Extracting data from PDF invoices and bills for financial accounting

Description of the project: This project aims to develop a complete workflow for discovering bills (in a directory, mail folder or with a browser plugin to extract them from web pages), storing them (a document management system, folder or Git repository), extracting relevant data (bill data, currency and amount) and saving the data (in a format like cXML) in the same document management system. It may be necessary to create a GUI window to help the tool 'learn' how to read a PDF, remember the placement of different data fields in the PDF and automatically extract the same fields next time it sees a bill from the same vendor.

Little or no work on the underlying PDF parsing software should be necessary, as there are already various projects like invoice2data that attempt to extract data from PDF invoices such as phone bills. The summer project will use invoice2data, or a similar software, as the underlying data extraction component.