Extracting data from PDF invoices and bills for financial accounting
Description of the project: This project aims to develop a complete workflow for discovering bills (in a directory, mail folder or with a browser plugin to extract them from web pages), storing them (a document management system, folder or Git repository), extracting relevant data (bill data, currency and amount) and saving the data (in a format like cXML) in the same document management system. It may be necessary to create a GUI window to help the tool 'learn' how to read a PDF, remember the placement of different data fields in the PDF and automatically extract the same fields next time it sees a bill from the same vendor.
Little or no work on the underlying PDF parsing software should be necessary, as there are already various projects like invoice2data that attempt to extract data from PDF invoices such as phone bills. The summer project will use invoice2data, or a similar software, as the underlying data extraction component.
Confirmed Mentors: Manuel Riel
How to contact the mentor: click here, complete all details, make sure you are subscribed to the relevant mailing lists before posting
Confirmed co-mentors: Thomas Levine
Deliverables of the project: script or scripts performing the functions required
Desirable skills: Version control systems (Git), Python, Regular Expressions
What the intern will learn: financial and accounting workflows, document management
Application tasks: make a small improvement to the invoice2data project and submit a pull request.
Related projects: discussion on stack overflow, PDFMiner