GSoC Proposal for "Extracting Data From PDF Invoices And Bills"

Name

Harshit Joshi

Contact/Email/IRC

Homepage: http://harshitjoshi.in/

Blog: http://blog.harshitjoshi.in/

Github: https://www.github.com/duskybomb/

Debian Wiki: HarshitJoshi

Email: <harshit113@ducic.ac.in>

IRC: duskybomb (OFTC/freenode/Matrix)

Timezone: GMT +5:30 (IST)

Background

I am a freshman at Cluster Innovation Centre, University of Delhi pursuing B.Tech in Information Technology and Mathematical Innovations and minor in Management and Economics. I like to make and play with python scripts which make working easier (automate) for others and me. I also have a keen interest in Artificial Intelligence and Machine Learning. Participating in Competitive Programming contests is also one of my hobbies.

Cluster Innovation Centre: https://www.ducic.ac.in

University of Delhi: https://www.du.ac.in

Location: New Delhi, India

Achievements I qualified for ACM-ICPC Asia-Kolkata Kanpur Regionals and received Honourable mention for the Online Contest.

My hardware

I use a Dell Inspiron Gaming 7567, i7 7th Gen Processor, 8GB RAM and Nvidia GTX 1050Ti. I also use an external 32’ monitor with screen resolution 1080p and 60Hz.

Operating System I am using Linux for last 4 years. I started with Elementary OS, then after 3 years on Elementary OS, I switched to Ubuntu 16.04 LTS. Now for last 3 months I have been using Debian 9.

Currently, my laptop is dual booted with Debian 9 and Windows 10 Home Edition running Ubuntu 16.04 LTS on VirtualBox

Internet Connection I have 50 Mbps stable internet connection

Development tools I use Sublime Text 3 as a text editor, other than that I use Pycharm, IntelliJ, Clion (all professional edition student pack) along with IPython and Jupyter Notebooks.

Project Title

Extracting Data From PDF Invoices And Bills

Synopsis: There are various projects like invoice2data that attempt to extract data from PDF invoices such as phone bills.

Detailed description of the project: This project aims to develop a complete workflow for discovering bills (in a directory, mail folder or with a browser plugin to extract them from web pages), storing them (a document management system, folder or Git repository), extracting relevant data (bill data, currency and amount) and saving the data (in a format like cXML) in the same document management system. It may be necessary to create a GUI window to help the tool 'learn' how to read a PDF, remember the placement of different data fields in the PDF and automatically extract the same fields next time it sees a bill from the same vendor.

Little or no work on the underlying PDF parsing software should be necessary, as there are already various projects like invoice2data that attempt to extract data from PDF invoices such as phone bills. The summer project will use invoice2data, or a similar software, as the underlying data extraction component.

Project Proposal

Project Timeline

Up to April 23

Community Bonding

Week 1

Week 2-3

Week 4-5

Week 6-8

Week 9-10

Week 11

Week 12

Week 13

Deliverables

Scripts performing the functions required

Benefits to Debian

An open source project to help Debian (and other OS) users to extract relevant data from various bills

Contributions to invoice2data

invoice2data

Issues created in invoice2data

Pull Requests made in invoice2data

PR #74 (Merged) Added xml and json features

PR #85 (Merged) Adding currency to output and test for checking output of file (Somehow the tests are still failing).

PR #78 (Merged) Attempt to add CLI testing

PR #71 (Merged) Solved Issue #70

PR #63 (Merged) Typo in Documentation

PR #83 (Closed but idea implemented by mentor in his own commit)

PR #68 (Closed but idea implemented by mentor in his own commit)

PR #94 (Merged) Added examples to README.md

Issue #70 (Closed) Build was failing

Issue #77 (Open) Resource Warning for unclosed file

Communication

Other summer plans

Nothing is planned yet, and nothing much is going to happen either. I might visit my native town in Uttarakhand, India for a few days but it won’t affect the work much, I will be having access to internet and will take my laptop with me.

Exams and Other commitments

I have my university end-semester examinations from 4th May to 20th May, so I will be little occupied. Other than that, I have no other commitments.

Why Debian?

I have always been an admirer of Open Source community for it’s transparency and able to bend the way you like. There have been time when I see some errors in any program or a webpage and wish that it had been an Open Source project so that I could fix the errors and help myself as well as other with continuing with the project. Since I have been using Linux for last 4 years and all my distros being build upon debian, I feel it will be great opportunity to give back a little to the community which supported me when I had a low end PC and still never stops to make me.

My Previous Debian Contributions

This will be my first time contributing to debian.

Are you applying for other projects in SoC?

No, I am only applying for this project.