GSoC Proposal for "Extracting Data From PDF Invoices And Bills"
Debian Wiki: HarshitJoshi
IRC: duskybomb (OFTC/freenode/Matrix)
Timezone: GMT +5:30 (IST)
I am a freshman at Cluster Innovation Centre, University of Delhi pursuing B.Tech in Information Technology and Mathematical Innovations and minor in Management and Economics. I like to make and play with python scripts which make working easier (automate) for others and me. I also have a keen interest in Artificial Intelligence and Machine Learning. Participating in Competitive Programming contests is also one of my hobbies.
Cluster Innovation Centre: https://www.ducic.ac.in
University of Delhi: https://www.du.ac.in
Location: New Delhi, India
Achievements I qualified for ACM-ICPC Asia-Kolkata Kanpur Regionals and received Honourable mention for the Online Contest.
I use a Dell Inspiron Gaming 7567, i7 7th Gen Processor, 8GB RAM and Nvidia GTX 1050Ti. I also use an external 32’ monitor with screen resolution 1080p and 60Hz.
Operating System I am using Linux for last 4 years. I started with Elementary OS, then after 3 years on Elementary OS, I switched to Ubuntu 16.04 LTS. Now for last 3 months I have been using Debian 9.
Currently, my laptop is dual booted with Debian 9 and Windows 10 Home Edition running Ubuntu 16.04 LTS on VirtualBox
Internet Connection I have 50 Mbps stable internet connection
Development tools I use Sublime Text 3 as a text editor, other than that I use Pycharm, IntelliJ, Clion (all professional edition student pack) along with IPython and Jupyter Notebooks.
Extracting Data From PDF Invoices And Bills
Synopsis: There are various projects like invoice2data that attempt to extract data from PDF invoices such as phone bills.
Detailed description of the project: This project aims to develop a complete workflow for discovering bills (in a directory, mail folder or with a browser plugin to extract them from web pages), storing them (a document management system, folder or Git repository), extracting relevant data (bill data, currency and amount) and saving the data (in a format like cXML) in the same document management system. It may be necessary to create a GUI window to help the tool 'learn' how to read a PDF, remember the placement of different data fields in the PDF and automatically extract the same fields next time it sees a bill from the same vendor.
Little or no work on the underlying PDF parsing software should be necessary, as there are already various projects like invoice2data that attempt to extract data from PDF invoices such as phone bills. The summer project will use invoice2data, or a similar software, as the underlying data extraction component.
- Discovering bills from directory
- Storing bills in folders
- Extracting relevant data like bill data, currency, amount
- Saving data in form of csv, json, cXML, UBL
- GUI to make the tool learn placements of different data fields
Up to April 23
- Make small improvements to invoice2data (proper tests for the project)
Playing with ?PyQt-5.
- Researching about Universal Business Language, standardized version of XML and cXML
- Setting up my environment
- Add feature to export output file in UBL and cXML format
Start work on Graphic User Interface (GUI) using ?PyQt-5
- Implement simple commands which are currently given as command line arguments.
- Addition to GUI:
- Enter field name and extract data from the extracted data.
- Addition to GUI:
- Remember placement of different data fields.
- Adding more templates from India.
- Improvements to OCR module.
- Testing OCR module
- Making proper tests for all the work done
- Polishing the scripts
- Update documentation
- Prepare final summary.
- Looking into, Choose and Classify fields with Machine Learning for future work (Post GSoC 2018)
Scripts performing the functions required
Benefits to Debian
An open source project to help Debian (and other OS) users to extract relevant data from various bills
Contributions to invoice2data
Issues created in invoice2data
Pull Requests made in invoice2data
PR #74 (Merged) Added xml and json features
PR #85 (Merged) Adding currency to output and test for checking output of file (Somehow the tests are still failing).
PR #78 (Merged) Attempt to add CLI testing
PR #71 (Merged) Solved Issue #70
PR #63 (Merged) Typo in Documentation
PR #83 (Closed but idea implemented by mentor in his own commit)
PR #68 (Closed but idea implemented by mentor in his own commit)
PR #94 (Merged) Added examples to README.md
Issue #70 (Closed) Build was failing
Issue #77 (Open) Resource Warning for unclosed file
- I will be available on the IRC/Matrix and the mail during the daytime.
- I will be regularly reporting to my mentor(s) about progress about the project and problems faced by me.
- I am planning to write weekly blog posts about my progress on the project.
- Weekly progress updates will be sent to the mailing list for community's feedback.
Other summer plans
Nothing is planned yet, and nothing much is going to happen either. I might visit my native town in Uttarakhand, India for a few days but it won’t affect the work much, I will be having access to internet and will take my laptop with me.
Exams and Other commitments
I have my university end-semester examinations from 4th May to 20th May, so I will be little occupied. Other than that, I have no other commitments.
I have always been an admirer of Open Source community for it’s transparency and able to bend the way you like. There have been time when I see some errors in any program or a webpage and wish that it had been an Open Source project so that I could fix the errors and help myself as well as other with continuing with the project. Since I have been using Linux for last 4 years and all my distros being build upon debian, I feel it will be great opportunity to give back a little to the community which supported me when I had a low end PC and still never stops to make me.
My Previous Debian Contributions
This will be my first time contributing to debian.
Are you applying for other projects in SoC?
No, I am only applying for this project.