I am a 20-year-old undergraduate student in computer science at University of Strasbourg (France). Fascinated by the human mind, I plan to study Artificial Intelligence after I complete my Bachelor degree. For this reason, I've been looking for an opportunity to start doing machine learning on my own.
I am not new to open source development: since late 2010, I have made small contributions to the Weboob project, a python framework for interacting with websites. This taught me how to work with others and interact with their code, with Git as the VCS.
Should this be relevant (for interaction with existing code, or implementing performance-critical features), I also have some experience in C programming, mainly from school courses.
Because of my long-term interests for machine learning, and my willingness to acquire as many skills as I can in this domain, I shall be entirely commited to this project. I will probably stick around after the summer is over, particularly if I feel my work is not polished enough.
Project title: Semantic Package Review Interface for mentors.debian.net
When a new contributor uploads a package, the latter is analyzed to extract relevant metadata from it. The metadata is then compared to a database of existing packages, to find packages similar to the new contributor's. The maintainers of those similar packages are then kept as potential sponsors.
Metadata extraction could be done with a supervised learning algorithm, using existing packages and Debtags' database for training. I'm not certain this is the right way, and I'll research that before the beginning of GSoC.
Matching a package with sponsors will be done in an unsupervised way, looking for similarities with the existing packages. The database could consist of all packages from registered sponsors, and/or be based on the entire Debian archive, using for example Debtags to avoid excessive calculation.
Using automatic metadata extraction from packages and learning algorithms, this project aims to match prospective maintainers with potential sponsors more easily and quickly. An efficient Web interface will be developed, so that maintainers and sponsors can access and improve this semantic metadata.
Benefits to Debian:
- an algorithm for extracting semantic metadata from new packages
- at least one algorithm for matching a package with potential sponsors
- Web UI: access to the metadata
- Web UI: allow sponsors to define their interests
- Web UI: help a new contributor to contact potential sponsors and packaging team with relevant information (improvement over existing interface)
(Tentative) Project schedule:
Before April 20: provide additional details to the mentors if needed
April 23 - May 21: Research and play with supervised and unsupervised learning algorithms, and choose the most suited to the problem. Choose a way of storing packages and their metadata for efficient retrieval. Learn more about the Debian process. Setup the development environment.
After May 21: I am not sure how much time each task will take.
Exams and other commitments: I have exams in early May; I'll be free of any school obligation from May 21 to the end of the summer.
Other summer plans: I have no other plans: this project would be my only commitment for the entire summer.
Debian has been my distribution of choice since 2004, for servers and desktops alike. However, I have been occasionally frustrated by the lack of reactivity for some important packages in Sid (KDE for example), and I always felt that the packaging system, while efficient for users, is unnecessarily complex and opaque for new packagers. For this reason, I regularly try other distributions for my desktop computer, and keep getting back to Debian, because it 'just works'. The grass is not so much more green on the other side.
The research I made after I heard of this project made me realize that package maintaining is not so inaccessible after all. This GSoC looks like a great introduction to the Debian process.
Are you applying for other projects in SoC? No