Name: Damien Desfontaines
Contact/Email: ddfontaines@gmail.com
Background:
I am a 19-years-old student, following a Theoretical Computer Science cursus at the École Normale Supérieure in Paris. I have solid basis in graph theory, calculability & complexity theories, language theory, lambda-calculus and functional programming. I have never been involved in large open-source projects, but I would really be happy to begin. I am nevertheless used to the open-source world: using Debian for several years, I know the basics of UNIX administration and I report bugs as soon as I find some.
I know how to read or write a man-page, a README file, a library dotcumentation, or any other technical document about a program. I clearly meet all the project's requirements. What makes me a good choice for working on this project ? I would say, my ability to solve problems by myself, reading books or documentations, along with the ease with which I read, understand and implement algorithms from a scientific research paper.
Project title: Semantic Package Review Interface for mentors.debian.net
Project details: Using semantic analysis on Debian package database and new packages submitted, to make the matching between packages to be reviewed and competent maintainers easier.
Synopsis: I plan to study, implement and compare effectiveness of different terminology extractions methods, and (probably fuzzy) clustering algorithms to automatically group repositories packages by similarity.
Benefits to Debian: This way, packages sent to Debexpo to be reviewed could be matched more easily with maintainers' centres of interests. So, the main short-term consequence of this work would be to make the package reviewing process faster and more efficient. In a longer-term view, this "semantic analysis" could be used to improve the package classification in aptitude (but that is not the primary goal of my work).
Deliverables:
Firstly, I plan to test different terminology extraction algorithms, and adapt them to the very particular shape of Debian packages. The goal is to automatically extract semantic information, represented as certain rates of correlation with keywords ("graphism", "python", "security", etc.), from man-pages, documentations, READMEs, for example.
Once the content of a package has been analysed, we will use its non-textual content to establish relations between it and other packages. The dependancy tree can be seen as a graph, and packages which are close neighbors on this graph have a better chance to be similar (eg. if they use almost the same libraries). In practice, I plan to use fuzzy clustering algorithms to use this idea to group similar packages. The "fuzzy" part is, of course, because a package can be a part of several groups at the same time, and because the relation "being part of a group" is not binary.
Once the packages have been classified into semantic groups, looking at previously reviewed packages for each maintainer should be enough to sort packages according to their interest for this particular maintainer.
Project schedule:
By the end of June, I will have finished to build the terminology extraction algorithm. I will need to (automatically, probably) create a list of keywords to use terminology extraction algorithms. I will test several algorithms of the literature, and select what seem to be the more precise and informative (maybe asking feedback to experienced maintainers).
By the end of July, I will have implemented a way to generate the "dependancy graph", and several fuzzy clustering algorithms will be tested to select the one giving the best results to maintainers - on this particular point too, I will probably ask them to give me some feedback.
During the last month, I will implement the system on the official debexpo platform, and add a functionality enabling maintainers to mark a packet reviewing proposition "very relevant" or "totally irrelevant", to automatically make the system better during time.
Exams and other commitments: I have exams during the first week of June.
Other summer plans: I have nothing planned but a four-day vacation with friends, probably during the beginning of August.
Why Debian?: I have personally used Debian for three years, and this distribution really suits me. I particularly appreciate the usability of Aptitude, the huge size of official repositories, the robustness of the Stable version and the organization for developing. It is been a year since I firstly wanted to contribute to the Debian project.
Are you applying for other projects in SoC?: I am applying to two other projects : one for ?GeoGebra and the other for Sage.
