I am a 19-years-old student, following a Theoretical Computer Science cursus at the École Normale Supérieure in Paris. I have solid basis in graph theory, calculability & complexity theories, language theory, lambda-calculus and functional programming. I have never been involved in large open-source projects, but I would really be happy to begin. I am nevertheless used to the open-source world: using Debian for several years, I know the basics of UNIX administration and I report bugs as soon as I find some.

I know how to read or write a man-page, a README file, a library dotcumentation, or any other technical document about a program. I clearly meet all the project's requirements. What makes me a good choice for working on this project ? I would say, my ability to solve problems by myself, reading books or documentations, along with the ease with which I read, understand and implement algorithms from a scientific research paper.

Firstly, I plan to test different terminology extraction algorithms, and adapt them to the very particular shape of Debian packages. The goal is to automatically extract semantic information, represented as certain rates of correlation with keywords ("graphism", "python", "security", etc.), from man-pages, documentations, READMEs, for example.

Once the content of a package has been analysed, we will use its non-textual content to establish relations between it and other packages. The dependancy tree can be seen as a graph, and packages which are close neighbors on this graph have a better chance to be similar (eg. if they use almost the same libraries). In practice, I plan to use fuzzy clustering algorithms to use this idea to group similar packages. The "fuzzy" part is, of course, because a package can be a part of several groups at the same time, and because the relation "being part of a group" is not binary.

Once the packages have been classified into semantic groups, looking at previously reviewed packages for each maintainer should be enough to sort packages according to their interest for this particular maintainer.

By the end of June, I will have finished to build the terminology extraction algorithm. I will need to (automatically, probably) create a list of keywords to use terminology extraction algorithms. I will test several algorithms of the literature, and select what seem to be the more precise and informative (maybe asking feedback to experienced maintainers).

By the end of July, I will have implemented a way to generate the "dependancy graph", and several fuzzy clustering algorithms will be tested to select the one giving the best results to maintainers - on this particular point too, I will probably ask them to give me some feedback.

During the last month, I will implement the system on the official debexpo platform, and add a functionality enabling maintainers to mark a packet reviewing proposition "very relevant" or "totally irrelevant", to automatically make the system better during time.