Progress reports for this project:

[1] June 4th, 2011 [2] June 19th, 2011

(Proposal has been updated on April 6th)

Project Proposal

Technical skills (with development experience):

In Debian Pure Blends, the performance of teams was measured on the basis of postings on relevant mailing lists as well as inspecting the package uploads which are recorded in the Ultimate Debian Database. The information gathered received positive feedback both from Blends and pure packaging teams. Andreas wrote the initial code, which is a mixture of Perl and shell scripts. You can checkout a version from [2].

With this project, we intend to completely replace the older code, using existing tools to achieve our task or writing our implementation in Python where required. The aim is to make it more flexible, maintainable and include additional resources that will help in measuring performance.

This project is divided into four parts:

1. List Statistics

This will involve parsing the mailing list to identify the most active contributors. The conventional approach was to download the mailing list archive pages and then parse the HTML. In this project, instead of doing that, we will be parsing the mbox archives [3]. This new approach will be faster and the output will be more precise as mbox is standardized format for holding collections of messages.

This phase will be undertaken in namely four steps:

  1. Fetch the archive in mbox format from the given mailing list,
  2. Parse the mbox archive,
  3. Save the information in a database,
  4. Create graphs that show the top n posters every month.

The current list statistics which were generated by the code Andreas wrote are available at [4]. You can have a look to get an idea of what at we are aiming at. With this project, we will improve the flexibility and accuracy of this approach. For the purpose of this task, we will be using: ?MailListStat [5]. The data gathered will then be used to generate the graphs.

2. Most Active Uploaders

Under this, the most active uploaders within a team are measured; this data is available from the Ultimate Debian Database [6]. We will fetch the information from UDD and then generate graphs from that. It will be investigated during this phase to see whether it is possible to obtain data about bugs fixed by team members from UDD or BTS. This is still under discussion and if it is required, it can be easily implemented within a week. This will be implemented in Python.

3. Commit Details

For measuring the performance using commits, we decided on two factors:

We believe that no single metric is complete. The number of commits cannot be an absolute measure of productivity and neither can the total number of lines committed. Because of this, we have decided to include both of them. To estimate Git repositories, we have decided on using ?GitStats [7], which serves our purpose well. However, for parsing the commits from SVN repositories, we will be writing our own implementation. This is because the standard tools available which can do this don't fit our requirements.

4. Data Presentation

After the information has been gathered, we will be working on making it easily accessible. Other than generating graphs from phase (1), (2) and (3), we will be making our data available in form of JSON [8]. This is because not only JSON integrates well with Python, but it is easy to learn and has bindings for large number of languages.

The final phase will be a web interface that allows anyone to easily access the information. This will allow dynamic generation of data; something on the lines of Debian Popularity Contest [9]. Note that this is easier than it sounds; the data is ready, we just have to give it an interface through which it can be accessed.

There will still be many technicalities involved at this stage. For example, in the case of list statistics, a mailing list has several user names for the same person. To get a real team performance, this name matching is quite important and the tool which we are using for this job has no such feature [4]. This will involve patching it or writing our own implementation if required.

Please note that this is a tentative timeline and as expected, can change when actual development takes place.

Community Bonding Period : I am already familiar with my mentors! I will use this period to further improve and analyze the design of the project. If there is one thing that I learnt the most from last year's GSoC, it's that a robust design is quintessential.

May 23th - 15th June : Initial coding, set up the infrastructure by gathering the required statistics and implement a very basic model of the website (presentation).

16th June - 25th June : Fetch the active uploaders list from the UDD by writing a script to do that. Hopefully, integrate the bug details as well.

26th June - 11th July : Feedback campaign across Debian, compiling user feedback on IRC, mailing lists and patching the basic website. Work on improving the data gathered.

11th July - 15th July : Mid-term evaluations! Review the old code and confirm that everything is working fine.

I really want to attend DebConf11 in Banja Luka. It would be awesome to meet my mentors and other community members in person! If this happens, I will adjust my schedule accordingly.

16th July - 15th August : Implement the presentation layers and complete the website that will present the data.

16th August - 22nd August : Improve code, check for bugs and write documentation.

With the updated proposal, we will be spending more time working on a polished interface and documentation and less time on implementing the tools, as suggested by Marc Brockschmidt. Also, a substantital part of our time will be spent on making these tools/ our implementations work for our specific case.