Ultimate Debian database
This proposal has been accepted and is being worked on!
Mentor: Lucas Nussbaum <lucas@debian.org>
Co-Mentor: Stefano Zacchiroli <zack@debian.org>
Co-Mentor: Marc 'HE' Brockschmidt <he@ftwca.de>
Summary: Import all interesting data about Debian in a database and data-mine it
Required skills:
- Relational model, SQL Databases, both theorical and practical knowledge. You will probably have to deal with complex queries, optimization of tables, etc.
- Knowledge of a scripting language (Python, Ruby, Perl, ...)
There's a lot of data in Debian, in many different places: Sources and Packages files, bug tracking system, popcon, DEHS, etc, etc, etc, etc. When someone want to combine two different kinds of data to look for discrepancies, or simply to present data in a different, more useful way, he usually has to write scripts to import this data in an usable form, and scripts to combine that data. Which is *very* painful and error-prone.
The goal of this GSOC project is to move from ad-hoc scripts to a centralized approach, by implementing:
- an SQL database where all the interesting data will be stored
- scripts to import data from other sources to this database
What the student should have done at the end of the project:
- define a database schema that works. Implement it in a pgsql db.
- write scripts to import data from various sources, into the database (including - but not limited to - Sources and Packages files for all suites and sections, BTS data, popcon data, debtags tags, etc.).
- write example scripts (big SQL query + presentation code) that present data in useful ways. For example:
- RC bugs in packages in testing, sorted by popcon
- Packages that are in unstable, but not in testing, sorted by popcon
- Packages with the more bugs
- Maintainers with the more bugs
- ...
- make it possible to easily move the DB and the scripts to another system. write documentation.
Related stuff:
projectb: projectb is the DB used by the FTP team to keep track of the archive. However:
- it doesn't contain data about other aspects of Debian
- it's archive-centric, making it difficult to query it for other things
Mole: mole is a framework to store large amount of QA-related data, and schedule tasks based on the data. For example, it's possible to automatically run a script to update the data for a package when that package is updated in Debian. The difference between Mole and this project is that this project doesn't do any scheduling, and is really aimed at combining not-so-large data.