Debsources as a Platform
Name: Clément Schreiner
Contact/Email: mailto:clement@mux.me - irc://irc.oftc.net/clemux
Background: I am a 23-year-old student in computer science at University of Strasbourg (France).
I am familiar with FLOSS, as an user since I installed Sarge in 2004, and as a developer since I made a few contributions to the Weboob framework, which allows easy interactions between console or graphical applications and various websites, in 2010.
I participated to the 2012 and 2014 GSoC:
- The former on debexpo (mentors.d.n), which allowed me to get better at programming, learn to use sqlalchemy, and get interested in software quality assurance.
- The latter on Debile, on which I am still working when I have some spare time, currently trying to make it installable without causing Sudden Hair Loss Syndrom to potential contributors, as well as improving the code coverage.
I am now familiar with Debian packaging, QA tools, and communication within the project.
Project title Debsources as a Platform
Contents
Project details
Debsources has already become a useful service for the Debian and FLOSS communities, giving everyone the ability to browse the source code of all packages in several suites, as well as a great research tool.
However, to reach its full potential, some issues linger.
The currently synchronous and sequential architecture of debsources-updater, the program that feeds data from packages into the database, doesn't permit reindexing existing data in batch, for example when changing a plugin's options. Running workers on several machines would also improve the indexing time.
Although the situation is currently improving, the web application is still very monolithic, which makes providing new services challenging.
For example, debsources' infrastructure could be use to provide a replacement for patch-tracker.debian.org, which does not run anymore.
Another interesting application that can be developed on top of debsources' backend is copyright.d.n: the service will leverage the existing advanced search features (search by ctag symbol) and huge database of source code to answer questions like:
- What's the license of this library, am I allowed to use it in my X-licensed project?
- This binary blob contains the symbol Y, where does Y come from? Does the binary violates the library's license?
Indeed, with the numerous licenses used by Free Software projects, and the incompatibilites among some licenses (to say nothing about proprietary software), which is further complicated by the fact that some projects use different licenses for some parts of their code, it is currently hard for companies and communities to ensure perfect license compliance.
Moreover, by implementing SPDX export, copyright.d.n could be become the largest SPDX database in the world, which would be much useful to companies involved in FLOSS.
Below, the plan for implementing these changes. All new features implemented will be supplied with appropriate unit tests to ensure correctness.
updater
The currently synchronous update daemon will be refactored and made asynchronous, converting the various plugins into Celery tasks.
- Message broker
celery supports sqlalchemy, but the documentation warns that is is experimental
redis: probably the simplest option, and very fast
rabbitmq (default): recommended for professional-scale projects, very efficient, more complicated to setup, more of a performance drain
rabitmq seems a sane choice, but I'll discuss that with Zack.
- Result backend. Either of:
- rabitmq
- sqlalchemy with the existing postgresql database
There are two type of tasks:
- updater stages
- extract: updater.extract_new
- suites: updater.update_suites
- gc: updater.garbage_collector
- stats: updater.update_statistics
- cache: updater.update_metadata
- charts: updater.update_charts
- hooks
- checksums
- ctags
- metrics
- sloccount
Hooks handle two events: add_package and rm_package.
add-package events are sent by the stage extract: extract_new → add_package → _add_package → notify.
rm-package events are sent by the stage gc: garbage_collect → _rm_package → notify.
Web application refactoring
Before starting, I will check the code coverage on the relevant parts of debsources. If not perfect, I will improve the tests.
- → what's the status of the OPW work on this? Do we have an estimation for its status at the beginning of GSoC?
Blueprints:
base/sources
All the base features: searching, browsing, listing and displaying packages and their contents
- /advancedsearch → this will need refactoring: ctag-based search cannot work if ctags plugin is not enabled, and it needs to be adapted for copyright.d.n and patch-tracker.d.n
- /doc
- /embed
- /list
- /prefix
- /search
- /sha256
- /src
stats
Displaying some stats will be conditional on what plugins are enabled (sloccount, ctags, ...).
/stats[/<suite>]
/api/stats[/<suite>]
infobox
/info/package/<package>/<version>/
/api/info/package/<package>/<version>/
ctags
- /ctag
- /api/ctag
sloc
I'll need to make the infobox (debsources/app/infobox.py) display the sloccount info only when the sloccount plugin is enabled.
/sloc/<package>
/api/sloc/<package>
copyright.d.n
Database
New tables:
author
- id
- name
license
Licensed will be identified by the SPDX identifiers when available. See spdx's git repository for a machine-readable list.
- id
- name: GPL-2, MPL-1.1, …
- text
copyright
- package_id
- file_id
- license_id
- author_copyright_id
author_copyright junction-table for the many-to-many relationship between copyright and author
- id
- author_id
- copyright_id
Worker
Parse debian/copyright files. See ?https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/, formerly DEP5.
- Some packages don't have a machine-readable copyright file. I need to figure out a way to handle those packages.
- feed the parsed data into the database
This will need to be future-proof, so the worker will be designed in a way that future copyright formats can easily be added.
App
New blueprint: copyright.
Essential features to implement:
- display the licenses of a file
- display the licenses of a package
- display the text of a license
Bonus features:
- for each license, determine whether a given license is compatible
- stats: number of packages under a given license
- stats: number of files under a given license
Routes:
/licenses/<license>
- /packages/
/packages/<package>
/files/<file>
- /authors/
/authors/<authors>
- /stats
- /api
Render the information in a user-friendly way.
API
- search a package or file, get the licenses
- export a package's copyright data to SPDX
- export a package's copyright data to JSON, for reuse in third-party web applications, and as a warm-up for implementing SPDX export
patch-tracker
The goal is to replace the now defunct patch-tracker.d.o.
Database
New tables:
patch
- id
- package_id
- diff_name
- origin (nullable)
- author (nullable)
- bug_debian (nullable)
- bug_upstream (nullable)
- description
patch_file: tracks info about a patch does to a single file
- id
- patch_id
- file_id
- additions (int)
- deletions (int)
Worker
Parse debian/patches/*. See DEP3.
Problem: do all packages use DEP3 for their patches? I'd guess not, since it's been accepted only 3 years ago.
Bonus feature: parse each diff to gather statistics (number of additions/deletions on a given file)
→ unidiff might help
App
New blueprint
/packages/<package>
- view patch series of the package
- view statistics about the patches (addition, deletions per file, per patch)
- bonus: evolution of the number of patches/additions/deletions since previous version of the package
/patches/<patch> : view the diff and statistics of a patch
- /api
API
- download diff
- retrieve data equivalent to the views above
Synopsis
Redesign the architecture of debsources updater, to make it more flexible and more scalable
- Turn Debsources into a platform for running novel web applications using the same backend as the existing searching, browsing and statistics-gathering features.
- Replace the late patch-tracker with an application built on top of that platform
- Develop an application for viewing copyright data of all Debian packages, with SPDX export.
Benefits to Debian
Debsources is becoming more and more important in the ecosystem of Debian web applications: it has been integrated into codesearch.d.n, the PTS (both original and the new tracker.d.o) and is used by firewoes, which will be important to Debile once it overcomes its current issues.
[WIP] These changes are necessary to make Debsources reach its full potential as an important piece of the Debian ecosystem, as a service to FLOSS users and contributors in general and as research tool.
Deliverables
updater
- new, asynchronous architecture for the updater, using Celery
- new worker for parsing debian/copyright files and injecting them into the database
- new worker for parsing debian/patches/ and injecting information about the patches into the database
Web application
- refactoring of the current web application to make it modular
- new web application: copyright.debian.net
- new web application: patch-tracker, to replace the defunct patch-tracker.d.o
Project schedule
I am planning buffer periods for each subproject. They will be used for debugging, improving unit tests, documentation. Non-essential ("bonus") features will be developed within those when I have time to spare.
Community bonding
If not done before, proof-of-concept of celery tasks. For example, gathering statistics about copyright-format usage in jessie, to get an idea of how many packages don't have machine-readable copyright files.
Setup a blog where I'll write shorts daily report. In addition to helping me write the weekly reports to soc-coordination, it'll allow me to quickly realize where I'm unproductive or get off track.
Maximize code coverage, especially in areas where I'll break stuff: updater, webapp blueprints and views.
Celerify debsources-updater
Week 1: celerify the updater
Week 2: safety buffer
Finalize the web app refactoring
Week 3
- finish web app refactoring
- safety buffer
copyright.d.n
Week 4
- new celery tasks for parsing the copyright files
- start the blueprint for viewing copyright data
Week 5-7
- continue working on the web app
- fix bugs in the celery tasks
- safety buffer
patch-tracker.d.n
Week 8
- sqlalchemy models for storing patch data
- new celery tasks for parsing patches
- start the web app
Week 9-11
- continue working on the web app
- fix bugs in the celery tasks
- safety buffer
End of the summer
Week 12-13
- finalize everything
- improve tests
- improve doc
Exams and other commitments
End of exams mid-May, no other commitment after that.
Why Debian?
Are you applying for other projects in SoC?
Yes.
