buildd.debian-ports.org is gone
|Deletions are marked like this.||Additions are marked like this.|
|Line 57:||Line 57:|
|* http://lintian.debian.org/graphs/statistics.svg, http://buildd.debian-ports.org/stats/ (multiple plots on one graph)||* http://lintian.debian.org/graphs/statistics.svg, http://buildd.debian.org/stats/ (multiple plots on one graph)|
|Line 59:||Line 59:|
|* http://buildd.debian-ports.org/stats/ (dynamic '''(!)''' page which shows stats depeding on passed parameters)||* http://buildd.debian.org/stats/ (dynamic '''(!)''' page which shows stats depeding on passed parameters)|
Name: Boris Bobrov
Background: I am a 3rd year student at Tashkent branch of Moscow State University, applied math and informatics faculty. Experience:
- 2.5 years of Python
- 2 years of Django
- 2.5 years of C in university
- 1.5 years of C++ in university
- 5 years of HTML and CSS
- 1 year of Scheme and 6 months of Common Lisp
- Basic knowledge of system administration (set up nginx, uwsgi etc, familiar with cron, shell scripting)
- Familiar with unit tests and test-driven development techniques
- Debian GNU/Linux as main OS for 4 years
- git (also know mercurial)
- vim (emacs for Lisp)
Most of the code was written for private projects and I am not sure if I can show that code publically; though I can show it on request.
Project title: Debian Metrics Portal
The project is to create a Debian Metrics Portal, a portal, that will be a central place for various metrics and stats. The portal will:
- Perform measurements by itself, from various sources and by different ways;
- Collect ready-made stats from various places;
- Display collected data in various ways (in text, in plots)
What is required to describe when creating a new metric
- Name, category etc
- Where measurement happens (local/remote)
- If remote, what script will receive and save the data
- If local, what script will collect and save the data
What script will return a JSON with required data (see "Why JSON" section below). This script will be invoked when a page with metrics is requested by a visitor
- What fields can be selected for building graphs, what is their types and what is the type of the graph (plot, histogram etc)
- A Django template, which will be used to render the data (though it is possible that other templating languages will be added in future)
The author of the script decides by himself how and where to store the data.
For remote measurements
- The remote script will send a query to the portal with a JSON, containing the data
- The JSON will be passed to a defined script (via stdin)
- The output of the script will be return in reply
Why give so much freedom
There can be many use cases of how statistics is displayed. Examples:
http://udd.debian.org/sponsorstats.cgi (text, multiple levels)
http://ftp-master.debian.org/stat.html (multiple graphs on one page, no need to select which plots to graph)
http://gnu.ethz.ch/linuks.mine.nu/sizematters/ (data implanted nicely into text)
http://buildd.debian.org/stats/ (dynamic (!) page which shows stats depeding on passed parameters)
http://www.lucas-nussbaum.net/blog/?p=751 (multiple graphs with multiple plots, and for each graph it'd be nice to select what plots to build)
And they all belong to Debian Metrics Portal.
Though not all metrics are complex and require so much freedom. An access to so many possibilities could be given by portal administrator only.
- Most of Debian statistic collectors are written in either Python or Ruby. Both these languages perfectly support convertion to JSON and the data we pass is rather simple
- A tag for inserting graphs into user templates will be made (see the "graph creating" section below)
- Some default generic templates will be available in order to make the interface of the site more uniform
- In future another templating engine could be used (along with Django), but because there will be a set of predefined templates, it will not be a part of the GSoC project
One of the main features of the portal will be an easy way to plot data. The metric author should not care how to graph, only what can be graphed.
- The portal visitor may wish to plot different fields of the metric
- There should be some defaults
- Some data cannot be ploted (for example, we cannot plot Names on X axis and Mailing Lists on Y)
- What is used for graphing is decided by the portal
- Some data is too heavy to be graphed by various fields ("SELECT count(*) from public.all_sources" returns 237818)
- TODO: how many entries is "heavy"? Does the "heavy" depend on the number of entries only? Leave it to the metrics author?
- But some even heavy relations can be pre-graphed and returned as static image files
A template tag should handle it all. It will:
- Output the graph (default or with fields requested by user)
- Output controls to build a graph from other fields (if allowed)
- Decide how to orgranize the graph (labels, titles etc)
- Handle metrics adding, zooming and other manipulations with the plot
Some users might want to share a hotlink to the graph image. An url to the graph image, plotted by matplotlib, will be generated for each change made in Flot plot, like adding a new metric to the plot, zooming etc.
If JS is disabled in the visitor's browser, a default version, generated by matplotlib, will be shown, with controls allowing to change metrics on the plot, zoom etc, but a reload of the page will be required for each action. Time and design permitting, the possibilities of this fallback version will be done as close to the JS version, as possible.
Django cache framework will be used for caching. Preliminary, caching might be required in interaction between the portal and user-provided scripts.
The metrics can be pretty simple. For example, a dependency of number of bugs from the date can be represented as a simple list of dicts. For these cases a more generic approach can be used.
A simple metric
- Name, category etc
- Where measurement happens (local/remote)
- If local, what script generates the JSON with data (and how often it should be called)
- Is the data a delta from previous measurement or the scripts regenerates the whole sample
- What keys does the data have and what is their type (int or string)
- What keys can be used for graphing
- A template (with an option to use a generic template)
This data will be saved in an inner table; the script author does not need to care, where to save his measurements.
Some other notes
A nice example of graphs layout: https://metrics.torproject.org/
- TODO: Decide where to use Ajax and for what.
- The "simple metric" will be based on the "generic metric" and new types of metrics can be added, if required.
- For example, a "UDD metric", where the author will be required only to select tables and fields from UDD, which will be used for statistics building
Synopsis: building a Debian metrics portal with a uniform (Web) interface to peruse Debian metrics, as well as a uniform (programming) interface to maintain them.
Benefits to Debian:
- A single place for all statistics and metrics
- A possibility to easily create simple metrics
- standardized interface to add/remove metrics to be graphed (possibly with different sampling rate)
- integration of existing graphs in the metrics infrastructure
- web interface to show daily (or more frequently) updated graphs of the various metrics
- dynamic web interface to graph, on demand, specific metrics (possibly more than one at a time) over specific time period
- a portal with modular architecture
- where generic tasks could be simplified
- suitable for use by adepts of different programming languages
- Community Bonding Period:
- discussing details
- getting to know Debian infrastructure and practices, common in Debian
- making different arrangements, like creating the project, setting up the db etc
- measuring performance of common data sources, making decisions about caching and pre-graphing possibilities
- planning, how to orgranize the (Django) views, planning, how to orgranize the models, planning, planning, planning...
- June 17 - 30: the beginning of the project, create a first prototype.
- Create the database scheme
- Build a simple interface for adding metrics
- Add some simple scripts (from /Statistics) for local data generation and data output
- Simple printing of the collected data
- July 1 - 7: Remote data collection - URLs, APIs, script handling
- July 8 - 28: Graph creating
- Flot and required JS (and Ajax) code
- URLs for static plots images, matplotlib plots generation
- Fallback (no-JS) matplotlib graphs
- July 29 - August 4
- Put everything together. At this point there should be a working version of generic metric
- August 5 - 11: Make some templates for typical use cases
August 12 - 25: Code the "Simple Metric". ?DebConf13 is held somewhere here
- August 26 - September 1:
- Move more scripts listed in /Statistics under the control of the portal
- Check their performance, add caching where required
- September 2 - 23:
- Check how the scripts behave, make neccessary changes to the code in case of problems (most of the time)
- Fix bugs, make a pretty UI
- Add suddenly appeared features
- Time-permitting, package the code
- Community Bonding Period:
Exams and other commitments:
- Maybe 2 exams in the middle of the June. A day of idling for each one.
- My next semester begins on the 3rd of Septermber, so most of time-consuming job should be done before that.
Other summer plans:
If I get selected for this task, I'd like to visit ?DebConf13
Why Debian?: I use Debian GNU/Linux for ~4 years and see GSoC as a good way to integrate into community closer.
Applications to other orgs: none