Name: Apoorv Upreti
Contact/Email: apoorvupreti@gmail.com, I idle on #debian-qa as nerdap
Background:
Project title: PTS Rewrite in Django
Project details:
- There will be seven major parts to the project (listed below in no particular order):
- Porting the PTS email interface currently written in Perl to Python.
- Retrieving package information from a variety of sources and making it accessible via Django models.
- Creating a REST API to access this package information.
- Creating a web interface where data can be presented in HTML (for humans) and RDF/XML and/or Turtle (for machines).
- (Optional, if time permits) Creating a portal where subscribers can register and add/delete subscriptions to packages/tags, change some settings, etc.
- Comparing the performance of the rewritten PTS to the old one along with some scalability analysis.
- Implementing caching at various levels in the application.
- I will be using Django's unittest framework to write tests for each part as it is completed. Test driven development is also another interesting development process I'm considering.
- More details on each of the above points:
. Porting the PTS email interface
- This should be relatively straightforward as compared to the rest of the project. I've already read most of the current email interface's code. A direct file by file conversion from Perl to Python is probably not a good idea. It would be more appropriate to consider what features are offered by the current interface and then implement these features in the new application.
Retrieving package information and making it accessible via Django models
- This is probably the most challenging part of the project and will involve a bit of research.
- There are two main issues here:
When to download package information
Downloading the data on-demand each time a package's information is required is unreasonable. Stefano mentioned (http://lists.debian.org/debian-qa/2013/04/msg00059.html) that some package information changed less frequently and thus needed to be downloaded less frequently while some other package information like bug reports should downloaded as frequently as is reasonable. Some information like bug reports could in fact be downloaded on demand, the performance implications of this will have to be looked into.
How to make this accessible via Django models
The other issue is how to make the package information accessible via Django models. It could be possible to insert data as it is downloaded into a database and remove it when we download new data, but since the data will be updated so frequently, it might be a good idea to store it directly in a cache along with an appropriate timeout. django-redis is a popular cache backend that could be a good option (https://github.com/niwibe/django-redis). A custom database backend would have to be written to ensure that the data is looked for in the cache and if it is not found, it will be downloaded from an appropriate source. Alternatively the custom backend would simply look for the data in the cache, and a scheduler like cron or Celery could populate the cache on regular intervals.
- There are a few reasons why I want to try as much as possible to use Django models to access the data, even though writing a custom database backend could be challenging:
- It's more in the spirit of the MVC paradigm
- It abstracts the worries of data retrieval away from someone who might wish to add features to other parts of the PTS
API frameworks like Tastypie (http://tastypieapi.org/) can expose models as resources with very little effort and very elegantly
- There is a project called django-rdf that can produce RDF/XML data from any Django model
Creating a REST API
I want to get the REST API specifications done early because the Debian Android app GSoC idea (http://wiki.debian.org/SummerOfCode2013/Projects#Debian_Android_Application) plans on displaying information from the PTS. Since the current SOAP API is unlikely to work after the rewrite, the new REST API would have to be used instead. The app might also provide the ability for users to log in to the PTS and manage their subscriptions so it might be necessary to add the ability for users to log in and manage subscriptions using the API.
Tastypie (http://tastypieapi.org/) could be used to develop the API.
Creating a web interface where data can be presented in HTML and RDF/XML and/or Turtle
- Currently the data downloaded by the PTS is stored in XML files. XSLT is used to generate HTML pages from this XML. In this part of the project, the objective is to implement a similar web interface using Django templates.
I don't have any experience RDF, but a quick Google search turned up django-rdf (http://code.google.com/p/django-rdf/) which claims to produce RDF/XML data from any Django model. There were a few complaints of performance issues that people have had with django-rdf, so I'll have to look into that first.
- Views will decide whether to serve HTML or RDF using the Accept header of the HTTP request.
(Optional, if time permits) Creating a portal for subscribers to manage subscriptions
- Pretty self-explanatory. This is a relatively straightforward task. It's worth noting that the models for subscribers will need a different database backend than the models which access package information, so Django will have to be configured to use with multiple databases. Still, the task is fairly simple.
Comparing the performance of the rewritten PTS to the old one along with some scalability analysis
One concern that some of developers at the debian-qa mailing list had was that the new, somewhat more dynamic PTS would perform poorly/use too much computational power as opposed to the current PTS (http://lists.debian.org/debian-qa/2013/04/msg00065.html). In the final part of the project, the performance and scalability of the new implementation would be benchmarked with the old one.
Implementing caching
- There are many caching options worth looking into:
- Caching parts of templates using the cache template tag
- Using Django cache middleware to cache responses returned by views
Caching database queries using Django (http://pythonhosted.org/johnny-cache/)
Using a package like django-staticgenerator (https://github.com/timetric/django-staticgenerator) to generate static html files
Using a caching HTTP reverse proxy like Varnish (https://www.varnish-cache.org/)
- There will be seven major parts to the project (listed below in no particular order):
Synopsis: Debian's Package Tracking System let's developers view a variety of interesting and useful information about packages available on Debian. Currently the PTS is a mix of Perl, Python (CGI) and Shell scripts. This project aims to rewrite the PTS using the Django web framework and Python. Key goals are making a web application that is easy to deploy, making the PTS more dynamic with some of the data being downloaded on demand or more frequently than before, ensuring the new PTS is easier to hack on by using technologies that are popular nowadays all while ensuring that the new PTS uses as little computing power as possible (by caching, intelligently deciding what data needs to be downloaded more frequently, etc).
Benefits to Debian
Deliverables: quantifiable results e.g. 'Port Debian to VAX', 'Write 3 articles for X website'.
Project schedule: how long will the project take? When can you begin work?
Exams and other commitments: do you have university exams inside the SoC period? If so, that's most likely not a problem but please tell us early!
Other summer plans: are you getting married? Do you have a long vacation planned? Are you expecting to start a job? Be aware that if you are accepted for the summer, then Google will be paying you as though you were working for them. We (in Debian) will therefore expect you to be working 35-40 hours per week on your project. It is very unlikely that you will be able to combine a successful SoC with another summer job working for somebody else.
Why Debian?: Why are you choosing Debian? What attracts you about Debian?
Are you applying for other projects in SoC? Note that letting us know about this does not impact your chances of acceptance or rejection with us; we ask this because it helps us to resolve deduplications wherein a student is accepted for multiple projects.