Name: Apoorv Upreti
Contact/Email: firstname.lastname@example.org, I idle on #debian-qa as nerdap
- I'm a 3rd year Computer Science undergraduate from BITS Pilani, India. I'm a member of the Department of Visual Media at my university, where for three years I've been using Django to develop web sites for my college's student-run fests. Some of these were simple, with nothing more than a couple of models and a few views to render data from these models onto templates. Some others were more complex, with over 4000 participants, XLSes/PDFs being generated for each participating college, extending the Django admin to add some features for confirming and mailing participants who had registered for an event, the ability for participants to form teams among themselves for events that required team participation, generating barcodes for each confirmed participant so it could be printed on their ID cards, etc.
I also spent about a year working for a startup called ?TunePatrol (http://tunepatrol.com/), where I was the backend developer and server administrator. I was responsible for developing the server side application using Django and some Python libraries like PIL and Python-memcached. This involved things like authentication via Facebook, streaming mp3 files, generating newsfeeds and activity feeds (somewhat like Facebook's timeline), a mechanism for liking songs/albums/playlists and so on. To improve performance, we ended up using memcached along with johnny-cache, a Django package that caches database queries. This improved performance considerably.
Project title: PTS Rewrite in Django
- There will be seven major parts to the project (listed below in no particular order):
- Porting the PTS email interface currently written in Perl to Python.
- Retrieving package information from a variety of sources and making it accessible via Django models.
- Creating a REST API to access this package information.
- Creating a web interface where data can be presented in HTML (for humans) and RDF/XML and/or Turtle (for machines).
- Creating a portal where subscribers can register and add/delete subscriptions to packages/tags, change some settings, etc.
- Comparing the performance of the rewritten PTS to the old one along with some scalability analysis.
- Implementing caching at various levels in the application.
- I will be using Django's unittest framework to write tests for each part as it is completed. Test driven development is also another interesting development process I'm considering.
- More details on each of the above points:
Porting the PTS email interface
- This should be relatively straightforward as compared to the rest of the project. I've already read most of the current email interface's code. A direct file by file conversion from Perl to Python is probably not a good idea. It would be more appropriate to consider what features are offered by the current interface and then implement these features in the new application.
Retrieving package information and making it accessible via Django models
- This is probably the most challenging part of the project and will involve a bit of research.
- There are two main issues here:
When to download package information
Downloading the data on-demand each time a package's information is required is unreasonable. Stefano mentioned (http://lists.debian.org/debian-qa/2013/04/msg00059.html) that some package information changed less frequently and thus needed to be downloaded less frequently while some other package information like bug reports should downloaded as frequently as is reasonable.
How to make this accessible via Django models
The other issue is how to make the package information accessible via Django models. It could be possible to insert data as it is downloaded into a database and remove it when we download new data, but since the data will be updated so frequently, it might be a good idea to store it directly in a cache along with an appropriate timeout. django-redis is a popular cache backend that could be a good option (https://github.com/niwibe/django-redis). A custom database backend would have to be written to ensure that the data is looked for in the cache and if it is not found, it will be downloaded from an appropriate source. Alternatively the custom backend would simply look for the data in the cache, and a scheduler like cron or Celery could populate the cache on regular intervals.
- There are a few reasons why I want to try as much as possible to use Django models to access the data, even though writing a custom database backend could be challenging:
- It's more in the spirit of the MVC paradigm
- It abstracts the worries of data retrieval away from someone who might wish to add features to other parts of the PTS
API frameworks like Tastypie (http://tastypieapi.org/) can expose models as resources with very little effort and very elegantly
- There is a project called django-rdf that can produce RDF/XML data from any Django model
Creating a REST API
I want to get the REST API specifications done early because the Debian Android app GSoC idea (http://wiki.debian.org/SummerOfCode2013/Projects#Debian_Android_Application) plans on displaying information from the PTS. Since the current SOAP API is unlikely to work after the rewrite, the new REST API would have to be used instead. The app might also provide the ability for users to log in to the PTS and manage their subscriptions so it might be necessary to add the ability for users to log in and manage subscriptions using the API.
Tastypie (http://tastypieapi.org/) could be used to develop the API.
Creating a web interface where data can be presented in HTML and RDF/XML and/or Turtle
- Currently the data downloaded by the PTS is stored in XML files. XSLT is used to generate HTML pages from this XML. In this part of the project, the objective is to implement a similar web interface using Django templates.
I don't have any experience with RDF, but a quick Google search turned up django-rdf (http://code.google.com/p/django-rdf/) which claims to produce RDF/XML data from any Django model. There were a few complaints of performance issues that people have had with django-rdf, so I'll have to look into that first. If it isn't possible for some reason to use django-rdf, I'll look into how to generate RDF/XML or Turtle content from package information.
- Views will decide whether to serve HTML or RDF using the Accept header of the HTTP request.
Creating a portal for subscribers to manage subscriptions
- Pretty self-explanatory. This is a relatively straightforward task. It's worth noting that the models for subscribers will need a different database backend than the models which access package information, so Django will have to be configured to use with multiple databases. Still, the task is fairly simple.
Comparing the performance of the rewritten PTS to the old one along with some scalability analysis
One concern that some of developers at the debian-qa mailing list had was that the new, somewhat more dynamic PTS would perform poorly/use too much computational power as opposed to the current PTS (http://lists.debian.org/debian-qa/2013/04/msg00065.html). In the final part of the project, the performance and scalability of the new implementation would be benchmarked with the old one.
- There are many caching options worth looking into:
- Caching parts of templates using the cache template tag
- Using Django cache middleware to cache responses returned by views
Caching database queries using Django (http://pythonhosted.org/johnny-cache/)
Using a package like django-staticgenerator (https://github.com/timetric/django-staticgenerator) to generate static html files
Using a caching HTTP reverse proxy like Varnish (https://www.varnish-cache.org/)
Other miscellaneous points
- There are a few more things I'm planning to take into consideration while working on this project:
- I'll have a deployable project ready by the end of every week so that the mentors and other members of the community can give feedback and validate the progress that has been made. I'll make the project available as a debian package to make it easier to deploy and manage dependencies.
- Some discussion with the Debian community and most probably the django-users community would be useful in figuring out what approach is most suitable for the data retrieval part. Since one of the goals of this project is to make it easier for other interested third parties like debian derivatives to setup their own instance of the PTS, some general way of specifying a data source would be useful. The approach I suggested above, using a custom database backend would probably make things a bit easier in this regard, but it could prove worthwhile to make a mechanism for specifying data sources that the database backend could pull data from. This would also allow us to specify custom expiry times for each data source which would be set based on how often the data sources is updated.
- There will be seven major parts to the project (listed below in no particular order):
Synopsis: Debian's Package Tracking System let's developers view a variety of interesting and useful information about packages available on Debian. Currently the PTS is a mix of Perl, Python (CGI) and Shell scripts. This project aims to rewrite the PTS using the Django web framework and Python. Key goals are making a web application that is easy to deploy, making the PTS more dynamic with some of the data being downloaded on demand or more frequently than before, ensuring the new PTS is easier to hack on by using technologies that are popular nowadays all while ensuring that the new PTS uses as little computing power as possible (by caching, intelligently deciding what data needs to be downloaded more frequently, etc).
Benefits to Debian
- The PTS is a pretty crucial part of Debian's infrastructure. Any improvements to it would benefit Debian greatly.
- Using newer web technologies like Django might attract more interest from developers who wish improve/add features to the PTS.
Stefano mentioned (http://lists.debian.org/debian-qa/2013/04/msg00059.html) that the static nature of the PTS had led to many bogus bug reports and much confusion since the data was updated only four times a day. Making the the PTS (or at least some parts of it) dynamic would would contribute to solving this problem.
- REST web services are getting more and more common and often replacing SOAP, especially in the mobile world. It is likely that the developer of the Debian Android application (another GSoC project this year) will find working with a REST API much more convenient than working with the current SOAP interface (which is in alpha anyway)
- Porting the PTS email interface, currently written in Perl to Python.
- A REST API to access package information.
- A web interface where data can be presented in HTML (for humans) and RDF/XML and/or Turtle (for machines) depending on the Accept header.
- A portal where subscribers can register and add/delete subscriptions to packages/tags, change some settings, etc.
- I have elaborated on most of the points below in the project description, so I'm not going to repeat them here.
May 27 - June 16: (Community bonding period)
- Discuss with the mentors and the community about the project
- Read through with the current codebase, especially update_incoming.sh (I've gone through almost everything else already)
- Make a high level design of how the new PTS would work
- Work on a deployment strategy, set up the development environment, set up a blank project and deploy this project to make sure the system works.
June 17 - June 30 (2 Weeks):
- Retrieving package information from a variety of sources and making it accessible via Django models
July 1 - July 7 (1 Week):
- Creating a REST API to access package information using Tastypie
July 8 - July 14 (1 Week):
- Porting the PTS email interface
July 15 - July 28 (2 Weeks):
- Creating the web interface
July 29 - August 4 (1 Week):
August 5 - August 11 (1 Week):
- Researching, experimenting with and discussing possible caching solutions with the community.
August 12 - August 25 (2 Weeks):
- Benchmarking the performance of the new PTS implementation with the old one, scalability analysis of the new PTS.
- Load testing the new PTS
August 26 - September 8 (2 Weeks):
- Testing in a production environment
- Fixing bugs
- Getting feedback from mentors and the community
September 9 - September 15 (1 Week):
- Apache with mod_wsgi or nginx with Gunicorn or ?
- Writing documentation on deployment
- Trying to ease future deployment and development by using different settings files for production and testing, virtualenvs, requirements.txt, etc.
September 16 - September 23 (1 Week):
- Further testing, fixing bugs, documentation and community feedback
- Making a Debian package of the PTS to make it easier for Debian derivatives and other interested parties to setup the PTS
- Making sure code conforms to best practices
Exams and other commitments:
- I don't have any exams or other commitments for the summer. My vacations end some time in August, but this shouldn't be a problem since the next semester is my final semester - I have no commitments except two courses to complete. My exams will be in October, long after the SoC is over.
Other summer plans:
I've been an Ubuntu user for around two years now. I'd always known Ubuntu was based on Debian, but I'd never really taken the effort to check it out. It was only when I saw Debian in the SoC mentoring organizations list that I decided to take a look. I find the Debian social contract appealing. I'm also fascinated by how such a huge project is managed completely by volunteers with no commercial backing at all. Being a part of community where people contribute purely out of their own interest is something I'm looking forward to
- Are you applying for other projects in SoC?
- I have also applied for SDL.