Differences between revisions 50 and 51
Revision 50 as of 2013-05-14 09:16:53
Size: 14651
Editor: ?ManishGill
Comment:
Revision 51 as of 2013-06-23 05:47:06
Size: 14632
Editor: ?ManishGill
Comment:
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
 * '''Contact/Email''': gill.manish90@gmail.com (Personal/Melange), mgill25@outlook.com (Mailing Lists). IRC: Naeblis @ Freenode and OFTC.  * '''Contact/Email''': <redacted> (Personal/Melange), <redacted> (Mailing Lists). IRC: <redacted> @ Freenode and OFTC.
  • Name Manish Gill

  • Contact/Email: <redacted> (Personal/Melange), <redacted> (Mailing Lists). IRC: <redacted> @ Freenode and OFTC.

  • Background: I am a final year Computer Science student at Guru Gobind Singh Indraprastha University, New Delhi, India. So far, my experience has been in web development, but I am also fascinated by (and motivated to learn) various other things like Compiler Theory, AI, Machine Learning, NLP etc.

    • Programming Experience

      • I am most comfortable with Python. However, I am also familiar with C, Ruby, Scheme/Racket and Javascript.
      • I have been programming in Python for more than a year now. I also worked as an Intern at a Startup, SubtleDisruption last year for 7 months, during which time, I worked mostly with Python and Javascript.

      • The applications I worked on as an intern are
        • 1.Batch.me - An app to queue and publish tweets.

          • Uses Flask, Celery, web2py DAL over PostgreSQL.
          • OAuth integration, using a custom built OAuth client wrapper.

          2.Sportschimp - A single-page HTML5 mobile app to pick and chose sports events among facebook friends.

          • Uses Flask to serve the app itself, which consumes a REST API.
          • Node.js/Express.js, the API, which interacts with a Redis database and serves JSON responses.
          • Google App Engine, which runs a crawler, periodically updating the Redis database.
      • I am fairly familiar with Version Control Systems like Git, Bazaar, Mercurial as well.
      • My Github and Bitbucket profiles.

    • What makes me the best person to work on this project?

      • Having worked with various loosely coupled web applications in the past, I am familiar with the type of work that might go on in creating a Django web application.
      • I have experience building applications with Flask and Django.
      • I have been reading the PTS source code I think I get what might entail in rewriting it.
      • I worked on migrating a REST API from Node.js to Flask, so I have experience in rewriting code in a different language/framework.
      • I've been going through tutorials and understand the basics of Package management in Debian well.
  • Project title: PTS Rewrite in Django

  • Project details:

    • Current state of PTS

      • PTS (The Debian Package Tracking System) is currently a mix of Perl, Python and Bash scripts, working together to serve the Debian package information on the web. This polygot mashup of technologies makes the PTS harder to maintain, feature addition and just generally hacking on it. This project aims to rewrite PTS using the Django web framework.
      • In the current version of PTS, update_incoming.sh downloads the package information, which is then processed to generate XML files, these files in turn use XSLT to generate the HTML files, which are served live.
    • New PTS Architecture The following is a high-level view of what the PTS would entail under the new PTS architecture.

      • Dynamic Web Interface Instead of XML/HTML combination to produce static pages, use Django templating engine to render HTML dynamically. Although Django isn't directly suited to produce static HTML pages, static generators are available for Django, which might be an interesting option to explore.

      • UDD Integration: Currently, the information that the PTS fetches, comes from multiple sources, including UDD.

        • UDD is just a Postgres instance, which generates a database dump every 2 days (which isn't exactly real-time).
        • That dump can be imported locally to PTS do display the package information.
        • Otherwise, accessing and monitoring it directly can be an option. Either way, most likely, the UDD schema will have to be mapped in Django models to use it.
        • UDD integration will have to be discussed more thoroughly with the mentors during the summer.
      • Live data monitoring (or as close as we can get to live): The data should always be "recent" and refreshes on the database should occur very fast. This can be done either using a queuing mechanism that periodically polls the data sources and keeps the data "fresh", or can be "on-demand", whereby a particular view handler will have the job of fetching the latest information from the database, and caching it so that it can be used in subsequent requests.

      • Email subscription shall remain more or less the same.

        • Basic subscribe/unsubscribe functionality, and "summary" emails, which send summarize information within a time period and send it to the subscribers (weekly/monthly/quarterly). This will be a good way to get a high-level view of the evolution of the package.
        • Email can be queued using Django-Mailer instead of using the MTA directly.

        • Group Subscription: As mentioned in the Wishlist. This feature can be used by users to subscribe to certain packages belonging to a "group" (like Gnome, Python, etc). Using tags and/or keywords to subscribe all associated packages sounds like a good option. Also, single confirmation message for multiple bug subscription is also desirable, instead of something like this

        • Raphael made me aware of the DDPO by mail initiative, which is automated email that sends out package information to developers. Something in that vein, with a script generating statistics and then summarizing it, is something many PTS users might be interested in, and should definitely be a feature of the subscription system. I've actually written similar scripts before. :)

      • Caching: Caching will be used at various levels in the application.

        • Django-caching will be the primary way caching will be done. This includes caching of the templates and view responses.
        • One option I think would be interesting is to use Redis as memcached-on-steroids for the database lookups. This can be done as some sort of Python middleware.
        • Performance and usage analysis to see which parts are more frequently used than others will also be a big part of how caching is done.
        • That will include writing scripts or using drop-in Django packages that focus on performance analysis.
      • RSS: Currently, the PTS exposes News feeds via RSS. This will be an important feature and will be migrated to Django as well. According to the Wishlist, the current RSS interface doesn't expose anything else besides news. That should be changed to include other information relevant to packages as well.

      • REST API: Currently, the PTS exposes a SOAP interface, which is not used by very many people.

        • Instead of rewriting the PTS with the older SOAP functionality, I would like to create a JSON-based REST API which can be consumed by clients.
        • Django has various frameworks which work in conjunction with Django models to expose an API, like Django-rest-framework and Tastypie (I prefer the latter, and have some experience in it).
        • Potential clients which might end up using the REST API includes the Debian Android Application, which is another GSoC project.
        • I would discuss with the mentors on which might be the best way to go.
      • Expose package RDF metadata: Package metadata is exposed as RDF/XML and Turtle. There are several Pythonic libraries that might be used in this task. Django-rdf seems to be no longer maintained. So writing a thin ORM wrapper over other libraries, or porting Django-rdf to the latest stable version of Django might be in order. One interesting link I encountered was JSON serialization of RDF.

  • Synopsis: The Debian Package Tracking system is currently written using Python, Perl and Bash scripts, which work together to periodically pull package data from the web, serve it on the web using CGI/Perl, and send it out to subscribers' email addresses. Currently, this setup uses a cronjob to refresh the data, which might get older by the time it's being viewed. This project will use modern web technologies, like the Django web framework, to rewrite the Debian PTS. The newer PTS aims to become more dynamic, provide live data monitoring, and handle issues of scalability with extensive use of caching. This new version of PTS will thus be much more loosely coupled, extensible, and dynamic.

  • Benefits to Debian

    1. Allow PTS to update the information as soon as it becomes available, which is much better than the current situation. Currently, PTS has the potential to allow disparity between information shown at the current time and the actual information, which might change in the meantime.
    2. A homogeneous codebase instead of a polygot mashup is much easier to hack on.
    3. A REST API, which other Debian projects might find interesting.
    4. According to a few members of the community that I've interacted with so far, PTS isn't something many people have worked on, or have generated interest in. Hopefully, this will change with this project. I would love to be involved in the maintenance and further development of the project even after the summer.
  • Deliverables: A Django implementation of PTS, which serves the package information at packages.qa.debian.org. This app will have:

    1. A package tracking system that gets updated regularly and provides real time (or as close as we can get) feed to the PTS app.
    2. An email subscription system integrated within PTS.
    3. Caching of various dynamic parts of PTS, whereby it is required.
    4. Possible integration with various Debian infrastructure tools, like UDD and debtags.
    5. Documentation and Tests for as much of the app as possible. This will be done in conjunction with writing the code.
  • Project schedule: The following is a tentative schedule. I will try to keep things as close to the timeline as possible, but depending on the various design decisions that the mentors take, this might change.

    • Major Milestones:

      • Basic app that allows to view package information.
      • Email Integration.
      • UDD integration.
      • Caching Implementation.
      • Integration with other Debian infrastructure.
      • Deployment
    • Timeline:

      • May 27 - June 16:

        • Community Bonding Period.
        • Familiarize myself with all the relevant Debian infrastructure that will be used in the Project.
        • Start discussions with the mentors. Discuss various high-level design decisions, and roadmap for the project.
        • Design the Schema for the app.
        • Begin mapping out the initial Django application. Start making models and prototype views.
      • (Week 1 and 2) June 17 - June 30:

        • Since the rewrite of dispatch.pl was part of the initial proposal, continue work on Email subscription system.
        • This includes writing a system that allows for:
          1. Subscribing/Unsubscribing - based on keywords chosen by user.
          2. Add summary package statistics, as explained above.
          3. Bounce handling.
        • Rewrites of bounce_handler.pl, dump-bounces.pl.
        • Tests for the scripts as well.
      • (Week 3 and 4) July 1 - July 14:

        • Work on scripts to pull the packaging information from the various database sources. Basically a rewrite of update_incoming.sh and improvements.
        • Start working on UDD integration.
        • Possible use of a queue scheduling system like Celery, or lightweight pyres, instead of cron?
        • Tests and documentation.
      • (Week 5 and 6) July 15 - July 28:

        • Work on the web interface.
        • Write views and templates for displaying basic package related information.
        • The functionality of sources_to_xml.py, excuses_to_xml.py and generate_html.py to be rewritten, making use of templates to generate HTML.
        • Write tests and documentation.
        • Mid term evaluation

      • (Week 7 and 8) July 29 - August 11:

        • Start working on caching implementation, discuss various possibilities with Mentors.
        • Write scripts to profile and analyse usage statistics for the app.
        • Deploy these scripts live and decide on various caching methodologies based on the resulting data.
        • Django-caching for the app itself, and any other techniques that can be leveraged to caching the data.
        • Performance benchmarking.
      • (Week 9 and 10) August 12 - August 25:

        • The basic app should be done by this point.
        • Deploy on Debian infrastructure, get feedback from community.
        • Write down the REST API specification - mapping of URL resources, allowed HTTP methods etc.
        • Start integrating Tastypie - keeping the schema within the constraints of Django's ORM, this is easy.
        • Scripts to expose RDF metadata with possible porting of Django-RDF.
      • (Week 11 and 12) August 26 - September 9:

        • API integration should be finished by this point.
        • Maybe work on additional features like RSS integration, debtags, etc.
        • Write tests and documentation for work done.
      • (Week 13 and 14) September 10 - September 22:

        • Final weeks. Full system deployment and testing.
        • Feedback from community, bug fixes.
        • Final evaluation of GSoC.

  • Exams and other commitments: My end term exams will most likely be in early June (2 exams in the first week, the third in the middle of the month). I won't be taking more than 7 days off for preparation + exam.

  • Other summer plans: No plans, I am available to work with Debian full time during the summer. :)

  • Why Debian?: I've been using Debian or Debian-based Operating systems for over 2 years. Debian is also the most friendly open source community that I've interacted with. There is a rich environment here for anyone who wants to contribute to open source community, and I've been passionate about that, just haven't found the right platform/community before now. :)

  • Are you applying for other projects in SoC? Yes, but I prefer to work with Debian.