Name: Gustavo Prado Alkmim
Contact/Email: galkmim@gmail.com, alkmim on irc.debian.org
Background:
General: I graduated in Computer Science at the Federal University of Lavras (Ufla/Brazil) in which I worked with virtualization using Xen. I was a member of Tecnolovre, a cooperative specialized on open-source based solutions. (http://www.tecnolivre.com.br) (They have also a very incomplete English page: http://www.tecnolivre.com.br/?p[l]=en). Currently, I am a master student at State University of Campinas (Unicamp/Brazil), working with Future Internet. My interest areas are network, security and general Linux management.
Skills on Linux and programming: I used Linux since the very early of my graduation. I am one of the administrators of of the LRC (Network Computer Laboratory) at Computer Institute, also in Unicamp, in which we use just Linux systems. I have advanced skills on C++ and shell script and basic skills on Python and Java. If necessary, I have no problems in start on another programming language like Perl.
Experience with the proposed project: In the last two weeks I started to take a look on the code and think on solutions together with the mentor.
Project title: Apt ordering code improvements
Synopsis:
The objectives of this project can be summarized in five topics.
- Insert a function to verify the correctness of the ordered list (dependency checking).
- Add regression tests with various known configurations to test the new (and old) code
- Create other types of ordering according to various constraints.
- Divide the ordered list in independent lists that can be processed in parallel (probably use the idea of graphs).
- Create a way to parallelize the installation process. (using independent list created in step 2).
The objectives 4 and 5 are optional objectives. This means that they are going to be done if the time was sufficient because these improvements possibly are going to request big changes in other other classes of libapt. If the time was not sufficient to conclude all the objectives, after the GSoC period I will keep working on them.
Benefits to Debian:
The libapt is an important part of the Debian package manager. The ordering code is essencial do guarantees that packages are going to be installed in the right order.
- Considering this fact, a function the verifies the correctness of this ordering is very important (Objective 1). It will ensure that broken packages and conflicts are not going to appear "imprevisible" while the installation occurs.
- The objective 2 is very important to test the new code (and old) codes. Also, any new improvements on the apt ordering may be verified using these regression tests.
- Ordering the packages according to other constraints (Objective 3) are also an interesting improvement since it may improve the performance in special cases.
- The objectives 4 e 5 are going to increase a lot the performance of the installation/download of the packages since currently processors have up to 4 cores, and so the ability to make parallel downloads and installs is very interesting.
Deliverables:
- Correctness Verification function for the ordering process
- Set of regression tests
- Other ordering types based on different constraints
- Functions that optimize the big ordered chunks in small ordered chunks when possibly
- The possibility to parallelize the installation of process
- Documentation for all of the above items.
- Better code documentation of pieces of libapt
Project details
Small introduction
The libapt have three main classes related with the apt ordering process represented by the files:
- [orderlist.{h,cc}]: This is the class responsible to represent and manipulate the ordered list of packages. In this class all order related stuff are done
- [packagemanager.{h,cc}]: This is the apt package manager class. This class uses the orderlist class to have an ordered list and install the packages in the correct order.
- [dpkgpm.{h,cc}]: this is the low level class of the dpkg package manager. It adds all the sequence of operations in a list and later starts to execute the operations using this list. The pkgDPkgPM will connect the libapt with the dpkg.
It is important to observe that all the operations in the ordering process happen over the cache and after the process they are written on disk. First the cache is ordered, then dpkg is called in the right order to commit the changes to disk. The cache class is defined in the pkgcache.{h,cc} and depcache.{h,cc} files.
Some important definitions of the Debian policy:
- dependency: This declares an absolute dependency. A package will not be configured unless all of the packages listed in its Depends field have been correctly configured (unless there is a circular dependency)
- pre-dependency: This field is like Depends, except that it also forces dpkg to complete installation of the packages named before even starting the installation of the package which declares the pre-dependency.
- conflicts: When one binary package declares a conflict with another using a Conflicts field, dpkg will refuse to allow them to be unpacked on the system at the same time
Main Objectives and possible solutions
- All solutions/ideas shown here are just possibility. Possibly, while I get deep on the code better solutions or problems may appear.
- Insert a function to verify the correctness of the ordered list (dependency checking). This involves to create a new function and call it on the end of the order process. This function should verify if the generated list solves all dependencies. It needs to check that the ordering is correct, i.e. that packages that depend on other packages (or version) are unpacked before the other package is configured, that pre-depends are honored (i.e. packages with pre-depends are unpacked and configured before the package is unpacked). Above is shown a very very high level possibly implementation of this function: (of course this function may be not perfect)
1 bool VerifyFunction() { 2 for each package A from the ordered list { 3 for each package B that is a pre-dependency of A { 4 if unpack_B and configure_B isn't before configure_A in the order list return false 5 } 6 for each package B that is a dependency of A { 7 if unpack_B isn't before configure_A in the order list return false 8 } 9 if unpack_A isn't before configure_A in the order list return false 10 for each package B that is installed or in the order list { 11 if B conflicts with A return false 12 } 13 } 14 }
Line 9 tries to solve a bugreport (https://bugs.launchpad.net/ubuntu/+source/apt/+bug/291482).
- Add regression tests with various known configurations to test the new (and old) code. These tests should be choose in a way it will cover a wide variety of possibly errors.
- Create other types of order according to various constraint as "order-for-minimal amounts of dpkg invocations" and "order for minimal amount of broken packages at any point during the install".
- Divide the ordered list in independent lists that can be processed in parallel. One good approach is use graphs. The selected packages may be see as a directed graph. Each package is going to be a node. And the edges are going to be the dependences between these packages. The packages are going to be inserted in the graphs when it is a dependency of the requested package. In this way, we can create a DAG. This will eliminate the cycles in dependences (or try). After this, we put the DAG in a topological order. So, if we can create more the one DAG with the initial graph, we can parallelize them. This involves change the structure of the orderlist.h file (create a way to store the independent lists, graphs, etc.) or create another file like orderlistparallel.h. In the first approach, the code is simplified since it is everything on the same place. In the second approach the changes are well isolated from the orderlist.h and more free of bugs, because the user may choose to used one class or another and later it may be integrated in the orderlist.h.
- Create a way to parallelize the installation process. (using the independent lists created in step 4). This is not easy because this involves change other files of the libapt to call the right functions of the parallelized ordered lists.
Project schedule:
- April 08 - Submit proposal
April 25: 19:00 UTC
- - Accepted student proposals announced on the Google Summer of Code 2011 site. - Community Bonding Period: Students get to know mentors, read documentation, get up to speed to begin working on their projects.
- April 25 - May 9: Study the source code and the documentation of all the libapt.
- May 9 - May 16: Study the source code and the documentation of mainly files related with the apt ordering.
- May 16 - May 23: Study in deep the source code and the documentation of the function currently used on the apt ordering.
May 23:
- - Students begin coding for their GSoC projects; Google begins issuing initial student payments provided tax forms are on file and students are in good standing with their communities. - Interim Period: Mentors give students a helping hand and guidance on their projects.
- May 23 - Jun 20: Project what needs to be changed on the code to accomplish all of the objectives proposed
July 11:
- - Mentors and students can begin submitting mid-term evaluations.
- Jun 13 - Jun 20:
July 15: 19:00 UTC
- - Mid-term evaluations deadline; Google begins issuing mid-term student payments provided passing student survey is on file. - Interim Period: Mentors give students a helping hand and guidance on their projects.
- Jun 20 - Ago 15: Code everthing
August 15:
- - Suggested 'pencils down' date. Take a week to scrub code, write tests, improve documentation, etc.
- Ago 15 - Ago 22: Pencils down
August 22: 19:00 UTC
- - Firm 'pencils down' date. Mentors, students and organization administrators can begin submitting final evaluations to Google.
- Ago 22 - Ago 26: Pencils down
August 26: 19:00 UTC
- - Final evaluation deadline; Google begins issuing student and mentoring organization payments provided forms and evaluations are on file.
August 29:
- - Final results of GSoC 2011 announced
August 30:
- - Students can begin submitting required code samples to Google
October 22 & 23:
- - Mentor Summit at Google: Representatives from each successfully participating organization are invited to Google to greet, collaborate and code. Our mission for the weekend: make the program even better, have fun and make new friends.
Other summer plans: I plan to work on my master research, but a hope this will not be a problem since I have no classes and I'm going to conciliate both things. Many students at Unicamp have managed to complete GSoCs before with the same type of constraint, so that will not be a problem for me.
Exams and other commitments: I don't have any kind of exams. Possibly I may submit articles from my research of the university, but I will try to conciliate it and I hope this will not interfere on GSoC.
If you are not a Debian Developer: Do you have plans for Debian after the summer ?
I am not a Debian Developer yet, but I got very interested on it while I was studying the code and discussing ideas with the mentor. I discover that we may do very good things while working on our normal jobs (or university in my case). I really want to keep the work on the apt code after the GSoC even if my proposal were not choose, I will try to work on it.