I did my Bachelor of Computer Science in 2011 and am now working for a Master program in Computer Science, both at Jacobs University Bremen. I have a liking toward developing tools that process big datasets (as weird as this sounds). For example, I worked on converting wikipedia dumps to offline archives, combining openstreetmap data with wikipedia location information, and processing timetable information for Deutsche Bahn.

Academically, my Bachelor Thesis was about analysing network flow traces and now in my Masters program I'm writing algorithms to develop approximate solutions for the 3d orthogonal knapsack problem as well as integrating RPL and SNMP on contiki OS.

Debian-related, I have been building root filesystem images for an open mobile phone framework and also automated their debian package cross-building process for armel on i386. Most recently I wrote a tool to automatically build a Debian rootfs for foreign platforms using multistrap without the need of superuser priviliges. Given my liking for exactly this kind of problem plus my experience with Debian packaging, I think I can handle this task well.

This GSoC project aims to develop a tool for planning the build order for building Debian for a new architecture. The tool will be able to analyze package information in the archive and output build instructions that can then be read by another tool doing the final compilation task (like buildd). This project is important for several reasons. Firstly, bootstrapping Debian for a new architecture is nothing uncommon and already happened 17 times so far. With this tool (and other efforts) building Debian from scratch can be automated, repeatable and deterministic and would not be (as) painful and manual anymore. This project could also help to update a lagging architecture (m68k) or build architectures that are not self-hosting for performance reasons (avr32). One can also use it to make sub-arch builds that are more optimized for a specific CPU (flavored builds, such as is currently planned for the Rasberry Pi). On top of that, in my personal opinion, "the universal operating system" should be able to make a new port without the help of another distribution.

The tool will not deal with the problems of crosscompiling Debian packages (but this one will) but with the problem of cyclic build dependencies. Using the results of the 'Bootstrappable Debian' GSoC project (but using synthetic package files until first results are available), this tool will be able to figure out cyclic build dependencies and be able to 'break' them using the staged build information the other project supplies. It will output an ordering where the problematic package is first built in its reduced form, then the packages that depend on it are built, then the problematic package can be built in its full form, followed by rebuilding the packages that depend on it to avoid incomplete functionality.

Staged build information allows the removal of edges in the graph of build dependencies so that cycles are removed and the graph becomes a directed acyclic graph. Finding this (preferably minimal) set of edges that can be removed to turn a graph into a DAG is called the minimal feedback arc set problem. Pietro Abate (co-mentor of this project) already did substantial work in implementing algorithms tackling this issue. The algorithm proposed by D. B. Johnson is able to enumerate cycles with linear complexity and promises an improvement over the algorithms proposed by R. Tarjan and J. C. Tiernan. Having found an algorithm that performs well, given our dependency data, one would then have to find a good heuristic that is able to choose the best package in the cycle that is to be stage-built and thus breaks the cycle. Another way of approaching the issue, would be to directly try to find the minimal feedback arc set of a weighted graph as it is proposed in a paper by C. Demetrescu and I. Finocchi. Since the problem is NP-complete they show an approximation algorithm.

Another problem that will be solved by this tool, is to find out when to switch over from cross-building to native building. In other words: what is the minimal set of packages that needs to be present to build the rest natively. This information is useful because making a package properly crosscompile (as well as assuring cross compilation does not break when changes are made to the package) is a difficult task and package maintainers should preferably not worry about it. To keep the amount of work and maintenance minimal that has to go into making a set of packages cross compile, the set should be kept as small as possible. Thus, this tool will also be able to figure out what this minimal set of packages is, that is needed to build every package in the archive natively. If it turns out there is no easy/fast way to figure out this set of packages, then alternatively the tool could check if a given set of packages is enough to build the rest. A naive first assumption would be that the minimal list of such packages would be all packages marked as 'Essential' plus build-essential and all their install-dependencies.

Another desired functionality is to find all dependency cycles in the archive that can not be solved due to missing (or outdated) staged-build information. Using the algorithm implemented by Pietro Abate and mentioned above, the tool will be able to check the whole archive and enumerate the cycles it finds. Using certain heuristics, the tool can suggest which packages are good candidates for modification and what could possibly be changed. These heuristics would work by for example identifying dependencies for documentation generation or by parsing the ./configure output and searching for --disable-libfoo like options that match an actual packagename libfoo. Since package maintainers might possibly break staged build dependencies during package updates, this tool can be used to periodically check the archive for possible introductions of cyclic build dependencies.

This build ordering tool will be used for both compilation parts: the initial cross compilation for the new architecture as well as the later native compilation as there will be build dependency cycles that need to be broken in both cases. It is expected that there will be more build dependency cycles (ie. the problem will be harder) in the native compilation case because during cross compilation, packages from the build-host can be used (for example to create documentation).

The problem will be tackled using the dose3 software suite as it is already able to read Debian package meta data and analyse graphs of package dependencies and will thus serve as the base of this GSoc project. Its machinery will be used to parse and encode the Debian metadata into cudf (the internal dose data structure)

old alarm clock

In the graphic on the side you can see how the problem was modeled so far by Pietro Abate. It will most likely continue to be modeled like that. The graph that is built contains source packages and dependency closures of a binary package as nodes. The dependency closure of a binary package is the set of its runtime dependencies. Since packages marked as 'Essential' are usually not listed as runtime dependencies of a package (unless a specific version is required), the set of 'Essential' packages is automatically part of the installation set. There are also two types of edges. The dotted edges are a binary package that is a build dependency of a source package. This type of edge points from a source package to a dependency closure and means: this build dependency needs those binary packages installed to possibly be available. In the example, the source package src:A has the build dependencies bin:1 and bin:2 which in turn require the dependencies sets of bin:1 and bin:2. The second type are the solid edges. They denote a "belong" relationship and point from a dependency closure to the source packages that are needed to build one (or more) of the binary packages in it. In the example, the dependency set of bin:1 is built by compiling the source packages src:B, src:C and src:D.

This mixed representation of build dependencies and runtime dependencies now allows to approach two of the problems at hand: how to break build dependency cycles and how to figure out the set of packages that have to be cross compiled. The latter can be approach by searching for cycles of runtime dependencies. A runtime dependency cycle can only be broken by crosscompiling a package in a cycle. Which one that should be must be chosen by a heuristic or alternatively by human intervention. The problem of breaking build dependency cycles can be solved as already described above by reducing the direct build dependencies through staged builds.

This project would benefit a lot if the 'Bootstrappable Debian' project would also be selected for this year. In case it does not, this project would not perform worse as it can always just be proven to work on synthetical package data. It would just delay the usefulness of this project until real world data becomes available.

This projects sets out into unexplored territory and tries to give answers to questions of which no-one knows the answers yet. It is not a mechanical project of writing code and in many points has also no clear solution yet. The tools I will write will attempt to give those answers and it is up to these answers how to best proceed with each of the subtasks.

SCM will be git, hosted on gitorious.org and starting with the codebase of Pietro Abate.