This wiki page is intended to describe the issues of, and mechanisms for, bootstrapping a Debian rootfs image from sources.
There is a real need to bootstrap Debian from sources when doing new ports or flavours. Every new architecture or optimisation flavour needs to do this at least once, and making it easier than the current 'really very hard' would be great. It is also very useful for cross-compiling to new or non-self-hosted architectures, and for a genuinely new arch at least part of the system (toolchains+build-essential) has to be cross-built until there is enough to become self-hosting.
Recent new bootstraps have been done for sh4, armhf, uclibc and avr32. More are coming down the line. The subarch flavoured rebuilds (e.g. to optimise for a particular CPU) are particularly useful on ARM and MIPS architectures.
This is also helpful when bringing a lagged architecture up to date, especially considering documentation tools (that may be too old), optional dependencies (that may be too old or not exist) like php5, which depends on everything and the kitchen sink, etc. I wish this had been implemented when starting to work on m68k…
Currently people tend to use non-debian tools (such as Yocto/gentoo/OpenEmbedded) to get a basic rootfs image of the target arch/ABI then do native building within that. This works but needs a great deal of manual loop-breaking and we really out to be able to bootstrap our own OS.
Putting the necessary bootstrapping metadata and build rules into the packages themselves in an orderly fashion enables the info to be maintained easily. QA tests to report on breakage will help enormously here. It also makes for a repeatable and deterministic process.
This work does need build-system and policy changes, which are detailed on this page.
An important principle is that the packaging changes necessary for this to work are reasonably clear and transparent. A Debian packager should not have to understand this stuff (staged builds and cross-building) in loving detail to avoid breaking things whilst making maintenance changes. Considering this principle helps when deciding between different technically-satisfactory ways of achieving things.
All the metadata needed should be part of the packages, so it can be maintained over time. Any solution with external patches/metadata is doomed to bitrot.
The concept is simple: add support for minimal/reduced/staged builds to packages involved in build-dependency loops, so that build loops are broken. Also ensure packages cross-build properly so that an initial native-building system can be produced.
Working out which packages to modify, and how, is a manual process, done by examining build-dep loops and choosing which packages are most easily and cleanly modified. Once that is done, building bootstrap-able packages is an automatable process.
This spec caters for multiple stages of staged/bootstrap build, so that if necessary a package can have stage-1, stage-2, etc before the final, normal, build. Almost all packages only need one stage other than the standard build. Only toolchain packages are known to have more than one stage at this time.
The reduced dependencies are specified in the control file, using normal Build-Depends: syntax, in new fields named Build-Depends-Stage1, Build-Depends-Stage2 etc.
An environment variable (DEB_BUILD_OPTIONS) is used to control when packages are built in reduced staged/bootstrap mode, and at what stage.. debian/rules can check this variable and miss out some optional features to reduce the dependency tree (e.g. building kerberos without LDAP support). dpkg-buildpackage/dpkg-checkbuilddeps also checks the reduced/changed build-dependencies instead of the normal ones.
So setting DEB_BUILD_OPTIONS=stage1 will cause dpkg-buildpackage to call dpkg-chcekbuilddeps with --stage=1 so that Build-Depends-Stage1 dependencies are checked, rather than the normal set.
Bootstrapped/Staged packages should be marked as such (in version string or control file) and not uploaded to normal repositories. It is important to avoid accidentally mistaking a bootstrap/staged package for a 'real' (normally-built) package. As soon as possible a bootstrap package should be rebuilt as a full package, to avoid having to rebuild many packages aginst the full version once it is available. The mechanism for this is not fleshed-out, but an extra control header seems the obvious thing to do. A version suffix may be useful too, mostly to help humans.
This process is sometimes called 'staged' builds as well as 'bootstrap' builds. Exact field names and variable names is a subject for bikesheddig on Debian-devel. Whatever it most likely to be clear to developers and not clash with other purposes is best.
Proof-of-concept patches for packages that need to understand the new fields have been made. They are here: http://wookware.org/software/cyclicdeps/patches/
Bootstrapping is closely related to support for cross-building Debian packages because at least part of the process must be done cross. Enough packages to make a bootable image need to be cross-buildable, because you cannot magic a system out of thin air. To move from cross to native building you need build-essential to be cross-buildable.
The number of build-loops that must be broken for cross-building is much smaller than the number that need to be broken for native building. This spec proposes that we start by fixing the loops that mean you can't even cross-build a base Debian image before going on to fix all the packages which have native build-dep loops.
Debian/Ubuntu cross-building is documented here: https://wiki.linaro.org/CrossBuilding
Patched sources to make Ubuntu Maverick base packages cross-buildable are here: https://launchpad.net/~peter-pearse/+archive/cross-source
For cross-building to be reliable cross-dependency metadata needs to be in packages, so that it is clear whether a build dependency should be satisfied by the build architecture or the host architecture. Multiarch information can be used to provide this information along with build-dependency decoration for the farily rare exceptions. Details are specified here: https://wiki.ubuntu.com/MultiarchCross
The current state of buildability using that technology is recorded here: http://people.linaro.org/~wookey/buildd/
Until that metadata is in packages, heurisitics must be used, as implemented in xdeb (and the now-deprecated apt-cross), or all dependencies must be installed for both host and native, as implemented in xapt. These are all ugly and horrid, but better than nothing.
The full automated bootstrapping process needs to keep track of staged/bootstrap builds and rebuilding things as needed so that staged/bootstrap builds don't hang around any longer than necessary. However any such tool could get out of sync with the current status, unless it is always determinable from the current package-set state. This spec attempts to define things such that it is always intrinsically stateful. Please speak up if you see ways that this isn't going to work.
It might be useful to append ~stageN+M to the package version automatically, where N is the stage number and M a continuously incremented (by the buildd) number. Or do binNMUs, which are already recognised by the package management system very well, and almost all packages are (supposedly) binNMU safe.
The toolchain has a complex 2 or 3-stage bootstrapping process involving binutils, gcc, libc and kernel-headers. It has been fixed up (in the Ubuntu maverick packaging onwards) to bootstrap itself. This has currently only been demonstrated on armel. Once tested/extended to other architectures it can be uploaded in Debian. This work is ongoing, by Marcin Juszkiewicz.
It already uses the DEB_STAGE variable name internally to control the build.
The cross-toolchain has also had 'flavoured builds' added so that it is easy to rebuild the tolchain locally for a different default CPU/ISA/optimisation unit. (e.g with/without VFP or for v5/v7 instruction set on ARM).
Circular dependencies/staged builds
The main issue is circular build-dependencies. These fall into three main areas:
- Most languages depend on themselves to build (gcc, openjdk, mono, haskell, perl, python, ada(gnat)).
Libraries sometimes circularly depend:
kerberos -> ldap -> kerberos
qt -> poppler -> cups -> qt
- Documentation packages. Many packages need documentation tools (sgmltools, jade, tex, doxygen) which cannot be built until many other packages are built. This is largely only a problem for native-builds, as the doc-tools are generally available when cross-building.
The generic way to deal with all of these is 'staged builds', where a version of the package is built with lesser functionality and thus a smaller dependency tree. This allows the depending package to then be built, then for the 'staged' package to be built normally.
This could be controlled by a tool that keeps track of which packages have currently been built as 'staged' packages and thus need rebuilding, but if we can correctly encode things in dependencies then this process can be made automatic and intrinsic. Exactly how this needs to be done is the subject of ongoing study.
A partial spec has been proposed here: https://wiki.ubuntu.com/Specs/M/ARMAutomatedBootstrap This document fills out that spec and proposes some further ideas and changes.
CircularBuildDependencies is a list of loops found in the last analysis (run in early 2011).
'Staged' builds are invoked by setting DEB_BUILD_OPTIONS=STAGEn to specify a staged build to dpkg-buildpackage. When no 'STAGEn' option is set then a normal build occurs. Some packages may need more than one staged build. We do not know what the maximum number of stages needed is: it is proably two, but to assume so would be foolish. We count up from STAGE1, STAGE2 to 'normal'. Hopefully this is reasonably clear to the average packager what is going on.
Any 'staged' package must be identified as such in the metadata so it is not accidentally uploaded as a 'real' package. Is the 'UNRELEASED' codename indicator sufficient or do we need something more explicit: e.g. X-Staged-Build:N header?
It must be possible for the build-tools to identify what build-stages are available. We propose Build-Depends-StageN headers, one for each stage. The existence of that defines such a stage as being available.
Let's consider kerberos as a typical example of a library package involved in a circular dependency. krb5 needs libldap2-dev to build (from openldap). openldap need libkrb5-dev (from krb5) to build. To fix this we add a staged build to krb5 to miss out the generation of the krb5-ldap package. This is easy to do with a debhelper-based package by simply setting DH_OPTIONS="--no-package=krb5-ldap", and running configure with --without-ldap (when DEB_BUILD_OPTIONS=STAGE1).
Dealing with changed build dependencies
Build-Depends-StageN simply list all the build-dependencies again except changing or missing out some as required. This does need to be maintained along with the normal build-depends. This makes it very easy to implement. It would be nice to just list a 'diff' from the normal build-dependency list - i.e. 'except package-foo' or 'package-minimal instead of package'. I'm not sure this is practical, but if anyone can work out how to do it...
So for krb5 we'd add: either
Build-Depends-StageN: except libldap2-dev
Build-Depends-StageN:debhelper (>= 7), byacc | bison, comerr-dev, docbook-to-man, libkeyutils-dev [!kfreebsd-i386 !kfreebsd-amd64 !hurd-i386] libncurses5-dev, libssl-dev, ss-dev, texinfo
For packages which depend on themselves (usually languages), the Build-dependencies should be changed to depend on lang | lang-bootstrap. In a normal repository the (native version) lang-bootstrap will not be available so a lang will be used. In a bootstraping environment lang may well not be available in which case lang-bootstrap needs to be built. The bootstraping tool knows to do a staged build in this case.
Setting DEB_STAGE and building this package causes it to produce lang-bootstrap (which is normally not emitted). This is implemented by adding a new control stanza for lang-bootstrap and specifying --no-package=lang-bootstrap in debian/rules for normal builds, but not for the stage build (which will probably exclude a load of other stuff).
For documentation issues being able to specify DEB_BUILD_OPTIONS=nodocs would be simplest. Building with docs affects the dependencies, so it is not like other DEB_BUILD_OPTIONS, so perhaps this is not a good mechanism to use? Something generic is attractive if we can make it work.
Documentation loops are primarily an issue for native building, although they do cause issues for cross-building too (gobject introspection, perl module docs).
These are some related activities and documents which have generated input for this one.
- GSOC 2011 project
Thanks to Jonathan Austin, Steve ?McIntyre, Steve Lanagsek and Loic Minier for helping clarify the thoughts described above.