Differences between revisions 42 and 43
Revision 42 as of 2013-08-17 19:25:57
Size: 9300
Editor: Lunar
Comment: Record progress!
Revision 43 as of 2013-08-17 19:29:05
Size: 9379
Editor: Lunar
Comment: add link to the patch
Deletions are marked like this. Additions are marked like this.
Line 30: Line 30:
 * Lunar has a good refactoring of the `dpkg-deb/build.c:do_build()` function (initially 225 lines long with a comment saying “Overly complex” on top). It needs a little bit more work and then on to fix timestamp and file order issues.  * Lunar [[http://people.debian.org/~lunar/volatile/dpkg_do_build_refactor.patch|has done a good refactoring]] of the `dpkg-deb/build.c:do_build()` function (initially 225 lines long with a comment saying “Overly complex” on top). It needs a little bit more work and then on to fix timestamp and file order issues.

It should be possible to reproduce, byte for byte, every build of every package in Debian.

For now, we will start with a few maintainers who want to opt in to this goal as we flesh out the details of what will make it possible. This page tracks our progress.

Drivers

Why do we want reproducible builds?

  • Independent verifications that a binary matches what the source intended to produce.
  • Help Multi-Arch: same packages co-installation (as they need every matching file to be byte identical).

  • Be able to generate debug symbols for packages which do not have a “debug package”.

Others?

Status

  • Proof of concept success: hello package: Contents of data.tar.gz and control.tar.gz are the same even if you build the package twice (as of version 2.8-4, per 719848).

  • Buy-in within Debian: 5 packages from 5 maintainers are interested, of which 0 so far have reproducible contents of {data,control}.tar.gz

  • Waiting on a few dpkg bugs for avoiding timestamps and file order inconsistency in {data,control}.tar.gz (or .xz)
  • Lunar has done a good refactoring of the dpkg-deb/build.c:do_build() function (initially 225 lines long with a comment saying “Overly complex” on top). It needs a little bit more work and then on to fix timestamp and file order issues.

  • It is possible to reinstall exactly the same set of Debian packages used for the initial build using snapshot.debian.org.
  • Things that need further investigation (by e.g. you!)
    • Write a script to recreate a build chroot from a list of binary packages and their respective versions. See below for some preliminary work that has already been done.
    • Find out if {control,data}.tar.gz files created by dpkg 1.17.1+ have a timestamp embedded.
    • Research about other distributions: NixOS, SUSE (see build-compare)

Use cases

  • If the Debian build daemons are compromised, end users can assure themselves that their binaries are OK if they can regenerate them (and their build dependencies). (You could use a more complicated equivalence test than "do the hashes match?" but if the hashes do match, this is simple.)

Detailed package status list

  • alpine (Asheesh Laroia)
    • Status: Untested
  • haveged (Lunar)
    • Status: content of data.tar and control.tar do not vary with time. control.tar is different because of tar member's mtimes. Same for data.tar.

  • iotop (pabs)
    • Status: data.tar and control.tar different (contents same): because the tar mtime/check_sum members differ (used hachoir-urwid to check)
  • debhelper (joeyh)
    • Status: Unknown
  • magit (lindi)
    • Status: Unknown

Reproducing builds

There are two sides to the problem: first we need to record the initial build environment, and then we need a way to set up the same environment.

Recording the environment

The right place to record the build environment is the .changes file. Rationale: it lists the checksums of the build products and is signed by either the maintainer or the buildd operator.

To add a field to the .changes file, we need to call dpkg-buildpackage using something like:

dpkg-buildpackage --changes-option="-DBuild-Environment=$(
COLUMNS=999 | dpkg -l | awk '
            /^ii/ { ORS=", "; print $2 " (= " $3 ")" }' |
        sed -e 's/, $//'
)"

The idea is not new, see 138409. The above could eventually be integrated in dpkg proper if our experiments turn successful.

(See 719854 for the first attempt which tried using XC- field in debian/control.)

Reproduce the build environment

Actions:

  • We need a script that would take a list of binary packages and their respective version, installs them in a chroot and starts the build. Maybe based on pbuilder?

Ruby script that generates URL to .deb on snapshot.debian.org from a list of binary packages and their respective version: http://people.debian.org/~paulproteus/lunar-verify-script.rb

Known bugs we are waiting on

  • dpkg: (719844) about gzip timestamps

  • dpkg: (719845) tar directory order

Different problems, and their solutions

Build systems tend to capture information about the environment that makes them produce different results accross different systems, despite having the same architecture and software installed.

Ideally, such variations should be fixed in the build system itself, but it might sometimes not be possible.

Non-problems

  • You might think ELF binaries (e.g. /usr/bin/hello in the hello package) have embedded timestamps. Luckily, they don't!

Files in data.tar.gz contains build paths

These should really be patched out in one way or another. This is not useful information and can actually hide real bugs.

For debug files, use debugedit.

Files in data.tar.gz depends on readdir order

The build system needs to be patched to sort directory listings.

Files in data.tar.gz varies with the locale

Builds should be made with LC_ALL=C.UTF-8.

It's quite unpractical to force such value in debian/rules and there is actually no reason this should not be the default.

Actions:

  • We could make dpkg-buildpackage exports this variable; but we would need to change the policy to make dpkg-buildpackage be the canonical solution to build package.

Files in data.tar.gz contains hostname, uname output, username

Actions:

  • We could write a LD_PRELOAD library that could answers consistent results for several system calls on the same model as libfaketime. Bdale suggested we call it liblietome.

Files in data.tar.gz contains timestamps

  • Recommended solution:
    • Use the timestamp of the of the last debian/changelog entry as reference.
    • touch all files to the reference timestamp before building the binary packages.
    • gzip -n when gzipping anything
    • get rid of non-determinisim (yup...)
    • Alternate solutions:
      • (or) libfaketime (probably breaks some things) (sudo apt-get install faketime)

For the worse cases, we could record the calls to gettimeofday() on the first build and have something like libfaketime replay them on rebuilds.

Members of control.tar have varying mtime

We can fix this by giving tar the --mtime= option with the date of the last debian/changelog entry or a similar fixed point in time. Change to be done in dpkg-deb/build.c:do_build() around line 462.

{data,control}.tar.{gz,xz,bz2} may have timestamps

  • dpkg 1.17.1 might or might not store a timestamp for the .gz versions of these files.
  • *.xz and *.bz2 seem to provide no ability to store a timestamp.

{data,control}.tar.{gz,xz,bz2} will store files in readdir order

This is dependent on an accident of filesystem layout at build time, so it would sometimes not be reproducible.

We should probably fix this in dpkg by sorting the contents of the tar files.

For control.tar, we need to feed tar a sorted list of files in dpkg-deb/build.c:do_build() around line 462.

For data.tar, we need to add sort the output of find in dpkg-deb/build.c:do_build() around line 571.

Changes are discussed in 719845.

.deb ar-archive header contains a timestamp

.deb are ar-archives. The header currently contains the “current time”. It is written by dpkg at line line 103 of lib/dpkg/ar.c.

Guillem said he would rather keep this.

References