Differences between revisions 88 and 89
Revision 88 as of 2013-09-08 20:13:30
Size: 19810
Editor: Lunar
Comment: summarize Yann's position
Revision 89 as of 2013-09-08 20:24:44
Size: 20111
Editor: Lunar
Comment: dh-buildinfo does what we want to record the environment
Deletions are marked like this. Additions are marked like this.
Line 93: Line 93:

The DebianPts:dh-buildinfo script already captures the build environment
quite nicely. We should simply re-use it. dh-buildinfo should be turn
into dpkg-buildinfo. Either it should produce an extra file that is bundled
together with a `.changes` or add some extra fields directly in `.changes`.

It should be possible to reproduce, byte for byte, every build of every package in Debian.

For now, we will start with a few maintainers who want to opt in to this goal as we flesh out the details of what will make it possible. This page tracks our progress.

To participate in the project, we recommend you create an account on this wiki, and then "watch" this page.

Drivers

Why do we want reproducible builds?

  • Independent verifications that a binary matches what the source intended to produce.
  • Help Multi-Arch: same packages co-installation (as they need every matching file to be byte identical).

  • Be able to generate debug symbols for packages which do not have a “debug package”.

Others?

Status

  • Proof of concept success: hello package: Contents of data.tar.gz and control.tar.gz are the same even if you build the package twice (as of version 2.8-4, per 719848).

  • Buy-in within Debian: 5 packages from 5 maintainers are interested, of which 0 so far have reproducible contents of {data,control}.tar.gz

  • Waiting on a few dpkg bugs for avoiding timestamps and file order inconsistency in {data,control}.tar.gz (or .xz)
  • Lunar has a dpkg branch that handles timestamps, file order in .deb and .changes, and pass the right CFLAGS to get deterministic debug symbols. See below.
  • Sort out the results of the rebuild done by David Suárez with the pu/reproducible_builds branch of dpkg.

  • Perform the next archive rebuild with binutils recompiled with --enable-deterministic-archives and see if that helps.

  • It is possible to reinstall exactly the same set of Debian packages used for the initial build using snapshot.debian.org.

Useful things you (yes, you!) can do

  • Write a script to rebuild a package from a list of binary packages and their respective versions. See below for some preliminary work by Lunar that has already been done. One way to do that is to download the deb files from snapshot.debian.org and then put them into a chroot with pbuilder, and then use pbuilder to do a build of the desired Debian package.
  • Research about other distributions: NixOS, SUSE (see build-compare), then write about it on your blog and link to it on this wiki page.

  • Write a tool to take a *.deb file and extract, into a YAML file, all the non-essential metadata about a *.deb -- for now, that is the order of files within the control.tar and data.tar, and the mtime of them. You can use the iotop package as a starting point for this. Once we can generate this file, we can write a second tool to re-apply the metadata to a target *.deb file. That way you will be able combine a new, rebuilt *.deb with the YAML file, and create a binary-identical *.deb to the one in the Debian archive.

If you want to help with this, join #debian-devel and ping paulproteus (Asheesh) or tumbleweed (Stefano) or the other people listed on this page.

Use cases

  • If the Debian build daemons are compromised, end users can assure themselves that their binaries are OK if they can regenerate them (and their build dependencies). (You could use a more complicated equivalence test than "do the hashes match?" but if the hashes do match, this is simple.)

Detailed package status list

  • alpine (Asheesh Laroia)
    • Status: Untested
  • haveged (Lunar)
    • Status: content of data.tar and control.tar do not vary with time. control.tar is different because of tar member's mtimes. Same for data.tar.

  • iotop (pabs)
    • Status: data.tar and control.tar different (contents same): because the tar mtime/check_sum members differ (used hachoir-urwid to check)
  • debhelper (joeyh)
    • Status: contains timestamps; got initial reproducible build using faketime with clock stuck at epoch. Build environment recording needed in order to get long-term reproducible build.
  • magit (lindi)
    • Status: Unknown

Reproducing builds

There are two sides to the problem: first we need to record the initial build environment, and then we need a way to set up the same environment.

Recording the environment

The right place to record the build environment is the .changes file. Rationale: it lists the checksums of the build products and is signed by either the maintainer or the buildd operator.

To add a field to the .changes file, we need to call dpkg-buildpackage using something like:

dpkg-buildpackage --changes-option="-DBuild-Environment=$(
COLUMNS=999 | dpkg -l | awk '
            /^ii/ { ORS=", "; print $2 " (= " $3 ")" }' |
        sed -e 's/, $//'
)"

The idea is not new, see 138409. The above could eventually be integrated in dpkg proper if our experiments turn successful.

The dh-buildinfo script already captures the build environment quite nicely. We should simply re-use it. dh-buildinfo should be turn into dpkg-buildinfo. Either it should produce an extra file that is bundled together with a .changes or add some extra fields directly in .changes.

(See 719854 for the first attempt which tried using XC- field in debian/control.)

Reproduce the build environment

Actions:

  • We need a script that would take a list of binary packages and their respective version, installs them in a chroot and starts the build. Maybe based on pbuilder?

Ruby script that generates URL to .deb on snapshot.debian.org from a list of binary packages and their respective version: http://people.debian.org/~paulproteus/lunar-verify-script.rb

Here's another potential piece of the puzzle. The following script will convert a RFC822 date (as found in a .changes) to the URL of the last known archive state recorded by snapshot.debian.org. This might be useful to debootstrap the proper chroot before installing packages…

require 'date'
require 'uri'
require 'net/http'
require 'nokogiri'

changes_date = 'Mon, 30 Jan 2012 12:52:28 +0100'

build_date = DateTime.rfc822(changes_date)
url = "http://snapshot.debian.org/archive/debian/?year=#{build_date.year}&month=#{build_date.month}"
response = Net::HTTP.get_response(URI.parse(url))

run = nil
doc = Nokogiri::HTML(response.body)
doc.css('p a').each do |link|
  date = DateTime.parse(link.content)
  break if date >= build_date
  run = link['href']
end
puts "http://snapshot.debian.org/archive/debian/#{run}"

Note : it would probably be a lot better of adding a new query to the machine interface of snapshot.d.o instead of parsing HTML.

Known bugs we are waiting on

  • dpkg: (719844) about gzip timestamps

  • dpkg: (719845) tar directory order

Different problems, and their solutions

Build systems tend to capture information about the environment that makes them produce different results accross different systems, despite having the same architecture and software installed.

Ideally, such variations should be fixed in the build system itself, but it might sometimes not be possible.

Non-problems

  • You might think ELF binaries (e.g. /usr/bin/hello in the hello package) have embedded timestamps. Luckily, they don't!

Files in data.tar.gz contains build paths

These should really be patched out in one way or another. This is not useful information and can actually hide real bugs.

The build path is embedded in DWARF sections of ELF files. The path can be made deterministic by adding two CFLAGS for GCC:

  1. -fdebug-prefix-map to replace the build path by a predetermined path.

  2. -gno-record-gcc-switches to prevent the previous option to be recorded in the debug file (as it changes with the build path).

Both are documented in GCC's manual.

We can pass both options using the dpkg-buildflags mechanism. Lunar's branch has a patch for that.

Another option is to use debugedit together with -fno-merge-debug-strings. The latter is needed because the hashtable used when merging strings will output strings in a different order depending on the build path.

Files in data.tar.gz depends on readdir order

The build system needs to be patched to sort directory listings.

Files in data.tar.gz varies with the locale

Builds should be made with LC_ALL=C.UTF-8.

It's quite unpractical to force such value in debian/rules and there is actually no reason this should not be the default.

Actions:

  • We could make dpkg-buildpackage exports this variable; but we would need to change the policy to make dpkg-buildpackage be the canonical solution to build package.

Files in data.tar.gz contains hostname, uname output, username

Actions:

  • We could write a LD_PRELOAD library that could answers consistent results for several system calls on the same model as libfaketime. Bdale suggested we call it liblietome.

Files in data.tar.gz contains timestamps

  • Recommended solution:
    • Use the timestamp of the of the last debian/changelog entry as reference.
    • touch all files to the reference timestamp before building the binary packages.
    • gzip -n when gzipping anything
    • get rid of non-determinisim (yup...)
    • Alternate solutions:
      • (or) libfaketime (probably breaks some things) (sudo apt-get install faketime)

For the worse cases, we could record the calls to gettimeofday() on the first build and have something like libfaketime replay them on rebuilds.

Members of control.tar have varying mtime

We can fix this by giving tar the --mtime= option with the date of the last debian/changelog entry or a similar fixed point in time. Change to be done in dpkg-deb/build.c:do_build() around line 462.

Lunar's branch use a single timestamp for all mtimes of tar members and allow to preset it during rebuilds, see below.

{data,control}.tar.{gz,xz,bz2} does not have timestamps

  • dpkg 1.17.1 does not store a timestamp for the .gz versions of these files.
  • *.xz and *.bz2 seem to provide no ability to store a timestamp.

{data,control}.tar.{gz,xz,bz2} will store files in readdir order

This is dependent on an accident of filesystem layout at build time, so it would sometimes not be reproducible.

We should probably fix this in dpkg by sorting the contents of the tar files.

For control.tar, we need to feed tar a sorted list of files in dpkg-deb/build.c:do_build() around line 462.

For data.tar, we need to add sort the output of find in dpkg-deb/build.c:do_build() around line 571.

Changes are discussed in 719845. Test case patch for pkg-tests.

.deb ar-archive header contains a timestamp

.deb are ar-archives. The header currently contains the “current time”. It is written by dpkg at line line 103 of lib/dpkg/ar.c.

Guillem said he would rather keep this.

Lunar's branch use a single timestamp for all ar headers and allow to preset it during rebuilds, see below.

dh-buildinfo

dh-buildinfo will encode informations about the build environment in a buildinfo_all.gz in the doc directory. For our goal, we believe that such information should not be recorded in the binary package itself and so dh-buildinfo should disappear.

Lunar wrote on 2013-09-08 to Yann Dirson to ask his opinion. Yann agrees that what buildinfo produces should not be in the .deb file but rather a separate file. He's advocating for having an additional file as part as an upload rather than adding fields in .changes. He also agree that what dh-buildinfo produces should actually be done at the dpkg level.

dpkg branch handling timestamps, file order, and debug symbols

The pu/reproducible_builds dpkg branch published by Lunar makes:

  1. file order deterministic in control and data part of the .deb,
  2. uses a single timestamp for .deb mtimes and allows to preset the timestamp,
  3. adjust dpkg-buildflags to pass CFLAGS leading to deterministic debug symbols.

Usage example (after building and installing the new dpkg):

$ apt-get source hello
$ cd hello-2.8
$ dpkg-buildpackage
[…]
$ cp ../hello_2.8-4_amd64.deb ../hello_2.8-4_amd64.deb.orig
$ DEB_BUILD_TIMESTAMP=$(date +%s -d"$(sed -n -e 's/^Date: //p' ../hello_2.8-4_amd64.changes)") dpkg-buildpackage
[…]
$ sha256sum ../hello_2.8-4_amd64.deb ../hello_2.8-4_amd64.deb.orig
1e944abfceac7e593f6706da971e0444e5cee9aab680de5292d52661940ee9c4  ../hello_2.8-4_amd64.deb
1e944abfceac7e593f6706da971e0444e5cee9aab680de5292d52661940ee9c4  ../hello_2.8-4_amd64.deb.orig

Success!

bash script to compare two package builds

Usage: ./diffp r1/hello_2.8-4_amd64.changes r2/hello_2.8-4_amd64

# diffp: compare two package builds
# Copyright © 2013 Lunar <lunar@debian.org>
# Licensed under WTFPL — http://www.wtfpl.net/txt/copying/
#
# Depends: bash, binutils, unzip

CHANGES_A="$1"
CHANGES_B="$2"

trim_diff() {
        grep -Ev '^(@@ |--- |\+\+\+ )'
}

get_ops() {
        local file="$1"

        case "$file" in
            *.so|*.so.[0-9]*)
                echo "readelf -a FILE"
                echo "readelf -w FILE"
                echo "objdump -d FILE"
                ;;
            *.a)
                echo "ar tv FILE"
                ;;
            *.zip|*.jar)
                echo "unzip -lv FILE"
                ;;
        esac
}

diffc() {
        local diff

        diff="$(diff -u0 <(echo "$@" | sed -e "s,PACKAGE,$PACKAGE_A," | sh) \
                 <(echo "$@" | sed -e "s,PACKAGE,$PACKAGE_B," | sh))"
        [ "$diff" ] || return 0
        echo "$diff" | trim_diff
        return 1
}

paste <(dcmd "$CHANGES_A" | sort | grep '\.deb$') <(dcmd "$CHANGES_B" | sort | grep '\.deb$') | while read PACKAGE_A PACKAGE_B; do
        PACKAGE="$(basename "$PACKAGE_A")"
        if [ "$PACKAGE" != "$(basename "$PACKAGE_B")" ]; then
                echo "$PACKAGE_A and $PACKAGE_B does not match. Something is wrong."
                exit 1
        fi
        echo "***** $PACKAGE"

        diffc "sha1sum < PACKAGE | sed -e 's,-$,$PACKAGE,'" && continue

        diffc 'ar tv PACKAGE' && continue

        MISMATCH=
        for file in debian-binary control.tar.gz data.tar.xz; do
                if diffc "ar p PACKAGE $file | sha1sum | sed -e s/-$/$file/"; then
                        MISMATCH=1
                fi
        done
        [ "$MISMATCH" ] || continue

        echo "===== control.tar.gz"

        diffc 'ar p PACKAGE control.tar.gz | tar -ztvf -'
        ar p $PACKAGE_A control.tar.gz | tar -zvtf - | grep '^-' | while read flags user size date time file; do
                echo "----- $file"
                diffc "ar p PACKAGE control.tar.gz | tar -zxOf - $file"
        done

        echo "===== data.tar.xz"
        if ar p "$PACKAGE_A" control.tar.gz | tar -ztf - ./md5sums; then
                FILES="$(diffc "ar p PACKAGE control.tar.gz | tar -zxOf - ./md5sums" | awk '/^-/ { print "./" $2 }')"
        else
                FILES="$(ar p $PACKAGE_A data.tar.xz | tar -Jvtf - | awk '/^-/ { print $6 }')"
        fi
        diffc 'ar p PACKAGE data.tar.xz | tar -Jtvf -'
        echo "$FILES" | while read file; do
                echo "----- $file"
                diffc "ar p PACKAGE data.tar.xz | tar -JxOf - $file" |
                    sed -e "s,Binary files [^ ]* and [^ ]* differ,Binary file $file differ,"

                OPS="$(get_ops "$file")"
                [ "$OPS" ] || continue

                TMP_A=$(mktemp)
                TMP_B=$(mktemp)
                ar p $PACKAGE_A data.tar.xz | tar -JxOf - $file > "$TMP_A"
                ar p $PACKAGE_B data.tar.xz | tar -JxOf - $file > "$TMP_B"
                echo "$OPS" | while read op; do
                        diff -u0 <(echo "$op" | sed -e "s,FILE,$TMP_A," | sh | sed -e "s,$TMP_A,$FILE,g") \
                                 <(echo "$op" | sed -e "s,FILE,$TMP_B," | sh | sed -e "s,$TMP_B,$FILE,g") | trim_diff
                done
                rm -f "$TMP_A" "$TMP_B"
        done
done

How to build a deb using faketime

sudo apt-get install faketime
echo > /tmp/fakeroot-faketime << EOF
faketime "2013-08-15T11:02:00" fakeroot "$@"
EOF
chmod a+x /tmp/fakeroot-faketime
dpkg-buildpackage -r/tmp/fakeroot-faketime

Note that this retians *one* timestamp, which is the timestamp of the 'ar' container of the *.deb. To erase that, somehow regenerate the package within the fakeroot-faketime environment by using dpkg-deb to unpack it, then dpkg-deb to repack it.

Note also that this is a total hack and not something I (AsheeshLaroia) think it makes sense to do on the Debian build daemons. In particular, some programs (e.g., gpg) hang forever when time does not advance.

Upstream changes may solve the problems we face with faketime 0.9.1. (rbalint) Faketime upstream has been improved to advance time linearly at a preset pace per each time() call and save/load timstamps. We could try rebuilding many packages saving timestamps in the first build and replaying them in successive builds. For example gnupg 1.4.14-1 builds fine:

NO_FAKE_STAT=1  ~/projects/libfaketime.git/src/faketime -f '+0 i0.01' dpkg-buildpackage -rfakeroot -us -uc

References

Publicity

This section lists URLs, people, and dates for when other people have publicly expressed interest, or shared information about, the project.