Differences between revisions 257 and 258
Revision 257 as of 2014-10-08 12:46:26
Size: 26760
Editor: HolgerLevsen
Comment: better explain what to do with debhelper7 packages
Revision 258 as of 2014-10-08 12:59:49
Size: 26819
Editor: HolgerLevsen
Comment: explain better when to execute which dh_ command
Deletions are marked like this. Additions are marked like this.
Line 175: Line 175:
  1. add {{{dh_fixmtimes}}} after {{{dh_fixperms}}}
  1. add {{{dh_genbuildinfo}}} at the end of the dh_ sequence
  1. add {{{dh_fixmtimes}}} after {{{dh_md5sums}}}, right before {{{dh_builddeb}}}
  1. add {{{dh_genbuildinfo}}} at the end of the dh_ sequence, so after {{{dh_builddeb}}}

It should be possible to reproduce, byte for byte, every build of every package in Debian.

This is an on-going project. To participate, we recommend you create an account on this wiki, subscribe to this page, and join the reproducible-builds@lists.alioth.debian.org mailing list. IRC channel is #debian-reproducible on OFTC.

Other resources:

Drivers:

  • Lunar

Why do we want reproducible builds?

  • Allow independent verifications that a binary matches what the source intended to produce.
  • Help Multi-Arch: same packages co-installation (as they need every matching file to be byte identical).

  • Be able to generate debug symbols for packages which do not have a “debug package”.
  • Ensure packages can be built from source. The archive could be made to only accept reproducible uploads: the maintainer would stop uploading .deb files but keep them referenced in the .changes. A buildd would then build the source. Only if the hash matches the upload gets accepted.
  • Allow file-level deduplication on Debian mirror sites, or maybe snapshots.d.o, of .deb files whose contents didn't really change between versions.
  • Allow .deb deltas to be smaller.
  • Packages with build profiles must offer the exact same functionality for all profiles. Reproducible builds could be use to verify that it is the case.

  • Making sure that Architecture:all packages are build identically on different build architectures.

  • Validate cross-builds against native builds.


Status

  • Current focus in on the toolchain: trying to get as few changes as possibles in key packages to make as many builds as possible reproducible.
  • We have a custom toolchain that will allow a good amount of packages to be reproducible, as long as they use dh for their build process.

  • We have a specification and a prototype implementation for recording the build environment.
  • A build tool that would reproduce a build environment using packages from snapshot.debian.org is still missing.

Useful things you (yes, you!) can do

  • If you maintain a package for debian, you can make sure that your package uses a modern debhelper style (e.g. one-liner debian/rules with overrides as needed). We aim to fix many causes of non-deterministic builds in the debhelper suite directly, so packages that use debhelper will be much easier to make reproducible with just an upgrade of the toolchain.

  • Investigate ?failing packages from the latest experiment. Submit wishlist bugs with fixes for individual issues or find general solutions if needed.

  • Look at the last 24h of results from Jenkins reproducible jobs, pick a package, look at the debbindiff output and investigate.

  • Fing a way to prevent javadoc from writing timestamps.
  • Find a way to prevent Epydoc from writing timestamps and output links in filesystem order.
  • Understand why binaries produced by Mono are different.
  • Write a script to rebuild a package from a .buildinfo file. Probably wrapping or modifying sbuild or pbuilder.

  • Create a patch for pbuilder to build packages in /usr/src/debian/hello-2.8-1 instead of /tmp/buildd.

  • Research about other distributions: NixOS, SUSE (see build-compare), then write about it on your blog and link to it on this wiki page.

  • Write a more integrated dh_genbuildinfo helper for debhelper that would create a .buildinfo file (see below for the format). It should be be called after dh_builddeb and use dpkg-distaddfile to be included in the .changes. The current implementation remixes the output of dpkg-genchanges and dh-buildinfo. Good for experimenting, but not enough to be integrated upstream.

  • Write a patch to cdbs to call dh_strip_nondeterminism, dh_fixmtimes, and dh_genbuildinfo at the right times.

If you want to help with this, feel free to ping the mailing list or edit this wiki page.

Reported bugs

All bugs relevant to the reproducible builds project should use usertags with user reproducible-builds@lists.alioth.debian.org.

Current usertags in use:

toolchain
affects a tool used by other package build systems
infrastructure
affects the whole Debian infrastructure or policies
timestamps
time of build in recorded during the build process
fileordering
build output varies with readdir() order
buildpath
path of sources is recorded during the build process
username
username is recorded during the build process
hostname
hostname is recorded during the build process
uname
uname output is recorded during the build process
randomness
some build aspects are dependent on (pseudo-)randomness

Control commands to update the view on the BTS.

Lintian tags

Here's a list of relevant Lintian tags:

Archive wide rebuilds

  • 2013-09-07 by David Suárez. 24% of 5240 source packages reproducible. Variations: time, build path.

  • 2014-01-26 by David Suárez. 67% of 6887 source packages reproducible. Variations: time, build path.

  • 2014-09-19 by Lunar, 30% of 172 source core packages reproducible. Variations: time, file order.

  • Updated daily since 2014-09-28 by jenkins.debian.net. On 2014-10-08 60.2% of 3500 source packages already tested built reproducibly...

UDD query to get a list of core packages (172 as of 2014-09-19):

SELECT DISTINCT source
  FROM packages
 WHERE release = 'sid'
   AND section != 'debian-installer'
   AND (   essential = 'yes'
        OR build_essential = 'yes'
        OR priority IN ('required', 'important', 'standard')
       )
 ORDER BY source;


Reproducing builds

There are two sides to the problem: first we need to record the initial build environment, and then we need a way to set up the same environment.

Recording the environment

Information on a build will be recorded in a new control file with extension `.buildinfo`.

Reproduce the build environment

Actions:

  • We need a script that would take a list of binary packages and their respective version, installs them in a chroot and starts the build. Maybe based on pbuilder?

Ruby script that generates URL to .deb on snapshot.debian.org from a list of binary packages and their respective version: http://people.debian.org/~paulproteus/lunar-verify-script.rb

Here's another potential piece of the puzzle. The following script will convert a RFC822 date (as found in a .changes) to the URL of the last known archive state recorded by snapshot.debian.org. This might be useful to debootstrap the proper chroot before installing packages…

require 'date'
require 'uri'
require 'net/http'
require 'nokogiri'

changes_date = 'Mon, 30 Jan 2012 12:52:28 +0100'

build_date = DateTime.rfc822(changes_date)
url = "http://snapshot.debian.org/archive/debian/?year=#{build_date.year}&month=#{build_date.month}"
response = Net::HTTP.get_response(URI.parse(url))

run = nil
doc = Nokogiri::HTML(response.body)
doc.css('p a').each do |link|
  date = DateTime.parse(link.content)
  break if date >= build_date
  run = link['href']
end
puts "http://snapshot.debian.org/archive/debian/#{run}"

Note : it would probably be a lot better of adding a new query to the API interface of snapshot.d.o instead of parsing HTML.

Or you could use this shell one-liner:

  echo -n "Input date: " &&\
    read date &&\
    ts=$(date -d "$date" -u +'%Y%m%dT%H%M%SZ') &&\
    origurl="http://snapshot.debian.org/archive/debian/$ts/" &&\
    echo "Use either $origurl" &&\
    redirected=$(curl -s -L -I -o /dev/null -w '%{url_effective}' "$origurl") &&\
    echo "or $redirected"


Reproducible builds automated on jenkins.d.n

Several jobs have been created to regularily test packages (from sid main) on jenkins.d.n. As a result there is the reproducible build overview of packages, which eventually will have results for all >21k sources packages in Debian.

The setup is explained in this blog post only, but this post is somewhat outdated by now and needs to be amended.


The basics for making packages build reproducible

Currently a plain sid environement is not enough to build packages reproducibly easily. Instead a few packages needs to be taken from the reproducible apt repository as explained above (at least debhelper 9.20141004~reproducible1 and dpkg 1.17.17~reproducible1 are needed). Besides this, these are the bascis for different types of packaging:

  1. use dh with compat=9

  2. use debhelper version 7 style packaging, add strip-nondeterminism to build-depends and make these modifications to debian/rules:

    1. add dh_strip_nondeterminism before dh_compress

    2. add dh_fixmtimes after dh_md5sums, right before dh_builddeb

    3. add dh_genbuildinfo at the end of the dh_ sequence, so after dh_builddeb

  3. use cdbs and FIXME_EXPLAIN_HERE_WHAT_TO_DO

Other types of packaging should be avoided and really be converted to dh.

With this, the basics should be covered and simple packages should build reproducible. See the next chapter for a discussion of common reproducibility issues and their solutions.


Identified problems, and possible solutions

Build systems tend to capture information about the environment that makes them produce different results accross different systems, despite having the same architecture and software installed.

Ideally, such variations should be fixed in the build system itself, but it might sometimes not be possible.

Files in data.tar.gz contains build paths

The build path is embedded in DWARF sections of ELF files among other types of file generated during builds. This has proven a real headache to fix after the path have been captured.

We are thus going to make mandatory to build package in a directory named like /usr/src/debian/hello-2.8-1.

As a bonus, this means that it will be easier to unpack packages in this canonical location for use with tools looking at the source code like gdb.

Files in data.tar.gz depends on readdir order

The build system needs to be patched to sort directory listings.

Epydoc

It looks like python-epydoc will produce links in an order that depends on the readdir order. This needs to be investigated.

Files in data.tar.gz varies with the locale

Builds should be made with LC_ALL=C.UTF-8.

It's quite unpractical to force such value in debian/rules and there is actually no reason this should not be the default.

Actions:

  • We could make dpkg-buildpackage exports this variable; but we would need to change the policy to make dpkg-buildpackage be the canonical solution to build package.

Files in data.tar.gz contains hostname, uname output, username

We could write a LD_PRELOAD library that could answers consistent results for several system calls on the same model as libfaketime. Bdale suggested we call it liblietome.

But we can also consider that no build systems should capture or produce different builds depending on such information and fix them.

Files in data.tar contains timestamps

Recommended solutions in order of preference:

  • Prevent the timestamp from being written entirely in the build products.
  • Tell the tools to use the timestamp of “0” if the timestamp is not used.
  • Tell the tools to use the timestamp of the last debian/changelog entry.

  • Strip timestamps at the end of the build process.
  • Replace timestamps at the end of the build process.

Specific issues:

Now fixed:

  • ?Timestamp in gzip headers

  • ?Timestamp in jar files

Generation of files in data.tar depends on (pseudo-)randomness

Now fixed:

Members of control.tar and data.tar have varying mtimes

dpkg-deb will record the mtime of files it packs in control.tar and data.tar. This is bad as most of these files are generated during the build process and will thus change with each build.

759886 contains a patch against debhelper that adds a new dh_fixmtimes helper that will ensure that the mtime of any file created after the date of latest changelog entry will be set to the date of the latest changelog entry.

{data,control}.tar.{gz,xz,bz2} will store files in readdir order

This is dependent on an accident of filesystem layout at build time, so it would sometimes not be reproducible.

We should probably fix this in dpkg by sorting the contents of the tar files.

Changes are discussed in 719845. Test case patch for pkg-tests. Patches that fork `sort` to get a stable order for files in control and data archives.

Randomness in control file

Now fixed:

.deb ar-archive header contains a timestamp

.deb are ar-archives. The header currently contains the “current time”.

759999 contains patches against dpkg that will preset the timestamp to the time of the latest entry of debian/changelog when a package is built using dpkg-buildpackage.

building the kernel

Uncoordinated experiments, see SameKernel.

XSLT generate-id() is non-deterministic

XSLT's generate-id() function is explicitly allowed by the XSLT spec to be non-deterministic, and is frequently implemented using memory addresses of XML nodes, which are of course non-deterministic thanks to ASLR. Consequentially, files that are generated by XSLT (typically documentation) that include the result of generate-id() in their output do not build deterministically.

piuparts, which uses xmlto to generate documentation, is affected by this.


Custom build environment

We maintain a set of modifications to the toolchain to perform our experiments. Commit notifications are sent to a dedicated mailing list.

Our modified packages can be found in the following APT archive, which is signed by 49B6 5747 36D0 B637 CC37 01EA 5DB7 CA67 EA59 A31F:

deb http://reproducible.alioth.debian.org/debian/ ./
deb-src http://reproducible.alioth.debian.org/debian/ ./

debhelper

The pu/reproducible_builds debhelper branch in the reproducible project contains:

  • dh_fixmtimes: an helper helper that will make the mtimes of control.tar and data.tar deterministic. It will be run by dh before building the .deb.

  • dh_strip_nondeterminism (see below) will be called before dh_installdeb in dh.

  • dh_genbuildinfo: an helper to generate .buildinfo control files as described above.

dpkg

The pu/reproducible_builds dpkg branch in the reproducible repository makes:

  1. file order deterministic in control and data part of the .deb,
  2. uses a single timestamp for .deb ar members
  3. preset the aforementioned timestamp to the latest changelog entry
  4. add -Wdate-time as part of CPPFLAGS in dpkg-buildflags

strip-nondeterminism

strip-nondeterminism is a post-processing tool that will normalize various file types. dh_strip_nondeterminism will be run by debhelper at the end of the build process.

dh-python

dh-python needs a patch for stable ordering of control variables. See 759231 and the pu/reproducible_builds branch.

discount

discount needs a patch to produce stable output of email addresses. See 762622 and the pu/reproducible_builds branch.

Usage example

If you have a pbuilder already setup, it's fairly easy to setup an environment with the custom toolchain:

sudo cp /var/cache/pbuilder/base.tgz /var/cache/pbuilder/base-reproducible.tgz
sudo pbuilder --login --save-after-exec --basetgz /var/cache/pbuilder/base-reproducible.tgz
echo 'deb http://reproducible.alioth.debian.org/debian/ ./' > /etc/apt/sources.list.d/reproducible.list
apt-get update
apt-get install dpkg dpkg-dev debhelper dh-python discount
exit 0

Once that's done, rebuilding a package can be done through:

apt-get source --download-only acl
sudo DEB_BUILD_OPTIONS=nocheck pbuilder --build --debbuildopts '-b' --basetgz /var/cache/pbuilder/base-reproducible.tgz acl_*.dsc
mkdir b1 b2
dcmd cp /var/cache/pbuilder/result/acl_*.changes b1
sudo dcmd rm /var/cache/pbuilder/result/acl_*.changes
sudo DEB_BUILD_OPTIONS=nocheck pbuilder --build --debbuildopts '-b' --basetgz /var/cache/pbuilder/base-reproducible.tgz acl_*.dsc
dcmd cp /var/cache/pbuilder/result/acl_*.changes b2
sudo dcmd rm /var/cache/pbuilder/result/acl_*.changes

debbindiff (available in Debian main) is useful to check the result:

debbindiff --html $output_file b1/*.changes b2/*.changes

Adding a package to the APT archive

On alioth.debian.org:

  1. Import the private signing key to your keyring, if you haven't already: gpg --import /home/groups/reproducible/private/reproducible-private.gpg

  2. Place the package files in /home/groups/reproducible/htdocs/debian/

  3. Run make from that directory.


bash script to compare two package builds

Usage: ./diffp r1/hello_2.8-4_amd64.changes r2/hello_2.8-4_amd64

The script is available in the misc.git repository.


Further work

Having reproducible builds allows us to trust binary packages better, because it becomes easier to have:

  • diversity of buildd location and jurisdiction - build packages in more than one location, including the developer's
  • diversity of buildd hardware, in case of hardware bugs, or malicious implants - a mix of VMs, some real hardware, different CPU manufacturers, different date of manufacture and supplier
  • diversity of people - multiple signatures on a .changes file
  • diversity of kernels, explained below

Kernel packages

Special features of kernel packages (including bootloaders and hypervisors) - GRUB2, Xen, linux, kfreebsd...

  • we put huge trust in them - kernels are the ultimate target of any rootkit, able to completely hide from userland
  • a kernel image built for amd64, if the build system is portable and reproducible enough, will be the same whether built from linux-amd64 or kfreebsd-amd64
  • or maybe from different kernel versions - for example, a jessie build chroot on a wheezy host system

Then we would be better protected from something that could affect many systems at once, such as a kernel vulnerability; or widespread infection by a rootkit, which now must be compatible with more than one type of kernel to go unnoticed.


References

Presentations

Publicity

This section lists URLs, people, and dates for when other people have publicly expressed interest, or shared information about, the project.

Related projects

  • CARE monitors the execution of the specified command to create an archive that contains all the material required to re-execute it in the same context.