Differences between revisions 10 and 11
Revision 10 as of 2017-02-14 16:49:31
Size: 13138
Editor: Infinity0
Comment: deduplicate some more wording
Revision 11 as of 2017-02-14 19:04:29
Size: 13147
Editor: Infinity0
Comment: clarify "this"
Deletions are marked like this. Additions are marked like this.
Line 63: Line 63:
This means, to patch packages to take information not from hard-to-reproduce system sources, but from an alternative source that is easily controlled by any rebuilder, when they wish to try to reproduce a package. "Redirecting" means, to patch packages to take information not from hard-to-reproduce system sources, but from an alternative source that is easily controlled by any rebuilder, when they wish to try to reproduce a package.

To practically make many more software packages reproducible today, we have introduced some new standard environment variables for build tools to consume, to make their outputs reproducible. This article discusses the reasons behind our choice of mechanism, as well as describing the situations when it is ideal or not ideal to use these mechanisms.

These variables are meant to minimise the work required across the whole software community, to achieve reproducible builds by default. However, they do add a bit of complexity to certain build tools. Furthermore, builders that wish to verify the reproducibility of a package, are burdened with the additional requirement of having to set an environment variable to a specific value.

So before implementing it, you should go through the following checklist, to see if you can avoid implementing it:

Checklist

Do I really need to use this?

  1. See if strip-nondeterminism will get rid of the difference for you, or if it's easy to add this functionality to it. If yes, then you don't need this variable.

  2. See if you can patch the tool that generates this information, to simply not generate this information in the first place. If this is a good idea and the maintainer accepts your idea, then you don't need this variable.
  3. See if you can patch the tool that generates this information, to instead take this from a more easy-to-reproduce input that is already present in any parent buildsystem that it expects to be part of. If this is the case and the maintainer accepts your idea, then you don't need this variable. For example, some programming languages have a very rigid structure on where they expect source files to be kept, such as Java and Go. In these cases, full paths to source files need not be recorded in the build output; paths relative to the source root(s) should be (and are) used instead. It may be possible to set up a similar mechanism even for other languages that don't have this feature.
  4. See if you can patch the tool that generates this information, to instead take this from the standard environment variables that we have designed, which would be set centrally to reproducible values by your distribution. This is what this document is about.

  5. See if you can patch the package that uses the tool, to give it some tool-specific options like CLI arguments or its own environment variables, either to avoid generating this information or to generate a reproducible value. For example, if doing out-of-tree builds with GNU autotools, you must call ./configure using a relative instead of an absolute path.

    In certain cases this may be a good solution, such as when the tool in question is the top-level buildsystem of the package and the options to be used are not too verbose. However, in most cases this should be avoided in favour of the previous options, since you will have to do this for every package that uses the tool. See #tool-specific-args for more discussion.

Rationale

During our investigations into reproducing different software packages, we have noticed many common sources of this irreproducibility. One of the most significant sources of these, are timestamps and full filesystem paths that are tied to a single build. Instead of patching many thousands of packages (sometimes with uncooperative upstreams), we wondered if there was a better solution.

Fixing the information

One way to make something unreproducible reproducible, is to try to fix (i.e. make constant) the source of irreproducibility on a system-wide level. This is relatively easy to do, but has several downsides:

It breaks some build systems. For example:

  • Fixing the system clock to some constant time T, breaks various build tools such as GNU make.

It doesn't work to actually remove the irreproducibility. For example:

  • Setting the system clock to some constant time T at the start of the build, then letting it run normally, does not achieve reproducibility in the cases where the build takes a non-deterministic amount of time and something embeds "the current time" near the end of the build process.

It restricts the user in major ways. For example:

  • Mandating that builds take place in a standardised absolute build path, prevents parallel builds from being done without extra system-level support (e.g. virtualisation or separate kernel namespaces), and prevents users without permissions to use that path from verifying reproducibility.

Ignoring the information

Another way to make something reproducible, is to try to patch packages or build tools to completely ignore the information. This is hard to do - one must patch every single package - and also has a few downsides:

It breaks some functionality. For example:

  • Ignoring path information in debugging output, makes it much harder to select the right source file to display to the end user when debugging, especially when debugging a large program with many dependencies that contain files with the same base name - "utils.c", etc.
  • Ignoring build dates when generating documentation, means that users no longer get an indication of "how old" their software is.

These sorts of loss-of-functionality means that software maintainers have to make a trade-off between whether to keep some functionality or get reproducible builds. Inevitably, some maintainers choose against reproducible builds, and that sets us back in the push for universal reproducibility. We at the reproducible builds project may think reproducibility is more important than these "minor" functions, but people that need to debug large programs a lot (or developers that imagine they need to support such people) might think the same of reproducible builds.

From "software philosophy" viewpoint, reproducible builds shouldn't force us to compromise on anything else. Indeed if we simply output "0" all the time, this is a "reproducible build", but this is not very useful. This is an extreme example, but as I have said elsewhere, exploring corner cases and extreme examples helps us understand reality better. An analogous example is often given in security courses: the "most secure" system is one that is destroyed / shut down / literally nobody can access. So when we do security we need to preserve as much of the existing functionality as possible, otherwise a significant fraction of people will always choose against security.

Redirecting the information

"Redirecting" means, to patch packages to take information not from hard-to-reproduce system sources, but from an alternative source that is easily controlled by any rebuilder, when they wish to try to reproduce a package.

Mostly, it is best to patch build tools and rarely end-user packages, since it is the build tools themselves that perform the embedding of the unreproducible output in the first place - and there are fewer build tools than end-user packages, so it costs much less time and effort.

Writing this patch is harder than the patch for "ignoring the information", since we are doing something more complex - however we believe the cost-benefit tradeoff clearly favours this approach in the case of timestamps (several hundred packages) and full filesystem paths (several thousand packages). It also has the added benefit that upstream maintainers are more likely to accept this type of patch, because it doesn't reduce the original functionality.

Nevertheless we want to emphasise that this not meant to be a "universal mechanism" to achieving reproducible builds. In particular, the rebuilder is burdened with having to set appropriate values for the alternative sources of information. It would be good if a package could be reproducible without having to resort to this mechanism, so that rebuilders unaware of this mechanism can still reproduce a package. See the checklist at the top of this page, for other approaches that do not involve this mechanism.

Standard environment variables

Our choice of "alternative source" is that of system environment variables. These are easily controlled by any rebuilder, are available on every platform, and are read in the same way by all build tools (unlike command line options, for example).

We have chosen names that have never been used before for any other purpose, so their presence should not cause additional irreproducibility. For example, certain build scripts save the build-time CFLAGS into the final output, which is a reasonable thing to do since these affect other parts of the output, and CFLAGS is a widely-known environment variable. On the other hand, saving arbitrary environment variables is not a reasonable thing to do but we already test for and fix this, so these new variables should not cause further irreproducibility.

Some people feel uneasy at taking input from yet another environment variable - or simply at taking input from environment variables in general. For this, we argue that the program is already taking input from "an environment", i.e. the system environment including the clock, filesystem, and other things; and that our mechanism merely redirects this to a more reproducible environment.

Finally, the general abstract concept of taking input from "an environment" is also not "against" ideas of functional programming or deterministic build processes. Environments are useful for writing more succinct code; instead of declaring the same inputs to every function - either (a) across many functions in your own layer or (b) through many functions down the stack in order to pass higher-level information down into a lower-level function, where the middle functions don't even use these inputs other than passing them down the stack - instead of doing all of that, you use "an environment", which in some codebases is called a "context", and in pure functional programming languages is often implemented using the Reader monad.

Our variables are:

Page

Variable name

Original hard-to-reproduce information

New easy-to-reproduce information

../TimestampsProposal

SOURCE_DATE_EPOCH

Time of the build

Time of the last modification to the source code

../BuildPathProposal

BUILD_PATH_PREFIX_MAP

Full filesystem path of build files, including the extracted source code.

Partial filesystem path of build files, relative to some standard run-time directory defined by the original builder.

More detailed discussion

Sometimes developers of build tools do not want to support our environment variable, or they will tweak the suggestion to something related but different. We really do think the best approach is to use the environment variable exactly as-is described in our proposal, without any variation. Here we explain our reasoning versus the arguments we have encountered:

"We'll add a command-line flag instead"

Build tool developers dislike the variable name or the mechanism of environment variables, and prefer to use command line options, perhaps to be more consistent with the rest of their program.

We understand the desire to avoid inconsistency, or supporting seemingly arbitrary environment variables. However, timestamps and full filesystem paths are by far the largest issues that prevents reproducible builds from becoming reality, and every tool supporting a standardised mechanism would greatly lower the cost of that. If every tool supported different mechanisms to do the same thing, the overall cost is greatly higher. In other words, we ask that you consider global consistency across many different projects, over any smaller-scale inconsistency in your own project.

From another point of view: we agree that it is easy for users of your tool to add a command-line option, but then they have to specifically think about reproducible builds. This is not, and ought not to be, in the minds of most software developers. Reproducible builds should be the default situation for all software, and we shouldn't burden developers with these trivialities. Opt-in security is not security, because nobody will take the effort to opt-in even if in general they think it's a good idea. (This is not a paradox; people like getting things for free.)

Furthermore, some buildsystems or build scripts like to embed the arguments passed to an inner tool, somewhere in the overall build output. So if the option itself contains unreproducible data (e.g. to tell the inner tool to remove it from its output), it would still make the overall output unreproducible. Whether this is a bug depends on the situation, but usually this is a perfectly reasonable thing to do, in which case a command-line option to the inner tool "to add reproducibility" to its own output, would cause more irreproducibility elsewhere.