Differences between revisions 1 and 34 (spanning 33 versions)
Revision 1 as of 2016-10-06 16:53:56
Size: 214
Editor: Infinity0
Comment: Testing "template" system
Revision 34 as of 2017-04-11 20:26:06
Size: 9297
Editor: Infinity0
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
<<Include(ReproducibleBuilds/BuildPathProposal, , to="^{{{build_time_information}}}$")>>
the full path of a source file
<<Include(ReproducibleBuilds/BuildPathProposal, , from="^{{{build_time_information}}}$")>>
/!\ A [[https://reproducible-builds.org/specs/build-path-prefix-map/|DRAFT version]] of the spec is available.

This is a standard that defines an environment variable `BUILD_PATH_PREFIX_MAP` that distributions can set centrally, and have build tools consume this in order to produce reproducible output.

Before implementing this, you should scan through [[../StandardEnvironmentVariables#Checklist|our checklist]] to see if you can avoid implementing it.

<<TableOfContents()>>

== Proposal ==

Please read our [[https://reproducible-builds.org/specs/build-path-prefix-map/|DRAFT specification]] for details.

See [[../StandardEnvironmentVariables|Standard Environment Variables]] for more detailed discussion of the rationales behind this mechanism.

Below we also have [[#More_detailed_discussion|more detailed discussion]] about this specific variable, as well as documentation on [[#history-and-alternatives|history and alternative proposals]].


== Example code ==

See our [[https://anonscm.debian.org/cgit/reproducible/build-path-prefix-map-spec.git|git repo]] for example code in various languages.

TODO:

* link to build tools to be patched


== Implementation notes ==

=== Language-specific ===

Since our encoding only deals with ASCII-compatible characters, and UTF-16 uses surrogate pairs to encode code points not in the BMP, it should be possible to implement our encoding by "naively" operating on string units, regardless of whether a unit is an 8-bit octet (e.g. POSIX C), 16-bit wchar_t (e.g. Windows C++), or an actual decoded Unicode code point (e.g. Python 3). However in practise, this is only possible when your language provides APIs that do not attempt to automatically decode environment variables or filesystem paths, or does this in a reversible (non-standard) way.

For example, on Python 3, `os.getenv` and the path functions normally return a unicode string (where each unit is a decoded Unicode code point), unless you specifically use `os.getenvb` instead or give "bytes"-type path arguments.

Luckily on Python 3.3+ one can implement our encoding without duplicating code, in a cross-platform way. Yes, paths and environment variables are presented as (decoded) Unicode strings. However on POSIX where the underlying OS values are bytes, values which cannot be UTF-8 decoded to valid Unicode are instead decoded into a lone "low surrogate" character which is not present in standard Unicode - Python calls this the "surrogateescope" encoding. The resulting string, when UTF-8 encoded back into bytes, preserves the original byte value - which is invalid UTF-8 but that doesn't matter to a POSIX OS. Therefore, it is correct to implement a "naive" algorithm that operates on Python unicode strings even when the OS type is bytes, and the benefit is that the same code will also work on Windows.

For example, in Rust the `OsString` type is platform-dependent and opaque; one must write platform-specific code to either convert this to an array of `u8` (for POSIX) or an array of `u16` (for Windows). In the latter case, `u16` units that are invalid UTF-16 are represented internally as [[https://simonsapin.github.io/wtf-8/|WTF-8]], but this is only an implementation detail not exposed to Rust stdlib API users.

For example, in NodeJS (as of v4.6.1), non-UTF-8 bytes in environment variables are *not supported* - they will get replaced by U+FFFD instead. Best to file a bug against them, if you need to map non-UTF-8 paths.

Our [[https://github.com/infinity0/rb-prefix-map/tree/master/consume/testcases|testcases]] includes a non-UTF-8 case, so you can test how to make this work (or not) in your favourite language. Unfortunately, we do not yet have invalid UTF-16 test cases for windows.


=== Transmitting these values ===

Our encoding only transforms sequences of printable ASCII characters. If you have reason to believe that you need to escape or encode your file paths before transmitting it across your chosen medium, e.g. because they contain non-printable or non-ASCII characters, it should suffice to simply apply the same escape or encoding mechanism to this environment variable as well. This is an entirely separate concern from anything else mentioned in this document, and the code to do this should be clearly separated from code that implements this document.


== More detailed discussion ==

(See [[ReproducibleBuilds/StandardEnvironmentVariables#more-detailed-discussion|Standard Environment Variables]] for general arguments.)

=== Comparison to SOURCE_DATE_EPOCH ===

`SOURCE_DATE_EPOCH`'s underlying information (date of last modification) is a property of the source code, and is therefore a constant reproducible value by definition. By contrast, `BUILD_PATH_PREFIX_MAP`'s underlying information (maps from paths to other paths) is not ''itself'' a property of the source code.

What it is, is roughly the "difference" between build-time path information, and a property of the source code. In other words, build tools read paths from the filesystem, then "subtract" `SOURCE_DATE_EPOCH` from it to get reproducible paths out of it. These might be relative paths, or they could be abstract absolute paths that might not exist on the build machine, but could exist (based on a contract with other tools) on the end user's run-time machine. This latter information is output, instead of the build-time path information.

Why don't we use a reproducible value directly? In fact some buildsystems can do this, and they don't need this variable. For example, some buildsystems and programming languages force a very rigid structure to the source tree, and either both the high-level and low-level build tools are able to determine the relative paths of every source file, or else the high-level tools detect this information and pass it down into the low-level tools.

However, this is only easy to arrange for vertically-integrated build stacks where the whole stack is controlled by just a few parties. For C/C++ and other languages, there are several different buildsystems that each want to work for several different compilers at the same time, so there is a disincentive to add special logic for tool-specific command-line options.

For example, GCC does in fact support a `-fdebug-prefix-map` option where a high-level build tool (or human) can supply information on "what to subtract". But historically, the only use-case that was imagined for this was debug info (hence its name) and the map does not apply to other things like `__FILE__` macros. It's unlikely that a higher-level buildsystem would want to spend the effort to detect appropriate values for this automatically, merely to have nicer debuginfo. More generally, command-line based solutions are hard for us the Reproducible Builds project to scale across the whole FOSS ecosystem; see [[../StandardEnvironmentVariables#tool-specific-args|"We'll add a command line flag instead"]] for more discussion.

This environment variable therefore, can also act as a ''standard interface'' between parent buildsystems and low-level build tools. The former can pass to the latter, information about the path structure of the overall software project. This can even be done without co-ordination: unlike command-line options, programs usually ignore environment variables that they don't recognise. So a parent buildsystem can pre-emptively choose to set this variable, even if not all of its child build tools can read it yet.

Similarly, our usage of `SOURCE_DATE_EPOCH` so far has been for the distribution's package builder program to set this value in an envvar, but it is conceivable that buildsystems could set this themselves, e.g. by reading it from VCS or a `ChangeLog` file in the source tarball, for lower-level build tools to consume.

<<Anchor(history-and-alternatives)>>
== History and alternative proposals ==

FIXME: stuff about `--fdebug-prefix-map`, `DW_AT_producer`, etc.

=== Rejected options ===

 * Simple-split using semi-common characters like `:`, because it loses the ability to map paths containing those characters. For example, if we wanted to remap paths to `$package-$version/$relative-path`, Debian versions sometimes contain `:` as the epoch separator.

 * Simple-split using never-in-path characters like `\t` or `0x1E RECORD SEPARATOR`, because they make the output unconditionally non-printable.

 * Any variant of backslash-escape, because it's not clean to implement in high-level languages. (Need to use regex or an explicit loop.)

 * Any variant of hex-encoding, because different languages decode hex codes >127 in different ways, when inserting it back into a string.

 * Any variant of url-encoding: as for hex-encoding, and additionally because the original perceived gain (being able to use existing decoders) did not work out in the end:

    * Extra characters like `+` `;` need to be encoded.

    * Decoders in many languages only decode to a `{ key → value list }`; there is no easy way to turn this into a list-of-pairs that preserves the original ordering.

 * Mapping `%` into `%%` (or `\` into `\\`, etc), because this causes differences when decoding sequences like `%%+` via different strategies.

----
'''Footnotes:'''

/!\ A DRAFT version of the spec is available.

This is a standard that defines an environment variable BUILD_PATH_PREFIX_MAP that distributions can set centrally, and have build tools consume this in order to produce reproducible output.

Before implementing this, you should scan through our checklist to see if you can avoid implementing it.

Proposal

Please read our DRAFT specification for details.

See Standard Environment Variables for more detailed discussion of the rationales behind this mechanism.

Below we also have more detailed discussion about this specific variable, as well as documentation on history and alternative proposals.

Example code

See our git repo for example code in various languages.

TODO:

* link to build tools to be patched

Implementation notes

Language-specific

Since our encoding only deals with ASCII-compatible characters, and UTF-16 uses surrogate pairs to encode code points not in the BMP, it should be possible to implement our encoding by "naively" operating on string units, regardless of whether a unit is an 8-bit octet (e.g. POSIX C), 16-bit wchar_t (e.g. Windows C++), or an actual decoded Unicode code point (e.g. Python 3). However in practise, this is only possible when your language provides APIs that do not attempt to automatically decode environment variables or filesystem paths, or does this in a reversible (non-standard) way.

For example, on Python 3, os.getenv and the path functions normally return a unicode string (where each unit is a decoded Unicode code point), unless you specifically use os.getenvb instead or give "bytes"-type path arguments.

Luckily on Python 3.3+ one can implement our encoding without duplicating code, in a cross-platform way. Yes, paths and environment variables are presented as (decoded) Unicode strings. However on POSIX where the underlying OS values are bytes, values which cannot be UTF-8 decoded to valid Unicode are instead decoded into a lone "low surrogate" character which is not present in standard Unicode - Python calls this the "surrogateescope" encoding. The resulting string, when UTF-8 encoded back into bytes, preserves the original byte value - which is invalid UTF-8 but that doesn't matter to a POSIX OS. Therefore, it is correct to implement a "naive" algorithm that operates on Python unicode strings even when the OS type is bytes, and the benefit is that the same code will also work on Windows.

For example, in Rust the OsString type is platform-dependent and opaque; one must write platform-specific code to either convert this to an array of u8 (for POSIX) or an array of u16 (for Windows). In the latter case, u16 units that are invalid UTF-16 are represented internally as WTF-8, but this is only an implementation detail not exposed to Rust stdlib API users.

For example, in NodeJS (as of v4.6.1), non-UTF-8 bytes in environment variables are *not supported* - they will get replaced by U+FFFD instead. Best to file a bug against them, if you need to map non-UTF-8 paths.

Our testcases includes a non-UTF-8 case, so you can test how to make this work (or not) in your favourite language. Unfortunately, we do not yet have invalid UTF-16 test cases for windows.

Transmitting these values

Our encoding only transforms sequences of printable ASCII characters. If you have reason to believe that you need to escape or encode your file paths before transmitting it across your chosen medium, e.g. because they contain non-printable or non-ASCII characters, it should suffice to simply apply the same escape or encoding mechanism to this environment variable as well. This is an entirely separate concern from anything else mentioned in this document, and the code to do this should be clearly separated from code that implements this document.

More detailed discussion

(See Standard Environment Variables for general arguments.)

Comparison to SOURCE_DATE_EPOCH

SOURCE_DATE_EPOCH's underlying information (date of last modification) is a property of the source code, and is therefore a constant reproducible value by definition. By contrast, BUILD_PATH_PREFIX_MAP's underlying information (maps from paths to other paths) is not itself a property of the source code.

What it is, is roughly the "difference" between build-time path information, and a property of the source code. In other words, build tools read paths from the filesystem, then "subtract" SOURCE_DATE_EPOCH from it to get reproducible paths out of it. These might be relative paths, or they could be abstract absolute paths that might not exist on the build machine, but could exist (based on a contract with other tools) on the end user's run-time machine. This latter information is output, instead of the build-time path information.

Why don't we use a reproducible value directly? In fact some buildsystems can do this, and they don't need this variable. For example, some buildsystems and programming languages force a very rigid structure to the source tree, and either both the high-level and low-level build tools are able to determine the relative paths of every source file, or else the high-level tools detect this information and pass it down into the low-level tools.

However, this is only easy to arrange for vertically-integrated build stacks where the whole stack is controlled by just a few parties. For C/C++ and other languages, there are several different buildsystems that each want to work for several different compilers at the same time, so there is a disincentive to add special logic for tool-specific command-line options.

For example, GCC does in fact support a -fdebug-prefix-map option where a high-level build tool (or human) can supply information on "what to subtract". But historically, the only use-case that was imagined for this was debug info (hence its name) and the map does not apply to other things like __FILE__ macros. It's unlikely that a higher-level buildsystem would want to spend the effort to detect appropriate values for this automatically, merely to have nicer debuginfo. More generally, command-line based solutions are hard for us the Reproducible Builds project to scale across the whole FOSS ecosystem; see "We'll add a command line flag instead" for more discussion.

This environment variable therefore, can also act as a standard interface between parent buildsystems and low-level build tools. The former can pass to the latter, information about the path structure of the overall software project. This can even be done without co-ordination: unlike command-line options, programs usually ignore environment variables that they don't recognise. So a parent buildsystem can pre-emptively choose to set this variable, even if not all of its child build tools can read it yet.

Similarly, our usage of SOURCE_DATE_EPOCH so far has been for the distribution's package builder program to set this value in an envvar, but it is conceivable that buildsystems could set this themselves, e.g. by reading it from VCS or a ChangeLog file in the source tarball, for lower-level build tools to consume.

History and alternative proposals

FIXME: stuff about --fdebug-prefix-map, DW_AT_producer, etc.

Rejected options

  • Simple-split using semi-common characters like :, because it loses the ability to map paths containing those characters. For example, if we wanted to remap paths to $package-$version/$relative-path, Debian versions sometimes contain : as the epoch separator.

  • Simple-split using never-in-path characters like \t or 0x1E RECORD SEPARATOR, because they make the output unconditionally non-printable.

  • Any variant of backslash-escape, because it's not clean to implement in high-level languages. (Need to use regex or an explicit loop.)
  • Any variant of hex-encoding, because different languages decode hex codes >127 in different ways, when inserting it back into a string.

  • Any variant of url-encoding: as for hex-encoding, and additionally because the original perceived gain (being able to use existing decoders) did not work out in the end:
    • Extra characters like + ; need to be encoded.

    • Decoders in many languages only decode to a { key → value list }; there is no easy way to turn this into a list-of-pairs that preserves the original ordering.

  • Mapping % into %% (or \ into \\, etc), because this causes differences when decoding sequences like %%+ via different strategies.


Footnotes: