Unifying the process to strip files with problematic copyright from upstream tarballs

Historically existing hacks to deal with the problem to some extend

Many packages call a script from the get-orig-source debian/rules target. Some other maintainers let uscan call it from debian/watch.

Daniel Leidert calls a script from debian/watch.

Mike Hommey also calls a script from debian/watch that allows to filter at the same time as the file is downloaded, without actually extracting to disk. The set of files to remove is passed through a separate file, supporting wildcards, and extra filters (sed-like). This can seem a worthless optimization, but for huge source tarballs (say 80MB bzipped) and slow download links, the whole process is about as fast as downloading alone. See this example.

Another variant is used by pkg-perl, documented here.

There is also the --filter-pristine-tar option to git-import-orig. See this gbp.conf example. Git-import-orig may later be modified to ignore files excluded by debian/control as uscan does. It already handles changing the compression scheme.

Proposed triggering of repackaging a tarball

The idea is to add information to the source package that triggers a repackaging process automatically. It seems to be accepted that debian/copyright is a reasonable file to specify the needed information because it is

  1. machine readable (or at least it should be according to DEP5)

  2. usually the reason for removal are copyright related reasons

In case of unpacking the version string will then be added +dfsg to express the fact that the content of the original source was changed. This suffix should be configurable, in case upstream re-releases the same upstream version repackaged to fix a purely tarball-related issue. To prevent uscan from automatic repackaging, the --no-exclusion command-line option and the USCAN_NO_EXCLUSION variable may be set in /etc/devscripts.conf or ~/.devscripts.

Ideally, the deletion could be executed from outside uscan too, in case the upstream tarball is generated from a VCS repository and uscan is never called. This will only be useful until uscan understand all VCS kinds in the world.

This point where to specify the removals has caused many discussions, see #561494 and later in a long debian-devel thread that finally leaded to creating this Wiki page. It seems that a consensus is reached (TM) to group into a single place information about where files are copyed from, why they are not or where they are allowed to be redistributed. debian/copyright seems a natural candidate, even if its name suggests something less general.

It may be useful to let Lintian produce a warning when a file designated for removal still exists in the source package.

Options for deleted files specification

Files-Excluded

This option was discussed in a long debian-devel thread. There is an uscan implementation maintained by Andreas Tille in this git repository that is based on the discussion of this thread.

This implementation relies on a new Files-Excluded: pattern field in the debian/copyright format. The pattern is searched in the top directory with find -name if it contains no slash, with find -path if it contains one, then all matching files or directories are removed from the repackaged tarball.

not-shipped-by-debian

Another solution may be considered, as in this experimental implementation. The latest debian/copyright format allows defining sets of files sharing the same license by successive exclusions. Existing parsers and glob syntax may be reused if a fake license is defined, meaning that the maintainer wants some files out of the Debian tarball. The list of accepted license abbreviations in the 1.0 copyright format should be updated. Here is an example debian/copyright. Excluded pattern are separated to demonstrate per-file-set comments. In real life, "Text of GPL3+" would be in a separate paragraph.

Files: *
License: GPL3+
 Full license text.

Files: __MACOSX */__MACOSX
License: not-shipped-by-debian
 Optionaly explain here why __MACOSX are rejected.

Files: *.jar
License: not-shipped-by-debian
 Optionaly explain here why jar files are rejected.

Files: rdp_classifier_2.5/lib/ReadSeq.jar
License: GPL3+
 Full license text.

Considerations about debian/copyright pattern specification

The thread showed that understanding the format is quite difficult. Next revision should explicitely mention that a pattern ending with / or beginning with ./ will never match anything (688481).

Information for developers

TODO: brackets in debian/copyright patterns should be escaped before being passed to find.

Debian::Copyright packaged as libdebian-copyright-perl. Parse::DebControl packaged as libparse-debcontrol-perl, used in devscripts.

Dpkg::Control::Hash packaged as libdpkg-perl, used in devscripts.

The first seem too strict about non-standard fields. The two latter ones seem so similar that the eventual choice may be deffered to the uscan maintainer.