Unifying the process to strip files with problematic copyright from upstream tarballs
Historically existing hacks to deal with the problem to some extent
Many packages call a script from the get-orig-source debian/rules target. Some other maintainers let uscan call it from debian/watch.
Mike Hommey also calls a script from debian/watch that allows to filter at the same time as the file is downloaded, without actually extracting to disk. The set of files to remove is passed through a separate file, supporting wildcards, and extra filters (sed-like). This can seem a worthless optimization, but for huge source tarballs (say 80MB bzipped) and slow download links, the whole process is about as fast as downloading alone. See this example.
There is also the --filter-pristine-tar option to git-import-orig. See this gbp.conf example. Git-import-orig may later be modified to ignore files excluded by debian/control as uscan does. It already handles changing the compression scheme.
Proposed triggering of repackaging a tarball
The idea is to add information to the source package that triggers a repackaging process automatically. The point where to specify the removals has caused many discussions, see #561494 and later in a long debian-devel thread that finally leaded to creating this Wiki page. It seems that a consensus is reached (TM) to group into a single place information about where files are copyed from, why they are not or where they are allowed to be redistributed. debian/copyright seems a natural candidate, even if its name suggests something less general. Both of the following implementation suggestions are based on this consensus.
In case of unpacking the version string will then be added +dfsg to express the fact that the content of the original source was changed. This suffix should be configurable, in case upstream re-releases the same upstream version repackaged to fix a purely tarball-related issue. To prevent uscan from automatic repackaging, the --no-exclusion command-line option and the USCAN_NO_EXCLUSION variable may be set in /etc/devscripts.conf or ~/.devscripts.
Ideally, the deletion could be executed from outside uscan too, in case the upstream tarball is generated from a VCS repository and uscan is never called. This will only be useful until uscan understand all VCS kinds in the world.
It may be useful to let Lintian produce a warning when a file designated for removal still exists in the source package.
Options for deleted files specification
This option was discussed in the debian-devel thread mentioned above. There is an uscan implementation maintained by Andreas Tille in this git repository that is based on the discussion of this thread.
This implementation relies on a new Files-Excluded: pattern field in the debian/copyright format. The pattern is searched using find -path. (Remark: Originally the implementation was a mix of find -name and find -path which was fixed in January 2013 following the suggested patch from Nicolas Boulenguez.
Example debian/copyright file:
Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0/ Upstream-Name: Spread Source: https://github.com/phylogeography/SPREAD/downloads Files-Excluded: *.jar release/Mac release/Windows release/tools bin classes .git
(Please also read paragraph below Once we are start removing files below)
One drawback of the Files-Excluded method was mentioned: There is no reasonable way to give file by file (rather pattern by pattern) comment why the file(s) were removed.
Another solution may be considered, as in this experimental implementation. The latest debian/copyright format allows defining sets of files sharing the same license by successive exclusions. Existing parsers and glob syntax may be reused if a fake license is defined, meaning that the maintainer wants some files out of the Debian tarball. The list of accepted license abbreviations in the 1.0 copyright format should be updated. Here is an example debian/copyright. Excluded pattern are separated to demonstrate per-file-set comments. In real life, "Text of GPL3+" would be in a separate paragraph.
Files: * License: GPL3+ Full license text. Files: __MACOSX */__MACOSX License: not-shipped-by-debian Optionaly explain here why __MACOSX are rejected. Files: *.jar deps License: not-shipped-by-debian Optionaly explain here why most jar files and precompiled libs are rejected. Files: rdp_classifier_2.5/lib/ReadSeq.jar deps/Linux-deps/README.TXT License: GPL3+ Full license text.
The successive exclusions allow to remove a whole subdirectory tree but one file, and to remove any file matching some pattern but one.
A single License: not-shipped-by-debian stanza at the end of debian/copyright is equivalent to a Files-Excluded field containing the same pattern.
Once we are start removing files
The current implementation in the git repository from Andreas Tille does a bit more once repackaging becomes necessary:
Removing VCS cruft from tarball
When repackaging tar --exclude-vcs is used. Usually there is no point in having VCS metainformation in upstream tarballs. It should be depated in a separate thread whether this option should be used unconditionally but the current implementation is that way. So in the example above the specification of .git is redundant because it will be left out anyway.
Specifying better compression method
You can specify a more reasonable compression method using uscan --repack-compression <compression>. You can use xz, bz2, gz, or lzma here. Current default is gz - the author is tempted to turn default to xz.
Considerations about debian/copyright pattern specification
The thread showed that understanding the format is quite difficult. Next revision should explicitely mention that a pattern ending with / or beginning with ./ will never match anything (688481).
Information for developers
Brackets in debian/copyright patterns should be escaped before being passed to find, as they are metacharacters for find but not in debian/copyright. Also, some shell metacharacters should be escaped (consider the "$(evil_command)" pattern). The actual unlink/rmdir actions should be echoed depending on command line options/environment/debug level. All this should be checked once both implementation have been merged.
Parsers for debian/changelog
Debian::Copyright packaged as libdebian-copyright-perl. It may be too strict about non-standard fields.
Parse::DebControl packaged as libparse-debcontrol-perl, used in devscripts.
Dpkg::Control::Hash packaged as libdpkg-perl, used in devscripts.
These two ones seem so similar that the eventual choice may be deffered to the uscan maintainer.
Config::Model has a module for debian/copyright, packaged in libconfig-model-perl.