Translation(s): none


dedup.debian.net

The debian duplication detector is a service that scans binary Debian packages and records hashes of regular files contained. It can then discover files shipped in multiple packages or multiple times in one package, that can possibly replaced by links to save space. Another use case is to discover embedded copies in scripting languages.

FAQ

Q: The PTS says that my package foo shares data with itself. How can that be?

A: This can happen when your package ships multiple copies of files. Those files actually consume the space on the disk and the mirrors multiple times. Both hard links and soft links are properly detected and not reported as duplication. See section "Within a single binary package".

Q: Why is there a sharing notice in the todo section of my package at all? I cannot do anything about it.

A: The current heuristic is to list packages that have at least 1MB and at least 10% of their installed size of sharing. Being a heuristic means that it can be wrong. To get the notice removed for your particular package, report a bug (see below).

Tips for reducing duplication in packages

Within a single binary package, using jdupes

If the software accessing the duplicate files supports symlinks, add the following Build-Depends in debian/control

Build-Depends:...
              jdupes,

then you can run the following commands from debian/rules after the files are installed by make install or similar.

# Replace duplicate files with relative symlinks
jdupes -rl debian/mypackage/

If the software accessing the duplicate files supports symlinks, add the following Build-Depends in debian/control

Build-Depends:...
              rdfind,
              symlinks

then you can run the following commands from debian/rules after the files are installed by make install or similar.

# Replace duplicate files with symlinks
rdfind -outputname /dev/null -makesymlinks true debian/mypackage/
# Fix those symlinks to make them relative
symlinks -r -s -c debian/mypackage/

An example package using this technique is megaglest.

Within multiple binary packages from a single source package

If the duplicated files are significant, you might want to pool them in a foo-common package and have the other binary packages depend on that. If there is one particular package required by all other packages, consider using dh_installdocs --link-doc=foo-common.

Within multiple binary packages from multiple source packages

You should co-ordinate with the maintainers of the source packages and come up with a solution.

Where the files are from embedded copies of other projects, the other projects should be packaged separately and the packages containing them should drop the files and depend on the new packages.

The dh-linktree helper can assist with replacing embedded copies by symbolic links to files in other packages.

Talks

DebConf13 lightning talk: dedup.debian.net intro by Helmut Grohne: slides, video (starting at 15:55)

Bugs and known issues

If you discover a bug or want a new feature, email helmut@subdivi.de.

A known limitation is that shared files are reported for different versions of the same software in the PTS. At the moment wesnoth and python are filtered via regular expressions. If more are needed, report a bug.

Thanks!

dedup helped me find an LGPL violation in apt-offline-gui due to it copying icons from oxygen-icon-theme without also copying the SVG source of the icons. Thanks!

-- Paul Wise

Ideas