Status: draft

Problem

As of now dpkg does not keep track of pathname metadata. The information is there in some places, but it gets lost on the way. Properly tracking metadata all the way, would make it possible to fix some long standing issues, such as:

The metadata is (or should be) present in three different stages of a package life-cycle (although the deployment will start from the most internal to the most external; db → bin → src):

Source packages

Most source packages contain already some kind of upstream manifest that specifies what goes into the binary packages, but those are not standardized and are in most cases programmatic manifests (via the upstream build systems install-style target) instead of declarative manifests.

To generate the data.tar archives for the .deb packages, w/o needing root, we need a template manifest that specifies at least the owner:group, permissions, etc. and the list of files, whether they are conffiles, or other classes of files (such as logs, caches or similar), and possibly extended attributes. Some of the other data should be always taken from the filesystem, such as timestamp, size, etc.

The template should support some form of globbing or pattern matching to avoid repetition. It should also allow specifying defaults/globals for the metadata to also avoid repetition.

One strong candidate is mtree(5), although one problem with it is that none of the known implementations support arbitrary metadata, so it ends up not being very extensible, so the appeal of using this format so that "standard" tools can be used kind of disappears. Or a new format could be devised perhaps based off mtree(5), the rpm specs format, or similar.

The current best option is to use mtree(5) 2.0 (as defined in libarchive) as the base format, and extend the key/values as needed, even if that means not being able to use stock mtree(1) commands.

Binary packages

Binary packages contain all basic metadata as transported by the tar entry header. Additional metadata is transported via control files in the .deb control member area, such as digests or the list of conffiles.

While this is one of the less problematic parts, as the metadata is already there. It still presents the problem that it is not transparently extensible, as each new control file needs support from dpkg so that its data gets imported into the database.

Possible solutions would be to either:

Installed database

The dpkg database currently only tracks pathnames, possibly their md5sums and whether those are conffiles or not.

We need a manifest format that is extensible, and contains the pathname and a list of key/value pairs. This would take over at least the .list, .conffiles, and .md5sums files. Initially each package would have one such manifest entry, but eventually the whole filesystem tree would be tracked in a single file, so we fix the long-standing loading time slowdown inflicted by the massive amount of seeking. This will require handling possibly conflicting metadata for shared pathnames such as directories or ref-counted files. We'll also need some kind of journal (similar to the control updates mechanism) to avoid having to write the whole filesystem metadata file on each package operation affecting the filesystem.

The current WIP branch is using mtree(5) 2.0 for the internal dpkg db, and as a first step will make it possible to track all metadata available from the current tar archives. So a simple reinstall will make fill the dpkg db with the metadata.

The deployment of this has several problems, that need to be handled first: