As of now dpkg does not keep track of pathname metadata. The information is there in some places, but it gets lost on the way. Properly tracking metadata all the way, would make it possible to fix some long standing issues, such as:
- Performing consistency checks between filesystem vs database («dpkg -V»)
- Restoring out-of-sync metadata on the filesystem (new «dpkg -M»?)
- Listing dpkg pathname metadata (new «dpkg-query --mtree»?)
- Building binary packages w/o requiring root/fakeroot («dpkg-deb -b»)
- Tracking spurious and volatile pathnames so that dpkg know about them and takes care of their removal (caches, logs, etc)
- Possible automatic handling of symlink ←→ dir pathname switches (because we'd know if this is package or admin induced)
- Making it possible to set extended attributes on the filesystem
- Making it possible to generate installation media w/o requiring root/fakeroot (by using an unprivileged installation tree and the metadata manifests)
The metadata is (or should be) present in three different stages of a package life-cycle (although the deployment will start from the most internal to the most external; db → bin → src):
Most source packages contain already some kind of upstream manifest that specifies what goes into the binary packages, but those are not standardized and are in most cases programmatic manifests (via the upstream build systems install-style target) instead of declarative manifests.
To generate the data.tar archives for the .deb packages, w/o needing root, we need a template manifest that specifies at least the owner:group, permissions, etc. and the list of files, whether they are conffiles, or other classes of files (such as logs, caches or similar), and possibly extended attributes. Some of the other data should be always taken from the filesystem, such as timestamp, size, etc.
The template should support some form of globbing or pattern matching to avoid repetition. It should also allow specifying defaults/globals for the metadata to also avoid repetition.
One strong candidate is mtree(5), although one problem with it is that none of the known implementations support arbitrary metadata, so it ends up not being very extensible, so the appeal of using this format so that "standard" tools can be used kind of disappears. Or a new format could be devised perhaps based off mtree(5), the rpm specs format, or similar.
The current best option is to use mtree(5) 2.0 (as defined in libarchive) as the base format, and extend the key/values as needed, even if that means not being able to use stock mtree(1) commands.
Binary packages contain all basic metadata as transported by the tar entry header. Additional metadata is transported via control files in the .deb control member area, such as digests or the list of conffiles.
While this is one of the less problematic parts, as the metadata is already there. It still presents the problem that it is not transparently extensible, as each new control file needs support from dpkg so that its data gets imported into the database.
Possible solutions would be to either:
- Use PAX formatted tar entries, but:
- Those imply a non-insignificant increase in the tar size, due to the overhead of the additional tar entries for the tar extended headers.
- The format support is not universal in all systems where dpkg can be used.
- On systems where tar does not support the PAX extended headers, manually extracting via ar+tar (or dpkg-deb) would create a mess of files representing the PAX extended headers (as per the spec).
- The filesystem xattrs have not been universally supported nor universally enabled even when they were supported.
- On systems where filesystem xattrs are not supported or enabled (which has been a problem as these have not seen universal adoption), manually extracting via ar+tar or dpkg-deb would mean that those xattrs would get lost.
- Old dpkg-deb will fail hard on the new unknown tar entry types (the new PAX 'g' and 'x' extender header entries).
- We'd be switching most metadata to the new transport format, and most of it needs to end up in the dpkg db and is not appropriate for the files in filesystem, so possibly having lots of duplicated information as xattrs does not make sense.
- The PAX format does not have an official defined keyword namespace for xattrs, there are several vendor namespaces used currently with non-universal adoption, there's LIBARCHIVE and SCHILY, the latter being somewhat undesirable given previous bad history within Debian with the author.
- Use a new extensible control file manifest (current candidate mtree(5) 2.0).
- This manifest must not duplicate metadata already present in the tar header, otherwise manual extraction (with ar+tar) would diverge, and extraction using an older dpkg-deb would also diverge.
- This manifest could be a restricted version of the source package manifest template, so that globbing/patterns are not allowed.
- There is the problem of whether to accept any unknown key silently, warn or fail hard.
- If the file is generated automatically then it might always have a good form (barring bugs), but if it is manually crafted, then typoed keywords might creep in. In which case a solution might be to lint either by a new dpkg command, or lintian to catch those.
- Accepting random keys makes it easy to extend and experiment by 3rd-parties, but it might become a support and extension headache whenever dpkg needs to claim one of the keywords w/o stomping on keys used unofficially.
The dpkg database currently only tracks pathnames, possibly their md5sums and whether those are conffiles or not.
We need a manifest format that is extensible, and contains the pathname and a list of key/value pairs. This would take over at least the .list, .conffiles, and .md5sums files. Initially each package would have one such manifest entry, but eventually the whole filesystem tree would be tracked in a single file, so we fix the long-standing loading time slowdown inflicted by the massive amount of seeking. This will require handling possibly conflicting metadata for shared pathnames such as directories or ref-counted files. We'll also need some kind of journal (similar to the control updates mechanism) to avoid having to write the while filesystem metadata file on each package operation affecting the filesystem.
The current WIP branch is using mtree(5) 2.0 for the internal dpkg db, and as a first step will make it possible to track all metadata available from the current tar archives. So a simple reinstall will make fill the dpkg db with the metadata.