Differences between revisions 1 and 2
Revision 1 as of 2017-03-16 05:02:05
Size: 3606
Editor: GuillemJover
Comment: Add early draft of metadata tracking spec
Revision 2 as of 2017-03-17 16:57:08
Size: 4162
Editor: GuillemJover
Comment: Update problems being solved
Deletions are marked like this. Additions are marked like this.
Line 9: Line 9:
As of now dpkg does not keep track of pathname metadata. The information is there in some places, but it gets lost on the way. Properly tracking metadata all the way, would make it possible to fix some long standing issues, such as restoring metadata on the filesystem, building w/o requiring root/fakeroot, proper handling of package switches from symlink to dir and vice versa, etc. As of now dpkg does not keep track of pathname metadata. The information is there in some places, but it gets lost on the way. Properly tracking metadata all the way, would make it possible to fix some long standing issues, such as:

  * Performing consistency checks between filesystem vs database («dpkg -V»)
  * Restoring out-of-sync metadata on the filesystem (new «dpkg -M»?)
  * Listing dpkg pathname metadata (new «dpkg-query --mtree»?)
  * Building binary packages w/o requiring root/fakeroot («dpkg-deb -b»)
  * Tracking spurious and volatile pathnames so that dpkg know about them and takes care of their removal (caches, logs, etc)
  * Possible automatic handling of symlink ←→ dir pathname switches (because we'd know if this is package or admin induced)
  * Making it possible to generate installation media w/o requiring root/fakeroot (by using an unprivileged installation tree and the metadata manifests)

Status: draft

Problem

As of now dpkg does not keep track of pathname metadata. The information is there in some places, but it gets lost on the way. Properly tracking metadata all the way, would make it possible to fix some long standing issues, such as:

  • Performing consistency checks between filesystem vs database («dpkg -V»)
  • Restoring out-of-sync metadata on the filesystem (new «dpkg -M»?)
  • Listing dpkg pathname metadata (new «dpkg-query --mtree»?)
  • Building binary packages w/o requiring root/fakeroot («dpkg-deb -b»)
  • Tracking spurious and volatile pathnames so that dpkg know about them and takes care of their removal (caches, logs, etc)
  • Possible automatic handling of symlink ←→ dir pathname switches (because we'd know if this is package or admin induced)
  • Making it possible to generate installation media w/o requiring root/fakeroot (by using an unprivileged installation tree and the metadata manifests)

The metadata is (or should be) present in three different stages of a package life-cycle:

Source packages

Source packages contain some kind of manifest that specifies what goes into the binary packages, but those are not standardized and are in most cases programmatic (via the upstream build systems install-style target) instead of declarative (via some kind of manifest).

To generate the data.tar archives for the .deb packages, w/o needing root, we need a template manifest that specifies at least the owner:group, permissions, etc. and the list of files, whether they are conffiles, or other classes of files, such as logs, caches or similar, and extended attributes. Some of the other data should be taken from the filesystem, such as timestamp, size, etc.

The template should support some form of globbing or pattern matching to avoid repetition. It should also allow specifying defaults/globals for the metadata to also avoid repetition.

One strong candidate is mtree(5), although one problem with it is that none of the known implementations support arbitrary metadata, so it ends up not being very extensible, so the appeal of using this format so that "standard" tools can be used kind of disappears. Or a new format could be devised perhaps based off mtree(5), the rpm specs format, or similar.

Binary packages

Binary packages contain all basic metadata as transported by the tar entry header. Additional metadata is transported via control files in the .deb control member area, such as digests or the list of conffiles.

While this is one of the less problematic parts, as the metadata is already there. It still presents the problem that it is not transparently extensible, as each new control file needs support from dpkg so that its data gets imported into the database.

Possible solutions would be to either:

  • Use PAX formatted tar entries, but those imply a non-insignificant increase in the tar size.
  • Use a new extensible control file manifest, although this might duplicate

metadata already present in the tar entry. It could be made to list only non-duplicate information though. This manifest could be a restricted version of the source package manifest template, so that globbing/patterns are not allowed.

Installed database

The dpkg database currently only tracks pathnames, and whether those are conffiles or not.

We need a manifest format that is extensible, and contains the pathname and a list of key/value pairs. This would take over at least the .list, .conffiles, and .md5sums files. Initially each package would have one such manifest entry, but eventually the whole filesystem tree would be tracked in a single file, so we fix the long-standing loading time slowdown inflicted by the massive amount of seeking. This will require handling possibly conflicting metadata for shared pathnames such as directories or ref-counted files. We'll also need some kind of journal (similar to the control updates mechanism) to avoid having to write the while filesystem metadata file on each package operation affecting the filesystem.