Differences between revisions 21 and 22
Revision 21 as of 2018-11-18 21:03:05
Size: 11520
Editor: GuillemJover
Comment: Move non-blockers together
Revision 22 as of 2018-11-18 22:10:21
Size: 11489
Editor: GuillemJover
Comment: Remove packages with now filed bugs
Deletions are marked like this. Additions are marked like this.
Line 84: Line 84:
      * bum
Line 110: Line 109:
      * rkhunter

Status: draft

Problem

As of now dpkg does not keep track of pathname metadata. The information is there in some places, but it gets lost on the way. Properly tracking metadata all the way, would make it possible to fix some long standing issues, such as:

  • Performing consistency checks between filesystem vs database («dpkg -V»)
  • Restoring out-of-sync metadata on the filesystem (new «dpkg -M»?)
  • Listing dpkg pathname metadata (new «dpkg-query --mtree»?)
  • Building binary packages w/o requiring root/fakeroot («dpkg-deb -b»)
  • Tracking spurious and volatile pathnames so that dpkg know about them and takes care of their removal (caches, logs, etc)
  • Possible automatic handling of symlink ←→ dir pathname switches (because we'd know if this is package or admin induced)
  • Making it possible to set extended attributes on the filesystem
  • Making it possible to generate installation media w/o requiring root/fakeroot (by using an unprivileged installation tree and the metadata manifests)

The metadata is (or should be) present in three different stages of a package life-cycle (although the deployment will start from the most internal to the most external; db → bin → src):

Source packages

Most source packages contain already some kind of upstream manifest that specifies what goes into the binary packages, but those are not standardized and are in most cases programmatic manifests (via the upstream build systems install-style target) instead of declarative manifests.

To generate the data.tar archives for the .deb packages, w/o needing root, we need a template manifest that specifies at least the owner:group, permissions, etc. and the list of files, whether they are conffiles, or other classes of files (such as logs, caches or similar), and possibly extended attributes. Some of the other data should be always taken from the filesystem, such as timestamp, size, etc.

The template should support some form of globbing or pattern matching to avoid repetition. It should also allow specifying defaults/globals for the metadata to also avoid repetition.

One strong candidate is mtree(5), although one problem with it is that none of the known implementations support arbitrary metadata, so it ends up not being very extensible, so the appeal of using this format so that "standard" tools can be used kind of disappears. Or a new format could be devised perhaps based off mtree(5), the rpm specs format, or similar.

The current best option is to use mtree(5) 2.0 (as defined in libarchive) as the base format, and extend the key/values as needed, even if that means not being able to use stock mtree(1) commands.

Binary packages

Binary packages contain all basic metadata as transported by the tar entry header. Additional metadata is transported via control files in the .deb control member area, such as digests or the list of conffiles.

While this is one of the less problematic parts, as the metadata is already there. It still presents the problem that it is not transparently extensible, as each new control file needs support from dpkg so that its data gets imported into the database.

Possible solutions would be to either:

  • Use PAX formatted tar entries, but:
    • Those imply a non-insignificant increase in the tar size, due to the overhead of the additional tar entries for the tar extended headers.
    • The format support is not universal in all systems where dpkg can be used.
    • On systems where tar does not support the PAX extended headers, manually extracting via ar+tar (or dpkg-deb) would create a mess of files representing the PAX extended headers (as per the spec).
    • The filesystem xattrs have not been universally supported nor universally enabled even when they were supported.
    • On systems where filesystem xattrs are not supported or enabled (which has been a problem as these have not seen universal adoption), manually extracting via ar+tar or dpkg-deb would mean that those xattrs would get lost.
    • Old dpkg-deb will fail hard on the new unknown tar entry types (the new PAX 'g' and 'x' extender header entries).
    • We'd be switching most metadata to the new transport format, and most of it needs to end up in the dpkg db and is not appropriate for the files in filesystem, so possibly having lots of duplicated information as xattrs does not make sense.
    • The PAX format does not have an official defined keyword namespace for xattrs, there are several vendor namespaces used currently with non-universal adoption, there's LIBARCHIVE and SCHILY, the latter being somewhat undesirable given previous bad history within Debian with the author.
  • Use a new extensible control file manifest (current candidate mtree(5) 2.0).
    • This manifest must not duplicate metadata already present in the tar header, otherwise manual extraction (with ar+tar) would diverge, and extraction using an older dpkg-deb would also diverge.
    • This manifest could be a restricted version of the source package manifest template, so that globbing/patterns are not allowed.
    • There is the problem of whether to accept any unknown key silently, warn or fail hard.
      • If the file is generated automatically then it might always have a good form (barring bugs), but if it is manually crafted, then typoed keywords might creep in. In which case a solution might be to lint either by a new dpkg command, or lintian to catch those.
      • Accepting random keys makes it easy to extend and experiment by 3rd-parties, but it might become a support and extension headache whenever dpkg needs to claim one of the keywords w/o stomping on keys used unofficially.

Installed database

The dpkg database currently only tracks pathnames, possibly their md5sums and whether those are conffiles or not.

We need a manifest format that is extensible, and contains the pathname and a list of key/value pairs. This would take over at least the .list, .conffiles, and .md5sums files. Initially each package would have one such manifest entry, but eventually the whole filesystem tree would be tracked in a single file, so we fix the long-standing loading time slowdown inflicted by the massive amount of seeking. This will require handling possibly conflicting metadata for shared pathnames such as directories or ref-counted files. We'll also need some kind of journal (similar to the control updates mechanism) to avoid having to write the whole filesystem metadata file on each package operation affecting the filesystem.

The current WIP branch is using mtree(5) 2.0 for the internal dpkg db, and as a first step will make it possible to track all metadata available from the current tar archives. So a simple reinstall will make fill the dpkg db with the metadata.

The deployment of this has several problems, that need to be handled first:

  • Many packages access the package info files database directly, which means they will break once we stop storing at least the .list and .md5sums files.
    • (blocker) We need to provide access to it via new interfaces, this includes:
      • (./) Print the «dpkg-query --status» for all packages; a status db dump, but programmatically, and supporting the dpkg journal (since dpkg 1.19.1).

      • (./) Query all conffiles from shell: «dpkg-query -f '${Package} : ${Conffiles}\n" -W»

      • (./) Query packages with specific fields, or field values (all packages with Modaliases field f.ex.): «dpkg-query -f '${Package} : ${Modaliases}\n"»

      • (./) Query filesystem objects from C: libdpkg-dev provides the fsys functions in 1.19.1, which were previously only within the tools themselves. Even though it is currently just a static library, this should not be a problem as it is built now as PIC everywhere.

      • Query filesystem objects from Perl: libdpkg-perl should grow a new interface to fetch all package files (could use the internal db for now if there's no other fast interface, as it is part of the dpkg suite).
      • Query filesystem objects from shell: dpkg-query interface to fetch all files in the system related to their owning packages.
      • Show the package files digests. While we can use «dpkg-query --control-show <pkgname> md5sums», this interface would need to map data from mtree in the future, and does not match the control data in the db. Also --control-show has unfortunately the arguments swapped and does not allow batching queries. :/

      • Show the package installation/last-modification time, they are getting the .list file mtime for this.
      • Query which package owns a specific db file (reportbug), as in: who-owns("<admindir>/info/dpkg.conffiles") → "dpkg"

    • New lintian tag(s). 905469 913974

    • (blocker) dpkg database access

      • aide
      • aptitude
      • avfs
      • crosshurd
      • cruft: needs db-fsys:Files
      • debian-goodies: needs db-fsys:Last-Modified
      • debootstrap
      • debsums
      • dh-make-perl: needs libdpkg-perl interfaces or switching to dpkg-query
      • dlocate
      • dpkg-www: needs db-fsys:Last-Modified
      • fai
      • gkdebconf
      • hw-detect
      • live-build
      • live-config
      • live-installer
      • mc
      • multistrap
      • ocsinventory-agent
      • open-infrastructure-container-tools
      • open-infrastructure-system-build
      • open-infrastructure-system-config
      • opensvc: needs db-fsys:Last-Modified
      • piuparts
      • popularity-contest
      • reportbug
      • ruby-debian
      • salt: needs db-fsys:Last-Modified
    • (non-blocker) dpkg database access for currently unaffected files (status/available/diversions/triggers db, .shlibs, .clilibs, .starlibs, .templates, and maint-scripts):
      • anna
      • apt
      • autopkgtest
      • bacula
      • bpython
      • btrfs-progs
      • chrome-gnome-shell
      • cli-common
      • configure-debian
      • cruft-ng
      • debconf
      • debian-edu-install
      • denyhosts
      • fakechroot
      • gnu-smalltalk
      • grub2
      • kickseed
      • libreoffice
      • lowmem
      • lxc
      • mono
      • multipath-tools
      • obs-build
      • pam
      • puppet
      • rootskel
      • ruby-defaults
      • wajig
    • (non-blocker, inert code) dpkg database access

    • (non-blocker) We need a way to query package states in a more reliable way. Update and merge the pu/compare-status branch.
    • (non-blocker) We need to fix the problem with a harmful prerm by making it possible for a new package to declare the old prerm should not be run (add a new field?).
  • (blocker) We cannot install .mtree files under /var/lib/dpkg/info/ because old or new .debs could have shipped those, and these might be invalid, or not match the contents. In general it seems like a bad idea to store the files handled and generated by dpkg itself, with files coming straight from the .debs. We need to separate them into different directories. Perhaps /var/lib/dpkg/info/<pkg>.<ctrl-file> and /var/lib/dpkg/meta/<pkg>.<meta-file> or similar.