This should be used a discussion base in order to implement an advanced startup/shutdown system that can use multi-layered block devices for it's filesystem.

Introduction

Filesystems

Filesystems are one of the basic objects needed in every Linux/UNIX system. They're needed very early (usually after the kernel itself has "booted") and very long (usually until the machine powers off or reboots, and during it is in any suspend-state.)

A filesystem generally lays ("on top") of a block device. This can be usually any block device, for example:

There are different kinds of filesystem classes to be considered:

Block Layers and Block Devices

The kernel supports many types of block layers:

All usually give us miscellaneous block devices and have their own userspace tools.

It is very important for the following the whole topic to understand, that block layers can be nearly arbitrary stacked, as following examples should show:

physical disk/partition --\
physical disk/partition --+-> md (RAID6) --> lvm --+-> dm-crypt --> filesystem (e.g. ext4)
physical disk/partition --|                        \-> dm-crypt --> filesystem (e.g. btrfs) 
physical disk/partition --/

physical disk/partition --\                                    /-> filesystem (e.g. xfs)
physical disk/partition --+-> md (RAID5) --> dm-crypt --> lvm -+-> filesystem (e.g. ext4)
physical disk/partition --/                                    \-> filesystem (e.g. jfs)

physical disk/partition --> dm-crypt --\                        /-> filesystem (e.g. xfs)
physical disk/partition --> dm-crypt --+-> md (RAID10) --> lvm -+-> filesystem (e.g. ext4)
physical disk/partition --> dm-crypt --/                        \-> filesystem (e.g. jfs)

Things to notice:

The Boot Process and the Shutdown Process

The boot process consists of several systems/steps:

  1. Boot loader (loads kernel and optionally the initramfs image)
  2. Kernel
  3. optionally, an initramfs image
    • The initramfs image, should do everything that is necessary to mount the root-filesystem (specified by the kernel-parameter "root") and/or to resume from any resume device.
    • It should only contain stuff that is required to do this (e.g. if the root-filesystem/resume device is neither directly nor indirectly on a LV, no LVM stuff should be included and/or executed or as another example, if btrfs is not used for root-filesystem, the "btrfs-scan" (btrfsctl -a) should not be done).

    • It should not mount any other filesystems or make any other devices/block layers available that are not required.
  4. init-system (e.g. sysvinit) and it's initscripts
    • If no initramfs image was used in step 3, the root-filesystem must be already present.
    • The init-system should mount any further filesystem as specified by /etc/fstab and also make any further block devices available as specified by their userspace's respective configuration (e.g. /etc/crypttab, the VG descriptors of LVMs, et cetera)

Things to notice:

The shutdown (halt/reboot) process consists of several systems/steps:

  1. init-system (e.g. sysvinit) and it's initscripts
    • The init-system should first unmount all non-root filesystems and close all their related block devices from all layers in the correct (reverse) order.
      • After that, the root-filesystem is still mounted (hopefully also all stuff that is required to proceed) and all the block devices/layers it uses directly or indirectly are still open. If it was tried to close them (the block devices) and error should have occured (something like "still busy").
    • Next, the root-filesystem should be remounted read-only (see later why it's not unmounted.
  2. shutdown (halt/reboot)
    • At this point, there are special dm-crypt related security issues, which will be discussed below.

Going into suspend modes:

AFAIK, there is not kind of "shutdown procedure" that affect filesystems or blockdevices when going into suspend modes. There is however some driver unloading, the network will be interrupted (which affects at least NBD and DRDB). At least suspend-to-RAM has huge security risks when using dm-crypt encrypted systems.

Assumptions

The information presented in the introduction leads two to the questions:

  1. Which setup (i.e. which way to stack all the different devices) is "the right one"?

    • The definite answer:

      No one, each of them is vaild. Every user might have a different environment and different wishes, and we should not force him to go any way that we personally prefer. As any example that every setup has advantages and disadvantages:

    • ...--> lvm --> dm-crypt --> filesystem

      • Here, every filesystem will "have" different dm-crypt keys, making it possible to "decrypt" them independently.
    • ...--> dm-crypt --> lvm --> filesystem

      • Here, all LVs and therefore all filesystem will "have" the same dm-crypt keys, making it possible to "decrypt" all "at once".
      In both setups, things like resizing should still be possible. It should be quite clear, that both are valid.
  2. Where do we allow the root-filesystem?

    • The definite answer:

      Everywhere. It would be a unjustifiable restriction if we e.g. restrict the root-fs to plain partitions or to be on "only one" intermediate block layer (e.g. LVM). For example at least one block layer in between is required to support fully encrypted systems, where even the root-filesystem is encrypted, and one boots from an e.g. USB-stick. This is especially interesting, as with modern filesystems that do checksumming (e.g. btrfs) one gets kind of integrity/authenticity for free, in addition to the encryption by dm-crypt, as we then have something like MAC. But even without checksumming, fully encrypted systems are nearly unmodifiable as an attacker does not know what to modify and to which value (if he wants for example to compromise the system).

Problems

In order to get everything outlined above many problems have to be solved, including the following (add them if there are any further). Most of them are probably not Debian-only :) .

How-to "Tell" The Initramfs Image And The Init-System The Order In Which Block Devices/Layers Must Be Created And Filesystems Mounted

For both,

we must find a way to tell them how (meaning in which order) to do so.

calestyo: Jonas has recently started a discussion on this at debian-devel, IIRC. event-based init-systems were mentioned there, but I'm not sure whether they really would help us here, but perhaps I just misunderstand something.

How-to Add Only Strictly Necessary Stuff Into The Initramfs Image

In order to add only the strictly necessary stuff into the initramfs image, we need to find out what is needed, by determining which kinds of block devices / block layers are either directly or indirectly used by the root-filesystem and the resume devices.

calestyo: I'm not sure how this could work, but it probably starts at the /-entry in /etc/fstab, and goes through all other related configs (or can we get all information by device mapper?) resulting in a tree of block devices, where the root-filesystem lays on top. The same of course, for each resume device. btw: I think, once any PV of LVM would be below a device, we'd have to add all others to, regardless of whether they're actually have extents from the respective or not, as I guess we cannot partially close a VG.

How-to Keep Cryptsetup Secure When Booting

I recently described a kind of a "meta-attack" (other have probably found before), where an attacker can trick a user, by replacing his (fully encrypted) physical volumes with compromised ones.

This is probably not the most important issue here, but it is related to the boot process, and therefore we might want to have a look at it.

See here for the (hopefully clear) description: http://www.saout.de/pipermail/dm-crypt/2010-June/000856.html

How-to shut the system down cleanly (halt or reboot)

There is also some discussion on this at: http://thread.gmane.org/gmane.linux.kernel/1003210

I try to explain the following as good as I can, but others (Milan?) have surely more knowledge and should please correct my mistakes and missing stuff.

When shutting down, we need to do the following (naively):

  1. Unmount all non-root-filesystems
  2. Close all block devices/layers (in the correct order), which were solely used by these non-root-filesystems
  3. Unmount the root-filesystem.
  4. Close all block devices/layers (in the correct order), which were solely used by the root-filesystem.
  5. Halt or Reboot.

It does/can however not work like this.

Is our root-filesystem (data) really safe, after we remounted read-only

Probably yes. Milan tried to describe the technical details in the lkml thread referenced above. (Correct me if anything is wrong!!!)

At least one question is open (to me) now: What if remount,ro and the flushing cannot finish as some block layer below has to wait or blocks. Consider e.g. lvm with using the cluster-thingy? The network could be down. Or would it (the daemon) already be gone at that stage, having cleanly stopped and locked/unlocked everything as necessary before?

We cannot close all the block devices/layers below the remounted,ro root-filesystem. What are the resulting problems?

Are there any problems at all? Yes!

What would happen at all? We would somehow (perhaps with error messages) continue and move forward to halting/rebooting, of course after the root-filesystem had been remounted read-only.

One (really stupid) "solution" to solve the error-messages issue is, simply not closing _ALL_ (below root-fs as well as non-root-fs) the block devices, by not calling the respective init-scripts in rc0 and rc6. While the data should still be safe (as all the filesystems are unmounted) this is really ugly,.. as we wouldn't close devices, we actually could close (namely everything only below non-root-filesystems).

Could we filter out the error messages for just those devices which _are_ somewhere below the root-fs? How?

How-to Keep Cryptsetup Secure When Shutting down

Milan noted in the lkml thread, that not closing the dm-crypt device(s) lead to (I guess at least?) one security problem. The key is not deleted from memory. Perhaps Milan can describe this (and any solutions) please :) ... and hopefully finding the perfect solution ;) .

Further Problems

Are there any further problems? For example related to suspend modes or security issues?

Other Notes

I've already played with such thoughts and Milan spoke it out loudly at the lkml thread... one solution for some of the above problems (but not all!) would be to "invent" the opposite of initramfs images,... "uninitramfs images" ;) , which could then "easily" () clean up, and really unmount the root-filesystem.

Current Situation In Debian

This shows the current status in Debian. Please correct anything that's wrong.

Affected Systems And Packages

The Problems From Above And Their Status In Debian

How-to "Tell" The Initramfs Image And The Init-System The Order In Which Block Devices/Layers Must Be Created And Filesystems Mounted

AFAIK, all this is statically defined at the moment in both the initramfs boot scripts and the init-scripts.

For the init-system the order seems to be: physical disks/partitions --> md --> dm-crypt --> lvm --> dm-crypt --> filesystem

Each of them is of optional, e.g. not having mdadm installed will mean no md.

The double dm-crypt is done via cryptdisks-early and cryptdisks.

I guess the order is solely defined via the LSB init-script headers.

For the initramfs image the order seems to be: physical disks/partitions --> lvm --> dm-crypt --> filesystem

Not sure whether there's also kind of a "cryptdisks_early" and where mdadm is.

Not sure how the order is defined here,... there are these ORDER files in ../scripts/local-*/, there is the prereqs mechanism in boot-scripts, and IIRC the remaining ones are simply alphabetically?

How-to Add Only Strictly Necessary Stuff Into The Initramfs Image

Some packages already do some checking. For example cryptsetup tries to only include stuff that is necessary. But I guess (Jonas correct me) this is limited: I assume you start scanning in /etc/fstab, but will this work, if the dm-crypt device is not directly below the root-filesystem but only at some other indirect place below (see scenarios above)?

Other packages, e.g. lvm2 simply always include and run their stuff within initramfs images, when installed (see my notes in the "Notes" section below ;) ).

How-to Keep Cryptsetup Secure When Booting

Currently, we are probably vulnerable to that issue, at least IMHO ;)

How-to shut the system down cleanly (halt or reboot)

Same issues as mentioned above.

We even have effectively only this (at least as far as I understand):

  1. Unmount all non-root-filesystems
  2. Close _ALL_ block devices/layers (in the correct order)
  3. remount,ro the root-filesystem.
  4. not trying again to close all block devices/layers (in the correct order), which were solely used by the root-filesystem.

  5. Halt or Reboot.

Is our root-filesystem (data) really safe, after we remounted read-only

Same issues as mentioned above.

We cannot close all the block devices/layers below the remounted,ro root-filesystem. What are the resulting problems?

Same issues as mentioned above.

Some packages e.g. lvm2 let the init-script fail if closing didn't work, leading to a big red "failed" on the terminal ;). Others, e.g. cryptsetup (at least until the next upload), print a warning but do not let the init-script fail (right?).

How-to Keep Cryptsetup Secure When Shutting down

As we cannot close the dm-crypt devices laying below an encrypted root-filesystem, we're currently vulnerable.

Bugs

There are already some related bugs in the BTS, including:

Please add them if you find more.

Notes

calestyo

ToDo

Things we might need to consider in addition