Differences between revisions 23 and 24
Revision 23 as of 2010-06-29 22:07:45
Size: 21734
Editor: ?ChristophAntonMitterer
Comment: planning a mighty solution, therefore giving the separate issues short names
Revision 24 as of 2011-03-08 09:28:01
Size: 21740
Editor: ?VincentDanjean
Comment: /var is not necessarly on the root FS
Deletions are marked like this. Additions are marked like this.
Line 20: Line 20:
 * root-filesystem (which includes hopefully everything from e.g. --(/usr)--, /bin, /sbin, /etc, /var, /lib, et cetera that is required to boot)  * root-filesystem (which includes hopefully everything from e.g. --(/usr)--, /bin, /sbin, /etc, --(/var)--, /lib, et cetera that is required to boot)

This should be used a discussion base in order to implement an advanced startup/shutdown system that can use multi-layered block devices for it's filesystem.

Introduction

Filesystems

Filesystems are one of the basic objects needed in every Linux/UNIX system. They're needed very early (usually after the kernel itself has "booted") and very long (usually until the machine powers off or reboots, and during it is in any suspend-state.)

A filesystem generally lays ("on top") of a block device. This can be usually any block device, for example:

  • /dev/sda (directly on the physical disk)
  • /dev/sda1 (directly on the physical disk, but in any partition of any partition table type)
  • /dev/loop0 (on loopback devices)
  • /dev/nbd0 (on network block devices)
  • /dev/vg00/lv00 (on logical volumes from LVM)
  • /dev/mapper/dmcryptDevice (on plain- or LUKS-dm-crypt device)
  • et cetera

There are different kinds of filesystem classes to be considered:

  • root-filesystem (which includes hopefully everything from e.g. /usr, /bin, /sbin, /etc, /var, /lib, et cetera that is required to boot)

  • non-root-filesystems (which are not required to boot, e.g. /home, /srv or /root)
  • resume devices, which are used in suspend-to-disk

Block Layers and Block Devices

The kernel supports many types of block layers:

  • Physical disks and their partitions
  • Loopback devices
  • Network block devices
  • Device Mapper (generic block device mapper)
    • Logical Volume Manager (LVM2) (which uses physical volumes (not necessarily disks) (PV), to map them to logical volumes (LV), grouping them by volume groups (VG))
    • MD ("software RAID")
    • dm-crypt (providing plain- and LUKS-dm-crypt devices)
  • DRDB
  • et cetera

All usually give us miscellaneous block devices and have their own userspace tools.

It is very important for the following the whole topic to understand, that block layers can be nearly arbitrary stacked, as following examples should show:

physical disk/partition --\
physical disk/partition --+-> md (RAID6) --> lvm --+-> dm-crypt --> filesystem (e.g. ext4)
physical disk/partition --|                        \-> dm-crypt --> filesystem (e.g. btrfs) 
physical disk/partition --/

physical disk/partition --\                                    /-> filesystem (e.g. xfs)
physical disk/partition --+-> md (RAID5) --> dm-crypt --> lvm -+-> filesystem (e.g. ext4)
physical disk/partition --/                                    \-> filesystem (e.g. jfs)

physical disk/partition --> dm-crypt --\                        /-> filesystem (e.g. xfs)
physical disk/partition --> dm-crypt --+-> md (RAID10) --> lvm -+-> filesystem (e.g. ext4)
physical disk/partition --> dm-crypt --/                        \-> filesystem (e.g. jfs)

Things to notice:

  • One can get even more complex setups, by using the same type of block layers multiple times (e.g. ...-->lvm--> dm-crypt --> lvm -->...).

  • Using loopback devices on top of filesystem, everything gets even worse.

The Boot Process and the Shutdown Process

The boot process consists of several systems/steps:

  1. Boot loader (loads kernel and optionally the initramfs image)
  2. Kernel
  3. optionally, an initramfs image
    • The initramfs image, should do everything that is necessary to mount the root-filesystem (specified by the kernel-parameter "root") and/or to resume from any resume device.
    • It should only contain stuff that is required to do this (e.g. if the root-filesystem/resume device is neither directly nor indirectly on a LV, no LVM stuff should be included and/or executed or as another example, if btrfs is not used for root-filesystem, the "btrfs-scan" (btrfsctl -a) should not be done).

    • It should not mount any other filesystems or make any other devices/block layers available that are not required.
  4. init-system (e.g. sysvinit) and it's initscripts
    • If no initramfs image was used in step 3, the root-filesystem must be already present.
    • The init-system should mount any further filesystem as specified by /etc/fstab and also make any further block devices available as specified by their userspace's respective configuration (e.g. /etc/crypttab, the VG descriptors of LVMs, et cetera)

Things to notice:

  • We do not care on anything before step 2, so the boot loader is rather irrelevant for us.
  • Of course, for any complex setup (especially where userspace tools and/or config is required), the initramfs image is not optional. There may be some exceptions to this, e.g. lvm is (IIRC) quite smart in figuring out everything "alone".

    • However, for we really should should try to avoid forcing users to use initramfs images, for setups where they're not strictly needed (e.g. (physical disk --> root-filesystem) or (physical disk(s) --> lvm --> root-filesystem)).

The shutdown (halt/reboot) process consists of several systems/steps:

  1. init-system (e.g. sysvinit) and it's initscripts
    • The init-system should first unmount all non-root filesystems and close all their related block devices from all layers in the correct (reverse) order.
      • After that, the root-filesystem is still mounted (hopefully also all stuff that is required to proceed) and all the block devices/layers it uses directly or indirectly are still open. If it was tried to close them (the block devices) and error should have occured (something like "still busy").
    • Next, the root-filesystem should be remounted read-only (see later why it's not unmounted.
  2. shutdown (halt/reboot)
    • At this point, there are special dm-crypt related security issues, which will be discussed below.

Going into suspend modes:

AFAIK, there is not kind of "shutdown procedure" that affect filesystems or blockdevices when going into suspend modes. There is however some driver unloading, the network will be interrupted (which affects at least NBD and DRDB). At least suspend-to-RAM has huge security risks when using dm-crypt encrypted systems.

Assumptions

The information presented in the introduction leads two to the questions:

  1. Which setup (i.e. which way to stack all the different devices) is "the right one"?

    • The definite answer:

      No one, each of them is vaild. Every user might have a different environment and different wishes, and we should not force him to go any way that we personally prefer. As any example that every setup has advantages and disadvantages:

    • ...--> lvm --> dm-crypt --> filesystem

      • Here, every filesystem will "have" different dm-crypt keys, making it possible to "decrypt" them independently.
    • ...--> dm-crypt --> lvm --> filesystem

      • Here, all LVs and therefore all filesystem will "have" the same dm-crypt keys, making it possible to "decrypt" all "at once".
      In both setups, things like resizing should still be possible. It should be quite clear, that both are valid.
  2. Where do we allow the root-filesystem?

    • The definite answer:

      Everywhere. It would be a unjustifiable restriction if we e.g. restrict the root-fs to plain partitions or to be on "only one" intermediate block layer (e.g. LVM). For example at least one block layer in between is required to support fully encrypted systems, where even the root-filesystem is encrypted, and one boots from an e.g. USB-stick. This is especially interesting, as with modern filesystems that do checksumming (e.g. btrfs) one gets kind of integrity/authenticity for free, in addition to the encryption by dm-crypt, as we then have something like MAC. But even without checksumming, fully encrypted systems are nearly unmodifiable as an attacker does not know what to modify and to which value (if he wants for example to compromise the system).

Problems

In order to get everything outlined above many problems have to be solved, including the following (add them if there are any further). Most of them are probably not Debian-only :) .

How-to "Tell" The Initramfs Image And The Init-System The Order In Which Block Devices/Layers Must Be Created And Filesystems Mounted (“Problem I”)

For both,

  • the initramfs image, in order to get the root-filesystem and resume-device(s) as well as for
  • the init-system, in order to get the non-root-filesystem

we must find a way to tell them how (meaning in which order) to do so.

Detailed discussion and proposed solutions can be found and should take place here.

How-to Add Only Strictly Necessary Stuff Into The Initramfs Image (“Problem II”)

In order to add only the strictly necessary stuff into the initramfs image, we need to find out what is needed, by determining which kinds of block devices / block layers are either directly or indirectly used by the root-filesystem and the resume devices.

Detailed discussion and proposed solutions can be found and should take place here.

How-to Keep Cryptsetup Secure When Booting (“Problem III”)

I recently described a kind of a "meta-attack" (other have probably found before), where an attacker can trick a user, by replacing his (fully encrypted) physical volumes with compromised ones.

This is probably not the most important issue here, but it is related to the boot process, and therefore we might want to have a look at it.

See here for the (hopefully clear) description: http://www.saout.de/pipermail/dm-crypt/2010-June/000856.html

Detailed discussion and proposed solutions can be found and should take place here.

How-to shut the system down cleanly (halt or reboot) (“Problem IV”)

There is also some discussion on this at: http://thread.gmane.org/gmane.linux.kernel/1003210

I try to explain the following as good as I can, but others (Milan?) have surely more knowledge and should please correct my mistakes and missing stuff.

When shutting down, we need to do the following (naively):

  1. Unmount all non-root-filesystems
  2. Close all block devices/layers (in the correct order), which were solely used by these non-root-filesystems
  3. Unmount the root-filesystem.
  4. Close all block devices/layers (in the correct order), which were solely used by the root-filesystem.
  5. Halt or Reboot.

It does/can however not work like this.

  • Steps 1-2: These are already not that easy, in principle they are as complex as the other way round, when setting them up (discussed above).
  • Step 3:
    • We cannot really unmount the root-filesystem. I guess the only reason is, that it's still used (or is there any other?). The init-system is still running, it e.g. further wants to read later init-scripts, and we want to somehow call halt or shutdown. What we do instead is, remounting the root-filesystem read-only.

  • Step 4: This doesn't work either, as the root-filesystem is not really unmounted the block devices at all layers below it think (correctly) that they're still used.

Detailed discussion and proposed solutions can be found and should take place here.

Is our root-filesystem (data) really safe, after we remounted read-only

Probably yes. Milan tried to describe the technical details in the lkml thread referenced above. (Correct me if anything is wrong!!!)

  • It seems that unmount / remount,ro does a flush, and this flush goes down through all the different block layers. This uses some technique like barriers, which are block-IO-barriers and not the same ones as e.g. those from ext4 (right??).

  • mount -o remount,ro / waits until all this flushing is really finished before it returns.
  • I hope the init-system (what about sysvinit??) would wait forever, if mount does not return,... and does not simply try to kill it and proceed with the next (e.g. halt/reboot). (Petter can you tell us?))

At least one question is open (to me) now: What if remount,ro and the flushing cannot finish as some block layer below has to wait or blocks. Consider e.g. lvm with using the cluster-thingy? The network could be down. Or would it (the daemon) already be gone at that stage, having cleanly stopped and locked/unlocked everything as necessary before?

Detailed discussion and proposed solutions can be found and should take place here.

We cannot close all the block devices/layers below the remounted,ro root-filesystem. What are the resulting problems?

Are there any problems at all? Yes!

What would happen at all? We would somehow (perhaps with error messages) continue and move forward to halting/rebooting, of course after the root-filesystem had been remounted read-only.

  • As mentioned just above, our data is probably save, as everything has already been safely flushed out.
  • Can there be any other problems, when not closing any of the underlying block devices? E.g. I could imagine that there might be problems with RAID metadata, or things like this which could be perhaps not flushed correctly, or even if they were flushed,... a ongoing process could be running, where a "clean stop" like the remount,ro would be needed but we have only the "close the device" doing this (which we cannot use however).
  • The user gets annoying error messages about devices that could not be closed. This is not critical, but we should solve. It might also irritate non-expert-users.
  • There are however, as also discussed in the lkml thread mentioned above, quite serious dm-crypt security problem arising from this, which are discussed below

Detailed discussion and proposed solutions can be found and should take place here.

How-to Keep Cryptsetup Secure When Shutting down (“Problem V”)

Milan noted in the lkml thread, that not closing the dm-crypt device(s) lead to (I guess at least?) one security problem. The key is not deleted from memory. Perhaps Milan can describe this (and any solutions) please :) ... and hopefully finding the perfect solution ;) .

Detailed discussion and proposed solutions can be found and should take place here.

Further Problems

Are there any further problems? For example related to suspend modes or security issues?

Other Notes

I've already played with such thoughts and Milan spoke it out loudly at the lkml thread... one solution for some of the above problems (but not all!) would be to "invent" the opposite of initramfs images,... "uninitramfs images" ;) , which could then "easily" () clean up, and really unmount the root-filesystem.

Current Situation In Debian

This shows the current status in Debian. Please correct anything that's wrong.

Affected Systems And Packages

The Problems From Above And Their Status In Debian

How-to "Tell" The Initramfs Image And The Init-System The Order In Which Block Devices/Layers Must Be Created And Filesystems Mounted

AFAIK, all this is statically defined at the moment in both the initramfs boot scripts and the init-scripts.

For the init-system the order seems to be: physical disks/partitions --> md --> dm-crypt --> lvm --> dm-crypt --> filesystem

Each of them is of optional, e.g. not having mdadm installed will mean no md.

The double dm-crypt is done via cryptdisks-early and cryptdisks.

I guess the order is solely defined via the LSB init-script headers.

For the initramfs image the order seems to be: physical disks/partitions --> lvm --> dm-crypt --> filesystem

Not sure whether there's also kind of a "cryptdisks_early" and where mdadm is.

Not sure how the order is defined here,... there are these ORDER files in ../scripts/local-*/, there is the prereqs mechanism in boot-scripts, and IIRC the remaining ones are simply alphabetically?

How-to Add Only Strictly Necessary Stuff Into The Initramfs Image

Some packages already do some checking. For example cryptsetup tries to only include stuff that is necessary. But I guess (Jonas correct me) this is limited: I assume you start scanning in /etc/fstab, but will this work, if the dm-crypt device is not directly below the root-filesystem but only at some other indirect place below (see scenarios above)?

Other packages, e.g. lvm2 simply always include and run their stuff within initramfs images, when installed (see my notes in the "Notes" section below ;) ).

How-to Keep Cryptsetup Secure When Booting

Currently, we are probably vulnerable to that issue, at least IMHO ;)

How-to shut the system down cleanly (halt or reboot)

Same issues as mentioned above.

We even have effectively only this (at least as far as I understand):

  1. Unmount all non-root-filesystems
  2. Close _ALL_ block devices/layers (in the correct order)
  3. remount,ro the root-filesystem.
  4. not trying again to close all block devices/layers (in the correct order), which were solely used by the root-filesystem.

  5. Halt or Reboot.

Is our root-filesystem (data) really safe, after we remounted read-only

Same issues as mentioned above.

We cannot close all the block devices/layers below the remounted,ro root-filesystem. What are the resulting problems?

Same issues as mentioned above.

Some packages e.g. lvm2 let the init-script fail if closing didn't work, leading to a big red "failed" on the terminal ;). Others, e.g. cryptsetup (at least until the next upload), print a warning but do not let the init-script fail (right?).

How-to Keep Cryptsetup Secure When Shutting down

As we cannot close the dm-crypt devices laying below an encrypted root-filesystem, we're currently vulnerable.

Bugs

There are already some related bugs in the BTS, including:

Please add them if you find more.

Notes

calestyo

  • I'm really not the kernel/block layers/etc. expert. Many of the information I wrote down here is probably borrowed by other people or at least inspired. Many people gave me some - for my level of knowledge - great insight on how these things work (including, but not limited, Milan Broz, Jonas Meurer, Arno Wagner, and many I've probably forgot ). Further,... my apologies if I pointed with my fingers at any Debian package, showing that something does not yet work there. I don't mean this offensive, or that I'm smarter or so ;)... just want to contribute :) .

  • I guess all these issues are not only Debian-related, at least mostly. Therefore I think we should try to do team work with other distros :) which does not mean we should throw away our own tools in places where they use others.

  • I guess to really solve all that issues cleanly and securely it's critical to have all experts (especially the upstream gurus like Milan ;) ) as well as all the affected maintainers at one (virtual) table and pulling on the same rope. The later probably includes cryptsetup, lvm2, mdadm, initscripts (as they do the booting/umounroot/etc) and many other packages I can't think of right now.

ToDo

Things we might need to consider in addition

  • although iSCSI is similar to NBD (give that it's block device via network) it might be more complicated in reality?
  • any special things to consider when using network filesystems like NFS(v3,v4), GFS, etc. (especially as root-filesystem)?
  • If we allow using the same type of block layer several times (e.g. as in
    •    physical disk/partition --> dm-crypt --\                        /-> '''root'''-filesystem (e.g. xfs)
         physical disk/partition --> dm-crypt --+-> md (RAID10) --> lvm -+-> filesystem (e.g. ext4)
         physical disk/partition --> dm-crypt --/                        \-> filesystem (e.g. jfs)
      or even more complicated (e.g. having another dm-crypt on top of the above lvm, and still below the root-filesystem) this could add us much complexity.

      Taking cryptsetup as example (which I know a bit better), we'd need to include keyscrips (as well as their configuration), checkscripts for all of the devices in the initramfs image.

  • cryptsetup, especially when using plain dm-crypt devices, must always check whether the encryption was correct (otherwise data corruption on read/write could occur). For LUKS this is easy, for plain-dm-crypt devices not (Debian can however check whether any valid filesystem is on top of dm-crypt). However this is AFAIK not done during initramfs images (and there's probably nothing similar for resume-devices?!). Further,... if the filesystem is not directly on top of dm-crypt, we cannot check for it anyway. (see also some notes in 587222)