FileSystem > Btrfs
Btrfs is intended to address the lack of pooling, snapshots, checksums, and integral multi-device spanning in Linux file systems, these features being crucial as the use of Linux scales upward into larger storage configurations. Btrfs is designed to be a multipurpose filesystem, scaling well on very large block devices.
Even though Btrfs has been in the kernel since 2.6.29, the developers state that "as of 2.6.31, we only plan to make forward compatible disk format changes". Please note that "forward compatible changes" means that booting a newer kernel, then booting into an older kernel is risky (further citations needed). The workaround is to mount the btrfs volume read-only while testing the newer kernel. When using a kernel from backports, this means your rescue disk will also need a kernel at least as recent as the one in backports. With every release the developers are improving the user/management tools and are making them easier to use. For more information about Btrfs, follow the links in See also section.
Ext2/3/4 filesystems should be upgradable to Btrfs, but this is not recommended because members of linux-btrfs have noted that a number of issues have been linked to conversions, and that the convert utility needs further work (Btrfs Wiki, Conversion from Ext3). Preparation for a rewrite of btrfs-convert began in btrfs-progs-4.4 (Btrfs progs release 4.4), and the major rewrite was released in v4.6 (2016-06-10). As of v4.6, it is still strongly recommended to backup, wipefs -a, mkfs.btrfs, and restore from backup rather than use btrfs-convert. Please do not use btrfs-convert with data you value.
One is substantially less likely to run into issues with a simple two disk mirror mounted noatime than an n-disk raid6 mounted with compress=lzo and autodefrag. Many people have reported years of btrfs usage without issue, and this wiki page will continue to be updated with configuration recommendations known to be good and cautions against those known to cause issues. eg: the number of snapshots per volume and per subvolume must be carefully monitored and/or automatically pruned, because too many snapshots can wedge the filesystem into an out of space condition or gravely degrade performance (Duncan, 2016-02-16, linux-btrfs). There are also reports that IO becomes sluggish and lags with far fewer snapshots, eg: only 86/subvolume on linux-4.0.4; this might be fixed in a newer kernel (Pete, 2016-03-11, linux-btrfs).
In my personal and professional opinion, btrfs before linux-4.4 and btrfs-progs-v4.4 is simply not worth the risk, and that this is the point where one could stop worrying "is my btrfs volume going to mysteriously blow up tomorrow, even with a simple use-case"; please use a kernel and progs >=4.4 to avoid needless maintenance headaches and needless risk. I also recommend sticking with the newest upstream LTS kernel rather than tracking the latest version in backports. For example tracking linux-image backport can result in bugs such as this one which forces a reboot --NicholasDSteeves
A Btrfs volume created on a raw partition can be used to boot using grub-pc. If booting with EFI firmware, please consult UEFI for ESP partitioning requirements; if you boot using EFI and you'd like your rootfs to be on btrfs, you must partition your drive[s]! As of 2016-03-14 it is highly recommended to use a swap partition rather than a manually configured swap file through a loop-device; classic swap files are not supported (Btrfs Wiki Btrfs FAQ).
The DebianInstaller can format and install to single-disk Btrfs volumes. The way that Btrfs combines multiple disks to create a single volume is not compatible with the data model of the current installer (#686097). Various people have described ways of installing Debian onto a raid1 Btrfs without too much trouble. It is also possible to install normally, then add another disk, then rebalance as raid1 (Btrfs Wiki, Converting to RAID).
As of linux-4.4.16 the official upstream status of btrfs is "Btrfs is under heavy development, and is not suitable for any uses other than benchmarking and review" (git.kernel.org). In a linux-btrfs thread where a user is unable to rebalance his btrfs volume due to out of space errors, there is a discussion about changing this to "you should have backups and be prepared to use them if you're using btrfs, and...it's not suitable for production systems yet" (Duncan, 2016-03-06, linux-btrfs). Please refer to BackupAndRecovery if you do not yet have a backup strategy in place.
Subvolumes cannot currently be mounted with different btrfs-specific options; the first btrfs line for a given volume in /etc/fstab takes effect. eg: you cannot mount / with noatime and /var with nodatacow,compress=lzo (Btrfs Wiki, Mount options).
At present, nodatacow implies nodatasum; this means that anything with the nodatacow attribute does not receive the benefits of btrfs' checksum protection and self-healing (for raid levels greater >= 1). Consequently, it is almost always preferable to disable COW in the application.
"Do not use BTRFS raid6 mode in production, it has at least 2 known serious bugs that may cause complete loss of the array due to a disk failure. Both of these issues have as of yet unknown trigger conditions, although they do seem to occur more frequently with larger arrays" (Austin S. Hemmelgarn, 2016-06-03, linux-btrfs).
Additionally, "RAID5 with one degraded disk won't be able to reconstruct data on this degraded disk because reconstructed extent content won't match checksum. Which kinda makes RAID5 pointless" (Andrei Borzenkov, 2016-06-24, linux-btrfs).
Compress=lzo might be dangerous. In the 'linux-btrfs' thread "Trying to rescue my data" (2016-06-26) it has finally come to light that mounting with compress=lzo might be something that causes btrfs volumes to break, because "if it gets too many [csum errors] at once, it *does* unfortunately crash, despite the second copy being available and being just fine as later demonstrated by the scrub fixing the bad copy from the good one" (Duncan). Later in the thread Steven Haigh confirms the behaviour and suggests "maybe here lays a common problem"?
- Mounting with -o compress will amplify fragmentation. All COW filesystems necessarily fragment. There is also a relation between the number of snapshots and the degree of fragmentation. Fragmentation manifests as higher than expected CPU usage on SSDs and increased read latency on rotational disks, because each of the references present in a frequently updated file will tend to necessitate a mechanical seek. Because the focus of btrfs development is currently on stabilisation, bug fixes, and core features, seek optimisation has not yet become a priority.
- Mounting with -o autodefrag will duplicate reflinked or snapshotted files when you run a balance. Also, whenever a portion of the fs is defragmented with "btrfs filesystem defragment" those files will lose their reflinks and the data will be "duplicated" with n-copies. The effect of this is that volumes that make heavy use of reflinks or snapshots will run out of space. At this point in time, to avoid such unexpected surprises and for peace of mind, please minimize the use of snapshots, and use a deduplicating backup manager. And remember snapshots != backups!
Once again, please do not use btrfs' raid5 or raid6 profiles at this point in time! In the thread "[BUG] Btrfs scrub sometime recalculate wrong parity in raid5" Chris Murphy found the following while testing the btrfs raid5's ability to recover from csum errors:
- I just did it a 2nd time and both file's parity are wrong now. So I did it several more times. Sometimes both files' parity is bad. Sometimes just one file's parity is bad. Sometimes neither file's parity is bad. It's a very bad bug, because it is a form of silent data corruption and it's induced by Btrfs. And it's apparently non-deterministically hit (2016-06-26).
In another email in this thread, Duncan suggested "And what's even clearer is that people /really/ shouldn't be using raid56 mode for anything but testing with throw-away data, at this point. Anything else is simply irresponsible" (linux-btrfs, 2016-06-26).
As a btrfs volume ages, you may notice performance degrade. This is because btrfs is a Copy On Write file system, and all COW filesystems eventually reach a heavily fragmented state; this includes ZFS. Over time, logs in /var/log/journal will become split across tens of thousands of extents. This is also the case for sqlite databases such as those that are used for Firefox and a variety of common desktop software. Fragmentation is a major contributing factor to why COW volumes become slower over time.
ZFS addresses the performance problems of fragmentation using an intelligent Adaptive Replacement Cache (ARC); the ARC requires massive amounts of RAM. Btrfs took a different approach, and benefits from—some would say requires—periodic defragmentation. In the future, maintenance of btrfs volumes on Debian systems will be automated using btrfsmaintenance. For now use:
sudo ionice -c idle btrfs filesystem defragment -t 32M -r $PATH
This command must be run as root, and it is recommended to ionice it to reduce the load on the system. To further reduce the IO load, flush data after defragmenting each file using:
sudo ionice -c idle btrfs filesystem defragment -f -t 32M -r $PATH
Target extent size is a little known, but for practical purposes absolutely essential argument. By default btrfs fi defrag only defrags files of less than 256KiB, because does not touch extents bigger than $SIZE, where $SIZE is by default 256KiB! While argument "-t 1G" would seem to be better than "-t 32M", because most volumes will have 1GiB chunk size, in practise this is not the case. Additionally, if you have a lot of snapshots or reflinked files, please use "-f" to flush data for each file before going to the next file. As of btrfs-progs-4.6.1, "-t 32M" is still necessary, but "-t 32M" is the default after btrfs-progs-4.7. Please consult the following linux-btrfs thread for more information.
- Which package contains the tools?
- Does btrfs really protect me from hard drive corruption?
Yes, but this requires at least two disks in raid1 profile. (eg: -m raid1 -d raid1). Without at least two copies of data, corruption can be detected but not corrected. Additionally, like for "mdadm or lvm raid, you need to make sure that the SCSI command timer (a kernel setting per block device) is longer than the drive's SCT ERC setting...If the command timer is shorter, bad sectors will not get reported as read errors for proper fixup, instead there will be a link reset and it's just inevitable there will be worse problems" (Chris Murphy, 2016-04-27, linux-btrfs). The Debian bug for this issue can be found here. For now do the following for all drives in the array, and then configure your system to change the SCSI command timer automatically on boot:
cat /sys/block/<dev>/device/timeout smartctl -l scterc /dev # echo -n ((the scterc value)/10)+10 to /sys/block/<dev>/device/timeoutThe default value is 30 seconds, which should be fine for disks that support SCT and likely have low timeout values like 7 sec. For disks that fail smartctl -l scterc, and thus do not support SCT, set the timeout value to 120. Consider a timeout of 180 to be extra safe with large consumer-grade disks.
- Does it support SSD optimizations?
Yes, Debian Jessie and later automatically detect non-rotational hard disks and ssd is added to the btrfs mount options. For more details on using SSDs with Debian, refer to SSDOptimization.
- What are the recommended options for installing on a pendrive, a SD card or a slow SSD drive?
When installing, use manual partitioning and select btrfs as file system. In the first boot, edit /etc/fstab with this options, so you can expect a very good speed and responsiveness improvement:
/dev/sdaX / btrfs x-systemd.device-timeout=0,noatime,compress=lzo,commit=0,ssd_spread,autodefrag 0 0
- But I have a super-small pendrive and keep running out of space! Now what?
Using another system, you can try something like this If Your Device is Small:
mkdir /tmp/pendrive mount /dev/sdX -o noatime,ssd_spread,compress /tmp/pendrive btrfs sub snap -r /tmp/pendrive /tmp/pendrive/tmp_snapshot btrfs send /tmp/pendrive/tmp_snapshot > /tmp/pendrive_snapshot.btrfs umount /tmp/pendrive wipefs -a /dev/sdX mkfs.btrfs --mixed /dev/sdX mount /dev/sdX -o noatime,ssd_spread,compress /tmp/pendrive btrfs receive -f /tmp/pendrive_snapshot.btrfs /tmp/pendrive sync btrfs fi sync /tmp/pendrive/
Now follow the procedure for converting a read-only snapshot to a live system and/or enabling / on a subvolume. Also, the bootloader needs to be reinstalled if your pendrive is a bootable OS drive and not just a data drive (Needs to be written --NicholasDSteeves).
- Does it support compression?
Yes, by adding compress=lzo or compress=zlib (depending on the level of compression or speed, lzo being faster and zlib having more compression):
/dev/sdaX / btrfs defaults,compress=lzo 0 1
Changing /dev/sdaX with your actual root device (UUID support in btrfs is a work-in-progress, but it works for mounting volumes; use the command blkid to get the UUID of all filesystems). If fact, there are many other more options you can add, just look here. (Remember: all fstab mount options must be comma separated but NOT space separated, so do not insert a space after the comma or the equal symbol).
In order to check if you have written the options correctly before rebooting and therefore before being in trouble, run this command as root:
mount -o remount /If no error is reported, everything is OK. Never try to boot with a troubled options fstab file or you'll have to manually try to recover it, a procedure that is more complicated.
- But if what you want is to just compress the files in a directory?
You can do this by applying the following two commands (for example for /var):
btrfs filesystem defragment -r -v -clzo /var chattr +c /var
By adding the +c attribute you ensure that any new file created inside the folder is compressed.
- What are the recommended options for a rotational hard disk?
In fstab :
UUID=<the_device_uuid> /mount/point/ btrfs noauto,compress=lzo,noatime,autodefrag 0 0
The noauto option will prevent the system to freeze at boot in the case of a non system and (likely) un-plugged device/partition. Alternatively, if you are using systemd and want to limit boot delay to 10 seconds in case of a missing device, and if that device is necessary for normal functioning of the system you can try this. System boot will halt with an error if the device is not found:
UUID=<the_device_uuid> /mount/point btrfs x-systemd.device-timeout=10,noatime,compress=lzo,autodefrag 0 0(Consider revoking this recommendation, because autodefrag, like -o discard, can trigger buggy behaviour. Also consider revoking the compress=lzo recommendation for rotational disks, because while it increases throughput for sequentially written compressible data, it also magnifies fragmentation...which means lots more seeks and increased latency -- NicholasDSteeves)
- Can I encrypt a btrfs installation?
Yes, you can by selecting manual partitioning and creating an encryption volume and then a btrfs file system on top of that. For the moment, btrfs does not support direct encryption so the installer uses cryptsetup, but is a planned feature, and experimental patches have recently been submitted to enable this (Anand Jain, linux-btrfs, Add btrfs encryption support)
- Does it work on RaspberryPi?
Yes, improving filesystem I/O responsiveness a lot. You may have to convert the filesystem to btrfs first from a PC and change the /etc/fstab type of filesystem from ext4 to btrfs (just by changing the name) before the first boot. Look above for recommended sdcard options in /etc/fstab.
- Fsck.btrfs doesn't do anything, how to I verify the integrity of my filesystem?
Rather than a fsck, btrfs has two methods to detect and repair corruption. The first method executes as a background process for a mounted volume. It has a default IO priority of idle, and it strives to minimize the impact on other active processes; nevertheless, like any IO-intensive background job, it is best to run it at a time when the system is not busy. To run it:
btrfs scrub start /btrfs_mountpoint
To monitor its progress:
btrfs scrub status /btrfs_mountpoint
The second method checks an umounted filesystem. It verifies that the metadata and filesystem structures of the volume are intact and uncorrupted. It should not usually be necessary to run this type of check. Please note that it runs read-only; this is by design, and there are usually better methods to recover a corrupted btrfs volume than to use the dangerous "--repair" option. Please do not use "--repair" unless someone has assured you that it is absolutely necessary. To run a standard read-only metadata and filesystem structures verification:
btrfs check -p /dev/sdX
btrfs check -p /dev/disk/by-partuuid/UUID
- Is there anything I can do to improve system responsiveness while running a scrub, balance, or defrag?
Yes, but only if the CFQ scheduler is enabled for the affected btrfs drives, because ionice requires the CFQ scheduler.
cat /sys/block/sdX/queue/scheduler # Should return "noop anticipatory deadline [cfq]" for rotational disks # If it does not, then echo -n cfq > /sys/block/sdX/queue/schedulerUse your preferred method to make this permanent (eg: /etc/rc.local, or a udev rule).
- How can I quickly check to see if my btrfs volume has experienced errors, with per-device accounting of any possible errors?
If you have a new enough copy of btrfs-progs you get an at-a-glance overview of all devices in your pool by running the following:
btrfs dev stats /btrfs_mountpoint
For a healthy two device raid1 volume this command will output something like:
[/dev/sdb1].write_io_errs 0 [/dev/sdb1].read_io_errs 0 [/dev/sdb1].flush_io_errs 0 [/dev/sdb1].corruption_errs 0 [/dev/sdb1].generation_errs 0 [/dev/sdc1].write_io_errs 0 [/dev/sdc1].read_io_errs 0 [/dev/sdc1].flush_io_errs 0 [/dev/sdc1].corruption_errs 0 [/dev/sdc1].generation_errs 0
- COW on COW: Don't do it!
This includes overlayfs, unionfs, databases that do their own COW, certain cowbuilder configurations, and virtual machine disk images. Please disable COW in the application if possible. For example, for QEMU, refer to qemu-img(1) and take care to use raw images. If this is not possible, you can disable COW on a single directory like this
mkdir directory chattr +C directory
New files in this directory will inherit the nodatacow attribute. Alternatively, nodatacow can be applied to a single file, but only for empty files
touch file chattr +C filePlease read earlier warning about using nodatacow. If your application supports integrity checks and/or self-healing, you will want to enable them if you use nodatacow for that application...but that might not be enough if you lose a whole disk!
- What happens if I mix differently sized disks in raid1 profile?
"RAID1 (and transitively RAID10) guarantees two copies on different disks, always. Only dup allows the copies to reside on the same disk. This is guaranteed is preserved, even when n=2k+1 and mixed-capacity disks. If disks run out of available chunks to satisfy the redundancy profile, the result is ENOSPC and requires the administrator to balance the file system before new allocations can succeed. The question essentially is asking if Btrfs will spontaneously degrade into "dup" if chunks cannot be allocated on some devices. That will never happen." (Justin Brown, 2016-06-03, linux-btrfs).
- Write section explaining what btrfs' raid1 and raid10 profiles actually are eg: 2 copies distributed on n devices, and that adding more devices does not make more copies; adding devices increases the size of the volume, but both raid1 and raid10 profiles always only make 2 copies. Adding more devices to increase redundancy is what upstream calls "raid1 profile n-copies" and no one is currently working on implementing this functionality.
- Add warning for current remount behavior when raid1 or raid10 experiences a failed devices. Does it still add chunks in profile=single, creating volume that has both degraded raid1 chunks and single chunks? If this still happens, then the volume locks to read-only the next time it is mounted.
- Write HOWTO for sbuild + schroot + btrfs, either here or somewhere else. (where should it go? --NicholasDSteeves)
- Write section on mlocate issues with btrfs (eg: allow bind mounts, mask off certain directories, etc.)
- More explicitly, address the dangers of going snapshot crazy, or using a loose and easy snapper config, because performance crashes somewhere between at 250 and 300 snapshots per subvolume, and also sometimes wedges the volume into an unmountable state. (More recently I've read more conservative estimates of no more than a dozen snapshots per subvolume, with a limit of 250 subvolumes--including snapshots)
- Warn about ways to innocently make a system unbootable, while innocently experimenting in a way that can't be worked around
Btrfs wiki: https://btrfs.wiki.kernel.org/
Primary manpages: btrfs(5) btrfs(8) mkfs.btrfs(8) btrfs-balance(8) btrfs-device(8) btrfs-filesystem(8) btrfs-property(8) btrfs-scrub(8) btrfs-show(8) btrfs-subvolume(8) btrfstune(8), and others from btrfs-tools.
Btrfs on Wikipedia
Btrfs mailing list: email@example.com