Translation(s): English - Русский


FileSystem > Btrfs

Btrfs was created to address the lack of pooling, snapshots, checksums, and integrated multi-device spanning in Linux file systems, particularly as the need for such features emerged when working at the petabyte scale. It aspires to be a multipurpose filesystem that scales well from massive block devices all the way down to cellular phones (Sailfish OS and Android). Because all reads are checksum-verified, Btrfs takes care to ensure that one's backups are not poisoned by silently corrupted source data—ZFS similarly ensures data integrity.

History

Btrfs has been part of the mainline Linux kernel since 2.6.29, and Debian's Btrfs support was introduced in DebianSqueeze.

In the future Ext2/3/4 filesystems will be upgradable to Btrfs. While a btrfs-convert utility has existed for some time, its use is not recommended. For the time being please backup, wipefs -a, mkfs.btrfs, and restore from backup, or replicate an existing Ext volume to the new Btrfs one (eg: using tar, cpio, rsync, et al).

"Google is evaluating btrfs for its potential use in android, but currently the lack of native file-based encryption unfortunately makes it a nonstarter" (Filip Bystricky, linux-btrfs, 2017-06-09).

Facebook has "now deployed [btrfs] on millions of servers, driving significant efficiency gains", because "btrfs helped eliminate priority inversions caused by the journaling behavior of the previous filesystem, when used for I/O control with cgroup2", and "btrfs is the only filesystem implementation that currently works with resource isolation" (Facebook open-sources new suite of Linux kernel components and tools, code.fb.com, 2018-10-30).

Linux ≥ 5.5 and btrfs-progs ≥ 5.4 finally bring support for checksum algorithms that are stronger than CRC32C. xxHash, SHA256, and BLAKE2 will then be supported. Additionally, with these releases raid1c3 and raid1c4 profiles have finally been introduced. Briefly, these are both profiles that have data redundancy. Where the well-tested raid1 profile supports two copies on N devices, raid1c3 supports three copies, and raid1c4 supports 4. Rebalancing to these new profiles should also be supported, allowing one to, for example, add an extra disk for redundancy before replacing a failing disk, and then dropping back down to previous level. Note that this operation will rewrite all data twice.

Status

Somewhat official upstream status is available here: The Btrfs Wiki: Status. and here: (git.kernel.org:btrfs.txt)

The DebianInstaller can format and install to single-disk Btrfs volumes, but does not yet support multi-disk btrfs volumes nor subvolume creation (Bug #686097). Daniel Pocock has a good article on how to Install Debian wheezy and jessie directly with btrfs RAID1; however, strictly speaking it showcases Btrfs' integrated multi-device flexibility. eg: Install to a single disk, add a second disk to the volume, rebalance while converting all data and metadata to raid1 profile.

Two disk raid1-profile Btrfs volumes created on a msdos or gpt partitions are bootable using grub-pc or grub-efi without a dedicated /boot, and it should also be possible to boot from a volume created on a raw disk using grub-pc. If booting with EFI firmware then consult UEFI for additional ESP partitioning requirements. Please note that if you boot using EFI and you would like your rootfs to be on btrfs, you must partition your drive[s]!

While support for swap files was added to linux-5.0, it is highly recommended to use a dedicated swap partition. Furthermore, enabling swap using a virtual block (loop) device is dangerous, because this "will only cause memory allocation lock-ups" (Martin Raiber, ''linux-btrfs'')

When using DebianStretch, please use a backported kernel and btrfs-tools from Backports. DebianBuster has good btrfs support out-of-the-box. DebianJessie users are urged to upgrade to a newer Debian release.

Users who do not yet have a backup strategy in place are urgently recommended to consult BackupAndRecovery, and to regularly verify that these backups are restorable.

Here are some of Btrfs' shortcomings:

Warnings

Maintenance

As a btrfs volume ages, its performance may degrade. This is because btrfs is a Copy On Write file system, and all COW filesystems eventually reach a heavily fragmented state; this includes ZFS. Over time, frequently appended or updated-in-place files will become split across tens of thousands of extents. This affects unrotated logs and databases such as those which are used for Firefox and a variety of common desktop software. Fragmentation is a major contributing factor to why COW volumes become slower over time.

ZFS addresses the performance problems of fragmentation using an intelligent Adaptive Replacement Cache (ARC), but the ARC requires massive amounts of RAM and only speeds up access to the hottest (most frequently and consistently accessed) data/metadata. Btrfs took a different approach and benefits from—some would say requires—periodic defragmentation. Btrfsmaintenance can be used to automate defragmentation and other btrfs maintenance tasks.

That said, it is almost certain that a database's own defragmentation/compaction operation will result in a better on-disk layout than "btrfs filesystem defragment". For example, "notmuch compact" produces a state that has between 10x and 100x fewer extents than the "btrfs filesystem defragment" equivalent.

Performance Considerations and Tuning

Recommendations

Many people have experienced upwards of three years of btrfs usage without issue, and this wiki page will continue to be updated with configuration recommendations known to be good and cautions against those known to cause issues.

  1. Use two (ideally three) equally sized disks, partition them identically, and add each partition to a btrfs raid1 profile volume. Alternatively, dedicate 1/3 of the disks for holding backups, because not much benefit in throughput or iops is yet gained by using btrfs raid1.
  2. Do not enable or use transparent filesystem compression with a mount option, in fstab, with "chattr +c", or with "btrfs filesystem defrag".
  3. Do not use quotas/qgroups.
  4. Keep regular backups and use a backup program that supports deduplication (eg: borgbackup).
  5. Do not enable mount -o discard, autodefrag, or space_cache=v2.
  6. Overprovision your SSD when partitioning so periodic trim won't be needed.
  7. Periodically run btrfs defrag against source subvolumes.
  8. Never run btrfs defrag against a child subvolume (eg: snapshots).
  9. Insure that the number of snapshots per volume/filesystem never exceeds 12; two or three times that might not cause ill effects, but keeping the count in the small double digits provides the greatest odds for avoiding morbid performance issues and out of space conditions. On the upside, many more btrfs snapshots can be taken before performance crashes when compared to LVM snapshots, where a single snapshot can introduce a performance crash.
  10. Take care to not fill the volume beyond 90%. If this occurs it will become necessary to run periodic balances to consolidate free space into contiguous chunks, and performance will become less predictable (eg: often poor).

FAQ

Which package contains the tools?

btrfs-tools in Debian 6 (squeeze) to Debian 8 (jessie), and btrfs-progs thereafter. Most interaction with Btrfs' advanced features requires these tools.

Does btrfs really protect my data from hard drive corruption?

Yes, but this requires at least two disks in raid1 profile. (eg: -m raid1 -d raid1). Without at least two copies of data, corruption can be detected but not corrected. Btrfs raid5 or raid6 profiles will not protect your data. Additionally, like for "mdadm or lvm raid, you need to make sure that the SCSI command timer (a kernel setting per block device) is longer than the drive's SCT ERC setting...If the command timer is shorter, bad sectors will not get reported as read errors for proper fixup, instead there will be a link reset and it's just inevitable there will be worse problems" (Chris Murphy, 2016-04-27, linux-btrfs). The Debian bug for this issue can be found here. For now do the following for all drives in the array, and then configure the system to set the SCSI command timer automatically on boot:

cat /sys/block/<dev>/device/timeout
smartctl -l scterc /dev

# echo -n ((the scterc value)/10)+10 to /sys/block/<dev>/device/timeout
The default value is 30 seconds, which should be fine for disks that support SCT and likely have low timeout values like 7 sec. For disks that fail smartctl -l scterc, and thus do not support SCT, set the timeout value to 120. Consider a timeout of 180 to be extra safe with large consumer-grade disks.
Does it support SSD optimizations?

Yes. For more details on using SSDs with Debian, refer to SSDOptimization. NOTE: Using "-o discard" with btrfs is generally unsafe. For an up-to-date discussion relevant to anything before Debian 11/bullseye see [LSF/MM TOPIC] More async operations for file systems - async discard (linux-btrfs via spinics).

What are btrfs' raid1 and raid10 profiles?
It is not classic RAID1, but rather 2 copies distributed on n devices. Adding more devices does not make more copies; adding devices increases the size of the volume, but both raid1 and raid10 profiles always only make 2 copies. Adding more devices to increase redundancy is what upstream calls "raid1 profile n-copies" and no one is currently working on implementing this functionality. Btrfs' raid10 profile is currently optimised and usually performs identically to or worse than the same disks in raid1 profile. Given the raid10 profile's added complexity, it is clear that raid1 should continue to be preferred at this time.
Does it support compression?

Yes, but consider this functionality experimental. Add compress=lzo, compress=zlib, or compress=zstd, according to the priority of throughput (lzo), best compression and fewest bugs (zlib), or something in between the two (zstd). If "=choice" is not specified then zlib will be used:

/dev/sdaX /  btrfs defaults,compress=choice 0 1

Change /dev/sdaX to your actual root device (UUID support in btrfs is a work-in-progress, but it works for mounting volumes; use the command blkid to get the UUID of all filesystems). Labels are also supported.

How do I transparently compress files in a directory?

For /var):

btrfs filesystem defragment -r -v -clzo /var
chattr +c /var

Adding the +c attribute ensures that any new file created inside the folder is compressed.

What are the recommended options for installing on a pendrive, an SD card or a slow SSD drive?

When installing, use manual partitioning and select btrfs as file system. In the first boot, edit /etc/fstab with these options, for possible improvements in throughput and latency (Compression is used here, because it is assumed that the pendrive contains throwaway or easily replaceable data). Zstd support is better than lzo, but requires a separate /boot partition that does not use zstd compression, because grub does not support zstd:

/dev/sdaX / btrfs noatime,compress=lzo,commit=0,ssd_spread,autodefrag 0 0
But I have a super-small pendrive and keep running out of space! Now what?

Using another system, try something like this If Your Device is Small (See note above regarding compression and throwaway data):

mkdir /tmp/pendrive
mount /dev/sdX -o noatime,ssd_spread,compress /tmp/pendrive
btrfs sub snap -r /tmp/pendrive /tmp/pendrive/tmp_snapshot
btrfs send /tmp/pendrive/tmp_snapshot > /tmp/pendrive_snapshot.btrfs
umount /tmp/pendrive

wipefs -a /dev/sdX
mkfs.btrfs --mixed /dev/sdX
mount /dev/sdX -o noatime,ssd_spread,compress /tmp/pendrive
btrfs receive -f /tmp/pendrive_snapshot.btrfs /tmp/pendrive
# Convert snapshot into writeable subvolume
btrfs property set -ts /tmp/pendrive/tmp_snapshot ro false
# Rename subvolume
mv /tmp/pendrive/tmp_snapshot /tmp/pendrive/tmp_snapshot/rootfs

# Alternatively, this conversion can be done thus:
# btrfs subvolume snap /tmp/pendrive/tmp_snapshot /tmp/pendrive/rootfs
# btrfs subvolume delete /tmp/pendrive/tmp_snapshot

# Now edit /tmp/pendrive/rootfs/etc/fstab to
# 1) Update UUID if using UUIDs
# 2) Use the "noatime,ssd_spread,compress" mount options

sync
btrfs fi sync /tmp/pendrive/

Now follow the procedure enabling / on a subvolume. Also, the bootloader needs to be reinstalled if your pendrive is a bootable OS drive and not just a data drive (--TODO: Needs to be written).

Can I encrypt a btrfs installation?

Yes, you can by selecting manual partitioning and creating an encryption volume and then a btrfs file system on top of that. For the moment, btrfs does not support direct encryption so the installer uses cryptsetup, but this is a planned feature, and experimental patches have recently been submitted to enable this (Anand Jain, linux-btrfs, Add btrfs encryption support) (--IN PROGRESS: I am currently testing btrfs on LUKS1 using cryptsetup, without LVM. After six months from June 2019, I have experienced issues with the desktop, so I've also transitioned my laptop to this setup. In both cases, this is with linux-4.19.x. --Nicholas D Steeves)

Does it work on RaspberryPi?

Yes, possibly improving filesystem I/O responsiveness. You may have to convert the filesystem to btrfs first from a PC and change the /etc/fstab type of filesystem from ext4 to btrfs (just by changing the name) before the first boot. Look above for recommended sdcard options in /etc/fstab.

Fsck.btrfs doesn't do anything, how to I verify the integrity of my filesystem?

Rather than a fsck, btrfs has two methods to detect and repair corruption. The first method executes as a background process for a mounted volume. It verifies the checksums for all data and metadata. If the checksum fails it marks it as bad, and if a good copy is available on another device then a scrub updates the bad copy using the good one; it heals the corruption. This operation runs at a default IO priority of idle, which strives to minimize the impact on other active processes; nevertheless, like any IO-intensive background job, it is best to run it at a time when the system is not busy. To run it:

btrfs scrub start /btrfs_mountpoint

To monitor its progress:

btrfs scrub status /btrfs_mountpoint

The second method checks an umounted filesystem. It verifies that the metadata and filesystem structures of the volume are intact and uncorrupted. It should not usually be necessary to run this type of check. Please note that it runs read-only; this is by design, and there are usually better methods to recover a corrupted btrfs volume than to use the dangerous "--repair" option. Please do not use "--repair" unless an upstream linux-btrfs developer has assured you that this is the best course of action. To run a standard read-only metadata and filesystem structures verification:

btrfs check -p /dev/sdX 

or

btrfs check -p /dev/disk/by-partuuid/UUID
How can I quickly check to see if my btrfs volume has experienced errors, with per-device accounting of any possible errors?

Get an at-a-glance overview of all devices in your pool with the following:

btrfs dev stats /btrfs_mountpoint

Command output for a healthy two device raid1 volume should show all zeroes, like this:

[/dev/sdb1].write_io_errs   0
[/dev/sdb1].read_io_errs    0
[/dev/sdb1].flush_io_errs   0
[/dev/sdb1].corruption_errs 0
[/dev/sdb1].generation_errs 0
[/dev/sdc1].write_io_errs   0
[/dev/sdc1].read_io_errs    0
[/dev/sdc1].flush_io_errs   0
[/dev/sdc1].corruption_errs 0
[/dev/sdc1].generation_errs 0
COW on COW: Don't do it!

This includes unionfs, databases that do their own COW, certain cowbuilder configurations, and virtual machine disk images like qcow. Please disable COW in the application if possible. Schroot+overlayfs appears to be safe with linux > 4.9. For QEMU, refer to qemu-img(1) and take care to use raw images. If this is not possible, COW may be disabled for single empty directory like this

mkdir directory
chattr +C directory

Newly created files in this directory will inherit the nodatacow attribute. Alternatively, nodatacow can be applied to a single file, but only for empty files

touch file
chattr +C file
Please read earlier warning about using nodatacow. Applications that support integrity checks and/or self-healing, can somewhat mitigate the risk of nodatacow, but please note that nodatacow files are not protected by raid1 profile's second copy in the event that a disk fails.
What happens if I mix differently sized disks in raid1 profile?

"RAID1 (and transitively RAID10) guarantees two copies on different disks, always. Only dup allows the copies to reside on the same disk. This is guaranteed is preserved, even when n=2k+1 and mixed-capacity disks. If disks run out of available chunks to satisfy the redundancy profile, the result is ENOSPC and requires the administrator to balance the file system before new allocations can succeed. The question essentially is asking if Btrfs will spontaneously degrade into "dup" if chunks cannot be allocated on some devices. That will never happen." (Justin Brown, 2016-06-03, linux-btrfs).

Why doesn't updatedb index /home when /home is on its own subvolume?

Consult this thread on linux-btrfs. The workaround I use is to have each top-level subvolume (id=5 or subvol=/) mounted at /btrfs-admin/$LABEL, where /btrfs-admin is root:sudo 750, and this is what I use in /etc/updatedb.conf:

PRUNE_BIND_MOUNTS="no"
PRUNENAMES=".git .bzr .hg .svn"
PRUNEPATHS="/tmp /var/spool /media /btrfs-admin /var/cache /var/lib/lxc"
PRUNEFS="NFS nfs nfs4 rpc_pipefs afs binfmt_misc proc smbfs autofs
iso9660 ncpfs coda devpts ftpfs devfs mfs shfs sysfs cifs lustre tmpfs
usbfs udf fuse.glusterfs fuse.sshfs curlftpfs"
With the exception of LXC rootfss I have a flat subvolume structure under each subvol=/. These subvolumes are mounted at specific mountpoints using fstab. Given that updatedb and locate work flawlessly, I'm inclined to conclude that this is the least disruptive configuration. If I used snapper I'd add it to PRUNEPATHS and rely on its facilities to find files that had been deleted, because I don't want to see n-duplicates-for-file when I use locate. A user who wanted locate to return duplicate paths could omit the path from PRUNEPATHS.

Old but still relevant References

The number of snapshots per volume and per subvolume must be carefully monitored and/or automatically pruned, because too many snapshots can wedge the filesystem into an out of space condition or gravely degrade performance (Duncan, 2016-02-16, linux-btrfs). There are also reports that IO becomes sluggish and lags with far fewer snapshots, eg: only 86/subvolume on linux-4.0.4; this might be fixed in a newer kernel (Pete, 2016-03-11, linux-btrfs).

This command must be run as root, and it is recommended to ionice it to reduce the load on the system. To further reduce the IO load, flush data after defragmenting each file using:

 sudo ionice -c idle btrfs filesystem defragment -f -t 32M -r $PATH

Raid5 and Raid6 Profiles

2016-06-26 Update

Once again, please do not use btrfs' raid5 or raid6 profiles at this point in time! In the thread [BUG] Btrfs scrub sometime recalculate wrong parity in raid5 Chris Murphy found the following while testing the btrfs raid5's ability to recover from csum errors:

In another email in this thread, Duncan suggested "And what's even clearer is that people /really/ shouldn't be using raid56 mode for anything but testing with throw-away data, at this point. Anything else is simply irresponsible" (linux-btrfs, 2016-06-26).

TODO

Documentation

See also

Upstream Contacts


CategoryKernel