Differences between revisions 97 and 98
Revision 97 as of 2018-10-09 01:30:31
Size: 37962
Editor: ?NicholasDSteeves
Comment: Update "Does it support SSD optimizations" section
Revision 98 as of 2018-10-09 01:37:06
Size: 38204
Editor: ?NicholasDSteeves
Comment: Enhance "Fsck.btrfs doesn't do anything..." section
Deletions are marked like this. Additions are marked like this.
Line 225: Line 225:
 Fsck.btrfs doesn't do anything, how to I verify the integrity of my filesystem? :: Rather than a fsck, btrfs has two methods to detect and repair corruption. The first method executes as a background process for a mounted volume. It has a default IO priority of idle, and it strives to minimize the impact on other active processes; nevertheless, like any IO-intensive background job, it is best to run it at a time when the system is not busy. To run it: {{{  Fsck.btrfs doesn't do anything, how to I verify the integrity of my filesystem? :: Rather than a fsck, btrfs has two methods to detect and repair corruption. The first method executes as a background process for a mounted volume. It verifies the checksums for all data and metadata. If the checksum fails it marks it as bad, and if a good copy is available on another device then a scrub heals updates the bad using the good one; it heals the corruption. This operation runs at a default IO priority of idle, which strives to minimize the impact on other active processes; nevertheless, like any IO-intensive background job, it is best to run it at a time when the system is not busy. To run it: {{{

Translation(s): English - Русский


FileSystem > Btrfs

Btrfs was created to address the lack of pooling, snapshots, checksums, and integrated multi-device spanning in Linux file systems, particularly as the need for such features emerged when working at the petabyte scale. It aspires to be a multipurpose filesystem that scales well from massive block devices all the way down to cellular phones (Sailfish OS and Android). Because all reads are checksum-verified, Btrfs takes care to ensure that your backups are not poisoned by silently corrupted source data—ZFS similarly ensures data integrity.

History

Btrfs has been part of the mainline Linux kernel since 2.6.29, and Debian's Btrfs support was introduced in DebianSqueeze.

In the future Ext2/3/4 filesystems will be upgradable to Btrfs. While a btrfs-convert utility has existed for some time, its use is presently not recommended. For the time being please backup, wipefs -a, mkfs.btrfs, and restore from backup, or replicate an existing Ext volume to the new Btrfs one using your choice of tools (eg: tar, cpio, rsync, et al).

"Google is evaluating btrfs for its potential use in android, but currently the lack of native file-based encryption unfortunately makes it a nonstarter" (Filip Bystricky, linux-btrfs, 2017-06-09).

Status

The official upstream status is available here: The Btrfs Wiki: Status. and here: (git.kernel.org:btrfs.txt)

The DebianInstaller can format and install to single-disk Btrfs volumes, but does not yet support multi-disk btrfs volumes nor subvolume creation (Bug #686097). Daniel Pocock has a good article on how to Install Debian wheezy and jessie directly with btrfs RAID1; however, strictly speaking it showcases Btrfs' integrated multi-device flexibility. eg: Install to a single disk, add a second disk to the volume, rebalance while converting all data and metadata to raid1 profile.

Two disk raid1-profile Btrfs volumes created on a msdos or gpt partitions are bootable using grub-pc or grub-efi without a dedicated /boot, and it should also be possible to boot from a volume created on a raw disk using grub-pc. If booting with EFI firmware then consult UEFI for additional ESP partitioning requirements. Please note that if you boot using EFI and you would like your rootfs to be on btrfs, you must partition your drive[s]! It is highly recommended to use a swap partition rather than a manually configured swap file through a loop-device; classic swap files are not supported (Btrfs Wiki Btrfs FAQ).

  • In my opinion, Btrfs before linux-4.4 and btrfs-progs-v4.4 is too risky to use, and 4.4 was the point where one could stop worrying "is my btrfs volume going to mysteriously blow up tomorrow, even with a simple use-case". When using DebianJessie, please use a backported kernel and btrfs-tools from Backports. DebianStretch has good btrfs support out-of-the-box. If at some point in the Stretch life-cycle one needs features from a newer kernel, then I recommend exclusively tracking LTS kernels rather than the latest version in backports, because tracking the linux-image backport can result in bugs such as this one—which forces a reboot:

    A lot warnings in dmesg while running thunderbird, or corruptions such as Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0).

    DebianTesting and DebianUnstable are also necessarily affected, because they track the latest stable upstream kernel. That said, if you would like to participate in the effort to debug and stabilise btrfs and can risk encountering bugs such as these, please use the newest kernel available to you. The upstream mailing list linux-btrfs appreciates this! --Nicholas D Steeves

Warnings of "Btrfs is under heavy development, and is not suitable for any uses other than benchmarking and review" were removed for linux-4.6 (git.kernel.org), and the consensus on the linux-btrfs mailing list seems to be that raid1 and raid10 profiles are now mature. Please refer to BackupAndRecovery if you do not yet have a backup strategy in place, and take care to regularly verify that your backups are restorable.

Here are some of Btrfs' shortcomings:

Warnings

  • Block-level clones often corrupt a btrfs volume. This is because btrfs expects UUID to truly be unique. The common case is 1) Install using LVM 2) Format an LV as a btrfs volume 3) Create a new LV 4) Dd the LV from #2 to #3 5) Udev probes the new LV and triggers btrfs device scan, or a reboot triggers btrfs device scan 6) UUID collision between the LV from #2 and #3 cause misdirected writes, which corrupts #2, #3, or simultaneously both. Creating a block level clone as a file and then mounting that file loopback on a system where the block device is mounted will also trigger this form of corruption.
  • Layering btrfs volumes on top of LVM might be implicated in other passive causes of filesystem corruption (Needs citation -- Nicholas D Steeves).
  • All kernels before linux-4.14 are affected by a bug in the SSD-specific allocator, but this is easy to work around. The consequences of this bug are as follows: extents are allocated in a highly fragmented way, free space becomes highly fragmented and inefficient, new data chunks are needlessly allocated from the unallocated free space pool (which has the consequence of making TRIM ineffective), and balance operations do not actually balance or they fail due to out-of-space conditions. The workaround is to add "nossd" to the mount options, reboot, then (optionally) defragment, and finally (necessarily) "btrfs balance --full-balance /mount/point". See Re: Recommendations for balancing as part of regular maintenance? (Hugo Mills, linux-btrfs, 2018-01-08).

  • There is currently (2018-01-24, linux ≤ 4.14.15) a bug that causes a two-disk raid1 profile to forever become read-only the second time it is mounted in a degraded state—for example due to a missing/broken/SATA link reset disk (unix.stackexchange.com, How to replace a disk drive that is physically no more there?). The easiest way to avoid this is to use a three-disk raid1 profile filesystem. With two copies of all data and metadata spread over three disks the filesystem can lose any one disk and continue to function across reboots unless a second disk dies, because with two surviving devices, two copies of data and metadata can be made. The "filesystem becomes read-only" bug is avoided, because it is only triggered when it becomes impossible to make two copies of data and metadata on two different devices. As an alternative, Adam Borowski has submitted [PATCH] [NOT-FOR-MERGING] btrfs: make "too many missing devices" check non-fatal to linux-btrfs, which addresses this issue, which is also addressed by Qu Wenro's yet-unmerged Btrfs: Per-chunk degradable check patch. The thread surrounding Borowski's patch is an excellent introduction to the debate surrounding whether or not btrfs volumes should be run in a degraded state.

  • While using btrfs with bcache increases performance, bcache can introduce grave errors such as Regression in 4.14: wrong data being read from bcache device (Pavel Goran, linux-bcache, 2017-11-16).

  • Do not use raid5 or raid6 profiles.
  • Quotas and qgroups are still broken and are implicated in filesystem corruption (Justin Maggard, linux-btrfs, 2017-09-31).

  • Subvolumes cannot yet be mounted with different btrfs-specific options; the first line for a given volume in /etc/fstab takes effect. eg: you cannot mount / with noatime and /var with nodatacow,compress=lzo (Btrfs Wiki, Mount options).

  • At present, nodatacow implies nodatasum; this means that anything with the nodatacow attribute does not receive the benefits of btrfs' checksum protection and self-healing (for raid levels greater >= 1); disabling CoW (Copy on Write) means that the a VM disk image will not be consistent if the host crashes or loses power. Nodatacow also carries the following additional danger on multidisk systems: because nodatasum is disabled there is no way to verify which disk in a two disk raid1 profile volume contains the correct data. After a crash there is a roughly 50% probability that the bad copy will be read on each request. Consequently, it is almost always preferable to disable COW in the application, and nodatacow should only be used for disposable data.

  • Compress=lzo might be dangerous. In the linux-btrfs thread "Trying to rescue my data" (2016-06-26) it has finally come to light that mounting with compress=lzo might be something that causes btrfs volumes to break, because "if it gets too many [csum errors] at once, it *does* unfortunately crash, despite the second copy being available and being just fine as later demonstrated by the scrub fixing the bad copy from the good one" (Duncan). Later in the thread Steven Haigh confirms the behaviour and suggests "maybe here lays a common problem"? Zlib appears to be unaffected.

  • Mounting with -o autodefrag will often duplicate reflinked or snapshotted files when a balance is run.
  • Any "btrfs filesystem defrag" operation can potentially duplicate reflinked or snapshotted blocks. Files with shared extents lose their shared reflinks, which are then duplicated with n-copies. The effect of the lack of "snapshot aware defrag" is that volumes that make heavy use of reflinks or snapshots will unexpectedly run out of free space. Avoid this by minimizing the use of snapshots, and instead use deduplicating backup software to store backups efficiently (eg: borgbackup).
  • And others from The Btrfs WikiGotchas

Raid5 and Raid6 Profiles

  • "Do not use BTRFS raid6 mode in production, it has at least 2 known serious bugs that may cause complete loss of the array due to a disk failure. Both of these issues have as of yet unknown trigger conditions, although they do seem to occur more frequently with larger arrays" (Austin S. Hemmelgarn, 2016-06-03, linux-btrfs).

  • Do not use raid5 mode in production because, "RAID5 with one degraded disk won't be able to reconstruct data on this degraded disk because reconstructed extent content won't match checksum. Which kinda makes RAID5 pointless" (Andrei Borzenkov, 2016-06-24, linux-btrfs).

2016-06-26 Update

Once again, please do not use btrfs' raid5 or raid6 profiles at this point in time! In the thread [BUG] Btrfs scrub sometime recalculate wrong parity in raid5 Chris Murphy found the following while testing the btrfs raid5's ability to recover from csum errors:

  • I just did it a 2nd time and both file's parity are wrong now. So I did it several more times. Sometimes both files' parity is bad. Sometimes just one file's parity is bad. Sometimes neither file's parity is bad. It's a very bad bug, because it is a form of silent data corruption and it's induced by Btrfs. And it's apparently non-deterministically hit (2016-06-26).

In another email in this thread, Duncan suggested "And what's even clearer is that people /really/ shouldn't be using raid56 mode for anything but testing with throw-away data, at this point. Anything else is simply irresponsible" (linux-btrfs, 2016-06-26).

Maintenance

As a btrfs volume ages, you might notice that its performance degrades. This is because btrfs is a Copy On Write file system, and all COW filesystems eventually reach a heavily fragmented state; this includes ZFS. Over time, logs in /var/log/journal will become split across tens of thousands of extents. This is also the case for sqlite databases such as those that are used for Firefox and a variety of common desktop software. Fragmentation is a major contributing factor to why COW volumes become slower over time.

ZFS addresses the performance problems of fragmentation using an intelligent Adaptive Replacement Cache (ARC), but the ARC requires massive amounts of RAM. Btrfs took a different approach and benefits from—some would say requires—periodic defragmentation. Btrfsmaintenance can be used to automate defragmentation and other btrfs maintenance tasks.

Performance Considerations and Tuning

  • While segmenting datasets using subvolumes will usually speed up operations that require walking the backref tree (Citation needed, probably something Qu Wenro wrote), too many snapshots has the opposite effect. Going snapshot crazy, or using a loose and easy snapper config will cause performance crashes somewhere between 12 snapshots per subvolume and/or 100 subvolumes per volume, including all snapshots. I've also read that too many snapshots can also sometime wedges the volume into an unmountable state. (Need to find the source for this on linux-btrfs -- Nicholas D Steeves)

  • In the linux-btrfs thread Re: Understanding BTRFS RAID0 Performance (2018-10-08), Austin S. Hemmelgarn writes "If you can find some way to logically subdivide your workload, you should look at creating one subvolume per subdivision. This will reduce lock contention (and thus make bumping up the thread_pool option actually have some benefits)". So for example, on a combined web and mail server, /var/www/html and the location where maildirs are stored (usually /home/$user/Maildir or /var/spool/mail/$user) should be on different subvolumes (eg: make subvolumes named "@html", and "@home" or "@mail").

  • Use maildirs and not mbox spool files.
  • dpkg and thus apt is very very slow on btrfs (bug marked as won't fix). This is important to know if you want to use btrfs as your root filesystem and you are using stable, this is extremely annoying if your are using unstable or want to run sbuild. eatmydata helps a lot here, but a power failure can leave you with a broken dpkg database.

  • Mounting with -o compress will amplify fragmentation. All COW filesystems necessarily fragment. There is also a relation between the number of snapshots and the degree of fragmentation. Fragmentation manifests as higher than expected CPU usage on SSDs and increased read latency on rotational disks, because each of the references present in a frequently updated file will tend to necessitate a mechanical seek. Because the focus of btrfs development is currently on stabilisation, bug fixes, and core features, seek optimisations have not yet become a priority. This said, some workloads show marked benefit from compression!
  • Is there anything I can do to improve system responsiveness while running a scrub, balance, or defrag?

    Yes, but only if the CFQ scheduler is enabled for the affected btrfs drives, because the "idle" ionice class requires the CFQ scheduler.

    cat /sys/block/sdX/queue/scheduler
    # Should return "noop anticipatory deadline [cfq]" for rotational disks
    # If it does not, then
    echo -n cfq > /sys/block/sdX/queue/scheduler
    Use your preferred method to make this permanent (eg: /etc/rc.local, or a udev rule).
    Btrfs makes my desktop slow, is there anything I can do to restore a snappy feeling?

    Yes, but at the cost of greater slowdown during scrub, balance, and defrag. (I use this for both my SSD and two 4200RPM disk btrfs raid1 --Nicholas D Steeves)

    cat /sys/block/sdX/queue/scheduler
    # Should return "noop anticipatory [deadline] cfq" if deadline is enabled
    # If it does not, then
    echo -n deadline > /sys/block/sdX/queue/scheduler
    Use your preferred method to make this permanent (eg: /etc/rc.local, or a udev rule). Please note that using deadline on a rotational boot disk is not a panacea for all btrfs performance issues and this very much a case of "your mileage may vary".
    I want to run defrag and balance to restore performance. Is it possible reduce the incidence of unexpected out of space errors ?

    Yes. The default target extent size of defrag primarily effects files ≤32MB, because "on an average aged filesystem…whole files [are overwritten] breaking the reflinks" (David Sterba, btrfsmaintenance issue #43, 2018-01-23). Alternatively, skip all extents larger than the most reflinked files with btrfs filesystem defrag -t 4M /mountpoint. Also, care should be taken to only defrag source subvolumes, and to never defrag their snapshots.

Recommendations

Many people have reported years of btrfs usage without issue, and this wiki page will continue to be updated with configuration recommendations known to be good and cautions against those known to cause issues.

  1. Use two (ideally three) equally sized disks, partition them identically, and add each partition to a btrfs raid1 profile volume. Alternatively, dedicate one disk for holding backups, because not much benefit in throughput or iops is yet gained by using btrfs raid1.
  2. Do not use compression. That said, if one wants to test this functionality then zlib seems to have fewer issues than lzo. There is insufficient data on the new zstd mode to either recommend it or caution against its use.
  3. Do not use quotas/qgroups.
  4. Keep regular backups and use a backup program that supports deduplication (eg: borgbackup).
  5. Do not enable mount -o discard, autodefrag, or space_cache=v2.
  6. Overprovision your SSD when partitioning so periodic trim won't be needed.
  7. Periodically run btrfs defrag against source subvolumes.
  8. Never run btrfs defrag against a child subvolume (eg: snapshots).
  9. Insure that the number of snapshots per volume/filesystem never exceeds 12; two or three times that might not cause ill effects, but keeping it under this number provides the greatest odds for avoiding morbid performance issues and out of space conditions. On the upside, many more btrfs snapshots can be taken before performance crashes when compared to LVM snapshots, where a single snapshot can introduce a performance crash.
  10. Take care to not fill the volume beyond 90%. If this occurs it may become necessary to run periodic balances to consolidate free space into contiguous chunks. Also, performance will become less predictable.

FAQ

Which package contains the tools?

btrfs-tools in Debian 6 (squeeze) to Debian 8 (jessie), and btrfs-progs thereafter. Most interaction with Btrfs' advanced features requires these tools.

Does btrfs really protect my data from hard drive corruption?

Yes, but this requires at least two disks in raid1 profile. (eg: -m raid1 -d raid1). Without at least two copies of data, corruption can be detected but not corrected. Btrfs raid5 or raid6 profiles will not protect your data. Additionally, like for "mdadm or lvm raid, you need to make sure that the SCSI command timer (a kernel setting per block device) is longer than the drive's SCT ERC setting...If the command timer is shorter, bad sectors will not get reported as read errors for proper fixup, instead there will be a link reset and it's just inevitable there will be worse problems" (Chris Murphy, 2016-04-27, linux-btrfs). The Debian bug for this issue can be found here. For now do the following for all drives in the array, and then configure your system to change the SCSI command timer automatically on boot:

cat /sys/block/<dev>/device/timeout
smartctl -l scterc /dev

# echo -n ((the scterc value)/10)+10 to /sys/block/<dev>/device/timeout
The default value is 30 seconds, which should be fine for disks that support SCT and likely have low timeout values like 7 sec. For disks that fail smartctl -l scterc, and thus do not support SCT, set the timeout value to 120. Consider a timeout of 180 to be extra safe with large consumer-grade disks.
Does it support SSD optimizations?

Yes. For more details on using SSDs with Debian, refer to SSDOptimization. NOTE: Do not use "-o discard" with btrfs.

What are btrfs' raid1 and raid10 profiles?
It is not classic RAID1, but rather 2 copies distributed on n devices. Adding more devices does not make more copies; adding devices increases the size of the volume, but both raid1 and raid10 profiles always only make 2 copies. Adding more devices to increase redundancy is what upstream calls "raid1 profile n-copies" and no one is currently working on implementing this functionality. Btrfs' raid10 profile is currently optimised and usually performs identically to or worse than the same disks in raid1 profile. Given the raid10 profile's added complexity, it is clear that raid1 should continue to be preferred at this time.
Does it support compression?

Yes, but consider this functionality experimental. Add compress=lzo, compress=zlib, or compress=zstd, according to the priority of throughput (lzo), best compression and fewest bugs (zlib), or something in between the two (zstd). If "=choice" is not specified then zlib will be used:

/dev/sdaX /  btrfs defaults,compress=choice 0 1

Change /dev/sdaX to your actual root device (UUID support in btrfs is a work-in-progress, but it works for mounting volumes; use the command blkid to get the UUID of all filesystems). Labels are also supported.

But if what you want is to just compress the files in a directory?

You can do this by applying the following two commands (for example for /var):

btrfs filesystem defragment -r -v -clzo /var
chattr +c /var

By adding the +c attribute you ensure that any new file created inside the folder is compressed.

What are the recommended options for installing on a pendrive, a SD card or a slow SSD drive?

When installing, use manual partitioning and select btrfs as file system. In the first boot, edit /etc/fstab with this options, so you can expect a very good speed and responsiveness improvement (note that compression might cause issues -- Nicholas D Steeves):

/dev/sdaX / btrfs x-systemd.device-timeout=0,noatime,compress=lzo,commit=0,ssd_spread,autodefrag 0 0
But I have a super-small pendrive and keep running out of space! Now what?

Using another system, you can try something like this If Your Device is Small (note that compression might cause issues -- Nicholas D Steeves):

mkdir /tmp/pendrive
mount /dev/sdX -o noatime,ssd_spread,compress /tmp/pendrive
btrfs sub snap -r /tmp/pendrive /tmp/pendrive/tmp_snapshot
btrfs send /tmp/pendrive/tmp_snapshot > /tmp/pendrive_snapshot.btrfs
umount /tmp/pendrive

wipefs -a /dev/sdX
mkfs.btrfs --mixed /dev/sdX
mount /dev/sdX -o noatime,ssd_spread,compress /tmp/pendrive
btrfs receive -f /tmp/pendrive_snapshot.btrfs /tmp/pendrive
# Convert snapshot into writeable subvolume
btrfs property set -ts /tmp/pendrive/tmp_snapshot ro false
# Rename subvolume
mv /tmp/pendrive/tmp_snapshot /tmp/pendrive/tmp_snapshot/rootfs

# Alternatively, this conversion can be done thus:
# btrfs subvolume snap /tmp/pendrive/tmp_snapshot /tmp/pendrive/rootfs
# btrfs subvolume delete /tmp/pendrive/tmp_snapshot

# Now edit /tmp/pendrive/rootfs/etc/fstab to
# 1) Update UUID if using UUIDs
# 2) Use the "noatime,ssd_spread,compress" mount options

sync
btrfs fi sync /tmp/pendrive/

Now follow the procedure enabling / on a subvolume. Also, the bootloader needs to be reinstalled if your pendrive is a bootable OS drive and not just a data drive (Needs to be written --Nicholas D Steeves).

What are the recommended options for a rotational hard disk? (note that compression might cause issues -- Nicholas D Steeves)

In /etc/fstab

UUID=<the_device_uuid> /mount/point/ btrfs noauto,compress=lzo,noatime,autodefrag 0 0

The noauto option will prevent the system to freeze at boot in the case of a non system and (likely) un-plugged device/partition. Alternatively, if you are using systemd and want to limit boot delay to 10 seconds in case of a missing device, and if that device is necessary for normal functioning of the system you can try this. System boot will halt with an error if the device is not found:

UUID=<the_device_uuid> /mount/point btrfs x-systemd.device-timeout=10,noatime,compress=lzo,autodefrag 0 0
(Consider revoking this recommendation, because autodefrag, like -o discard, can trigger buggy behaviour. Also consider revoking the compress=lzo recommendation for rotational disks, because while it increases throughput for sequentially written compressible data, it also magnifies fragmentation...which means lots more seeks and increased latency -- Nicholas D Steeves)
Can I encrypt a btrfs installation?

Yes, you can by selecting manual partitioning and creating an encryption volume and then a btrfs file system on top of that. For the moment, btrfs does not support direct encryption so the installer uses cryptsetup, but this is a planned feature, and experimental patches have recently been submitted to enable this (Anand Jain, linux-btrfs, Add btrfs encryption support)

Does it work on RaspberryPi?

Yes, improving filesystem I/O responsiveness a lot. You may have to convert the filesystem to btrfs first from a PC and change the /etc/fstab type of filesystem from ext4 to btrfs (just by changing the name) before the first boot. Look above for recommended sdcard options in /etc/fstab.

Fsck.btrfs doesn't do anything, how to I verify the integrity of my filesystem?

Rather than a fsck, btrfs has two methods to detect and repair corruption. The first method executes as a background process for a mounted volume. It verifies the checksums for all data and metadata. If the checksum fails it marks it as bad, and if a good copy is available on another device then a scrub heals updates the bad using the good one; it heals the corruption. This operation runs at a default IO priority of idle, which strives to minimize the impact on other active processes; nevertheless, like any IO-intensive background job, it is best to run it at a time when the system is not busy. To run it:

btrfs scrub start /btrfs_mountpoint

To monitor its progress:

btrfs scrub status /btrfs_mountpoint

The second method checks an umounted filesystem. It verifies that the metadata and filesystem structures of the volume are intact and uncorrupted. It should not usually be necessary to run this type of check. Please note that it runs read-only; this is by design, and there are usually better methods to recover a corrupted btrfs volume than to use the dangerous "--repair" option. Please do not use "--repair" unless someone has assured you that it is absolutely necessary. To run a standard read-only metadata and filesystem structures verification:

btrfs check -p /dev/sdX 

or

btrfs check -p /dev/disk/by-partuuid/UUID
How can I quickly check to see if my btrfs volume has experienced errors, with per-device accounting of any possible errors?

If you have a new enough copy of btrfs-progs you get an at-a-glance overview of all devices in your pool by running the following:

btrfs dev stats /btrfs_mountpoint

For a healthy two device raid1 volume this command will output something like:

[/dev/sdb1].write_io_errs   0
[/dev/sdb1].read_io_errs    0
[/dev/sdb1].flush_io_errs   0
[/dev/sdb1].corruption_errs 0
[/dev/sdb1].generation_errs 0
[/dev/sdc1].write_io_errs   0
[/dev/sdc1].read_io_errs    0
[/dev/sdc1].flush_io_errs   0
[/dev/sdc1].corruption_errs 0
[/dev/sdc1].generation_errs 0
COW on COW: Don't do it!

This includes overlayfs, unionfs, databases that do their own COW, certain cowbuilder configurations, and virtual machine disk images. Please disable COW in the application if possible. For example, for QEMU, refer to qemu-img(1) and take care to use raw images. If this is not possible, you can disable COW on a single directory like this

mkdir directory
chattr +C directory

New files in this directory will inherit the nodatacow attribute. Alternatively, nodatacow can be applied to a single file, but only for empty files

touch file
chattr +C file
Please read earlier warning about using nodatacow. If your application supports integrity checks and/or self-healing, you will want to enable them if you use nodatacow for that application...but that might not be enough if you lose a whole disk!
What happens if I mix differently sized disks in raid1 profile?

"RAID1 (and transitively RAID10) guarantees two copies on different disks, always. Only dup allows the copies to reside on the same disk. This is guaranteed is preserved, even when n=2k+1 and mixed-capacity disks. If disks run out of available chunks to satisfy the redundancy profile, the result is ENOSPC and requires the administrator to balance the file system before new allocations can succeed. The question essentially is asking if Btrfs will spontaneously degrade into "dup" if chunks cannot be allocated on some devices. That will never happen." (Justin Brown, 2016-06-03, linux-btrfs).

Why doesn't updatedb index /home when /home is on its own subvolume?

Consult this thread on linux-btrfs. The workaround I use is to have each top-level subvolume (id=5 or subvol=/) mounted at /btrfs-admin/$LABEL, where /btrfs-admin is root:sudo 750, and this is what I use in /etc/updatedb.conf:

PRUNE_BIND_MOUNTS="no"
PRUNENAMES=".git .bzr .hg .svn"
PRUNEPATHS="/tmp /var/spool /media /btrfs-admin /var/cache /var/lib/lxc"
PRUNEFS="NFS nfs nfs4 rpc_pipefs afs binfmt_misc proc smbfs autofs
iso9660 ncpfs coda devpts ftpfs devfs mfs shfs sysfs cifs lustre tmpfs
usbfs udf fuse.glusterfs fuse.sshfs curlftpfs"
With the exception of LXC rootfss I have a flat subvolume structure under each subvol=/. These subvolumes are mounted at specific mountpoints using fstab. Given that updatedb and locate work flawlessly, and that I've only had two issues (freespacecache) while using LTS kernels, I'm inclined to conclude that this is the least disruptive configuration. If I used snapper I'd add it to PRUNEPATHS and rely on its facilities to find files that had been deleted, because I don't want to see n-duplicates-for-file when I use locate. A user who wanted to see those duplicates could remove the path from PRUNEPATHS.

Old but still relevant References

The number of snapshots per volume and per subvolume must be carefully monitored and/or automatically pruned, because too many snapshots can wedge the filesystem into an out of space condition or gravely degrade performance (Duncan, 2016-02-16, linux-btrfs). There are also reports that IO becomes sluggish and lags with far fewer snapshots, eg: only 86/subvolume on linux-4.0.4; this might be fixed in a newer kernel (Pete, 2016-03-11, linux-btrfs).

This command must be run as root, and it is recommended to ionice it to reduce the load on the system. To further reduce the IO load, flush data after defragmenting each file using:

 sudo ionice -c idle btrfs filesystem defragment -f -t 32M -r $PATH

Target extent size is a little known—but for practical purposes—absolutely essential argument. While the argument "-t 1G" would seem to be better than the "-t 32M" default, in practise this is not the case, because most volumes will have 1GiB chunk size. Additionally, if you have a lot of snapshots or reflinked files, please use "-f" to flush data for each file before going to the next file. Please consult the following thread for more information: Re: btrfs fi defrag does not defrag files >256kB?. Since btrfs-progs-4.9.1 "-t 32M" is default and no longer needs to be specified. eg: necessary for Stretch, unless using backported btrfs-progs.

TODO

  • Add warning for current remount behaviour when raid1 or raid10 experiences a failed devices. Does it still add chunks in profile=single, creating volume that has both degraded raid1 chunks and single chunks? If this still happens, then the volume locks to read-only the next time it is mounted.
  • Write HOWTO for sbuild + schroot + btrfs, either here or somewhere else. (where should it go? --Nicholas D Steeves)
  • Warn about ways to innocently make a system unbootable, while experimenting?
  • Write FAQ entry on "my array is so slow!" -- needs research into both bcache and upstream's recommendation of btrfs raid1 of raid0 (either mdraid or hardware raid) pairs.
  • Rewrite "Does it work on RaspberryPi?" to not use btrfs convert

  • Add "Tuning for throughput" on SSD and rotational disks, and also "Tuning for Latency" to "Performance Considerations and Tuning" section. Also link to some other source for tuning for latency, throughput using different knobs like VM tuning of dirty pages.
  • Add/write/link to a HOWTO for migrating for a default installation to a system with subvolumes for / and /home.
  • TODO: Check the wiki page on SSDs and add a section on overprovisioning when partitioning. Giving the firmware more unallocated space to work with allow the SSD to maintain more consistent performance when nearing maximal filled capacity; the benefit is particularly noticeable on lower priced SSDs. Some rotational hard drives also benefit from overprovisioning in a practise known as "short stroking".
  • Merge this info to fstab, and also add manpage link for fstab (from bin:mount) to that page.

    • (Remember: all fstab mount options must be comma separated but NOT space separated, so do not insert a space after the comma or the equal symbol).

      In order to check if you have written the options correctly before rebooting and therefore before being in trouble, run this command as root:

      mount -o remount /
      If no error is reported, everything is OK. Never try to boot with a troubled options fstab file or you'll have to manually try to recover it, a procedure that is more complicated.

Documentation

See also

Contact


CategoryKernel