FileSystem > ZFS

ZFS is a combined file system and logical volume manager designed by Sun Microsystems (now owned by Oracle), which is licensed as open-source software under the Common Development and Distribution License (CDDL) as part of the ?OpenSolaris project in November 2005. OpenZFS brings together developers and users from various open-source forks of the original ZFS on different platforms, it was announced in September 2013 as the truly open source successor to the ZFS project.

Described as The last word in filesystems, ZFS is scalable, and includes extensive protection against data corruption, support for high storage capacities, efficient data compression, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, continuous integrity checking and automatic repair, RAID-Z, native NFSv4 ACLs, and can be very precisely configured.

Status

Debian kFreeBSD users are able to use ZFS since the release of Squeeze, for those who use Linux kernel it is available from contrib archive area with the form of DKMS source since the release of Stretch. There is also a deprecated userspace implementation facilitating the FUSE framework. This page will demonstrate using ZFS on Linux (ZoL) if not specifically pointed to the kFreeBSD or FUSE implementation.

Due to potential legal incompatibilities between CDDL and GPL, even both of them are OSI-approved free software license that comply with DFSG, ZFS development is not supported by the Linux kernel. ZoL is a project funded by the Lawrence Livermore National Laboratory to develop a native Linux kernel module for its massive storage requirements and super computers.

Features

Installation

ZFS on Linux is provided in the form of DKMS source for Debian users. It is necessary to add the contrib section to your apt sources configuration to be able to get the packages. Also, it is recommended by Debian ZFS on Linux Team to install ZFS related packages from Backports archive. Upstream stable patches will be tracked and compatibility is always maintained. When configured, use following commands to install the packages:

  apt update
  apt install linux-headers-amd64
  apt install -t buster-backports zfsutils-linux

The given example has separated the steps of installing Linux headers and zfs. It's fine to combine everything in one command but let's be explicit to avoid any chance of messing up with versions. Future updates will be taken care by apt.

(!) If automatic installation of Recommends is disabled it is also necessary to install zfs-dkms.

/!\ The modules will be built automatically only for kernels that have the corresponding linux-headers package installed. Install the linux-headers-<arch> package to always have the latest linux headers installed (analog to the linux-image-<arch> package).

Creating the Pool

Many disks can be added to a storage pool, and ZFS can allocate space from it, so the first step of using ZFS is creating a pool. It is recommended to use more than 1 whole disk to take advantage of full benefits, but it's fine to proceed with only one device or just a partition.

In the world of ZFS, device names with path/id are typically used to identify a disk, because the device names like /dev/sdX may change on every start. These names can be retrieved with ls -l /dev/disk/by-id/ or ls -l /dev/disk/by-path/

{i} In case of using whole disks ZFS will automatically reserve 8 MiB at the end of the device, to allow for replacement and/or additional physical devices that don't have the exact same size as the other devices in the pool.

(!) When using partitions (or preparing the disk manually in advance) it is possible to also use the GPT partition labels to identify the partition/disk, as they are customizable and nicer for humans to understand. These can be found in /dev/disk/by-partlabel/.

Basic Configuration

The most common pool configurations are mirror, raidz and raidz2, choose one from the following:

    zpool create tank mirror scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c

    zpool create tank raidz scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c scsi-35000cca26c108480

    zpool create tank raidz2 scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c scsi-35000cca26c108480 scsi-35000cca266ccbdb4

    zpool create tank scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c

    zpool create tank scsi-35000cca26c108480

Advanced Configuration

If building a pool with a larger number of disks, you are encouraged to configure them into more than one group and finally construct a stripe pool using these vdevs. This would allow more flexible pool design to trade-off among space, redundancy and efficiency.

Different configurations may have different IO characteristics under certain workload pattern, please refer to see also section at the end of this page for more information.

    zpool create tank mirror scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c \
                    mirror scsi-35000cca26c108480 scsi-35000cca266ccbdb4 \
                    mirror scsi-35000cca266c75c74 scsi-35000cca26c0e84dc \
                    mirror scsi-35000cca266cda748 scsi-35000cca266cd14b4 \
                    mirror scsi-35000cca266cb8ae4 scsi-35000cca266cbad80

    zpool create tank raidz scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c scsi-35000cca26c108480 scsi-35000cca266ccbdb4 scsi-35000cca266c75c74 \
                    raidz scsi-35000cca26c0e84dc scsi-35000cca266cda748 scsi-35000cca266cd14b4 scsi-35000cca266cb8ae4 scsi-35000cca266cbad80

ZFS can make use of fast SSD as second level cache (L2ARC) after RAM (ARC), which can improve cache hit rate thus improving overall performance. Because cache devices could be read and write very frequently when the pool is busy, please consider to use more durable SSD devices (SLC/MLC over TLC/QLC) preferably come with NVMe protocol. This cache is only use for read operations, so that data write to cache disk is demanded by read operations, and is not related to write operations at all.

    zpool add tank cache nvme-MT001600KWHAC_S3M0NA0K700264

ZFS can also make uses of NVRAM/Optane/SSD as SLOG (Separate ZFS Intent Log) device, which can be considered as kind of write cache but that's far from the truth. SLOG devices are used for speeding up synchronous writes by sending those transaction to SLOG in parallel to slower disks, as soon as the transaction is successful on SLOG the operation is marked as completed, then the synchronous operation is unblocked quicker and resistance against power loss is not compromised. Mirrored set up of SLOG devices is obviously recommended. Please also note that asynchronous writes are not sent to SLOG by default, you could try to set sync=always property of the working dataset and see whether performance gets improved.

    zpool add tank log mirror nvme-MT001600KWHAC_S3M0NA0K700244 nvme-MT001600KWHAC_S3M0NA0K700246

Provisioning file systems or volume

After creating the zpool, we are able to provision file systems or volumes (ZVOL). ZVOL is a kind of block device whose space is allocated from the pool. It is possible to create another file system on it (including swap) or use it as storage device for virtual machines, like any other block device.

    mkdir -p /data
    zfs create -o mountpoint=/data tank/data

    zfs create -s -V 4GB tank/vol
    mkfs.ext4 /dev/zvol/tank/vol
    mount /dev/zvol/tank/vol /mnt

    # ZFS will handle mounts that are managed by it
    zfs destroy tank/data
    # Need to umount first, because this mount is user managed
    umount /dev/zvol/tank/vol
    zfs destroy tank/vol

/!\ Using ZVOLs for swap on Linux can lead to deadlocks, see https://github.com/openzfs/zfs/issues/7734.

Snapshots

Snapshot is a most wanted feature of modern file system, and ZFS definitely supports it.

Creating and Managing Snapshots

    zfs snapshot tank/data@2019-06-24

    zfs destroy tank/data@2019-06-24

Backup and Restore (with remote)

It is possible to backup a ZFS dataset to another pool with zfs send/recv commands, even the pool is located at the other end of network.

    # create a initial snapshot
    zfs snapshot tank/data@initial
    # send it to another local pool, named ''tank2'', and calling the dataset ''packman''
    zfs send tank/data@initial | zfs recv -F tank2/packman
    # send it to a remote pool, named ''tanker'' at remote side
    zfs send tank/data@initial | ssh remotehost zfs recv -F tanker/data
    # after using ''tank/data'' for a while, create another snapshot
    zfs snapshot tank/data@2019-06-24T18-10
    # incrementally send the new state to remote
    zfs send -i initial tank/data@2019-06-24T18-10 | ssh remotehost zfs recv -F tanker/data

File Sharing

ZFS has integration with operating system's NFS, CIFS and iSCSI servers, it does not implement its own server but reuse existing software. However, iSCSI integration is not yet available on Linux. It is recommended to enable xattr=sa and dnodesize=auto for these usages.

NFS shares

To share a dataset through NFS, nfs-kernel-server package needs to be installed:

    apt install nfs-kernel-server

Set up recommended properties for the targeting zfs file system:

    zfs set xattr=sa dnodesize=auto tank/data

Configure a very simiple NFS share (read/write to 192.168.0.0/24, read only to 10.0.0.0/8):

    zfs set mountpoint=/data tank/data
    zfs set sharenfs="rw=192.168.0.0/24,ro=10.0.0.0/8" tank/data
    zfs share tank/data

Verify the share is exported successfuly:

    showmount -e 127.0.0.1

Stop the NFS share:

    zfs unshare tank/data
    # If you want to disable the share forever, do the following
    zfs sharenfs=off tank/data

CIFS shares

CIFS is a dialect of Server Message Block (SMB) Protocol and could be used on Windows, VMS, several versions of Unix, and other operating systems.

To share a dataset through CIFS, samba package needs to be installed:

    apt install samba

Because Microsoft Windows is not case sensitive, it is recommended to set casesensitivity=mixed to the dataset to be shared, and this property can only be set on creation time:

    zfs create -o casesensitivity=mixed -o xattr=sa -o dnodesize=auto tank/data

Configure a very simiple CIFS share (read/write to 192.168.0.0/24, read only to 10.0.0.0/8):

    zfs set mountpoint=/data tank/data
    zfs set sharesmb=on tank/data
    zfs share tank/data

Verify the share is exported successfuly:

    smbclient -U guest -N -L localhost

Stop the CIFS share:

    zfs unshare tank/data
    # If you want to disable the share forever, do the following
    zfs sharesmb=off tank/data

Encryption

ZFS native encryption was implemented since ZoL 0.8.0 release. For any older version the alternative solution is to wrap ZFS with LUKS (see cryptsetup). Creating encrypted ZFS is straightforward, for example:

    zfs create -o encryption=on -o keyformat=passphrase tank/secret

ZFS will prompt and ask you to input the passphrase. Alternatively, the key location could be specified with the "keylocation" attribute.

ZFS can also encrypt a dataset during "recv":

    zfs send tank/data | zfs recv -o encryption=on -o keylocation=file:///path/to/my/raw/key backup/data

Before mounting an encrypted dataset, the key has to be loaded (zfs load-key tank/secret) first. "zfs mount" provides a shortcut for the two steps:

    zfs mount -l tank/secret

TRIM support

TRIM is a kind of commands used to inform a the disk device which blocks of data are no longer considered to be 'in use' and therefore can be erased internally, which was introduced soon after SSDs are introduced, and is also widely used on SMR hard drives. Since the release of OpenZFS 0.8, TRIM support was added.

TRIM is introduced to help the disk controller do better job in garbage collection, and reduce write amplification. Proper trimming may help mitigate/avoid performance degration, and improve endurance. In real world TRIM is not always vitally necessary, the situation varys due to different disk controller implementation, flash over provision, and workload patterns. Excessive TRIM could hurt online performance and affect long-term endurance. If your workload constantly write and delete a lot of data (calculated by DPDW), then you might need a higher frequency of TRIM.

Traditional RAIDs (hardware/md) could suffer from performance problems when using TRIM, because there are 2-levels in the path (filesystem - raid), individual TRIM commands are issued in small size like 4KB or 64KB (usually PAGE_SIZE) when reaching the disks, some merging is possible but often difficult to implement. ZFS does not have such kind of issue because the it has direct knowledge to both the file system and space allocation.

Manual TRIM

To perform TRIM on a pool, use

    zpool trim tank

Status and progress can be viewed with

    zpool status -t tank

Be aware that pool performance could be affected depending on disk performance when workload is heavy.

Periodic TRIM

On Debian systems, since Bullseye release (or 2.0.3-9 in buster-backports), periodic TRIM is implemented using a custom per-pool property: org.debian:perodic-trim

By default, these TRIM jobs are scheduled on every first Sunday of month. The completion speed depends on the disks size, disk speed and workload pattern. Cheap QLC disks could take considerable more time than very expensive enterprise graded NVMe disks.

org.debian:perodic-trim

Pool SSD

TRIM

nvme-only

yes

auto1

sata3.0

no2

sata>=3.1

no3

mixed

no4

enabled

any

yes

disabled

any

no

  1. When org.debian:perodic-trim is not persent in pool, or the property is persent but value is empty/invalid, they are treated as auto.

  2. SATA SSD with protocol version 3.0 or lower handles TRIM (UNMAP) in synchronous manner which could block all other I/O on the disk immediately until the command is finished, this could lead to severe interruption. In such case, pool trim is only recommended in scheduled maintenance period.
  3. SATA SSD with protocol version >=3.1 may perform TRIM in a queued manner, making the operation not blocking. Enabling TRIM on these disks is planned by the Debian ZFS mantainers (990871), but yet to be implemented because there are issues to be considered - for example some disks advertise the ability of doing Queued TRIM although the implementaion is known broken. Users can enable the pool trim by setting the property to enable after checking carefully.

  4. When the >=3.1 support is properly implemented, pool with a mixed types of SSDs will be meatured by whether all disks are of the recommended types. Users can enable the pool trim by setting the property to enable after checking all disks in pool carefully.

To set the property to enable, use:

    zfs set org.debian:periodic-trim=enable tank

Please note this property is set on the root dataset of the pool, not the pool itself because it is not yet implemented.

autotrim property

ZFS can perform TRIM after data deletion, which is in some way similar to discard mount option in other file systems. To use autotrim, set the pool property:

    zpool set autotrim=on tank

Automatic TRIM looks for space which has been recently freed, and is no longer allocated by the pool, to be periodically trimmed, however it does not immediately reclaim blocks after a free, which makes it very effective at a cost of more likely of encountering tiny ranges.

Note the previous mentioned periodic-trim does not conflict with autotrim, and is already sufficient for light usages. For heavy workloads, periodic-trim (which is full trim) can be used to work to complement autotrim.

Interoperability

Last version of ZFS released from ?OpenSolaris is zpool v28, after that Oracle has decided not to publish future updates, so that version 28 has the best interoperability across all implementations. This is also the last pool version zfs-fuse supports.

Later it is decided the open source implementation will stick to zpool v5000 and make any future changes tracked and controled by feature flags. This is an incompatible change to the closed source successor and v28 will remain the last interoperatable pool version.

By default new pools are created with all supported features enabled (use -d option to disable), and if you want a pool of version 28:

    zpool create -o version=28 tank mirror scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c

All known OpenZFS implementations have support to zpool v5000 and feature flags in major stable versions, this includes illumOS, FreeBSD, ZFS on Linux and OpenZFS on OS X. There are difference on the supported features among these implementations, for example support of large_dnode feature flag was first introduced on Linux, and spacemap_v2 is not supported on Linux until ZoL 0.8.x. There are more features have differential inclusion status other than feature flags, like xattr=sa is only available on Linux and OS X, whereas TRIM was not supported on Linux until ZoL 0.8.x.

Advanced Topics

These are not really advanced stuff like internals of ZFS and storage, but are some topics not relevant to everyone.

    zpool create -o ashift=12 tank mirror scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c

(!) Consider using ashift=12 or ashift=13 even if currently using only disks with 512 bytes sectors. Adding devices with bigger sectors to the same VDEV can severely impact performance due to wrong alignment, while a device with 512 sectors will work also with a higher ashift (bigger sectors will be aligned as they are multiples of 512).

    zfs set compression=on tank

This will enable compression using the current default compression algorithm (currently lz4 if the pool has the lz4_compress feature enabled).

    # dragons ahead, you have been warned
    zfs set dedup=on tank/data

    # attributes is most likely to inherit to all child datasets
    zfs set xattr=sa tank

    zfs set dnodesize=auto tank/data

See Also


CategoryStorage