FileSystem > ZFS
ZFS is a combined file system and logical volume manager designed by Sun Microsystems (now owned by Oracle), which is licensed as open-source software under the Common Development and Distribution License (CDDL) as part of the ?OpenSolaris project in November 2005. OpenZFS brings together developers and users from various open-source forks of the original ZFS on different platforms, it was announced in September 2013 as the truly open source successor to the ZFS project.
Described as The last word in filesystems, ZFS is scalable, and includes extensive protection against data corruption, support for high storage capacities, efficient data compression, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, continuous integrity checking and automatic repair, RAID-Z, native NFSv4 ACLs, and can be very precisely configured.
- Creating the Pool
- Provisioning file systems or volume
- File Sharing
- TRIM support
- Advanced Topics
- See Also
Debian kFreeBSD users are able to use ZFS since the release of Squeeze, for those who use Linux kernel it is available from contrib archive area with the form of DKMS source since the release of Stretch. There is also a deprecated userspace implementation facilitating the FUSE framework. This page will demonstrate using ZFS on Linux (ZoL) if not specifically pointed to the kFreeBSD or FUSE implementation.
Due to potential legal incompatibilities between CDDL and GPL, even both of them are OSI-approved free software license that comply with DFSG, ZFS development is not supported by the Linux kernel. ZoL is a project funded by the Lawrence Livermore National Laboratory to develop a native Linux kernel module for its massive storage requirements and super computers.
- Pool based storage
- Data integrity against silent data corruption
- Software Volume Manager
- Software RAID
ZFS on Linux is provided in the form of DKMS source for Debian users. It is necessary to add the contrib section to your apt sources configuration to be able to get the packages. Also, it is recommended by Debian ZFS on Linux Team to install ZFS related packages from Backports archive. Upstream stable patches will be tracked and compatibility is always maintained. When configured, use following commands to install the packages:
apt update apt install linux-headers-amd64 apt install -t buster-backports zfsutils-linux
The given example has separated the steps of installing Linux headers and zfs. It's fine to combine everything in one command but let's be explicit to avoid any chance of messing up with versions. Future updates will be taken care by apt.
If automatic installation of Recommends is disabled it is also necessary to install zfs-dkms.
The modules will be built automatically only for kernels that have the corresponding linux-headers package installed. Install the linux-headers-<arch> package to always have the latest linux headers installed (analog to the linux-image-<arch> package).
Creating the Pool
Many disks can be added to a storage pool, and ZFS can allocate space from it, so the first step of using ZFS is creating a pool. It is recommended to use more than 1 whole disk to take advantage of full benefits, but it's fine to proceed with only one device or just a partition.
In the world of ZFS, device names with path/id are typically used to identify a disk, because the device names like /dev/sdX may change on every start. These names can be retrieved with ls -l /dev/disk/by-id/ or ls -l /dev/disk/by-path/
In case of using whole disks ZFS will automatically reserve 8 MiB at the end of the device, to allow for replacement and/or additional physical devices that don't have the exact same size as the other devices in the pool.
When using partitions (or preparing the disk manually in advance) it is possible to also use the GPT partition labels to identify the partition/disk, as they are customizable and nicer for humans to understand. These can be found in /dev/disk/by-partlabel/.
The most common pool configurations are mirror, raidz and raidz2, choose one from the following:
mirror pool (similar to raid-1, ≥ 2 disks, 1:1 redundancy)
zpool create tank mirror scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c
raidz1 pool (similar to raid-5, ≥ 3 disks, 1 disk redundancy)
zpool create tank raidz scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c scsi-35000cca26c108480
raidz2 pool (similar to raid-6, ≥ 4 disks, 2 disks redundancy)
zpool create tank raidz2 scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c scsi-35000cca26c108480 scsi-35000cca266ccbdb4
stripe pool (similar to raid-0, no redundancy)
zpool create tank scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c
single disk stripe pool
zpool create tank scsi-35000cca26c108480
If building a pool with a larger number of disks, you are encouraged to configure them into more than one group and finally construct a stripe pool using these vdevs. This would allow more flexible pool design to trade-off among space, redundancy and efficiency.
Different configurations may have different IO characteristics under certain workload pattern, please refer to see also section at the end of this page for more information.
- 5 mirror (like raid-10, 1:1 redundancy)
zpool create tank mirror scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c \ mirror scsi-35000cca26c108480 scsi-35000cca266ccbdb4 \ mirror scsi-35000cca266c75c74 scsi-35000cca26c0e84dc \ mirror scsi-35000cca266cda748 scsi-35000cca266cd14b4 \ mirror scsi-35000cca266cb8ae4 scsi-35000cca266cbad80
- 2 raidz (like raid-50, 2 disks redundancy in total)
zpool create tank raidz scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c scsi-35000cca26c108480 scsi-35000cca266ccbdb4 scsi-35000cca266c75c74 \ raidz scsi-35000cca26c0e84dc scsi-35000cca266cda748 scsi-35000cca266cd14b4 scsi-35000cca266cb8ae4 scsi-35000cca266cbad80
ZFS can make use of fast SSD as second level cache (L2ARC) after RAM (ARC), which can improve cache hit rate thus improving overall performance. Because cache devices could be read and write very frequently when the pool is busy, please consider to use more durable SSD devices (SLC/MLC over TLC/QLC) preferably come with NVMe protocol. This cache is only use for read operations, so that data write to cache disk is demanded by read operations, and is not related to write operations at all.
zpool add tank cache nvme-MT001600KWHAC_S3M0NA0K700264
ZFS can also make uses of NVRAM/Optane/SSD as SLOG (Separate ZFS Intent Log) device, which can be considered as kind of write cache but that's far from the truth. SLOG devices are used for speeding up synchronous writes by sending those transaction to SLOG in parallel to slower disks, as soon as the transaction is successful on SLOG the operation is marked as completed, then the synchronous operation is unblocked quicker and resistance against power loss is not compromised. Mirrored set up of SLOG devices is obviously recommended. Please also note that asynchronous writes are not sent to SLOG by default, you could try to set sync=always property of the working dataset and see whether performance gets improved.
zpool add tank log mirror nvme-MT001600KWHAC_S3M0NA0K700244 nvme-MT001600KWHAC_S3M0NA0K700246
Provisioning file systems or volume
After creating the zpool, we are able to provision file systems or volumes (ZVOL). ZVOL is a kind of block device whose space is allocated from the pool. It is possible to create another file system on it (including swap) or use it as storage device for virtual machines, like any other block device.
provision a file system named data under pool tank, and have it mounted on /data
mkdir -p /data zfs create -o mountpoint=/data tank/data
thin provision a ZVOL of 4GB named vol under pool tank, and format it to ext4, then mount on /mnt temporarily
zfs create -s -V 4GB tank/vol mkfs.ext4 /dev/zvol/tank/vol mount /dev/zvol/tank/vol /mnt
- destroy previously created file systems and ZVOL
# ZFS will handle mounts that are managed by it zfs destroy tank/data # Need to umount first, because this mount is user managed umount /dev/zvol/tank/vol zfs destroy tank/vol
Using ZVOLs for swap on Linux can lead to deadlocks, see https://github.com/openzfs/zfs/issues/7734.
Snapshot is a most wanted feature of modern file system, and ZFS definitely supports it.
Creating and Managing Snapshots
making a snapshot of tank/data
zfs snapshot tank/data@2019-06-24
- removing a snapshot
zfs destroy tank/data@2019-06-24
Backup and Restore (with remote)
It is possible to backup a ZFS dataset to another pool with zfs send/recv commands, even the pool is located at the other end of network.
# create a initial snapshot zfs snapshot tank/data@initial # send it to another local pool, named ''tank2'', and calling the dataset ''packman'' zfs send tank/data@initial | zfs recv -F tank2/packman # send it to a remote pool, named ''tanker'' at remote side zfs send tank/data@initial | ssh remotehost zfs recv -F tanker/data # after using ''tank/data'' for a while, create another snapshot zfs snapshot tank/data@2019-06-24T18-10 # incrementally send the new state to remote zfs send -i initial tank/data@2019-06-24T18-10 | ssh remotehost zfs recv -F tanker/data
ZFS has integration with operating system's NFS, CIFS and iSCSI servers, it does not implement its own server but reuse existing software. However, iSCSI integration is not yet available on Linux. It is recommended to enable xattr=sa and dnodesize=auto for these usages.
To share a dataset through NFS, nfs-kernel-server package needs to be installed:
apt install nfs-kernel-server
Set up recommended properties for the targeting zfs file system:
zfs set xattr=sa dnodesize=auto tank/data
Configure a very simiple NFS share (read/write to 192.168.0.0/24, read only to 10.0.0.0/8):
zfs set mountpoint=/data tank/data zfs set sharenfs="rw=192.168.0.0/24,ro=10.0.0.0/8" tank/data zfs share tank/data
Verify the share is exported successfuly:
showmount -e 127.0.0.1
Stop the NFS share:
zfs unshare tank/data # If you want to disable the share forever, do the following zfs sharenfs=off tank/data
CIFS is a dialect of Server Message Block (SMB) Protocol and could be used on Windows, VMS, several versions of Unix, and other operating systems.
To share a dataset through CIFS, samba package needs to be installed:
apt install samba
Because Microsoft Windows is not case sensitive, it is recommended to set casesensitivity=mixed to the dataset to be shared, and this property can only be set on creation time:
zfs create -o casesensitivity=mixed -o xattr=sa -o dnodesize=auto tank/data
Configure a very simiple CIFS share (read/write to 192.168.0.0/24, read only to 10.0.0.0/8):
zfs set mountpoint=/data tank/data zfs set sharesmb=on tank/data zfs share tank/data
Verify the share is exported successfuly:
smbclient -U guest -N -L localhost
Stop the CIFS share:
zfs unshare tank/data # If you want to disable the share forever, do the following zfs sharesmb=off tank/data
ZFS native encryption was implemented since ZoL 0.8.0 release. For any older version the alternative solution is to wrap ZFS with LUKS (see cryptsetup). Creating encrypted ZFS is straightforward, for example:
zfs create -o encryption=on -o keyformat=passphrase tank/secret
ZFS will prompt and ask you to input the passphrase. Alternatively, the key location could be specified with the "keylocation" attribute.
ZFS can also encrypt a dataset during "recv":
zfs send tank/data | zfs recv -o encryption=on -o keylocation=file:///path/to/my/raw/key backup/data
Before mounting an encrypted dataset, the key has to be loaded (zfs load-key tank/secret) first. "zfs mount" provides a shortcut for the two steps:
zfs mount -l tank/secret
TRIM is a kind of commands used to inform a the disk device which blocks of data are no longer considered to be 'in use' and therefore can be erased internally, which was introduced soon after SSDs are introduced, and is also widely used on SMR hard drives. Since the release of OpenZFS 0.8, TRIM support was added.
TRIM is introduced to help the disk controller do better job in garbage collection, and reduce write amplification. Proper trimming may help mitigate/avoid performance degration, and improve endurance. In real world TRIM is not always vitally necessary, the situation varys due to different disk controller implementation, flash over provision, and workload patterns. Excessive TRIM could hurt online performance and affect long-term endurance. If your workload constantly write and delete a lot of data (calculated by DPDW), then you might need a higher frequency of TRIM.
Traditional RAIDs (hardware/md) could suffer from performance problems when using TRIM, because there are 2-levels in the path (filesystem - raid), individual TRIM commands are issued in small size like 4KB or 64KB (usually PAGE_SIZE) when reaching the disks, some merging is possible but often difficult to implement. ZFS does not have such kind of issue because the it has direct knowledge to both the file system and space allocation.
To perform TRIM on a pool, use
zpool trim tank
Status and progress can be viewed with
zpool status -t tank
Be aware that pool performance could be affected depending on disk performance when workload is heavy.
On Debian systems, since Bullseye release (or 2.0.3-9 in buster-backports), periodic TRIM is implemented using a custom per-pool property: org.debian:perodic-trim
By default, these TRIM jobs are scheduled on every first Sunday of month. The completion speed depends on the disks size, disk speed and workload pattern. Cheap QLC disks could take considerable more time than very expensive enterprise graded NVMe disks.
When org.debian:perodic-trim is not persent in pool, or the property is persent but value is empty/invalid, they are treated as auto.
- SATA SSD with protocol version 3.0 or lower handles TRIM (UNMAP) in synchronous manner which could block all other I/O on the disk immediately until the command is finished, this could lead to severe interruption. In such case, pool trim is only recommended in scheduled maintenance period.
SATA SSD with protocol version >=3.1 may perform TRIM in a queued manner, making the operation not blocking. Enabling TRIM on these disks is planned by the Debian ZFS mantainers (990871), but yet to be implemented because there are issues to be considered - for example some disks advertise the ability of doing Queued TRIM although the implementaion is known broken. Users can enable the pool trim by setting the property to enable after checking carefully.
When the >=3.1 support is properly implemented, pool with a mixed types of SSDs will be meatured by whether all disks are of the recommended types. Users can enable the pool trim by setting the property to enable after checking all disks in pool carefully.
To set the property to enable, use:
zfs set org.debian:periodic-trim=enable tank
Please note this property is set on the root dataset of the pool, not the pool itself because it is not yet implemented.
ZFS can perform TRIM after data deletion, which is in some way similar to discard mount option in other file systems. To use autotrim, set the pool property:
zpool set autotrim=on tank
Automatic TRIM looks for space which has been recently freed, and is no longer allocated by the pool, to be periodically trimmed, however it does not immediately reclaim blocks after a free, which makes it very effective at a cost of more likely of encountering tiny ranges.
Note the previous mentioned periodic-trim does not conflict with autotrim, and is already sufficient for light usages. For heavy workloads, periodic-trim (which is full trim) can be used to work to complement autotrim.
Last version of ZFS released from ?OpenSolaris is zpool v28, after that Oracle has decided not to publish future updates, so that version 28 has the best interoperability across all implementations. This is also the last pool version zfs-fuse supports.
Later it is decided the open source implementation will stick to zpool v5000 and make any future changes tracked and controled by feature flags. This is an incompatible change to the closed source successor and v28 will remain the last interoperatable pool version.
By default new pools are created with all supported features enabled (use -d option to disable), and if you want a pool of version 28:
zpool create -o version=28 tank mirror scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c
All known OpenZFS implementations have support to zpool v5000 and feature flags in major stable versions, this includes illumOS, FreeBSD, ZFS on Linux and OpenZFS on OS X. There are difference on the supported features among these implementations, for example support of large_dnode feature flag was first introduced on Linux, and spacemap_v2 is not supported on Linux until ZoL 0.8.x. There are more features have differential inclusion status other than feature flags, like xattr=sa is only available on Linux and OS X, whereas TRIM was not supported on Linux until ZoL 0.8.x.
These are not really advanced stuff like internals of ZFS and storage, but are some topics not relevant to everyone.
- 64-bit hardware and kernel is recommended. ZFS wants a lot of memory (so as address space) to work best, also it was developed with an assumption of being 64-bit only from the beginning. It is possible to use ZFS under 32-bit environments but a lot of care must be taken by the user.
Use ashift=12 or ashift=13 when creating the pool if applicable (though ZFS can detect correctly for most cases). Value of ashift is exponent of 2, which should be aligned to the physical sector size of disks, for example 29=512, 212=4096, 213=8192. Some disks are reporting a logical sector size of 512 bytes while having 4KiB physical sector size (aka 512e), and some SSDs have 8KiB physical sector size. USB racks can also prevent the correct detection of the physical sector size.
zpool create -o ashift=12 tank mirror scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c
Consider using ashift=12 or ashift=13 even if currently using only disks with 512 bytes sectors. Adding devices with bigger sectors to the same VDEV can severely impact performance due to wrong alignment, while a device with 512 sectors will work also with a higher ashift (bigger sectors will be aligned as they are multiples of 512).
- Enable compression if not absolutely paranoid because ZFS can skip compression of objects that it sees not effect, and compressed objects can improve IO efficiency
zfs set compression=on tank
This will enable compression using the current default compression algorithm (currently lz4 if the pool has the lz4_compress feature enabled).
- Install as much RAM as feasible. ZFS has advanced caching design which could take advantage of a lot of memory to improve performance. This cache is called Adjustable Replacement Cache (ARC).
- Block level deduplication is scary when RAM is limited, but such feature is getting increasingly promoted on professional storage solutions nowadays, since it could perform impressively for scenarios like storing VM disks that share common ancestors. Because deduplication table is part of ARC, it's possible to use a fast L2ARC (NVMe SSD) to mitigate the problem of lacking RAM. Typical space requirement would be 2-5 GB ARC/L2ARC for 1TB of disk, if you are building a storage of 1PB raw capacity, at least 1TB of L2ARC space should be planned for deduplication (minimum size, assuming pool is mirrored).
# dragons ahead, you have been warned zfs set dedup=on tank/data
- ECC RAM is always preferred. ZFS makes use of checksum to ensure data integrity, which depends on the system memory being correct. This does not mean you should turn to other file systems when ECC memory is not possible, but it opens the door of failing to detect silent data corruption when the RAM generate some random errors unexpectedly. If you are building a serious storage solution, ECC RAM is required.
Storing extended attributes as system attributes (Linux only). With xattr=on (default), ZFS stores extended attributes in hidden sub directories which could hurt performance.
# attributes is most likely to inherit to all child datasets zfs set xattr=sa tank
Setting dnodesize=auto for non-root datasets. This allows ZFS to automatically determine dnodesize, which is useful if the dataset uses the xattr=sa property setting and the workload makes heavy use of extended attributes (SELinux-enabled systems, Lustre servers, and Samba/NFS servers). This setting relies on large_dnode feature flag support on the pool which may not be widely supported on all OpenZFS platforms, please also note Grub does not yet have support to this feature.
zfs set dnodesize=auto tank/data
Thin provision allows a volume to use up to the limited amount of space but do not reserve any resource until explicitly demanded, making over provision possible, at the risk of being unable to allocate space when pool is getting full. It is usually considered a way of facilitating flexible management and improving space efficiency of the backing storage.
Aaron Toponce's ZFS on Linux User Guide
The Z File System (ZFS) from FreeBSD handbook
ZFS article on Archlinux Wiki
ZFS article on Gentoo Wiki
ZFS 101 - Understanding ZFS storage and performance (2020-08-05)