Differences between revisions 58 and 59
Revision 58 as of 2024-01-11 07:38:02
Size: 28213
Editor: ShengqiChen
Comment: Remove release-specific content on backports
Revision 59 as of 2024-01-11 07:48:06
Size: 28910
Editor: ShengqiChen
Comment: Add introduction to draid
Deletions are marked like this. Additions are marked like this.
Line 91: Line 91:

Starting from v2.1, openzfs supports a new ''draid'' configuration that allows fast resilvering and (in a future release) dynamic expansion.

To begin with, just replace ''raidz[1,2,3]'' with ''draid[1,2,3]'' in your ''zpool create'' command. If you want more advanced features such as hot spare, using the following options:

{{{
    zpool create tank draid[<parity>][:<data>d][:<children>c][:<spares>s] <vdevs...>
}}}

where ''parity'' is the parity level from 1 to 3, ''data'' is the number of data devices '''per redundancy group''' (defaults to 8), ''children'' is the expected number of children (all disks), and ''spares '' is the number of distributed hot spares (can be zero).


FileSystem > ZFS

ZFS is a combined file system and logical volume manager designed by Sun Microsystems (now owned by Oracle), which is licensed as open-source software under the Common Development and Distribution License (CDDL) as part of the ?OpenSolaris project in November 2005. OpenZFS brings together developers and users from various open-source forks of the original ZFS on different platforms, it was announced in September 2013 as the truly open source successor to the ZFS project.

Described as The last word in filesystems, ZFS is scalable, and includes extensive protection against data corruption, support for high storage capacities, efficient data compression, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, continuous integrity checking and automatic repair, RAID-Z, native NFSv4 ACLs, and can be very precisely configured.

Status

Debian kFreeBSD users are able to use ZFS since the release of Squeeze, for those who use Linux kernel it is available from contrib archive area with the form of DKMS source since the release of Stretch. There is also a deprecated userspace implementation facilitating the FUSE framework. This page will demonstrate using ZFS on Linux (ZoL) if not specifically pointed to the kFreeBSD or FUSE implementation.

Due to potential legal incompatibilities between the CDDL and GPL, despite both being OSI-approved free software licenses which comply with DFSG, ZFS development is not supported by the Linux kernel. ZoL is a project funded by the Lawrence Livermore National Laboratory to develop a native Linux kernel module for its massive storage requirements and super computers.

Features

  • Pool based storage
  • Copy-on-Write
  • Snapshots
  • Data integrity against silent data corruption
  • Software Volume Manager
  • Software RAID

Installation

ZFS on Linux is provided in the form of DKMS source for Debian users. It is necessary to add the contrib section to your apt sources configuration to be able to get the packages. Also, it is recommended by Debian ZFS on Linux Team to install ZFS related packages from Backports archive. Upstream stable patches will be tracked and compatibility is always maintained. When configured, use following commands to install the packages:

  sudo apt update
  sudo apt install linux-headers-amd64
  sudo apt install -t stable-backports zfsutils-linux

Future updates will be taken care by apt.

If You have received the following notice in return:

E: The value 'stable-backports' is invalid for APT::Default-Release as such a release is not available in the sources"

Make sure /etc/apt/sources.list contains correct Debian codename backports:

codename=$(lsb_release -cs);echo "deb http://deb.debian.org/debian $codename-backports main contrib non-free"|sudo tee -a /etc/apt/sources.list && sudo apt update

(!) If automatic installation of Recommends is disabled it is also necessary to install zfs-dkms.

/!\ The modules will be built automatically only for kernels that have the corresponding linux-headers package installed. Install the linux-headers-<arch> package to always have the latest linux headers installed (analog to the linux-image-<arch> package).

Creating the Pool

Many disks can be added to a storage pool, and ZFS can allocate space from it, so the first step of using ZFS is creating a pool. It is recommended to use more than 1 whole disk to take advantage of full benefits, but it's fine to proceed with only one device or just a partition.

In the world of ZFS, device names with path/id are typically used to identify a disk, because the device names like /dev/sdX may change on every start. These names can be retrieved with ls -l /dev/disk/by-id/ or ls -l /dev/disk/by-path/

{i} In case of using whole disks ZFS will automatically reserve 8 MiB at the end of the device, to allow for replacement and/or additional physical devices that don't have the exact same size as the other devices in the pool.

(!) When using partitions (or preparing the disk manually in advance) it is possible to also use the GPT partition labels to identify the partition/disk, as they are customizable and nicer for humans to understand. These can be found in /dev/disk/by-partlabel/.

Basic Configuration

The most common pool configurations are mirror, raidz and raidz2, choose one from the following:

  • mirror pool (similar to raid-1, ≥ 2 disks, 1:1 redundancy)

    zpool create tank mirror scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c
  • raidz1 pool (similar to raid-5, ≥ 3 disks, 1 disk redundancy)

    zpool create tank raidz scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c scsi-35000cca26c108480
  • raidz2 pool (similar to raid-6, ≥ 4 disks, 2 disks redundancy)

    zpool create tank raidz2 scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c scsi-35000cca26c108480 scsi-35000cca266ccbdb4
  • stripe pool (similar to raid-0, no redundancy)

    zpool create tank scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c
  • single disk stripe pool

    zpool create tank scsi-35000cca26c108480

Starting from v2.1, openzfs supports a new draid configuration that allows fast resilvering and (in a future release) dynamic expansion.

To begin with, just replace raidz[1,2,3] with draid[1,2,3] in your zpool create command. If you want more advanced features such as hot spare, using the following options:

    zpool create tank draid[<parity>][:<data>d][:<children>c][:<spares>s] <vdevs...>

where parity is the parity level from 1 to 3, data is the number of data devices per redundancy group (defaults to 8), children is the expected number of children (all disks), and spares is the number of distributed hot spares (can be zero).

Advanced Configuration

If building a pool with a larger number of disks, you are encouraged to configure them into more than one group and finally construct a stripe pool using these vdevs. This would allow more flexible pool design to trade-off among space, redundancy and efficiency.

Different configurations may have different IO characteristics under certain workload pattern, please refer to see also section at the end of this page for more information.

  • 5 mirror (like raid-10, 1:1 redundancy)

    zpool create tank mirror scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c \
                    mirror scsi-35000cca26c108480 scsi-35000cca266ccbdb4 \
                    mirror scsi-35000cca266c75c74 scsi-35000cca26c0e84dc \
                    mirror scsi-35000cca266cda748 scsi-35000cca266cd14b4 \
                    mirror scsi-35000cca266cb8ae4 scsi-35000cca266cbad80
  • 2 raidz (like raid-50, 2 disks redundancy in total)

    zpool create tank raidz scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c scsi-35000cca26c108480 scsi-35000cca266ccbdb4 scsi-35000cca266c75c74 \
                    raidz scsi-35000cca26c0e84dc scsi-35000cca266cda748 scsi-35000cca266cd14b4 scsi-35000cca266cb8ae4 scsi-35000cca266cbad80

ZFS can make use of fast SSD as second level cache (L2ARC) after RAM (ARC), which can improve cache hit rate thus improving overall performance. Because cache devices could be read and write very frequently when the pool is busy, please consider to use more durable SSD devices (SLC/MLC over TLC/QLC) preferably come with NVMe protocol. This cache is only use for read operations, so that data write to cache disk is demanded by read operations, and is not related to write operations at all.

    zpool add tank cache nvme-MT001600KWHAC_S3M0NA0K700264

ZFS can also make uses of NVRAM/Optane/SSD as SLOG (Separate ZFS Intent Log) device, which can be considered as kind of write cache but that's far from the truth. SLOG devices are used for speeding up synchronous writes by sending those transaction to SLOG in parallel to slower disks, as soon as the transaction is successful on SLOG the operation is marked as completed, then the synchronous operation is unblocked quicker and resistance against power loss is not compromised. Mirrored set up of SLOG devices is obviously recommended. Please also note that asynchronous writes are not sent to SLOG by default, you could try to set sync=always property of the working dataset and see whether performance gets improved.

    zpool add tank log mirror nvme-MT001600KWHAC_S3M0NA0K700244 nvme-MT001600KWHAC_S3M0NA0K700246

Provisioning file systems or volume

After creating the zpool, we are able to provision file systems or volumes (ZVOL). ZVOL is a kind of block device whose space is allocated from the pool. It is possible to create another file system on it (including swap) or use it as storage device for virtual machines, like any other block device.

  • provision a file system named data under pool tank, and have it mounted on /data

    mkdir -p /data
    zfs create -o mountpoint=/data tank/data
  • thin provision a ZVOL of 4GB named vol under pool tank, and format it to ext4, then mount on /mnt temporarily

    zfs create -s -V 4GB tank/vol
    mkfs.ext4 /dev/zvol/tank/vol
    mount /dev/zvol/tank/vol /mnt
  • destroy previously created file systems and ZVOL

    # ZFS will handle mounts that are managed by it
    zfs destroy tank/data
    # Need to umount first, because this mount is user managed
    umount /dev/zvol/tank/vol
    zfs destroy tank/vol

/!\ Using ZVOLs for swap on Linux can lead to deadlocks, see https://github.com/openzfs/zfs/issues/7734.

Snapshots

Snapshot is a most wanted feature of modern file system, and ZFS definitely supports it.

Creating and Managing Snapshots

  • making a snapshot of tank/data

    zfs snapshot tank/data@2019-06-24
  • removing a snapshot

    zfs destroy tank/data@2019-06-24

Backup and Restore (with remote)

It is possible to backup a ZFS dataset to another pool with zfs send/recv commands, even if the pool is located at the other end of network.

    # create a initial snapshot
    zfs snapshot tank/data@initial
    # send it to another local pool, named ''tank2'', and calling the dataset ''packman''
    zfs send tank/data@initial | zfs recv -F tank2/packman
    # send it to a remote pool, named ''tanker'' at remote side
    zfs send tank/data@initial | ssh remotehost zfs recv -F tanker/data
    # after using ''tank/data'' for a while, create another snapshot
    zfs snapshot tank/data@2019-06-24T18-10
    # incrementally send the new state to remote
    zfs send -i initial tank/data@2019-06-24T18-10 | ssh remotehost zfs recv -F tanker/data

Snapshot Utilities

Package

Integration

Replication

Homepage

Note

sanoid

Systemd or Cron

YES

https://github.com/jimsalterjrs/sanoid

simplesnap

Cron

NO

https://github.com/jgoerzen/simplesnap

zfsnap

Manual

NO

https://github.com/graudeejs/zfSnap

zsnapsd

Systemd

YES

https://github.com/khenderick/zfs-snap-manager

Not maintained anymore

File Sharing

ZFS has integration with operating system's NFS, CIFS and iSCSI servers, it does not implement its own server but reuse existing software. However, iSCSI integration is not yet available on Linux. It is recommended to enable xattr=sa and dnodesize=auto for these usages.

NFS shares

To share a dataset through NFS, nfs-kernel-server package needs to be installed:

    apt install nfs-kernel-server

Set up recommended properties for the targeting zfs file system:

    zfs set xattr=sa dnodesize=auto tank/data

Configure a very simple NFS share (read/write to 192.168.0.0/24, read only to 10.0.0.0/8):

    zfs set mountpoint=/data tank/data
    zfs set sharenfs="rw=192.168.0.0/24,ro=10.0.0.0/8" tank/data
    zfs share tank/data

Verify the share is exported successfuly:

    showmount -e 127.0.0.1

Stop the NFS share:

    zfs unshare tank/data
    # If you want to disable the share forever, do the following
    zfs sharenfs=off tank/data

CIFS shares

CIFS is a dialect of Server Message Block (SMB) Protocol and could be used on Windows, VMS, several versions of Unix, and other operating systems.

To share a dataset through CIFS, samba package needs to be installed:

    apt install samba

Because Microsoft Windows is not case sensitive, it is recommended to set casesensitivity=mixed to the dataset to be shared, and this property can only be set on creation time:

    zfs create -o casesensitivity=mixed -o xattr=sa -o dnodesize=auto tank/data

Configure a very simiple CIFS share (read/write to 192.168.0.0/24, read only to 10.0.0.0/8):

    zfs set mountpoint=/data tank/data
    zfs set sharesmb=on tank/data
    zfs share tank/data

Verify the share is exported successfully:

    smbclient -U guest -N -L localhost

Stop the CIFS share:

    zfs unshare tank/data
    # If you want to disable the share forever, do the following
    zfs sharesmb=off tank/data

Encryption

ZFS native encryption was implemented since ZoL 0.8.0 release. For any older version the alternative solution is to wrap ZFS with LUKS (see cryptsetup). Creating encrypted ZFS is straightforward, for example:

    zfs create -o encryption=on -o keyformat=passphrase tank/secret

ZFS will prompt and ask you to input the passphrase. Alternatively, the key location could be specified with the keylocation attribute.

When creating a child dataset of an encrypted dataset, the encryption will be inherited and shown in the encryptionroot property, mounting the child will require loading the encryption key of the parent dataset.

ZFS can also encrypt a dataset during "recv":

    zfs send tank/data | zfs recv -o encryption=on -o keylocation=file:///path/to/my/raw/key backup/data

Before mounting an encrypted dataset, the key has to be loaded first:

    zfs load-key tank/secret

This is usually useful when the parent dataset is created with canmount=off property. Also, zfs mount provides a shortcut for the two steps:

    zfs mount -l tank/secret

To unload the encryption secret, after all relevant datasets are unmounted:

    zfs unload-key tank/secret

TRIM support

TRIM is a kind of commands used to inform a the disk device which blocks of data are no longer considered to be 'in use' and therefore can be erased internally, which was introduced soon after SSDs are introduced, and is also widely used on SMR hard drives. Since the release of OpenZFS 0.8, TRIM support was added.

TRIM is introduced to help the disk controller do better job in garbage collection, and reduce write amplification. Proper trimming may help mitigate/avoid performance degradation, and improve endurance. In real world TRIM is not always vitally necessary, the situation may vary due to different disk controller implementation, flash over provision, and workload patterns. Excessive TRIM could hurt online performance and affect long-term endurance. If your workload constantly write and delete a lot of data (calculated by DPDW), then you might need a higher frequency of TRIM.

Traditional RAIDs (hardware/md) could suffer from performance problems when using TRIM, because there are 2-levels in the path (filesystem - raid), individual TRIM commands are issued in small size like 4KB or 64KB (usually PAGE_SIZE) when reaching the disks, some merging is possible but often difficult to implement. ZFS does not have such kind of issue because the it has direct knowledge to both the file system and space allocation.

Manual TRIM

To perform TRIM on a pool, use

    zpool trim tank

Status and progress can be viewed with

    zpool status -t tank

Be aware that pool performance could be affected depending on disk performance when workload is heavy.

Periodic TRIM

On Debian systems, since Bullseye release (or 2.0.3-9 in buster-backports), periodic TRIM is implemented using a custom per-pool property: org.debian:periodic-trim

By default, these TRIM jobs are scheduled on the first Sunday of every month. The completion speed depends on the disks size, disk speed and workload pattern. Cheap QLC disks could take considerable more time than very expensive enterprise graded NVMe disks.

org.debian:periodic-trim

Pool SSD

TRIM

nvme-only

yes

auto1

sata3.0

no2

sata>=3.1

no3

mixed

no4

enabled

any

yes

disabled

any

no

  1. When org.debian:periodic-trim is not present in pool, or the property is present but value is empty/invalid, they are treated as auto.

  2. SATA SSD with protocol version 3.0 or lower handles TRIM (UNMAP) in synchronous manner which could block all other I/O on the disk immediately until the command is finished, this could lead to severe interruption. In such case, pool trim is only recommended in scheduled maintenance period.
  3. SATA SSD with protocol version >=3.1 may perform TRIM in a queued manner, making the operation not blocking. Enabling TRIM on these disks is planned by the Debian ZFS mantainers (990871), but yet to be implemented because there are issues to be considered - for example some disks advertise the ability of doing Queued TRIM although the implementation is known broken. Users can enable the pool trim by setting the property to enable after checking carefully.

  4. When the >=3.1 support is properly implemented, pool with a mixed types of SSDs will be measured by whether all disks are of the recommended types. Users can enable the pool trim by setting the property to enable after checking all disks in pool carefully.

To set the property to enable, use:

    zfs set org.debian:periodic-trim=enable tank

Please note this property is set on the root dataset of the pool, not the pool itself because it is not yet implemented.

autotrim property

ZFS can perform TRIM after data deletion, which is in some way similar to discard mount option in other file systems. To use autotrim, set the pool property:

    zpool set autotrim=on tank

Automatic TRIM looks for space which has been recently freed, and is no longer allocated by the pool, to be periodically trimmed, however it does not immediately reclaim blocks after a free, which makes it very effective at a cost of more likely of encountering tiny ranges.

Note the previous mentioned periodic-trim does not conflict with autotrim, and is already sufficient for light usages. For heavy workloads, periodic-trim (which is full trim) can be used to work to complement autotrim.

Auto Scrub of all pools

Debian by default has a cron job entry to scrub all pools on the second Sunday of every month at 24 minutes past midnight.

See /etc/cron.d/zfsutils-linux and /usr/lib/zfs-linux/scrub for details

It is possible to disable this by setting a zfs user defined property on the root dataset for a pool.

To set the property to disable, use:

    zfs set org.debian:periodic-scrub=disable tank

Interoperability

Last version of ZFS released from ?OpenSolaris is zpool v28, after that Oracle has decided not to publish future updates, so that version 28 has the best interoperability across all implementations. This is also the last pool version zfs-fuse supports.

Later it is decided the open source implementation will stick to zpool v5000 and make any future changes tracked and controlled by feature flags. This is an incompatible change to the closed source successor and v28 will remain the last interoperable pool version.

By default new pools are created with all supported features enabled (use -d option to disable), and if you want a pool of version 28:

    zpool create -o version=28 tank mirror scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c

All known OpenZFS implementations have support to zpool v5000 and feature flags in major stable versions, this includes illumOS, FreeBSD, ZFS on Linux and OpenZFS on OS X. There are difference on the supported features among these implementations, for example support of large_dnode feature flag was first introduced on Linux, and spacemap_v2 is not supported on Linux until ZoL 0.8.x. There are more features have differential inclusion status other than feature flags, like xattr=sa is only available on Linux and OS X, whereas TRIM was not supported on Linux until ZoL 0.8.x.

Advanced Topics

These are not really advanced stuff like internals of ZFS and storage, but are some topics not relevant to everyone.

  • 64-bit hardware and kernel is recommended. ZFS wants a lot of memory (so as address space) to work best, also it was developed with an assumption of being 64-bit only from the beginning. It is possible to use ZFS under 32-bit environments but a lot of care must be taken by the user.
  • Use ashift=12 or ashift=13 when creating the pool if applicable (though ZFS can detect correctly for most cases). Value of ashift is exponent of 2, which should be aligned to the physical sector size of disks, for example 29=512, 212=4096, 213=8192. Some disks are reporting a logical sector size of 512 bytes while having 4KiB physical sector size (aka 512e), and some SSDs have 8KiB physical sector size. USB racks can also prevent the correct detection of the physical sector size.

    zpool create -o ashift=12 tank mirror scsi-35000cca2735cbc38 scsi-35000cca266cc4b3c

(!) Consider using ashift=12 or ashift=13 even if currently using only disks with 512 bytes sectors. Adding devices with bigger sectors to the same VDEV can severely impact performance due to wrong alignment, while a device with 512 sectors will work also with a higher ashift (bigger sectors will be aligned as they are multiples of 512).

  • Enable compression if not absolutely paranoid because ZFS can skip compression of objects that it sees not effect, and compressed objects can improve IO efficiency

    zfs set compression=on tank

This will enable compression using the current default compression algorithm (currently lz4 if the pool has the lz4_compress feature enabled).

  • Install as much RAM as feasible. ZFS has advanced caching design which could take advantage of a lot of memory to improve performance. This cache is called Adjustable Replacement Cache (ARC).
  • Block level de-duplication is scary when RAM is limited, but such feature is getting increasingly promoted on professional storage solutions nowadays, since it could perform impressively for scenarios like storing VM disks that share common ancestors. Because de-duplication table is part of ARC, it's possible to use a fast L2ARC (NVMe SSD) to mitigate the problem of lacking RAM. Typical space requirement would be 2-5 GB ARC/L2ARC for 1TB of disk, if you are building a storage of 1PB raw capacity, at least 1TB of L2ARC space should be planned for de-duplication (minimum size, assuming pool is mirrored).

    # dragons ahead, you have been warned
    zfs set dedup=on tank/data
  • ECC RAM is always preferred. ZFS makes use of checksum to ensure data integrity, which depends on the system memory being correct. This does not mean you should turn to other file systems when ECC memory is not possible, but it opens the door of failing to detect silent data corruption when the RAM generate some random errors unexpectedly. If you are building a serious storage solution, ECC RAM is required.
  • Storing extended attributes as system attributes (Linux only). With xattr=on (default), ZFS stores extended attributes in hidden sub directories which could hurt performance.

    # attributes is most likely to inherit to all child datasets
    zfs set xattr=sa tank
  • Setting dnodesize=auto for non-root datasets. This allows ZFS to automatically determine dnodesize, which is useful if the dataset uses the xattr=sa property setting and the workload makes heavy use of extended attributes (SELinux-enabled systems, Lustre servers, and Samba/NFS servers). This setting relies on large_dnode feature flag support on the pool which may not be widely supported on all OpenZFS platforms, please also note Grub does not yet have support to this feature.

    zfs set dnodesize=auto tank/data
  • Thin provision allows a volume to use up to the limited amount of space but do not reserve any resource until explicitly demanded, making over provision possible, at the risk of being unable to allocate space when pool is getting full. It is usually considered a way of facilitating flexible management and improving space efficiency of the backing storage.

  • In order to create a package containing only the binary ZFS kernel modules (e.g. to install on systems where a full build environment is not needed or desired) make sure the package debhelper is installed (it's only a Suggests: of zfs-dkms) and run dkms mkbmdeb zfs/2.0.3. /!\ These packages are for own use only as distributing them would infringe the licenses of both Linux and ZFS.

See Also


CategoryStorage CategorySystemAdministration