ZFS: Setting up ZFS storage on Ubuntu

If you are new to ZFS, I would advise doing a little bit of research first to understand the fundamentals. Jim Salter’s articles on storage and ZFS are very recommended.

https://arstechnica.com/information-technology/2020/05/zfs-101-understanding-zfs-storage-and-performance/

The examples below are to create a pool from a single disk, with separate datasets used for network backups.

In some examples, I might use device names for simplicity, but you are advised to use disks IDs or serials.

Installing ZFS

Ubuntu makes it very easy.

# apt install zfsutils-linux

ZFS Cockpit module

If Cockpit is installed, it is possible to install a module for ZFS. This module is sadly no longer in development. If you know of alternatives, please share!

$ git clone https://github.com/optimans/cockpit-zfs-manager.git
[...]
# cp -r cockpit-zfs-manager/zfs /usr/share/cockpit

Configuring automatic snapshots

This service generates automatic snapshots every hour, and it can be configured to retain your preferred period.

# apt install zfs-auto-snapshot

The snapshot retention is set in the following files:

/etc/cron.hourly/zfs-auto-snapshot
/etc/cron.daily/zfs-auto-snapshot
/etc/cron.weekly/zfs-auto-snapshot
/etc/cron.monthly/zfs-auto-snapshot

By default, the configuration runs the following snapshots and retention policies:

Period Retention
Hourly 24 hours
Daily 31 days
Weekly Eight weeks
Monthly 12 months

I configured the following snapshot retention policy:

Period Retention
Hourly 48 hours
Daily 14 days
Weekly Four weeks
Monthly Three months

Hourly

# vim /etc/cron.hourly/zfs-auto-snapshot
#!/bin/sh

# Only call zfs-auto-snapshot if it's available
which zfs-auto-snapshot > /dev/null || exit 0

exec zfs-auto-snapshot --quiet --syslog --label=hourly --keep=48 //

Daily

# vim /etc/cron.daily/zfs-auto-snapshot
#!/bin/sh

# Only call zfs-auto-snapshot if it's available
which zfs-auto-snapshot > /dev/null || exit 0

exec zfs-auto-snapshot --quiet --syslog --label=daily --keep=14 //

Weekly

# vim /etc/cron.weekly/zfs-auto-snapshot
#!/bin/sh

# Only call zfs-auto-snapshot if it's available
which zfs-auto-snapshot > /dev/null || exit 0

exec zfs-auto-snapshot --quiet --syslog --label=weekly --keep=4 //
Monthly
# vim /etc/cron.monthly/zfs-auto-snapshot
#!/bin/sh

# Only call zfs-auto-snapshot if it's available
which zfs-auto-snapshot > /dev/null || exit 0

exec zfs-auto-snapshot --quiet --syslog --label=monthly --keep=3 //

Setting up the ZFS pool

This post has several use cases and examples, and I recommend it highly if you want further details on different commands and ways to configure your pools.

https://www.thegeekdiary.com/zfs-tutorials-creating-zfs-pools-and-file-systems/

In my example there is no resilience, as there is only one attached disk. For me, this is acceptable because I have an additional local backup besides this filesystem.

It is preferable to have a second backup (ideally off-site) than a single one regardless of any added resilience you might set.

I create a single pool with an external drive. Read below for an explanation of the different command flags.

zpool create -f 
-o ashift=12 
-O compression=lz4 
-O acltype=posixacl 
-O xattr=sa 
-O relatime=on 
-O atime=off 
-O normalization=formD 
-O canmount=off 
-O dnodesize=auto 
-O sync=standard 
backup_pool scsi-SSeagate_Desktop_NA7HP4VK

Block size / ashift

Of the above values, the most important one by far is ashift.

The ashift property sets the block size of the vdev. It can’t be changed once set, and if it isn’t correct, it will cause massive performance issues with the filesystem.

Find out your drive’s optimal block size and match it to ashift.

It is set in bits.

bits sector size
9 512 bytes
10 1 kiB
11 2 kiB
12 4 kiB
13 8 kiB
14 16 kiB
15 32 kiB
16 64 kiB

recordsize is another performance impacting property, especially on the Raspberry Pi. Smaller sizes can improve performance when accessing random batches, but higher values will provide better performance and compression when reading sequential data. The problem on the Raspberry Pi has been that with a value of 1M the system load increased, eventually stopping the filesystem activity until the system was restarted.

The default value (128k) has performed without any noticeable issue.

Compression

lz4 compression is going to yield an optimum performance/compression ratio. It will make the storage perform faster than if there is no compression.

ZFS 0.8 doesn’t give many choices regarding compression but bear in mind that you can change the algorithm on a live system.

gzip will impact performance but yields a higher compression rate. It might be worth checking the performance with different compression formats on the Pi 4. With older Raspberry Pi models, the limitation will be the USB / network in most cases.

For reference, on the same amount of data these were the compression ratios I obtained:

gzip-7
backup_pool 1.34x
backup_pool/backintime 1.35x
backup_pool/timecapsule 1.33x

lz4
backup_pool 1.27x
backup_pool/backintime 1.30x
backup_pool/timecapsule 1.33x

All in all, the performance impact and memory consumption didn’t make switching from lz4 worthwhile.

Permissions

acltype=posixacl
xattr=sa

It enables the POSIX ACLs and Linux Extended Attributes on the inodes rather than on separate files.

Access times

atime is recommended to be disabled (off) to reduce the number of IOPS.

relatime offers a good compromise between the atime and notime behaviours.

Normalisation

The normalization property indicates whether a file system should perform a Unicode normalisation of file names whenever two file names are compared and which normalisation algorithm should be used.

formD is the default set by Canonical when setting up a pool. It seems to be a good choice if sharing the volume via NFS with macOS systems and avoiding files not being displayed due to names using non-ASCII characters.

Additional properties

The pool is configured with the canmount property off so that it can’t be mounted.

This is because I will be creating separate datasets, one for Time Capsule backups, and another two for Backintime, and I don’t want them to mix.

All datasets will share the same pool, but I don’t want the pool root to be mounted. Only datasets will mount.

dnodesize is set to auto, as per several recommendations when datasets are using the xattr=sa property.

sync is set as standard. There is a performance hit for writes, but disabling it comes at the expense of data consistency if there is a power cut or similar.

A brief test showed a lower system load when sync=standard than with sync=disabled. Also, with standard there were fewer spikes. It is likely that the performance is lower, but it certainly causes the system to suffer less.

Encryption

I am not too keen to encrypt physically secure volumes because when doing data recovery, you are adding an additional layer that might hamper and slow things down.

For reference, I am writing down an example of encryption options using an external key for a volume. This might not be appropriate for your particular scenario. Research alternatives if needed.

-O encryption=aes-256-gcm 
-O keylocation=file:///etc/pool_encryption_key 
-O keyformat=raw 

Pool options

Automatic trimming of the pool is essential for SSDs:

# zpool set autotrim=on backup_pool

Disabling automatic mount for the pool. (This applies only to the root of the pool, the datasets can still be set to be mountable regardless of this setting.)

# zfs set canmount=off backup_pool

Setting up the ZFS datasets

I will create three separate datasets with assigned quotas for each.

[Create datasets]
# zfs create backup_pool/backintime_tuxedo
# zfs create backup_pool/backintime_ab350
# zfs create backup_pool/timecapsule

[Set mountpoints]
# zfs set mountpoint=/backups/backintime_tuxedo  backup_pool/backintime_tuxedo
# zfs set mountpoint=/backups/backintime_ab350  backup_pool/backintime_ab350
# zfs set mountpoint=/backups/timecapsule  backup_pool/timecapsule

[Set quotas]
# zfs set quota=2T backup_pool/backintime_tuxedo
# zfs set quota=2T backup_pool/backintime_ab350
# zfs set quota=2T backup_pool/timecapsule

Changing compression on a dataset

The default lz4 compression is recommended. gzip consumes a lot of CPU and makes data transfers slower, impacting backups restoration.

If you still want to change the compression for a given dataset:

# zfs set compression=gzip-7 backup_pool/timecapsule

A comparison of compression and decompression using different algorithms with OpenZFS:

https://github.com/openzfs/zfs/pull/9735

Querying pool properties, current compression algorithm and compress ratio

# zfs get all backup_pool
# zfs get compression backup_pool
# zfs get compressratio backup_pool
# zfs get all | grep compressratio

Changing ZFS settings

For reference, below are some examples of properties and settings that can be changed after a pool has already been created.

Renaming pools and datasets

If for any reason, a dataset was given a name that needs to be changed, this can be done with a command like this:

# zfs rename backup_pool/Test1 backup_pool/backintime_tuxedo

A zpool can be renamed by exporting and importing it.

# zpool export test_pool
# zpool import test_pool backup_pool

Attaching mirror disks

You can add an additional disk/partition and make the pool redundant in a RAID-Z configuration. Unfortunately, it doesn’t work to make it a RAID-Z2 or RAID-Z3.

# zpool attach backup_pool /dev/sda7 /dev/sdb7

Renaming disks in pools

By default, Ubuntu uses device identifiers for the disks. This should not be an issue, but in some cases, adding or connecting drives might change the device name order and degrade one or more pools.

This is why creating a pool with disk IDs or serials is recommended. You can still fix this if you created your pool using device names.

With the pool unmounted, export it, and reimport pointing to the right path:

# zpool export backup_pool
# zpool import -d /dev/disk/by-id/ backup_pool

There are additional examples in this handy blog post:

https://plantroon.com/changing-disk-identifiers-in-zpool/

ZFS optimisation

ZFS should be running on a system with at least 4GiB of RAM. If you plan to use it on a Raspberry Pi (or any other system with limited resources), reduce the ARC size.

In this case, I am limiting it to 3GiB. It is a change that can be done live:

# echo 3221225472 > /sys/module/zfs/parameters/zfs_arc_max

To make it persistent between boots:

# vim /etc/modprobe.d/zfs.conf

[add this line]
options zfs zfs_arc_max=3221225472

# update-initramfs -u

You can check the ARC statistics:

$ less /proc/spl/kstat/zfs/arcstats

More on ZFS performance

Some other links with interesting points on performance:

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html

https://icesquare.com/wordpress/how-to-improve-zfs-performance/