ZFS ‘Failed to start Mark current ZSYS boot as successful’ fix

On Ubuntu 20.04 after installing the NVIDIA driver 510 metapackage the system stopped booting.

It will either hang with a black screen and blinking cursor on the top left or show the following error message:

[FAILED] Failed to start Mark current ZSYS boot as successful.
See 'systemctl status zsys-commit.service' for details.
[  OK  ] Stopped User Manager for UID 1000.

Attempting to revert from a snapshot ends up with the same error message. This wasn’t the case on another separate system that had the same upgrade.

The “20.04 zys-commit.service fails” message is quite interesting and it seems that the overall cause is a mismatch of user/kernel zfs components.

These are the steps I followed to fix it. Many thanks to Lockszmith for his research in identifying the issue and finding a fix. He created two posts raising it, links provided here.

https://askubuntu.com/users/720005/lockszmith

https://askubuntu.com/questions/1388997/zsys-commit-service-fails-with-couldnt-promote-dataset-not-a-cloned-filesy

Fix

Restart Ubuntu and boot in recovery mode

[In GRUB]

*Advanced options for Ubuntu 20.04.3 LTS

[Select the first recovery option in the menu]
*Ubuntu 20.04.3 LTS, with Linux 5.xx.x-xx-generic (recovery mode)

[Wait for the system to load the menu and select:]
root

[Press Enter for Maintenance to get the CLI]

Check the reason for the error.

# systemctl status zsys-commit.service
[...]
 Feb 17 11:11:24 ab350 systemd[1] zsysctl[4068]: level=error msg="couldn't commit: couldn't promote dataset "rpool/ROOT/ubuntu_733qyk": couldn't promote "rpool/ROOT/ubuntu_733qyk": not a cloned filesystem"
 [...]

Attempting to promote it manually fails:

# zfs promote rpool/ROOT/ubuntu_733qyk

cannot promote `rpool/ROOT/ubuntu_733qyk` : not a cloned filesystem

Uninstall the NVIDIA drivers.

# dkms uninstall nvidia/510.47.03
# dkms remove nvidia/510.47.03 --all

Make sure you can connect to the internet. You can temporarily assign a DHCP address to one of the network interfaces.

# dhclient -v eno1
# ip address

Update the system and install a 3rd party ZFS set of tools.

# apt update
# apt upgrade
# apt autoremove

[Add 3rd party PPA for zfstools]
# add-apt-repository ppa:jonathonf/zfs
# apt update 

[Upgrade ZFS]
# apt upgrade

[If ZFS isn't upgraded, do it manually]
# apt install zfs-initramfs zfs-zed zfsutils-linux

It might take a bit to update. Reboot normally.

# reboot

It should boot normally.

If this doesn’t work for you, reboot in recovery mode again and promote the filesystem manually.

# zfs promote rpool/ROOT/ubuntu_733qyk

Sort graphical drivers

Revert to NVIDIA metapackage 470 (if this is what broke your system). Reboot, and fix resolution settings.

Upgrading back to 510 will bring the error back and make it even more difficult to fix. Don’t!

Things will only work if zfs and zfs-kmod match versions.

$ zfs --version
zfs-0.8.3-1ubuntu12.13
zfs-kmod-2.0.6-1ubuntu2

[boot in recovery mode]
# apt reinstall zfs-initramfs zfs-zed zfsutils-linux
# zfs promote rpool/ROOT/ubuntu_733qyk

[reboot in normal mode]
[Configure the 470 drivers]

Reverting to previous ZFS version

The system should now be back to normal, but you might want to revert to the mainline ZFS version despite the bug. After all, this was a hack to promote the filesystem and get it back to work.

# add-apt-repository --remove ppa:jonathonf/zfs

[Check that is has been removed]
$ apt policy

# apt update

[Pray]
# apt remove zfs-initramfs zfs-zed zfsutils-linux
# apt install zfs-initramfs zfs-zed zfsutils-linux

[Check the right version is installed]
# apt list --installed | grep zfs

# apt autoremove

[Pray harder]
# reboot

With that I managed to bring my system back to a working condition, but updating the drivers a second time made it fail again and I couldn’t fix it. A clean install of 20.04.3 doesn’t seem to exhibit this problem. Not sure what is the reason behind it but there are a few bugs open with Canonical regarding this.

I hope that 22.04 will bring a better ZFS version.

Raspberry Pi : Configuring a Time Capsule/Backintime server

In this post, I am setting up a Time Capsule and Backintime server. I am using a Raspberry Pi that has Ubuntu installed, with a USB disk that has been configured into a ZFS pool.

Setting up backup users

You are going to have to create users for each of the services/users that will be connecting to the server. You want to keep files and access as isolated as possible. As in a given user shouldn’t have any visibility or notion of other users’ backups. We are also creating accounts that can’t login into the system for Time Machine, only authenticate.

Check if there is an entry for nologin in:

$ cat /etc/shells

If there is no entry add it:

# vim /etc/shells

# /etc/shells: valid login shells
/bin/sh
/bin/bash
/bin/rbash
/bin/dash
/usr/bin/tmux
/usr/sbin/nologin

Create a generic user for the backups, or dedicated accounts for each user to increase security:

Generic user example:

# useradd -s /usr/sbin/nologin timemachine
# passwd timemachine

Dedicated user example:

# useradd -s /usr/sbin/nologin timemachine_john
# passwd timemachine_john

Note that useradd doesn’t create a home

If required, the default shell can be changed with:

# usermod -s /usr/sbin/nologin timemachine_john

Setting up backup user groups

If more than one system is going to be backed up it is advisable to use different accounts for each.

It is possible to isolate users by assigning them individual datasets, but that might create storage silos.

An alternative is to create individual users that belong to the same backup group. The backup group can access the backintime dataset, but not each other’s data.

Create the group.

# addgroup backupusers

Assign main group and secondary group (the secondary group would be the shared one).

# usermod -g timemachine_john -G backupusers timemachine_john

Although not required, you could force the UID and GID to be a specific one.

# usermod -u 1012 timemachine
# groupmod -g 1012 timemachine

Time Capsule

Install netatalk

Install netatalk from the repositories.

# apt install netatalk

Allow access to all the appropriate accounts to the directory where the backups are going to be written to:

# chown :timemachine_john /backups/timecapsule/
# chmod 775 /backups/timecapsule/

Edit the settings of the netatalk service so that that share can be seen with the name of your choice and work as a Time Capsule server.

# vim /etc/netatalk/AppleVolumes.default

Enter the following:

/backups/timecapsule "pi-capsule" options:tm

Note that you can give the capsule a name with spaces above.

Restart the service:

# systemctl restart netatalk

Check that netatalk has been installed correctly:

# afpd -V

afpd 3.1.12 - Apple Filing Protocol (AFP) daemon of Netatalk
[...]
afpd has been compiled with support for these features:

          AFP versions: 2.2 3.0 3.1 3.2 3.3 3.4 
         CNID backends: dbd last tdb 
      Zeroconf support: Avahi
  TCP wrappers support: Yes
         Quota support: Yes
   Admin group support: Yes
    Valid shell checks: Yes
      cracklib support: No
            EA support: ad | sys
           ACL support: Yes
          LDAP support: Yes
         D-Bus support: Yes
     Spotlight support: Yes
         DTrace probes: Yes

              afp.conf: /etc/netatalk/afp.conf
           extmap.conf: /etc/netatalk/extmap.conf
       state directory: /var/lib/netatalk/
    afp_signature.conf: /var/lib/netatalk/afp_signature.conf
      afp_voluuid.conf: /var/lib/netatalk/afp_voluuid.conf
       UAM search path: /usr/lib/netatalk//
  Server messages path: /var/lib/netatalk/msg/

Configure netatalk

# vim /etc/nsswitch.conf

Change this line:

hosts:          files mdns4_minimal [NOTFOUND=return] dns

to this:

hosts:          files mdns4_minimal [NOTFOUND=return] dns mdns4 mdns

Note that if you are running Netatalk 3.1.11 or above it is not necessary any more to create the /etc/avahi/services/afpd.service. Using this file will cause an error.

If you are running an older version go ahead, otherwise jump to the next section.

Create /etc/avahi/services/afpd.service as root

# vim /etc/avahi/services/afpd.service

and fill it up with:

<?xml version="1.0" standalone='no'?><!--*-nxml-*-->
<!DOCTYPE service-group SYSTEM "avahi-service.dtd">
<service-group>
        <name replace-wildcards="yes">%h</name>
        <service>
                <type>_afpovertcp._tcp</type>
                <port>548</port>
        </service>
        <service>
                <type>_device-info._tcp</type>
                <port>0</port>
                <txt-record>model=TimeCapsule</txt-record>
        </service>
</service-group>

Configure the AFP service

Edit the configuration file.

# vim /etc/netatalk/afp.conf

[Global]
; Global server settings
mimic model = TimeCapsule6,106

[pi-capsule]
path = /backups/timecapsule
time machine = yes

Check configuration and reload if needed:

# systemctl status avahi-daemon	

[restart if necessary]
# systemctl restart netatalk

[Make the service automatically start]
# systemctl enable netatalk.service

If you go to your Mac’s Time Machine preferences the new volume will be available and you can start using it.

netatalk troubleshooting

Some notes of things to check from the server side (Time Capsule server):

https://wiki.archlinux.org/index.php/Netatalk

Backintime setup

Configuring Backintime

Prepare users

If you have disabled passwords and are only using keys, you will need to temporarily change the security settings to allow Backintime to exchange keys.

On the remote system/Pi/server:

# vim /etc/ssh/sshd_config

PasswordAuthentication yes

# systemctl restart ssh

Backintime uses SSH, so the user accounts need to be allowed to login. Therefore the default login shell needs to reflect this.

If not created already, assign the user a home directory. Finally, allow the user to read and write the folder containing the backups.

# usermod -s /usr/bin/bash backintime_john

# mkdir /home/backintime_john

# chown backintime_tuxedo:backintime_john /home/backintime_john/

# usermod -d /home/backintime_john/ backintime_john

# chown :backupusers /backups/backintime/

# chmod 775 /backups/backintime/

Permissions for some of the subfolders might be required in multi-user configuration after the first backup:

# chown :backupusers /backups/backintime*/backintime

# chmod 770 /backups/backintime*/backintime/system1/
# chmod 770 /backups/backintime*/backintime/laptop2/

Prepare keys

To simplify things these are the roles:

[Local system]
The client machine that is running Backintime and that you want to backup your data from.

[Remote system]
The SSH server that has the storage where your backup is going to be stored.

From the local system account you want to run backintime (either your user or root, depending on how you run Backintime) SSH into the remote system. In my case, a Raspberry Pi.

# ssh backintime_john@pi-capsule.local

After logging in check the host key.

$ ssh-keygen -l -f /etc/ssh/ssh_host_ecdsa_key.pub
256 SHA256:KjzU6aGqH6tXri/K87xz3H+cP35PMT7n+Ob6MIaBZb0 root@pi-capsule (ECDSA)

You can then log out from the remote machine.

From the local account, you want to run Backintime from generate a new SSH key pair.

# ssh-keygen

And then copy the public key to the Pi.

# ssh-copy-id -i ~/.ssh/id_rsa.pub backintime_david@pi-capsule
[...]
ECDSA key fingerprint is SHA256:KjzU6aGqH6tXri/K87xz3H+cP35PMT7n+Ob6MIaBZb0.
[...]
Number of key(s) added: 1
[...]

Note that the fingerprint is the same as the one displayed in the previous step.

Configure Backintime profile

You can now configure the SSH profile from Backintime and make the first run.

In the General tab:

Mode:               SSH

SSH Settings
Host:   pi-capsule
Port:   22
User:   backintime_david
Path:   /backups/backintime_david
Cipher:     [Leave as default]
Private Key:/root/.ssh/id_rsa

Password
SSH private key:[empty in most cases]
Enable Cache Password

Advanced
Host:       tuxedo
User:       root
Profile:    2

Schedule
[Select appropriate setting after testing]

Include

/home
/etc/
/boot
/root
/steam
/opt
/usr
/var

Exclude (example)

/steam/steamapps/downloading
/var/cache
/vm/kvm_images/__security/Security_TryHackMe*.qcow2

Auto-remove
Older than 10 years
If free space is less than 50GiB
If free inodes is less than 2%

Smart remove
Run in background on remote Host
Keep last
14 days (7 days)
21 days (14 days)
8 weeks (6 weeks)
36 months (14 months)

Don't remove named snapshots

Options
Enable notifications
Backup replaced files on restore
Continue on errors (keep incomplete snapshots)
Log level: Changes & Errors

After the first run has completed you can check which is the best performing cipher from the CLI.

# backintime benchmark-cipher --profile-id 2

After a few rounds, aes192-ctr came out as the best performing cipher for me.

Secure SSH

If you changed the SSH configuration at the beginning, after setting everything up, remember to secure SSH again on the server/remote system.

# vim /etc/ssh/sshd_config

PasswordAuthentication no

# systemctl restart ssh

Restoring restrictions to backup users

The login account is required for Backintime to be able to run rsync. It is worth doing a bit more research on how to harden/limit these accounts.

Troubleshooting

Some examples of some issues and some troubleshooting steps you can apply.

Time Capsule can’t be reached / firewall settings

Make sure the server is allowing AFP connections from the Mac client.

# ufw allow proto tcp from CLIENT_IP to PI_CAPSULE_IP port 548

Time Capsule – Configuring Time Machine backups via the network on a macOS VM

The destination needs to be configured manually.

Mount the AFP/Time Capsule mount via the Finder.

In the CLI configure the destination:

# tmutil setdestination -a /Volumes/pi-capsule

The backups can then be started from the GUI.

You can get information about the current configured destinations via the CLI.

# tmutil destinationinfo
====================================
Name            : pi-capsule
Kind            : Network
Mount Point     : /Volumes/pi-capsule
ID              : 7B648734-9BFC-417F-B5A1-F31B8AD52F4B

Time Capsule – Checking backup status

# tmutil currentphase

# tmutil status

ZFS stalling on a Raspberry Pi

Check the recordsize property. Reduce it to the default 128 kiB.

Reduce ARC size to reduce the amount of memory consumed/reserved for ZFS.

Understanding rsync logs

The logs indicate the type of change rsync is seeing. A reference is available here:

XYcstpoguax  path/to/file
|||||||||||
||||||||||╰- x: The extended attribute information changed
|||||||||╰-- a: The ACL information changed
||||||||╰--- u: The u slot is reserved for future use
|||||||╰---- g: Group is different
||||||╰----- o: Owner is different
|||||╰------ p: Permission are different
||||╰------- t: Modification time is different
|||╰-------- s: Size is different
||╰--------- c: Different checksum (for regular files), or
||              changed value (for symlinks, devices, and special files)
|╰---------- the file type:
|            f: for a file,
|            d: for a directory,
|            L: for a symlink,
|            D: for a device,
|            S: for a special file (e.g. named sockets and fifos)
╰----------- the type of update being done::
             <: file is being transferred to the remote host (sent)
             >: file is being transferred to the local host (received)
             c: local change/creation for the item, such as:
                - the creation of a directory
                - the changing of a symlink,
                - etc.
             h: the item is a hard link to another item (requires 
                --hard-links).
             .: the item is not being updated (though it might have
                attributes that are being modified)
             *: means that the rest of the itemized-output area contains
                a message (e.g. "deleting")

Some example output:

>f+++++++++ some/dir/new-file.txt
.f....og..x some/dir/existing-file-with-changed-owner-and-group.txt
.f........x some/dir/existing-file-with-changed-unnamed-attribute.txt
>f...p....x some/dir/existing-file-with-changed-permissions.txt
>f..t..g..x some/dir/existing-file-with-changed-time-and-group.txt
>f.s......x some/dir/existing-file-with-changed-size.txt
>f.st.....x some/dir/existing-file-with-changed-size-and-time-stamp.txt 
cd+++++++++ some/dir/new-directory/
.d....og... some/dir/existing-directory-with-changed-owner-and-group/
.d..t...... some/dir/existing-directory-with-different-time-stamp/

ZFS: Setting up ZFS storage on Ubuntu

If you are new to ZFS, I would advise doing a little bit of research first to understand the fundamentals. Jim Salter’s articles on storage and ZFS are very recommended.

https://arstechnica.com/information-technology/2020/05/zfs-101-understanding-zfs-storage-and-performance/

The examples below are to create a pool from a single disk, with separate datasets used for network backups.

In some examples, I might use device names for simplicity, but you are advised to use disks IDs or serials.

Installing ZFS

Ubuntu makes it very easy.

# apt install zfsutils-linux

ZFS Cockpit module

If Cockpit is installed, it is possible to install a module for ZFS. This module is sadly no longer in development. If you know of alternatives, please share!

$ git clone https://github.com/optimans/cockpit-zfs-manager.git
[...]
# cp -r cockpit-zfs-manager/zfs /usr/share/cockpit

Configuring automatic snapshots

This service generates automatic snapshots every hour, and it can be configured to retain your preferred period.

# apt install zfs-auto-snapshot

The snapshot retention is set in the following files:

/etc/cron.hourly/zfs-auto-snapshot
/etc/cron.daily/zfs-auto-snapshot
/etc/cron.weekly/zfs-auto-snapshot
/etc/cron.monthly/zfs-auto-snapshot

By default, the configuration runs the following snapshots and retention policies:

Period	Retention
Hourly	24 hours
Daily	31 days
Weekly	Eight weeks
Monthly	12 months

I configured the following snapshot retention policy:

Period	Retention
Hourly	48 hours
Daily	14 days
Weekly	Four weeks
Monthly	Three months

Hourly

# vim /etc/cron.hourly/zfs-auto-snapshot

#!/bin/sh

# Only call zfs-auto-snapshot if it's available
which zfs-auto-snapshot > /dev/null || exit 0

exec zfs-auto-snapshot --quiet --syslog --label=hourly --keep=48 //

Daily

# vim /etc/cron.daily/zfs-auto-snapshot

#!/bin/sh

# Only call zfs-auto-snapshot if it's available
which zfs-auto-snapshot > /dev/null || exit 0

exec zfs-auto-snapshot --quiet --syslog --label=daily --keep=14 //

Weekly

# vim /etc/cron.weekly/zfs-auto-snapshot

#!/bin/sh

# Only call zfs-auto-snapshot if it's available
which zfs-auto-snapshot > /dev/null || exit 0

exec zfs-auto-snapshot --quiet --syslog --label=weekly --keep=4 //

Monthly

# vim /etc/cron.monthly/zfs-auto-snapshot

#!/bin/sh

# Only call zfs-auto-snapshot if it's available
which zfs-auto-snapshot > /dev/null || exit 0

exec zfs-auto-snapshot --quiet --syslog --label=monthly --keep=3 //

Setting up the ZFS pool

This post has several use cases and examples, and I recommend it highly if you want further details on different commands and ways to configure your pools.

https://www.thegeekdiary.com/zfs-tutorials-creating-zfs-pools-and-file-systems/

In my example there is no resilience, as there is only one attached disk. For me, this is acceptable because I have an additional local backup besides this filesystem.

It is preferable to have a second backup (ideally off-site) than a single one regardless of any added resilience you might set.

I create a single pool with an external drive. Read below for an explanation of the different command flags.

zpool create -f 
-o ashift=12 
-O compression=lz4 
-O acltype=posixacl 
-O xattr=sa 
-O relatime=on 
-O atime=off 
-O normalization=formD 
-O canmount=off 
-O dnodesize=auto 
-O sync=standard 
backup_pool scsi-SSeagate_Desktop_NA7HP4VK

Block size / ashift

Of the above values, the most important one by far is ashift.

The ashift property sets the block size of the vdev. It can’t be changed once set, and if it isn’t correct, it will cause massive performance issues with the filesystem.

Find out your drive’s optimal block size and match it to ashift.

It is set in bits.

bits	sector size
9	512 bytes
10	1 kiB
11	2 kiB
12	4 kiB
13	8 kiB
14	16 kiB
15	32 kiB
16	64 kiB

recordsize is another performance impacting property, especially on the Raspberry Pi. Smaller sizes can improve performance when accessing random batches, but higher values will provide better performance and compression when reading sequential data. The problem on the Raspberry Pi has been that with a value of 1M the system load increased, eventually stopping the filesystem activity until the system was restarted.

The default value (128k) has performed without any noticeable issue.

Compression

lz4 compression is going to yield an optimum performance/compression ratio. It will make the storage perform faster than if there is no compression.

ZFS 0.8 doesn’t give many choices regarding compression but bear in mind that you can change the algorithm on a live system.

gzip will impact performance but yields a higher compression rate. It might be worth checking the performance with different compression formats on the Pi 4. With older Raspberry Pi models, the limitation will be the USB / network in most cases.

For reference, on the same amount of data these were the compression ratios I obtained:

gzip-7
backup_pool 1.34x
backup_pool/backintime 1.35x
backup_pool/timecapsule 1.33x

lz4
backup_pool 1.27x
backup_pool/backintime 1.30x
backup_pool/timecapsule 1.33x

All in all, the performance impact and memory consumption didn’t make switching from lz4 worthwhile.

Permissions

acltype=posixacl
xattr=sa

It enables the POSIX ACLs and Linux Extended Attributes on the inodes rather than on separate files.

Access times

atime is recommended to be disabled (off) to reduce the number of IOPS.

relatime offers a good compromise between the atime and notime behaviours.

Normalisation

The normalization property indicates whether a file system should perform a Unicode normalisation of file names whenever two file names are compared and which normalisation algorithm should be used.

formD is the default set by Canonical when setting up a pool. It seems to be a good choice if sharing the volume via NFS with macOS systems and avoiding files not being displayed due to names using non-ASCII characters.

Additional properties

The pool is configured with the canmount property off so that it can’t be mounted.

This is because I will be creating separate datasets, one for Time Capsule backups, and another two for Backintime, and I don’t want them to mix.

All datasets will share the same pool, but I don’t want the pool root to be mounted. Only datasets will mount.

dnodesize is set to auto, as per several recommendations when datasets are using the xattr=sa property.

sync is set as standard. There is a performance hit for writes, but disabling it comes at the expense of data consistency if there is a power cut or similar.

A brief test showed a lower system load when sync=standard than with sync=disabled. Also, with standard there were fewer spikes. It is likely that the performance is lower, but it certainly causes the system to suffer less.

Encryption

I am not too keen to encrypt physically secure volumes because when doing data recovery, you are adding an additional layer that might hamper and slow things down.

For reference, I am writing down an example of encryption options using an external key for a volume. This might not be appropriate for your particular scenario. Research alternatives if needed.

-O encryption=aes-256-gcm 
-O keylocation=file:///etc/pool_encryption_key 
-O keyformat=raw

Pool options

Automatic trimming of the pool is essential for SSDs:

# zpool set autotrim=on backup_pool

Disabling automatic mount for the pool. (This applies only to the root of the pool, the datasets can still be set to be mountable regardless of this setting.)

# zfs set canmount=off backup_pool

Setting up the ZFS datasets

I will create three separate datasets with assigned quotas for each.

[Create datasets]
# zfs create backup_pool/backintime_tuxedo
# zfs create backup_pool/backintime_ab350
# zfs create backup_pool/timecapsule

[Set mountpoints]
# zfs set mountpoint=/backups/backintime_tuxedo  backup_pool/backintime_tuxedo
# zfs set mountpoint=/backups/backintime_ab350  backup_pool/backintime_ab350
# zfs set mountpoint=/backups/timecapsule  backup_pool/timecapsule

[Set quotas]
# zfs set quota=2T backup_pool/backintime_tuxedo
# zfs set quota=2T backup_pool/backintime_ab350
# zfs set quota=2T backup_pool/timecapsule

Changing compression on a dataset

The default lz4 compression is recommended. gzip consumes a lot of CPU and makes data transfers slower, impacting backups restoration.

If you still want to change the compression for a given dataset:

# zfs set compression=gzip-7 backup_pool/timecapsule

A comparison of compression and decompression using different algorithms with OpenZFS:

https://github.com/openzfs/zfs/pull/9735

Querying pool properties, current compression algorithm and compress ratio

# zfs get all backup_pool
# zfs get compression backup_pool
# zfs get compressratio backup_pool
# zfs get all | grep compressratio

Changing ZFS settings

For reference, below are some examples of properties and settings that can be changed after a pool has already been created.

Renaming pools and datasets

If for any reason, a dataset was given a name that needs to be changed, this can be done with a command like this:

# zfs rename backup_pool/Test1 backup_pool/backintime_tuxedo

A zpool can be renamed by exporting and importing it.

# zpool export test_pool
# zpool import test_pool backup_pool

Attaching mirror disks

You can add an additional disk/partition and make the pool redundant in a RAID-Z configuration. Unfortunately, it doesn’t work to make it a RAID-Z2 or RAID-Z3.

# zpool attach backup_pool /dev/sda7 /dev/sdb7

Renaming disks in pools

By default, Ubuntu uses device identifiers for the disks. This should not be an issue, but in some cases, adding or connecting drives might change the device name order and degrade one or more pools.

This is why creating a pool with disk IDs or serials is recommended. You can still fix this if you created your pool using device names.

With the pool unmounted, export it, and reimport pointing to the right path:

# zpool export backup_pool
# zpool import -d /dev/disk/by-id/ backup_pool

There are additional examples in this handy blog post:

https://plantroon.com/changing-disk-identifiers-in-zpool/

ZFS optimisation

ZFS should be running on a system with at least 4GiB of RAM. If you plan to use it on a Raspberry Pi (or any other system with limited resources), reduce the ARC size.

In this case, I am limiting it to 3GiB. It is a change that can be done live:

# echo 3221225472 > /sys/module/zfs/parameters/zfs_arc_max

To make it persistent between boots:

# vim /etc/modprobe.d/zfs.conf

[add this line]
options zfs zfs_arc_max=3221225472

# update-initramfs -u

You can check the ARC statistics:

$ less /proc/spl/kstat/zfs/arcstats

More on ZFS performance

Some other links with interesting points on performance:

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html

https://icesquare.com/wordpress/how-to-improve-zfs-performance/

Ubuntu: ZFS bpool is full and not running snapshots during apt updates

When running apt to update my system I kept seeing a message saying that bpool had less than 20% space free and that the automatic snapshotting would not run.

What I didn’t realise is that this would apply to the rpool even if it had plenty of free space. They are run together and have to match. Checking the snapshots it seems they had stopped running for several months. Yikes!

You can list the current snapshots in several ways:

[List existing snapshots with their names and creation date.]

$ zsysctl show
Name:           rpool/ROOT/ubuntu_dd5xf4
ZSys:           true
Last Used:      current
History:        
  - Name:       rpool/ROOT/ubuntu_dd5xf4@autozsys_qfi5pz
    Created on: 2021-01-12 23:35:01
  - Name:       rpool/ROOT/ubuntu_dd5xf4@autozsys_1osqbq
    Created on: 2021-01-12 23:33:22

You can also use the zfs commands for the same purpose.

List existing snapshots with default properties information
(name, used, references, mountpoint)

$ zfs list -t snapshot

You can also list the creation date asking for the creation property.

$ zfs list -t snapshot -o name,creation

It should list then in creation order, but if not, you can use -s option to sort them.

$ zfs list -t snapshot -o name,creation -s creation

Deciding which snapshots to delete will vary. You might want to get rid of the older ones, or maybe the ones that are consuming the most space.

My snapshots were a few months old so there wasn’t much point in keeping them. I deleted all with the following one-liner:

[-H removes headers]
[-o name displays the name of the filesystem]
[-t snapshot displays only snapshots]

# zfs list -H -o name -t snapshot | grep auto | xargs -n1 zfs destroy

I can’t stress how important it is that whatever zfs destroy command you issue, especially if doing several automatic iterations, only applies to the snapshots you want to.

You can delete filesystems, volumes and snapshots with the above command. Deleting snapshots isn’t an issue. Deleting the filesystem is a pretty big one.

Please, ensure that the command lists only snapshots you want to remove before running it. You have been warned.

Ubuntu 20.04: Install Ubuntu with ZFS and encryption

Ubuntu 20.04 offers installing ZFS as the default filesystem. This has lots of advantages. My favourite is being able to revert the system and home partitions (simultaneously or individually) to a previous state through the boot menu.

One major drawback for me is the lack of an option to encrypt the filesystem during the installation.

You have the option to use LUKS and ext4 but there isn’t an encryption option in the installer for ZFS.

Some people have used LUKS and ZFS in the past, but that solution didn’t quite work for me. The tutorials I saw were using LUKS1 instead of LUKS2 and it also felt that the approach was cumbersome now that ZFS on Linux supports native encryption.

The more you deviate from a standard installation the more complicated it will be to do any troubleshooting if anything breaks in the future. Keep it simple.

The ZFS on Linux version included with the 20.04 installer is 0.8.3.

The installation of Ubuntu 20.04 on ZFS will create two pools: bpool and rpool.

bpool contains the boot partition and rpool all the other mountpoints in several datasets.

In a very security minded world both pools should be encrypted, but I prefer not encrypt the boot partition. Adding that extra layer of security might make a system recovery that much more difficult or impossible.

The default partitioning during the install creates four partitions and two ZFS pools, using all the storage in the installation disk:

/boot/efi	512MiB	EFI System Partition (vfat)
SWAP	2GiB	Linux Swap Partition (swap)
bpool	2GiB	ZFS/Solaris boot partition (zfs)
rpool	all remaining space	ZFS/Solaris root partition (zfs)

To encrypt the rpool we will need to edit the installation script.

Steps

Boot with the Ubuntu 20.04 Desktop ISO.

Click the “Try Ubuntu” button.

Open a terminal window.

Edit /usr/share/ubiquity/zsys-setup

# vim /usr/share/ubiquity/zsys-setup

This script is responsible for setting up ZFS. We can modify the default options for rpool.

Edit the rpool section from this:

# Pools
        # rpool
        zpool create -f \
                -o ashift=12 \
                -O compression=lz4 \
                -O acltype=posixacl \
                -O xattr=sa \
                -O relatime=on \
                -O normalization=formD \
                -O mountpoint=/ \
                -O canmount=off \
                -O dnodesize=auto \
                -O sync=disabled \
                -O mountpoint=/ -R "${target}" rpool "${partrpool}"

to this:

# Pools
        # rpool
        echo PASSWORD | zpool create -f \
                -o ashift=12 \
                -O compression=lz4 \
                -O acltype=posixacl \
                -O xattr=sa \
                -O relatime=on \
                -O normalization=formD \
                -O mountpoint=/ \
                -O canmount=off \
                -O dnodesize=auto \
                -O sync=disabled \
                -O recordsize=1M \
                -O encryption=aes-256-gcm \
                -O keylocation=prompt \
                -O keyformat=passphrase \
                -O mountpoint=/ -R "${target}" rpool "${partrpool}"

Replace PASSWORD with the encryption password you want to use. You will be prompted to type this at boot time.
Save the changes to the file and exit.
Launch the installer:

# ubiquity

Install Ubuntu as you would.
In the storage section:
Select “Use entire disk”
Select ZFS (Experimental)

The system will be installed with the encryption options set on the script and on boot it will prompt you with the password you setup.

Some comments on the options for reference:

-o ashift=12
This is the default setting that means that your disk’s block size is 4,096 bytes (2^12=4,096). Valid values are:

0 for autodetect sector size
9 for 512 bytes
10 for 1,024 bytes
11 for 2,048
12 for 4,096
13 for 8,192
14 for 16,384
15 for 32,768
16 for 65,536

You can output the physical sector size with lsblk -t, although values of 512 might be simulated. You should check the specifications if the drive is SSD.

Alternative ways to retrieve physical sector sizes are:

$ cat /sys/block/sd*/queue/physical_block_size
# hdparm -I /dev/sda | grep Sector

A value of 12 will work just fine, even on 512 sector drives and likely being the reason for Canonical setting up as the default.

If set too low this can have a huge and negative impact on performance.

-O recordsize=1M
Other tutorials suggest creating this entry. According to Oracle’s documentation this parameter is used for databases and I have read that it can also be used for certain types of VMs.

The default value is 128k. You can tune it for your individual use by changing the record size of an existing pool. Any new files created will use the new record size value. You can cp/rm files to force them to be rewritten with the new value.

You can change this value later on with:

# zfs set recordsize=128k rpool

or

# zfs set recordsize=128k rpool/filesystem

-O encryption=aes-256-gcm
AES with key lengths of 128, 192 and 256 bits in CCM and GCM operation modes are supported natively.
0.8.4 comes with a fix that improves performance with AES-GCM and should hopefully be included in an update to Ubuntu soon.

-O keylocation=prompt
Valid options are prompt or file:// </absolute/file/path>

Prompt will ask you to type the password, in this case during boot.
File will point to the location of the decryption key, but on a portable system it would defy its purpose.

-O keyformat=passphrase
Options are raw, hex or passphrase.
When using passphrase the password can be between 8 and 512 bytes in length.

Additional information

Reference sites
Debian ZFS site
Ubuntu ZFS reference
FreeBSD ZFS reference

ZFS on Linux website / Admin documentation
ZFS on Linux manpage
OpenZFS System Administration
OpenZFS FAQ

Oracle ZFS Admin guide (not necessarily in line with ZFS on Linux)
Archlinux ZFS wiki
Alpine Linux with root on ZFS with native encryption wiki

Ars Technica intro to ZFS

Interesting articles on ZFS tuning:
Tuning ZFS recordsize (Oracle blog)
ZFS record size (Joyent blog)
OpenZFS performance tuning wiki