ZFS is a next generation filesystem created by Matthew Ahrens and Jeff Bonwick. It was designed around a few key ideas:
- Administration of storage should be simple.
- Redundancy should be handled by the filesystem.
- File-systems should never be taken offline for repair.
- Automated simulations of worst case scenarios before shipping code is important.
- Data integrity is paramount.
Development of ZFS started in 2001 at Sun Microsystems. It was released under the CDDL in 2005 as part of Open Solaris. Pawel Jakub Dawidek ported ZFS to FreeBSD in 2007. Brian Behlendorf at LLNL started the ZFSOnLinux project in 2008 to port ZFS to Linux for High Performance Computing. Oracle purchased Sun Microsystems in 2010 and discontinued Open Solaris later that year. The Illumos project started to replace Open Solaris and roughly 2/3 of the core ZFS team resigned, including Matthew Ahrens and Jeff Bonwick. Most of them took jobs at companies and continue to develop Open Source ZFS as part of the Illumos project. The 1/3 of the ZFS core team at Oracle that did not resign continue development of an incompatible proprietary branch of ZFS in Oracle Solaris. The first release of Solaris included a few innovative changes that were under development prior to the mass resignation. Subsequent releases of Solaris have included fewer and less ambitious changes. Significant innovation continues in the open source branch of ZFS developed in Illumos. Today, a growing community continues development of the open source branch of ZFS across multiple platforms, including FreeBSD, Illumos, Linux and Mac OS X.
- 1 Features
- 2 Installation
- 3 Installing into the kernel directory (for static installs)
- 4 Usage
- 4.1 Preparation
- 4.2 Zpools
- 4.3 Volumes
- 4.4 Snapshots
- 4.5 Maintenance
- 5 Caveats
- 6 External resources
A detailed list of features can be found in a separate article.
There are out-of-tree Linux kernel modules available from the ZFSOnLinux Project. The current release is version 0.6.2 (zpool version 5000). This succeeds 0.6.1, which was the first release considered "ready for wide scale deployment on everything from desktops to super computers", by the ZFSOnLinux Project.
All changes to the GIT repository are subject to regression tests by LLNL.
echo "sys-kernel/spl ~amd64" >> /etc/portage/package.accept_keywords
echo "sys-fs/zfs-kmod ~amd64" >> /etc/portage/package.accept_keywords
echo "sys-fs/zfs ~amd64" >> /etc/portage/package.accept_keywords
emerge -av zfs
The latest upstream versions require keywording the live ebuilds (optional):
echo "=sys-kernel/spl-9999 **" >> /etc/portage/package.accept_keywords
echo "=sys-fs/zfs-kmod-9999 **" >> /etc/portage/package.accept_keywords
echo "=sys-fs/zfs-9999 **" >> /etc/portage/package.accept_keywords
Add zfs to the boot runlevel to mount all zpools on boot:
rc-update add zfs boot
|USE flag (what is that?)||Default||Recommended||Description|
||No||No||Build with user-specified CFLAGS (unsupported)|
||Yes||Yes||Enable dependencies required for booting off a pool containing a rootfs|
||No||No||Build static versions of dynamic libraries as well|
||No||No||Install regression test suite|
ZFSOnLinux uses ARC page replacement algorithm instead of the Last Recently Used page replacement algorithm used by other filesystems. This has a better hit rate, therefore providing better performance. The implementation of ARC in ZFS differs from the original paper in that the amount of memory used as cache can vary. This permits memory used by ARC to be reclaimed when the system is under memory pressure (via the kernel's shrinker mechanism) and grow when the system has memory to spare. The minimum and maximum amount of memory allocated to ARC varies based on your system memory. The default minimum is 1/32 of all memory, or 64MB, whichever is more. The default maximum is the larger of 1/2 of system memory or 64MB.
The manner in which Linux accounts for memory used by ARC differs from memory used by the page cache. Specifically, memory used by ARC is included under "used" rather than "cached" in the output used by the `free` program. This in no way prevents the memory from being released when the system is low on memory. However, it can give the impression that ARC (and by extension ZFS) will use all of system memory if given the opportunity.
Adjusting ARC memory usage
The minimum and maximum memory usage of ARC is tunable via zfs_arc_min and zfs_arc_max respectively. These properties can be set any of three ways. The first is at runtime (new in 0.6.2):
echo 536870912 >> /sys/module/zfs/parameters/zfs_arc_max
This sysfs value became writable in ZFSOnLinux 0.6.2. Changes through sysfs do not persist across boots. Also, the value in sysfs will be 0 when this value has not been manually configured. The current setting can be viewed by looking at c_max in /proc/spl/kstat/zfs/arcstats
The second is via /etc/modprobe.d/zfs.conf.
echo "options zfs zfs_arc_max=536870912" >> /etc/modprobe.d/zfs.conf
If using genkernel to load ZFS, this value must be set before genkernel is run to ensure that the file is copied into the initramfs.
The third is on the kernel commandline by specifying "zfs.zfs_arc_max=536870912" (for 512MB).
Similarly, the same can be done to adjust zfs_arc_min.
Installing into the kernel directory (for static installs)
This example uses 9999, but just change it to the latest ~ or stable (when that happens) and you should be good. The only issue you may run into is having zfs and zfs-kmod out of sync with each other. Just try to avoid that :D
This will generate the needed files, and copy them into the kernel sources directory.
env EXTRA_ECONF='--enable-linux-builtin' ebuild /usr/portage/sys-kernel/spl/spl-9999.ebuild clean configure
(cd /var/tmp/portage/sys-kernel/spl-9999/work/spl-9999 && ./copy-builtin /usr/src/linux)
env EXTRA_ECONF='--with-spl=/usr/src/linux --enable-linux-builtin' ebuild /usr/portage/sys-fs/zfs-kmod/zfs-kmod-9999.ebuild clean configure
(cd /var/tmp/portage/sys-fs/zfs-kmod-9999/work/zfs-kmod-9999/ && ./copy-builtin /usr/src/linux)
After this, you just need to edit the kernel config to enable CONFIG_SPL and CONFIG_ZFS and emerge the zfs binaries.
mkdir -p /etc/portage/profile
echo 'sys-fs/zfs -kernel-builtin' >> /etc/portage/profile/package.use.mask
echo 'sys-fs/zfs kernel-builtin' >> /etc/portage/package.use
emerge -1v sys-fs/zfs
The echo's only need to be run once, but the emerge needs to be run every time you install a new version of zfs.
ZFS includes already all programs to manage the hardware and the file systems, there are no additional tools needed.
ZFS supports the use of either block devices or files. Administration is the same in both cases, but for production use, the ZFS developers recommend the use of block devices (preferably whole disks).To go through the different commands and scenarios we can use files in place of block devices. The following commands create 2GB sparse image files in /var/lib/zfs_img/ that we use as our hard drives. This uses at most 8GB disk space, but in practice will use very little because only written areas are allocated:
truncate -s 2G /var/lib/zfs_img/zfs0.img
truncate -s 2G /var/lib/zfs_img/zfs1.img
truncate -s 2G /var/lib/zfs_img/zfs2.img
truncate -s 2G /var/lib/zfs_img/zfs3.img
Now we check which loopback devices are in use:
On pool export, all of the files will be released and the folder /var/lib/zfs_img can be deleted
The program /usr/sbin/zpool is used with any operation regarding zpools.
To export (unmount) an existing zpool named zfs_test into the file system, you can use the following command:
zpool export zfs_test
To import (mount) the zpool named zfs_test use this command:
zpool import zfs_test
The root mountpoint of zfs_test is a property and can be changed the same way as for volumes. To import (mount) the zpool named zfs_test root on /mnt/gentoo, use this command:
zpool import -R /mnt/gentoo zfs_test
ZFS will automatically search on the hard drives for the zpool named zfs_test
One Hard Drive
Create a new zpool named zfs_test with one hard drive:
zpool create zfs_test /var/lib/zfs_img/zfs0.img
The zpool will automatically be mounted, default is the root file system aka /zfs_test
To delete a zpool use this command:
zpool destroy zfs_test
ZFS will not ask if you are sure.
MIRROR Two Hard Drives
In ZFS you can have several harddrives in a MIRROR, where equal copies exist on each storage. This increases the performance and redundancy. To create a new zpool named zfs_test with two hard drives as MIRROR:
zpool create zfs_test mirror /var/lib/zfs_img/zfs0.img /var/lib/zfs_img/zfs1.img
of the two hard drives only 2GB are effective useable so total_space * 1/n
To delete the zpool:
zpool destroy zfs_test
RAIDZ1 Three Hard Drives
RAIDZ1 is the equivalent to RAID5, where data is written to the first two drives and a parity onto the third. You need at least three hard drives, one can fail and the zpool is still ONLINE but the faulty drive should be replaced as soon as possible.
To create a pool with RAIDZ1 and three hard drives:
zpool create zfs_test raidz1 /var/lib/zfs_img/zfs0.img /var/lib/zfs_img/zfs1.img /var/lib/zfs_img/zfs2.img
of the three hard drives only 4GB are effective useable so total_space * (1-1/n)
To delete the zpool:
zpool destroy zfs_test
RAIDZ2 Four Hard Drives
RAIDZ2 is the equivalent to RAID6, where data is written to the first two drives and a parity onto the next two. You need at least four hard drives, two can fail and the zpool is still ONLINE but the faulty drives should be replaced as soon as possible. To create a pool with RAIDZ2 and four hard drives:
zpool create zfs_test raidz2 /var/lib/zfs_img/zfs0.img /var/lib/zfs_img/zfs1.img /var/lib/zfs_img/zfs2.img /var/lib/zfs_img/zfs3.img
of the four hard drives only 4GB are effective useable so total_space * (1-2/n)
To delete the zpool:
zpool destroy zfs_test
You can add hot-spares into your zpool. In case a failure, those are already installed and available to replace faulty vdevs. In this example, we use RAIDZ1 with three hard drives and a zpool named zfs_test:
zpool add zfs_test spare /var/lib/zfs_img/zfs3.img
The status of /dev/loop3 will stay AVAIL until it is set to be online, now we let /var/lib/zfs_img/zfs0.img fail:
zpool offline zfs_test /var/lib/zfs_img/zfs0.img
pool: zfs_test state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM zfs_test ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 /var/lib/zfs_img/zfs0.img ONLINE 0 0 0 /var/lib/zfs_img/zfs1.img ONLINE 0 0 0 /var/lib/zfs_img/zfs2.img ONLINE 0 0 0 spares /var/lib/zfs_img/zfs3.img errors: No known data errors
We replace /var/lib/zfs_img/zfs0.img with our spare /var/lib/zfs_img/zfs3.img:
zpool replace zfs_test /var/lib/zfs_img/zfs0.img /var/lib/zfs_img/zfs3.img
pool: zfs_test state: ONLINE scan: resilvered 62K in 0h0m with 0 errors on Sun Sep 1 15:41:41 2013 config: NAME STATE READ WRITE CKSUM zfs_test ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 spare-0 ONLINE 0 0 0 /var/lib/zfs_img/zfs0.img ONLINE 0 0 0 /var/lib/zfs_img/zfs3.img ONLINE 0 0 0 /var/lib/zfs_img/zfs1.img ONLINE 0 0 0 /var/lib/zfs_img/zfs2.img ONLINE 0 0 0 spares /var/lib/zfs_img/zfs3.img INUSE currently in use errors: No known data errors
The original vdev will automatically get removed asynchronously. If this is not the case, the old vdev may need to be detached with the "zpool detach" command. Later you will see it leave the zpool status output:
pool: zfs_test state: ONLINE scan: resilvered 62K in 0h0m with 0 errors on Sun Sep 1 15:41:41 2013 config: NAME STATE READ WRITE CKSUM zfs_test ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 /var/lib/zfs_img/zfs3.img ONLINE 0 0 0 /var/lib/zfs_img/zfs1.img ONLINE 0 0 0 /var/lib/zfs_img/zfs2.img ONLINE 0 0 0 errors: No known data errors
ZFS automatically resilvered onto /var/lib/zfs_img/zfs0.img and the zpool had no downtime
Now we start a manual scrub:
zpool scrub zfs_test
pool: zfs_test state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Sun Sep 1 15:57:31 2013 config: NAME STATE READ WRITE CKSUM zfs_test ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 /var/lib/zfs_img/zfs3.img ONLINE 0 0 0 /var/lib/zfs_img/zfs1.img ONLINE 0 0 0 /var/lib/zfs_img/zfs2.img ONLINE 0 0 0 errors: No known data errors
Zpool Version Update
With every update of sys-fs/zfs, you are likely to also get a more recent ZFS version. Also the status of your zpools will indicate a warning that a new version is available and the zpools could be upgraded. To display the current version on a zpool:
zpool upgrade -v
This system supports ZFS pool feature flags. The following features are supported: FEAT DESCRIPTION ------------------------------------------------------------- async_destroy (read-only compatible) Destroy filesystems asynchronously. empty_bpobj (read-only compatible) Snapshots use less space. lz4_compress LZ4 compression algorithm support. The following legacy versions are also supported: VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit 15 user/group space accounting 16 stmf property support 17 Triple-parity RAID-Z 18 Snapshot user holds 19 Log device removal 20 Compression using zle (zero-length encoding) 21 Deduplication 22 Received properties 23 Slim ZIL 24 System attributes 25 Improved scrub stats 26 Improved snapshot deletion performance 27 Improved snapshot creation performance 28 Multiple vdev replacements For more information on a particular version, including supported releases, see the ZFS Administration Guide.
systems with a lower version installed will not be able to import a zpool of a higher version
To upgrade the version of zpool zfs_test:
zpool upgrade zfs_test
To upgrade the version of all zpools in the system:
zpool upgrade -a
- You cannot shrink a zpool and remove vdevs after its initial creation.
- It is possible to add more vdevs to a MIRROR after its initial creation. Use the following command (/dev/loop0 is the first drive in the MIRROR):
zpool attach zfs_test /dev/loop0 /dev/loop2
- More than 9 vdevs in one RAIDZ could cause performance regression. For example it is better to use 2xRAIDZ with each five vdevs rather than 1xRAIDZ with 10 vdevs in a zpool
- RAIDZ1 and RAIDZ2 cannot be resized after intial creation (you may only add additional hot spares). You can, however, replace the hard drives with bigger ones (one at a time), e.g. replace 1T drives with 2T drives to double the available space in the zpool.
- It is possible to mix MIRROR, RAIDZ1 and RAIDZ2 in a zpool. For example to add two more vdevs in a MIRROR in a zpool with RAIDZ1 named zfs_test, use:
zpool add -f zfs_test mirror /dev/loop4 /dev/loop5
this needs the -f option
- It is possible to restore a destroyed zpool, by reimporting it straight after the accident happened:
zpool import -D
pool: zfs_test id: 12744221975042547640 state: ONLINE (DESTROYED) action: The pool can be imported using its name or numeric identifier.
the option -D searches on all hard drives for existing zpools
The program /usr/sbin/zfs is used for any operation regarding volumes. To control the size of a volume you can set quota and you can reserve a certain amount of storage within a zpool. By default zpool uses the full storage size.
We use our zpool zfs_test to create a new volume called volume1:
zfs create zfs_test/volume1
The volume will be mounted automatically as /zfs_test/volumes1/
Volumes can be mounted with the following command, the mountpoint is defined by the property mountpoint of the volume:
zfs mount zfs_test/volume1
To unmount the volume:
zfs unmount zfs_test/volume1
The folder /zfs_test/volume1 stays without the volume behind it. If you write data to it and then try to mount the volume again, you will see the following error message:
cannot mount '/zfs_test/volume1': directory is not empty
To remove volumes volume1 from zpool zfs_test:
zfs destroy zfs_test/volume1
you cannot destroy a volume if there exist any snapshots of it
Properties for volumes are inherited from the zpool. So you can either change the property on the zpool for all volumes or specifically per individual volume or a mix of both. To set a property for a volume:
zfs set <property> zfs_test/volume1
To show the setting for a particular property on a volume:
zfs get <property> zfs_test/volume1
The properties are used on a volume e.g. compression, the higher is the version of this volume
You can get a list of all properties set on any zpool with the following command:
zfs get all
This is a partial list of properties that can be set on either zpools or volumes, for a full list see man zfs:
|quota=||20m,none||set a quota of 20MB for the volume|
|reservation=||20m,none||reserves 20MB for the volume within it's zpool|
|compression=||zle,gzip,on,off||uses the given compression method or the default method for compression which should be gzip|
|sharenfs=||on,off,ro,nfsoptions||shares the volume via NFS|
|exec=||on,off||controls if programs can be executed on the volume|
|setuid=||on,off||controls if SUID or GUID can be set on the volume|
|readonly=||on,off||sets read only atribute to on/off|
|atime=||on,off||update access times for files in the volume|
|dedup=||on,off||sets deduplication on or off|
|mountpoint=||none,path||sets the mountpoint for the volume below the zpool or elsewhere in the file system, a mountpoint set to none prevents the volume from being mounted|
Set the mountpoint for a volume, use the following command:
zfs set mountpoint=/mnt/data zfs_test/volume1
The volume will be automatically moved to /mnt/data
Create a volume as NFS share:
zfs create -o sharenfs=on zfs_test/volume2
Check what file systems are shared via NFS:
Per default the volume is shared to all networks, to specify share options:
zfs set sharenfs="-maproot=root -alldir -network 192.168.1.254 -mask 255.255.255.0" zfs_test/volume2
To stop sharing the volume:
zfs set sharenfs=off zfs_test/volume2
Snapshots are volumes which have no initial size and save changes made to another volume. With increasing changes between the snapshot and the original volume it grows in size.
To create a snapshot of a volume, use the following command:
zfs snapshot zfs_test/volume1@22082011
volume1@22082011 is the full name of the snapshot, everything after the @ symbol can be any alphanumeric combination
Every time a file in volume1 changes, the old data of the file will be linked into the snapshot.
List all available snapshots:
zfs list -t snapshot -o name,creation
To rollback a full volume to a previous state:
zfs rollback zfs_test/volume1@21082011
if there are other snapshots in between, then you have to use the -r option. This would remove all snapshots between the one you want to rollback and the original volume
ZFS can clone snapshots to new volumes, so you can access the files from previous states individually:
zfs clone zfs_test/volume1@21082011 zfs_test/volume1_restore
In the folder /zfs_test/volume1_restore can now be worked on in the version of a previous state
Remove snapshots of a volume with the following command:
zfs destroy zfs_test/volume1@21082011
Start a scrubbing for zpool zfs_test:
zpool scrub zfs_test
this might take some time and is quite I/O intensive
To check the history of commands that were executed:
Monitor I/O activity on all zpools (refreshes every 6 seconds):
zpool iostat 6
- Swap: Due to the Linux kernel's design, swap is recommended. However, swap on a zvol can deadlock under situations involving high memory pressure. It is better to place swap on a dedicated device until zfsonlinux/zfs#1526 is fixed.
- Memory fragmentation: Memory fragmentation on Linux can cause memory allocations to consume more memory than is actually measured, which means that actual ARC memory usage can exceed zfs_arc_max by a constant factor. This effect will be dramatically reduced once zfsonlinux/zfs#75 is fixed. Recent versions of ZFS on Linux include the arcstats.py script which allows you to monitor ARC usage.