ZFS

ZFS is a next generation filesystem created by Matthew Ahrens and Jeff Bonwick. It was designed around a few key ideas:


 * Administration of storage should be simple.
 * Redundancy should be handled by the filesystem.
 * File-systems should never be taken offline for repair.
 * Automated simulations of worst case scenarios before shipping code is important.
 * Data integrity is paramount.

Development of ZFS started in 2001 at Sun Microsystems. It was released under the CDDL in 2005 as part of Open Solaris. Pawel Jakub Dawidek ported ZFS to FreeBSD in 2007. Brian Behlendorf at LLNL started the ZFSOnLinux project in 2008 to port ZFS to Linux for High Performance Computing. Oracle purchased Sun Microsystems in 2010 and discontinued Open Solaris later that year. The Illumos project started to replace Open Solaris and roughly 2/3 of the core ZFS team resigned, including Matthew Ahrens and Jeff Bonwick. Most of them took jobs at companies and continue to develop Open Source ZFS as part of the Illumos prject. The 1/3 of the ZFS core team at Oracle that did not resign continue development of an incompatible proprietary branch of ZFS in Oracle Solaris. The first release of Solaris included a few innovative changes that under development prior to the mass resignation. Subsequent releases of Solaris have included fewer and less ambitious changes. Significant innovation continues in the open source branch of ZFS developed in Illumos. Today, a growing community continues development of the open source branch of ZFS across multiple platforms, including FreeBSD, Illumos, Linux and Mac OS X.

Features
Some of ZFS' features are:
 * Simplified administration (two main administration tools, zpool and zfs)
 * A hierarchical namespace for management of all mointpoints (datasets) and block devices (zvols).
 * Online management (no downtime required for routine administrative tasks)
 * Partitioning is replaced by ZFS storage pools that span multiple disks
 * Dynamic allocation of storage across mountpoints (no need to repartition)
 * Integrated Volume management (zvol block devices like LVM logical volumes)
 * Supports thin provisioning of storage
 * Snapshots (maintains a copy of data as it was at a specific point in time)
 * Clones (write-able copies of snapshots that store only changes from the original)
 * Special .zfs directory for viewing contents of snapshots.
 * ZFS Send/Recv of snapshots (online backup without the consistency issues of rsync)
 * Incremental Send/Recv of snapshots (reads list of changes between snapshots and transmits only them; asymptotically faster than rsync)
 * Integrated RAID with support for N-way mirrors and up to three levels of parity-based RAID (RAID-Z) similar to RAID 5, RAID 6 and an additional level beyond that.
 * Variable stripe (no RAID write-hole)
 * Abstraction of all storage into a vdev (virtual device) tree.
 * Scaling of IOPS across top level vdevs in a pool
 * e.g. If two RAID-Z2 vdevs are in a pool, objects are written to one or another, such that IOPS are more intelligently distributed than traditional striped storage.
 * ARC page replacement algorithm
 * Higher hit rate than commonly used LRU page replacement algorithm increases IOPS performance
 * ZFS Intent Log (ZIL)
 * Sequentially write intent records of pending small synchronous writes to safely reduce latencies to levels of asynchronous IO
 * Tiered storage
 * L2ARC devices that act as an extension of the system's main memory.
 * Supports LZ4 compression for increased cache as of ZFSOnLinux 0.6.2
 * SLOG devices that permit ZIL to be written to a dedicated hardware.
 * Data deduplication
 * Data compression with zle (zero-length encoding — fast, but only compresses sequences of zeros), LZJB or its replacement LZ4, or gzip (higher compression, but slower)
 * Endian Independence (different machine designs do not prevent ZFS formatted disks from being read)
 * Easy disk format upgrades
 * Persistent pool settings
 * Failure mode configuration
 * bootfs (used by GRUB2 to find binaries)
 * Ability to annotate pool with a comment
 * Other miscellaneous settings, especially readonly settings
 * Persistent configuration for mountpoints (datasets) and block devices (zvols)
 * Uses inheritance in the hierarchical namespace
 * Dataset Specific Options
 * NFS/SMB sharing
 * Automates configuration of NFS and SMB servers, but original manual way is optional
 * Case sensitivity/insensitivity
 * Unicode normalization
 * Quotas (limit to how much storage can be allocated from the pool)
 * Multiple copies of data (transparent)
 * Atime Updates
 * Xattr
 * setuid
 * Mountpoints
 * Depreciates fstab (although still an option with mountpoint=legacy)
 * Makes it easy to have thousands of datasets, including one per home directory
 * Control visibility of .zfs directory
 * Recordsize (tunable that controls internal CoW granularity)
 * Zvol Specific options
 * volblocksize (tunable that controls internal CoW granularity)
 * volsize (allows online resizing of zvol)
 * Control visibility of snapshots in /dev (Linux-specific)
 * Options common to both
 * Compression (already mentioned)
 * Deduplication (already mentioned)
 * Cache control (great for software that implements its own cache in userland)
 * Control of synchronous IO (whether ZIL is used)
 * Reservations (storage reserved for use by a zvol/dataset)
 * This permits thin provisioning on zvols
 * Many other miscellaneous settings, especially readonly settings
 * User defined settings (for use by scripts)

Features in Illumos/Solaris that have yet to be implemented in ZFSOnLinux are:


 * On-access virus scanner integration (ClamAV)
 * iSCSI integration
 * NFSv4 ACLs
 * Does not prevent NFS from being used with ZFS
 * Delegated administration
 * Allows system administrator to give ownership of datasets to users (e.g. their home directories) so that they can manage snapshots, configure compression, etcetera.

Modules
There are out-of-tree Linux kernel modules available from the ZFSOnLinux Project. The current release is version 0.6.2 (zpool version 5000). This succeeds 0.6.1, which was the first release considered "ready for wide scale deployment on everything from desktops to super computers", by the ZFSOnLinux Project.

To install ZFS on Gentoo Linux requires ~amd64 keyword for and it's dependencies  and :

The latest upstream versions require keywording the live ebuilds (optional):

Add zfs to the boot runlevel to mount all zpools on boot:

ARC
ZFSOnLinux uses ARC page replacement algorithm instead of the Last Recently Used page replacement algorithm used by other filesystems. This has a better hit rate (and therefore provides better performance. The implementation of ARC in ZFS differs from the original paper in that the amount of memory used as cache can vary. This permits memory used by ARC to be reclaimed when the system is under memory pressure (via the kernel's shrinker mechanism) and grow when the system has memory to spare. The minimum and maximum amount of memory allocated to ARC varies based on your system memory. The default minimum is 1/32 of all memory, or 64MB, whichever is more. The default maximum is the larger of 1/2 of system memory or 64MB.

The manner in which Linux accounts for memory used by ARC differs from memory used by the page cache. Specifically, memory used by ARC is included under "used" rather than "cached" in the output used by the `free` program. This in no way prevents the memory from being released when the system is low on memory. However, it can give the impression that ARC (and by extension ZFS) will use all of system memory if given the opportunity.

Adjusting ARC memory usage
The minimum and maximum memory usage of ARC is tunable via zfs_arc_min and zfs_arc_max respectively. These properties can be set any of three ways. The first is at runtime (new in 0.6.2):

The second is via /etc/modprobe.d/zfs.conf.

The third is on the kernel commandline by specifying "zfs.zfs_arc_max=536870912" (for 512MB).

Similarly, the same can be done to adjust zfs_arc_min.

Installing into the kernel directory (for static installs)
This example uses 9999, but just change it to the latest ~ or stable (when that happens) and you should be good. The only issue you may run into is having zfs and zfs-kmod out of sync with eachother. Just try to avoid that :D

This will generate the needed files, and copy them into the kernel sources directory.

After this, you just need to edit the kernel config to enable CONFIG_SPL and CONFIG_ZFS and emerge the zfs binaries.

The echo's only need to be run once, but the emerge needs to be run every time you install a new version of zfs.

Usage
ZFS includes already all programs to manage the hardware and the file systems, there are no additional tools needed.

Preparation
ZFS supports the use of either block devices or files. Administration is the same in both cases, but for production use, the ZFS developers recommend the use of block devices (preferably whole disks).To go through the different commands and scenarios we can use files in place of block devices. The following commands create 2GB sparse image files in /var/lib/zfs_img/ that we use as our hard drives. This uses at most 8GB disk space, but in practice will use very little because only written areas are allocated:

Now we check which loopback devices are in use:

Zpools
The program /usr/sbin/zpool is used with any operation regarding zpools.

import/export Zpool
To export (unmount) an existing zpool named zfs_test into the file system, you can use the following command:

To import (mount) the zpool named zfs_test use this command:

The root mountpoint of zfs_test is a property and can be changed the same way as for volumes. To import (mount) the zpool named zfs_test root on /mnt/gentoo, use this command:

One Hard Drive
Create a new zpool named zfs_test with one hard drive:

The zpool will automatically be mounted, default is the root file system aka /zfs_test

To delete a zpool use this command:

MIRROR Two Hard Drives
In ZFS you can have several harddrives in a MIRROR, where equal copies exist on each storage. This increases the performance and redundancy. To create a new zpool named zfs_test with two hard drives as MIRROR:

To delete the zpool:

RAIDZ1 Three Hard Drives
RAIDZ1 is the equivalent to RAID5, where data is written to the first two drives and a parity onto the third. You need at least three hard drives, one can fail and the zpool is still ONLINE but the faulty drive should be replaced as soon as possible. To create a pool with RAIDZ1 and three hard drives:

To delete the zpool:

RAIDZ2 Four Hard Drives
RAIDZ2 is the equivalent to RAID6, where data is written to the first two drives and a parity onto the next two. You need at least four hard drives, two can fail and the zpool is still ONLINE but the faulty drives should be replaced as soon as possible. To create a pool with RAIDZ2 and four hard drives:

To delete the zpool:

Spares/Replace vdev
You can add hot-spares into your zpool. In case a failure, those are already installed and available to replace faulty vdevs. In this example, we use RAIDZ1 with three hard drives and a zpool named zfs_test:

The status of /dev/loop3 will stay AVAIL until it is set to be online, now we let /var/lib/zfs_img/zfs0.img fail:

We replace /var/lib/zfs_img/zfs0.img with our spare /var/lib/zfs_img/zfs3.img:

The original vdev will be automatically removed asynchronously. Later you will see it leave the zpool status outpt:

Now we start a manual scrub:

Zpool Version Update
With every update of, you are likely to also get a more recent ZFS version. Also the status of your zpools will indicate a warning that a new version is available and the zpools could be upgraded. To display the current version on a zpool:

To upgrade the version of zpool zfs_test:

To upgrade the version of all zpools in the system:

Zpool Tips/Tricks

 * You cannot shrink a zpool and remove vdevs after it's initial creation.
 * It is possible to add more vdevs to a MIRROR after it's initial creation. Use the following command (/dev/loop0 is the first drive in the MIRROR):


 * More than 9 vdevs in one RAIDZ could cause performance regression. For example it is better to use 2xRAIDZ with each five vdevs rather than 1xRAIDZ with 10 vdevs in a zpool
 * RAIDZ1 and RAIDZ2 cannot be resized after intial creation (you can only add additional hot spares). You can however replace the hard drives with bigger ones (one at a time), e.g. replace 1T drives with 2T drives to double the available space in the zpool.
 * It is possible to mix MIRROR, RAIDZ1 and RAIDZ2 in a zpool. For example a zpool with RAIDZ1 named zfs_test, to add two more vdevs in a MIRROR use:


 * It is possible to restore a destroyed zpool, by reimporting it straight after the accident happened:

Volumes
The program /usr/sbin/zfs is used with any operation regarding volumes. To control the size of a volume you can set quota and you can reserver a certain amount of storage within a zpool, per default the complete storage size in the zpool is used.

Create Volumes
We use our zpool zfs_test to create a new volume called volume1:

The volume will be mounted automatically as /zfs_test/volumes1/

Mount/Umount Volumes
Volumes can be mounted with the following command, the mountpoint is defined by the property mountpoint of the volume:

To unmount the volume:

The folder /zfs_test/volume1 stays without the volume behind it. If you write data to it and then try to mount the volume again, you will see the following error message: cannot mount '/zfs_test/volume1': directory is not empty

Remove Volumes
To remove volumes volume1 from zpool zfs_test:

Properties
Properties for volumes are inherited from the zpool. So youy can either change the property on the zpool for all volumes or specific for each volume individual or a mix of both. To set a property for a volume:

To show the setting for a particular property on a volume:

You can get a list of all properties set on any zpool with the following command:

This is a partial list of properties that can be set on either zpools or volumes, for a full list see man zfs:

Set Mountpoint
Set the mountpoint for a volume, use the following command:

The volume will be automatically moved to /mnt/data

NFS Volume
Create a volume as NFS share:

Check what file systems are shared via NFS:

Per default the volume is shared to all networks, to specify share options:

To stop sharing the volume:

Snapshots
Snapshots are volumes which have no initial size and save changes made to another volume. With increasing changes between the snapshot and the original volume it grows in size.

Create Snapshots
To create a snapshot of a volume, use the following command:

Every time a file in volume1 changes, the old data of the file will be linked into the snapshot.

List Snapshots
List all available snapshots:

Rollback Snapshots
To rollback a full volume to a previous state:

Clone Snapshots
ZFS can clone snapshots to new volumes, so you can access the files from previous states individually:

In the folder /zfs_test/volume1_restore can now be worked on in the version of a previous state

Remove Snapshots
Remove snapshots of a volume with the following command:

Scrubbing
Start a scrubbing for zpool zfs_test:

Log Files
To check the history of commands that were executed:

Monitor I/O
Monitor I/O activity on all zpools (refreshes every 6 seconds):

Caveat

 * Swap: due to how ZFS and the Linux kernel are designed, if you use ZFS on Linux you need to use swap, it makes it easier to manage high memory pressure situations. That's still true if you never needed swap and have large amounts of RAM available. Don't place it on a ZVOL (at least until #1526 is fixed).
 * High memory usage despite a low zfs_arc_max: ZFS on Linux currently suffers of memory fragmentation, which may lead ARC to exceed the limit you set with zfs_arc_max. You can't do much about it. Recent versions of ZFS on Linux include the arcstats.py script which allows you to monitor ARC usage. Here's a dirty one-liner to get an idea of how much memory is allocated without strict need:
 * Release memory disabling ARC: if you want to free some memory you can disable ARC on the whole pool for a short period of time, to reset the cache. Beware, when you do it, since all the reads/writes will go directly to the disk.
 * Release memory simulating heavy load: if you need to allocate big amounts of memory in a shot, you might have a problem, since ZFS might not be able to release it in time, resulting in an out-of-memory error. To force ZFS to release memory you can simulate high memory pressure using stress (app-benchmarks/stress). You can start with 10 workers and increase their number until you obtain the desired effect:

External resources

 * zfs-fuse.net
 * ZFS for Linux
 * ZFS Best Practices Guide
 * ZFS Evil Tuning Guide
 * article about ZFS on Linux/Gentoo (german)