ZFSOnLinux Development Guide

This page is a temporary home for documentation intended as a reference for ZFS developers on how ZFSOnLinux development is done. It is a WIP and it is only meant for developers.

= Setting up a development environment =

This is easiest with a Linux environment. It is theoretically possible to use non-Linux environments, but no one does it in practice and anyone trying to use a non-Linux environment will be in new territory.

Once you have a Linux environment, you will want to install and configure git. The "Git book" is a good resource for this, but the simplest setup is:

In addition, you will want to setup a github account and configure it so that you can do clones over SSH. Once that is done, clone the repositories:

Also, you will likely want to install and use cscope. Using cscope is easy with the following bash alias.

After running that in your shell, you can run `cscope-init && cscope -d` inside directories to get a nice curses interface for browsing the code.

= Setting up a VM environment =

Virtual machines can save time when running regression tests reveal problems in new code by allowing inspection of the environment from the hypervisor. It is desireable to test module loading and unloading in the virtual environment, so it is best to use a non-ZFS rootfs in the environment running regression tests. The simplest one to use is 9p-virtio, which is a network filesystem whose server is integrated with QEMU and operates over a virtio transport. It also introduces a few quirks of its own, but those can be worked around by compiling in a chroot.

Gentoo with a 9p-virtio rootfs
We pick /guest as a place to store the virtual machines. /guest/gentoo will store the rootfs and we will make clever use of bind mounts to have a chroot at /guest/gentoo-chroot to alow us to operate on the rootfs even while the VM is running.

Now we install the current stage3:

Setup the bind mounts:

Also setup a bind mount to allow the chroot to access code from our development environment:

Copy resolv.conf so that DNS resolution inside our chroot works:

Now enter the chroot for additional setup:

Setup serial terminals

Disable the fstab entries. We do not need them with a 9p-virtio rootfs:

Put /tmp and /var/tmp/portage (PORTAGE_TMPDIR) on a tmpfs.

Now lets setup portage:

Lets also setup distfiles and packages:

Do some portage configuration.

Keyword the 9999 ebuilds.

Configure ZFS packaging

If you are not bind mounting a portage tree, install one:

Install a few tools we will need:

Configure make.conf:

We compile quite a bit, so we therefore try to optimize GCC somewhat

Lets rebuild everything and install the tools we need at the same time

Lets cheat at kernel configuration

In menuconfig, make certain that you enable the following


 * CONFIG_MODULES=y
 * CONFIG_MODULE_UNLOAD=y
 * CONFIG_EXPERT=y
 * CONFIG_KALLSYMS=y
 * CONFIG_KPROBES=y
 * CONFIG_FTRACE=y
 * CONFIG_KPROBE_EVENT=y (probably already set)
 * CONFIG_UPROBE_EVENT=y
 * CONFIG_FUNCTION_TRACER=y
 * CONFIG_FUNCTION_GRAPH_TRACER=y (probably already set)
 * CONFIG_DEBUG_INFO=y
 * CONFIG_READABLE_ASM=y
 * CONFIG_LOCKUP_DETECTOR=y
 * CONFIG_DEBUG_LIST=y
 * CONFIG_KGDB=y
 * CONFIG_DEBUG_SET_MODULE_RONX=y
 * CONFIG_CC_STACKPROTECTOR_REGULAR=y (CONFIG_CC_STACKPROTECTOR_STRONG preferred, but we don't have GCC 4.9 yet)
 * CONFIG_VIRTIO_PCI=y
 * CONFIG_NET_9P=y
 * CONFIG_NET_9P_VIRTIO=y
 * CONFIG_9P_FS=y
 * CONFIG_9P_FS_POSIX_ACL=y
 * CONFIG_9P_FS_SECURITY=y

Then compile the kernel:

Now install ZFS. It should pull from your local checkout.

Now you should be able to start a VM:

Development
You will commit changes to a branch in your git checkouts in your home directory. Inside the chroot, you will rebuild the ZFS kernel modules and userland tools as needed:

Edit the appropriate variables in /etc/portage/make.conf to point to the branch that you want to compile. Sometimes, you will encounter situations where a running virtual machine has old copies of the files from the chroot. You can resolve that by running drop_caches inside the VM:

Debugging
QEMU's monitor and first serial console are multiplexed on the terminal by the QEMU command in the previous section. Switching between them can be done by typing Ctrl+A C. QEMU's monitor has several interesting functions. One being that it permits you to start the gdb server by typing `gdbserver`. gdb can then be attached to debug. This has an issue where gdb cannot find the kernel sources, but this can be resolved by using gdb from the chroot to debug the running VM:

If on a 64-bit architecture, it is often necessary to specify that to gdb explicitly. Afterward, gdb can be sanely told to attach to the server. The default port is 1234.

Since the ZFSOnLinux code is loaded as modules, we will need to tell gdb about them. At present, the only known way of doing this is getting the module address from /proc/modules inside the VM environment and then telling gdb to load the symbol files. The zfs module must be loaded when you access /proc/modules or it will not appear. Loading the module symbol tables into gdb is relatively straightforward:

Generating such commands is time consuming, so we can tell the guest to generate them and print them to stdout for us to copy and paste to the host:

Loading the module symbol files into gdb will enable you to use gdb break points on the ZFS code in gdb and step through execution. These breakpoints will be lost in the event that you reset the virtual machine using `system_reset` from the QEMU monitor.

Since gdbserver is running at the level of QEMU, you will not be able to see individual threads. Instead, you will see cores, but that in itself can be advantageous in certain scenarios. Getting direct access to threads requires using kgdb.

File Bench
Stress testing the code via random IO can be helpful when making changes that affect performance or potentially correctness.

Solaris Porting LAyer Tests (splat)
The Solaris Porting Layer consists of a module that implements Solaris kernel APIs on top of Linux kernel APIs. SPLAT is a regression test platform for verifying the operation of these tests. Running them is simple:

XFS Test Suite
The XFS Test Suite is a POSIX filesystem API conformance test originally developed by SGI for use on IRIX. It was ported to Linux after IRIX around the time SGI was accquired by Rackable Systems. It is the primary test suite used to test the POSIX filesystem API conformance of Linux filesystems. Brian Behlendorf at LLNL maintains a couple of forks of the XFS Test Suite that have been adapted to permit use with ZFS in two separate branches. They are located at:

https://github.com/behlendorf/xfstests

The zfs branch contains the original fork of the XFS tests from mid-2011. The zfs-upstream branch contains a rebase on upstream that was done in November 2013 with the intention of upstreaming ZFS support. The latter is far easier to use than the former and is presently the master branch, so we shall document use of the latter.

It requires that you install git to fetch it and the xfsprogs to build it. Once these requirements are met, we can fetch and build it:

Running them requires that `hostname -f` works properly on your system. On a Gentoo workstation, this means editing the 127.0.0.1 line in /etc/hosts to be something like '127.0.0.1 localhost myhostname' where myhostname is the value in /etc/conf.d/hostname. Once that works, we can setup the system to run them:

Unfortunately, not all of the newer XFS tests currently work. In specific, generic/317 has a bash script that hangs in a subshell. Killing the bash interpreter for that subshell will make it correctly report that generic/317 is not able to run because of a missing user account. This is something that needs to be fixed, but it can be worked around by running a subset of tests that are known to actually test things on ZFS:

In addition, a few currently fail as of 0.6.3. These failures need to be investigated and fixed.

ztest
ztest is a userland tool. It links to a userland copy of the ZFS kernel code, creates a pool and then stresses the code by randomly selects from many possible operations runs many possible operations in many parallel threads at the same time. It will occasionally kill itself and resume to simulate system failures. It has some issues on 9p-virtio, so it will need to be run inside the chroot. This can be done much like we run gdb:

The above example will run ztest for 12 hours with the files in /tmp/ztest-0 of the chroot and it will kill itself half of all runs. If a segmentation fault or assertion failure occurs, it will die leaving a core dump, which could be analyzed with gdb:

= Profiling =

Flame Graphs
Flame Graphs are an indispensible tool for visualizing profile data. They are documented at Brendan Gregg's site:

http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

They require that something generate data. Perf is the primary tool for doing that on Linux. On other platforms, it is DTrace.

Generating a Flame Graph with Perf
You will need the following prerequisites to generate a sane flame graph with perf:


 * 1) The kernel must be compiled with CONFIG_FRAME_POINTER=y
 * 2) The kernel modules must not have been stripped. Some distributions strip them in a way that the file utility does not report that they have been stripped. You can use `ls -lh $(modinfo -n zfs)` to check. The unstripped ZFS module should be about 28MB while the stripped ZFS module should be about 2MB to 3MB.
 * 3) The perf binary must be [|patched] to support profiling of out-of-tree kernel modules. perf 3.15 and later have the patch included.
 * 4) You must obtain the flame graph scripts from [|Brendan Gregg's git repository] on github.

Once those requirements are met, it is simple to profile and generate flame graphs. The general idea is to generate a perf.data file containing the trace. The perf.data file does not have symbols, so the perf script command is used to convert it into a human readable format using symbols from your system. After that, the data can be processed anywhere to generate the flame graph.

Flame Graph of Entire System while a userland program is running
The program chosen for this example is the sleep command with argument 60, such that it will stay alive for 60 seconds and then terminate, generating output. This is good if you want a timer, although if you have a specific program in mind where you want to profile the whole system, you can invoke it from perf:

Flame Graph of Entire System until perf record is killed
The perf record command below will block indefinitely until killed with Ctrl+C (SIGTERM) when it will generate perf.data. Killing it with SIGKILL will prevent it from generating perf.data.

Flame Graph of all threads with a unique name
We use sleep 60 here to do perform this over 60 seconds, but any of the previous variations on invoking perf record would work.

Flame Graph of a specific thread by PID
We use sleep 60 here to do perform this over 60 seconds, but any of the previous variations on invoking perf record would work.

Flame Graph of all frames with a specific function
We use sleep 60 here to do perform this over 60 seconds, but any of the previous variations on invoking perf record would work.

The purpose of particular flame graph is to identify the call paths of specific function(s) under a workload. By identifying the call paths and examining the code for the stack frames, we can gain insight into things like concurrency by thinking about whether two stacks can occur simultaneously and what that means for performance with respect to locks held by each stack frame. We can also use the knowledge gained to design workloads that exercise particular functions, which is useful when evaluating the impact of changes on concurrency.

Perf
Brendan Greg has excellent documentation on doing profiling with perf. There is not much one can add to it:

http://www.brendangregg.com/perf.html

DTrace
There is an experimental Linux DTrace port called DTrace4Linux. It is prone to crash systems and should not be run in production. If you wish to profile inside a test environment and are willing to help development, then it is an option:

https://github.com/dtrace4linux/linux