From Gentoo Wiki
(Redirected from Copy-on-write)
Jump to:navigation Jump to:search

Deduplication uses the clone machanism of a copy-on-write or CoW capable filesystem, a feature that allows to share data of copied but identical files, much like a hardlink until one of the copies actually is written to and thereby changed, i.e. a delayed copy operation and hence the name copy-on-write. If implemented on a block level, only modified blocks are actually stored in the file system, saving space by sharing identical blocks of multiple files.

Copy-on-write (CoW) can be implemented in-band or out-of-band.[1] The later is called deduplication and requires a user application that compares files or blocks and sets the CoW status for identical blocks in the filesystem.


On Linux, only a hand-full of filesystems implement CoW, namely bcachefs, Btrfs, OCFS2 and XFS. The clone ioctl kernel functions were previously private to Btrfs, where CoW deputed on Linux, and moved to the Virtual File System (VFS) layer starting with Linux kernel 4.5 so that any CoW-supporting file system can make use of them.[2] The first additional filesystem to implement CoW was XFS.[3]

Some filesystem tools themselves support deduplication, like the Btrfs subvolumes feature. There are also in-band filesystem options, such as XFS' always_cow sysfs switch.[4]

Applications with deduplication support

The most basic way to deduplicate a file is to clone it with cp --reflink. cp is part of the sys-apps/coreutils package. At first the result is almost identical to hardlink, in that both files use the same blocks of data on the storage device, with the major difference that, if one file gets changed on hardlinks, every linked file is changed as well. On clones (deduplicated files) however the other files that use data from the same blocks are preserved and only the changed file, or blocks of that file, are written to the storage, hence the name copy-on-write.

user $cp --reflink=always sourcefile destfile

Unlike hardlinks, changing either destfile or sourcefile will preserve the other. Copy-on-write essentially keeps the files separate while (at least initially) benefiting from the same space advantage as hardlinks do. It is however unclear if the whole file is rewritten in case of a change, or only the changed block (chunk) of an initially deduplicated file, and it heavily depends on how an application implements writing files to disk.

If the filesystem doesn't support copy-on-write (CoW), cp will abort with an error massage. With the --reflink=auto parameter cp will automatically make a regular copy instead when CoW is not available.

user $cp --reflink sourcefile destfile
cp: failed to clone 'destfile' from 'sourcefile': Operation not supported
user $cp --reflink=auto --verbose sourcefile destfile
'sourcefile' -> 'destfile'

There are additional user applications that allow to compare existing files and to deduplicate them, which essentially frees disk space. The most common tools are sys-fs/duperemove, app-misc/fdupes and app-misc/jdupes. sys-fs/bees works on block-level, but is limited to Btrfs.


The obvious benefit of deduplication and copy-on-write is to regain valuable storage space. It might be argued that in-band copy-on-write may also be beneficial for reducing wear on SSD storage by reducing writes to the device, similar to Portage TMPDIR on tmpfs. However, a wear reducing factor is uncertain when a write operation has already occurred, which is always the case when using out-of-band deduplication tools.

Practical use scenarios

Portage hooks

Deduplication can be hooked into pkg_postinst for specified packages using the standard portage facilities. For example, to dedupe the Linux kernels from package sys-kernel/gentoo-sources after emerging each new version, a portage environment can be added under /etc/portage/package.env. This will save space for unchanged files of each installed kernel source version under /etc/src/.

The following example uses duperemove:

FILE /etc/portage/env/sys-kernel/gentoo-sources
function post_pkg_postinst() {
    echo ":: Running duperemove in /usr/src/"
    duperemove -r -d -h -q /usr/src/

Additionally, after running genkernel from sys-kernel/genkernel, deduplication can be configured in /etc/genkernel.conf.

FILE /etc/genkernel.conf
CMD_CALLBACK="duperemove -r -d -h -q /usr/src/"

See also

  • Duperemove — a btrfs tool for finding duplicated extents and submitting them to the kernel for deduplication
  • fdupes — a tool for identifying duplicate files across a set of directories.

External resources