Deduplication

From Gentoo Wiki
Jump to:navigation Jump to:search
Resources

Deduplication uses the clone mechanism of a copy-on-write or CoW capable filesystem, a feature that allows to share data of copied but identical files, much like a hardlink until one of the copies actually is written to and thereby changed, i.e. a delayed copy operation and hence the name copy-on-write. If implemented on a block level, only modified blocks are actually stored in the file system, saving space by sharing identical blocks of multiple files.

Copy-on-write (CoW) can be implemented in-band or out-of-band.[1] The later is called deduplication and requires a user application that compares files or blocks and sets the CoW status for identical blocks in the filesystem.

Filesystems

On Linux, only a hand-full of filesystems implement CoW, namely bcachefs, Btrfs, OCFS2 and XFS. The clone ioctl kernel functions were previously private to Btrfs, where CoW deputed on Linux, and moved to the Virtual File System (VFS) layer starting with Linux kernel 4.5 so that any CoW-supporting file system can make use of them.[2] The first additional filesystem to implement CoW was XFS.[3]

Some filesystem tools themselves support deduplication, like the Btrfs subvolumes feature. There are also in-band filesystem options, such as XFS' always_cow sysfs switch.[4]

Applications with deduplication support

There are user applications that allow to compare existing files and to deduplicate them, which essentially frees disk space. The most common tools are:


Various tools also support CoW themselves when copying files by using the appropriate Linux syscalls if available:


Applications missing support:

GNU coreutils

Tip
GNU coreutils 9.0 and newer default to --reflink=auto for cp and install.

The most basic way to deduplicate a file is to clone it with cp --reflink. cp is part of the sys-apps/coreutils package. At first the result is almost identical to hardlink, in that both files use the same blocks of data on the storage device, with the major difference that, if one file gets changed on hardlinks, every linked file is changed as well. On clones (deduplicated files) however the other files that use data from the same blocks are preserved and only the changed file, or blocks of that file, are written to the storage, hence the name copy-on-write.

user $cp --reflink=always sourcefile destfile

Unlike hardlinks, changing either destfile or sourcefile will preserve the other. Copy-on-write essentially keeps the files separate while (at least initially) benefiting from the same space advantage as hardlinks do. It is however unclear if the whole file is rewritten in case of a change, or only the changed block (chunk) of an initially deduplicated file, and it heavily depends on how an application implements writing files to disk.

If the filesystem doesn't support copy-on-write (CoW), cp will abort with an error massage. With the --reflink=auto parameter cp will automatically make a regular copy instead when CoW is not available.

user $cp --reflink sourcefile destfile
cp: failed to clone 'destfile' from 'sourcefile': Operation not supported
user $cp --reflink=auto --verbose sourcefile destfile
'sourcefile' -> 'destfile'

Portage

Portage uses copy_file_range or sendfile if available when merging packages from PORTAGE_TMPDIR to the live filesystem. This support is implemented as a C extension with USE=native-extensions, which is enabled by default for Portage. [5]

Portage 3.0.48 and newer will also avoid overwriting files on the live filesystem if they're identical, as implemented for bug #722270.

Benefits

The obvious benefit of deduplication and copy-on-write is to regain valuable storage space. It might be argued that in-band copy-on-write may also be beneficial for reducing wear on SSD storage by reducing writes to the device, similar to Portage TMPDIR on tmpfs. However, a wear reducing factor is uncertain when a write operation has already occurred, which is always the case when using out-of-band deduplication tools.

Practical use scenarios

Portage hooks

Deduplication can be hooked into pkg_postinst for specified packages using the standard portage facilities. For example, to deduplicate the Linux kernels from package sys-kernel/gentoo-sources after emerging each new version, a portage environment can be added under /etc/portage/package.env. This will save space for unchanged files of each installed kernel source version under /etc/src/.

The following example uses duperemove:

FILE /etc/portage/env/sys-kernel/gentoo-sources
function post_pkg_postinst() {
    echo ":: Running duperemove in /usr/src/"
    duperemove -r -d -h -q /usr/src/
}

Genkernel hooks

Additionally, after running genkernel from sys-kernel/genkernel, deduplication can be configured in /etc/genkernel.conf:

FILE /etc/genkernel.conf
CMD_CALLBACK="duperemove -r -d -h -q /usr/src/"

See also

  • Duperemove — a btrfs and XFS tool for finding duplicated extents and submitting them to the kernel for deduplication
  • fdupes — a tool for identifying duplicate files across a set of directories.

External resources

References

  1. Read the Docs: Deduplication
  2. IOCTL_FICLONERANGE(2) from the Linux Programmer's Manual
  3. xfs: add reflink and dedupe support on LWN.net, 29 Sep 2016
  4. XFS Copy-On-Write Support Being Improved, Always CoW Option, phoronix, 19 Feb 2019
  5. portage_util_file_copy_reflink_linux.c, Portage source code (3.0.49), 10 Jul 2023