Device-mapper

Normally, users rarely use dmsetup directly. The dmsetup is a very low level, and difficult tool to use. LVM, mdtool or cryptsetup is generally the preferred way to do it, as it takes care of saving the metadata and issuing the dmsetup commands. However, sometimes it is desirable to deal with directly: sometimes for recovery purposes, or to use a target that han't yet been ported to LVM.

Create
The create command activates a new device mapper device. It appears in /dev/mapper. In addition, if the target has metadata, it reads it, or if this its first use, it initializes the metadata devices. Note the prior device mapper devices can be passed as parameters (if the target takes a device), thus it is possible to "stack" them. The syntax is:

Remove
The remove command deactivates a device mapper device. It removes it from /dev/mapper. Syntax is Note is not possible to remove a device that's in use. The -f option may be passed the replace the target with one that fails all I/O, hopefully allowing the reference count to drop to 0.

Message
The message command send a message to the device. What message are supported depend on the target Syntax is: The tends not to be used and is almost always 0.

Suspend
The suspend command stops any NEW I/O. Existing I/O will still be completed. This can be used to quiesce a device. Syntax is:

Resume
The resume command allows I/O to be submitted to a previously suspended device. Syntax is:

Reload
The reload command replaces an existing device, possible with new targets and/or parameters. It may be necessary to suspend the target first. Some targets are immutable and can't be replaced with this command. Syntax is the same as create.

Zero
See Documentation/device-mapper/zero.txt for usage. This target has no target-specific parameters.

The "zero" target create that functions similarly to /dev/zero: All reads return binary zero, and all writes are discarded. Normally used in tests, but also useful in recovering linear and raid-type targets, when combined with the 'snapshot' target: a "zero" target of the same size as the missing piece(s) is created, a (writable) snapshot created (usually a loop device backed by a large sparse file, but it can be far smaller than the missing piece since it only has to the hold the changes). Then the snapshot can be mounted, fsck'd, or recovery tools run against it.

This creates a 1GB (1953125-sector) zero target:

Linear
See Documentation/device-mapper/linear.txt for parameters and usage. The linear target is basic building block for the device mapper - it is used to both join and split (and often both at once) block device. For a simple identify mapping:

The 4 disks can be joined to together as one:

Note the peculiar syntax on the join. The --table argument only allows single-line tables. Multi-line tables must be read from stdin. Also notice the logical_start_sector is not 0 in this case, as each device were appending need to start where the previous ends. Its possible to split a disk, in this case into a 4 MiB (8192 sector) "small" and 1 GB "large" (1953125 sector) disks:

Note that in the second device, the offset is not 0, since it is desired to start 4 MiB (8192 sectors) in Both joining an splitting can be combined:

This creates a 4GB device using last 1GB of each disk.

Snapshot
See Documentation/device-mapper/snapshot.txt for parameters and usage. The snapshot target is used to create a special copy-on-write device to store changes for an origin device. If a change is made to an origin, the old chunk is copied to the snapshot device. This is useful for backups. It is possible to write to the snapshot device directly as well, and thus test changes before committing them to the origin.

The origin device should already be populated. To mark a device an origin, use the snapshot-origin target:

Creating snapshots
A snapshot can then be created though the snapshot target. This has 2 important arguments:
 * persistent controls whether or not this snapshot is invalided at restart. Values are P for persistent (survives reboot) and N for non-persistent (invalid upon reboot). Non-persistent snapshots have less metadata associated with them
 * chuncksize controls the granularity of copying with the snapshot. Chunks are copied to the snapshot device in intervals of this value (in sectors).

The origin should be suspended before creating the snapshot device, if it is a device mapper device. To create a snapshot device:

Note how the origin device is the not the same device as the one we just created, but rather the origin's underlying device

Merging snapshots
To merge a snapshot, the origin must be suspended, the snapshot device unmapped, the snapshot-origin target replaced with the snapshot-merge target, and the origin resumed:

At that point the dmsetup status output will need be polled to find out then the merge is complete. Once the merge is complete, the snapshot-merge target should be replaced with the snapshot-origin target again:

A new snapshot can then be created, reusing the now freed-up old snapshot device if desired.

Mirror
There is no kernel documentation for the mirror target. Parameters obtained from Linux sources: drivers/md/dm-log.c and drivers/md/dm-raid1.c

For log_type there are 4 values with different arguments:
 * core region_size [[no]sync]
 * disk logdevice region_size [[no]sync]

And the values of each argument:
 * region_size is the region size of the mirror in sectors. It must be power of 2 and at least of a kernel page (for Intel x86/x64 processors, this is 4 KiB (8 sectors) This is the granularity in which the mirror is kept to update. Its a tradeoff between increased metadata and wasted I/O. LVM uses a value of 512 KiB (1024 sectors).
 * logdevice is the device in which to store the metadata, for the disk log types
 * [no]sync is an optional argument. Default is sync. nosync skips the sync step, but any reads to unwritten regions to since the mirror was established are undefined. This is appropriate to use then the initial device is empty.

And there is only 1 feature:
 * handle_errors causes the mirror to respond to an error. Default is to ignore all errors. LVM enables this feature.

To create a mirror with in-memory log:

Without a persistent log, the mirror will have to be recreated every time by copying the entire block device to the other "legs". To avoid this, the log may be stored on disk:

Its possible to do LVM "--mirrorlog mirror" by creating 2 mirrors: a core mirror for the log device, and a disk mirror the data devices:

RAID1
See Documentation/device-mapper/dm-raid.txt for parameters and usage. This target has a few important parameters:


 * chunk_size is unused, but its value is required. Set it to 0.
 * region_size has the same meaning as it does in the mirror target. Unlike the mirror target. it has a default of 4 MiB (8192 sectors). LVM uses a region size of 512 KiB (1024 sectors).
 * [no]sync has the same meaning as it does in the mirror target

To create a simple 1 GB raid1 with no metadata devices.

Note that because there's no metadata device, the array must be re-mirrored each time it is created. So normally, a metadata device is desired. Each "leg" needs its own metadata device If /dev/loop2 and /dev/loop3 are small metadata devices (4 MiB), then to create a 1G RAID1 would be:

Striped (RAID 0) and RAID 4/5/6/10
See Documentation/device-mapper/striped.txt</tt> and Documentation/device-mapper/dm-raid.txt</tt> for parameters and usage. The striped target aggregates several disks and split i/o amongst them for performance. There are a few importnat parameters. For both the raid and striped targets: These parameters are only applicable to the raid target: Because the number of sectors (1953125) is not a multiple of 128, it must be rounded down to the nearest multiple of 128 sectors, which can be done using this formula:
 * chunk_size is the size I/O (in sectors) before its "split" across the array It must be both a power a two and a least a large as a kernel memory page (for x86/x64 processors, pages are 4 KiB, so must be at least 8.) LVM uses a default value of 64 KiB (128 sectors). Using LVM defaults, a 1 MiB (2048 sector) write will be split in 16 chunks, distributed as evenly as possible across the array. The size of the array MUST be a multiple of this value. Otherwise the target will give the error "Array size does not match requested target length".
 * region size has the same meaning and defaults as it does for the RAID1 target.
 * [no]sync has the same meaning as it does for the RAID1 target. It is usually not appropriate for RAID 4,5 and 6 as even for blank devices parity must still be computed, unless creating a degraded array.

So in this case:

Striped (RAID0)
Stripe sets allow multiple disks to be combined into one with improved performance. The striped target parameters is asymmetric to the RAID ones. First, the # devices comes first, not the cluster size. Second, one must specify the offset (usually 0) of each device the makes up the stripe set. Because there are 4 disks of 1953024 sectors each, the total array size will be 7812096 sectors. To create a stripe set (RAID0):

RAID4
RAID4 is striped set that can tolerate the failure of a single disk. Because RAID4 uses a dedicated parity disk, one disk is "unusable", therefore the total space is 3 disks * 1953024 sectors, or a total of 5859072 sectors. To create a RAID4 set with no metadata devices:

As RAID1, because there are no metadata devices, the parity disk will have to be rebuilt every time it is assembled. To create a RAID4 WITH metadata devices:

It is possible to create a RAID4 in degraded mode initially. It is necessary to not specify any metadata devices, and "nosync" must added

The reason for doing this is its faster to create a degraded array, populate it, then reload the table with missing metadata devices and data device, so that the parity is only computed once, not twice.

RAID5
RAID5 is similar to RAID4, except in RAID5 the parity data is distributed across the stripe set. There are 4 "flavors" of RAID5. For LVM, the default is raid5_ls. The amount of parity used is the same as RAID4, so the total space is 5859072 sectors. To create a RAID5 set with no metadata devices:

To create a RAID5 with metadata:

To create a degraded RAID5:

RAID6
RAID6 is a stripe set that can tolerate the failure of up to 2 disks. Like RAID5, parity is distributed across the stripe set. There are 3 "flavors" of RAID 6. For LVM, the default is "raid6_zr". The total available space is 3906048 sectors. To create a RAID6 set with no metadata devices:

To create a RAID6 with metadata:

To create a degraded RAID6:

Note 2 devices are left out instead of 1.

RAID10
RAID10 combines mirroring (RAID 1) and striping (RAID10). Note is a better than stacking a RAID1 on top of RAID0 (or vice versa) - it is possible to do RAID10 on an odd number of disks. Half the disks are lost to the mirror, so the the total available space is 3906048 sectors. To create a RAID10 set with no metadata devices:

To create a RAID10 set with metadata:

If all the devices are empty, the nosync may be used to skip the initial sync, with the same caveats as mirror target.

Crypt
See Documentation/device-mapper/dm-crypt.txt</tt> for parameters and usage. The crypt target is used to encrypt an underlying block device. It is the backend to the cryptsetup tool. It supports several encryption modes, and is compatible with Cryptoloop and loop-aes volumes.

To create an encrypted volume, using the default method chosen by cryptsetup:

Rather than specify the key on the command line (which is insecure), it is possible to use the kernel keyring facility:

If there's not enough entropy for /dev/random, dd will hang until there's enough entropy.

Note that for keyring facility, the crypt target wants binary blobs instead of hex string. They key above (babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe) would need to be formatted as follows:

To create an encrypted volume, using the default method from older version of cryptsetup:

Note its sha256 and not sha-256!

Note that using the --table option of dmsetup discloses sensitive data via the command line, so its better to store the parameters in a secure file and redirect from STDIN instead. Also note that the dmsetup table command does not show the key unless --showkeys is passed. When using the kernel keyring facility, only the the key string is shown, not the actual key.

Verity
See /Documentation/device-mapper/verity.txt</tt> for the parameters and usage. The verity target is a read-only target intended for use in "verified boot" scenarios. Its similar in spirit to IMA/EVM, but doesn't require a TPM, and works at block level rather than the file level. This target compares the hash of the block of data to that of the matadata, if they do not match, an I/O error is returned. Note this target only does hashing, not signature verification. Whatever uses this target should keep a signed copy of the hash (with the public key in read-only firmware or similar), before using it.

Unlike the other device mapper targets, the verity target does NOT initialize the metadata device on its first use. Instead, it must be populated using an external tool, veritysetup</tt> which is part of cryptsetup</tt>. To create the metadata (using an empty device in this case):

The verity target must be multiple of the data block size, do multiply data_blocks*(data_block_size/512)</tt> to get the length for the verity target, in this case, 1953120. To create the verity device:

The --readonly option must be specified to this target.

Thin
See Documentation/device-mapper/thin-provisioning.txt</tt> for parameters and usage. Thin pools are to block devices what sparse files are to filesystems. It is possible to create large, empty, even larger than the pool itself, or sums of objects greater than the pool size, and space isn't allocated until something is actually written to those areas. Futhermore, blocks can be returned to the thin pool via the trim/discard operation. Thin pool have a cheap snapshotting operation (different from the snapshot target) that remains cheap even upon multiple layers of indirections (snapshots of snapshots of snapshots...).

The thin target has 3 important parameters:
 * metadata_dev is where to store the metadata for the thin pool. The recommended size is 3*(data_dev_size/(32*data_block_bize)) sectors, but at least 2 MiB (4096 sectors). The thin-provisioning-tools package has a program, thin_metadata_size that will compute an suitable thin pool size given the data_block_size, data_dev_size, and number of volumes in the pool. The maximum supported size of the metadata device is 15.9375 GiB (33423360 sectors)
 * data_block_size controls the granularity of the thin pool. Data is allocated in blocks of this size. It must at least 64KiB (128 sectors), and be a multiple of 64 KiB (128) sectors.
 * low_water_mark is a lower boundary of free space within the pool. If the free space drops below this, a message a sent. Set it to 0 is disable.

Using the thin_metadata_size utility of thin-provisioning-tools:

Even anticipating 100 volumes with in a pool, it still less then the minimum recommended 2 MiB (4096 sector). To create the pool:

Creating thin volumes
The thin-pool target is unusual among the other targets, as it does not produce a usable disk by itself. Instead, by sending message to the target, it produce more device-mapper device which can be used for storage using the thin target. Volumes within a pool are referred to by 24-bit ordinal. Note that there's isn't a way to query the pool what ordinals are in use. To create a new thin volume with ordinal 17:

This allocates an ordinal but no storage. However, its possible to use the ordinal with the thin target to create a 200MB (390625-sector) thin volume:

Creating internal thin snapshot
Thin snapshots can be created of thin volume. First, the volume must be quiesced:

Then the snapshot taker. An ordinal needs to be allocated is needed for it, so for this example 6 will be chosen:

The volume can be resumed after the snapshot is taken:

The snapshot can now be activated like any other thin volume:

Creating external thin snaphot
Thin snapshot be taken of read-only external volumes. First, an ordinal is allocated as in create a thin volume:

Then a new thin volume is create life before, but an extra parameter is added to indicate the origin:

Deleting thin volumes
A thin volume can be deleted by unmapping it and sending the pool a delete message with the ordinal of the pool to delete:

Cache
See Documentation/device-mapper/cache.txt</tt> and Documentation/device-mapper/cache-policies.txt</tt> for parameters and usage. This target is intended to speed up access to a slow but large rotational disk by using a faster but smaller SSD as a cache. There is one important parameter:
 * block_size is the granularity of the cache. Data is promoted/demoted to/from the cache in blocks. It must be a multiple of 32k (64 sectors). LVM uses 64k (128 sectors) by default.

The recommended metadata device size is 8192 sectors + (nr_blocks/32) sectors</tt> where nr_blocks is the number of sectors on the "fast" device divided by the block_size. For this device:

Will round up to 8 MiB (16384 sectors) for safety, however 4 MiB (8192 sectors) would likely be more than enough anyway.

To create a cache device:

Its recommended to mirror the metadata device on the origin and cache device. To do so:

Era
See Documentation/device-mapper/era.txt</tt> for the parameters this target takes. The era target is intended to be used with the cache target. The purpose of this target is to track changed blocks since the last checkpoint, called an era</tt>. This target does so efficiently at the expense of possible false positives, but never false negative. Backup software or snapshots would typically use this target. For snapshots, a checkpoint can be created an snapshot time. If it becomes necessary to revert to the snapshot, its possible to invalidate all blocks since the snapshot in the cache, bypass the cache to rollback the snapshot, then reenable the cache - all without trashing the whole cache and lose valuable metadata on which blocks are still "hot". Backup software can use this to see what blocks have changed since the last backup, and this do an incremental or differential backup of only the changed blocks.

First, create cache device. Since there will now be 2 metadata devices (one for cache, and one for era), to avoid 2 mirror metadata devices, a single mirror will be created with 1 metadata device and the mirror split up using the linear target:

Unlike the cache example, where 8 MiB was used, only 4 MiB is being used here, in order to make room for the era metadata.

To create the era device:

In this case, we have a mirror of the era metadata, but its not required, in that case the metadata device should be on the "fast" device. Also note the block_size argument has the same value as the block_size of the cache.

This target needs other tools to make use of the metadata it records. The thin-provisoing-tools</tt> package can fetch this an act upon it, by invaliding all blocks changed in the current era.

Integrity
See Documentation/device-mapper/dm-integrity.txt</tt> for the parameters this target takes. The integrity target can either be used alone, or with the crypt target. This target is unusual among device-mapper targets what it works on more than just the range of sectors its given, it works on the entire underlying device.

First, the device must be formatted. Unlike the other targets, to get this target o the formatting, 2 things must be true. First, the first 512 of the destination must be zeroed out. Second, the target must be invoked on the first sector only. It will then write the superblock (always 512 bytes) to the underlying device. The target has 5 mandatory arguments. The tag size parameter is written to the superblock and if 'J' is specified for the journal parameter, the journal size is written to it. There are a few optional arguments that will be written to the superblock: journal_sector, interleave_sectors, and block_size parameters. The internal_hash, journal_crypt, and journal_mac are NOT saved to the superblock, but must be reproduced in every future invocation of is target the underlying device.

One the device is formatted, it writes the number of usable blocks to the superblock of the target. Unfortunately, this value is required for future invocation of this target on the underlying device, and neither dmsetup nor any other tool (future versions of cryptsetup will have an integritysetup command that can dump this) can retrieve that. Fortunately, writing such a (crude) tool only involves a little C.

Standalone
To use this target by itself, specify "-" for the tag size and include the internal_hash option: This example will format the device uses crc32 as the integrity algorithm:

This formats the device with the superblock. To use it, it must be given a sector range based on the AVAILABLE sectors in the superblock, NOT the size of the underlying device. For this example, its 1922872 blocks.

The device still isn't usable yet though - space is allocated for hash, but they are not valid. We must use a special dd command to rady the device:

Note the oflag=direct option. This is important, as this causes the hash check to be bypassed, and new hashes generated. Otherwise, dd would have given an I/O error at the first sector and bailed out.

With crypt target
To use this target with dm-crypt, specify the tag size (size of the hash, 0 if no hash+size of IV), and don't specify internal_hash. dm-crypt has a new IV type: random, which is needed with the authenticated hash types. To just store just the IV, the tag size varies on the encryption algorithm. For cbc(aes), it has a 16 byte IV. For gcm(aes)-random and rfc7539(chacha20,poly1305)-random, the hash size is 16 bytes and the IV the 12 bytes. for a total of 28. For authenc(hmac(sha256),xts(aes))-random, it is 48.

Just storing IV
This example will format the device with 16-bit tags (for cbc(aes)), and encrypt the journal with cbc(aes) as well:

Just like with standalone mode, the superblock must be read to figure out the number of available blocks, in this case its 1878488.

Now the crypt target can be stacked on it

The important part here is the integrity:16:none option. The tag size here MUST match the integrity target's one. The none option indicates to just store IV and not do any authentication.

There is no need to zero out the device as we are just storing IV, not doing any hashed authentication.

Hashed authentication
This example will format the device with 28-bit tags, encrypt the journal use ChaCha20 and hash it using hmac(sha256):

Reading the superblock reveals there are 1835744 blocks for use. Now the crypt target can be stacked on it.

The important part here is the integrity:28:aead option. Again, he tag size here MUST match the integrity target's one. aead is the integrity profile to use.

The device must be cleared before use:

The oflag=direct is required to generate the new hashes without I/O errors.

Delay
See Documentation/device-mapper/delay.txt</tt> for parameters and usage. The delay target is used is simulated a congested or slow device (like a floppy).

Error
There is no kernel documentation for the error target. There are no target-specific parameters. While it can used like any other target, is not a "real" target. Instead, just sets a bit in the current target that causes it to reject all I/O. For that reason, its normally used with reload instead of create. It can be used to simulate a failed disk (like in a RAID set), or part of a failed regions of a disk when used with linear. This target does have a pratical use: to convince processes that have a device open to close it, so it can be removed. The -f to remove does exactly that.

Flakey
<tt>Documentation/device-mapper/dm-flakey.txt</tt> for parameters and usage. The flakey target is used to simulate intermittently failing device. During failure mode, it can returning I/O error (like the error target), silently drop writes, or corrupt data in deterministic ways.

Multipath
There is no kernel documentation for the multipath target. This target is unusual among the device-mapper target in that instead of aggregating disks, it aggregates paths to the disks. This target is the backend to the <tt>multipath-tools</tt> application  The typical use of this target is on machines with multiple Fibre Channel adapters connecting via one for more SAN's for redundancy. Because the same disks can be accessed though any of the host adapter, the multipath are arbitrate how the paths are used - round-robin, failover, and so forth.

Switch
See <tt>Documentation/device-mapper/switch.txt</tt> for parameters and usage. The switch target is used with the multipath target. Some (for now, just one - Dell EqualLogic) Fibre Channel arrays, make the "nodes" semi-transparent: if a request is sent to the wrong node that doesn't have the disk of interest, it transparently forwards it to the one that does, furthermore, it can migrate the data between nodes if needed. Because of all this moving around in the background, the multipath target would needed a table possibly millions of entries long, dm-switch acts as an indirection layer, similarly to the way page tables work for memory - dm-switch maintains the dynamic data and picks the correct multipath device with far fewer entries.

External resources

 * https://www.sourceware.org/lvm2/wiki/