Device-mapper

Normally, users rarely use dmsetup directly. The dmsetup is a very low level, and difficult tool to use. LVM, mdtool or dmsetup is generally the preferred way to do it, as it takes care of saving the metadata and issuing the dmsetup commands for you. However, sometimes one want to deal with it directly: sometimes for recovery purposes, or because LVM doesn't yet support what you want.

Create
The create command activates a new device mapper device. It appears in /dev/mapper. In addition, if the target has metadata, it reads it, or if this its first use, it initializes the metadata devices. Note the prior device mapper devices can be passed as paramters (if the target takes a device), thus it is possible to "stack" them. The syntax is:

Remove
The remove command deactivates a device mapper device. It removes it from /dev/mapper. Syntax is Note is not possible to remove a device that's in use. The -f option may be passed the replace the target with one that fails all I/O, hopefully allowing the reference count to drop to 0.

Message
The message command send a message to the device. What message are supported depend on the target Syntax is: The tends not to be used and is almost always 0.

Suspend
The suspend command stops any NEW I/O. Existing I/O will still be completed. This can be used to quiesce a device. Syntax is:

Resume
The resume command allows I/O to be submitted to a previously suspended device. Syntax is:

Reload
The reload command replaces an existing device, possible with new targets and/or parameters. It may be necessary to suspend the target first. Some targets are immutable and can't be replaced with this command. Syntax is the same as create.

Zero
See Documentation/device-mapper/zero.txt. This target has no target-specific parameters.

The "zero" target create that functions similarly to /dev/zero: All reads return binary zero, and all writes are discarded> Normaly used in tests, but also useful in recovering linear and raid-type targets, when combined with the 'snapshot' target: a "zero" target of the same size as the missing piece(s) is created, a (writable) snapshot created (usually a loop device backed by a large sparse file, but it can be far smaller than the missing piece since it only has to the hold the changes). Then the snapshot can be mounted, fsck'd, or recover tools run against it.

This creates a 1GB (1953125-sector) zero target:

Linear
See Documentation/device-mapper/linear.txt for paramters in usage. This target is the basic building block for the device mapper - it is used to both join and split (and often both at once) block device. For a simple identify mapping:

The 4 disks can be joined to together as one:

Note the peculiar syntax on the join. The --table argument only allows single-line tables. Multi-line tables must be read from stdin. Also notice the logical_start_sector is not 0 in this case, as each device were appending need to start where the previous ends. Its possible to split a disk, in this case into a 4 MiB (8192 sector) "small" and 1 GB "large" (1953125 sector) disks:

Note that in the second device, the offset is not 0, since it is desired to start 4 MiB (8192 sectors) in Both joining an splitting can be combined:

This creates a 4GB device using last 1GB of each disk.

Snapshot
See Documentation/device-mapper/snapshot.txt for the parameters. This target is used to create a special copy-on-write device. If a change is made to an origin, the old chunk is copied to the snapshot device. This is useful for backups. It is possible to write to the snapshot device directly as well, and thus test changes before committing them to the origin.

The origin device should already be populated. To mark a device an origin, use the snapshot-origin target:

Creating snapshots
A snapshot can then be created though the snapshot target. This has 2 important arguments:
 * persistent controls whether or not this snapshot is invalided at restart. Values are P for persistent (survives reboot) and N for non-persistent (invalid upon reboot). Non-persistent snapshots have less metadata associated with them
 * chucksize controls the granularity of copying with the snapshot. Chunks are copied to the snapshot device in intervals of this value (in sectors).

The origin should be suspended before creating the snapshot device, if it is a device mapper device. To create a snapshot device:

Note how the origin device is the not the same device as the one we just created, but rather the origin's underlying device

Merging snapshots
To merge a snapshot, the origin must be suspended, the snapshot device unmapped, replace the snapshot-origin target replaced with the snapshot-merge target, and the origin resumed:

At that point the dmsetup status output will need be polled to find out then the merge is complete. Once the merge is complete, the snapshot-merge target should be replaced with the snapshot-origin target again:

A new snapshot can then be created, reusing the now freed-up old snapshot device if desired.

Mirror
There is no kernel documentation for the mirror target. Parameters obtained from Linux sources: drivers/md/dm-log.c and drivers/md/dm-raid1.c  mirror  <#log_args> <log_arg1>...<log_argN> <#devs> <device_name_1> <offset_1>...<device name N> <offset N> <#features> <feature_1>...<feature_N>

For log_type there are 4 values with different arguments:
 * core region_size [[no]sync]
 * disk logdevice region_size [[no]sync]

And the values of each argument:
 * region_size is the region size of the mirror in sectors. It must be power of 2 and at least of a kernel page (for Intel x86/x64 processors, this is 4 KiB (8 sectors) This is the granularity in which the mirror is kept to update. Its a tradeoff between increased metadata and wasted I/O. LVM uses a value of 512 KiB (1024 sectors).
 * logdevice is the device in which to store the metadata, for the disk log types
 * [no]sync is an optional argument. Default is sync. nosync skips the sync step, but any reads to unwritten regions to since the mirror was established are undefined. This is appropriate to use then the initial device is empty.

And there is only 1 feature:
 * handle_errors causes the mirror to respond to an error. Default is to ignore all errors. LVM enables this feature.

To create a mirror with in-memory log:

Without a persistent log, the mirror will have to be recreated every time by copying the entire block device to the other "legs". To avoid this, the log may be stored on disk:

Its possible to do LVM "--mirrorlog mirror" by creating 2 mirrors: a core mirror for the log device, and a disk mirror the data devices:

RAID1
See Documentation/device-mapper/dm-raid.txt</tt>. Note that <chunk_size> is unused for RAID1, but a value is still required, therefore is value should be set to 0. There 2 other important, though optional, parameters: region_size and [no]sync.


 * region_size has the same meaning as it does in the mirror target. Unlike the mirror target. it has a default of 4 MiB (8192 sectors). LVM uses a region size of 512 KiB (1024 sectors).
 * [no]sync has the same meaning as it does in the mirror target

To create a simple 1 GB raid1 with no metadata devices.

Note that because there's no metadata device, the array must be re-mirrored each time it is created. So normally, a metadata device is desired. Each "leg" needs it own metadata device If /dev/loop2 and /dev/loop3 are small metadata devices (4 MiB), then to create a 1G RAID1 would be:

Striped (RAID 0) and RAID 4/5/6/10
See Documentation/device-mapper/striped.txt</tt> and Documentation/device-mapper/dm-raid.txt</tt> for the parameters of this target. Three in particular are important: Because the number of sectors (1953125) is not a multiple of 128, it must be rounded down to the nearest multiple of 128 sectors, which can be done using this formula: bc So in this case:
 * chunk_size is the size I/O (in sectors) before its "split" across the array It must be both a power a two and a least a large as a kernel memory page (for x86/x64 processors, pages are 4 KiB, so must be at least 8.) LVM uses a default value of 64 KiB (128 sectors). Using LVM defaults, a 1 MiB (2048 sector) write will be split in 16 chunks, distributed as evenly as possible across the array. The size of the array MUST be a multiple of this value. Otherwise the target will give the error "Array size does not match requested target length".
 * region size has the same meaning and defaults as it does for the RAID1 target.
 * [no]sync has the same meaning as it does for the RAID1 target. It is usually not appropriate for RAID 4,5 and 6 as even for blank devices parity must still be computed, unless creating a degraded array.

Striped (RAID0)
Stripe sets allow multiple disks to be combined into one with improved performance. The striped target parameters is asymmetric to the RAID ones. First, the # devices comes first, not the cluster size. Second, one must specify the offset (usually 0) of each device the makes up the stripe set.Because there are 4 disks of 1953024 sectors each, the total array size will be 7812096 sectors. To create a stripe set (RAID0):

RAID4
RAID4 is striped set that can tolerate the failure of a single disk. Because RAID4 uses a dedicated parity disk, one disk is "unusable", therefore the total space is 3 disks * 1953024 sectors, or a total of 5859072 sectors. To create a RAID4 set with no metadata devices:

As RAID1, because there are no metadata devices, the parity disk will have to be rebuilt every time it is assembled. To create a RAID4 WITH metadata devices:

It is possible to create a RAID4 in degraded mode initially. It is necessary to not specify any metadata devices, and "nosync" must added

The reason for doing this is its faster to create a degraded array, populate it, then reload the table with missing metadata devices and data device, so that the parity is only computed once, not twice.

RAID5
RAID5 is similar to RAID4, except in RAID5 the parity data is distributed across the stripe set. There are 4 "flavors" of RAID5. For LVM, the default is raid5_ls. The amount of parity used is the same as RAID4, so the total space is 5859072 sectors. To create a RAID5 set with no metadata devices:

To create a RAID5 with metadata:

To create a degraded RAID5:

RAID6
RAID6 is a stripe set that can tolerate the failure of up to 2 disks. Like RAID5, parity is distributed across the stripe set. There are 3 "flavors" of RAID 6. For LVM, the default is "raid6_zr". The total available space is 3906048 sectors. To create a RAID6 set with no metadata devices:

To create a RAID6 with metadata:

To create a degraded RAID6:

Note 2 devices are left out instead of 1.

RAID10
RAID10 combines mirroring (RAID 1) and striping (RAID10). Note is a better than stacking a RAID1 on top of RAID0 (or vice versa) - it is possible to do RAID10 on an odd number of disks. Half the disks are lost to the mirror, so the the total available space is 3906048 sectors. To create a RAID10 set with no metadata devices:

To create a RAID10 set with metadata:

If all the devices are empty, the nosync may be used to skip the initial sync, with the same caveats as mirror target.

Crypt
See Documentation/device-mapper/dm-crypt.txt</tt> for paramters and use. This target is used to encrypt an underlying block device. It is the backend to the cryptsetup tool. It supports several encryption modes, and is compatible with Cryptoloop and loop-aes volumes.

To create an encrypted volume:

Or, for something compatible with older version of cryptsetup:

Note its sha256 and not sha-256!

Note that using the --table option of dmsetup discloses sensitive data via the command line, so its better to store the parameters in a secure file and redirect from STDIN instead. Also note that the dmsetup table command does not show the key unless --showkeys is passed.

Verity
See /Documentation/device-mapper/verity.txt</tt> for the parameters of this target. The verity target is a read-only target intended for use in "verified boot" scenarios. Its similar in spirit to IMA/EVM, but doesn't require a TPM, and works at block level rather than the file level. This target compares the hash of the block of data to that of the matadata, if they do not match, an I/O error is returned. Note this target only does hashing, not signature verification. Whatever uses this target should keep a signed copy of the hash (with the public key in read-only firmware or similar), before using it.

Unlike the other device mapper targets, the verity target does NOT initialize the metadata device on its first use. Instead, it must be populated using an external tool, veritysetup</tt> which is part of cryptsetup</tt>. To create the metadata (using an empty device in this case):

The verity target must be multiple of the data block size, do multiply data_blocks*(data_block_size/512)</tt> to get the length for the verity target, in this case, 1953120. To create the verity device:

The --readonly option must be specified to this target.

Thin
See Documentation/device-mapper/thin-provisioning.txt</tt> for the parameters of this target. Thin pools are to block devices what sparse files are to filesystems. It is possible to create large, empty, even larger than the pool itself, or sums of objects greater than the pool size, and space isn't allocated until something is actually written to those areas. Futhermore, blocks can be returned to the thin pool via the trim/discard operation. Thin pool have a cheap snapshotting operation (different from the snapshot target) that remains cheap even upon multiple layers of indirections (snapshots of snapshots of snapshots...).

The thin target has 3 important parameters:
 * metadata_dev is where to store the metadata for the thin pool. The recommended size is 3*(data_dev_size/(32*data_block_bize)) sectors, but at least 2 MiB (4096 sectors). The thin-provisioning-tools package has a program, thin_metadata_size that will compute an suitable thin pool size given the data_block_size, data_dev_size, and number of volumes in the pool. The maximum supported size of the metadata device is 15.9375 GiB (33423360 sectors)
 * data_block_size controls the granularity of the thin pool. Data is allocated in blocks of this size. It must at least 64KiB (128 sectors), and be a multiple of 64 KiB (128) sectors.
 * low_water_mark is a lower boundary of free space within the pool. If the free space drops below this, a message a sent. Set it to 0 is disable.

Using the thin_metadata_size utility of thin-provisioning-tools:

Even anticipating 100 volumes with in a pool, it still less then the minimum recommended 2 MiB (4096 sector). To create the pool:

Creating thin volumes
The thin-pool target is unusual among the other targets, as it does not produce a usable disk by itself. Instead, by sending message to the target, it produce more device-mapper device which can be used for storage using the thin target. Volumes within a pool are referred to by 24-bit ordinal. Note that there's isn't a way to query the pool what ordinals are in use. To create a new thin volume with ordinal 17:

This allocates an ordinal but no storage. However, its possible to use the ordinal with the thin target to create a 200MB (390625-sector) thin volume:

Creating internal thin snapshot
Thin snapshots can be created of thin volume. First, the volume must be quiesced:

Then the snapshot taker. An ordinal needs to be allocated is needed for it, so for this example 6 will be chosen:

The volume can be resumed after the snapshot is taken:

The snapshot can now be activated like any other thin volume:

Creating external thin snaphot
Thin snapshot be taken of read-only external volumes. First, an ordinal is allocated as in create a thin volume:

Then a new thin volume is create life before, but an extra parameter is added to indicate the origin:

Deleting thin volumes
A snapshot can be deleted by unmapping it and sending the pool a delete message with the ordinal of the pool to delete:

Cache
See Documentation/device-mapper/cache.txt</tt> and Documentation/device-mapper/cache-policies.txt for parameters and usage. This target is intended to speed up access to a slow but large rotational disk by using a faster but smaller SSD as a cache. There is one important parameter:
 * block_size is the granularity of the cache. Data is promoted/demoted to/from the cache in blocks. It must be a multiple of 32k (64 sectors). LVM uses 64k (128 sectors) by default.

The recommended metadata device size is 8192 sectors + (nr_blocks/32) sectors</tt> where nr_blocks is the number of sectors on the "fast" device divided by the block_size. For this device:

Will round up to 8 MiB (16384 sectors) for safety, however 4 MiB (8192 sectors) would likely be more than enough anyway.

To create a cache device:

Its recommended to mirror the metadata device on the origin and cache device. To do so:

Era
See Documentation/device-mapper/era.txt</tt> for the parameters this target takes. The era target is intended to be use with the cache target. The purpose of this target is to track change block since the last checkpoint, called an era</tt>. This target does so efficiently at the expense of possible false positives, but never false negative. Backup software or snapshot would typically use this target. For snapshots, a checkpoint can be created an snapshot time. If it becomes necessary to revert to the snapshot, its possible to invalidate all blocks since the snapshot in the cache, bypass the cache to rollback the snapshot, then reenable the cache - all without trashing the whole cache and lose valuable metadata on which blocks are still "hot". Backup software can use this to see what blocks have changed since the last backujp, and this do an incremental or differential backup of only the changed blocks.

First, create cache device. Since there will now be 2 metadata devices (one for cache, and one for era), to avoid 2 mirror metadata devices, a single mirror will be created with 1 metadata device and the mirror split up using the linear target:

Unlike the cache example, where 8 MiB was used, only 4 MiB is being used here, in order to make room for the era metadata.

To create the era device:

In this case, we have a mirror of the era metadata, but its not required, in that case the metadata device should be on the "fast" device. Also note the block_size argument has the same value as the block_size of the cache.

This target needs other tools to make use of the metadata it records. The thin-provisoing-tools</tt> package can fetch this an act upon it, by invaliding all blocks changed in the current era.