Device-mapper

From Gentoo Wiki
Jump to:navigation Jump to:search

Normally, users rarely use dmsetup directly. The dmsetup is a very low level, and difficult tool to use. LVM, mdtool or cryptsetup is generally the preferred way to do it, as it takes care of saving the metadata and issuing the dmsetup commands. However, sometimes it is desirable to deal with directly: sometimes for recovery purposes, or to use a target that han't yet been ported to LVM.

Important
The device mapper, like the rest of the Linux block layer deals with things at the sector level. A sector defined as 512 bytes, regardless of the actual physical geometry the the block device. All formulas and values to the device mapper will be in sectors unless otherwise stated.
Note
The article uses both IEC (1024-based) and Metric (1000-based). So 1GB is 1,000,000,000 bytes, but 1 GiB is 1,073,741,824 bytes

Dmsetup commands

Create

The create command activates a new device mapper device. It appears in /dev/mapper. In addition, if the target has metadata, it reads it, or if this its first use, it initializes the metadata devices. Note the prior device mapper devices can be passed as parameters (if the target takes a device), thus it is possible to "stack" them. The syntax is: dmsetup create <new device name> --tables <start sector> <end sector> <target name> <target parmaters>

Remove

The remove command deactivates a device mapper device. It removes it from /dev/mapper. Syntax is dmsetup remove [-f] <device name> Note is not possible to remove a device that's in use. The -f option may be passed the replace the target with one that fails all I/O, hopefully allowing the reference count to drop to 0.

Message

The message command send a message to the device. What message are supported depend on the target Syntax is: dmsetup message <sector number> <device name> <target message> The <sector number> tends not to be used and is almost always 0.

Suspend

The suspend command stops any NEW I/O. Existing I/O will still be completed. This can be used to quiesce a device. Syntax is: dmsetup message suspend <device name>

Resume

The resume command allows I/O to be submitted to a previously suspended device. Syntax is: dmsetup message suspend <device name>

Reload

The reload command replaces an existing device, possible with new targets and/or parameters. It may be necessary to suspend the target first. Some targets are immutable and can't be replaced with this command. Syntax is the same as create.

Device-mapper targets

Note
To quickly create a sparse file:
user $dd if=/dev/null of=nameX.img seek=#sectors
The number of sectors for a given device will be stated to facilitate this. A loopback device can then be setup via losetup. Note this won't overwrite old content in the file, so if fresh file is desired the loopback device must be torn down, the file deleted and new one created.

Zero

See Documentation/device-mapper/zero.txt for usage. This target has no target-specific parameters.

The "zero" target create that functions similarly to /dev/zero: All reads return binary zero, and all writes are discarded. Normally used in tests, but also useful in recovering linear and raid-type targets, when combined with the 'snapshot' target: a "zero" target of the same size as the missing piece(s) is created, a (writable) snapshot created (usually a loop device backed by a large sparse file, but it can be far smaller than the missing piece since it only has to the hold the changes). Then the snapshot can be mounted, fsck'd, or recovery tools run against it.

This creates a 1GB (1953125-sector) zero target:

root #dmsetup create 1gb-zero --table '0 1953125 zero'

Linear

Note
All the following examples for this target will use 4 1GB+4MiB (1961317-sector) disks as /dev/loop{0,1,2,3}

See Documentation/device-mapper/linear.txt for parameters and usage. The linear target is basic building block for the device mapper - it is used to both join and split (and often both at once) block device. For a simple identify mapping:

root #dmsetup create test-linear --table '0 1961317 linear /dev/loop0 0'

The 4 disks can be joined to together as one:

root #echo -e '0 1961317 linear /dev/loop0 0'\\n'1961317 1961317 linear /dev/loop1 0'\\n'3922634 1961317 linear /dev/loop2 0'\\n'5883951 1961317 linear /dev/loop3 0' | dmsetup create test-linear

Note the peculiar syntax on the join. The --table argument only allows single-line tables. Multi-line tables must be read from stdin. Also notice the logical_start_sector is not 0 in this case, as each device were appending need to start where the previous ends. Its possible to split a disk, in this case into a 4 MiB (8192 sector) "small" and 1 GB "large" (1953125 sector) disks:

root #dmsetup create test-linear-small --table '0 8192 linear /dev/loop0 0'
root #dmsetup create test-linear-large --table '0 1953125 linear /dev/loop0 8192'

Note that in the second device, the offset is not 0, since it is desired to start 4 MiB (8192 sectors) in Both joining an splitting can be combined:

root #echo -e '0 1953125 linear /dev/loop0 8192'\\n'1953125 1953125 linear /dev/loop1 8192'\\n'3906250 1953125 linear /dev/loop2 8192'\\n'5859375 1953125 linear /dev/loop3 8192' | dmsetup create test-linear

This creates a 4GB device using last 1GB of each disk.

Snapshot

Note
All the following examples for this target will use a 1.2GB (2343750 sectors) disk/dev/loop0 split as follows:
root #dmsetup create test-snapshot-base-real --table '0 1953125 linear /dev/loop0 0'
root #dmsetup create test-snapshot-snap-cow --table '0 390625 linear /dev/loop0 1953125'

See Documentation/device-mapper/snapshot.txt for parameters and usage. The snapshot target is used to create a special copy-on-write device to store changes for an origin device. If a change is made to an origin, the old chunk is copied to the snapshot device. This is useful for backups. It is possible to write to the snapshot device directly as well, and thus test changes before committing them to the origin.

The origin device should already be populated. To mark a device an origin, use the snapshot-origin target:

root #dmsetup create test-snapshot-base --table '0 1953125 snapshot-origin /dev/mapper/test-snapshot-base-real'

Creating snapshots

A snapshot can then be created though the snapshot target. This has 2 important arguments:

  • persistent controls whether or not this snapshot is invalided at restart. Values are P for persistent (survives reboot) and N for non-persistent (invalid upon reboot). Non-persistent snapshots have less metadata associated with them
  • chuncksize controls the granularity of copying with the snapshot. Chunks are copied to the snapshot device in intervals of this value (in sectors).

The origin should be suspended before creating the snapshot device, if it is a device mapper device. To create a snapshot device:

root #dmsetup create test-snapshot-cow --table '0 1953125 snapshot /dev/mapper/test-snapshot-base-real /dev/mapper/test-snapshot-snap-cow P 8'

Note how the origin device is the not the same device as the one we just created, but rather the origin's underlying device

Merging snapshots

To merge a snapshot, the origin must be suspended, the snapshot device unmapped, the snapshot-origin target replaced with the snapshot-merge target, and the origin resumed:

root #dmsetup suspend test-snapshot-base
root #dmsetup remove test-snapshot-cow
root #dmsetup reload test-snapshot-base --table '0 1953125 snapshot-merge /dev/mapper/test-snapshot-base-real /dev/mapper/test-snapshot-snap-cow P 8'
root #dmsetup resume test-snapshot-base

At that point the dmsetup status output will need be polled to find out then the merge is complete. Once the merge is complete, the snapshot-merge target should be replaced with the snapshot-origin target again:

root #dmsetup suspend test-snapshot-base
root #dmsetup reload test-snapshot-base --table '0 1953125 snapshot-origin /dev/mapper/test-snapshot-base-real'
root #dmsetup resume test-snapshot-base

A new snapshot can then be created, reusing the now freed-up old snapshot device if desired.

Mirror and RAID1

Note
All the following examples for this target will use 4 1GB+4MiB (1961317-sector) disks as /dev/loop{0,1}, split as follows:
root #dmsetup create test-mirror-log0 --table '0 8192 linear /dev/loop0 0'
root #dmsetup create test-mirror-log1 --table '0 8192 linear /dev/loop1 0'
root #dmsetup create test-mirror-data0 --table '0 1953125 linear /dev/loop0 8192'
root #dmsetup create test-mirror-data1 --table '0 1953125 linear /dev/loop1 8192'

Mirror

There is no kernel documentation for the mirror target. Parameters obtained from Linux sources: drivers/md/dm-log.c and drivers/md/dm-raid1.c

CODE
<starting_sector> <length> mirror <log_type> <#log_args> <log_arg1>...<log_argN> <#devs> <device_name_1> <offset_1>...<device name N> <offset N> <#features> <feature_1>...<feature_N>

For log_type there are 4 values with different arguments:

  • core region_size [[no]sync]
  • disk logdevice region_size [[no]sync]

And the values of each argument:

  • region_size is the region size of the mirror in sectors. It must be power of 2 and at least of a kernel page (for Intel x86/x64 processors, this is 4 KiB (8 sectors) This is the granularity in which the mirror is kept to update. Its a tradeoff between increased metadata and wasted I/O. LVM uses a value of 512 KiB (1024 sectors).
  • logdevice is the device in which to store the metadata, for the disk log types
  • [no]sync is an optional argument. Default is sync. nosync skips the sync step, but any reads to unwritten regions to since the mirror was established are undefined. This is appropriate to use then the initial device is empty.

And there is only 1 feature:

  • handle_errors causes the mirror to respond to an error. Default is to ignore all errors. LVM enables this feature.

To create a mirror with in-memory log:

root #dmsetup create test-mirror --table '0 1953125 mirror core 1 1024 2 /dev/mapper/test-mirror-data0 0 /dev/mapper/test-mirror-data1 0 1 handle_errors'

Without a persistent log, the mirror will have to be recreated every time by copying the entire block device to the other "legs". To avoid this, the log may be stored on disk:

root #dmsetup create test-mirror --table '0 1953125 mirror disk 2 /dev/mapper/test-mirror-log0 1024 2 /dev/mapper/test-mirror-data0 0 /dev/mapper/test-mirror-data0 0 1 handle_errors'

Its possible to do LVM "--mirrorlog mirror" by creating 2 mirrors: a core mirror for the log device, and a disk mirror the data devices:

root #dmsetup create test-mirror-log --table '0 8192 mirror core 1 1024 2 /dev/mapper/test-mirror-log0 0 /dev/mapper/test-mirror-log1 0 1 handle_errors'
root #dmsetup create test-mirror --table '0 1953125 mirror disk 2 /dev/mapper/test-mirror-log 1024 2 /dev/mapper/test-mirror-data0 0 /dev/mapper/test-mirror-data1 0 1 handle_errors'

RAID1

See Documentation/device-mapper/dm-raid.txt for parameters and usage. This target has a few important parameters:

  • chunk_size is unused, but its value is required. Set it to 0.
  • region_size has the same meaning as it does in the mirror target. Unlike the mirror target. it has a default of 4 MiB (8192 sectors). LVM uses a region size of 512 KiB (1024 sectors).
  • [no]sync has the same meaning as it does in the mirror target

To create a simple 1 GB raid1 with no metadata devices.

root #dmsetup create test-raid1 --table '0 1953125 raid raid1 3 0 region_size 1024 2 - /dev/mapper/test-mirror-data0 - /dev/mapper/test-mirror-data1'

Note that because there's no metadata device, the array must be re-mirrored each time it is created. So normally, a metadata device is desired. Each "leg" needs its own metadata device If /dev/loop2 and /dev/loop3 are small metadata devices (4 MiB), then to create a 1G RAID1 would be:

root #dmsetup create test-raid1 --table '0 1953125 raid raid1 3 0 region_size 1024 2 /dev/mapper/test-mirror-log0 /dev/mapper/test-mirror-data0 /dev/mapper/test-mirror-log1 /dev/mapper/test-mirror-data1'

Striped (RAID 0) and RAID 4/5/6/10

Note
All the following examples for this target will use 4 1GB+4MiB (1961317-sector) disks as /dev/loop{0,1}, split as follows:
root #for S in `seq 0 3`; do dmsetup create test-raid-metadata$S --table "0 8192 linear /dev/loop$S 0"; dmsetup create test-raid-data$S --table "0 1953125 linear /dev/loop$S 8192"; done

See Documentation/device-mapper/striped.txt and Documentation/device-mapper/dm-raid.txt for parameters and usage. The striped target aggregates several disks and split i/o amongst them for performance. There are a few importnat parameters. For both the raid and striped targets:

  • chunk_size is the size I/O (in sectors) before its "split" across the array It must be both a power a two and a least a large as a kernel memory page (for x86/x64 processors, pages are 4 KiB, so <chunk size> must be at least 8.) LVM uses a default value of 64 KiB (128 sectors). Using LVM defaults, a 1 MiB (2048 sector) write will be split in 16 chunks, distributed as evenly as possible across the array. The size of the array MUST be a multiple of this value. Otherwise the target will give the error "Array size does not match requested target length".

These parameters are only applicable to the raid target:

  • region size has the same meaning and defaults as it does for the RAID1 target.
  • [no]sync has the same meaning as it does for the RAID1 target. It is usually not appropriate for RAID 4,5 and 6 as even for blank devices parity must still be computed, unless creating a degraded array.

Because the number of sectors (1953125) is not a multiple of 128, it must be rounded down to the nearest multiple of 128 sectors, which can be done using this formula:

CODE
echo 'scale=0; x=<dev_dize> / <chunk_size>; x*=<chunk_size>; x' | bc

So in this case:

user $echo 'scale=0; x=1953125 / 128; x*=128; x' | bc
1953024

Striped (RAID0)

Stripe sets allow multiple disks to be combined into one with improved performance. The striped target parameters is asymmetric to the RAID ones. First, the # devices comes first, not the cluster size. Second, one must specify the offset (usually 0) of each device the makes up the stripe set. Because there are 4 disks of 1953024 sectors each, the total array size will be 7812096 sectors. To create a stripe set (RAID0):

root #dmsetup create test-raid0 --table '0 7812096 striped 4 128 /dev/mapper/test-raid-data0 0 /dev/mapper/test-raid-data1 0 /dev/mapper/test-raid-data2 0 /dev/mapper/test-raid-data3 0'

RAID4

RAID4 is striped set that can tolerate the failure of a single disk. Because RAID4 uses a dedicated parity disk, one disk is "unusable", therefore the total space is 3 disks * 1953024 sectors, or a total of 5859072 sectors. To create a RAID4 set with no metadata devices:

root #dmsetup create test-raid4 --table '0 5859072 raid raid4 3 64 region_size 1024 4 - /dev/mapper/test-raid-data0 - /dev/mapper/test-raid-data1 - /dev/mapper/test-raid-data2 - /dev/mapper/test-raid-data3'

As RAID1, because there are no metadata devices, the parity disk will have to be rebuilt every time it is assembled. To create a RAID4 WITH metadata devices:

root #dmsetup create test-raid4 --table '0 5859072 raid raid4 3 64 region_size 1024 4 /dev/mapper/test-raid-metadata0 /dev/mapper/test-raid-data0 /dev/mapper/test-raid-metadata1 /dev/mapper/test-raid-data1 /dev/mapper/test-raid-metadata2 /dev/mapper/test-raid-data2 /dev/mapper/test-raid-metadata3 /dev/mapper/test-raid-data3'

It is possible to create a RAID4 in degraded mode initially. It is necessary to not specify any metadata devices, and "nosync" must added

root #dmsetup create test-raid4 --table '0 5859072 raid raid4 4 64 region_size 1024 nosync 4 - /dev/mapper/test-raid-data0 - /dev/mapper/test-raid-data1 - /dev/mapper/test-raid-data2 - -'

The reason for doing this is its faster to create a degraded array, populate it, then reload the table with missing metadata devices and data device, so that the parity is only computed once, not twice.

RAID5

RAID5 is similar to RAID4, except in RAID5 the parity data is distributed across the stripe set. There are 4 "flavors" of RAID5. For LVM, the default is raid5_ls. The amount of parity used is the same as RAID4, so the total space is 5859072 sectors. To create a RAID5 set with no metadata devices:

root #dmsetup create test-raid5 --table '0 5859072 raid raid5_ls 3 64 region_size 1024 4 - /dev/mapper/test-raid-data0 - /dev/mapper/test-raid-data1 - /dev/mapper/test-raid-data2 - /dev/mapper/test-raid-data3'

To create a RAID5 with metadata:

root #dmsetup create test-raid5 --table '0 5859072 raid raid5_ls 3 64 region_size 1024 4 /dev/mapper/test-raid-metadata0 /dev/mapper/test-raid-data0 /dev/mapper/test-raid-metadata1 /dev/mapper/test-raid-data1 /dev/mapper/test-raid-metadata2 /dev/mapper/test-raid-data2 /dev/mapper/test-raid-metadata3 /dev/mapper/test-raid-data3'

To create a degraded RAID5:

root #dmsetup create test-raid5 --table '0 5859072 raid raid5_ls 4 64 region_size 1024 nosync 4 - /dev/mapper/test-raid-data0 - /dev/mapper/test-raid-data1 - /dev/mapper/test-raid-data2 - -'

RAID6

RAID6 is a stripe set that can tolerate the failure of up to 2 disks. Like RAID5, parity is distributed across the stripe set. There are 3 "flavors" of RAID 6. For LVM, the default is "raid6_zr". The total available space is 3906048 sectors. To create a RAID6 set with no metadata devices:

root #dmsetup create test-raid6 --table '0 3906048 raid raid6_zr 3 64 region_size 1024 4 - /dev/mapper/test-raid-data0 - /dev/mapper/test-raid-data1 - /dev/mapper/test-raid-data2 - /dev/mapper/test-raid-data3'

To create a RAID6 with metadata:

root #dmsetup create test-raid6 --table '0 3906048 raid raid6_zr 3 64 region_size 1024 4 /dev/mapper/test-raid-metadata0 /dev/mapper/test-raid-data0 /dev/mapper/test-raid-metadata1 /dev/mapper/test-raid-data1 /dev/mapper/test-raid-metadata2 /dev/mapper/test-raid-data2 /dev/mapper/test-raid-metadata3 /dev/mapper/test-raid-data3'

To create a degraded RAID6:

root #dmsetup create test-raid6 --table '0 3906048 raid raid6_zr 4 64 region_size 1024 nosync 4 - /dev/mapper/test-raid-data0 - /dev/mapper/test-raid-data1 - - - -'

Note 2 devices are left out instead of 1.

RAID10

RAID10 combines mirroring (RAID 1) and striping (RAID10). Note is a better than stacking a RAID1 on top of RAID0 (or vice versa) - it is possible to do RAID10 on an odd number of disks. Half the disks are lost to the mirror, so the the total available space is 3906048 sectors. To create a RAID10 set with no metadata devices:

root #dmsetup create test-raid10 --table '0 3906048 raid raid10 3 64 region_size 1024 4 - /dev/mapper/test-raid-data0 - /dev/mapper/test-raid-data1 - /dev/mapper/test-raid-data2 - /dev/mapper/test-raid-data3'

To create a RAID10 set with metadata:

root #dmsetup create test-raid10 --table '0 3906048 raid raid10 3 64 region_size 1024 4 /dev/mapper/test-raid-metadata0 /dev/mapper/test-raid-data0 /dev/mapper/test-raid-metadata1 /dev/mapper/test-raid-data1 /dev/mapper/test-raid-metadata2 /dev/mapper/test-raid-data2 /dev/mapper/test-raid-metadata3 /dev/mapper/test-raid-data3'

If all the devices are empty, the nosync may be used to skip the initial sync, with the same caveats as mirror target.

Crypt

Note
All the following examples for this target will use a 4 1GB (1961317-sector) disk as /dev/loop0

See Documentation/device-mapper/dm-crypt.txt for parameters and usage. The crypt target is used to encrypt an underlying block device. It is the backend to the cryptsetup tool. It supports several encryption modes, and is compatible with Cryptoloop and loop-aes volumes.

To create an encrypted volume, using the default method chosen by cryptsetup:

root #dmsetup create test-crypt --table '0 1953125 crypt aes-xts-plain64 babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0 /dev/loop0 0 1 allow_discards'

Rather than specify the key on the command line (which is insecure), it is possible to use the kernel keyring facility:

root #dd if=/dev/random bs=32 count=1 iflag=fullblock | keyctl padd user test-cryptkey @s
252498285

If there's not enough entropy for /dev/random, dd will hang until there's enough entropy.

root #dmsetup create test-crypt --table '0 1953125 crypt aes-xts-plain64 :32:user:test-cryptkey 0 /dev/loop0 0 1 allow_discards'

Note that for keyring facility, the crypt target wants binary blobs instead of hex string. They key above (babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe) would need to be formatted as follows:

root #echo -ne '\xba\xbe\xba\xbe\xba\xbe\xba\xbe\xba\xbe\xba\xbe\xba\xbe\xba\xbe\xba\xbe\xba\xbe\xba\xbe\xba\xbe\xba\xbe\xba\xbe\xba\xbe\xba\xbe' | keyctl pupdate 252498285

To create an encrypted volume, using the default method from older version of cryptsetup:

root #dmsetup create test-crypt --table '0 1953125 crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 /dev/loop0 0'

Note its sha256 and not sha-256!

Note that using the --table option of dmsetup discloses sensitive data via the command line, so its better to store the parameters in a secure file and redirect from STDIN instead. Also note that the dmsetup table command does not show the key unless --showkeys is passed. When using the kernel keyring facility, only the the key string is shown, not the actual key.

Verity

Note
All the following examples for this target will use a 4 1GB+8 MiB (1969509-sector) disk as /dev/loop0, split as follows:
root #dmsetup create test-verity-metadata --table '0 16384 linear /dev/loop0 0'
root #dmsetup create test-verity-data --table '0 1953125 linear /dev/loop0 16384'

See /Documentation/device-mapper/verity.txt for the parameters and usage. The verity target is a read-only target intended for use in "verified boot" scenarios. Its similar in spirit to IMA/EVM, but doesn't require a TPM, and works at block level rather than the file level. This target compares the hash of the block of data to that of the matadata, if they do not match, an I/O error is returned. Note this target only does hashing, not signature verification. Whatever uses this target should keep a signed copy of the hash (with the public key in read-only firmware or similar), before using it.

Unlike the other device mapper targets, the verity target does NOT initialize the metadata device on its first use. Instead, it must be populated using an external tool, veritysetup which is part of cryptsetup. To create the metadata (using an empty device in this case):

root #veritysetup format --no-superblock --salt e48da609055204e89ae53b655ca2216dd983cf3cb829f34f63a297d106d53e2d /dev/mapper/test-verity-data /dev/mapper/test-verity-metadata
VERITY header information for /dev/mapper/test-verity-metadata
UUID:            	
Hash type:       	1
Data blocks:     	244140
Data block size: 	4096
Hash block size: 	4096
Hash algorithm:  	sha256
Salt:            	e48da609055204e89ae53b655ca2216dd983cf3cb829f34f63a297d106d53e2d
Root hash:      	29cb87e60ce7b12b443ba6008266f3e41e93e403d7f298f8e3f316b29ff89c5e

The verity target must be multiple of the data block size, do multiply data_blocks*(data_block_size/512) to get the length for the verity target, in this case, 1953120. To create the verity device:

root #dmsetup create test-verity --readonly --table '0 1953120 verity 1 /dev/mapper/test-verity-data /dev/mapper/test-verity-metadata 4096 4096 244140 0 sha256 29cb87e60ce7b12b443ba6008266f3e41e93e403d7f298f8e3f316b29ff89c5e e48da609055204e89ae53b655ca2216dd983cf3cb829f34f63a297d106d53e2d'

The --readonly option must be specified to this target.

Thin

Note
All the following examples for this target will use 2 disks: a 1G+2 MiB (1957221 sectors) disk as /dev/loop0, and a 200 MB (390625 sectors) read-only disk as /dev/loop1, with /dev/loop0 split as follows:
root #dmsetup create test-thin-metadata --table '0 4096 linear /dev/loop0 0'
root #dmsetup create test-thin-data --table '0 1953125 linear /dev/loop0 4096'

See Documentation/device-mapper/thin-provisioning.txt for parameters and usage. Thin pools are to block devices what sparse files are to filesystems. It is possible to create large, empty, even larger than the pool itself, or sums of objects greater than the pool size, and space isn't allocated until something is actually written to those areas. Futhermore, blocks can be returned to the thin pool via the trim/discard operation. Thin pool have a cheap snapshotting operation (different from the snapshot target) that remains cheap even upon multiple layers of indirections (snapshots of snapshots of snapshots...).

The thin target has 3 important parameters:

  • metadata_dev is where to store the metadata for the thin pool. The recommended size is 3*(data_dev_size/(32*data_block_bize)) sectors, but at least 2 MiB (4096 sectors). The thin-provisioning-tools package has a program, thin_metadata_size that will compute an suitable thin pool size given the data_block_size, data_dev_size, and number of volumes in the pool. The maximum supported size of the metadata device is 15.9375 GiB (33423360 sectors)
  • data_block_size controls the granularity of the thin pool. Data is allocated in blocks of this size. It must at least 64KiB (128 sectors), and be a multiple of 64 KiB (128) sectors.
  • low_water_mark is a lower boundary of free space within the pool. If the free space drops below this, a message a sent. Set it to 0 is disable.

Using the thin_metadata_size utility of thin-provisioning-tools:

user $/sbin/thin_metadata_size -b128s -s1g -m100
thin_metadata_size - 1848 sectors estimated metadata area size for "--block-size=128sectors --pool-size=1gigabytes --max-thins=100"

Even anticipating 100 volumes with in a pool, it still less then the minimum recommended 2 MiB (4096 sector). To create the pool:

root #dmsetup create test-thin-pool --table '0 1953125 thin-pool /dev/mapper/test-thin-metadata /dev/mapper/test-thin-data 128 0'

Creating thin volumes

The thin-pool target is unusual among the other targets, as it does not produce a usable disk by itself. Instead, by sending message to the target, it produce more device-mapper device which can be used for storage using the thin target. Volumes within a pool are referred to by 24-bit ordinal. Note that there's isn't a way to query the pool what ordinals are in use. To create a new thin volume with ordinal 17:

root #dmsetup message /dev/mapper/test-thin-pool 0 "create_thin 17"

This allocates an ordinal but no storage. However, its possible to use the ordinal with the thin target to create a 200MB (390625-sector) thin volume:

root #dmsetup create test-thin --table '0 390625 thin /dev/mapper/test-thin-pool 17'

Creating internal thin snapshot

Thin snapshots can be created of thin volume. First, the volume must be quiesced:

root #dmsetup suspend /dev/mapper/test-thin

Then the snapshot taker. An ordinal needs to be allocated is needed for it, so for this example 6 will be chosen:

root #dmsetup message /dev/mapper/test-thin-pool 0 "create_snap 6 17"

The volume can be resumed after the snapshot is taken:

root #dmsetup resume /dev/mapper/test-thin

The snapshot can now be activated like any other thin volume:

root #dmsetup create test-thin-snap --table '0 390625 thin /dev/mapper/test-thin-pool 6'

Creating external thin snaphot

Thin snapshot be taken of read-only external volumes. First, an ordinal is allocated as in create a thin volume:

root #dmsetup message /dev/mapper/test-thin-pool 0 "create_thin 438"

Then a new thin volume is create life before, but an extra parameter is added to indicate the origin:

root #dmsetup create test-thin-extsnap --table '0 390625 thin /dev/mapper/test-thin-pool 438 /dev/loop1'

Deleting thin volumes

A thin volume can be deleted by unmapping it and sending the pool a delete message with the ordinal of the pool to delete:

root #dmsetup remove test-thin
root #dmsetup message /dev/mapper/test-thin-pool 0 "delete 17"

Cache

Note
All the following examples for this target will 2 disks: /dev/loop0 will be 1G+12 MiB (1977701 sectors), and /dev/loop1 will be 200MB+12MiB (415201 sectors), split as follows:
root #for S in `seq 0 1`; do dmsetup create test-cache-meta$S --table "0 16384 linear /dev/loop$S 0"; dmsetup create test-mirror-cache-meta$S --table "0 8192 linear /dev/loop$S 16384"; done
root #dmsetup create test-cache-data0 --table '0 1953125 linear /dev/loop0 24576'
root #dmsetup create test-cache-data1 --table '0 390625 linear /dev/loop1 24576'

See Documentation/device-mapper/cache.txt and Documentation/device-mapper/cache-policies.txt for parameters and usage. This target is intended to speed up access to a slow but large rotational disk by using a faster but smaller SSD as a cache. There is one important parameter:

  • block_size is the granularity of the cache. Data is promoted/demoted to/from the cache in blocks. It must be a multiple of 32k (64 sectors). LVM uses 64k (128 sectors) by default.

The recommended metadata device size is 8192 sectors + (nr_blocks/32) sectors where nr_blocks is the number of sectors on the "fast" device divided by the block_size. For this device:

user $echo '390625 / (128 * 32) + 8192 | bc
8287

Will round up to 8 MiB (16384 sectors) for safety, however 4 MiB (8192 sectors) would likely be more than enough anyway.

To create a cache device:

root #dmsetup create test-cache --table '0 1953024 cache /dev/mapper/test-cache-meta0 /dev/mapper/test-cache-data1 /dev/mapper/test-cache-data0 128 0 default 0'

Its recommended to mirror the metadata device on the origin and cache device. To do so:

root #dmsetup create test-mirror-cache-meta --table '0 16384 raid raid1 1 0 2 /dev/mapper/test-mirror-cache-meta0 /dev/mapper/test-cache-meta0 /dev/mapper/test-mirror-cache-meta1 /dev/mapper/test-cache-meta1'
root #dmsetup create test-cache --table '0 1953024 cache /dev/mapper/test-mirror-cache-meta /dev/mapper/test-cache-data1 /dev/mapper/test-cache-data0 128 0 default 0'

Era

Note
All the following examples for this target will 2 disks: /dev/loop0 will be 1G+12 MiB (1977701 sectors), and /dev/loop1 will be 200MB+12MiB (415201 sectors), split as follows:
root #for S in `seq 0 1`; do dmsetup create test-era-meta$S --table "0 16384 linear /dev/loop$S 0"; dmsetup create test-era-mirror-meta$S --table "0 8192 linear /dev/loop$S 16384"; done
root #dmsetup create test-era-data0 --table '0 1953125 linear /dev/loop0 24576'
root #dmsetup create test-era-data1 --table '0 390625 linear /dev/loop1 24576'

See Documentation/device-mapper/era.txt for the parameters this target takes. The era target is intended to be used with the cache target. The purpose of this target is to track changed blocks since the last checkpoint, called an era. This target does so efficiently at the expense of possible false positives, but never false negative. Backup software or snapshots would typically use this target. For snapshots, a checkpoint can be created an snapshot time. If it becomes necessary to revert to the snapshot, its possible to invalidate all blocks since the snapshot in the cache, bypass the cache to rollback the snapshot, then reenable the cache - all without trashing the whole cache and lose valuable metadata on which blocks are still "hot". Backup software can use this to see what blocks have changed since the last backup, and this do an incremental or differential backup of only the changed blocks.

First, create cache device. Since there will now be 2 metadata devices (one for cache, and one for era), to avoid 2 mirror metadata devices, a single mirror will be created with 1 metadata device and the mirror split up using the linear target:

root #dmsetup create test-era-mirror-meta --table '0 16384 raid raid1 3 0 region_size 1024 2 /dev/mapper/test-era-mirror-meta0 /dev/mapper/test-era-meta0 /dev/mapper/test-era-mirror-meta1 /dev/mapper/test-era-meta1'
root #dmsetup create test-era-mirror-meta-cache --table '0 8192 linear /dev/mapper/test-era-mirror-meta 0'
root #dmsetup create test-era-mirror-meta-era --table '0 8192 linear /dev/mapper/test-era-mirror-meta 8192'
root #dmsetup create test-era-cache --table '0 1953024 cache /dev/mapper/test-era-mirror-meta-cache /dev/mapper/test-era-data1 /dev/mapper/test-era-data0 128 0 default 0'

Unlike the cache example, where 8 MiB was used, only 4 MiB is being used here, in order to make room for the era metadata.

To create the era device:

root #dmsetup create test-era --table '0 1953024 era /dev/mapper/test-era-mirror-meta-era /dev/mapper/test-era-cache 128'

In this case, we have a mirror of the era metadata, but its not required, in that case the metadata device should be on the "fast" device. Also note the block_size argument has the same value as the block_size of the cache.

This target needs other tools to make use of the metadata it records. The thin-provisoing-tools package can fetch this an act upon it, by invaliding all blocks changed in the current era.

Integrity

Note
All the following examples for this target will 1 disk: /dev/loop0 will be 1G (1953125 sectors)

See Documentation/device-mapper/dm-integrity.txt for the parameters this target takes. The integrity target can either be used alone, or with the crypt target. This target is unusual among device-mapper targets what it works on more than just the range of sectors its given, it works on the entire underlying device.

First, the device must be formatted. Unlike the other targets, to get this target o the formatting, 2 things must be true. First, the first 512 of the destination must be zeroed out. Second, the target must be invoked on the first sector only. It will then write the superblock (always 512 bytes) to the underlying device. The target has 5 mandatory arguments. The tag size parameter is written to the superblock and if 'J' is specified for the journal parameter, the journal size is written to it. There are a few optional arguments that will be written to the superblock: journal_sector, interleave_sectors, and block_size parameters. The internal_hash, journal_crypt, and journal_mac are NOT saved to the superblock, but must be reproduced in every future invocation of is target the underlying device.

One the device is formatted, it writes the number of usable blocks to the superblock of the target. Unfortunately, this value is required for future invocation of this target on the underlying device, and neither dmsetup nor any other tool (future versions of cryptsetup will have an integritysetup command that can dump this) can retrieve that. Fortunately, writing such a (crude) tool only involves a little C.

CODE dump-integrity.c
#include &lt;stdio.h&gt;
#include &lt;string.h&gt;
#include &lt;inttypes.h&gt;

struct integrity_sb {
  uint8_t magic[8];               /* "integrt" */
  uint8_t version;                /* superblock version, 1 */
  int8_t log2_interleave_sectors; /* interleave sectors */
  uint16_t integrity_tag_size;    /* tag size per-sector */
  uint32_t journal_sections;      /* size of journal */
  uint64_t provided_data_sectors; /* available size for data */
  uint32_t flags;                 /* flags */
  uint8_t log2_sectors_per_block; /* presented block (sector) size */
} __attribute__ ((packed));

int main(int argc,char** argv)
{
	FILE* integrityf = NULL;
	struct integrity_sb this_sb;

	if(argc&gt;1) {
		if(strcmp(argv[1],"-") !=0 ) {
			integrityf = fopen(argv[1],"r");
			if(!integrityf) {
				perror("fopen: ");
				return 1;
			}
		}
	}

	if(!integrityf)	
		integrityf = stdin;

	fread(&this_sb,sizeof this_sb,1,integrityf);
	if(ferror(integrityf)) {
		perror("fread: ");
		return 1;
	} else
	if(feof(integrityf)) {
		printf("Short read\n");
		return 1;
	}
	if(strcmp((char*)&this_sb.magic,"integrt") != 0) {
		printf("Invalid integrity superblock\n");
		return 1;
	}
	if(this_sb.version != 1) {
		printf("Unknown integrity superblock version\n");
		return 1;
	}
	printf("Interleave sectors: %d\n",1&lt;&lt;this_sb.log2_interleave_sectors);
	printf("Tag size per-sector: %" PRIu16 "\n",this_sb.integrity_tag_size);
	printf("Size of journal: %" PRIu32 "\n",this_sb.journal_sections);
	printf("Available size for data: %" PRIu64 "\n",this_sb.provided_data_sectors);
	printf("Flags: 0x%" PRIx32 "\n",this_sb.flags);
	printf("Sectors per block: %u\n",1U&lt;&lt;this_sb.log2_sectors_per_block);
}

Standalone

To use this target by itself, specify "-" for the tag size and include the internal_hash option: This example will format the device uses crc32 as the integrity algorithm:

root #dmsetup create test-integrity --table '0 1 integrity /dev/loop0 0 - J 1 internal_hash:crc32'
root #dmsetup remove test-integrity

This formats the device with the superblock. To use it, it must be given a sector range based on the AVAILABLE sectors in the superblock, NOT the size of the underlying device. For this example, its 1922872 blocks.

root #dmsetup create test-integrity --table '0 1922872 integrity /dev/loop0 0 - J 1 internal_hash:crc32'

The device still isn't usable yet though - space is allocated for hash, but they are not valid. We must use a special dd command to rady the device:

root #dd if=/dev/zero of=/dev/mapper/test-integrity count=1922872 oflag=direct

Note the oflag=direct option. This is important, as this causes the hash check to be bypassed, and new hashes generated. Otherwise, dd would have given an I/O error at the first sector and bailed out.

With crypt target

To use this target with dm-crypt, specify the tag size (size of the hash, 0 if no hash+size of IV), and don't specify internal_hash. dm-crypt has a new IV type: random, which is needed with the authenticated hash types. To just store just the IV, the tag size varies on the encryption algorithm. For cbc(aes), it has a 16 byte IV. For gcm(aes)-random and rfc7539(chacha20,poly1305)-random, the hash size is 16 bytes and the IV the 12 bytes. for a total of 28. For authenc(hmac(sha256),xts(aes))-random, it is 48.

Just storing IV

This example will format the device with 16-bit tags (for cbc(aes)), and encrypt the journal with cbc(aes) as well:

root #dmsetup create test-integrity --table '0 1 integrity /dev/loop0 0 16 J 1 journal_crypt:cbc(aes):00000000000000000000000000000000'
root #dmsetup remove test-integrity

Just like with standalone mode, the superblock must be read to figure out the number of available blocks, in this case its 1878488.

root #dmsetup create test-integrity --table '0 1878488 integrity /dev/loop0 0 16 J 1 journal_crypt:cbc(aes):00000000000000000000000000000000'

Now the crypt target can be stacked on it

root #dmsetup create test-integrity-crypt --table '0 1878488 crypt aes-cbc-random 0123456789ABCDEF0123456789ABCDEF 0 /dev/mapper/test-integrity 0 1 integrity:16:none'

The important part here is the integrity:16:none option. The tag size here MUST match the integrity target's one. The none option indicates to just store IV and not do any authentication.

There is no need to zero out the device as we are just storing IV, not doing any hashed authentication.

Hashed authentication

This example will format the device with 28-bit tags, encrypt the journal use ChaCha20 and hash it using hmac(sha256):

root #dmsetup create test-integrity --table '0 1 integrity /dev/loop0 0 28 J 2 journal_crypt:chacha20:babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe journal_mac:hmac(sha256):deadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef'
root #dmsetup remove test-integrity

Reading the superblock reveals there are 1835744 blocks for use. Now the crypt target can be stacked on it.

root #dmsetup create test-integrity-crypt --table '0 1835744 crypt capi:rfc7539(chacha20,poly1305)-random 0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF 0 /dev/mapper/test-integrity 0 1 integrity:28:aead'

The important part here is the integrity:28:aead option. Again, he tag size here MUST match the integrity target's one. aead is the integrity profile to use.

The device must be cleared before use:

root #dd if=/dev/urandom of=/dev/mapper/test-integrity-crypt count=1835744 oflag=direct

The oflag=direct is required to generate the new hashes without I/O errors.

Testing targets

Delay

See Documentation/device-mapper/delay.txt for parameters and usage. The delay target is used is simulated a congested or slow device (like a floppy).

Error

There is no kernel documentation for the error target. There are no target-specific parameters. While it can used like any other target, is not a "real" target. Instead, just sets a bit in the current target that causes it to reject all I/O. For that reason, its normally used with reload instead of create. It can be used to simulate a failed disk (like in a RAID set), or part of a failed regions of a disk when used with linear. This target does have a pratical use: to convince processes that have a device open to close it, so it can be removed. The -f to remove does exactly that.

Flakey

Documentation/device-mapper/dm-flakey.txt for parameters and usage. The flakey target is used to simulate intermittently failing device. During failure mode, it can returning I/O error (like the error target), silently drop writes, or corrupt data in deterministic ways.

Other targets

Multipath

There is no kernel documentation for the multipath target. This target is unusual among the device-mapper target in that instead of aggregating disks, it aggregates paths to the disks. This target is the backend to the multipath-tools application The typical use of this target is on machines with multiple Fibre Channel adapters connecting via one for more SAN's for redundancy. Because the same disks can be accessed though any of the host adapter, the multipath are arbitrate how the paths are used - round-robin, failover, and so forth.

Switch

See Documentation/device-mapper/switch.txt for parameters and usage. The switch target is used with the multipath target. Some (for now, just one - Dell EqualLogic) Fibre Channel arrays, make the "nodes" semi-transparent: if a request is sent to the wrong node that doesn't have the disk of interest, it transparently forwards it to the one that does, furthermore, it can migrate the data between nodes if needed. Because of all this moving around in the background, the multipath target would needed a table possibly millions of entries long, dm-switch acts as an indirection layer, similarly to the way page tables work for memory - dm-switch maintains the dynamic data and picks the correct multipath device with far fewer entries.

See also

Raid1 with LVM from scratch

External resources