|  | dm-raid | 
|  | ======= | 
|  |  | 
|  | The device-mapper RAID (dm-raid) target provides a bridge from DM to MD. | 
|  | It allows the MD RAID drivers to be accessed using a device-mapper | 
|  | interface. | 
|  |  | 
|  |  | 
|  | Mapping Table Interface | 
|  | ----------------------- | 
|  | The target is named "raid" and it accepts the following parameters: | 
|  |  | 
|  | <raid_type> <#raid_params> <raid_params> \ | 
|  | <#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>] | 
|  |  | 
|  | <raid_type>: | 
|  | raid0		RAID0 striping (no resilience) | 
|  | raid1		RAID1 mirroring | 
|  | raid4		RAID4 with dedicated last parity disk | 
|  | raid5_n 	RAID5 with dedicated last parity disk supporting takeover | 
|  | Same as raid4 | 
|  | -Transitory layout | 
|  | raid5_la	RAID5 left asymmetric | 
|  | - rotating parity 0 with data continuation | 
|  | raid5_ra	RAID5 right asymmetric | 
|  | - rotating parity N with data continuation | 
|  | raid5_ls	RAID5 left symmetric | 
|  | - rotating parity 0 with data restart | 
|  | raid5_rs 	RAID5 right symmetric | 
|  | - rotating parity N with data restart | 
|  | raid6_zr	RAID6 zero restart | 
|  | - rotating parity zero (left-to-right) with data restart | 
|  | raid6_nr	RAID6 N restart | 
|  | - rotating parity N (right-to-left) with data restart | 
|  | raid6_nc	RAID6 N continue | 
|  | - rotating parity N (right-to-left) with data continuation | 
|  | raid6_n_6	RAID6 with dedicate parity disks | 
|  | - parity and Q-syndrome on the last 2 disks; | 
|  | layout for takeover from/to raid4/raid5_n | 
|  | raid6_la_6	Same as "raid_la" plus dedicated last Q-syndrome disk | 
|  | - layout for takeover from raid5_la from/to raid6 | 
|  | raid6_ra_6	Same as "raid5_ra" dedicated last Q-syndrome disk | 
|  | - layout for takeover from raid5_ra from/to raid6 | 
|  | raid6_ls_6	Same as "raid5_ls" dedicated last Q-syndrome disk | 
|  | - layout for takeover from raid5_ls from/to raid6 | 
|  | raid6_rs_6	Same as "raid5_rs" dedicated last Q-syndrome disk | 
|  | - layout for takeover from raid5_rs from/to raid6 | 
|  | raid10        Various RAID10 inspired algorithms chosen by additional params | 
|  | (see raid10_format and raid10_copies below) | 
|  | - RAID10: Striped Mirrors (aka 'Striping on top of mirrors') | 
|  | - RAID1E: Integrated Adjacent Stripe Mirroring | 
|  | - RAID1E: Integrated Offset Stripe Mirroring | 
|  | -  and other similar RAID10 variants | 
|  |  | 
|  | Reference: Chapter 4 of | 
|  | http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf | 
|  |  | 
|  | <#raid_params>: The number of parameters that follow. | 
|  |  | 
|  | <raid_params> consists of | 
|  | Mandatory parameters: | 
|  | <chunk_size>: Chunk size in sectors.  This parameter is often known as | 
|  | "stripe size".  It is the only mandatory parameter and | 
|  | is placed first. | 
|  |  | 
|  | followed by optional parameters (in any order): | 
|  | [sync|nosync]   Force or prevent RAID initialization. | 
|  |  | 
|  | [rebuild <idx>]	Rebuild drive number 'idx' (first drive is 0). | 
|  |  | 
|  | [daemon_sleep <ms>] | 
|  | Interval between runs of the bitmap daemon that | 
|  | clear bits.  A longer interval means less bitmap I/O but | 
|  | resyncing after a failure is likely to take longer. | 
|  |  | 
|  | [min_recovery_rate <kB/sec/disk>]  Throttle RAID initialization | 
|  | [max_recovery_rate <kB/sec/disk>]  Throttle RAID initialization | 
|  | [write_mostly <idx>]		   Mark drive index 'idx' write-mostly. | 
|  | [max_write_behind <sectors>]       See '--write-behind=' (man mdadm) | 
|  | [stripe_cache <sectors>]           Stripe cache size (RAID 4/5/6 only) | 
|  | [region_size <sectors>] | 
|  | The region_size multiplied by the number of regions is the | 
|  | logical size of the array.  The bitmap records the device | 
|  | synchronisation state for each region. | 
|  |  | 
|  | [raid10_copies   <# copies>] | 
|  | [raid10_format   <near|far|offset>] | 
|  | These two options are used to alter the default layout of | 
|  | a RAID10 configuration.  The number of copies is can be | 
|  | specified, but the default is 2.  There are also three | 
|  | variations to how the copies are laid down - the default | 
|  | is "near".  Near copies are what most people think of with | 
|  | respect to mirroring.  If these options are left unspecified, | 
|  | or 'raid10_copies 2' and/or 'raid10_format near' are given, | 
|  | then the layouts for 2, 3 and 4 devices	are: | 
|  | 2 drives         3 drives          4 drives | 
|  | --------         ----------        -------------- | 
|  | A1  A1           A1  A1  A2        A1  A1  A2  A2 | 
|  | A2  A2           A2  A3  A3        A3  A3  A4  A4 | 
|  | A3  A3           A4  A4  A5        A5  A5  A6  A6 | 
|  | A4  A4           A5  A6  A6        A7  A7  A8  A8 | 
|  | ..  ..           ..  ..  ..        ..  ..  ..  .. | 
|  | The 2-device layout is equivalent 2-way RAID1.  The 4-device | 
|  | layout is what a traditional RAID10 would look like.  The | 
|  | 3-device layout is what might be called a 'RAID1E - Integrated | 
|  | Adjacent Stripe Mirroring'. | 
|  |  | 
|  | If 'raid10_copies 2' and 'raid10_format far', then the layouts | 
|  | for 2, 3 and 4 devices are: | 
|  | 2 drives             3 drives             4 drives | 
|  | --------             --------------       -------------------- | 
|  | A1  A2               A1   A2   A3         A1   A2   A3   A4 | 
|  | A3  A4               A4   A5   A6         A5   A6   A7   A8 | 
|  | A5  A6               A7   A8   A9         A9   A10  A11  A12 | 
|  | ..  ..               ..   ..   ..         ..   ..   ..   .. | 
|  | A2  A1               A3   A1   A2         A2   A1   A4   A3 | 
|  | A4  A3               A6   A4   A5         A6   A5   A8   A7 | 
|  | A6  A5               A9   A7   A8         A10  A9   A12  A11 | 
|  | ..  ..               ..   ..   ..         ..   ..   ..   .. | 
|  |  | 
|  | If 'raid10_copies 2' and 'raid10_format offset', then the | 
|  | layouts for 2, 3 and 4 devices are: | 
|  | 2 drives       3 drives           4 drives | 
|  | --------       ------------       ----------------- | 
|  | A1  A2         A1  A2  A3         A1  A2  A3  A4 | 
|  | A2  A1         A3  A1  A2         A2  A1  A4  A3 | 
|  | A3  A4         A4  A5  A6         A5  A6  A7  A8 | 
|  | A4  A3         A6  A4  A5         A6  A5  A8  A7 | 
|  | A5  A6         A7  A8  A9         A9  A10 A11 A12 | 
|  | A6  A5         A9  A7  A8         A10 A9  A12 A11 | 
|  | ..  ..         ..  ..  ..         ..  ..  ..  .. | 
|  | Here we see layouts closely akin to 'RAID1E - Integrated | 
|  | Offset Stripe Mirroring'. | 
|  |  | 
|  | [delta_disks <N>] | 
|  | The delta_disks option value (-251 < N < +251) triggers | 
|  | device removal (negative value) or device addition (positive | 
|  | value) to any reshape supporting raid levels 4/5/6 and 10. | 
|  | RAID levels 4/5/6 allow for addition of devices (metadata | 
|  | and data device tuple), raid10_near and raid10_offset only | 
|  | allow for device addition. raid10_far does not support any | 
|  | reshaping at all. | 
|  | A minimum of devices have to be kept to enforce resilience, | 
|  | which is 3 devices for raid4/5 and 4 devices for raid6. | 
|  |  | 
|  | [data_offset <sectors>] | 
|  | This option value defines the offset into each data device | 
|  | where the data starts. This is used to provide out-of-place | 
|  | reshaping space to avoid writing over data whilst | 
|  | changing the layout of stripes, hence an interruption/crash | 
|  | may happen at any time without the risk of losing data. | 
|  | E.g. when adding devices to an existing raid set during | 
|  | forward reshaping, the out-of-place space will be allocated | 
|  | at the beginning of each raid device. The kernel raid4/5/6/10 | 
|  | MD personalities supporting such device addition will read the data from | 
|  | the existing first stripes (those with smaller number of stripes) | 
|  | starting at data_offset to fill up a new stripe with the larger | 
|  | number of stripes, calculate the redundancy blocks (CRC/Q-syndrome) | 
|  | and write that new stripe to offset 0. Same will be applied to all | 
|  | N-1 other new stripes. This out-of-place scheme is used to change | 
|  | the RAID type (i.e. the allocation algorithm) as well, e.g. | 
|  | changing from raid5_ls to raid5_n. | 
|  |  | 
|  | [journal_dev <dev>] | 
|  | This option adds a journal device to raid4/5/6 raid sets and | 
|  | uses it to close the 'write hole' caused by the non-atomic updates | 
|  | to the component devices which can cause data loss during recovery. | 
|  | The journal device is used as writethrough thus causing writes to | 
|  | be throttled versus non-journaled raid4/5/6 sets. | 
|  | Takeover/reshape is not possible with a raid4/5/6 journal device; | 
|  | it has to be deconfigured before requesting these. | 
|  |  | 
|  | [journal_mode <mode>] | 
|  | This option sets the caching mode on journaled raid4/5/6 raid sets | 
|  | (see 'journal_dev <dev>' above) to 'writethrough' or 'writeback'. | 
|  | If 'writeback' is selected the journal device has to be resilient | 
|  | and must not suffer from the 'write hole' problem itself (e.g. use | 
|  | raid1 or raid10) to avoid a single point of failure. | 
|  |  | 
|  | <#raid_devs>: The number of devices composing the array. | 
|  | Each device consists of two entries.  The first is the device | 
|  | containing the metadata (if any); the second is the one containing the | 
|  | data. A Maximum of 64 metadata/data device entries are supported | 
|  | up to target version 1.8.0. | 
|  | 1.9.0 supports up to 253 which is enforced by the used MD kernel runtime. | 
|  |  | 
|  | If a drive has failed or is missing at creation time, a '-' can be | 
|  | given for both the metadata and data drives for a given position. | 
|  |  | 
|  |  | 
|  | Example Tables | 
|  | -------------- | 
|  | # RAID4 - 4 data drives, 1 parity (no metadata devices) | 
|  | # No metadata devices specified to hold superblock/bitmap info | 
|  | # Chunk size of 1MiB | 
|  | # (Lines separated for easy reading) | 
|  |  | 
|  | 0 1960893648 raid \ | 
|  | raid4 1 2048 \ | 
|  | 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81 | 
|  |  | 
|  | # RAID4 - 4 data drives, 1 parity (with metadata devices) | 
|  | # Chunk size of 1MiB, force RAID initialization, | 
|  | #       min recovery rate at 20 kiB/sec/disk | 
|  |  | 
|  | 0 1960893648 raid \ | 
|  | raid4 4 2048 sync min_recovery_rate 20 \ | 
|  | 5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82 | 
|  |  | 
|  |  | 
|  | Status Output | 
|  | ------------- | 
|  | 'dmsetup table' displays the table used to construct the mapping. | 
|  | The optional parameters are always printed in the order listed | 
|  | above with "sync" or "nosync" always output ahead of the other | 
|  | arguments, regardless of the order used when originally loading the table. | 
|  | Arguments that can be repeated are ordered by value. | 
|  |  | 
|  |  | 
|  | 'dmsetup status' yields information on the state and health of the array. | 
|  | The output is as follows (normally a single line, but expanded here for | 
|  | clarity): | 
|  | 1: <s> <l> raid \ | 
|  | 2:      <raid_type> <#devices> <health_chars> \ | 
|  | 3:      <sync_ratio> <sync_action> <mismatch_cnt> | 
|  |  | 
|  | Line 1 is the standard output produced by device-mapper. | 
|  | Line 2 & 3 are produced by the raid target and are best explained by example: | 
|  | 0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0 | 
|  | Here we can see the RAID type is raid4, there are 5 devices - all of | 
|  | which are 'A'live, and the array is 2/490221568 complete with its initial | 
|  | recovery.  Here is a fuller description of the individual fields: | 
|  | <raid_type>     Same as the <raid_type> used to create the array. | 
|  | <health_chars>  One char for each device, indicating: 'A' = alive and | 
|  | in-sync, 'a' = alive but not in-sync, 'D' = dead/failed. | 
|  | <sync_ratio>    The ratio indicating how much of the array has undergone | 
|  | the process described by 'sync_action'.  If the | 
|  | 'sync_action' is "check" or "repair", then the process | 
|  | of "resync" or "recover" can be considered complete. | 
|  | <sync_action>   One of the following possible states: | 
|  | idle    - No synchronization action is being performed. | 
|  | frozen  - The current action has been halted. | 
|  | resync  - Array is undergoing its initial synchronization | 
|  | or is resynchronizing after an unclean shutdown | 
|  | (possibly aided by a bitmap). | 
|  | recover - A device in the array is being rebuilt or | 
|  | replaced. | 
|  | check   - A user-initiated full check of the array is | 
|  | being performed.  All blocks are read and | 
|  | checked for consistency.  The number of | 
|  | discrepancies found are recorded in | 
|  | <mismatch_cnt>.  No changes are made to the | 
|  | array by this action. | 
|  | repair  - The same as "check", but discrepancies are | 
|  | corrected. | 
|  | reshape - The array is undergoing a reshape. | 
|  | <mismatch_cnt>  The number of discrepancies found between mirror copies | 
|  | in RAID1/10 or wrong parity values found in RAID4/5/6. | 
|  | This value is valid only after a "check" of the array | 
|  | is performed.  A healthy array has a 'mismatch_cnt' of 0. | 
|  | <data_offset>   The current data offset to the start of the user data on | 
|  | each component device of a raid set (see the respective | 
|  | raid parameter to support out-of-place reshaping). | 
|  | <journal_char>	'A' - active write-through journal device. | 
|  | 'a' - active write-back journal device. | 
|  | 'D' - dead journal device. | 
|  | '-' - no journal device. | 
|  |  | 
|  |  | 
|  | Message Interface | 
|  | ----------------- | 
|  | The dm-raid target will accept certain actions through the 'message' interface. | 
|  | ('man dmsetup' for more information on the message interface.)  These actions | 
|  | include: | 
|  | "idle"   - Halt the current sync action. | 
|  | "frozen" - Freeze the current sync action. | 
|  | "resync" - Initiate/continue a resync. | 
|  | "recover"- Initiate/continue a recover process. | 
|  | "check"  - Initiate a check (i.e. a "scrub") of the array. | 
|  | "repair" - Initiate a repair of the array. | 
|  |  | 
|  |  | 
|  | Discard Support | 
|  | --------------- | 
|  | The implementation of discard support among hardware vendors varies. | 
|  | When a block is discarded, some storage devices will return zeroes when | 
|  | the block is read.  These devices set the 'discard_zeroes_data' | 
|  | attribute.  Other devices will return random data.  Confusingly, some | 
|  | devices that advertise 'discard_zeroes_data' will not reliably return | 
|  | zeroes when discarded blocks are read!  Since RAID 4/5/6 uses blocks | 
|  | from a number of devices to calculate parity blocks and (for performance | 
|  | reasons) relies on 'discard_zeroes_data' being reliable, it is important | 
|  | that the devices be consistent.  Blocks may be discarded in the middle | 
|  | of a RAID 4/5/6 stripe and if subsequent read results are not | 
|  | consistent, the parity blocks may be calculated differently at any time; | 
|  | making the parity blocks useless for redundancy.  It is important to | 
|  | understand how your hardware behaves with discards if you are going to | 
|  | enable discards with RAID 4/5/6. | 
|  |  | 
|  | Since the behavior of storage devices is unreliable in this respect, | 
|  | even when reporting 'discard_zeroes_data', by default RAID 4/5/6 | 
|  | discard support is disabled -- this ensures data integrity at the | 
|  | expense of losing some performance. | 
|  |  | 
|  | Storage devices that properly support 'discard_zeroes_data' are | 
|  | increasingly whitelisted in the kernel and can thus be trusted. | 
|  |  | 
|  | For trusted devices, the following dm-raid module parameter can be set | 
|  | to safely enable discard support for RAID 4/5/6: | 
|  | 'devices_handle_discards_safely' | 
|  |  | 
|  |  | 
|  | Version History | 
|  | --------------- | 
|  | 1.0.0	Initial version.  Support for RAID 4/5/6 | 
|  | 1.1.0	Added support for RAID 1 | 
|  | 1.2.0	Handle creation of arrays that contain failed devices. | 
|  | 1.3.0	Added support for RAID 10 | 
|  | 1.3.1	Allow device replacement/rebuild for RAID 10 | 
|  | 1.3.2   Fix/improve redundancy checking for RAID10 | 
|  | 1.4.0	Non-functional change.  Removes arg from mapping function. | 
|  | 1.4.1   RAID10 fix redundancy validation checks (commit 55ebbb5). | 
|  | 1.4.2   Add RAID10 "far" and "offset" algorithm support. | 
|  | 1.5.0   Add message interface to allow manipulation of the sync_action. | 
|  | New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt. | 
|  | 1.5.1   Add ability to restore transiently failed devices on resume. | 
|  | 1.5.2   'mismatch_cnt' is zero unless [last_]sync_action is "check". | 
|  | 1.6.0   Add discard support (and devices_handle_discard_safely module param). | 
|  | 1.7.0   Add support for MD RAID0 mappings. | 
|  | 1.8.0   Explicitly check for compatible flags in the superblock metadata | 
|  | and reject to start the raid set if any are set by a newer | 
|  | target version, thus avoiding data corruption on a raid set | 
|  | with a reshape in progress. | 
|  | 1.9.0   Add support for RAID level takeover/reshape/region size | 
|  | and set size reduction. | 
|  | 1.9.1   Fix activation of existing RAID 4/10 mapped devices | 
|  | 1.9.2   Don't emit '- -' on the status table line in case the constructor | 
|  | fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and | 
|  | 'D' on the status line.  If '- -' is passed into the constructor, emit | 
|  | '- -' on the table line and '-' as the status line health character. | 
|  | 1.10.0  Add support for raid4/5/6 journal device | 
|  | 1.10.1  Fix data corruption on reshape request | 
|  | 1.11.0  Fix table line argument order | 
|  | (wrong raid10_copies/raid10_format sequence) | 
|  | 1.11.1  Add raid4/5/6 journal write-back support via journal_mode option | 
|  | 1.12.1  fix for MD deadlock between mddev_suspend() and md_write_start() available | 
|  | 1.13.0  Fix dev_health status at end of "recover" (was 'a', now 'A') |