| xj | b04a402 | 2021-11-25 15:01:52 +0800 | [diff] [blame] | 1 | dm-switch | 
 | 2 | ========= | 
 | 3 |  | 
 | 4 | The device-mapper switch target creates a device that supports an | 
 | 5 | arbitrary mapping of fixed-size regions of I/O across a fixed set of | 
 | 6 | paths.  The path used for any specific region can be switched | 
 | 7 | dynamically by sending the target a message. | 
 | 8 |  | 
 | 9 | It maps I/O to underlying block devices efficiently when there is a large | 
 | 10 | number of fixed-sized address regions but there is no simple pattern | 
 | 11 | that would allow for a compact representation of the mapping such as | 
 | 12 | dm-stripe. | 
 | 13 |  | 
 | 14 | Background | 
 | 15 | ---------- | 
 | 16 |  | 
 | 17 | Dell EqualLogic and some other iSCSI storage arrays use a distributed | 
 | 18 | frameless architecture.  In this architecture, the storage group | 
 | 19 | consists of a number of distinct storage arrays ("members") each having | 
 | 20 | independent controllers, disk storage and network adapters.  When a LUN | 
 | 21 | is created it is spread across multiple members.  The details of the | 
 | 22 | spreading are hidden from initiators connected to this storage system. | 
 | 23 | The storage group exposes a single target discovery portal, no matter | 
 | 24 | how many members are being used.  When iSCSI sessions are created, each | 
 | 25 | session is connected to an eth port on a single member.  Data to a LUN | 
 | 26 | can be sent on any iSCSI session, and if the blocks being accessed are | 
 | 27 | stored on another member the I/O will be forwarded as required.  This | 
 | 28 | forwarding is invisible to the initiator.  The storage layout is also | 
 | 29 | dynamic, and the blocks stored on disk may be moved from member to | 
 | 30 | member as needed to balance the load. | 
 | 31 |  | 
 | 32 | This architecture simplifies the management and configuration of both | 
 | 33 | the storage group and initiators.  In a multipathing configuration, it | 
 | 34 | is possible to set up multiple iSCSI sessions to use multiple network | 
 | 35 | interfaces on both the host and target to take advantage of the | 
 | 36 | increased network bandwidth.  An initiator could use a simple round | 
 | 37 | robin algorithm to send I/O across all paths and let the storage array | 
 | 38 | members forward it as necessary, but there is a performance advantage to | 
 | 39 | sending data directly to the correct member. | 
 | 40 |  | 
 | 41 | A device-mapper table already lets you map different regions of a | 
 | 42 | device onto different targets.  However in this architecture the LUN is | 
 | 43 | spread with an address region size on the order of 10s of MBs, which | 
 | 44 | means the resulting table could have more than a million entries and | 
 | 45 | consume far too much memory. | 
 | 46 |  | 
 | 47 | Using this device-mapper switch target we can now build a two-layer | 
 | 48 | device hierarchy: | 
 | 49 |  | 
 | 50 |     Upper Tier - Determine which array member the I/O should be sent to. | 
 | 51 |     Lower Tier - Load balance amongst paths to a particular member. | 
 | 52 |  | 
 | 53 | The lower tier consists of a single dm multipath device for each member. | 
 | 54 | Each of these multipath devices contains the set of paths directly to | 
 | 55 | the array member in one priority group, and leverages existing path | 
 | 56 | selectors to load balance amongst these paths.  We also build a | 
 | 57 | non-preferred priority group containing paths to other array members for | 
 | 58 | failover reasons. | 
 | 59 |  | 
 | 60 | The upper tier consists of a single dm-switch device.  This device uses | 
 | 61 | a bitmap to look up the location of the I/O and choose the appropriate | 
 | 62 | lower tier device to route the I/O.  By using a bitmap we are able to | 
 | 63 | use 4 bits for each address range in a 16 member group (which is very | 
 | 64 | large for us).  This is a much denser representation than the dm table | 
 | 65 | b-tree can achieve. | 
 | 66 |  | 
 | 67 | Construction Parameters | 
 | 68 | ======================= | 
 | 69 |  | 
 | 70 |     <num_paths> <region_size> <num_optional_args> [<optional_args>...] | 
 | 71 |     [<dev_path> <offset>]+ | 
 | 72 |  | 
 | 73 | <num_paths> | 
 | 74 |     The number of paths across which to distribute the I/O. | 
 | 75 |  | 
 | 76 | <region_size> | 
 | 77 |     The number of 512-byte sectors in a region. Each region can be redirected | 
 | 78 |     to any of the available paths. | 
 | 79 |  | 
 | 80 | <num_optional_args> | 
 | 81 |     The number of optional arguments. Currently, no optional arguments | 
 | 82 |     are supported and so this must be zero. | 
 | 83 |  | 
 | 84 | <dev_path> | 
 | 85 |     The block device that represents a specific path to the device. | 
 | 86 |  | 
 | 87 | <offset> | 
 | 88 |     The offset of the start of data on the specific <dev_path> (in units | 
 | 89 |     of 512-byte sectors). This number is added to the sector number when | 
 | 90 |     forwarding the request to the specific path. Typically it is zero. | 
 | 91 |  | 
 | 92 | Messages | 
 | 93 | ======== | 
 | 94 |  | 
 | 95 | set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>... | 
 | 96 |  | 
 | 97 | Modify the region table by specifying which regions are redirected to | 
 | 98 | which paths. | 
 | 99 |  | 
 | 100 | <index> | 
 | 101 |     The region number (region size was specified in constructor parameters). | 
 | 102 |     If index is omitted, the next region (previous index + 1) is used. | 
 | 103 |     Expressed in hexadecimal (WITHOUT any prefix like 0x). | 
 | 104 |  | 
 | 105 | <path_nr> | 
 | 106 |     The path number in the range 0 ... (<num_paths> - 1). | 
 | 107 |     Expressed in hexadecimal (WITHOUT any prefix like 0x). | 
 | 108 |  | 
 | 109 | R<n>,<m> | 
 | 110 |     This parameter allows repetitive patterns to be loaded quickly. <n> and <m> | 
 | 111 |     are hexadecimal numbers. The last <n> mappings are repeated in the next <m> | 
 | 112 |     slots. | 
 | 113 |  | 
 | 114 | Status | 
 | 115 | ====== | 
 | 116 |  | 
 | 117 | No status line is reported. | 
 | 118 |  | 
 | 119 | Example | 
 | 120 | ======= | 
 | 121 |  | 
 | 122 | Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with | 
 | 123 | the same size. | 
 | 124 |  | 
 | 125 | Create a switch device with 64kB region size: | 
 | 126 |     dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0` | 
 | 127 | 	switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" | 
 | 128 |  | 
 | 129 | Set mappings for the first 7 entries to point to devices switch0, switch1, | 
 | 130 | switch2, switch0, switch1, switch2, switch1: | 
 | 131 |     dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 | 
 | 132 |  | 
 | 133 | Set repetitive mapping. This command: | 
 | 134 |     dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10 | 
 | 135 | is equivalent to: | 
 | 136 |     dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \ | 
 | 137 | 	:1 :2 :1 :2 :1 :2 :1 :2 :1 :2 | 
 | 138 |  |