| b.liu | e958203 | 2025-04-17 19:18:16 +0800 | [diff] [blame] | 1 | ============ | 
 | 2 | dm-integrity | 
 | 3 | ============ | 
 | 4 |  | 
 | 5 | The dm-integrity target emulates a block device that has additional | 
 | 6 | per-sector tags that can be used for storing integrity information. | 
 | 7 |  | 
 | 8 | A general problem with storing integrity tags with every sector is that | 
 | 9 | writing the sector and the integrity tag must be atomic - i.e. in case of | 
 | 10 | crash, either both sector and integrity tag or none of them is written. | 
 | 11 |  | 
 | 12 | To guarantee write atomicity, the dm-integrity target uses journal, it | 
 | 13 | writes sector data and integrity tags into a journal, commits the journal | 
 | 14 | and then copies the data and integrity tags to their respective location. | 
 | 15 |  | 
 | 16 | The dm-integrity target can be used with the dm-crypt target - in this | 
 | 17 | situation the dm-crypt target creates the integrity data and passes them | 
 | 18 | to the dm-integrity target via bio_integrity_payload attached to the bio. | 
 | 19 | In this mode, the dm-crypt and dm-integrity targets provide authenticated | 
 | 20 | disk encryption - if the attacker modifies the encrypted device, an I/O | 
 | 21 | error is returned instead of random data. | 
 | 22 |  | 
 | 23 | The dm-integrity target can also be used as a standalone target, in this | 
 | 24 | mode it calculates and verifies the integrity tag internally. In this | 
 | 25 | mode, the dm-integrity target can be used to detect silent data | 
 | 26 | corruption on the disk or in the I/O path. | 
 | 27 |  | 
 | 28 | There's an alternate mode of operation where dm-integrity uses bitmap | 
 | 29 | instead of a journal. If a bit in the bitmap is 1, the corresponding | 
 | 30 | region's data and integrity tags are not synchronized - if the machine | 
 | 31 | crashes, the unsynchronized regions will be recalculated. The bitmap mode | 
 | 32 | is faster than the journal mode, because we don't have to write the data | 
 | 33 | twice, but it is also less reliable, because if data corruption happens | 
 | 34 | when the machine crashes, it may not be detected. | 
 | 35 |  | 
 | 36 | When loading the target for the first time, the kernel driver will format | 
 | 37 | the device. But it will only format the device if the superblock contains | 
 | 38 | zeroes. If the superblock is neither valid nor zeroed, the dm-integrity | 
 | 39 | target can't be loaded. | 
 | 40 |  | 
 | 41 | To use the target for the first time: | 
 | 42 |  | 
 | 43 | 1. overwrite the superblock with zeroes | 
 | 44 | 2. load the dm-integrity target with one-sector size, the kernel driver | 
 | 45 |    will format the device | 
 | 46 | 3. unload the dm-integrity target | 
 | 47 | 4. read the "provided_data_sectors" value from the superblock | 
 | 48 | 5. load the dm-integrity target with the the target size | 
 | 49 |    "provided_data_sectors" | 
 | 50 | 6. if you want to use dm-integrity with dm-crypt, load the dm-crypt target | 
 | 51 |    with the size "provided_data_sectors" | 
 | 52 |  | 
 | 53 |  | 
 | 54 | Target arguments: | 
 | 55 |  | 
 | 56 | 1. the underlying block device | 
 | 57 |  | 
 | 58 | 2. the number of reserved sector at the beginning of the device - the | 
 | 59 |    dm-integrity won't read of write these sectors | 
 | 60 |  | 
 | 61 | 3. the size of the integrity tag (if "-" is used, the size is taken from | 
 | 62 |    the internal-hash algorithm) | 
 | 63 |  | 
 | 64 | 4. mode: | 
 | 65 |  | 
 | 66 | 	D - direct writes (without journal) | 
 | 67 | 		in this mode, journaling is | 
 | 68 | 		not used and data sectors and integrity tags are written | 
 | 69 | 		separately. In case of crash, it is possible that the data | 
 | 70 | 		and integrity tag doesn't match. | 
 | 71 | 	J - journaled writes | 
 | 72 | 		data and integrity tags are written to the | 
 | 73 | 		journal and atomicity is guaranteed. In case of crash, | 
 | 74 | 		either both data and tag or none of them are written. The | 
 | 75 | 		journaled mode degrades write throughput twice because the | 
 | 76 | 		data have to be written twice. | 
 | 77 | 	B - bitmap mode - data and metadata are written without any | 
 | 78 | 		synchronization, the driver maintains a bitmap of dirty | 
 | 79 | 		regions where data and metadata don't match. This mode can | 
 | 80 | 		only be used with internal hash. | 
 | 81 | 	R - recovery mode - in this mode, journal is not replayed, | 
 | 82 | 		checksums are not checked and writes to the device are not | 
 | 83 | 		allowed. This mode is useful for data recovery if the | 
 | 84 | 		device cannot be activated in any of the other standard | 
 | 85 | 		modes. | 
 | 86 |  | 
 | 87 | 5. the number of additional arguments | 
 | 88 |  | 
 | 89 | Additional arguments: | 
 | 90 |  | 
 | 91 | journal_sectors:number | 
 | 92 | 	The size of journal, this argument is used only if formatting the | 
 | 93 | 	device. If the device is already formatted, the value from the | 
 | 94 | 	superblock is used. | 
 | 95 |  | 
 | 96 | interleave_sectors:number | 
 | 97 | 	The number of interleaved sectors. This values is rounded down to | 
 | 98 | 	a power of two. If the device is already formatted, the value from | 
 | 99 | 	the superblock is used. | 
 | 100 |  | 
 | 101 | meta_device:device | 
 | 102 | 	Don't interleave the data and metadata on on device. Use a | 
 | 103 | 	separate device for metadata. | 
 | 104 |  | 
 | 105 | buffer_sectors:number | 
 | 106 | 	The number of sectors in one buffer. The value is rounded down to | 
 | 107 | 	a power of two. | 
 | 108 |  | 
 | 109 | 	The tag area is accessed using buffers, the buffer size is | 
 | 110 | 	configurable. The large buffer size means that the I/O size will | 
 | 111 | 	be larger, but there could be less I/Os issued. | 
 | 112 |  | 
 | 113 | journal_watermark:number | 
 | 114 | 	The journal watermark in percents. When the size of the journal | 
 | 115 | 	exceeds this watermark, the thread that flushes the journal will | 
 | 116 | 	be started. | 
 | 117 |  | 
 | 118 | commit_time:number | 
 | 119 | 	Commit time in milliseconds. When this time passes, the journal is | 
 | 120 | 	written. The journal is also written immediatelly if the FLUSH | 
 | 121 | 	request is received. | 
 | 122 |  | 
 | 123 | internal_hash:algorithm(:key)	(the key is optional) | 
 | 124 | 	Use internal hash or crc. | 
 | 125 | 	When this argument is used, the dm-integrity target won't accept | 
 | 126 | 	integrity tags from the upper target, but it will automatically | 
 | 127 | 	generate and verify the integrity tags. | 
 | 128 |  | 
 | 129 | 	You can use a crc algorithm (such as crc32), then integrity target | 
 | 130 | 	will protect the data against accidental corruption. | 
 | 131 | 	You can also use a hmac algorithm (for example | 
 | 132 | 	"hmac(sha256):0123456789abcdef"), in this mode it will provide | 
 | 133 | 	cryptographic authentication of the data without encryption. | 
 | 134 |  | 
 | 135 | 	When this argument is not used, the integrity tags are accepted | 
 | 136 | 	from an upper layer target, such as dm-crypt. The upper layer | 
 | 137 | 	target should check the validity of the integrity tags. | 
 | 138 |  | 
 | 139 | recalculate | 
 | 140 | 	Recalculate the integrity tags automatically. It is only valid | 
 | 141 | 	when using internal hash. | 
 | 142 |  | 
 | 143 | journal_crypt:algorithm(:key)	(the key is optional) | 
 | 144 | 	Encrypt the journal using given algorithm to make sure that the | 
 | 145 | 	attacker can't read the journal. You can use a block cipher here | 
 | 146 | 	(such as "cbc(aes)") or a stream cipher (for example "chacha20", | 
 | 147 | 	"salsa20", "ctr(aes)" or "ecb(arc4)"). | 
 | 148 |  | 
 | 149 | 	The journal contains history of last writes to the block device, | 
 | 150 | 	an attacker reading the journal could see the last sector nubmers | 
 | 151 | 	that were written. From the sector numbers, the attacker can infer | 
 | 152 | 	the size of files that were written. To protect against this | 
 | 153 | 	situation, you can encrypt the journal. | 
 | 154 |  | 
 | 155 | journal_mac:algorithm(:key)	(the key is optional) | 
 | 156 | 	Protect sector numbers in the journal from accidental or malicious | 
 | 157 | 	modification. To protect against accidental modification, use a | 
 | 158 | 	crc algorithm, to protect against malicious modification, use a | 
 | 159 | 	hmac algorithm with a key. | 
 | 160 |  | 
 | 161 | 	This option is not needed when using internal-hash because in this | 
 | 162 | 	mode, the integrity of journal entries is checked when replaying | 
 | 163 | 	the journal. Thus, modified sector number would be detected at | 
 | 164 | 	this stage. | 
 | 165 |  | 
 | 166 | block_size:number | 
 | 167 | 	The size of a data block in bytes.  The larger the block size the | 
 | 168 | 	less overhead there is for per-block integrity metadata. | 
 | 169 | 	Supported values are 512, 1024, 2048 and 4096 bytes.  If not | 
 | 170 | 	specified the default block size is 512 bytes. | 
 | 171 |  | 
 | 172 | sectors_per_bit:number | 
 | 173 | 	In the bitmap mode, this parameter specifies the number of | 
 | 174 | 	512-byte sectors that corresponds to one bitmap bit. | 
 | 175 |  | 
 | 176 | bitmap_flush_interval:number | 
 | 177 | 	The bitmap flush interval in milliseconds. The metadata buffers | 
 | 178 | 	are synchronized when this interval expires. | 
 | 179 |  | 
 | 180 | legacy_recalculate | 
 | 181 | 	Allow recalculating of volumes with HMAC keys. This is disabled by | 
 | 182 | 	default for security reasons - an attacker could modify the volume, | 
 | 183 | 	set recalc_sector to zero, and the kernel would not detect the | 
 | 184 | 	modification. | 
 | 185 |  | 
 | 186 |  | 
 | 187 | The journal mode (D/J), buffer_sectors, journal_watermark, commit_time can | 
 | 188 | be changed when reloading the target (load an inactive table and swap the | 
 | 189 | tables with suspend and resume). The other arguments should not be changed | 
 | 190 | when reloading the target because the layout of disk data depend on them | 
 | 191 | and the reloaded target would be non-functional. | 
 | 192 |  | 
 | 193 |  | 
 | 194 | The layout of the formatted block device: | 
 | 195 |  | 
 | 196 | * reserved sectors | 
 | 197 |     (they are not used by this target, they can be used for | 
 | 198 |     storing LUKS metadata or for other purpose), the size of the reserved | 
 | 199 |     area is specified in the target arguments | 
 | 200 |  | 
 | 201 | * superblock (4kiB) | 
 | 202 | 	* magic string - identifies that the device was formatted | 
 | 203 | 	* version | 
 | 204 | 	* log2(interleave sectors) | 
 | 205 | 	* integrity tag size | 
 | 206 | 	* the number of journal sections | 
 | 207 | 	* provided data sectors - the number of sectors that this target | 
 | 208 | 	  provides (i.e. the size of the device minus the size of all | 
 | 209 | 	  metadata and padding). The user of this target should not send | 
 | 210 | 	  bios that access data beyond the "provided data sectors" limit. | 
 | 211 | 	* flags | 
 | 212 | 	    SB_FLAG_HAVE_JOURNAL_MAC | 
 | 213 | 		- a flag is set if journal_mac is used | 
 | 214 | 	    SB_FLAG_RECALCULATING | 
 | 215 | 		- recalculating is in progress | 
 | 216 | 	    SB_FLAG_DIRTY_BITMAP | 
 | 217 | 		- journal area contains the bitmap of dirty | 
 | 218 | 		  blocks | 
 | 219 | 	* log2(sectors per block) | 
 | 220 | 	* a position where recalculating finished | 
 | 221 | * journal | 
 | 222 | 	The journal is divided into sections, each section contains: | 
 | 223 |  | 
 | 224 | 	* metadata area (4kiB), it contains journal entries | 
 | 225 |  | 
 | 226 | 	  - every journal entry contains: | 
 | 227 |  | 
 | 228 | 		* logical sector (specifies where the data and tag should | 
 | 229 | 		  be written) | 
 | 230 | 		* last 8 bytes of data | 
 | 231 | 		* integrity tag (the size is specified in the superblock) | 
 | 232 |  | 
 | 233 | 	  - every metadata sector ends with | 
 | 234 |  | 
 | 235 | 		* mac (8-bytes), all the macs in 8 metadata sectors form a | 
 | 236 | 		  64-byte value. It is used to store hmac of sector | 
 | 237 | 		  numbers in the journal section, to protect against a | 
 | 238 | 		  possibility that the attacker tampers with sector | 
 | 239 | 		  numbers in the journal. | 
 | 240 | 		* commit id | 
 | 241 |  | 
 | 242 | 	* data area (the size is variable; it depends on how many journal | 
 | 243 | 	  entries fit into the metadata area) | 
 | 244 |  | 
 | 245 | 	    - every sector in the data area contains: | 
 | 246 |  | 
 | 247 | 		* data (504 bytes of data, the last 8 bytes are stored in | 
 | 248 | 		  the journal entry) | 
 | 249 | 		* commit id | 
 | 250 |  | 
 | 251 | 	To test if the whole journal section was written correctly, every | 
 | 252 | 	512-byte sector of the journal ends with 8-byte commit id. If the | 
 | 253 | 	commit id matches on all sectors in a journal section, then it is | 
 | 254 | 	assumed that the section was written correctly. If the commit id | 
 | 255 | 	doesn't match, the section was written partially and it should not | 
 | 256 | 	be replayed. | 
 | 257 |  | 
 | 258 | * one or more runs of interleaved tags and data. | 
 | 259 |     Each run contains: | 
 | 260 |  | 
 | 261 | 	* tag area - it contains integrity tags. There is one tag for each | 
 | 262 | 	  sector in the data area | 
 | 263 | 	* data area - it contains data sectors. The number of data sectors | 
 | 264 | 	  in one run must be a power of two. log2 of this value is stored | 
 | 265 | 	  in the superblock. |