Blame - src/kernel/linux/v4.19/Documentation/device-mapper/thin-provisioning.txt - T800

blob: 883e7ca5f74588aa54b8ad7a50750ca44224ca89 [file] [log] [blame]

xj	b04a402	2021-11-25 15:01:52 +0800	[diff] [blame]	1	Introduction
				2	============
				3
				4	This document describes a collection of device-mapper targets that
				5	between them implement thin-provisioning and snapshots.
				6
				7	The main highlight of this implementation, compared to the previous
				8	implementation of snapshots, is that it allows many virtual devices to
				9	be stored on the same data volume. This simplifies administration and
				10	allows the sharing of data between volumes, thus reducing disk usage.
				11
				12	Another significant feature is support for an arbitrary depth of
				13	recursive snapshots (snapshots of snapshots of snapshots ...). The
				14	previous implementation of snapshots did this by chaining together
				15	lookup tables, and so performance was O(depth). This new
				16	implementation uses a single data structure to avoid this degradation
				17	with depth. Fragmentation may still be an issue, however, in some
				18	scenarios.
				19
				20	Metadata is stored on a separate device from data, giving the
				21	administrator some freedom, for example to:
				22
				23	- Improve metadata resilience by storing metadata on a mirrored volume
				24	but data on a non-mirrored one.
				25
				26	- Improve performance by storing the metadata on SSD.
				27
				28	Status
				29	======
				30
				31	These targets are considered safe for production use. But different use
				32	cases will have different performance characteristics, for example due
				33	to fragmentation of the data volume.
				34
				35	If you find this software is not performing as expected please mail
				36	dm-devel@redhat.com with details and we'll try our best to improve
				37	things for you.
				38
				39	Userspace tools for checking and repairing the metadata have been fully
				40	developed and are available as 'thin_check' and 'thin_repair'. The name
				41	of the package that provides these utilities varies by distribution (on
				42	a Red Hat distribution it is named 'device-mapper-persistent-data').
				43
				44	Cookbook
				45	========
				46
				47	This section describes some quick recipes for using thin provisioning.
				48	They use the dmsetup program to control the device-mapper driver
				49	directly. End users will be advised to use a higher-level volume
				50	manager such as LVM2 once support has been added.
				51
				52	Pool device
				53	-----------
				54
				55	The pool device ties together the metadata volume and the data volume.
				56	It maps I/O linearly to the data volume and updates the metadata via
				57	two mechanisms:
				58
				59	- Function calls from the thin targets
				60
				61	- Device-mapper 'messages' from userspace which control the creation of new
				62	virtual devices amongst other things.
				63
				64	Setting up a fresh pool device
				65	------------------------------
				66
				67	Setting up a pool device requires a valid metadata device, and a
				68	data device. If you do not have an existing metadata device you can
				69	make one by zeroing the first 4k to indicate empty metadata.
				70
				71	dd if=/dev/zero of=$metadata_dev bs=4096 count=1
				72
				73	The amount of metadata you need will vary according to how many blocks
				74	are shared between thin devices (i.e. through snapshots). If you have
				75	less sharing than average you'll need a larger-than-average metadata device.
				76
				77	As a guide, we suggest you calculate the number of bytes to use in the
				78	metadata device as 48 * $data_dev_size / $data_block_size but round it up
				79	to 2MB if the answer is smaller. If you're creating large numbers of
				80	snapshots which are recording large amounts of change, you may find you
				81	need to increase this.
				82
				83	The largest size supported is 16GB: If the device is larger,
				84	a warning will be issued and the excess space will not be used.
				85
				86	Reloading a pool table
				87	----------------------
				88
				89	You may reload a pool's table, indeed this is how the pool is resized
				90	if it runs out of space. (N.B. While specifying a different metadata
				91	device when reloading is not forbidden at the moment, things will go
				92	wrong if it does not route I/O to exactly the same on-disk location as
				93	previously.)
				94
				95	Using an existing pool device
				96	-----------------------------
				97
				98	dmsetup create pool \
				99	--table "0 20971520 thin-pool $metadata_dev $data_dev \
				100	$data_block_size $low_water_mark"
				101
				102	$data_block_size gives the smallest unit of disk space that can be
				103	allocated at a time expressed in units of 512-byte sectors.
				104	$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
				105	multiple of 128 (64KB). $data_block_size cannot be changed after the
				106	thin-pool is created. People primarily interested in thin provisioning
				107	may want to use a value such as 1024 (512KB). People doing lots of
				108	snapshotting may want a smaller value such as 128 (64KB). If you are
				109	not zeroing newly-allocated data, a larger $data_block_size in the
				110	region of 256000 (128MB) is suggested.
				111
				112	$low_water_mark is expressed in blocks of size $data_block_size. If
				113	free space on the data device drops below this level then a dm event
				114	will be triggered which a userspace daemon should catch allowing it to
				115	extend the pool device. Only one such event will be sent.
				116
				117	No special event is triggered if a just resumed device's free space is below
				118	the low water mark. However, resuming a device always triggers an
				119	event; a userspace daemon should verify that free space exceeds the low
				120	water mark when handling this event.
				121
				122	A low water mark for the metadata device is maintained in the kernel and
				123	will trigger a dm event if free space on the metadata device drops below
				124	it.
				125
				126	Updating on-disk metadata
				127	-------------------------
				128
				129	On-disk metadata is committed every time a FLUSH or FUA bio is written.
				130	If no such requests are made then commits will occur every second. This
				131	means the thin-provisioning target behaves like a physical disk that has
				132	a volatile write cache. If power is lost you may lose some recent
				133	writes. The metadata should always be consistent in spite of any crash.
				134
				135	If data space is exhausted the pool will either error or queue IO
				136	according to the configuration (see: error_if_no_space). If metadata
				137	space is exhausted or a metadata operation fails: the pool will error IO
				138	until the pool is taken offline and repair is performed to 1) fix any
				139	potential inconsistencies and 2) clear the flag that imposes repair.
				140	Once the pool's metadata device is repaired it may be resized, which
				141	will allow the pool to return to normal operation. Note that if a pool
				142	is flagged as needing repair, the pool's data and metadata devices
				143	cannot be resized until repair is performed. It should also be noted
				144	that when the pool's metadata space is exhausted the current metadata
				145	transaction is aborted. Given that the pool will cache IO whose
				146	completion may have already been acknowledged to upper IO layers
				147	(e.g. filesystem) it is strongly suggested that consistency checks
				148	(e.g. fsck) be performed on those layers when repair of the pool is
				149	required.
				150
				151	Thin provisioning
				152	-----------------
				153
				154	i) Creating a new thinly-provisioned volume.
				155
				156	To create a new thinly- provisioned volume you must send a message to an
				157	active pool device, /dev/mapper/pool in this example.
				158
				159	dmsetup message /dev/mapper/pool 0 "create_thin 0"
				160
				161	Here '0' is an identifier for the volume, a 24-bit number. It's up
				162	to the caller to allocate and manage these identifiers. If the
				163	identifier is already in use, the message will fail with -EEXIST.
				164
				165	ii) Using a thinly-provisioned volume.
				166
				167	Thinly-provisioned volumes are activated using the 'thin' target:
				168
				169	dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
				170
				171	The last parameter is the identifier for the thinp device.
				172
				173	Internal snapshots
				174	------------------
				175
				176	i) Creating an internal snapshot.
				177
				178	Snapshots are created with another message to the pool.
				179
				180	N.B. If the origin device that you wish to snapshot is active, you
				181	must suspend it before creating the snapshot to avoid corruption.
				182	This is NOT enforced at the moment, so please be careful!
				183
				184	dmsetup suspend /dev/mapper/thin
				185	dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
				186	dmsetup resume /dev/mapper/thin
				187
				188	Here '1' is the identifier for the volume, a 24-bit number. '0' is the
				189	identifier for the origin device.
				190
				191	ii) Using an internal snapshot.
				192
				193	Once created, the user doesn't have to worry about any connection
				194	between the origin and the snapshot. Indeed the snapshot is no
				195	different from any other thinly-provisioned device and can be
				196	snapshotted itself via the same method. It's perfectly legal to
				197	have only one of them active, and there's no ordering requirement on
				198	activating or removing them both. (This differs from conventional
				199	device-mapper snapshots.)
				200
				201	Activate it exactly the same way as any other thinly-provisioned volume:
				202
				203	dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
				204
				205	External snapshots
				206	------------------
				207
				208	You can use an external _read only_ device as an origin for a
				209	thinly-provisioned volume. Any read to an unprovisioned area of the
				210	thin device will be passed through to the origin. Writes trigger
				211	the allocation of new blocks as usual.
				212
				213	One use case for this is VM hosts that want to run guests on
				214	thinly-provisioned volumes but have the base image on another device
				215	(possibly shared between many VMs).
				216
				217	You must not write to the origin device if you use this technique!
				218	Of course, you may write to the thin device and take internal snapshots
				219	of the thin volume.
				220
				221	i) Creating a snapshot of an external device
				222
				223	This is the same as creating a thin device.
				224	You don't mention the origin at this stage.
				225
				226	dmsetup message /dev/mapper/pool 0 "create_thin 0"
				227
				228	ii) Using a snapshot of an external device.
				229
				230	Append an extra parameter to the thin target specifying the origin:
				231
				232	dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image"
				233
				234	N.B. All descendants (internal snapshots) of this snapshot require the
				235	same extra origin parameter.
				236
				237	Deactivation
				238	------------
				239
				240	All devices using a pool must be deactivated before the pool itself
				241	can be.
				242
				243	dmsetup remove thin
				244	dmsetup remove snap
				245	dmsetup remove pool
				246
				247	Reference
				248	=========
				249
				250	'thin-pool' target
				251	------------------
				252
				253	i) Constructor
				254
				255	thin-pool <metadata dev> <data dev> <data block size (sectors)> \
				256	<low water mark (blocks)> [<number of feature args> [<arg>]*]
				257
				258	Optional feature arguments:
				259
				260	skip_block_zeroing: Skip the zeroing of newly-provisioned blocks.
				261
				262	ignore_discard: Disable discard support.
				263
				264	no_discard_passdown: Don't pass discards down to the underlying
				265	data device, but just remove the mapping.
				266
				267	read_only: Don't allow any changes to be made to the pool
				268	metadata. This mode is only available after the
				269	thin-pool has been created and first used in full
				270	read/write mode. It cannot be specified on initial
				271	thin-pool creation.
				272
				273	error_if_no_space: Error IOs, instead of queueing, if no space.
				274
				275	Data block size must be between 64KB (128 sectors) and 1GB
				276	(2097152 sectors) inclusive.
				277
				278
				279	ii) Status
				280
				281	<transaction id> <used metadata blocks>/<total metadata blocks>
				282	<used data blocks>/<total data blocks> <held metadata root>
				283	ro\|rw\|out_of_data_space [no_]discard_passdown [error\|queue]_if_no_space
				284	needs_check\|- metadata_low_watermark
				285
				286	transaction id:
				287	A 64-bit number used by userspace to help synchronise with metadata
				288	from volume managers.
				289
				290	used data blocks / total data blocks
				291	If the number of free blocks drops below the pool's low water mark a
				292	dm event will be sent to userspace. This event is edge-triggered and
				293	it will occur only once after each resume so volume manager writers
				294	should register for the event and then check the target's status.
				295
				296	held metadata root:
				297	The location, in blocks, of the metadata root that has been
				298	'held' for userspace read access. '-' indicates there is no
				299	held root.
				300
				301	discard_passdown\|no_discard_passdown
				302	Whether or not discards are actually being passed down to the
				303	underlying device. When this is enabled when loading the table,
				304	it can get disabled if the underlying device doesn't support it.
				305
				306	ro\|rw\|out_of_data_space
				307	If the pool encounters certain types of device failures it will
				308	drop into a read-only metadata mode in which no changes to
				309	the pool metadata (like allocating new blocks) are permitted.
				310
				311	In serious cases where even a read-only mode is deemed unsafe
				312	no further I/O will be permitted and the status will just
				313	contain the string 'Fail'. The userspace recovery tools
				314	should then be used.
				315
				316	error_if_no_space\|queue_if_no_space
				317	If the pool runs out of data or metadata space, the pool will
				318	either queue or error the IO destined to the data device. The
				319	default is to queue the IO until more space is added or the
				320	'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool
				321	module parameter can be used to change this timeout -- it
				322	defaults to 60 seconds but may be disabled using a value of 0.
				323
				324	needs_check
				325	A metadata operation has failed, resulting in the needs_check
				326	flag being set in the metadata's superblock. The metadata
				327	device must be deactivated and checked/repaired before the
				328	thin-pool can be made fully operational again. '-' indicates
				329	needs_check is not set.
				330
				331	metadata_low_watermark:
				332	Value of metadata low watermark in blocks. The kernel sets this
				333	value internally but userspace needs to know this value to
				334	determine if an event was caused by crossing this threshold.
				335
				336	iii) Messages
				337
				338	create_thin <dev id>
				339
				340	Create a new thinly-provisioned device.
				341	<dev id> is an arbitrary unique 24-bit identifier chosen by
				342	the caller.
				343
				344	create_snap <dev id> <origin id>
				345
				346	Create a new snapshot of another thinly-provisioned device.
				347	<dev id> is an arbitrary unique 24-bit identifier chosen by
				348	the caller.
				349	<origin id> is the identifier of the thinly-provisioned device
				350	of which the new device will be a snapshot.
				351
				352	delete <dev id>
				353
				354	Deletes a thin device. Irreversible.
				355
				356	set_transaction_id <current id> <new id>
				357
				358	Userland volume managers, such as LVM, need a way to
				359	synchronise their external metadata with the internal metadata of the
				360	pool target. The thin-pool target offers to store an
				361	arbitrary 64-bit transaction id and return it on the target's
				362	status line. To avoid races you must provide what you think
				363	the current transaction id is when you change it with this
				364	compare-and-swap message.
				365
				366	reserve_metadata_snap
				367
				368	Reserve a copy of the data mapping btree for use by userland.
				369	This allows userland to inspect the mappings as they were when
				370	this message was executed. Use the pool's status command to
				371	get the root block associated with the metadata snapshot.
				372
				373	release_metadata_snap
				374
				375	Release a previously reserved copy of the data mapping btree.
				376
				377	'thin' target
				378	-------------
				379
				380	i) Constructor
				381
				382	thin <pool dev> <dev id> [<external origin dev>]
				383
				384	pool dev:
				385	the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
				386
				387	dev id:
				388	the internal device identifier of the device to be
				389	activated.
				390
				391	external origin dev:
				392	an optional block device outside the pool to be treated as a
				393	read-only snapshot origin: reads to unprovisioned areas of the
				394	thin target will be mapped to this device.
				395
				396	The pool doesn't store any size against the thin devices. If you
				397	load a thin target that is smaller than you've been using previously,
				398	then you'll have no access to blocks mapped beyond the end. If you
				399	load a target that is bigger than before, then extra blocks will be
				400	provisioned as and when needed.
				401
				402	ii) Status
				403
				404	<nr mapped sectors> <highest mapped sector>
				405
				406	If the pool has encountered device errors and failed, the status
				407	will just contain the string 'Fail'. The userspace recovery
				408	tools should then be used.
				409
				410	In the case where <nr mapped sectors> is 0, there is no highest
				411	mapped sector and the value of <highest mapped sector> is unspecified.