Blame - src/kernel/linux/v4.19/Documentation/cgroup-v1/blkio-controller.txt - T800

blob: 673dc34d3f7812c5030133821e81d9772f8aa546 [file] [log] [blame]

xj	b04a402	2021-11-25 15:01:52 +0800	[diff] [blame]	1	Block IO Controller
				2	===================
				3	Overview
				4	========
				5	cgroup subsys "blkio" implements the block io controller. There seems to be
				6	a need of various kinds of IO control policies (like proportional BW, max BW)
				7	both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
				8	Plan is to use the same cgroup based management interface for blkio controller
				9	and based on user options switch IO policies in the background.
				10
				11	Currently two IO control policies are implemented. First one is proportional
				12	weight time based division of disk policy. It is implemented in CFQ. Hence
				13	this policy takes effect only on leaf nodes when CFQ is being used. The second
				14	one is throttling policy which can be used to specify upper IO rate limits
				15	on devices. This policy is implemented in generic block layer and can be
				16	used on leaf nodes as well as higher level logical devices like device mapper.
				17
				18	HOWTO
				19	=====
				20	Proportional Weight division of bandwidth
				21	-----------------------------------------
				22	You can do a very simple testing of running two dd threads in two different
				23	cgroups. Here is what you can do.
				24
				25	- Enable Block IO controller
				26	CONFIG_BLK_CGROUP=y
				27
				28	- Enable group scheduling in CFQ
				29	CONFIG_CFQ_GROUP_IOSCHED=y
				30
				31	- Compile and boot into kernel and mount IO controller (blkio); see
				32	cgroups.txt, Why are cgroups needed?.
				33
				34	mount -t tmpfs cgroup_root /sys/fs/cgroup
				35	mkdir /sys/fs/cgroup/blkio
				36	mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
				37
				38	- Create two cgroups
				39	mkdir -p /sys/fs/cgroup/blkio/test1/ /sys/fs/cgroup/blkio/test2
				40
				41	- Set weights of group test1 and test2
				42	echo 1000 > /sys/fs/cgroup/blkio/test1/blkio.weight
				43	echo 500 > /sys/fs/cgroup/blkio/test2/blkio.weight
				44
				45	- Create two same size files (say 512MB each) on same disk (file1, file2) and
				46	launch two dd threads in different cgroup to read those files.
				47
				48	sync
				49	echo 3 > /proc/sys/vm/drop_caches
				50
				51	dd if=/mnt/sdb/zerofile1 of=/dev/null &
				52	echo $! > /sys/fs/cgroup/blkio/test1/tasks
				53	cat /sys/fs/cgroup/blkio/test1/tasks
				54
				55	dd if=/mnt/sdb/zerofile2 of=/dev/null &
				56	echo $! > /sys/fs/cgroup/blkio/test2/tasks
				57	cat /sys/fs/cgroup/blkio/test2/tasks
				58
				59	- At macro level, first dd should finish first. To get more precise data, keep
				60	on looking at (with the help of script), at blkio.disk_time and
				61	blkio.disk_sectors files of both test1 and test2 groups. This will tell how
				62	much disk time (in milliseconds), each group got and how many sectors each
				63	group dispatched to the disk. We provide fairness in terms of disk time, so
				64	ideally io.disk_time of cgroups should be in proportion to the weight.
				65
				66	Throttling/Upper Limit policy
				67	-----------------------------
				68	- Enable Block IO controller
				69	CONFIG_BLK_CGROUP=y
				70
				71	- Enable throttling in block layer
				72	CONFIG_BLK_DEV_THROTTLING=y
				73
				74	- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)
				75	mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
				76
				77	- Specify a bandwidth rate on particular device for root group. The format
				78	for policy is "<major>:<minor> <bytes_per_second>".
				79
				80	echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
				81
				82	Above will put a limit of 1MB/second on reads happening for root group
				83	on device having major/minor number 8:16.
				84
				85	- Run dd to read a file and see if rate is throttled to 1MB/s or not.
				86
				87	# dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
				88	1024+0 records in
				89	1024+0 records out
				90	4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
				91
				92	Limits for writes can be put using blkio.throttle.write_bps_device file.
				93
				94	Hierarchical Cgroups
				95	====================
				96
				97	Both CFQ and throttling implement hierarchy support; however,
				98	throttling's hierarchy support is enabled iff "sane_behavior" is
				99	enabled from cgroup side, which currently is a development option and
				100	not publicly available.
				101
				102	If somebody created a hierarchy like as follows.
				103
				104	root
				105	/ \
				106	test1 test2
				107	\|
				108	test3
				109
				110	CFQ by default and throttling with "sane_behavior" will handle the
				111	hierarchy correctly. For details on CFQ hierarchy support, refer to
				112	Documentation/block/cfq-iosched.txt. For throttling, all limits apply
				113	to the whole subtree while all statistics are local to the IOs
				114	directly generated by tasks in that cgroup.
				115
				116	Throttling without "sane_behavior" enabled from cgroup side will
				117	practically treat all groups at same level as if it looks like the
				118	following.
				119
				120	pivot
				121	/ / \ \
				122	root test1 test2 test3
				123
				124	Various user visible config options
				125	===================================
				126	CONFIG_BLK_CGROUP
				127	- Block IO controller.
				128
				129	CONFIG_DEBUG_BLK_CGROUP
				130	- Debug help. Right now some additional stats file show up in cgroup
				131	if this option is enabled.
				132
				133	CONFIG_CFQ_GROUP_IOSCHED
				134	- Enables group scheduling in CFQ. Currently only 1 level of group
				135	creation is allowed.
				136
				137	CONFIG_BLK_DEV_THROTTLING
				138	- Enable block device throttling support in block layer.
				139
				140	Details of cgroup files
				141	=======================
				142	Proportional weight policy files
				143	--------------------------------
				144	- blkio.weight
				145	- Specifies per cgroup weight. This is default weight of the group
				146	on all the devices until and unless overridden by per device rule.
				147	(See blkio.weight_device).
				148	Currently allowed range of weights is from 10 to 1000.
				149
				150	- blkio.weight_device
				151	- One can specify per cgroup per device rules using this interface.
				152	These rules override the default value of group weight as specified
				153	by blkio.weight.
				154
				155	Following is the format.
				156
				157	# echo dev_maj:dev_minor weight > blkio.weight_device
				158	Configure weight=300 on /dev/sdb (8:16) in this cgroup
				159	# echo 8:16 300 > blkio.weight_device
				160	# cat blkio.weight_device
				161	dev weight
				162	8:16 300
				163
				164	Configure weight=500 on /dev/sda (8:0) in this cgroup
				165	# echo 8:0 500 > blkio.weight_device
				166	# cat blkio.weight_device
				167	dev weight
				168	8:0 500
				169	8:16 300
				170
				171	Remove specific weight for /dev/sda in this cgroup
				172	# echo 8:0 0 > blkio.weight_device
				173	# cat blkio.weight_device
				174	dev weight
				175	8:16 300
				176
				177	- blkio.leaf_weight[_device]
				178	- Equivalents of blkio.weight[_device] for the purpose of
				179	deciding how much weight tasks in the given cgroup has while
				180	competing with the cgroup's child cgroups. For details,
				181	please refer to Documentation/block/cfq-iosched.txt.
				182
				183	- blkio.time
				184	- disk time allocated to cgroup per device in milliseconds. First
				185	two fields specify the major and minor number of the device and
				186	third field specifies the disk time allocated to group in
				187	milliseconds.
				188
				189	- blkio.sectors
				190	- number of sectors transferred to/from disk by the group. First
				191	two fields specify the major and minor number of the device and
				192	third field specifies the number of sectors transferred by the
				193	group to/from the device.
				194
				195	- blkio.io_service_bytes
				196	- Number of bytes transferred to/from the disk by the group. These
				197	are further divided by the type of operation - read or write, sync
				198	or async. First two fields specify the major and minor number of the
				199	device, third field specifies the operation type and the fourth field
				200	specifies the number of bytes.
				201
				202	- blkio.io_serviced
				203	- Number of IOs (bio) issued to the disk by the group. These
				204	are further divided by the type of operation - read or write, sync
				205	or async. First two fields specify the major and minor number of the
				206	device, third field specifies the operation type and the fourth field
				207	specifies the number of IOs.
				208
				209	- blkio.io_service_time
				210	- Total amount of time between request dispatch and request completion
				211	for the IOs done by this cgroup. This is in nanoseconds to make it
				212	meaningful for flash devices too. For devices with queue depth of 1,
				213	this time represents the actual service time. When queue_depth > 1,
				214	that is no longer true as requests may be served out of order. This
				215	may cause the service time for a given IO to include the service time
				216	of multiple IOs when served out of order which may result in total
				217	io_service_time > actual time elapsed. This time is further divided by
				218	the type of operation - read or write, sync or async. First two fields
				219	specify the major and minor number of the device, third field
				220	specifies the operation type and the fourth field specifies the
				221	io_service_time in ns.
				222
				223	- blkio.io_wait_time
				224	- Total amount of time the IOs for this cgroup spent waiting in the
				225	scheduler queues for service. This can be greater than the total time
				226	elapsed since it is cumulative io_wait_time for all IOs. It is not a
				227	measure of total time the cgroup spent waiting but rather a measure of
				228	the wait_time for its individual IOs. For devices with queue_depth > 1
				229	this metric does not include the time spent waiting for service once
				230	the IO is dispatched to the device but till it actually gets serviced
				231	(there might be a time lag here due to re-ordering of requests by the
				232	device). This is in nanoseconds to make it meaningful for flash
				233	devices too. This time is further divided by the type of operation -
				234	read or write, sync or async. First two fields specify the major and
				235	minor number of the device, third field specifies the operation type
				236	and the fourth field specifies the io_wait_time in ns.
				237
				238	- blkio.io_merged
				239	- Total number of bios/requests merged into requests belonging to this
				240	cgroup. This is further divided by the type of operation - read or
				241	write, sync or async.
				242
				243	- blkio.io_queued
				244	- Total number of requests queued up at any given instant for this
				245	cgroup. This is further divided by the type of operation - read or
				246	write, sync or async.
				247
				248	- blkio.avg_queue_size
				249	- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
				250	The average queue size for this cgroup over the entire time of this
				251	cgroup's existence. Queue size samples are taken each time one of the
				252	queues of this cgroup gets a timeslice.
				253
				254	- blkio.group_wait_time
				255	- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
				256	This is the amount of time the cgroup had to wait since it became busy
				257	(i.e., went from 0 to 1 request queued) to get a timeslice for one of
				258	its queues. This is different from the io_wait_time which is the
				259	cumulative total of the amount of time spent by each IO in that cgroup
				260	waiting in the scheduler queue. This is in nanoseconds. If this is
				261	read when the cgroup is in a waiting (for timeslice) state, the stat
				262	will only report the group_wait_time accumulated till the last time it
				263	got a timeslice and will not include the current delta.
				264
				265	- blkio.empty_time
				266	- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
				267	This is the amount of time a cgroup spends without any pending
				268	requests when not being served, i.e., it does not include any time
				269	spent idling for one of the queues of the cgroup. This is in
				270	nanoseconds. If this is read when the cgroup is in an empty state,
				271	the stat will only report the empty_time accumulated till the last
				272	time it had a pending request and will not include the current delta.
				273
				274	- blkio.idle_time
				275	- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y.
				276	This is the amount of time spent by the IO scheduler idling for a
				277	given cgroup in anticipation of a better request than the existing ones
				278	from other queues/cgroups. This is in nanoseconds. If this is read
				279	when the cgroup is in an idling state, the stat will only report the
				280	idle_time accumulated till the last idle period and will not include
				281	the current delta.
				282
				283	- blkio.dequeue
				284	- Debugging aid only enabled if CONFIG_DEBUG_BLK_CGROUP=y. This
				285	gives the statistics about how many a times a group was dequeued
				286	from service tree of the device. First two fields specify the major
				287	and minor number of the device and third field specifies the number
				288	of times a group was dequeued from a particular device.
				289
				290	- blkio.*_recursive
				291	- Recursive version of various stats. These files show the
				292	same information as their non-recursive counterparts but
				293	include stats from all the descendant cgroups.
				294
				295	Throttling/Upper limit policy files
				296	-----------------------------------
				297	- blkio.throttle.read_bps_device
				298	- Specifies upper limit on READ rate from the device. IO rate is
				299	specified in bytes per second. Rules are per device. Following is
				300	the format.
				301
				302	echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
				303
				304	- blkio.throttle.write_bps_device
				305	- Specifies upper limit on WRITE rate to the device. IO rate is
				306	specified in bytes per second. Rules are per device. Following is
				307	the format.
				308
				309	echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
				310
				311	- blkio.throttle.read_iops_device
				312	- Specifies upper limit on READ rate from the device. IO rate is
				313	specified in IO per second. Rules are per device. Following is
				314	the format.
				315
				316	echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
				317
				318	- blkio.throttle.write_iops_device
				319	- Specifies upper limit on WRITE rate to the device. IO rate is
				320	specified in io per second. Rules are per device. Following is
				321	the format.
				322
				323	echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
				324
				325	Note: If both BW and IOPS rules are specified for a device, then IO is
				326	subjected to both the constraints.
				327
				328	- blkio.throttle.io_serviced
				329	- Number of IOs (bio) issued to the disk by the group. These
				330	are further divided by the type of operation - read or write, sync
				331	or async. First two fields specify the major and minor number of the
				332	device, third field specifies the operation type and the fourth field
				333	specifies the number of IOs.
				334
				335	- blkio.throttle.io_service_bytes
				336	- Number of bytes transferred to/from the disk by the group. These
				337	are further divided by the type of operation - read or write, sync
				338	or async. First two fields specify the major and minor number of the
				339	device, third field specifies the operation type and the fourth field
				340	specifies the number of bytes.
				341
				342	Common files among various policies
				343	-----------------------------------
				344	- blkio.reset_stats
				345	- Writing an int to this file will result in resetting all the stats
				346	for that cgroup.
				347
				348	CFQ sysfs tunable
				349	=================
				350	/sys/block/<disk>/queue/iosched/slice_idle
				351	------------------------------------------
				352	On a faster hardware CFQ can be slow, especially with sequential workload.
				353	This happens because CFQ idles on a single queue and single queue might not
				354	drive deeper request queue depths to keep the storage busy. In such scenarios
				355	one can try setting slice_idle=0 and that would switch CFQ to IOPS
				356	(IO operations per second) mode on NCQ supporting hardware.
				357
				358	That means CFQ will not idle between cfq queues of a cfq group and hence be
				359	able to driver higher queue depth and achieve better throughput. That also
				360	means that cfq provides fairness among groups in terms of IOPS and not in
				361	terms of disk time.
				362
				363	/sys/block/<disk>/queue/iosched/group_idle
				364	------------------------------------------
				365	If one disables idling on individual cfq queues and cfq service trees by
				366	setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
				367	on the group in an attempt to provide fairness among groups.
				368
				369	By default group_idle is same as slice_idle and does not do anything if
				370	slice_idle is enabled.
				371
				372	One can experience an overall throughput drop if you have created multiple
				373	groups and put applications in that group which are not driving enough
				374	IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
				375	on individual groups and throughput should improve.