Blame - src/kernel/linux/v4.14/Documentation/scheduler/sched-tune.txt - T103

blob: 5df0ea36131155e3b2f50b371dac6e0440608974 [file] [log] [blame]

rjw	1f88458	2022-01-06 17:20:42 +0800	[diff] [blame^]	1	Central, scheduler-driven, power-performance control
				2	(EXPERIMENTAL)
				3
				4	Abstract
				5	========
				6
				7	The topic of a single simple power-performance tunable, that is wholly
				8	scheduler centric, and has well defined and predictable properties has come up
				9	on several occasions in the past [1,2]. With techniques such as a scheduler
				10	driven DVFS [3], we now have a good framework for implementing such a tunable.
				11	This document describes the overall ideas behind its design and implementation.
				12
				13
				14	Table of Contents
				15	=================
				16
				17	1. Motivation
				18	2. Introduction
				19	3. Signal Boosting Strategy
				20	4. OPP selection using boosted CPU utilization
				21	5. Per task group boosting
				22	6. Per-task wakeup-placement-strategy Selection
				23	7. Question and Answers
				24	- What about "auto" mode?
				25	- What about boosting on a congested system?
				26	- How CPUs are boosted when we have tasks with multiple boost values?
				27	8. References
				28
				29
				30	1. Motivation
				31	=============
				32
				33	Sched-DVFS [3] was a new event-driven cpufreq governor which allows the
				34	scheduler to select the optimal DVFS operating point (OPP) for running a task
				35	allocated to a CPU. Later, the cpufreq maintainers introduced a similar
				36	governor, schedutil. The introduction of schedutil also enables running
				37	workloads at the most energy efficient OPPs.
				38
				39	However, sometimes it may be desired to intentionally boost the performance of
				40	a workload even if that could imply a reasonable increase in energy
				41	consumption. For example, in order to reduce the response time of a task, we
				42	may want to run the task at a higher OPP than the one that is actually required
				43	by it's CPU bandwidth demand.
				44
				45	This last requirement is especially important if we consider that one of the
				46	main goals of the utilization-driven governor component is to replace all
				47	currently available CPUFreq policies. Since sched-DVFS and schedutil are event
				48	based, as opposed to the sampling driven governors we currently have, they are
				49	already more responsive at selecting the optimal OPP to run tasks allocated to
				50	a CPU. However, just tracking the actual task load demand may not be enough
				51	from a performance standpoint. For example, it is not possible to get
				52	behaviors similar to those provided by the "performance" and "interactive"
				53	CPUFreq governors.
				54
				55	This document describes an implementation of a tunable, stacked on top of the
				56	utilization-driven governors which extends their functionality to support task
				57	performance boosting.
				58
				59	By "performance boosting" we mean the reduction of the time required to
				60	complete a task activation, i.e. the time elapsed from a task wakeup to its
				61	next deactivation (e.g. because it goes back to sleep or it terminates). For
				62	example, if we consider a simple periodic task which executes the same workload
				63	for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
				64	that task must complete each of its activations in less than 5[s].
				65
				66	A previous attempt [5] to introduce such a boosting feature has not been
				67	successful mainly because of the complexity of the proposed solution. Previous
				68	versions of the approach described in this document exposed a single simple
				69	interface to user-space. This single tunable knob allowed the tuning of
				70	system wide scheduler behaviours ranging from energy efficiency at one end
				71	through to incremental performance boosting at the other end. This first
				72	tunable affects all tasks. However, that is not useful for Android products
				73	so in this version only a more advanced extension of the concept is provided
				74	which uses CGroups to boost the performance of only selected tasks while using
				75	the energy efficient default for all others.
				76
				77	The rest of this document introduces in more details the proposed solution
				78	which has been named SchedTune.
				79
				80
				81	2. Introduction
				82	===============
				83
				84	SchedTune exposes a simple user-space interface provided through a new
				85	CGroup controller 'stune' which provides two power-performance tunables
				86	per group:
				87
				88	/<stune cgroup mount point>/schedtune.prefer_idle
				89	/<stune cgroup mount point>/schedtune.boost
				90
				91	The CGroup implementation permits arbitrary user-space defined task
				92	classification to tune the scheduler for different goals depending on the
				93	specific nature of the task, e.g. background vs interactive vs low-priority.
				94
				95	More details are given in section 5.
				96
				97	2.1 Boosting
				98	============
				99
				100	The boost value is expressed as an integer in the range [-100..0..100].
				101
				102	A value of 0 (default) configures the CFS scheduler for maximum energy
				103	efficiency. This means that sched-DVFS runs the tasks at the minimum OPP
				104	required to satisfy their workload demand.
				105
				106	A value of 100 configures scheduler for maximum performance, which translates
				107	to the selection of the maximum OPP on that CPU.
				108
				109	A value of -100 configures scheduler for minimum performance, which translates
				110	to the selection of the minimum OPP on that CPU.
				111
				112	The range between -100, 0 and 100 can be set to satisfy other scenarios suitably.
				113	For example to satisfy interactive response or depending on other system events
				114	(battery level etc).
				115
				116	The overall design of the SchedTune module is built on top of "Per-Entity Load
				117	Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating
				118	Performance Point (OPP) selection.
				119
				120	Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune
				121	the operating frequency of that CPU to better match the workload demand. The
				122	selection of the actual OPP being activated is influenced by the boost value
				123	for the task CGroup.
				124
				125	This simple biasing approach leverages existing frameworks, which means minimal
				126	modifications to the scheduler, and yet it allows to achieve a range of
				127	different behaviours all from a single simple tunable knob.
				128
				129	In EAS schedulers, we use boosted task and CPU utilization for energy
				130	calculation and energy-aware task placement.
				131
				132	2.2 prefer_idle
				133	===============
				134
				135	This is a flag which indicates to the scheduler that userspace would like
				136	the scheduler to focus on energy or to focus on performance.
				137
				138	A value of 0 (default) signals to the CFS scheduler that tasks in this group
				139	can be placed according to the energy-aware wakeup strategy.
				140
				141	A value of 1 signals to the CFS scheduler that tasks in this group should be
				142	placed to minimise wakeup latency.
				143
				144	The value is combined with the boost value - task placement will not be
				145	boost aware however CPU OPP selection is still boost aware.
				146
				147	Android platforms typically use this flag for application tasks which the
				148	user is currently interacting with.
				149
				150
				151	3. Signal Boosting Strategy
				152	===========================
				153
				154	The whole PELT machinery works based on the value of a few load tracking signals
				155	which basically track the CPU bandwidth requirements for tasks and the capacity
				156	of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
				157	some of these load tracking signals to make a task or RQ appears more demanding
				158	that it actually is.
				159
				160	Which signals have to be inflated depends on the specific "consumer". However,
				161	independently from the specific (signal, consumer) pair, it is important to
				162	define a simple and possibly consistent strategy for the concept of boosting a
				163	signal.
				164
				165	A boosting strategy defines how the "abstract" user-space defined
				166	sched_cfs_boost value is translated into an internal "margin" value to be added
				167	to a signal to get its inflated value:
				168
				169	margin := boosting_strategy(sched_cfs_boost, signal)
				170	boosted_signal := signal + margin
				171
				172	Different boosting strategies were identified and analyzed before selecting the
				173	one found to be most effective.
				174
				175	Signal Proportional Compensation (SPC)
				176	--------------------------------------
				177
				178	In this boosting strategy the sched_cfs_boost value is used to compute a
				179	margin which is proportional to the complement of the original signal.
				180	When a signal has a maximum possible value, its complement is defined as
				181	the delta from the actual value and its possible maximum.
				182
				183	Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as
				184	the maximum possible value, the margin becomes:
				185
				186	margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal)
				187
				188	Using this boosting strategy:
				189	- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
				190	- each value in the range of sched_cfs_boost effectively inflates the signal in
				191	question by a quantity which is proportional to the maximum value.
				192
				193	For example, by applying the SPC boosting strategy to the selection of the OPP
				194	to run a task it is possible to achieve these behaviors:
				195
				196	- 0% boosting: run the task at the minimum OPP required by its workload
				197	- 100% boosting: run the task at the maximum OPP available for the CPU
				198	- 50% boosting: run at the half-way OPP between minimum and maximum
				199
				200	Which means that, at 50% boosting, a task will be scheduled to run at half of
				201	the maximum theoretically achievable performance on the specific target
				202	platform.
				203
				204	A graphical representation of an SPC boosted signal is represented in the
				205	following figure where:
				206	a) "-" represents the original signal
				207	b) "b" represents a 50% boosted signal
				208	c) "p" represents a 100% boosted signal
				209
				210
				211	^
				212	\| SCHED_LOAD_SCALE
				213	+-----------------------------------------------------------------+
				214	\|pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
				215	\|
				216	\| boosted_signal
				217	\| bbbbbbbbbbbbbbbbbbbbbbbb
				218	\|
				219	\| original signal
				220	\| bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
				221	\| \|
				222	\|bbbbbbbbbbbbbbbbbb \|
				223	\| \|
				224	\| \|
				225	\| \|
				226	\| +-----------------------+
				227	\| \|
				228	\| \|
				229	\| \|
				230	\|------------------+
				231	\|
				232	\|
				233	+----------------------------------------------------------------------->
				234
				235	The plot above shows a ramped load signal (titled 'original_signal') and it's
				236	boosted equivalent. For each step of the original signal the boosted signal
				237	corresponding to a 50% boost is midway from the original signal and the upper
				238	bound. Boosting by 100% generates a boosted signal which is always saturated to
				239	the upper bound.
				240
				241
				242	4. OPP selection using boosted CPU utilization
				243	==============================================
				244
				245	It is worth calling out that the implementation does not introduce any new load
				246	signals. Instead, it provides an API to tune existing signals. This tuning is
				247	done on demand and only in scheduler code paths where it is sensible to do so.
				248	The new API calls are defined to return either the default signal or a boosted
				249	one, depending on the value of sched_cfs_boost. This is a clean an non invasive
				250	modification of the existing existing code paths.
				251
				252	The signal representing a CPU's utilization is boosted according to the
				253	previously described SPC boosting strategy. To sched-DVFS, this allows a CPU
				254	(ie CFS run-queue) to appear more used then it actually is.
				255
				256	Thus, with the sched_cfs_boost enabled we have the following main functions to
				257	get the current utilization of a CPU:
				258
				259	cpu_util()
				260	boosted_cpu_util()
				261
				262	The new boosted_cpu_util() is similar to the first but returns a boosted
				263	utilization signal which is a function of the sched_cfs_boost value.
				264
				265	This function is used in the CFS scheduler code paths where sched-DVFS needs to
				266	decide the OPP to run a CPU at.
				267	For example, this allows selecting the highest OPP for a CPU which has
				268	the boost value set to 100%.
				269
				270
				271	5. Per task group boosting
				272	==========================
				273
				274	On battery powered devices there usually are many background services which are
				275	long running and need energy efficient scheduling. On the other hand, some
				276	applications are more performance sensitive and require an interactive
				277	response and/or maximum performance, regardless of the energy cost.
				278
				279	To better service such scenarios, the SchedTune implementation has an extension
				280	that provides a more fine grained boosting interface.
				281
				282	A new CGroup controller, namely "schedtune", can be enabled which allows to
				283	defined and configure task groups with different boosting values.
				284	Tasks that require special performance can be put into separate CGroups.
				285	The value of the boost associated with the tasks in this group can be specified
				286	using a single knob exposed by the CGroup controller:
				287
				288	schedtune.boost
				289
				290	This knob allows the definition of a boost value that is to be used for
				291	SPC boosting of all tasks attached to this group.
				292
				293	The current schedtune controller implementation is really simple and has these
				294	main characteristics:
				295
				296	1) It is only possible to create 1 level depth hierarchies
				297
				298	The root control groups define the system-wide boost value to be applied
				299	by default to all tasks. Its direct subgroups are named "boost groups" and
				300	they define the boost value for specific set of tasks.
				301	Further nested subgroups are not allowed since they do not have a sensible
				302	meaning from a user-space standpoint.
				303
				304	2) It is possible to define only a limited number of "boost groups"
				305
				306	This number is defined at compile time and by default configured to 16.
				307	This is a design decision motivated by two main reasons:
				308	a) In a real system we do not expect utilization scenarios with more then few
				309	boost groups. For example, a reasonable collection of groups could be
				310	just "background", "interactive" and "performance".
				311	b) It simplifies the implementation considerably, especially for the code
				312	which has to compute the per CPU boosting once there are multiple
				313	RUNNABLE tasks with different boost values.
				314
				315	Such a simple design should allow servicing the main utilization scenarios identified
				316	so far. It provides a simple interface which can be used to manage the
				317	power-performance of all tasks or only selected tasks.
				318	Moreover, this interface can be easily integrated by user-space run-times (e.g.
				319	Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
				320	classification, which has been a long standing requirement.
				321
				322	Setup and usage
				323	---------------
				324
				325	0. Use a kernel with CONFIG_SCHED_TUNE support enabled
				326
				327	1. Check that the "schedtune" CGroup controller is available:
				328
				329	root@linaro-nano:~# cat /proc/cgroups
				330	#subsys_name hierarchy num_cgroups enabled
				331	cpuset 0 1 1
				332	cpu 0 1 1
				333	schedtune 0 1 1
				334
				335	2. Mount a tmpfs to create the CGroups mount point (Optional)
				336
				337	root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
				338
				339	3. Mount the "schedtune" controller
				340
				341	root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
				342	root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
				343
				344	4. Create task groups and configure their specific boost value (Optional)
				345
				346	For example here we create a "performance" boost group configure to boost
				347	all its tasks to 100%
				348
				349	root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
				350	root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
				351
				352	5. Move tasks into the boost group
				353
				354	For example, the following moves the tasks with PID $TASKPID (and all its
				355	threads) into the "performance" boost group.
				356
				357	root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
				358
				359	This simple configuration allows only the threads of the $TASKPID task to run,
				360	when needed, at the highest OPP in the most capable CPU of the system.
				361
				362
				363	6. Per-task wakeup-placement-strategy Selection
				364	===============================================
				365
				366	Many devices have a number of CFS tasks in use which require an absolute
				367	minimum wakeup latency, and many tasks for which wakeup latency is not
				368	important.
				369
				370	For touch-driven environments, removing additional wakeup latency can be
				371	critical.
				372
				373	When you use the Schedtume CGroup controller, you have access to a second
				374	parameter which allows a group to be marked such that energy_aware task
				375	placement is bypassed for tasks belonging to that group.
				376
				377	prefer_idle=0 (default - use energy-aware task placement if available)
				378	prefer_idle=1 (never use energy-aware task placement for these tasks)
				379
				380	Since the regular wakeup task placement algorithm in CFS is biased for
				381	performance, this has the effect of restoring minimum wakeup latency
				382	for the desired tasks whilst still allowing energy-aware wakeup placement
				383	to save energy for other tasks.
				384
				385
				386	7. Question and Answers
				387	=======================
				388
				389	What about "auto" mode?
				390	-----------------------
				391
				392	The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
				393	with some suitable user-space element. This element could use the exposed
				394	system-wide or cgroup based interface.
				395
				396	How are multiple groups of tasks with different boost values managed?
				397	---------------------------------------------------------------------
				398
				399	The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
				400	on a CPU. The CPU utilization seen by the scheduler-driven cpufreq governors
				401	(and used to select an appropriate OPP) is boosted with a value which is the
				402	maximum of the boost values of the currently RUNNABLE tasks in its RQ.
				403
				404	This allows cpufreq to boost a CPU only while there are boosted tasks ready
				405	to run and switch back to the energy efficient mode as soon as the last boosted
				406	task is dequeued.
				407
				408
				409	8. References
				410	=============
				411	[1] http://lwn.net/Articles/552889
				412	[2] http://lkml.org/lkml/2012/5/18/91
				413	[3] http://lkml.org/lkml/2015/6/26/620