blob: 5df0ea36131155e3b2f50b371dac6e0440608974 [file] [log] [blame]
rjw1f884582022-01-06 17:20:42 +08001 Central, scheduler-driven, power-performance control
2 (EXPERIMENTAL)
3
4Abstract
5========
6
7The topic of a single simple power-performance tunable, that is wholly
8scheduler centric, and has well defined and predictable properties has come up
9on several occasions in the past [1,2]. With techniques such as a scheduler
10driven DVFS [3], we now have a good framework for implementing such a tunable.
11This document describes the overall ideas behind its design and implementation.
12
13
14Table of Contents
15=================
16
171. Motivation
182. Introduction
193. Signal Boosting Strategy
204. OPP selection using boosted CPU utilization
215. Per task group boosting
226. Per-task wakeup-placement-strategy Selection
237. Question and Answers
24 - What about "auto" mode?
25 - What about boosting on a congested system?
26 - How CPUs are boosted when we have tasks with multiple boost values?
278. References
28
29
301. Motivation
31=============
32
33Sched-DVFS [3] was a new event-driven cpufreq governor which allows the
34scheduler to select the optimal DVFS operating point (OPP) for running a task
35allocated to a CPU. Later, the cpufreq maintainers introduced a similar
36governor, schedutil. The introduction of schedutil also enables running
37workloads at the most energy efficient OPPs.
38
39However, sometimes it may be desired to intentionally boost the performance of
40a workload even if that could imply a reasonable increase in energy
41consumption. For example, in order to reduce the response time of a task, we
42may want to run the task at a higher OPP than the one that is actually required
43by it's CPU bandwidth demand.
44
45This last requirement is especially important if we consider that one of the
46main goals of the utilization-driven governor component is to replace all
47currently available CPUFreq policies. Since sched-DVFS and schedutil are event
48based, as opposed to the sampling driven governors we currently have, they are
49already more responsive at selecting the optimal OPP to run tasks allocated to
50a CPU. However, just tracking the actual task load demand may not be enough
51from a performance standpoint. For example, it is not possible to get
52behaviors similar to those provided by the "performance" and "interactive"
53CPUFreq governors.
54
55This document describes an implementation of a tunable, stacked on top of the
56utilization-driven governors which extends their functionality to support task
57performance boosting.
58
59By "performance boosting" we mean the reduction of the time required to
60complete a task activation, i.e. the time elapsed from a task wakeup to its
61next deactivation (e.g. because it goes back to sleep or it terminates). For
62example, if we consider a simple periodic task which executes the same workload
63for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
64that task must complete each of its activations in less than 5[s].
65
66A previous attempt [5] to introduce such a boosting feature has not been
67successful mainly because of the complexity of the proposed solution. Previous
68versions of the approach described in this document exposed a single simple
69interface to user-space. This single tunable knob allowed the tuning of
70system wide scheduler behaviours ranging from energy efficiency at one end
71through to incremental performance boosting at the other end. This first
72tunable affects all tasks. However, that is not useful for Android products
73so in this version only a more advanced extension of the concept is provided
74which uses CGroups to boost the performance of only selected tasks while using
75the energy efficient default for all others.
76
77The rest of this document introduces in more details the proposed solution
78which has been named SchedTune.
79
80
812. Introduction
82===============
83
84SchedTune exposes a simple user-space interface provided through a new
85CGroup controller 'stune' which provides two power-performance tunables
86per group:
87
88 /<stune cgroup mount point>/schedtune.prefer_idle
89 /<stune cgroup mount point>/schedtune.boost
90
91The CGroup implementation permits arbitrary user-space defined task
92classification to tune the scheduler for different goals depending on the
93specific nature of the task, e.g. background vs interactive vs low-priority.
94
95More details are given in section 5.
96
972.1 Boosting
98============
99
100The boost value is expressed as an integer in the range [-100..0..100].
101
102A value of 0 (default) configures the CFS scheduler for maximum energy
103efficiency. This means that sched-DVFS runs the tasks at the minimum OPP
104required to satisfy their workload demand.
105
106A value of 100 configures scheduler for maximum performance, which translates
107to the selection of the maximum OPP on that CPU.
108
109A value of -100 configures scheduler for minimum performance, which translates
110to the selection of the minimum OPP on that CPU.
111
112The range between -100, 0 and 100 can be set to satisfy other scenarios suitably.
113For example to satisfy interactive response or depending on other system events
114(battery level etc).
115
116The overall design of the SchedTune module is built on top of "Per-Entity Load
117Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating
118Performance Point (OPP) selection.
119
120Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune
121the operating frequency of that CPU to better match the workload demand. The
122selection of the actual OPP being activated is influenced by the boost value
123for the task CGroup.
124
125This simple biasing approach leverages existing frameworks, which means minimal
126modifications to the scheduler, and yet it allows to achieve a range of
127different behaviours all from a single simple tunable knob.
128
129In EAS schedulers, we use boosted task and CPU utilization for energy
130calculation and energy-aware task placement.
131
1322.2 prefer_idle
133===============
134
135This is a flag which indicates to the scheduler that userspace would like
136the scheduler to focus on energy or to focus on performance.
137
138A value of 0 (default) signals to the CFS scheduler that tasks in this group
139can be placed according to the energy-aware wakeup strategy.
140
141A value of 1 signals to the CFS scheduler that tasks in this group should be
142placed to minimise wakeup latency.
143
144The value is combined with the boost value - task placement will not be
145boost aware however CPU OPP selection is still boost aware.
146
147Android platforms typically use this flag for application tasks which the
148user is currently interacting with.
149
150
1513. Signal Boosting Strategy
152===========================
153
154The whole PELT machinery works based on the value of a few load tracking signals
155which basically track the CPU bandwidth requirements for tasks and the capacity
156of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
157some of these load tracking signals to make a task or RQ appears more demanding
158that it actually is.
159
160Which signals have to be inflated depends on the specific "consumer". However,
161independently from the specific (signal, consumer) pair, it is important to
162define a simple and possibly consistent strategy for the concept of boosting a
163signal.
164
165A boosting strategy defines how the "abstract" user-space defined
166sched_cfs_boost value is translated into an internal "margin" value to be added
167to a signal to get its inflated value:
168
169 margin := boosting_strategy(sched_cfs_boost, signal)
170 boosted_signal := signal + margin
171
172Different boosting strategies were identified and analyzed before selecting the
173one found to be most effective.
174
175Signal Proportional Compensation (SPC)
176--------------------------------------
177
178In this boosting strategy the sched_cfs_boost value is used to compute a
179margin which is proportional to the complement of the original signal.
180When a signal has a maximum possible value, its complement is defined as
181the delta from the actual value and its possible maximum.
182
183Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as
184the maximum possible value, the margin becomes:
185
186 margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal)
187
188Using this boosting strategy:
189- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
190- each value in the range of sched_cfs_boost effectively inflates the signal in
191 question by a quantity which is proportional to the maximum value.
192
193For example, by applying the SPC boosting strategy to the selection of the OPP
194to run a task it is possible to achieve these behaviors:
195
196- 0% boosting: run the task at the minimum OPP required by its workload
197- 100% boosting: run the task at the maximum OPP available for the CPU
198- 50% boosting: run at the half-way OPP between minimum and maximum
199
200Which means that, at 50% boosting, a task will be scheduled to run at half of
201the maximum theoretically achievable performance on the specific target
202platform.
203
204A graphical representation of an SPC boosted signal is represented in the
205following figure where:
206 a) "-" represents the original signal
207 b) "b" represents a 50% boosted signal
208 c) "p" represents a 100% boosted signal
209
210
211 ^
212 | SCHED_LOAD_SCALE
213 +-----------------------------------------------------------------+
214 |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
215 |
216 | boosted_signal
217 | bbbbbbbbbbbbbbbbbbbbbbbb
218 |
219 | original signal
220 | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
221 | |
222 |bbbbbbbbbbbbbbbbbb |
223 | |
224 | |
225 | |
226 | +-----------------------+
227 | |
228 | |
229 | |
230 |------------------+
231 |
232 |
233 +----------------------------------------------------------------------->
234
235The plot above shows a ramped load signal (titled 'original_signal') and it's
236boosted equivalent. For each step of the original signal the boosted signal
237corresponding to a 50% boost is midway from the original signal and the upper
238bound. Boosting by 100% generates a boosted signal which is always saturated to
239the upper bound.
240
241
2424. OPP selection using boosted CPU utilization
243==============================================
244
245It is worth calling out that the implementation does not introduce any new load
246signals. Instead, it provides an API to tune existing signals. This tuning is
247done on demand and only in scheduler code paths where it is sensible to do so.
248The new API calls are defined to return either the default signal or a boosted
249one, depending on the value of sched_cfs_boost. This is a clean an non invasive
250modification of the existing existing code paths.
251
252The signal representing a CPU's utilization is boosted according to the
253previously described SPC boosting strategy. To sched-DVFS, this allows a CPU
254(ie CFS run-queue) to appear more used then it actually is.
255
256Thus, with the sched_cfs_boost enabled we have the following main functions to
257get the current utilization of a CPU:
258
259 cpu_util()
260 boosted_cpu_util()
261
262The new boosted_cpu_util() is similar to the first but returns a boosted
263utilization signal which is a function of the sched_cfs_boost value.
264
265This function is used in the CFS scheduler code paths where sched-DVFS needs to
266decide the OPP to run a CPU at.
267For example, this allows selecting the highest OPP for a CPU which has
268the boost value set to 100%.
269
270
2715. Per task group boosting
272==========================
273
274On battery powered devices there usually are many background services which are
275long running and need energy efficient scheduling. On the other hand, some
276applications are more performance sensitive and require an interactive
277response and/or maximum performance, regardless of the energy cost.
278
279To better service such scenarios, the SchedTune implementation has an extension
280that provides a more fine grained boosting interface.
281
282A new CGroup controller, namely "schedtune", can be enabled which allows to
283defined and configure task groups with different boosting values.
284Tasks that require special performance can be put into separate CGroups.
285The value of the boost associated with the tasks in this group can be specified
286using a single knob exposed by the CGroup controller:
287
288 schedtune.boost
289
290This knob allows the definition of a boost value that is to be used for
291SPC boosting of all tasks attached to this group.
292
293The current schedtune controller implementation is really simple and has these
294main characteristics:
295
296 1) It is only possible to create 1 level depth hierarchies
297
298 The root control groups define the system-wide boost value to be applied
299 by default to all tasks. Its direct subgroups are named "boost groups" and
300 they define the boost value for specific set of tasks.
301 Further nested subgroups are not allowed since they do not have a sensible
302 meaning from a user-space standpoint.
303
304 2) It is possible to define only a limited number of "boost groups"
305
306 This number is defined at compile time and by default configured to 16.
307 This is a design decision motivated by two main reasons:
308 a) In a real system we do not expect utilization scenarios with more then few
309 boost groups. For example, a reasonable collection of groups could be
310 just "background", "interactive" and "performance".
311 b) It simplifies the implementation considerably, especially for the code
312 which has to compute the per CPU boosting once there are multiple
313 RUNNABLE tasks with different boost values.
314
315Such a simple design should allow servicing the main utilization scenarios identified
316so far. It provides a simple interface which can be used to manage the
317power-performance of all tasks or only selected tasks.
318Moreover, this interface can be easily integrated by user-space run-times (e.g.
319Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
320classification, which has been a long standing requirement.
321
322Setup and usage
323---------------
324
3250. Use a kernel with CONFIG_SCHED_TUNE support enabled
326
3271. Check that the "schedtune" CGroup controller is available:
328
329 root@linaro-nano:~# cat /proc/cgroups
330 #subsys_name hierarchy num_cgroups enabled
331 cpuset 0 1 1
332 cpu 0 1 1
333 schedtune 0 1 1
334
3352. Mount a tmpfs to create the CGroups mount point (Optional)
336
337 root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
338
3393. Mount the "schedtune" controller
340
341 root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
342 root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
343
3444. Create task groups and configure their specific boost value (Optional)
345
346 For example here we create a "performance" boost group configure to boost
347 all its tasks to 100%
348
349 root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
350 root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
351
3525. Move tasks into the boost group
353
354 For example, the following moves the tasks with PID $TASKPID (and all its
355 threads) into the "performance" boost group.
356
357 root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
358
359This simple configuration allows only the threads of the $TASKPID task to run,
360when needed, at the highest OPP in the most capable CPU of the system.
361
362
3636. Per-task wakeup-placement-strategy Selection
364===============================================
365
366Many devices have a number of CFS tasks in use which require an absolute
367minimum wakeup latency, and many tasks for which wakeup latency is not
368important.
369
370For touch-driven environments, removing additional wakeup latency can be
371critical.
372
373When you use the Schedtume CGroup controller, you have access to a second
374parameter which allows a group to be marked such that energy_aware task
375placement is bypassed for tasks belonging to that group.
376
377prefer_idle=0 (default - use energy-aware task placement if available)
378prefer_idle=1 (never use energy-aware task placement for these tasks)
379
380Since the regular wakeup task placement algorithm in CFS is biased for
381performance, this has the effect of restoring minimum wakeup latency
382for the desired tasks whilst still allowing energy-aware wakeup placement
383to save energy for other tasks.
384
385
3867. Question and Answers
387=======================
388
389What about "auto" mode?
390-----------------------
391
392The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
393with some suitable user-space element. This element could use the exposed
394system-wide or cgroup based interface.
395
396How are multiple groups of tasks with different boost values managed?
397---------------------------------------------------------------------
398
399The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
400on a CPU. The CPU utilization seen by the scheduler-driven cpufreq governors
401(and used to select an appropriate OPP) is boosted with a value which is the
402maximum of the boost values of the currently RUNNABLE tasks in its RQ.
403
404This allows cpufreq to boost a CPU only while there are boosted tasks ready
405to run and switch back to the energy efficient mode as soon as the last boosted
406task is dequeued.
407
408
4098. References
410=============
411[1] http://lwn.net/Articles/552889
412[2] http://lkml.org/lkml/2012/5/18/91
413[3] http://lkml.org/lkml/2015/6/26/620