rjw | 1f88458 | 2022-01-06 17:20:42 +0800 | [diff] [blame^] | 1 | Central, scheduler-driven, power-performance control |
| 2 | (EXPERIMENTAL) |
| 3 | |
| 4 | Abstract |
| 5 | ======== |
| 6 | |
| 7 | The topic of a single simple power-performance tunable, that is wholly |
| 8 | scheduler centric, and has well defined and predictable properties has come up |
| 9 | on several occasions in the past [1,2]. With techniques such as a scheduler |
| 10 | driven DVFS [3], we now have a good framework for implementing such a tunable. |
| 11 | This document describes the overall ideas behind its design and implementation. |
| 12 | |
| 13 | |
| 14 | Table of Contents |
| 15 | ================= |
| 16 | |
| 17 | 1. Motivation |
| 18 | 2. Introduction |
| 19 | 3. Signal Boosting Strategy |
| 20 | 4. OPP selection using boosted CPU utilization |
| 21 | 5. Per task group boosting |
| 22 | 6. Per-task wakeup-placement-strategy Selection |
| 23 | 7. Question and Answers |
| 24 | - What about "auto" mode? |
| 25 | - What about boosting on a congested system? |
| 26 | - How CPUs are boosted when we have tasks with multiple boost values? |
| 27 | 8. References |
| 28 | |
| 29 | |
| 30 | 1. Motivation |
| 31 | ============= |
| 32 | |
| 33 | Sched-DVFS [3] was a new event-driven cpufreq governor which allows the |
| 34 | scheduler to select the optimal DVFS operating point (OPP) for running a task |
| 35 | allocated to a CPU. Later, the cpufreq maintainers introduced a similar |
| 36 | governor, schedutil. The introduction of schedutil also enables running |
| 37 | workloads at the most energy efficient OPPs. |
| 38 | |
| 39 | However, sometimes it may be desired to intentionally boost the performance of |
| 40 | a workload even if that could imply a reasonable increase in energy |
| 41 | consumption. For example, in order to reduce the response time of a task, we |
| 42 | may want to run the task at a higher OPP than the one that is actually required |
| 43 | by it's CPU bandwidth demand. |
| 44 | |
| 45 | This last requirement is especially important if we consider that one of the |
| 46 | main goals of the utilization-driven governor component is to replace all |
| 47 | currently available CPUFreq policies. Since sched-DVFS and schedutil are event |
| 48 | based, as opposed to the sampling driven governors we currently have, they are |
| 49 | already more responsive at selecting the optimal OPP to run tasks allocated to |
| 50 | a CPU. However, just tracking the actual task load demand may not be enough |
| 51 | from a performance standpoint. For example, it is not possible to get |
| 52 | behaviors similar to those provided by the "performance" and "interactive" |
| 53 | CPUFreq governors. |
| 54 | |
| 55 | This document describes an implementation of a tunable, stacked on top of the |
| 56 | utilization-driven governors which extends their functionality to support task |
| 57 | performance boosting. |
| 58 | |
| 59 | By "performance boosting" we mean the reduction of the time required to |
| 60 | complete a task activation, i.e. the time elapsed from a task wakeup to its |
| 61 | next deactivation (e.g. because it goes back to sleep or it terminates). For |
| 62 | example, if we consider a simple periodic task which executes the same workload |
| 63 | for 5[s] every 20[s] while running at a certain OPP, a boosted execution of |
| 64 | that task must complete each of its activations in less than 5[s]. |
| 65 | |
| 66 | A previous attempt [5] to introduce such a boosting feature has not been |
| 67 | successful mainly because of the complexity of the proposed solution. Previous |
| 68 | versions of the approach described in this document exposed a single simple |
| 69 | interface to user-space. This single tunable knob allowed the tuning of |
| 70 | system wide scheduler behaviours ranging from energy efficiency at one end |
| 71 | through to incremental performance boosting at the other end. This first |
| 72 | tunable affects all tasks. However, that is not useful for Android products |
| 73 | so in this version only a more advanced extension of the concept is provided |
| 74 | which uses CGroups to boost the performance of only selected tasks while using |
| 75 | the energy efficient default for all others. |
| 76 | |
| 77 | The rest of this document introduces in more details the proposed solution |
| 78 | which has been named SchedTune. |
| 79 | |
| 80 | |
| 81 | 2. Introduction |
| 82 | =============== |
| 83 | |
| 84 | SchedTune exposes a simple user-space interface provided through a new |
| 85 | CGroup controller 'stune' which provides two power-performance tunables |
| 86 | per group: |
| 87 | |
| 88 | /<stune cgroup mount point>/schedtune.prefer_idle |
| 89 | /<stune cgroup mount point>/schedtune.boost |
| 90 | |
| 91 | The CGroup implementation permits arbitrary user-space defined task |
| 92 | classification to tune the scheduler for different goals depending on the |
| 93 | specific nature of the task, e.g. background vs interactive vs low-priority. |
| 94 | |
| 95 | More details are given in section 5. |
| 96 | |
| 97 | 2.1 Boosting |
| 98 | ============ |
| 99 | |
| 100 | The boost value is expressed as an integer in the range [-100..0..100]. |
| 101 | |
| 102 | A value of 0 (default) configures the CFS scheduler for maximum energy |
| 103 | efficiency. This means that sched-DVFS runs the tasks at the minimum OPP |
| 104 | required to satisfy their workload demand. |
| 105 | |
| 106 | A value of 100 configures scheduler for maximum performance, which translates |
| 107 | to the selection of the maximum OPP on that CPU. |
| 108 | |
| 109 | A value of -100 configures scheduler for minimum performance, which translates |
| 110 | to the selection of the minimum OPP on that CPU. |
| 111 | |
| 112 | The range between -100, 0 and 100 can be set to satisfy other scenarios suitably. |
| 113 | For example to satisfy interactive response or depending on other system events |
| 114 | (battery level etc). |
| 115 | |
| 116 | The overall design of the SchedTune module is built on top of "Per-Entity Load |
| 117 | Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating |
| 118 | Performance Point (OPP) selection. |
| 119 | |
| 120 | Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune |
| 121 | the operating frequency of that CPU to better match the workload demand. The |
| 122 | selection of the actual OPP being activated is influenced by the boost value |
| 123 | for the task CGroup. |
| 124 | |
| 125 | This simple biasing approach leverages existing frameworks, which means minimal |
| 126 | modifications to the scheduler, and yet it allows to achieve a range of |
| 127 | different behaviours all from a single simple tunable knob. |
| 128 | |
| 129 | In EAS schedulers, we use boosted task and CPU utilization for energy |
| 130 | calculation and energy-aware task placement. |
| 131 | |
| 132 | 2.2 prefer_idle |
| 133 | =============== |
| 134 | |
| 135 | This is a flag which indicates to the scheduler that userspace would like |
| 136 | the scheduler to focus on energy or to focus on performance. |
| 137 | |
| 138 | A value of 0 (default) signals to the CFS scheduler that tasks in this group |
| 139 | can be placed according to the energy-aware wakeup strategy. |
| 140 | |
| 141 | A value of 1 signals to the CFS scheduler that tasks in this group should be |
| 142 | placed to minimise wakeup latency. |
| 143 | |
| 144 | The value is combined with the boost value - task placement will not be |
| 145 | boost aware however CPU OPP selection is still boost aware. |
| 146 | |
| 147 | Android platforms typically use this flag for application tasks which the |
| 148 | user is currently interacting with. |
| 149 | |
| 150 | |
| 151 | 3. Signal Boosting Strategy |
| 152 | =========================== |
| 153 | |
| 154 | The whole PELT machinery works based on the value of a few load tracking signals |
| 155 | which basically track the CPU bandwidth requirements for tasks and the capacity |
| 156 | of CPUs. The basic idea behind the SchedTune knob is to artificially inflate |
| 157 | some of these load tracking signals to make a task or RQ appears more demanding |
| 158 | that it actually is. |
| 159 | |
| 160 | Which signals have to be inflated depends on the specific "consumer". However, |
| 161 | independently from the specific (signal, consumer) pair, it is important to |
| 162 | define a simple and possibly consistent strategy for the concept of boosting a |
| 163 | signal. |
| 164 | |
| 165 | A boosting strategy defines how the "abstract" user-space defined |
| 166 | sched_cfs_boost value is translated into an internal "margin" value to be added |
| 167 | to a signal to get its inflated value: |
| 168 | |
| 169 | margin := boosting_strategy(sched_cfs_boost, signal) |
| 170 | boosted_signal := signal + margin |
| 171 | |
| 172 | Different boosting strategies were identified and analyzed before selecting the |
| 173 | one found to be most effective. |
| 174 | |
| 175 | Signal Proportional Compensation (SPC) |
| 176 | -------------------------------------- |
| 177 | |
| 178 | In this boosting strategy the sched_cfs_boost value is used to compute a |
| 179 | margin which is proportional to the complement of the original signal. |
| 180 | When a signal has a maximum possible value, its complement is defined as |
| 181 | the delta from the actual value and its possible maximum. |
| 182 | |
| 183 | Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as |
| 184 | the maximum possible value, the margin becomes: |
| 185 | |
| 186 | margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal) |
| 187 | |
| 188 | Using this boosting strategy: |
| 189 | - a 100% sched_cfs_boost means that the signal is scaled to the maximum value |
| 190 | - each value in the range of sched_cfs_boost effectively inflates the signal in |
| 191 | question by a quantity which is proportional to the maximum value. |
| 192 | |
| 193 | For example, by applying the SPC boosting strategy to the selection of the OPP |
| 194 | to run a task it is possible to achieve these behaviors: |
| 195 | |
| 196 | - 0% boosting: run the task at the minimum OPP required by its workload |
| 197 | - 100% boosting: run the task at the maximum OPP available for the CPU |
| 198 | - 50% boosting: run at the half-way OPP between minimum and maximum |
| 199 | |
| 200 | Which means that, at 50% boosting, a task will be scheduled to run at half of |
| 201 | the maximum theoretically achievable performance on the specific target |
| 202 | platform. |
| 203 | |
| 204 | A graphical representation of an SPC boosted signal is represented in the |
| 205 | following figure where: |
| 206 | a) "-" represents the original signal |
| 207 | b) "b" represents a 50% boosted signal |
| 208 | c) "p" represents a 100% boosted signal |
| 209 | |
| 210 | |
| 211 | ^ |
| 212 | | SCHED_LOAD_SCALE |
| 213 | +-----------------------------------------------------------------+ |
| 214 | |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp |
| 215 | | |
| 216 | | boosted_signal |
| 217 | | bbbbbbbbbbbbbbbbbbbbbbbb |
| 218 | | |
| 219 | | original signal |
| 220 | | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+ |
| 221 | | | |
| 222 | |bbbbbbbbbbbbbbbbbb | |
| 223 | | | |
| 224 | | | |
| 225 | | | |
| 226 | | +-----------------------+ |
| 227 | | | |
| 228 | | | |
| 229 | | | |
| 230 | |------------------+ |
| 231 | | |
| 232 | | |
| 233 | +-----------------------------------------------------------------------> |
| 234 | |
| 235 | The plot above shows a ramped load signal (titled 'original_signal') and it's |
| 236 | boosted equivalent. For each step of the original signal the boosted signal |
| 237 | corresponding to a 50% boost is midway from the original signal and the upper |
| 238 | bound. Boosting by 100% generates a boosted signal which is always saturated to |
| 239 | the upper bound. |
| 240 | |
| 241 | |
| 242 | 4. OPP selection using boosted CPU utilization |
| 243 | ============================================== |
| 244 | |
| 245 | It is worth calling out that the implementation does not introduce any new load |
| 246 | signals. Instead, it provides an API to tune existing signals. This tuning is |
| 247 | done on demand and only in scheduler code paths where it is sensible to do so. |
| 248 | The new API calls are defined to return either the default signal or a boosted |
| 249 | one, depending on the value of sched_cfs_boost. This is a clean an non invasive |
| 250 | modification of the existing existing code paths. |
| 251 | |
| 252 | The signal representing a CPU's utilization is boosted according to the |
| 253 | previously described SPC boosting strategy. To sched-DVFS, this allows a CPU |
| 254 | (ie CFS run-queue) to appear more used then it actually is. |
| 255 | |
| 256 | Thus, with the sched_cfs_boost enabled we have the following main functions to |
| 257 | get the current utilization of a CPU: |
| 258 | |
| 259 | cpu_util() |
| 260 | boosted_cpu_util() |
| 261 | |
| 262 | The new boosted_cpu_util() is similar to the first but returns a boosted |
| 263 | utilization signal which is a function of the sched_cfs_boost value. |
| 264 | |
| 265 | This function is used in the CFS scheduler code paths where sched-DVFS needs to |
| 266 | decide the OPP to run a CPU at. |
| 267 | For example, this allows selecting the highest OPP for a CPU which has |
| 268 | the boost value set to 100%. |
| 269 | |
| 270 | |
| 271 | 5. Per task group boosting |
| 272 | ========================== |
| 273 | |
| 274 | On battery powered devices there usually are many background services which are |
| 275 | long running and need energy efficient scheduling. On the other hand, some |
| 276 | applications are more performance sensitive and require an interactive |
| 277 | response and/or maximum performance, regardless of the energy cost. |
| 278 | |
| 279 | To better service such scenarios, the SchedTune implementation has an extension |
| 280 | that provides a more fine grained boosting interface. |
| 281 | |
| 282 | A new CGroup controller, namely "schedtune", can be enabled which allows to |
| 283 | defined and configure task groups with different boosting values. |
| 284 | Tasks that require special performance can be put into separate CGroups. |
| 285 | The value of the boost associated with the tasks in this group can be specified |
| 286 | using a single knob exposed by the CGroup controller: |
| 287 | |
| 288 | schedtune.boost |
| 289 | |
| 290 | This knob allows the definition of a boost value that is to be used for |
| 291 | SPC boosting of all tasks attached to this group. |
| 292 | |
| 293 | The current schedtune controller implementation is really simple and has these |
| 294 | main characteristics: |
| 295 | |
| 296 | 1) It is only possible to create 1 level depth hierarchies |
| 297 | |
| 298 | The root control groups define the system-wide boost value to be applied |
| 299 | by default to all tasks. Its direct subgroups are named "boost groups" and |
| 300 | they define the boost value for specific set of tasks. |
| 301 | Further nested subgroups are not allowed since they do not have a sensible |
| 302 | meaning from a user-space standpoint. |
| 303 | |
| 304 | 2) It is possible to define only a limited number of "boost groups" |
| 305 | |
| 306 | This number is defined at compile time and by default configured to 16. |
| 307 | This is a design decision motivated by two main reasons: |
| 308 | a) In a real system we do not expect utilization scenarios with more then few |
| 309 | boost groups. For example, a reasonable collection of groups could be |
| 310 | just "background", "interactive" and "performance". |
| 311 | b) It simplifies the implementation considerably, especially for the code |
| 312 | which has to compute the per CPU boosting once there are multiple |
| 313 | RUNNABLE tasks with different boost values. |
| 314 | |
| 315 | Such a simple design should allow servicing the main utilization scenarios identified |
| 316 | so far. It provides a simple interface which can be used to manage the |
| 317 | power-performance of all tasks or only selected tasks. |
| 318 | Moreover, this interface can be easily integrated by user-space run-times (e.g. |
| 319 | Android, ChromeOS) to implement a QoS solution for task boosting based on tasks |
| 320 | classification, which has been a long standing requirement. |
| 321 | |
| 322 | Setup and usage |
| 323 | --------------- |
| 324 | |
| 325 | 0. Use a kernel with CONFIG_SCHED_TUNE support enabled |
| 326 | |
| 327 | 1. Check that the "schedtune" CGroup controller is available: |
| 328 | |
| 329 | root@linaro-nano:~# cat /proc/cgroups |
| 330 | #subsys_name hierarchy num_cgroups enabled |
| 331 | cpuset 0 1 1 |
| 332 | cpu 0 1 1 |
| 333 | schedtune 0 1 1 |
| 334 | |
| 335 | 2. Mount a tmpfs to create the CGroups mount point (Optional) |
| 336 | |
| 337 | root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup |
| 338 | |
| 339 | 3. Mount the "schedtune" controller |
| 340 | |
| 341 | root@linaro-nano:~# mkdir /sys/fs/cgroup/stune |
| 342 | root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune |
| 343 | |
| 344 | 4. Create task groups and configure their specific boost value (Optional) |
| 345 | |
| 346 | For example here we create a "performance" boost group configure to boost |
| 347 | all its tasks to 100% |
| 348 | |
| 349 | root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance |
| 350 | root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost |
| 351 | |
| 352 | 5. Move tasks into the boost group |
| 353 | |
| 354 | For example, the following moves the tasks with PID $TASKPID (and all its |
| 355 | threads) into the "performance" boost group. |
| 356 | |
| 357 | root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs |
| 358 | |
| 359 | This simple configuration allows only the threads of the $TASKPID task to run, |
| 360 | when needed, at the highest OPP in the most capable CPU of the system. |
| 361 | |
| 362 | |
| 363 | 6. Per-task wakeup-placement-strategy Selection |
| 364 | =============================================== |
| 365 | |
| 366 | Many devices have a number of CFS tasks in use which require an absolute |
| 367 | minimum wakeup latency, and many tasks for which wakeup latency is not |
| 368 | important. |
| 369 | |
| 370 | For touch-driven environments, removing additional wakeup latency can be |
| 371 | critical. |
| 372 | |
| 373 | When you use the Schedtume CGroup controller, you have access to a second |
| 374 | parameter which allows a group to be marked such that energy_aware task |
| 375 | placement is bypassed for tasks belonging to that group. |
| 376 | |
| 377 | prefer_idle=0 (default - use energy-aware task placement if available) |
| 378 | prefer_idle=1 (never use energy-aware task placement for these tasks) |
| 379 | |
| 380 | Since the regular wakeup task placement algorithm in CFS is biased for |
| 381 | performance, this has the effect of restoring minimum wakeup latency |
| 382 | for the desired tasks whilst still allowing energy-aware wakeup placement |
| 383 | to save energy for other tasks. |
| 384 | |
| 385 | |
| 386 | 7. Question and Answers |
| 387 | ======================= |
| 388 | |
| 389 | What about "auto" mode? |
| 390 | ----------------------- |
| 391 | |
| 392 | The 'auto' mode as described in [5] can be implemented by interfacing SchedTune |
| 393 | with some suitable user-space element. This element could use the exposed |
| 394 | system-wide or cgroup based interface. |
| 395 | |
| 396 | How are multiple groups of tasks with different boost values managed? |
| 397 | --------------------------------------------------------------------- |
| 398 | |
| 399 | The current SchedTune implementation keeps track of the boosted RUNNABLE tasks |
| 400 | on a CPU. The CPU utilization seen by the scheduler-driven cpufreq governors |
| 401 | (and used to select an appropriate OPP) is boosted with a value which is the |
| 402 | maximum of the boost values of the currently RUNNABLE tasks in its RQ. |
| 403 | |
| 404 | This allows cpufreq to boost a CPU only while there are boosted tasks ready |
| 405 | to run and switch back to the energy efficient mode as soon as the last boosted |
| 406 | task is dequeued. |
| 407 | |
| 408 | |
| 409 | 8. References |
| 410 | ============= |
| 411 | [1] http://lwn.net/Articles/552889 |
| 412 | [2] http://lkml.org/lkml/2012/5/18/91 |
| 413 | [3] http://lkml.org/lkml/2015/6/26/620 |