| b.liu | e958203 | 2025-04-17 19:18:16 +0800 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | .. include:: <isonum.txt> |
| 3 | |
| 4 | .. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>` |
| 5 | .. |cpufreq| replace:: :doc:`CPU Performance Scaling <cpufreq>` |
| 6 | |
| 7 | ======================== |
| 8 | CPU Idle Time Management |
| 9 | ======================== |
| 10 | |
| 11 | :Copyright: |copy| 2018 Intel Corporation |
| 12 | |
| 13 | :Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
| 14 | |
| 15 | |
| 16 | Concepts |
| 17 | ======== |
| 18 | |
| 19 | Modern processors are generally able to enter states in which the execution of |
| 20 | a program is suspended and instructions belonging to it are not fetched from |
| 21 | memory or executed. Those states are the *idle* states of the processor. |
| 22 | |
| 23 | Since part of the processor hardware is not used in idle states, entering them |
| 24 | generally allows power drawn by the processor to be reduced and, in consequence, |
| 25 | it is an opportunity to save energy. |
| 26 | |
| 27 | CPU idle time management is an energy-efficiency feature concerned about using |
| 28 | the idle states of processors for this purpose. |
| 29 | |
| 30 | Logical CPUs |
| 31 | ------------ |
| 32 | |
| 33 | CPU idle time management operates on CPUs as seen by the *CPU scheduler* (that |
| 34 | is the part of the kernel responsible for the distribution of computational |
| 35 | work in the system). In its view, CPUs are *logical* units. That is, they need |
| 36 | not be separate physical entities and may just be interfaces appearing to |
| 37 | software as individual single-core processors. In other words, a CPU is an |
| 38 | entity which appears to be fetching instructions that belong to one sequence |
| 39 | (program) from memory and executing them, but it need not work this way |
| 40 | physically. Generally, three different cases can be consider here. |
| 41 | |
| 42 | First, if the whole processor can only follow one sequence of instructions (one |
| 43 | program) at a time, it is a CPU. In that case, if the hardware is asked to |
| 44 | enter an idle state, that applies to the processor as a whole. |
| 45 | |
| 46 | Second, if the processor is multi-core, each core in it is able to follow at |
| 47 | least one program at a time. The cores need not be entirely independent of each |
| 48 | other (for example, they may share caches), but still most of the time they |
| 49 | work physically in parallel with each other, so if each of them executes only |
| 50 | one program, those programs run mostly independently of each other at the same |
| 51 | time. The entire cores are CPUs in that case and if the hardware is asked to |
| 52 | enter an idle state, that applies to the core that asked for it in the first |
| 53 | place, but it also may apply to a larger unit (say a "package" or a "cluster") |
| 54 | that the core belongs to (in fact, it may apply to an entire hierarchy of larger |
| 55 | units containing the core). Namely, if all of the cores in the larger unit |
| 56 | except for one have been put into idle states at the "core level" and the |
| 57 | remaining core asks the processor to enter an idle state, that may trigger it |
| 58 | to put the whole larger unit into an idle state which also will affect the |
| 59 | other cores in that unit. |
| 60 | |
| 61 | Finally, each core in a multi-core processor may be able to follow more than one |
| 62 | program in the same time frame (that is, each core may be able to fetch |
| 63 | instructions from multiple locations in memory and execute them in the same time |
| 64 | frame, but not necessarily entirely in parallel with each other). In that case |
| 65 | the cores present themselves to software as "bundles" each consisting of |
| 66 | multiple individual single-core "processors", referred to as *hardware threads* |
| 67 | (or hyper-threads specifically on Intel hardware), that each can follow one |
| 68 | sequence of instructions. Then, the hardware threads are CPUs from the CPU idle |
| 69 | time management perspective and if the processor is asked to enter an idle state |
| 70 | by one of them, the hardware thread (or CPU) that asked for it is stopped, but |
| 71 | nothing more happens, unless all of the other hardware threads within the same |
| 72 | core also have asked the processor to enter an idle state. In that situation, |
| 73 | the core may be put into an idle state individually or a larger unit containing |
| 74 | it may be put into an idle state as a whole (if the other cores within the |
| 75 | larger unit are in idle states already). |
| 76 | |
| 77 | Idle CPUs |
| 78 | --------- |
| 79 | |
| 80 | Logical CPUs, simply referred to as "CPUs" in what follows, are regarded as |
| 81 | *idle* by the Linux kernel when there are no tasks to run on them except for the |
| 82 | special "idle" task. |
| 83 | |
| 84 | Tasks are the CPU scheduler's representation of work. Each task consists of a |
| 85 | sequence of instructions to execute, or code, data to be manipulated while |
| 86 | running that code, and some context information that needs to be loaded into the |
| 87 | processor every time the task's code is run by a CPU. The CPU scheduler |
| 88 | distributes work by assigning tasks to run to the CPUs present in the system. |
| 89 | |
| 90 | Tasks can be in various states. In particular, they are *runnable* if there are |
| 91 | no specific conditions preventing their code from being run by a CPU as long as |
| 92 | there is a CPU available for that (for example, they are not waiting for any |
| 93 | events to occur or similar). When a task becomes runnable, the CPU scheduler |
| 94 | assigns it to one of the available CPUs to run and if there are no more runnable |
| 95 | tasks assigned to it, the CPU will load the given task's context and run its |
| 96 | code (from the instruction following the last one executed so far, possibly by |
| 97 | another CPU). [If there are multiple runnable tasks assigned to one CPU |
| 98 | simultaneously, they will be subject to prioritization and time sharing in order |
| 99 | to allow them to make some progress over time.] |
| 100 | |
| 101 | The special "idle" task becomes runnable if there are no other runnable tasks |
| 102 | assigned to the given CPU and the CPU is then regarded as idle. In other words, |
| 103 | in Linux idle CPUs run the code of the "idle" task called *the idle loop*. That |
| 104 | code may cause the processor to be put into one of its idle states, if they are |
| 105 | supported, in order to save energy, but if the processor does not support any |
| 106 | idle states, or there is not enough time to spend in an idle state before the |
| 107 | next wakeup event, or there are strict latency constraints preventing any of the |
| 108 | available idle states from being used, the CPU will simply execute more or less |
| 109 | useless instructions in a loop until it is assigned a new task to run. |
| 110 | |
| 111 | |
| 112 | .. _idle-loop: |
| 113 | |
| 114 | The Idle Loop |
| 115 | ============= |
| 116 | |
| 117 | The idle loop code takes two major steps in every iteration of it. First, it |
| 118 | calls into a code module referred to as the *governor* that belongs to the CPU |
| 119 | idle time management subsystem called ``CPUIdle`` to select an idle state for |
| 120 | the CPU to ask the hardware to enter. Second, it invokes another code module |
| 121 | from the ``CPUIdle`` subsystem, called the *driver*, to actually ask the |
| 122 | processor hardware to enter the idle state selected by the governor. |
| 123 | |
| 124 | The role of the governor is to find an idle state most suitable for the |
| 125 | conditions at hand. For this purpose, idle states that the hardware can be |
| 126 | asked to enter by logical CPUs are represented in an abstract way independent of |
| 127 | the platform or the processor architecture and organized in a one-dimensional |
| 128 | (linear) array. That array has to be prepared and supplied by the ``CPUIdle`` |
| 129 | driver matching the platform the kernel is running on at the initialization |
| 130 | time. This allows ``CPUIdle`` governors to be independent of the underlying |
| 131 | hardware and to work with any platforms that the Linux kernel can run on. |
| 132 | |
| 133 | Each idle state present in that array is characterized by two parameters to be |
| 134 | taken into account by the governor, the *target residency* and the (worst-case) |
| 135 | *exit latency*. The target residency is the minimum time the hardware must |
| 136 | spend in the given state, including the time needed to enter it (which may be |
| 137 | substantial), in order to save more energy than it would save by entering one of |
| 138 | the shallower idle states instead. [The "depth" of an idle state roughly |
| 139 | corresponds to the power drawn by the processor in that state.] The exit |
| 140 | latency, in turn, is the maximum time it will take a CPU asking the processor |
| 141 | hardware to enter an idle state to start executing the first instruction after a |
| 142 | wakeup from that state. Note that in general the exit latency also must cover |
| 143 | the time needed to enter the given state in case the wakeup occurs when the |
| 144 | hardware is entering it and it must be entered completely to be exited in an |
| 145 | ordered manner. |
| 146 | |
| 147 | There are two types of information that can influence the governor's decisions. |
| 148 | First of all, the governor knows the time until the closest timer event. That |
| 149 | time is known exactly, because the kernel programs timers and it knows exactly |
| 150 | when they will trigger, and it is the maximum time the hardware that the given |
| 151 | CPU depends on can spend in an idle state, including the time necessary to enter |
| 152 | and exit it. However, the CPU may be woken up by a non-timer event at any time |
| 153 | (in particular, before the closest timer triggers) and it generally is not known |
| 154 | when that may happen. The governor can only see how much time the CPU actually |
| 155 | was idle after it has been woken up (that time will be referred to as the *idle |
| 156 | duration* from now on) and it can use that information somehow along with the |
| 157 | time until the closest timer to estimate the idle duration in future. How the |
| 158 | governor uses that information depends on what algorithm is implemented by it |
| 159 | and that is the primary reason for having more than one governor in the |
| 160 | ``CPUIdle`` subsystem. |
| 161 | |
| 162 | There are three ``CPUIdle`` governors available, ``menu``, `TEO <teo-gov_>`_ |
| 163 | and ``ladder``. Which of them is used by default depends on the configuration |
| 164 | of the kernel and in particular on whether or not the scheduler tick can be |
| 165 | `stopped by the idle loop <idle-cpus-and-tick_>`_. It is possible to change the |
| 166 | governor at run time if the ``cpuidle_sysfs_switch`` command line parameter has |
| 167 | been passed to the kernel, but that is not safe in general, so it should not be |
| 168 | done on production systems (that may change in the future, though). The name of |
| 169 | the ``CPUIdle`` governor currently used by the kernel can be read from the |
| 170 | :file:`current_governor_ro` (or :file:`current_governor` if |
| 171 | ``cpuidle_sysfs_switch`` is present in the kernel command line) file under |
| 172 | :file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``. |
| 173 | |
| 174 | Which ``CPUIdle`` driver is used, on the other hand, usually depends on the |
| 175 | platform the kernel is running on, but there are platforms with more than one |
| 176 | matching driver. For example, there are two drivers that can work with the |
| 177 | majority of Intel platforms, ``intel_idle`` and ``acpi_idle``, one with |
| 178 | hardcoded idle states information and the other able to read that information |
| 179 | from the system's ACPI tables, respectively. Still, even in those cases, the |
| 180 | driver chosen at the system initialization time cannot be replaced later, so the |
| 181 | decision on which one of them to use has to be made early (on Intel platforms |
| 182 | the ``acpi_idle`` driver will be used if ``intel_idle`` is disabled for some |
| 183 | reason or if it does not recognize the processor). The name of the ``CPUIdle`` |
| 184 | driver currently used by the kernel can be read from the :file:`current_driver` |
| 185 | file under :file:`/sys/devices/system/cpu/cpuidle/` in ``sysfs``. |
| 186 | |
| 187 | |
| 188 | .. _idle-cpus-and-tick: |
| 189 | |
| 190 | Idle CPUs and The Scheduler Tick |
| 191 | ================================ |
| 192 | |
| 193 | The scheduler tick is a timer that triggers periodically in order to implement |
| 194 | the time sharing strategy of the CPU scheduler. Of course, if there are |
| 195 | multiple runnable tasks assigned to one CPU at the same time, the only way to |
| 196 | allow them to make reasonable progress in a given time frame is to make them |
| 197 | share the available CPU time. Namely, in rough approximation, each task is |
| 198 | given a slice of the CPU time to run its code, subject to the scheduling class, |
| 199 | prioritization and so on and when that time slice is used up, the CPU should be |
| 200 | switched over to running (the code of) another task. The currently running task |
| 201 | may not want to give the CPU away voluntarily, however, and the scheduler tick |
| 202 | is there to make the switch happen regardless. That is not the only role of the |
| 203 | tick, but it is the primary reason for using it. |
| 204 | |
| 205 | The scheduler tick is problematic from the CPU idle time management perspective, |
| 206 | because it triggers periodically and relatively often (depending on the kernel |
| 207 | configuration, the length of the tick period is between 1 ms and 10 ms). |
| 208 | Thus, if the tick is allowed to trigger on idle CPUs, it will not make sense |
| 209 | for them to ask the hardware to enter idle states with target residencies above |
| 210 | the tick period length. Moreover, in that case the idle duration of any CPU |
| 211 | will never exceed the tick period length and the energy used for entering and |
| 212 | exiting idle states due to the tick wakeups on idle CPUs will be wasted. |
| 213 | |
| 214 | Fortunately, it is not really necessary to allow the tick to trigger on idle |
| 215 | CPUs, because (by definition) they have no tasks to run except for the special |
| 216 | "idle" one. In other words, from the CPU scheduler perspective, the only user |
| 217 | of the CPU time on them is the idle loop. Since the time of an idle CPU need |
| 218 | not be shared between multiple runnable tasks, the primary reason for using the |
| 219 | tick goes away if the given CPU is idle. Consequently, it is possible to stop |
| 220 | the scheduler tick entirely on idle CPUs in principle, even though that may not |
| 221 | always be worth the effort. |
| 222 | |
| 223 | Whether or not it makes sense to stop the scheduler tick in the idle loop |
| 224 | depends on what is expected by the governor. First, if there is another |
| 225 | (non-tick) timer due to trigger within the tick range, stopping the tick clearly |
| 226 | would be a waste of time, even though the timer hardware may not need to be |
| 227 | reprogrammed in that case. Second, if the governor is expecting a non-timer |
| 228 | wakeup within the tick range, stopping the tick is not necessary and it may even |
| 229 | be harmful. Namely, in that case the governor will select an idle state with |
| 230 | the target residency within the time until the expected wakeup, so that state is |
| 231 | going to be relatively shallow. The governor really cannot select a deep idle |
| 232 | state then, as that would contradict its own expectation of a wakeup in short |
| 233 | order. Now, if the wakeup really occurs shortly, stopping the tick would be a |
| 234 | waste of time and in this case the timer hardware would need to be reprogrammed, |
| 235 | which is expensive. On the other hand, if the tick is stopped and the wakeup |
| 236 | does not occur any time soon, the hardware may spend indefinite amount of time |
| 237 | in the shallow idle state selected by the governor, which will be a waste of |
| 238 | energy. Hence, if the governor is expecting a wakeup of any kind within the |
| 239 | tick range, it is better to allow the tick trigger. Otherwise, however, the |
| 240 | governor will select a relatively deep idle state, so the tick should be stopped |
| 241 | so that it does not wake up the CPU too early. |
| 242 | |
| 243 | In any case, the governor knows what it is expecting and the decision on whether |
| 244 | or not to stop the scheduler tick belongs to it. Still, if the tick has been |
| 245 | stopped already (in one of the previous iterations of the loop), it is better |
| 246 | to leave it as is and the governor needs to take that into account. |
| 247 | |
| 248 | The kernel can be configured to disable stopping the scheduler tick in the idle |
| 249 | loop altogether. That can be done through the build-time configuration of it |
| 250 | (by unsetting the ``CONFIG_NO_HZ_IDLE`` configuration option) or by passing |
| 251 | ``nohz=off`` to it in the command line. In both cases, as the stopping of the |
| 252 | scheduler tick is disabled, the governor's decisions regarding it are simply |
| 253 | ignored by the idle loop code and the tick is never stopped. |
| 254 | |
| 255 | The systems that run kernels configured to allow the scheduler tick to be |
| 256 | stopped on idle CPUs are referred to as *tickless* systems and they are |
| 257 | generally regarded as more energy-efficient than the systems running kernels in |
| 258 | which the tick cannot be stopped. If the given system is tickless, it will use |
| 259 | the ``menu`` governor by default and if it is not tickless, the default |
| 260 | ``CPUIdle`` governor on it will be ``ladder``. |
| 261 | |
| 262 | |
| 263 | .. _menu-gov: |
| 264 | |
| 265 | The ``menu`` Governor |
| 266 | ===================== |
| 267 | |
| 268 | The ``menu`` governor is the default ``CPUIdle`` governor for tickless systems. |
| 269 | It is quite complex, but the basic principle of its design is straightforward. |
| 270 | Namely, when invoked to select an idle state for a CPU (i.e. an idle state that |
| 271 | the CPU will ask the processor hardware to enter), it attempts to predict the |
| 272 | idle duration and uses the predicted value for idle state selection. |
| 273 | |
| 274 | It first obtains the time until the closest timer event with the assumption |
| 275 | that the scheduler tick will be stopped. That time, referred to as the *sleep |
| 276 | length* in what follows, is the upper bound on the time before the next CPU |
| 277 | wakeup. It is used to determine the sleep length range, which in turn is needed |
| 278 | to get the sleep length correction factor. |
| 279 | |
| 280 | The ``menu`` governor maintains two arrays of sleep length correction factors. |
| 281 | One of them is used when tasks previously running on the given CPU are waiting |
| 282 | for some I/O operations to complete and the other one is used when that is not |
| 283 | the case. Each array contains several correction factor values that correspond |
| 284 | to different sleep length ranges organized so that each range represented in the |
| 285 | array is approximately 10 times wider than the previous one. |
| 286 | |
| 287 | The correction factor for the given sleep length range (determined before |
| 288 | selecting the idle state for the CPU) is updated after the CPU has been woken |
| 289 | up and the closer the sleep length is to the observed idle duration, the closer |
| 290 | to 1 the correction factor becomes (it must fall between 0 and 1 inclusive). |
| 291 | The sleep length is multiplied by the correction factor for the range that it |
| 292 | falls into to obtain the first approximation of the predicted idle duration. |
| 293 | |
| 294 | Next, the governor uses a simple pattern recognition algorithm to refine its |
| 295 | idle duration prediction. Namely, it saves the last 8 observed idle duration |
| 296 | values and, when predicting the idle duration next time, it computes the average |
| 297 | and variance of them. If the variance is small (smaller than 400 square |
| 298 | milliseconds) or it is small relative to the average (the average is greater |
| 299 | that 6 times the standard deviation), the average is regarded as the "typical |
| 300 | interval" value. Otherwise, the longest of the saved observed idle duration |
| 301 | values is discarded and the computation is repeated for the remaining ones. |
| 302 | Again, if the variance of them is small (in the above sense), the average is |
| 303 | taken as the "typical interval" value and so on, until either the "typical |
| 304 | interval" is determined or too many data points are disregarded, in which case |
| 305 | the "typical interval" is assumed to equal "infinity" (the maximum unsigned |
| 306 | integer value). The "typical interval" computed this way is compared with the |
| 307 | sleep length multiplied by the correction factor and the minimum of the two is |
| 308 | taken as the predicted idle duration. |
| 309 | |
| 310 | Then, the governor computes an extra latency limit to help "interactive" |
| 311 | workloads. It uses the observation that if the exit latency of the selected |
| 312 | idle state is comparable with the predicted idle duration, the total time spent |
| 313 | in that state probably will be very short and the amount of energy to save by |
| 314 | entering it will be relatively small, so likely it is better to avoid the |
| 315 | overhead related to entering that state and exiting it. Thus selecting a |
| 316 | shallower state is likely to be a better option then. The first approximation |
| 317 | of the extra latency limit is the predicted idle duration itself which |
| 318 | additionally is divided by a value depending on the number of tasks that |
| 319 | previously ran on the given CPU and now they are waiting for I/O operations to |
| 320 | complete. The result of that division is compared with the latency limit coming |
| 321 | from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_, |
| 322 | framework and the minimum of the two is taken as the limit for the idle states' |
| 323 | exit latency. |
| 324 | |
| 325 | Now, the governor is ready to walk the list of idle states and choose one of |
| 326 | them. For this purpose, it compares the target residency of each state with |
| 327 | the predicted idle duration and the exit latency of it with the computed latency |
| 328 | limit. It selects the state with the target residency closest to the predicted |
| 329 | idle duration, but still below it, and exit latency that does not exceed the |
| 330 | limit. |
| 331 | |
| 332 | In the final step the governor may still need to refine the idle state selection |
| 333 | if it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That |
| 334 | happens if the idle duration predicted by it is less than the tick period and |
| 335 | the tick has not been stopped already (in a previous iteration of the idle |
| 336 | loop). Then, the sleep length used in the previous computations may not reflect |
| 337 | the real time until the closest timer event and if it really is greater than |
| 338 | that time, the governor may need to select a shallower state with a suitable |
| 339 | target residency. |
| 340 | |
| 341 | |
| 342 | .. _teo-gov: |
| 343 | |
| 344 | The Timer Events Oriented (TEO) Governor |
| 345 | ======================================== |
| 346 | |
| 347 | The timer events oriented (TEO) governor is an alternative ``CPUIdle`` governor |
| 348 | for tickless systems. It follows the same basic strategy as the ``menu`` `one |
| 349 | <menu-gov_>`_: it always tries to find the deepest idle state suitable for the |
| 350 | given conditions. However, it applies a different approach to that problem. |
| 351 | |
| 352 | First, it does not use sleep length correction factors, but instead it attempts |
| 353 | to correlate the observed idle duration values with the available idle states |
| 354 | and use that information to pick up the idle state that is most likely to |
| 355 | "match" the upcoming CPU idle interval. Second, it does not take the tasks |
| 356 | that were running on the given CPU in the past and are waiting on some I/O |
| 357 | operations to complete now at all (there is no guarantee that they will run on |
| 358 | the same CPU when they become runnable again) and the pattern detection code in |
| 359 | it avoids taking timer wakeups into account. It also only uses idle duration |
| 360 | values less than the current time till the closest timer (with the scheduler |
| 361 | tick excluded) for that purpose. |
| 362 | |
| 363 | Like in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain |
| 364 | the *sleep length*, which is the time until the closest timer event with the |
| 365 | assumption that the scheduler tick will be stopped (that also is the upper bound |
| 366 | on the time until the next CPU wakeup). That value is then used to preselect an |
| 367 | idle state on the basis of three metrics maintained for each idle state provided |
| 368 | by the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``. |
| 369 | |
| 370 | The ``hits`` and ``misses`` metrics measure the likelihood that a given idle |
| 371 | state will "match" the observed (post-wakeup) idle duration if it "matches" the |
| 372 | sleep length. They both are subject to decay (after a CPU wakeup) every time |
| 373 | the target residency of the idle state corresponding to them is less than or |
| 374 | equal to the sleep length and the target residency of the next idle state is |
| 375 | greater than the sleep length (that is, when the idle state corresponding to |
| 376 | them "matches" the sleep length). The ``hits`` metric is increased if the |
| 377 | former condition is satisfied and the target residency of the given idle state |
| 378 | is less than or equal to the observed idle duration and the target residency of |
| 379 | the next idle state is greater than the observed idle duration at the same time |
| 380 | (that is, it is increased when the given idle state "matches" both the sleep |
| 381 | length and the observed idle duration). In turn, the ``misses`` metric is |
| 382 | increased when the given idle state "matches" the sleep length only and the |
| 383 | observed idle duration is too short for its target residency. |
| 384 | |
| 385 | The ``early_hits`` metric measures the likelihood that a given idle state will |
| 386 | "match" the observed (post-wakeup) idle duration if it does not "match" the |
| 387 | sleep length. It is subject to decay on every CPU wakeup and it is increased |
| 388 | when the idle state corresponding to it "matches" the observed (post-wakeup) |
| 389 | idle duration and the target residency of the next idle state is less than or |
| 390 | equal to the sleep length (i.e. the idle state "matching" the sleep length is |
| 391 | deeper than the given one). |
| 392 | |
| 393 | The governor walks the list of idle states provided by the ``CPUIdle`` driver |
| 394 | and finds the last (deepest) one with the target residency less than or equal |
| 395 | to the sleep length. Then, the ``hits`` and ``misses`` metrics of that idle |
| 396 | state are compared with each other and it is preselected if the ``hits`` one is |
| 397 | greater (which means that that idle state is likely to "match" the observed idle |
| 398 | duration after CPU wakeup). If the ``misses`` one is greater, the governor |
| 399 | preselects the shallower idle state with the maximum ``early_hits`` metric |
| 400 | (or if there are multiple shallower idle states with equal ``early_hits`` |
| 401 | metric which also is the maximum, the shallowest of them will be preselected). |
| 402 | [If there is a wakeup latency constraint coming from the `PM QoS framework |
| 403 | <cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the |
| 404 | target residency within the sleep length, the deepest idle state with the exit |
| 405 | latency within the constraint is preselected without consulting the ``hits``, |
| 406 | ``misses`` and ``early_hits`` metrics.] |
| 407 | |
| 408 | Next, the governor takes several idle duration values observed most recently |
| 409 | into consideration and if at least a half of them are greater than or equal to |
| 410 | the target residency of the preselected idle state, that idle state becomes the |
| 411 | final candidate to ask for. Otherwise, the average of the most recent idle |
| 412 | duration values below the target residency of the preselected idle state is |
| 413 | computed and the governor walks the idle states shallower than the preselected |
| 414 | one and finds the deepest of them with the target residency within that average. |
| 415 | That idle state is then taken as the final candidate to ask for. |
| 416 | |
| 417 | Still, at this point the governor may need to refine the idle state selection if |
| 418 | it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That |
| 419 | generally happens if the target residency of the idle state selected so far is |
| 420 | less than the tick period and the tick has not been stopped already (in a |
| 421 | previous iteration of the idle loop). Then, like in the ``menu`` governor |
| 422 | `case <menu-gov_>`_, the sleep length used in the previous computations may not |
| 423 | reflect the real time until the closest timer event and if it really is greater |
| 424 | than that time, a shallower state with a suitable target residency may need to |
| 425 | be selected. |
| 426 | |
| 427 | |
| 428 | .. _idle-states-representation: |
| 429 | |
| 430 | Representation of Idle States |
| 431 | ============================= |
| 432 | |
| 433 | For the CPU idle time management purposes all of the physical idle states |
| 434 | supported by the processor have to be represented as a one-dimensional array of |
| 435 | |struct cpuidle_state| objects each allowing an individual (logical) CPU to ask |
| 436 | the processor hardware to enter an idle state of certain properties. If there |
| 437 | is a hierarchy of units in the processor, one |struct cpuidle_state| object can |
| 438 | cover a combination of idle states supported by the units at different levels of |
| 439 | the hierarchy. In that case, the `target residency and exit latency parameters |
| 440 | of it <idle-loop_>`_, must reflect the properties of the idle state at the |
| 441 | deepest level (i.e. the idle state of the unit containing all of the other |
| 442 | units). |
| 443 | |
| 444 | For example, take a processor with two cores in a larger unit referred to as |
| 445 | a "module" and suppose that asking the hardware to enter a specific idle state |
| 446 | (say "X") at the "core" level by one core will trigger the module to try to |
| 447 | enter a specific idle state of its own (say "MX") if the other core is in idle |
| 448 | state "X" already. In other words, asking for idle state "X" at the "core" |
| 449 | level gives the hardware a license to go as deep as to idle state "MX" at the |
| 450 | "module" level, but there is no guarantee that this is going to happen (the core |
| 451 | asking for idle state "X" may just end up in that state by itself instead). |
| 452 | Then, the target residency of the |struct cpuidle_state| object representing |
| 453 | idle state "X" must reflect the minimum time to spend in idle state "MX" of |
| 454 | the module (including the time needed to enter it), because that is the minimum |
| 455 | time the CPU needs to be idle to save any energy in case the hardware enters |
| 456 | that state. Analogously, the exit latency parameter of that object must cover |
| 457 | the exit time of idle state "MX" of the module (and usually its entry time too), |
| 458 | because that is the maximum delay between a wakeup signal and the time the CPU |
| 459 | will start to execute the first new instruction (assuming that both cores in the |
| 460 | module will always be ready to execute instructions as soon as the module |
| 461 | becomes operational as a whole). |
| 462 | |
| 463 | There are processors without direct coordination between different levels of the |
| 464 | hierarchy of units inside them, however. In those cases asking for an idle |
| 465 | state at the "core" level does not automatically affect the "module" level, for |
| 466 | example, in any way and the ``CPUIdle`` driver is responsible for the entire |
| 467 | handling of the hierarchy. Then, the definition of the idle state objects is |
| 468 | entirely up to the driver, but still the physical properties of the idle state |
| 469 | that the processor hardware finally goes into must always follow the parameters |
| 470 | used by the governor for idle state selection (for instance, the actual exit |
| 471 | latency of that idle state must not exceed the exit latency parameter of the |
| 472 | idle state object selected by the governor). |
| 473 | |
| 474 | In addition to the target residency and exit latency idle state parameters |
| 475 | discussed above, the objects representing idle states each contain a few other |
| 476 | parameters describing the idle state and a pointer to the function to run in |
| 477 | order to ask the hardware to enter that state. Also, for each |
| 478 | |struct cpuidle_state| object, there is a corresponding |
| 479 | :c:type:`struct cpuidle_state_usage <cpuidle_state_usage>` one containing usage |
| 480 | statistics of the given idle state. That information is exposed by the kernel |
| 481 | via ``sysfs``. |
| 482 | |
| 483 | For each CPU in the system, there is a :file:`/sys/devices/system/cpu<N>/cpuidle/` |
| 484 | directory in ``sysfs``, where the number ``<N>`` is assigned to the given |
| 485 | CPU at the initialization time. That directory contains a set of subdirectories |
| 486 | called :file:`state0`, :file:`state1` and so on, up to the number of idle state |
| 487 | objects defined for the given CPU minus one. Each of these directories |
| 488 | corresponds to one idle state object and the larger the number in its name, the |
| 489 | deeper the (effective) idle state represented by it. Each of them contains |
| 490 | a number of files (attributes) representing the properties of the idle state |
| 491 | object corresponding to it, as follows: |
| 492 | |
| 493 | ``above`` |
| 494 | Total number of times this idle state had been asked for, but the |
| 495 | observed idle duration was certainly too short to match its target |
| 496 | residency. |
| 497 | |
| 498 | ``below`` |
| 499 | Total number of times this idle state had been asked for, but cerainly |
| 500 | a deeper idle state would have been a better match for the observed idle |
| 501 | duration. |
| 502 | |
| 503 | ``desc`` |
| 504 | Description of the idle state. |
| 505 | |
| 506 | ``disable`` |
| 507 | Whether or not this idle state is disabled. |
| 508 | |
| 509 | ``latency`` |
| 510 | Exit latency of the idle state in microseconds. |
| 511 | |
| 512 | ``name`` |
| 513 | Name of the idle state. |
| 514 | |
| 515 | ``power`` |
| 516 | Power drawn by hardware in this idle state in milliwatts (if specified, |
| 517 | 0 otherwise). |
| 518 | |
| 519 | ``residency`` |
| 520 | Target residency of the idle state in microseconds. |
| 521 | |
| 522 | ``time`` |
| 523 | Total time spent in this idle state by the given CPU (as measured by the |
| 524 | kernel) in microseconds. |
| 525 | |
| 526 | ``usage`` |
| 527 | Total number of times the hardware has been asked by the given CPU to |
| 528 | enter this idle state. |
| 529 | |
| 530 | The :file:`desc` and :file:`name` files both contain strings. The difference |
| 531 | between them is that the name is expected to be more concise, while the |
| 532 | description may be longer and it may contain white space or special characters. |
| 533 | The other files listed above contain integer numbers. |
| 534 | |
| 535 | The :file:`disable` attribute is the only writeable one. If it contains 1, the |
| 536 | given idle state is disabled for this particular CPU, which means that the |
| 537 | governor will never select it for this particular CPU and the ``CPUIdle`` |
| 538 | driver will never ask the hardware to enter it for that CPU as a result. |
| 539 | However, disabling an idle state for one CPU does not prevent it from being |
| 540 | asked for by the other CPUs, so it must be disabled for all of them in order to |
| 541 | never be asked for by any of them. [Note that, due to the way the ``ladder`` |
| 542 | governor is implemented, disabling an idle state prevents that governor from |
| 543 | selecting any idle states deeper than the disabled one too.] |
| 544 | |
| 545 | If the :file:`disable` attribute contains 0, the given idle state is enabled for |
| 546 | this particular CPU, but it still may be disabled for some or all of the other |
| 547 | CPUs in the system at the same time. Writing 1 to it causes the idle state to |
| 548 | be disabled for this particular CPU and writing 0 to it allows the governor to |
| 549 | take it into consideration for the given CPU and the driver to ask for it, |
| 550 | unless that state was disabled globally in the driver (in which case it cannot |
| 551 | be used at all). |
| 552 | |
| 553 | The :file:`power` attribute is not defined very well, especially for idle state |
| 554 | objects representing combinations of idle states at different levels of the |
| 555 | hierarchy of units in the processor, and it generally is hard to obtain idle |
| 556 | state power numbers for complex hardware, so :file:`power` often contains 0 (not |
| 557 | available) and if it contains a nonzero number, that number may not be very |
| 558 | accurate and it should not be relied on for anything meaningful. |
| 559 | |
| 560 | The number in the :file:`time` file generally may be greater than the total time |
| 561 | really spent by the given CPU in the given idle state, because it is measured by |
| 562 | the kernel and it may not cover the cases in which the hardware refused to enter |
| 563 | this idle state and entered a shallower one instead of it (or even it did not |
| 564 | enter any idle state at all). The kernel can only measure the time span between |
| 565 | asking the hardware to enter an idle state and the subsequent wakeup of the CPU |
| 566 | and it cannot say what really happened in the meantime at the hardware level. |
| 567 | Moreover, if the idle state object in question represents a combination of idle |
| 568 | states at different levels of the hierarchy of units in the processor, |
| 569 | the kernel can never say how deep the hardware went down the hierarchy in any |
| 570 | particular case. For these reasons, the only reliable way to find out how |
| 571 | much time has been spent by the hardware in different idle states supported by |
| 572 | it is to use idle state residency counters in the hardware, if available. |
| 573 | |
| 574 | |
| 575 | .. _cpu-pm-qos: |
| 576 | |
| 577 | Power Management Quality of Service for CPUs |
| 578 | ============================================ |
| 579 | |
| 580 | The power management quality of service (PM QoS) framework in the Linux kernel |
| 581 | allows kernel code and user space processes to set constraints on various |
| 582 | energy-efficiency features of the kernel to prevent performance from dropping |
| 583 | below a required level. The PM QoS constraints can be set globally, in |
| 584 | predefined categories referred to as PM QoS classes, or against individual |
| 585 | devices. |
| 586 | |
| 587 | CPU idle time management can be affected by PM QoS in two ways, through the |
| 588 | global constraint in the ``PM_QOS_CPU_DMA_LATENCY`` class and through the |
| 589 | resume latency constraints for individual CPUs. Kernel code (e.g. device |
| 590 | drivers) can set both of them with the help of special internal interfaces |
| 591 | provided by the PM QoS framework. User space can modify the former by opening |
| 592 | the :file:`cpu_dma_latency` special device file under :file:`/dev/` and writing |
| 593 | a binary value (interpreted as a signed 32-bit integer) to it. In turn, the |
| 594 | resume latency constraint for a CPU can be modified by user space by writing a |
| 595 | string (representing a signed 32-bit integer) to the |
| 596 | :file:`power/pm_qos_resume_latency_us` file under |
| 597 | :file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs``, where the CPU number |
| 598 | ``<N>`` is allocated at the system initialization time. Negative values |
| 599 | will be rejected in both cases and, also in both cases, the written integer |
| 600 | number will be interpreted as a requested PM QoS constraint in microseconds. |
| 601 | |
| 602 | The requested value is not automatically applied as a new constraint, however, |
| 603 | as it may be less restrictive (greater in this particular case) than another |
| 604 | constraint previously requested by someone else. For this reason, the PM QoS |
| 605 | framework maintains a list of requests that have been made so far in each |
| 606 | global class and for each device, aggregates them and applies the effective |
| 607 | (minimum in this particular case) value as the new constraint. |
| 608 | |
| 609 | In fact, opening the :file:`cpu_dma_latency` special device file causes a new |
| 610 | PM QoS request to be created and added to the priority list of requests in the |
| 611 | ``PM_QOS_CPU_DMA_LATENCY`` class and the file descriptor coming from the |
| 612 | "open" operation represents that request. If that file descriptor is then |
| 613 | used for writing, the number written to it will be associated with the PM QoS |
| 614 | request represented by it as a new requested constraint value. Next, the |
| 615 | priority list mechanism will be used to determine the new effective value of |
| 616 | the entire list of requests and that effective value will be set as a new |
| 617 | constraint. Thus setting a new requested constraint value will only change the |
| 618 | real constraint if the effective "list" value is affected by it. In particular, |
| 619 | for the ``PM_QOS_CPU_DMA_LATENCY`` class it only affects the real constraint if |
| 620 | it is the minimum of the requested constraints in the list. The process holding |
| 621 | a file descriptor obtained by opening the :file:`cpu_dma_latency` special device |
| 622 | file controls the PM QoS request associated with that file descriptor, but it |
| 623 | controls this particular PM QoS request only. |
| 624 | |
| 625 | Closing the :file:`cpu_dma_latency` special device file or, more precisely, the |
| 626 | file descriptor obtained while opening it, causes the PM QoS request associated |
| 627 | with that file descriptor to be removed from the ``PM_QOS_CPU_DMA_LATENCY`` |
| 628 | class priority list and destroyed. If that happens, the priority list mechanism |
| 629 | will be used, again, to determine the new effective value for the whole list |
| 630 | and that value will become the new real constraint. |
| 631 | |
| 632 | In turn, for each CPU there is only one resume latency PM QoS request |
| 633 | associated with the :file:`power/pm_qos_resume_latency_us` file under |
| 634 | :file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs`` and writing to it causes |
| 635 | this single PM QoS request to be updated regardless of which user space |
| 636 | process does that. In other words, this PM QoS request is shared by the entire |
| 637 | user space, so access to the file associated with it needs to be arbitrated |
| 638 | to avoid confusion. [Arguably, the only legitimate use of this mechanism in |
| 639 | practice is to pin a process to the CPU in question and let it use the |
| 640 | ``sysfs`` interface to control the resume latency constraint for it.] It |
| 641 | still only is a request, however. It is a member of a priority list used to |
| 642 | determine the effective value to be set as the resume latency constraint for the |
| 643 | CPU in question every time the list of requests is updated this way or another |
| 644 | (there may be other requests coming from kernel code in that list). |
| 645 | |
| 646 | CPU idle time governors are expected to regard the minimum of the global |
| 647 | effective ``PM_QOS_CPU_DMA_LATENCY`` class constraint and the effective |
| 648 | resume latency constraint for the given CPU as the upper limit for the exit |
| 649 | latency of the idle states they can select for that CPU. They should never |
| 650 | select any idle states with exit latency beyond that limit. |
| 651 | |
| 652 | |
| 653 | Idle States Control Via Kernel Command Line |
| 654 | =========================================== |
| 655 | |
| 656 | In addition to the ``sysfs`` interface allowing individual idle states to be |
| 657 | `disabled for individual CPUs <idle-states-representation_>`_, there are kernel |
| 658 | command line parameters affecting CPU idle time management. |
| 659 | |
| 660 | The ``cpuidle.off=1`` kernel command line option can be used to disable the |
| 661 | CPU idle time management entirely. It does not prevent the idle loop from |
| 662 | running on idle CPUs, but it prevents the CPU idle time governors and drivers |
| 663 | from being invoked. If it is added to the kernel command line, the idle loop |
| 664 | will ask the hardware to enter idle states on idle CPUs via the CPU architecture |
| 665 | support code that is expected to provide a default mechanism for this purpose. |
| 666 | That default mechanism usually is the least common denominator for all of the |
| 667 | processors implementing the architecture (i.e. CPU instruction set) in question, |
| 668 | however, so it is rather crude and not very energy-efficient. For this reason, |
| 669 | it is not recommended for production use. |
| 670 | |
| 671 | The ``cpuidle.governor=`` kernel command line switch allows the ``CPUIdle`` |
| 672 | governor to use to be specified. It has to be appended with a string matching |
| 673 | the name of an available governor (e.g. ``cpuidle.governor=menu``) and that |
| 674 | governor will be used instead of the default one. It is possible to force |
| 675 | the ``menu`` governor to be used on the systems that use the ``ladder`` governor |
| 676 | by default this way, for example. |
| 677 | |
| 678 | The other kernel command line parameters controlling CPU idle time management |
| 679 | described below are only relevant for the *x86* architecture and references |
| 680 | to ``intel_idle`` affect Intel processors only. |
| 681 | |
| 682 | The *x86* architecture support code recognizes three kernel command line |
| 683 | options related to CPU idle time management: ``idle=poll``, ``idle=halt``, |
| 684 | and ``idle=nomwait``. The first two of them disable the ``acpi_idle`` and |
| 685 | ``intel_idle`` drivers altogether, which effectively causes the entire |
| 686 | ``CPUIdle`` subsystem to be disabled and makes the idle loop invoke the |
| 687 | architecture support code to deal with idle CPUs. How it does that depends on |
| 688 | which of the two parameters is added to the kernel command line. In the |
| 689 | ``idle=halt`` case, the architecture support code will use the ``HLT`` |
| 690 | instruction of the CPUs (which, as a rule, suspends the execution of the program |
| 691 | and causes the hardware to attempt to enter the shallowest available idle state) |
| 692 | for this purpose, and if ``idle=poll`` is used, idle CPUs will execute a |
| 693 | more or less ``lightweight'' sequence of instructions in a tight loop. [Note |
| 694 | that using ``idle=poll`` is somewhat drastic in many cases, as preventing idle |
| 695 | CPUs from saving almost any energy at all may not be the only effect of it. |
| 696 | For example, on Intel hardware it effectively prevents CPUs from using |
| 697 | P-states (see |cpufreq|) that require any number of CPUs in a package to be |
| 698 | idle, so it very well may hurt single-thread computations performance as well as |
| 699 | energy-efficiency. Thus using it for performance reasons may not be a good idea |
| 700 | at all.] |
| 701 | |
| 702 | The ``idle=nomwait`` option prevents the use of ``MWAIT`` instruction of |
| 703 | the CPU to enter idle states. When this option is used, the ``acpi_idle`` |
| 704 | driver will use the ``HLT`` instruction instead of ``MWAIT``. On systems |
| 705 | running Intel processors, this option disables the ``intel_idle`` driver |
| 706 | and forces the use of the ``acpi_idle`` driver instead. Note that in either |
| 707 | case, ``acpi_idle`` driver will function only if all the information needed |
| 708 | by it is in the system's ACPI tables. |
| 709 | |
| 710 | In addition to the architecture-level kernel command line options affecting CPU |
| 711 | idle time management, there are parameters affecting individual ``CPUIdle`` |
| 712 | drivers that can be passed to them via the kernel command line. Specifically, |
| 713 | the ``intel_idle.max_cstate=<n>`` and ``processor.max_cstate=<n>`` parameters, |
| 714 | where ``<n>`` is an idle state index also used in the name of the given |
| 715 | state's directory in ``sysfs`` (see |
| 716 | `Representation of Idle States <idle-states-representation_>`_), causes the |
| 717 | ``intel_idle`` and ``acpi_idle`` drivers, respectively, to discard all of the |
| 718 | idle states deeper than idle state ``<n>``. In that case, they will never ask |
| 719 | for any of those idle states or expose them to the governor. [The behavior of |
| 720 | the two drivers is different for ``<n>`` equal to ``0``. Adding |
| 721 | ``intel_idle.max_cstate=0`` to the kernel command line disables the |
| 722 | ``intel_idle`` driver and allows ``acpi_idle`` to be used, whereas |
| 723 | ``processor.max_cstate=0`` is equivalent to ``processor.max_cstate=1``. |
| 724 | Also, the ``acpi_idle`` driver is part of the ``processor`` kernel module that |
| 725 | can be loaded separately and ``max_cstate=<n>`` can be passed to it as a module |
| 726 | parameter when it is loaded.] |