| xj | b04a402 | 2021-11-25 15:01:52 +0800 | [diff] [blame] | 1 | .. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>` | 
 | 2 | .. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>` | 
 | 3 |  | 
 | 4 | ======================= | 
 | 5 | CPU Performance Scaling | 
 | 6 | ======================= | 
 | 7 |  | 
 | 8 | :: | 
 | 9 |  | 
 | 10 |  Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> | 
 | 11 |  | 
 | 12 | The Concept of CPU Performance Scaling | 
 | 13 | ====================================== | 
 | 14 |  | 
 | 15 | The majority of modern processors are capable of operating in a number of | 
 | 16 | different clock frequency and voltage configurations, often referred to as | 
 | 17 | Operating Performance Points or P-states (in ACPI terminology).  As a rule, | 
 | 18 | the higher the clock frequency and the higher the voltage, the more instructions | 
 | 19 | can be retired by the CPU over a unit of time, but also the higher the clock | 
 | 20 | frequency and the higher the voltage, the more energy is consumed over a unit of | 
 | 21 | time (or the more power is drawn) by the CPU in the given P-state.  Therefore | 
 | 22 | there is a natural tradeoff between the CPU capacity (the number of instructions | 
 | 23 | that can be executed over a unit of time) and the power drawn by the CPU. | 
 | 24 |  | 
 | 25 | In some situations it is desirable or even necessary to run the program as fast | 
 | 26 | as possible and then there is no reason to use any P-states different from the | 
 | 27 | highest one (i.e. the highest-performance frequency/voltage configuration | 
 | 28 | available).  In some other cases, however, it may not be necessary to execute | 
 | 29 | instructions so quickly and maintaining the highest available CPU capacity for a | 
 | 30 | relatively long time without utilizing it entirely may be regarded as wasteful. | 
 | 31 | It also may not be physically possible to maintain maximum CPU capacity for too | 
 | 32 | long for thermal or power supply capacity reasons or similar.  To cover those | 
 | 33 | cases, there are hardware interfaces allowing CPUs to be switched between | 
 | 34 | different frequency/voltage configurations or (in the ACPI terminology) to be | 
 | 35 | put into different P-states. | 
 | 36 |  | 
 | 37 | Typically, they are used along with algorithms to estimate the required CPU | 
 | 38 | capacity, so as to decide which P-states to put the CPUs into.  Of course, since | 
 | 39 | the utilization of the system generally changes over time, that has to be done | 
 | 40 | repeatedly on a regular basis.  The activity by which this happens is referred | 
 | 41 | to as CPU performance scaling or CPU frequency scaling (because it involves | 
 | 42 | adjusting the CPU clock frequency). | 
 | 43 |  | 
 | 44 |  | 
 | 45 | CPU Performance Scaling in Linux | 
 | 46 | ================================ | 
 | 47 |  | 
 | 48 | The Linux kernel supports CPU performance scaling by means of the ``CPUFreq`` | 
 | 49 | (CPU Frequency scaling) subsystem that consists of three layers of code: the | 
 | 50 | core, scaling governors and scaling drivers. | 
 | 51 |  | 
 | 52 | The ``CPUFreq`` core provides the common code infrastructure and user space | 
 | 53 | interfaces for all platforms that support CPU performance scaling.  It defines | 
 | 54 | the basic framework in which the other components operate. | 
 | 55 |  | 
 | 56 | Scaling governors implement algorithms to estimate the required CPU capacity. | 
 | 57 | As a rule, each governor implements one, possibly parametrized, scaling | 
 | 58 | algorithm. | 
 | 59 |  | 
 | 60 | Scaling drivers talk to the hardware.  They provide scaling governors with | 
 | 61 | information on the available P-states (or P-state ranges in some cases) and | 
 | 62 | access platform-specific hardware interfaces to change CPU P-states as requested | 
 | 63 | by scaling governors. | 
 | 64 |  | 
 | 65 | In principle, all available scaling governors can be used with every scaling | 
 | 66 | driver.  That design is based on the observation that the information used by | 
 | 67 | performance scaling algorithms for P-state selection can be represented in a | 
 | 68 | platform-independent form in the majority of cases, so it should be possible | 
 | 69 | to use the same performance scaling algorithm implemented in exactly the same | 
 | 70 | way regardless of which scaling driver is used.  Consequently, the same set of | 
 | 71 | scaling governors should be suitable for every supported platform. | 
 | 72 |  | 
 | 73 | However, that observation may not hold for performance scaling algorithms | 
 | 74 | based on information provided by the hardware itself, for example through | 
 | 75 | feedback registers, as that information is typically specific to the hardware | 
 | 76 | interface it comes from and may not be easily represented in an abstract, | 
 | 77 | platform-independent way.  For this reason, ``CPUFreq`` allows scaling drivers | 
 | 78 | to bypass the governor layer and implement their own performance scaling | 
 | 79 | algorithms.  That is done by the |intel_pstate| scaling driver. | 
 | 80 |  | 
 | 81 |  | 
 | 82 | ``CPUFreq`` Policy Objects | 
 | 83 | ========================== | 
 | 84 |  | 
 | 85 | In some cases the hardware interface for P-state control is shared by multiple | 
 | 86 | CPUs.  That is, for example, the same register (or set of registers) is used to | 
 | 87 | control the P-state of multiple CPUs at the same time and writing to it affects | 
 | 88 | all of those CPUs simultaneously. | 
 | 89 |  | 
 | 90 | Sets of CPUs sharing hardware P-state control interfaces are represented by | 
 | 91 | ``CPUFreq`` as |struct cpufreq_policy| objects.  For consistency, | 
 | 92 | |struct cpufreq_policy| is also used when there is only one CPU in the given | 
 | 93 | set. | 
 | 94 |  | 
 | 95 | The ``CPUFreq`` core maintains a pointer to a |struct cpufreq_policy| object for | 
 | 96 | every CPU in the system, including CPUs that are currently offline.  If multiple | 
 | 97 | CPUs share the same hardware P-state control interface, all of the pointers | 
 | 98 | corresponding to them point to the same |struct cpufreq_policy| object. | 
 | 99 |  | 
 | 100 | ``CPUFreq`` uses |struct cpufreq_policy| as its basic data type and the design | 
 | 101 | of its user space interface is based on the policy concept. | 
 | 102 |  | 
 | 103 |  | 
 | 104 | CPU Initialization | 
 | 105 | ================== | 
 | 106 |  | 
 | 107 | First of all, a scaling driver has to be registered for ``CPUFreq`` to work. | 
 | 108 | It is only possible to register one scaling driver at a time, so the scaling | 
 | 109 | driver is expected to be able to handle all CPUs in the system. | 
 | 110 |  | 
 | 111 | The scaling driver may be registered before or after CPU registration.  If | 
 | 112 | CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to | 
 | 113 | take a note of all of the already registered CPUs during the registration of the | 
 | 114 | scaling driver.  In turn, if any CPUs are registered after the registration of | 
 | 115 | the scaling driver, the ``CPUFreq`` core will be invoked to take note of them | 
 | 116 | at their registration time. | 
 | 117 |  | 
 | 118 | In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it | 
 | 119 | has not seen so far as soon as it is ready to handle that CPU.  [Note that the | 
 | 120 | logical CPU may be a physical single-core processor, or a single core in a | 
 | 121 | multicore processor, or a hardware thread in a physical processor or processor | 
 | 122 | core.  In what follows "CPU" always means "logical CPU" unless explicitly stated | 
 | 123 | otherwise and the word "processor" is used to refer to the physical part | 
 | 124 | possibly including multiple logical CPUs.] | 
 | 125 |  | 
 | 126 | Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set | 
 | 127 | for the given CPU and if so, it skips the policy object creation.  Otherwise, | 
 | 128 | a new policy object is created and initialized, which involves the creation of | 
 | 129 | a new policy directory in ``sysfs``, and the policy pointer corresponding to | 
 | 130 | the given CPU is set to the new policy object's address in memory. | 
 | 131 |  | 
 | 132 | Next, the scaling driver's ``->init()`` callback is invoked with the policy | 
 | 133 | pointer of the new CPU passed to it as the argument.  That callback is expected | 
 | 134 | to initialize the performance scaling hardware interface for the given CPU (or, | 
 | 135 | more precisely, for the set of CPUs sharing the hardware interface it belongs | 
 | 136 | to, represented by its policy object) and, if the policy object it has been | 
 | 137 | called for is new, to set parameters of the policy, like the minimum and maximum | 
 | 138 | frequencies supported by the hardware, the table of available frequencies (if | 
 | 139 | the set of supported P-states is not a continuous range), and the mask of CPUs | 
 | 140 | that belong to the same policy (including both online and offline CPUs).  That | 
 | 141 | mask is then used by the core to populate the policy pointers for all of the | 
 | 142 | CPUs in it. | 
 | 143 |  | 
 | 144 | The next major initialization step for a new policy object is to attach a | 
 | 145 | scaling governor to it (to begin with, that is the default scaling governor | 
 | 146 | determined by the kernel configuration, but it may be changed later | 
 | 147 | via ``sysfs``).  First, a pointer to the new policy object is passed to the | 
 | 148 | governor's ``->init()`` callback which is expected to initialize all of the | 
 | 149 | data structures necessary to handle the given policy and, possibly, to add | 
 | 150 | a governor ``sysfs`` interface to it.  Next, the governor is started by | 
 | 151 | invoking its ``->start()`` callback. | 
 | 152 |  | 
 | 153 | That callback it expected to register per-CPU utilization update callbacks for | 
 | 154 | all of the online CPUs belonging to the given policy with the CPU scheduler. | 
 | 155 | The utilization update callbacks will be invoked by the CPU scheduler on | 
 | 156 | important events, like task enqueue and dequeue, on every iteration of the | 
 | 157 | scheduler tick or generally whenever the CPU utilization may change (from the | 
 | 158 | scheduler's perspective).  They are expected to carry out computations needed | 
 | 159 | to determine the P-state to use for the given policy going forward and to | 
 | 160 | invoke the scaling driver to make changes to the hardware in accordance with | 
 | 161 | the P-state selection.  The scaling driver may be invoked directly from | 
 | 162 | scheduler context or asynchronously, via a kernel thread or workqueue, depending | 
 | 163 | on the configuration and capabilities of the scaling driver and the governor. | 
 | 164 |  | 
 | 165 | Similar steps are taken for policy objects that are not new, but were "inactive" | 
 | 166 | previously, meaning that all of the CPUs belonging to them were offline.  The | 
 | 167 | only practical difference in that case is that the ``CPUFreq`` core will attempt | 
 | 168 | to use the scaling governor previously used with the policy that became | 
 | 169 | "inactive" (and is re-initialized now) instead of the default governor. | 
 | 170 |  | 
 | 171 | In turn, if a previously offline CPU is being brought back online, but some | 
 | 172 | other CPUs sharing the policy object with it are online already, there is no | 
 | 173 | need to re-initialize the policy object at all.  In that case, it only is | 
 | 174 | necessary to restart the scaling governor so that it can take the new online CPU | 
 | 175 | into account.  That is achieved by invoking the governor's ``->stop`` and | 
 | 176 | ``->start()`` callbacks, in this order, for the entire policy. | 
 | 177 |  | 
 | 178 | As mentioned before, the |intel_pstate| scaling driver bypasses the scaling | 
 | 179 | governor layer of ``CPUFreq`` and provides its own P-state selection algorithms. | 
 | 180 | Consequently, if |intel_pstate| is used, scaling governors are not attached to | 
 | 181 | new policy objects.  Instead, the driver's ``->setpolicy()`` callback is invoked | 
 | 182 | to register per-CPU utilization update callbacks for each policy.  These | 
 | 183 | callbacks are invoked by the CPU scheduler in the same way as for scaling | 
 | 184 | governors, but in the |intel_pstate| case they both determine the P-state to | 
 | 185 | use and change the hardware configuration accordingly in one go from scheduler | 
 | 186 | context. | 
 | 187 |  | 
 | 188 | The policy objects created during CPU initialization and other data structures | 
 | 189 | associated with them are torn down when the scaling driver is unregistered | 
 | 190 | (which happens when the kernel module containing it is unloaded, for example) or | 
 | 191 | when the last CPU belonging to the given policy in unregistered. | 
 | 192 |  | 
 | 193 |  | 
 | 194 | Policy Interface in ``sysfs`` | 
 | 195 | ============================= | 
 | 196 |  | 
 | 197 | During the initialization of the kernel, the ``CPUFreq`` core creates a | 
 | 198 | ``sysfs`` directory (kobject) called ``cpufreq`` under | 
 | 199 | :file:`/sys/devices/system/cpu/`. | 
 | 200 |  | 
 | 201 | That directory contains a ``policyX`` subdirectory (where ``X`` represents an | 
 | 202 | integer number) for every policy object maintained by the ``CPUFreq`` core. | 
 | 203 | Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links | 
 | 204 | under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer | 
 | 205 | that may be different from the one represented by ``X``) for all of the CPUs | 
 | 206 | associated with (or belonging to) the given policy.  The ``policyX`` directories | 
 | 207 | in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific | 
 | 208 | attributes (files) to control ``CPUFreq`` behavior for the corresponding policy | 
 | 209 | objects (that is, for all of the CPUs associated with them). | 
 | 210 |  | 
 | 211 | Some of those attributes are generic.  They are created by the ``CPUFreq`` core | 
 | 212 | and their behavior generally does not depend on what scaling driver is in use | 
 | 213 | and what scaling governor is attached to the given policy.  Some scaling drivers | 
 | 214 | also add driver-specific attributes to the policy directories in ``sysfs`` to | 
 | 215 | control policy-specific aspects of driver behavior. | 
 | 216 |  | 
 | 217 | The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/` | 
 | 218 | are the following: | 
 | 219 |  | 
 | 220 | ``affected_cpus`` | 
 | 221 | 	List of online CPUs belonging to this policy (i.e. sharing the hardware | 
 | 222 | 	performance scaling interface represented by the ``policyX`` policy | 
 | 223 | 	object). | 
 | 224 |  | 
 | 225 | ``bios_limit`` | 
 | 226 | 	If the platform firmware (BIOS) tells the OS to apply an upper limit to | 
 | 227 | 	CPU frequencies, that limit will be reported through this attribute (if | 
 | 228 | 	present). | 
 | 229 |  | 
 | 230 | 	The existence of the limit may be a result of some (often unintentional) | 
 | 231 | 	BIOS settings, restrictions coming from a service processor or another | 
 | 232 | 	BIOS/HW-based mechanisms. | 
 | 233 |  | 
 | 234 | 	This does not cover ACPI thermal limitations which can be discovered | 
 | 235 | 	through a generic thermal driver. | 
 | 236 |  | 
 | 237 | 	This attribute is not present if the scaling driver in use does not | 
 | 238 | 	support it. | 
 | 239 |  | 
 | 240 | ``cpuinfo_cur_freq`` | 
 | 241 | 	Current frequency of the CPUs belonging to this policy as obtained from | 
 | 242 | 	the hardware (in KHz). | 
 | 243 |  | 
 | 244 | 	This is expected to be the frequency the hardware actually runs at. | 
 | 245 | 	If that frequency cannot be determined, this attribute should not | 
 | 246 | 	be present. | 
 | 247 |  | 
 | 248 | ``cpuinfo_max_freq`` | 
 | 249 | 	Maximum possible operating frequency the CPUs belonging to this policy | 
 | 250 | 	can run at (in kHz). | 
 | 251 |  | 
 | 252 | ``cpuinfo_min_freq`` | 
 | 253 | 	Minimum possible operating frequency the CPUs belonging to this policy | 
 | 254 | 	can run at (in kHz). | 
 | 255 |  | 
 | 256 | ``cpuinfo_transition_latency`` | 
 | 257 | 	The time it takes to switch the CPUs belonging to this policy from one | 
 | 258 | 	P-state to another, in nanoseconds. | 
 | 259 |  | 
 | 260 | 	If unknown or if known to be so high that the scaling driver does not | 
 | 261 | 	work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`) | 
 | 262 | 	will be returned by reads from this attribute. | 
 | 263 |  | 
 | 264 | ``related_cpus`` | 
 | 265 | 	List of all (online and offline) CPUs belonging to this policy. | 
 | 266 |  | 
 | 267 | ``scaling_available_governors`` | 
 | 268 | 	List of ``CPUFreq`` scaling governors present in the kernel that can | 
 | 269 | 	be attached to this policy or (if the |intel_pstate| scaling driver is | 
 | 270 | 	in use) list of scaling algorithms provided by the driver that can be | 
 | 271 | 	applied to this policy. | 
 | 272 |  | 
 | 273 | 	[Note that some governors are modular and it may be necessary to load a | 
 | 274 | 	kernel module for the governor held by it to become available and be | 
 | 275 | 	listed by this attribute.] | 
 | 276 |  | 
 | 277 | ``scaling_cur_freq`` | 
 | 278 | 	Current frequency of all of the CPUs belonging to this policy (in kHz). | 
 | 279 |  | 
 | 280 | 	In the majority of cases, this is the frequency of the last P-state | 
 | 281 | 	requested by the scaling driver from the hardware using the scaling | 
 | 282 | 	interface provided by it, which may or may not reflect the frequency | 
 | 283 | 	the CPU is actually running at (due to hardware design and other | 
 | 284 | 	limitations). | 
 | 285 |  | 
 | 286 | 	Some architectures (e.g. ``x86``) may attempt to provide information | 
 | 287 | 	more precisely reflecting the current CPU frequency through this | 
 | 288 | 	attribute, but that still may not be the exact current CPU frequency as | 
 | 289 | 	seen by the hardware at the moment. | 
 | 290 |  | 
 | 291 | ``scaling_driver`` | 
 | 292 | 	The scaling driver currently in use. | 
 | 293 |  | 
 | 294 | ``scaling_governor`` | 
 | 295 | 	The scaling governor currently attached to this policy or (if the | 
 | 296 | 	|intel_pstate| scaling driver is in use) the scaling algorithm | 
 | 297 | 	provided by the driver that is currently applied to this policy. | 
 | 298 |  | 
 | 299 | 	This attribute is read-write and writing to it will cause a new scaling | 
 | 300 | 	governor to be attached to this policy or a new scaling algorithm | 
 | 301 | 	provided by the scaling driver to be applied to it (in the | 
 | 302 | 	|intel_pstate| case), as indicated by the string written to this | 
 | 303 | 	attribute (which must be one of the names listed by the | 
 | 304 | 	``scaling_available_governors`` attribute described above). | 
 | 305 |  | 
 | 306 | ``scaling_max_freq`` | 
 | 307 | 	Maximum frequency the CPUs belonging to this policy are allowed to be | 
 | 308 | 	running at (in kHz). | 
 | 309 |  | 
 | 310 | 	This attribute is read-write and writing a string representing an | 
 | 311 | 	integer to it will cause a new limit to be set (it must not be lower | 
 | 312 | 	than the value of the ``scaling_min_freq`` attribute). | 
 | 313 |  | 
 | 314 | ``scaling_min_freq`` | 
 | 315 | 	Minimum frequency the CPUs belonging to this policy are allowed to be | 
 | 316 | 	running at (in kHz). | 
 | 317 |  | 
 | 318 | 	This attribute is read-write and writing a string representing a | 
 | 319 | 	non-negative integer to it will cause a new limit to be set (it must not | 
 | 320 | 	be higher than the value of the ``scaling_max_freq`` attribute). | 
 | 321 |  | 
 | 322 | ``scaling_setspeed`` | 
 | 323 | 	This attribute is functional only if the `userspace`_ scaling governor | 
 | 324 | 	is attached to the given policy. | 
 | 325 |  | 
 | 326 | 	It returns the last frequency requested by the governor (in kHz) or can | 
 | 327 | 	be written to in order to set a new frequency for the policy. | 
 | 328 |  | 
 | 329 |  | 
 | 330 | Generic Scaling Governors | 
 | 331 | ========================= | 
 | 332 |  | 
 | 333 | ``CPUFreq`` provides generic scaling governors that can be used with all | 
 | 334 | scaling drivers.  As stated before, each of them implements a single, possibly | 
 | 335 | parametrized, performance scaling algorithm. | 
 | 336 |  | 
 | 337 | Scaling governors are attached to policy objects and different policy objects | 
 | 338 | can be handled by different scaling governors at the same time (although that | 
 | 339 | may lead to suboptimal results in some cases). | 
 | 340 |  | 
 | 341 | The scaling governor for a given policy object can be changed at any time with | 
 | 342 | the help of the ``scaling_governor`` policy attribute in ``sysfs``. | 
 | 343 |  | 
 | 344 | Some governors expose ``sysfs`` attributes to control or fine-tune the scaling | 
 | 345 | algorithms implemented by them.  Those attributes, referred to as governor | 
 | 346 | tunables, can be either global (system-wide) or per-policy, depending on the | 
 | 347 | scaling driver in use.  If the driver requires governor tunables to be | 
 | 348 | per-policy, they are located in a subdirectory of each policy directory. | 
 | 349 | Otherwise, they are located in a subdirectory under | 
 | 350 | :file:`/sys/devices/system/cpu/cpufreq/`.  In either case the name of the | 
 | 351 | subdirectory containing the governor tunables is the name of the governor | 
 | 352 | providing them. | 
 | 353 |  | 
 | 354 | ``performance`` | 
 | 355 | --------------- | 
 | 356 |  | 
 | 357 | When attached to a policy object, this governor causes the highest frequency, | 
 | 358 | within the ``scaling_max_freq`` policy limit, to be requested for that policy. | 
 | 359 |  | 
 | 360 | The request is made once at that time the governor for the policy is set to | 
 | 361 | ``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` | 
 | 362 | policy limits change after that. | 
 | 363 |  | 
 | 364 | ``powersave`` | 
 | 365 | ------------- | 
 | 366 |  | 
 | 367 | When attached to a policy object, this governor causes the lowest frequency, | 
 | 368 | within the ``scaling_min_freq`` policy limit, to be requested for that policy. | 
 | 369 |  | 
 | 370 | The request is made once at that time the governor for the policy is set to | 
 | 371 | ``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` | 
 | 372 | policy limits change after that. | 
 | 373 |  | 
 | 374 | ``userspace`` | 
 | 375 | ------------- | 
 | 376 |  | 
 | 377 | This governor does not do anything by itself.  Instead, it allows user space | 
 | 378 | to set the CPU frequency for the policy it is attached to by writing to the | 
 | 379 | ``scaling_setspeed`` attribute of that policy. | 
 | 380 |  | 
 | 381 | ``schedutil`` | 
 | 382 | ------------- | 
 | 383 |  | 
 | 384 | This governor uses CPU utilization data available from the CPU scheduler.  It | 
 | 385 | generally is regarded as a part of the CPU scheduler, so it can access the | 
 | 386 | scheduler's internal data structures directly. | 
 | 387 |  | 
 | 388 | It runs entirely in scheduler context, although in some cases it may need to | 
 | 389 | invoke the scaling driver asynchronously when it decides that the CPU frequency | 
 | 390 | should be changed for a given policy (that depends on whether or not the driver | 
 | 391 | is capable of changing the CPU frequency from scheduler context). | 
 | 392 |  | 
 | 393 | The actions of this governor for a particular CPU depend on the scheduling class | 
 | 394 | invoking its utilization update callback for that CPU.  If it is invoked by the | 
 | 395 | RT or deadline scheduling classes, the governor will increase the frequency to | 
 | 396 | the allowed maximum (that is, the ``scaling_max_freq`` policy limit).  In turn, | 
 | 397 | if it is invoked by the CFS scheduling class, the governor will use the | 
 | 398 | Per-Entity Load Tracking (PELT) metric for the root control group of the | 
 | 399 | given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_ | 
 | 400 | LWN.net article for a description of the PELT mechanism).  Then, the new | 
 | 401 | CPU frequency to apply is computed in accordance with the formula | 
 | 402 |  | 
 | 403 | 	f = 1.25 * ``f_0`` * ``util`` / ``max`` | 
 | 404 |  | 
 | 405 | where ``util`` is the PELT number, ``max`` is the theoretical maximum of | 
 | 406 | ``util``, and ``f_0`` is either the maximum possible CPU frequency for the given | 
 | 407 | policy (if the PELT number is frequency-invariant), or the current CPU frequency | 
 | 408 | (otherwise). | 
 | 409 |  | 
 | 410 | This governor also employs a mechanism allowing it to temporarily bump up the | 
 | 411 | CPU frequency for tasks that have been waiting on I/O most recently, called | 
 | 412 | "IO-wait boosting".  That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag | 
 | 413 | is passed by the scheduler to the governor callback which causes the frequency | 
 | 414 | to go up to the allowed maximum immediately and then draw back to the value | 
 | 415 | returned by the above formula over time. | 
 | 416 |  | 
 | 417 | This governor exposes only one tunable: | 
 | 418 |  | 
 | 419 | ``rate_limit_us`` | 
 | 420 | 	Minimum time (in microseconds) that has to pass between two consecutive | 
 | 421 | 	runs of governor computations (default: 1000 times the scaling driver's | 
 | 422 | 	transition latency). | 
 | 423 |  | 
 | 424 | 	The purpose of this tunable is to reduce the scheduler context overhead | 
 | 425 | 	of the governor which might be excessive without it. | 
 | 426 |  | 
 | 427 | This governor generally is regarded as a replacement for the older `ondemand`_ | 
 | 428 | and `conservative`_ governors (described below), as it is simpler and more | 
 | 429 | tightly integrated with the CPU scheduler, its overhead in terms of CPU context | 
 | 430 | switches and similar is less significant, and it uses the scheduler's own CPU | 
 | 431 | utilization metric, so in principle its decisions should not contradict the | 
 | 432 | decisions made by the other parts of the scheduler. | 
 | 433 |  | 
 | 434 | ``ondemand`` | 
 | 435 | ------------ | 
 | 436 |  | 
 | 437 | This governor uses CPU load as a CPU frequency selection metric. | 
 | 438 |  | 
 | 439 | In order to estimate the current CPU load, it measures the time elapsed between | 
 | 440 | consecutive invocations of its worker routine and computes the fraction of that | 
 | 441 | time in which the given CPU was not idle.  The ratio of the non-idle (active) | 
 | 442 | time to the total CPU time is taken as an estimate of the load. | 
 | 443 |  | 
 | 444 | If this governor is attached to a policy shared by multiple CPUs, the load is | 
 | 445 | estimated for all of them and the greatest result is taken as the load estimate | 
 | 446 | for the entire policy. | 
 | 447 |  | 
 | 448 | The worker routine of this governor has to run in process context, so it is | 
 | 449 | invoked asynchronously (via a workqueue) and CPU P-states are updated from | 
 | 450 | there if necessary.  As a result, the scheduler context overhead from this | 
 | 451 | governor is minimum, but it causes additional CPU context switches to happen | 
 | 452 | relatively often and the CPU P-state updates triggered by it can be relatively | 
 | 453 | irregular.  Also, it affects its own CPU load metric by running code that | 
 | 454 | reduces the CPU idle time (even though the CPU idle time is only reduced very | 
 | 455 | slightly by it). | 
 | 456 |  | 
 | 457 | It generally selects CPU frequencies proportional to the estimated load, so that | 
 | 458 | the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of | 
 | 459 | 1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute | 
 | 460 | corresponds to the load of 0, unless when the load exceeds a (configurable) | 
 | 461 | speedup threshold, in which case it will go straight for the highest frequency | 
 | 462 | it is allowed to use (the ``scaling_max_freq`` policy limit). | 
 | 463 |  | 
 | 464 | This governor exposes the following tunables: | 
 | 465 |  | 
 | 466 | ``sampling_rate`` | 
 | 467 | 	This is how often the governor's worker routine should run, in | 
 | 468 | 	microseconds. | 
 | 469 |  | 
 | 470 | 	Typically, it is set to values of the order of 10000 (10 ms).  Its | 
 | 471 | 	default value is equal to the value of ``cpuinfo_transition_latency`` | 
 | 472 | 	for each policy this governor is attached to (but since the unit here | 
 | 473 | 	is greater by 1000, this means that the time represented by | 
 | 474 | 	``sampling_rate`` is 1000 times greater than the transition latency by | 
 | 475 | 	default). | 
 | 476 |  | 
 | 477 | 	If this tunable is per-policy, the following shell command sets the time | 
 | 478 | 	represented by it to be 750 times as high as the transition latency:: | 
 | 479 |  | 
 | 480 | 	# echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate | 
 | 481 |  | 
 | 482 | ``up_threshold`` | 
 | 483 | 	If the estimated CPU load is above this value (in percent), the governor | 
 | 484 | 	will set the frequency to the maximum value allowed for the policy. | 
 | 485 | 	Otherwise, the selected frequency will be proportional to the estimated | 
 | 486 | 	CPU load. | 
 | 487 |  | 
 | 488 | ``ignore_nice_load`` | 
 | 489 | 	If set to 1 (default 0), it will cause the CPU load estimation code to | 
 | 490 | 	treat the CPU time spent on executing tasks with "nice" levels greater | 
 | 491 | 	than 0 as CPU idle time. | 
 | 492 |  | 
 | 493 | 	This may be useful if there are tasks in the system that should not be | 
 | 494 | 	taken into account when deciding what frequency to run the CPUs at. | 
 | 495 | 	Then, to make that happen it is sufficient to increase the "nice" level | 
 | 496 | 	of those tasks above 0 and set this attribute to 1. | 
 | 497 |  | 
 | 498 | ``sampling_down_factor`` | 
 | 499 | 	Temporary multiplier, between 1 (default) and 100 inclusive, to apply to | 
 | 500 | 	the ``sampling_rate`` value if the CPU load goes above ``up_threshold``. | 
 | 501 |  | 
 | 502 | 	This causes the next execution of the governor's worker routine (after | 
 | 503 | 	setting the frequency to the allowed maximum) to be delayed, so the | 
 | 504 | 	frequency stays at the maximum level for a longer time. | 
 | 505 |  | 
 | 506 | 	Frequency fluctuations in some bursty workloads may be avoided this way | 
 | 507 | 	at the cost of additional energy spent on maintaining the maximum CPU | 
 | 508 | 	capacity. | 
 | 509 |  | 
 | 510 | ``powersave_bias`` | 
 | 511 | 	Reduction factor to apply to the original frequency target of the | 
 | 512 | 	governor (including the maximum value used when the ``up_threshold`` | 
 | 513 | 	value is exceeded by the estimated CPU load) or sensitivity threshold | 
 | 514 | 	for the AMD frequency sensitivity powersave bias driver | 
 | 515 | 	(:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000 | 
 | 516 | 	inclusive. | 
 | 517 |  | 
 | 518 | 	If the AMD frequency sensitivity powersave bias driver is not loaded, | 
 | 519 | 	the effective frequency to apply is given by | 
 | 520 |  | 
 | 521 | 		f * (1 - ``powersave_bias`` / 1000) | 
 | 522 |  | 
 | 523 | 	where f is the governor's original frequency target.  The default value | 
 | 524 | 	of this attribute is 0 in that case. | 
 | 525 |  | 
 | 526 | 	If the AMD frequency sensitivity powersave bias driver is loaded, the | 
 | 527 | 	value of this attribute is 400 by default and it is used in a different | 
 | 528 | 	way. | 
 | 529 |  | 
 | 530 | 	On Family 16h (and later) AMD processors there is a mechanism to get a | 
 | 531 | 	measured workload sensitivity, between 0 and 100% inclusive, from the | 
 | 532 | 	hardware.  That value can be used to estimate how the performance of the | 
 | 533 | 	workload running on a CPU will change in response to frequency changes. | 
 | 534 |  | 
 | 535 | 	The performance of a workload with the sensitivity of 0 (memory-bound or | 
 | 536 | 	IO-bound) is not expected to increase at all as a result of increasing | 
 | 537 | 	the CPU frequency, whereas workloads with the sensitivity of 100% | 
 | 538 | 	(CPU-bound) are expected to perform much better if the CPU frequency is | 
 | 539 | 	increased. | 
 | 540 |  | 
 | 541 | 	If the workload sensitivity is less than the threshold represented by | 
 | 542 | 	the ``powersave_bias`` value, the sensitivity powersave bias driver | 
 | 543 | 	will cause the governor to select a frequency lower than its original | 
 | 544 | 	target, so as to avoid over-provisioning workloads that will not benefit | 
 | 545 | 	from running at higher CPU frequencies. | 
 | 546 |  | 
 | 547 | ``conservative`` | 
 | 548 | ---------------- | 
 | 549 |  | 
 | 550 | This governor uses CPU load as a CPU frequency selection metric. | 
 | 551 |  | 
 | 552 | It estimates the CPU load in the same way as the `ondemand`_ governor described | 
 | 553 | above, but the CPU frequency selection algorithm implemented by it is different. | 
 | 554 |  | 
 | 555 | Namely, it avoids changing the frequency significantly over short time intervals | 
 | 556 | which may not be suitable for systems with limited power supply capacity (e.g. | 
 | 557 | battery-powered).  To achieve that, it changes the frequency in relatively | 
 | 558 | small steps, one step at a time, up or down - depending on whether or not a | 
 | 559 | (configurable) threshold has been exceeded by the estimated CPU load. | 
 | 560 |  | 
 | 561 | This governor exposes the following tunables: | 
 | 562 |  | 
 | 563 | ``freq_step`` | 
 | 564 | 	Frequency step in percent of the maximum frequency the governor is | 
 | 565 | 	allowed to set (the ``scaling_max_freq`` policy limit), between 0 and | 
 | 566 | 	100 (5 by default). | 
 | 567 |  | 
 | 568 | 	This is how much the frequency is allowed to change in one go.  Setting | 
 | 569 | 	it to 0 will cause the default frequency step (5 percent) to be used | 
 | 570 | 	and setting it to 100 effectively causes the governor to periodically | 
 | 571 | 	switch the frequency between the ``scaling_min_freq`` and | 
 | 572 | 	``scaling_max_freq`` policy limits. | 
 | 573 |  | 
 | 574 | ``down_threshold`` | 
 | 575 | 	Threshold value (in percent, 20 by default) used to determine the | 
 | 576 | 	frequency change direction. | 
 | 577 |  | 
 | 578 | 	If the estimated CPU load is greater than this value, the frequency will | 
 | 579 | 	go up (by ``freq_step``).  If the load is less than this value (and the | 
 | 580 | 	``sampling_down_factor`` mechanism is not in effect), the frequency will | 
 | 581 | 	go down.  Otherwise, the frequency will not be changed. | 
 | 582 |  | 
 | 583 | ``sampling_down_factor`` | 
 | 584 | 	Frequency decrease deferral factor, between 1 (default) and 10 | 
 | 585 | 	inclusive. | 
 | 586 |  | 
 | 587 | 	It effectively causes the frequency to go down ``sampling_down_factor`` | 
 | 588 | 	times slower than it ramps up. | 
 | 589 |  | 
 | 590 |  | 
 | 591 | Frequency Boost Support | 
 | 592 | ======================= | 
 | 593 |  | 
 | 594 | Background | 
 | 595 | ---------- | 
 | 596 |  | 
 | 597 | Some processors support a mechanism to raise the operating frequency of some | 
 | 598 | cores in a multicore package temporarily (and above the sustainable frequency | 
 | 599 | threshold for the whole package) under certain conditions, for example if the | 
 | 600 | whole chip is not fully utilized and below its intended thermal or power budget. | 
 | 601 |  | 
 | 602 | Different names are used by different vendors to refer to this functionality. | 
 | 603 | For Intel processors it is referred to as "Turbo Boost", AMD calls it | 
 | 604 | "Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on. | 
 | 605 | As a rule, it also is implemented differently by different vendors.  The simple | 
 | 606 | term "frequency boost" is used here for brevity to refer to all of those | 
 | 607 | implementations. | 
 | 608 |  | 
 | 609 | The frequency boost mechanism may be either hardware-based or software-based. | 
 | 610 | If it is hardware-based (e.g. on x86), the decision to trigger the boosting is | 
 | 611 | made by the hardware (although in general it requires the hardware to be put | 
 | 612 | into a special state in which it can control the CPU frequency within certain | 
 | 613 | limits).  If it is software-based (e.g. on ARM), the scaling driver decides | 
 | 614 | whether or not to trigger boosting and when to do that. | 
 | 615 |  | 
 | 616 | The ``boost`` File in ``sysfs`` | 
 | 617 | ------------------------------- | 
 | 618 |  | 
 | 619 | This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls | 
 | 620 | the "boost" setting for the whole system.  It is not present if the underlying | 
 | 621 | scaling driver does not support the frequency boost mechanism (or supports it, | 
 | 622 | but provides a driver-specific interface for controlling it, like | 
 | 623 | |intel_pstate|). | 
 | 624 |  | 
 | 625 | If the value in this file is 1, the frequency boost mechanism is enabled.  This | 
 | 626 | means that either the hardware can be put into states in which it is able to | 
 | 627 | trigger boosting (in the hardware-based case), or the software is allowed to | 
 | 628 | trigger boosting (in the software-based case).  It does not mean that boosting | 
 | 629 | is actually in use at the moment on any CPUs in the system.  It only means a | 
 | 630 | permission to use the frequency boost mechanism (which still may never be used | 
 | 631 | for other reasons). | 
 | 632 |  | 
 | 633 | If the value in this file is 0, the frequency boost mechanism is disabled and | 
 | 634 | cannot be used at all. | 
 | 635 |  | 
 | 636 | The only values that can be written to this file are 0 and 1. | 
 | 637 |  | 
 | 638 | Rationale for Boost Control Knob | 
 | 639 | -------------------------------- | 
 | 640 |  | 
 | 641 | The frequency boost mechanism is generally intended to help to achieve optimum | 
 | 642 | CPU performance on time scales below software resolution (e.g. below the | 
 | 643 | scheduler tick interval) and it is demonstrably suitable for many workloads, but | 
 | 644 | it may lead to problems in certain situations. | 
 | 645 |  | 
 | 646 | For this reason, many systems make it possible to disable the frequency boost | 
 | 647 | mechanism in the platform firmware (BIOS) setup, but that requires the system to | 
 | 648 | be restarted for the setting to be adjusted as desired, which may not be | 
 | 649 | practical at least in some cases.  For example: | 
 | 650 |  | 
 | 651 |   1. Boosting means overclocking the processor, although under controlled | 
 | 652 |      conditions.  Generally, the processor's energy consumption increases | 
 | 653 |      as a result of increasing its frequency and voltage, even temporarily. | 
 | 654 |      That may not be desirable on systems that switch to power sources of | 
 | 655 |      limited capacity, such as batteries, so the ability to disable the boost | 
 | 656 |      mechanism while the system is running may help there (but that depends on | 
 | 657 |      the workload too). | 
 | 658 |  | 
 | 659 |   2. In some situations deterministic behavior is more important than | 
 | 660 |      performance or energy consumption (or both) and the ability to disable | 
 | 661 |      boosting while the system is running may be useful then. | 
 | 662 |  | 
 | 663 |   3. To examine the impact of the frequency boost mechanism itself, it is useful | 
 | 664 |      to be able to run tests with and without boosting, preferably without | 
 | 665 |      restarting the system in the meantime. | 
 | 666 |  | 
 | 667 |   4. Reproducible results are important when running benchmarks.  Since | 
 | 668 |      the boosting functionality depends on the load of the whole package, | 
 | 669 |      single-thread performance may vary because of it which may lead to | 
 | 670 |      unreproducible results sometimes.  That can be avoided by disabling the | 
 | 671 |      frequency boost mechanism before running benchmarks sensitive to that | 
 | 672 |      issue. | 
 | 673 |  | 
 | 674 | Legacy AMD ``cpb`` Knob | 
 | 675 | ----------------------- | 
 | 676 |  | 
 | 677 | The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to | 
 | 678 | the global ``boost`` one.  It is used for disabling/enabling the "Core | 
 | 679 | Performance Boost" feature of some AMD processors. | 
 | 680 |  | 
 | 681 | If present, that knob is located in every ``CPUFreq`` policy directory in | 
 | 682 | ``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called | 
 | 683 | ``cpb``, which indicates a more fine grained control interface.  The actual | 
 | 684 | implementation, however, works on the system-wide basis and setting that knob | 
 | 685 | for one policy causes the same value of it to be set for all of the other | 
 | 686 | policies at the same time. | 
 | 687 |  | 
 | 688 | That knob is still supported on AMD processors that support its underlying | 
 | 689 | hardware feature, but it may be configured out of the kernel (via the | 
 | 690 | :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global | 
 | 691 | ``boost`` knob is present regardless.  Thus it is always possible use the | 
 | 692 | ``boost`` knob instead of the ``cpb`` one which is highly recommended, as that | 
 | 693 | is more consistent with what all of the other systems do (and the ``cpb`` knob | 
 | 694 | may not be supported any more in the future). | 
 | 695 |  | 
 | 696 | The ``cpb`` knob is never present for any processors without the underlying | 
 | 697 | hardware feature (e.g. all Intel ones), even if the | 
 | 698 | :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set. | 
 | 699 |  | 
 | 700 |  | 
 | 701 | .. _Per-entity load tracking: https://lwn.net/Articles/531853/ |