|  | ========================================== | 
|  | Reducing OS jitter due to per-cpu kthreads | 
|  | ========================================== | 
|  |  | 
|  | This document lists per-CPU kthreads in the Linux kernel and presents | 
|  | options to control their OS jitter.  Note that non-per-CPU kthreads are | 
|  | not listed here.  To reduce OS jitter from non-per-CPU kthreads, bind | 
|  | them to a "housekeeping" CPU dedicated to such work. | 
|  |  | 
|  | References | 
|  | ========== | 
|  |  | 
|  | -	Documentation/IRQ-affinity.txt:  Binding interrupts to sets of CPUs. | 
|  |  | 
|  | -	Documentation/cgroup-v1:  Using cgroups to bind tasks to sets of CPUs. | 
|  |  | 
|  | -	man taskset:  Using the taskset command to bind tasks to sets | 
|  | of CPUs. | 
|  |  | 
|  | -	man sched_setaffinity:  Using the sched_setaffinity() system | 
|  | call to bind tasks to sets of CPUs. | 
|  |  | 
|  | -	/sys/devices/system/cpu/cpuN/online:  Control CPU N's hotplug state, | 
|  | writing "0" to offline and "1" to online. | 
|  |  | 
|  | -	In order to locate kernel-generated OS jitter on CPU N: | 
|  |  | 
|  | cd /sys/kernel/debug/tracing | 
|  | echo 1 > max_graph_depth # Increase the "1" for more detail | 
|  | echo function_graph > current_tracer | 
|  | # run workload | 
|  | cat per_cpu/cpuN/trace | 
|  |  | 
|  | kthreads | 
|  | ======== | 
|  |  | 
|  | Name: | 
|  | ehca_comp/%u | 
|  |  | 
|  | Purpose: | 
|  | Periodically process Infiniband-related work. | 
|  |  | 
|  | To reduce its OS jitter, do any of the following: | 
|  |  | 
|  | 1.	Don't use eHCA Infiniband hardware, instead choosing hardware | 
|  | that does not require per-CPU kthreads.  This will prevent these | 
|  | kthreads from being created in the first place.  (This will | 
|  | work for most people, as this hardware, though important, is | 
|  | relatively old and is produced in relatively low unit volumes.) | 
|  | 2.	Do all eHCA-Infiniband-related work on other CPUs, including | 
|  | interrupts. | 
|  | 3.	Rework the eHCA driver so that its per-CPU kthreads are | 
|  | provisioned only on selected CPUs. | 
|  |  | 
|  |  | 
|  | Name: | 
|  | irq/%d-%s | 
|  |  | 
|  | Purpose: | 
|  | Handle threaded interrupts. | 
|  |  | 
|  | To reduce its OS jitter, do the following: | 
|  |  | 
|  | 1.	Use irq affinity to force the irq threads to execute on | 
|  | some other CPU. | 
|  |  | 
|  | Name: | 
|  | kcmtpd_ctr_%d | 
|  |  | 
|  | Purpose: | 
|  | Handle Bluetooth work. | 
|  |  | 
|  | To reduce its OS jitter, do one of the following: | 
|  |  | 
|  | 1.	Don't use Bluetooth, in which case these kthreads won't be | 
|  | created in the first place. | 
|  | 2.	Use irq affinity to force Bluetooth-related interrupts to | 
|  | occur on some other CPU and furthermore initiate all | 
|  | Bluetooth activity on some other CPU. | 
|  |  | 
|  | Name: | 
|  | ksoftirqd/%u | 
|  |  | 
|  | Purpose: | 
|  | Execute softirq handlers when threaded or when under heavy load. | 
|  |  | 
|  | To reduce its OS jitter, each softirq vector must be handled | 
|  | separately as follows: | 
|  |  | 
|  | TIMER_SOFTIRQ | 
|  | ------------- | 
|  |  | 
|  | Do all of the following: | 
|  |  | 
|  | 1.	To the extent possible, keep the CPU out of the kernel when it | 
|  | is non-idle, for example, by avoiding system calls and by forcing | 
|  | both kernel threads and interrupts to execute elsewhere. | 
|  | 2.	Build with CONFIG_HOTPLUG_CPU=y.  After boot completes, force | 
|  | the CPU offline, then bring it back online.  This forces | 
|  | recurring timers to migrate elsewhere.	If you are concerned | 
|  | with multiple CPUs, force them all offline before bringing the | 
|  | first one back online.  Once you have onlined the CPUs in question, | 
|  | do not offline any other CPUs, because doing so could force the | 
|  | timer back onto one of the CPUs in question. | 
|  |  | 
|  | NET_TX_SOFTIRQ and NET_RX_SOFTIRQ | 
|  | --------------------------------- | 
|  |  | 
|  | Do all of the following: | 
|  |  | 
|  | 1.	Force networking interrupts onto other CPUs. | 
|  | 2.	Initiate any network I/O on other CPUs. | 
|  | 3.	Once your application has started, prevent CPU-hotplug operations | 
|  | from being initiated from tasks that might run on the CPU to | 
|  | be de-jittered.  (It is OK to force this CPU offline and then | 
|  | bring it back online before you start your application.) | 
|  |  | 
|  | BLOCK_SOFTIRQ | 
|  | ------------- | 
|  |  | 
|  | Do all of the following: | 
|  |  | 
|  | 1.	Force block-device interrupts onto some other CPU. | 
|  | 2.	Initiate any block I/O on other CPUs. | 
|  | 3.	Once your application has started, prevent CPU-hotplug operations | 
|  | from being initiated from tasks that might run on the CPU to | 
|  | be de-jittered.  (It is OK to force this CPU offline and then | 
|  | bring it back online before you start your application.) | 
|  |  | 
|  | IRQ_POLL_SOFTIRQ | 
|  | ---------------- | 
|  |  | 
|  | Do all of the following: | 
|  |  | 
|  | 1.	Force block-device interrupts onto some other CPU. | 
|  | 2.	Initiate any block I/O and block-I/O polling on other CPUs. | 
|  | 3.	Once your application has started, prevent CPU-hotplug operations | 
|  | from being initiated from tasks that might run on the CPU to | 
|  | be de-jittered.  (It is OK to force this CPU offline and then | 
|  | bring it back online before you start your application.) | 
|  |  | 
|  | TASKLET_SOFTIRQ | 
|  | --------------- | 
|  |  | 
|  | Do one or more of the following: | 
|  |  | 
|  | 1.	Avoid use of drivers that use tasklets.  (Such drivers will contain | 
|  | calls to things like tasklet_schedule().) | 
|  | 2.	Convert all drivers that you must use from tasklets to workqueues. | 
|  | 3.	Force interrupts for drivers using tasklets onto other CPUs, | 
|  | and also do I/O involving these drivers on other CPUs. | 
|  |  | 
|  | SCHED_SOFTIRQ | 
|  | ------------- | 
|  |  | 
|  | Do all of the following: | 
|  |  | 
|  | 1.	Avoid sending scheduler IPIs to the CPU to be de-jittered, | 
|  | for example, ensure that at most one runnable kthread is present | 
|  | on that CPU.  If a thread that expects to run on the de-jittered | 
|  | CPU awakens, the scheduler will send an IPI that can result in | 
|  | a subsequent SCHED_SOFTIRQ. | 
|  | 2.	CONFIG_NO_HZ_FULL=y and ensure that the CPU to be de-jittered | 
|  | is marked as an adaptive-ticks CPU using the "nohz_full=" | 
|  | boot parameter.  This reduces the number of scheduler-clock | 
|  | interrupts that the de-jittered CPU receives, minimizing its | 
|  | chances of being selected to do the load balancing work that | 
|  | runs in SCHED_SOFTIRQ context. | 
|  | 3.	To the extent possible, keep the CPU out of the kernel when it | 
|  | is non-idle, for example, by avoiding system calls and by | 
|  | forcing both kernel threads and interrupts to execute elsewhere. | 
|  | This further reduces the number of scheduler-clock interrupts | 
|  | received by the de-jittered CPU. | 
|  |  | 
|  | HRTIMER_SOFTIRQ | 
|  | --------------- | 
|  |  | 
|  | Do all of the following: | 
|  |  | 
|  | 1.	To the extent possible, keep the CPU out of the kernel when it | 
|  | is non-idle.  For example, avoid system calls and force both | 
|  | kernel threads and interrupts to execute elsewhere. | 
|  | 2.	Build with CONFIG_HOTPLUG_CPU=y.  Once boot completes, force the | 
|  | CPU offline, then bring it back online.  This forces recurring | 
|  | timers to migrate elsewhere.  If you are concerned with multiple | 
|  | CPUs, force them all offline before bringing the first one | 
|  | back online.  Once you have onlined the CPUs in question, do not | 
|  | offline any other CPUs, because doing so could force the timer | 
|  | back onto one of the CPUs in question. | 
|  |  | 
|  | RCU_SOFTIRQ | 
|  | ----------- | 
|  |  | 
|  | Do at least one of the following: | 
|  |  | 
|  | 1.	Offload callbacks and keep the CPU in either dyntick-idle or | 
|  | adaptive-ticks state by doing all of the following: | 
|  |  | 
|  | a.	CONFIG_NO_HZ_FULL=y and ensure that the CPU to be | 
|  | de-jittered is marked as an adaptive-ticks CPU using the | 
|  | "nohz_full=" boot parameter.  Bind the rcuo kthreads to | 
|  | housekeeping CPUs, which can tolerate OS jitter. | 
|  | b.	To the extent possible, keep the CPU out of the kernel | 
|  | when it is non-idle, for example, by avoiding system | 
|  | calls and by forcing both kernel threads and interrupts | 
|  | to execute elsewhere. | 
|  |  | 
|  | 2.	Enable RCU to do its processing remotely via dyntick-idle by | 
|  | doing all of the following: | 
|  |  | 
|  | a.	Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y. | 
|  | b.	Ensure that the CPU goes idle frequently, allowing other | 
|  | CPUs to detect that it has passed through an RCU quiescent | 
|  | state.	If the kernel is built with CONFIG_NO_HZ_FULL=y, | 
|  | userspace execution also allows other CPUs to detect that | 
|  | the CPU in question has passed through a quiescent state. | 
|  | c.	To the extent possible, keep the CPU out of the kernel | 
|  | when it is non-idle, for example, by avoiding system | 
|  | calls and by forcing both kernel threads and interrupts | 
|  | to execute elsewhere. | 
|  |  | 
|  | Name: | 
|  | kworker/%u:%d%s (cpu, id, priority) | 
|  |  | 
|  | Purpose: | 
|  | Execute workqueue requests | 
|  |  | 
|  | To reduce its OS jitter, do any of the following: | 
|  |  | 
|  | 1.	Run your workload at a real-time priority, which will allow | 
|  | preempting the kworker daemons. | 
|  | 2.	A given workqueue can be made visible in the sysfs filesystem | 
|  | by passing the WQ_SYSFS to that workqueue's alloc_workqueue(). | 
|  | Such a workqueue can be confined to a given subset of the | 
|  | CPUs using the ``/sys/devices/virtual/workqueue/*/cpumask`` sysfs | 
|  | files.	The set of WQ_SYSFS workqueues can be displayed using | 
|  | "ls sys/devices/virtual/workqueue".  That said, the workqueues | 
|  | maintainer would like to caution people against indiscriminately | 
|  | sprinkling WQ_SYSFS across all the workqueues.	The reason for | 
|  | caution is that it is easy to add WQ_SYSFS, but because sysfs is | 
|  | part of the formal user/kernel API, it can be nearly impossible | 
|  | to remove it, even if its addition was a mistake. | 
|  | 3.	Do any of the following needed to avoid jitter that your | 
|  | application cannot tolerate: | 
|  |  | 
|  | a.	Build your kernel with CONFIG_SLUB=y rather than | 
|  | CONFIG_SLAB=y, thus avoiding the slab allocator's periodic | 
|  | use of each CPU's workqueues to run its cache_reap() | 
|  | function. | 
|  | b.	Avoid using oprofile, thus avoiding OS jitter from | 
|  | wq_sync_buffer(). | 
|  | c.	Limit your CPU frequency so that a CPU-frequency | 
|  | governor is not required, possibly enlisting the aid of | 
|  | special heatsinks or other cooling technologies.  If done | 
|  | correctly, and if you CPU architecture permits, you should | 
|  | be able to build your kernel with CONFIG_CPU_FREQ=n to | 
|  | avoid the CPU-frequency governor periodically running | 
|  | on each CPU, including cs_dbs_timer() and od_dbs_timer(). | 
|  |  | 
|  | WARNING:  Please check your CPU specifications to | 
|  | make sure that this is safe on your particular system. | 
|  | d.	As of v3.18, Christoph Lameter's on-demand vmstat workers | 
|  | commit prevents OS jitter due to vmstat_update() on | 
|  | CONFIG_SMP=y systems.  Before v3.18, is not possible | 
|  | to entirely get rid of the OS jitter, but you can | 
|  | decrease its frequency by writing a large value to | 
|  | /proc/sys/vm/stat_interval.  The default value is HZ, | 
|  | for an interval of one second.	Of course, larger values | 
|  | will make your virtual-memory statistics update more | 
|  | slowly.  Of course, you can also run your workload at | 
|  | a real-time priority, thus preempting vmstat_update(), | 
|  | but if your workload is CPU-bound, this is a bad idea. | 
|  | However, there is an RFC patch from Christoph Lameter | 
|  | (based on an earlier one from Gilad Ben-Yossef) that | 
|  | reduces or even eliminates vmstat overhead for some | 
|  | workloads at https://lkml.org/lkml/2013/9/4/379. | 
|  | e.	Boot with "elevator=noop" to avoid workqueue use by | 
|  | the block layer. | 
|  | f.	If running on high-end powerpc servers, build with | 
|  | CONFIG_PPC_RTAS_DAEMON=n.  This prevents the RTAS | 
|  | daemon from running on each CPU every second or so. | 
|  | (This will require editing Kconfig files and will defeat | 
|  | this platform's RAS functionality.)  This avoids jitter | 
|  | due to the rtas_event_scan() function. | 
|  | WARNING:  Please check your CPU specifications to | 
|  | make sure that this is safe on your particular system. | 
|  | g.	If running on Cell Processor, build your kernel with | 
|  | CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from | 
|  | spu_gov_work(). | 
|  | WARNING:  Please check your CPU specifications to | 
|  | make sure that this is safe on your particular system. | 
|  | h.	If running on PowerMAC, build your kernel with | 
|  | CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, | 
|  | avoiding OS jitter from rackmeter_do_timer(). | 
|  |  | 
|  | Name: | 
|  | rcuc/%u | 
|  |  | 
|  | Purpose: | 
|  | Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels. | 
|  |  | 
|  | To reduce its OS jitter, do at least one of the following: | 
|  |  | 
|  | 1.	Build the kernel with CONFIG_PREEMPT=n.  This prevents these | 
|  | kthreads from being created in the first place, and also obviates | 
|  | the need for RCU priority boosting.  This approach is feasible | 
|  | for workloads that do not require high degrees of responsiveness. | 
|  | 2.	Build the kernel with CONFIG_RCU_BOOST=n.  This prevents these | 
|  | kthreads from being created in the first place.  This approach | 
|  | is feasible only if your workload never requires RCU priority | 
|  | boosting, for example, if you ensure frequent idle time on all | 
|  | CPUs that might execute within the kernel. | 
|  | 3.	Build with CONFIG_RCU_NOCB_CPU=y and boot with the rcu_nocbs= | 
|  | boot parameter offloading RCU callbacks from all CPUs susceptible | 
|  | to OS jitter.  This approach prevents the rcuc/%u kthreads from | 
|  | having any work to do, so that they are never awakened. | 
|  | 4.	Ensure that the CPU never enters the kernel, and, in particular, | 
|  | avoid initiating any CPU hotplug operations on this CPU.  This is | 
|  | another way of preventing any callbacks from being queued on the | 
|  | CPU, again preventing the rcuc/%u kthreads from having any work | 
|  | to do. | 
|  |  | 
|  | Name: | 
|  | rcuob/%d, rcuop/%d, and rcuos/%d | 
|  |  | 
|  | Purpose: | 
|  | Offload RCU callbacks from the corresponding CPU. | 
|  |  | 
|  | To reduce its OS jitter, do at least one of the following: | 
|  |  | 
|  | 1.	Use affinity, cgroups, or other mechanism to force these kthreads | 
|  | to execute on some other CPU. | 
|  | 2.	Build with CONFIG_RCU_NOCB_CPU=n, which will prevent these | 
|  | kthreads from being created in the first place.  However, please | 
|  | note that this will not eliminate OS jitter, but will instead | 
|  | shift it to RCU_SOFTIRQ. | 
|  |  | 
|  | Name: | 
|  | watchdog/%u | 
|  |  | 
|  | Purpose: | 
|  | Detect software lockups on each CPU. | 
|  |  | 
|  | To reduce its OS jitter, do at least one of the following: | 
|  |  | 
|  | 1.	Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these | 
|  | kthreads from being created in the first place. | 
|  | 2.	Boot with "nosoftlockup=0", which will also prevent these kthreads | 
|  | from being created.  Other related watchdog and softlockup boot | 
|  | parameters may be found in Documentation/admin-guide/kernel-parameters.rst | 
|  | and Documentation/watchdog/watchdog-parameters.txt. | 
|  | 3.	Echo a zero to /proc/sys/kernel/watchdog to disable the | 
|  | watchdog timer. | 
|  | 4.	Echo a large number of /proc/sys/kernel/watchdog_thresh in | 
|  | order to reduce the frequency of OS jitter due to the watchdog | 
|  | timer down to a level that is acceptable for your workload. |