| b.liu | e958203 | 2025-04-17 19:18:16 +0800 | [diff] [blame^] | 1 | =========================================================== | 
|  | 2 | Clock sources, Clock events, sched_clock() and delay timers | 
|  | 3 | =========================================================== | 
|  | 4 |  | 
|  | 5 | This document tries to briefly explain some basic kernel timekeeping | 
|  | 6 | abstractions. It partly pertains to the drivers usually found in | 
|  | 7 | drivers/clocksource in the kernel tree, but the code may be spread out | 
|  | 8 | across the kernel. | 
|  | 9 |  | 
|  | 10 | If you grep through the kernel source you will find a number of architecture- | 
|  | 11 | specific implementations of clock sources, clockevents and several likewise | 
|  | 12 | architecture-specific overrides of the sched_clock() function and some | 
|  | 13 | delay timers. | 
|  | 14 |  | 
|  | 15 | To provide timekeeping for your platform, the clock source provides | 
|  | 16 | the basic timeline, whereas clock events shoot interrupts on certain points | 
|  | 17 | on this timeline, providing facilities such as high-resolution timers. | 
|  | 18 | sched_clock() is used for scheduling and timestamping, and delay timers | 
|  | 19 | provide an accurate delay source using hardware counters. | 
|  | 20 |  | 
|  | 21 |  | 
|  | 22 | Clock sources | 
|  | 23 | ------------- | 
|  | 24 |  | 
|  | 25 | The purpose of the clock source is to provide a timeline for the system that | 
|  | 26 | tells you where you are in time. For example issuing the command 'date' on | 
|  | 27 | a Linux system will eventually read the clock source to determine exactly | 
|  | 28 | what time it is. | 
|  | 29 |  | 
|  | 30 | Typically the clock source is a monotonic, atomic counter which will provide | 
|  | 31 | n bits which count from 0 to (2^n)-1 and then wraps around to 0 and start over. | 
|  | 32 | It will ideally NEVER stop ticking as long as the system is running. It | 
|  | 33 | may stop during system suspend. | 
|  | 34 |  | 
|  | 35 | The clock source shall have as high resolution as possible, and the frequency | 
|  | 36 | shall be as stable and correct as possible as compared to a real-world wall | 
|  | 37 | clock. It should not move unpredictably back and forth in time or miss a few | 
|  | 38 | cycles here and there. | 
|  | 39 |  | 
|  | 40 | It must be immune to the kind of effects that occur in hardware where e.g. | 
|  | 41 | the counter register is read in two phases on the bus lowest 16 bits first | 
|  | 42 | and the higher 16 bits in a second bus cycle with the counter bits | 
|  | 43 | potentially being updated in between leading to the risk of very strange | 
|  | 44 | values from the counter. | 
|  | 45 |  | 
|  | 46 | When the wall-clock accuracy of the clock source isn't satisfactory, there | 
|  | 47 | are various quirks and layers in the timekeeping code for e.g. synchronizing | 
|  | 48 | the user-visible time to RTC clocks in the system or against networked time | 
|  | 49 | servers using NTP, but all they do basically is update an offset against | 
|  | 50 | the clock source, which provides the fundamental timeline for the system. | 
|  | 51 | These measures does not affect the clock source per se, they only adapt the | 
|  | 52 | system to the shortcomings of it. | 
|  | 53 |  | 
|  | 54 | The clock source struct shall provide means to translate the provided counter | 
|  | 55 | into a nanosecond value as an unsigned long long (unsigned 64 bit) number. | 
|  | 56 | Since this operation may be invoked very often, doing this in a strict | 
|  | 57 | mathematical sense is not desirable: instead the number is taken as close as | 
|  | 58 | possible to a nanosecond value using only the arithmetic operations | 
|  | 59 | multiply and shift, so in clocksource_cyc2ns() you find: | 
|  | 60 |  | 
|  | 61 | ns ~= (clocksource * mult) >> shift | 
|  | 62 |  | 
|  | 63 | You will find a number of helper functions in the clock source code intended | 
|  | 64 | to aid in providing these mult and shift values, such as | 
|  | 65 | clocksource_khz2mult(), clocksource_hz2mult() that help determine the | 
|  | 66 | mult factor from a fixed shift, and clocksource_register_hz() and | 
|  | 67 | clocksource_register_khz() which will help out assigning both shift and mult | 
|  | 68 | factors using the frequency of the clock source as the only input. | 
|  | 69 |  | 
|  | 70 | For real simple clock sources accessed from a single I/O memory location | 
|  | 71 | there is nowadays even clocksource_mmio_init() which will take a memory | 
|  | 72 | location, bit width, a parameter telling whether the counter in the | 
|  | 73 | register counts up or down, and the timer clock rate, and then conjure all | 
|  | 74 | necessary parameters. | 
|  | 75 |  | 
|  | 76 | Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43 | 
|  | 77 | seconds, the code handling the clock source will have to compensate for this. | 
|  | 78 | That is the reason why the clock source struct also contains a 'mask' | 
|  | 79 | member telling how many bits of the source are valid. This way the timekeeping | 
|  | 80 | code knows when the counter will wrap around and can insert the necessary | 
|  | 81 | compensation code on both sides of the wrap point so that the system timeline | 
|  | 82 | remains monotonic. | 
|  | 83 |  | 
|  | 84 |  | 
|  | 85 | Clock events | 
|  | 86 | ------------ | 
|  | 87 |  | 
|  | 88 | Clock events are the conceptual reverse of clock sources: they take a | 
|  | 89 | desired time specification value and calculate the values to poke into | 
|  | 90 | hardware timer registers. | 
|  | 91 |  | 
|  | 92 | Clock events are orthogonal to clock sources. The same hardware | 
|  | 93 | and register range may be used for the clock event, but it is essentially | 
|  | 94 | a different thing. The hardware driving clock events has to be able to | 
|  | 95 | fire interrupts, so as to trigger events on the system timeline. On an SMP | 
|  | 96 | system, it is ideal (and customary) to have one such event driving timer per | 
|  | 97 | CPU core, so that each core can trigger events independently of any other | 
|  | 98 | core. | 
|  | 99 |  | 
|  | 100 | You will notice that the clock event device code is based on the same basic | 
|  | 101 | idea about translating counters to nanoseconds using mult and shift | 
|  | 102 | arithmetic, and you find the same family of helper functions again for | 
|  | 103 | assigning these values. The clock event driver does not need a 'mask' | 
|  | 104 | attribute however: the system will not try to plan events beyond the time | 
|  | 105 | horizon of the clock event. | 
|  | 106 |  | 
|  | 107 |  | 
|  | 108 | sched_clock() | 
|  | 109 | ------------- | 
|  | 110 |  | 
|  | 111 | In addition to the clock sources and clock events there is a special weak | 
|  | 112 | function in the kernel called sched_clock(). This function shall return the | 
|  | 113 | number of nanoseconds since the system was started. An architecture may or | 
|  | 114 | may not provide an implementation of sched_clock() on its own. If a local | 
|  | 115 | implementation is not provided, the system jiffy counter will be used as | 
|  | 116 | sched_clock(). | 
|  | 117 |  | 
|  | 118 | As the name suggests, sched_clock() is used for scheduling the system, | 
|  | 119 | determining the absolute timeslice for a certain process in the CFS scheduler | 
|  | 120 | for example. It is also used for printk timestamps when you have selected to | 
|  | 121 | include time information in printk for things like bootcharts. | 
|  | 122 |  | 
|  | 123 | Compared to clock sources, sched_clock() has to be very fast: it is called | 
|  | 124 | much more often, especially by the scheduler. If you have to do trade-offs | 
|  | 125 | between accuracy compared to the clock source, you may sacrifice accuracy | 
|  | 126 | for speed in sched_clock(). It however requires some of the same basic | 
|  | 127 | characteristics as the clock source, i.e. it should be monotonic. | 
|  | 128 |  | 
|  | 129 | The sched_clock() function may wrap only on unsigned long long boundaries, | 
|  | 130 | i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps | 
|  | 131 | after circa 585 years. (For most practical systems this means "never".) | 
|  | 132 |  | 
|  | 133 | If an architecture does not provide its own implementation of this function, | 
|  | 134 | it will fall back to using jiffies, making its maximum resolution 1/HZ of the | 
|  | 135 | jiffy frequency for the architecture. This will affect scheduling accuracy | 
|  | 136 | and will likely show up in system benchmarks. | 
|  | 137 |  | 
|  | 138 | The clock driving sched_clock() may stop or reset to zero during system | 
|  | 139 | suspend/sleep. This does not matter to the function it serves of scheduling | 
|  | 140 | events on the system. However it may result in interesting timestamps in | 
|  | 141 | printk(). | 
|  | 142 |  | 
|  | 143 | The sched_clock() function should be callable in any context, IRQ- and | 
|  | 144 | NMI-safe and return a sane value in any context. | 
|  | 145 |  | 
|  | 146 | Some architectures may have a limited set of time sources and lack a nice | 
|  | 147 | counter to derive a 64-bit nanosecond value, so for example on the ARM | 
|  | 148 | architecture, special helper functions have been created to provide a | 
|  | 149 | sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the | 
|  | 150 | same counter that is also used as clock source is used for this purpose. | 
|  | 151 |  | 
|  | 152 | On SMP systems, it is crucial for performance that sched_clock() can be called | 
|  | 153 | independently on each CPU without any synchronization performance hits. | 
|  | 154 | Some hardware (such as the x86 TSC) will cause the sched_clock() function to | 
|  | 155 | drift between the CPUs on the system. The kernel can work around this by | 
|  | 156 | enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect | 
|  | 157 | that makes sched_clock() different from the ordinary clock source. | 
|  | 158 |  | 
|  | 159 |  | 
|  | 160 | Delay timers (some architectures only) | 
|  | 161 | -------------------------------------- | 
|  | 162 |  | 
|  | 163 | On systems with variable CPU frequency, the various kernel delay() functions | 
|  | 164 | will sometimes behave strangely. Basically these delays usually use a hard | 
|  | 165 | loop to delay a certain number of jiffy fractions using a "lpj" (loops per | 
|  | 166 | jiffy) value, calibrated on boot. | 
|  | 167 |  | 
|  | 168 | Let's hope that your system is running on maximum frequency when this value | 
|  | 169 | is calibrated: as an effect when the frequency is geared down to half the | 
|  | 170 | full frequency, any delay() will be twice as long. Usually this does not | 
|  | 171 | hurt, as you're commonly requesting that amount of delay *or more*. But | 
|  | 172 | basically the semantics are quite unpredictable on such systems. | 
|  | 173 |  | 
|  | 174 | Enter timer-based delays. Using these, a timer read may be used instead of | 
|  | 175 | a hard-coded loop for providing the desired delay. | 
|  | 176 |  | 
|  | 177 | This is done by declaring a struct delay_timer and assigning the appropriate | 
|  | 178 | function pointers and rate settings for this delay timer. | 
|  | 179 |  | 
|  | 180 | This is available on some architectures like OpenRISC or ARM. |