2hrtimers - subsystem for high-resolution kernel timers
   5This patch introduces a new subsystem for high-resolution kernel timers.
   7One might ask the question: we already have a timer subsystem
   8(kernel/timers.c), why do we need two timer subsystems? After a lot of
   9back and forth trying to integrate high-resolution and high-precision
  10features into the existing timer framework, and after testing various
  11such high-resolution timer implementations in practice, we came to the
  12conclusion that the timer wheel code is fundamentally not suitable for
  13such an approach. We initially didn't believe this ('there must be a way
  14to solve this'), and spent a considerable effort trying to integrate
  15things into the timer wheel, but we failed. In hindsight, there are
  16several reasons why such integration is hard/impossible:
  18- the forced handling of low-resolution and high-resolution timers in
  19  the same way leads to a lot of compromises, macro magic and #ifdef
  20  mess. The timers.c code is very "tightly coded" around jiffies and
  21  32-bitness assumptions, and has been honed and micro-optimized for a
  22  relatively narrow use case (jiffies in a relatively narrow HZ range)
  23  for many years - and thus even small extensions to it easily break
  24  the wheel concept, leading to even worse compromises. The timer wheel
  25  code is very good and tight code, there's zero problems with it in its
  26  current usage - but it is simply not suitable to be extended for
  27  high-res timers.
  29- the unpredictable [O(N)] overhead of cascading leads to delays which
  30  necessitate a more complex handling of high resolution timers, which
  31  in turn decreases robustness. Such a design still led to rather large
  32  timing inaccuracies. Cascading is a fundamental property of the timer
  33  wheel concept, it cannot be 'designed out' without unevitably
  34  degrading other portions of the timers.c code in an unacceptable way.
  36- the implementation of the current posix-timer subsystem on top of
  37  the timer wheel has already introduced a quite complex handling of
  38  the required readjusting of absolute CLOCK_REALTIME timers at
  39  settimeofday or NTP time - further underlying our experience by
  40  example: that the timer wheel data structure is too rigid for high-res
  41  timers.
  43- the timer wheel code is most optimal for use cases which can be
  44  identified as "timeouts". Such timeouts are usually set up to cover
  45  error conditions in various I/O paths, such as networking and block
  46  I/O. The vast majority of those timers never expire and are rarely
  47  recascaded because the expected correct event arrives in time so they
  48  can be removed from the timer wheel before any further processing of
  49  them becomes necessary. Thus the users of these timeouts can accept
  50  the granularity and precision tradeoffs of the timer wheel, and
  51  largely expect the timer subsystem to have near-zero overhead.
  52  Accurate timing for them is not a core purpose - in fact most of the
  53  timeout values used are ad-hoc. For them it is at most a necessary
  54  evil to guarantee the processing of actual timeout completions
  55  (because most of the timeouts are deleted before completion), which
  56  should thus be as cheap and unintrusive as possible.
  58The primary users of precision timers are user-space applications that
  59utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
  60users like drivers and subsystems which require precise timed events
  61(e.g. multimedia) can benefit from the availability of a separate
  62high-resolution timer subsystem as well.
  64While this subsystem does not offer high-resolution clock sources just
  65yet, the hrtimer subsystem can be easily extended with high-resolution
  66clock capabilities, and patches for that exist and are maturing quickly.
  67The increasing demand for realtime and multimedia applications along
  68with other potential users for precise timers gives another reason to
  69separate the "timeout" and "precise timer" subsystems.
  71Another potential benefit is that such a separation allows even more
  72special-purpose optimization of the existing timer wheel for the low
  73resolution and low precision use cases - once the precision-sensitive
  74APIs are separated from the timer wheel and are migrated over to
  75hrtimers. E.g. we could decrease the frequency of the timeout subsystem
  76from 250 Hz to 100 HZ (or even smaller).
  78hrtimer subsystem implementation details
  81the basic design considerations were:
  83- simplicity
  85- data structure not bound to jiffies or any other granularity. All the
  86  kernel logic works at 64-bit nanoseconds resolution - no compromises.
  88- simplification of existing, timing related kernel code
  90another basic requirement was the immediate enqueueing and ordering of
  91timers at activation time. After looking at several possible solutions
  92such as radix trees and hashes, we chose the red black tree as the basic
  93data structure. Rbtrees are available as a library in the kernel and are
  94used in various performance-critical areas of e.g. memory management and
  95file systems. The rbtree is solely used for time sorted ordering, while
  96a separate list is used to give the expiry code fast access to the
  97queued timers, without having to walk the rbtree.
  99(This separate list is also useful for later when we'll introduce
 100high-resolution clocks, where we need separate pending and expired
 101queues while keeping the time-order intact.)
 103Time-ordered enqueueing is not purely for the purposes of
 104high-resolution clocks though, it also simplifies the handling of
 105absolute timers based on a low-resolution CLOCK_REALTIME. The existing
 106implementation needed to keep an extra list of all armed absolute
 107CLOCK_REALTIME timers along with complex locking. In case of
 108settimeofday and NTP, all the timers (!) had to be dequeued, the
 109time-changing code had to fix them up one by one, and all of them had to
 110be enqueued again. The time-ordered enqueueing and the storage of the
 111expiry time in absolute time units removes all this complex and poorly
 112scaling code from the posix-timer implementation - the clock can simply
 113be set without having to touch the rbtree. This also makes the handling
 114of posix-timers simpler in general.
 116The locking and per-CPU behavior of hrtimers was mostly taken from the
 117existing timer wheel code, as it is mature and well suited. Sharing code
 118was not really a win, due to the different data structures. Also, the
 119hrtimer functions now have clearer behavior and clearer names - such as
 120hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
 121equivalent to del_timer() and del_timer_sync()] - so there's no direct
 1221:1 mapping between them on the algorithmical level, and thus no real
 123potential for code sharing either.
 125Basic data types: every time value, absolute or relative, is in a
 126special nanosecond-resolution type: ktime_t. The kernel-internal
 127representation of ktime_t values and operations is implemented via
 128macros and inline functions, and can be switched between a "hybrid
 129union" type and a plain "scalar" 64bit nanoseconds representation (at
 130compile time). The hybrid union type optimizes time conversions on 32bit
 131CPUs. This build-time-selectable ktime_t storage format was implemented
 132to avoid the performance impact of 64-bit multiplications and divisions
 133on 32bit CPUs. Such operations are frequently necessary to convert
 134between the storage formats provided by kernel and userspace interfaces
 135and the internal time format. (See include/linux/ktime.h for further
 138hrtimers - rounding of timer values
 141the hrtimer code will round timer events to lower-resolution clocks
 142because it has to. Otherwise it will do no artificial rounding at all.
 144one question is, what resolution value should be returned to the user by
 145the clock_getres() interface. This will return whatever real resolution
 146a given clock has - be it low-res, high-res, or artificially-low-res.
 148hrtimers - testing and verification
 151We used the high-resolution clock subsystem ontop of hrtimers to verify
 152the hrtimer implementation details in praxis, and we also ran the posix
 153timer tests in order to ensure specification compliance. We also ran
 154tests on low-resolution clocks.
 156The hrtimer patch converts the following kernel functionality to use
 159 - nanosleep
 160 - itimers
 161 - posix-timers
 163The conversion of nanosleep and posix-timers enabled the unification of
 164nanosleep and clock_nanosleep.
 166The code was successfully compiled for the following platforms:
 168 i386, x86_64, ARM, PPC, PPC64, IA64
 170The code was run-tested on the following platforms:
 172 i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
 174hrtimers were also integrated into the -rt tree, along with a
 175hrtimers-based high-resolution clock implementation, so the hrtimers
 176code got a healthy amount of testing and use in practice.
 178        Thomas Gleixner, Ingo Molnar