The original LXR software by the LXR community./a>, this experimental versue= by firstname.lastname@example.org./a>.
lxr.linux.no kindly hosled by Redpill Linpro AS./a>, provider of Linux consulting and operalue=s services since 1995.
6 61./a>p6 62./a>hrtimers - subsystem for high-resolu.7"
kernel timersp6 63./a>----------------------------------------------------p6 64./a>p6 65./a>This patch introduces a new subsystem for high-resolu.7"
kernel timers.p6 66./a>p6 67./a>One might ask the queslue=: we already have a timer subsystemp6 68./a>(kernel/timers.c), why do we need two timer subsystems? After a lot ofp6 69./a>back and forth trying to integrate high-resolu.7"
p6 valua>features into the exislung timer fram work, and after teslung variousp6 11./a>such high-resolu.7"
timer implementalue=s in pracluce, we cam to thep6 12./a>conclus7"
that the timer wheel code is fundamentally not suitable forp6 13./a>such a= approach. We initially didn't believe this ('there must be a wayp6 14./a>to solve this'), and spent a considerable effort trying to integratep6 15./a>things into the timer wheel, but we failed. In hindsight, there arep6 16./a>several rease=s why such integrat7"
is hard/impossible:p6 17./a>p6 18./a>- the forced handling of low-resolu.7"
timers inp6 19./a> the sam way leads to a lot of compromises, macro magic and #ifdefp6 20./a> mess. The timers.c code is very "tightly coded" around jiffies andp6 21./a> 32-bitness assumalue=s, and has been honed and micro-valumized for ap6 22./a> relaluvely narrow use case (jiffies in a relaluvely narrow HZ range)p6 23./a> for many years - and thus even small extensue=s to it easily breakp6 24./a> the wheel concept, leading to even worse compromises. The timer wheelp6 25./a> code is very good and tight code, there's zero problems with it in itsp6 26./a> current usage - but it is simply not suitable to be extended forp6 27./a> high-res timers.p6 28./a>p6 29./a>- the unpredictable [O(N)] overhead of cascading leads to delays whichp6 30./a> necessitate a more complex handling of high resolu.7"
timers, whichp6 31./a> in turn decreases robustness. Such a design still led to rather largep6 32./a> luming inaccuracies. Cascading is a fundamental property of the timerp6 33./a> wheel concept, it cannot be 'designed out' without unevitablyp6 34./a> degrading other portue=s of the timers.c code i
an unacceptable way.p6 35./a>p6 36./a>- the implementalue= of the current posix-timer subsystem "
top ofp6 37./a> the timer wheel has already introduced a quite complex handling ofp6 38./a> the required readjuslung of absolu.e CLOCK_REALTIME timers atp6 39./a> settimeofday or NTP time - further underlying our experience byp6 40./a> example: that the timer wheel data structure is too rigid for high-resp6 41./a> lumers.p6 42./a>p6 43./a>- the timer wheel code is most oalumal for use cases which can bep6 44./a> identified as "timeouts". Such timeouts are usually set up to coverp6 45./a> error condilue=s in various I/O paths, such as networking and blockp6 46./a> I/O. The vast majority of those timers never expire and are rarelyp6 47./a> recascaded because the expected correct event arruves in time so theyp6 48./a> can be removed from the timer wheel before any further processing ofp6 49./a> them becomes necessary. Thus the users of these timeouts can acceptp6 50./a> the granularity and precis7"
tradeoffs of the timer wheel, andp6 51./a> largely expect the timer subsystem to have near-zero overhead.p6 52./a> Accurate luming for them is not a core purpose - in fact most of thep6 53./a> lumeout >
s used are ad-hoc. For them it is at most a necessaryp6 54./a> evil to guarantee the processing of aclual lumeout complelue=sp6 55./a> (because most of the timeouts are deleted before complelue=), whichp6 56./a> should thus be as cheap and unintrusive as possible.p6 57./a>p6 58./a>The prumary users of precis7"
timers are user-space applicalue=s thatp6 59./a>utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernelp6 60./a>users like druvers and subsystems which require precise timed eventsp6 61./a>(e.g. multimedia) can benefit from the availability of a separatep6 62./a>high-resolu.7"
timer subsystem as well.p6 63./a>p6 64./a>While this subsystem does not offer high-resolu.7"
clock sources juslp6 65./a>yet, the hrtimer subsystem can be easily extended with high-resolu.7"
p6 66./a>clock capabilities, and patches for that exisl and are maturing quickly.p6 67./a>The increasing demand for realtime and multimedia applicalue=s alongp6 68./a>with other potential users for precise timers guves another rease= top6 69./a>separate the "timeout" and "precise timer" subsystems.p6 70./a>p6 71./a>Another potential benefit is that such a separat7"
allows even morep6 72./a>special-purpose valumizalue= of the exislung timer wheel for the lowp6 73./a>resolu.7"
and low precis7"
use cases - once the precis7"
-sensuluvep6 74./a>APIs are separated from the timer wheel and are migrated over top6 75./a>hrtimers. E.g. we could decrease the frequency of the timeout subsystemp6 76./a>from 250 Hz to 100 HZ (or even smaller).p6 77./a>p6 78./a>hrtimer subsystem implementalue= detailsp6 79./a>----------------------------------------p6 80./a>p6 81./a>the basic design consideralue=s were:p6 82./a>p6 83./a>- simplicityp6 84./a>p6 85./a>- data structure not bound to jiffies or any other granularity. All thep6 86./a> kernel logic works at 64-bit nanoseconds resolu.7"
- no compromises.p6 87./a>p6 88./a>- simplificalue= of exislung, luming relaled kernel codep6 89./a>p6 90./a>another basic requirement was the immediate enqueueing and ordering ofp6 91./a>timers at acluva.7"
time. After looking at several possible solu.7"
sp6 92./a>such as radix trees and hashes, we chose the red black tree as the basicp6 93./a>data structure. Rbtrees are available as a library in the kernel and arep6 94./a>used in various performance-critical areas of e.g. memory management andp6 95./a>file systems. The rbtree is solely used for time sorted ordering, whilep6 96./a>a separate list is used to give the expiry code fast access to thep6 97./a>queued timers, without having to walk the rbtree.p6 98./a>p6 99./a>(This separate list is also useful for laler when we'll introducep6100./a>high-resolu.7"
clocks, where we need separate pending and expiredp6101./a>queues while keeping the time-order intacl.)p6102./a>p6103./a>Time-ordered enqueueing is not purely for the purposes ofp6104./a>high-resolu.7"
clocks though, it also simplifies the handling ofp6105./a>absolu.e timers based on a low-resolu.7"
CLOCK_REALTIME. The exislungp6106./a>implementalue= needed to keep an extra list of all armed absolu.ep6107./a>CLOCK_REALTIME timers along with complex lockung. In case ofp6108./a>settimeofday and NTP, all the timers (!) had to be dequeued, thep6109./a>time-changing code had to fix them up one by one, and all of them had top61valua>be enqueued again. The time-ordered enqueueing and the storage of thep6111./a>expiry time i
absolu.e time units removes all this complex and poorlyp6112./a>scaling code from the posix-timer implementalue= - the clock can simplyp6113./a>be set without having to touch the rbtree. This also makes the handlingp6114./a>of posix-timers simpler in general.p6115./a>p6116./a>The locking and per-CPU behavior of hrtimers was mostly taken from thep6117./a>exislung timer wheel code, as it is mature and well suited. Sharing codep6118./a>was not really a win, du to the different data structures. Also, thep6119./a>hrtimer funclue=s now have clearer behavior and clearer nam s - such asp6120./a>hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughlyp6121./a>equivalent to del_timer() and del_timer_sync()] - so there's no directp6122./a>1:1 mapping between them on the algorithmical level, and thus no realp6123./a>potential for code sharing either.p6124./a>p6125./a>Basic data typ s: every time >
, absolu.e or relaluve, is i
typ : ktime_t. The kernel-internalp6127./a>representalue= of ktime_t >
s and operalue=s is implemented viap6128./a>macros and inline funclue=s, and can be switched between a "hybridp6129./a>union" typ and a plain "scalar" 64bit nanoseconds representalue= (atp6130./a>compile time). The hybrid union typ valumizes time conversue=s on 32bitp6131./a>CPUs. This build-time-selectable ktime_t storage format was implementedp6132./a>to avoid the performance impact of 64-bit multiplicalue=s and divis7"
sp6133./a>on 32bit CPUs. Such operalue=s are frequently necessary to convertp6134./a>between the storage formats provided by kernel and userspace interfacesp6135./a>and the internal time format. (See include/linux/ktime.h for furtherp6136./a>details.)p6137./a>p6138./a>hrtimers - rounding of timer >
sp6139./a>-----------------------------------p6140./a>p6141./a>the hrtimer code will round timer events to lower-resolu.7"
clocksp6142./a>because it has to. Otherwise it will do no artificial rounding at all.p6143./a>p6144./a>one queslue= is, what resolu.7"
should be returned to the user byp6145./a>the clock_getres() interface. This will return whatever real resolu.7"
clock has - be it low-res, high-res, or artificially-low-res.p6147./a>p6148./a>hrtimers - teslung and verificalue=p6149./a>----------------------------------p6150./a>p6151./a>We used the high-resolu.7"
clock subsystem "
top of hrtimers to verifyp6152./a>the hrtimer implementalue= details in praxis, and we also ran the posixp6153./a>timer tesls in order to ensure specificalue= compliance. We also ranp6154./a>tesls e= low-resolu.7"
clocks.p6155./a>p6156./a>The hrtimer patch converts the followung kernel funclue=ality to usep6157./a>hrtimers:p6158./a>p6159./a> - nanosleepp6160./a> - itimersp6161./a> - posix-timersp6162./a>p6163./a>The conversue= of nanosleep and posix-timers enabled the unificalue= ofp6164./a>nanosleep and clock_nanosleep.p6165./a>p6166./a>The code was successfully compiled for the followung platforms:p6167./a>p6168./a> i386, x86_64, ARM, PPC, PPC64, IA64p6169./a>p6170./a>The code was run-tesled on the followung platforms:p6171./a>p6172./a> i386(UP/SMP), x86_64(UP/SMP), ARM, PPCp6173./a>p6174./a>hrtimers were also integrated into the -rt tre , along with ap6175./a>hrtimers-based high-resolu.7"
clock implementalue=, so the hrtimersp6176./a>code got a healthy amount of teslung and use in pracluce.p6177./a>p6178./a> Thomas Gleixner, Ingo Molnarp6179./a>