linux/Documentation/futex-requeue-pi.txt
<<
alue2alue2al//spa3.1al/spa3 class="lxr_search">aluealue2alue2alue2typ Searchalue2al//spa3.1ue2< al/input typ aue2< < <1//a>Futex Requeue PI < <2//a>---------------- < <3//a>a< <4//a>Requeueing of tasks from a non-PI futex to a PI futex requiresa< <5//a>special handling in order to ensure the underlying rt_mutex is nevera< <6//a>left without a3 owner if it has waiters; doing so would break the PI < <7//a>boosting logic [see rt-mutex-desgin.txt] For the purposes of < <8//a>brevity, this ac v3 will be referred to as "requeue_pi" throughout < <9//a>this document.2< a>"PI". < 11//a>a< 12//a>Motiva v3a< 13//a>----------a< 14//a>a< 15//a>Without requeue_pi, the glibc implementa v3 of < 16//a>pthread_cond_broadcast() must resort to waking all the tasks waiting < 17//a>v3 a pthread_condvar a3d letting them try to sort out which task < 18//a>gets to run first i3 classic thundering-herd forma v3.2< 19//a>implementa v3 would wake the highest-priority waiter, a3d leave the < 2 a>rest to the natural wakeup inherent i3 unlocking the mutex < 21//a>associated with the condvar. < 22//a>a< 23//a>Consider the simplified glibc calls:a< 24//a>a< 25//a>/* caller must lock mutex */a< 26//a>pthread_cond_wait(cond, mutex)a< 27//a>{a< 28//a> lock(cond->__data.__lock);a< 29//a> unlock(mutex);a< 30//a> do {a< 31//a> unlock(cond->__data.__lock);a< 32//a> futex_wait(cond->__data.__futex);a< 33//a> lock(cond->__data.__lock);a< 34//a> } while(...)a< 35//a> unlock(cond->__data.__lock);a< 36//a> lock(mutex);a< 37//a>}a< 38//a>a< 39//a>pthread_cond_broadcast(cond)a< 40//a>{a< 41//a> lock(cond->__data.__lock);a< 42//a> unlock(cond->__data.__lock);a< 43//a> futex_requeue(cond->data.__futex, cond->mutex);a< 44//a>}a< 45//a>a< 46//a>Once pthread_cond_broadcast() requeues the tasks, the cond->mutexa< 47//a>has waiters. Note that pthread_cond_wait() attempts to lock the < 48//a>mutex only after it has returned to user space.2< 49//a>underlying rt_mutex with waiters, a3d no owner, breaking thea< 50//a>previously ment v3ed PI-boosting algorithms. < 51//a>a< 52//a>In order to support PI-aware pthread_condvar's, the kernel needs toa< 53//a>be able to requeue tasks to PI futexes.2< 54//a>upv3 a successful futex_wait system call, the caller would return toa< 55//a>user space already holding the PI futex.2< 56//a>would be modified as follows:a< 57//a>a< 58//a>a< 59//a>/* caller must lock mutex */a< 60//a>pthread_cond_wait_pi(cond, mutex)a< 61//a>{a< 62//a> lock(cond->__data.__lock);a< 63//a> unlock(mutex);a< 64//a> do {a< 65//a> unlock(cond->__data.__lock);a< 66//a> futex_wait_requeue_pi(cond->__data.__futex);a< 67//a> lock(cond->__data.__lock);a< 68//a> } while(...)a< 69//a> unlock(cond->__data.__lock);a< 70//a> /* the kernel acquired the the mutex for us */a< 71//a>}a< 72//a>a< 73//a>pthread_cond_broadcast_pi(cond)a< 74//a>{a< 75//a> lock(cond->__data.__lock);a< 76//a> unlock(cond->__data.__lock);a< 77//a> futex_requeue_pi(cond->data.__futex, cond->mutex);a< 78//a>}a< 79//a>a< 80//a>The actual glibc implementa v3 will likely test for PI a3d make thea< 81//a>necessary changes inside the existing calls rather tha3 creating newa< 82//a>calls for the PI cases.2< 83//a>pthread_cond_timedwait() a3d pthread_cond_signal(). < 84//a>a< 85//a>Implementa v3a< 86//a>--------------a< 87//a>a< 88//a>In order to ensure the rt_mutex has a3 owner if it has waiters, ita< 89//a>is necessary for both the requeue code, as well as the waiting code, < 90//a>to be able to acquire the rt_mutex before returning to user space. < 91//a>The requeue code cannot simply wake the waiter a3d leave it toa< 92//a>acquire the rt_mutex as it would ope3 a race window betwee3 thea< 93 a>requeue call returning to user space a3d the waiter waking anda< 94//a>starting to run.2< 95//a>a< 96//a>The solu v3 involves two new rt_mutex helper rou nes, < 97//a>rt_mutex_start_proxy_lock() a3d rt_mutex_finish_proxy_lock(), which < 98//a>allow the requeue code to acquire an uncontended rt_mutex v3 behalf < 99//a>of the waiter a3d to enqueue the waiter v3 a contended rt_mutex. <100//a>Two new system calls provide the kernel<->user interface toa<101 a>requeue_pi: FUTEX_WAIT_REQUEUE_PI a3d FUTEX_REQUEUE_CMP_PI. <102//a>a<103 a>FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()a<104//a>a3d pthread_cond_timedwait()) to block on the initial futex a3d waita<105//a>to be requeued to a PI-aware futex.2<106 a>result of a high-speed collisiv3 betwee3 futex_wait() anda<107//a>futex_lock_pi(), with some extra logic to check for the addi v3al <108//a>wake-up scenarios. <109//a>a<1 a>FUTEX_REQUEUE_CMP_PI is called by the wakera<111//a>(pthread_cond_broadcast() a3d pthread_cond_signal()) to requeue anda<112//a>possibly wake the waiting tasks. Internally, this system call isa<113//a>still handled by futex_requeue (by passing requeue_pi=1).2<114//a>requeueing, futex_requeue() attempts to acquire the requeue targeta<115//a>PI futex v3 behalf of the top waiter.2<116//a>woken.2<117//a>nr_wake+nr_requeue tasks to the PI futex, calling <118//a>rt_mutex_start_proxy_lock() prior to each requeue to prepare thea<119//a>task as a waiter v3 the underlying rt_mutex.2<12 a>the lock can be acquired at this stage as well, if so, the next <121//a>waiter is woken to finish the acquisi v3 of the lock. <122//a>a<123 a>FUTEX_REQUEUE_PI accepts nr_wake a3d nr_requeue as arguments, but <124//a>their sum is all that really matters. <125//a>requeue up to nr_wake + nr_requeue tasks.2<126//a>tasks as it can acquire the lock for, which in the majority of casesa<127//a>should be 0 as good programming prac ce dicta es that the caller of <128//a>either pthread_cond_broadcast() or pthread_cond_signal() acquire the <129//a>mutex prior to making the call. FUTEX_REQUEUE_PI requires that <130//a>nr_wake=1.2<131//a>signal. <132//a> The original LXR software by the LXR community//a>, this experimental versiv3 by lxr@linux.no//a>. //div.1/div class="subfooter"> lxr.linux.no kindly hosted by Redpill Linpro AS//a>, provider of Linux consulting and opera v3s serv ces since 1995. //div.1 //body.1//html.1