LXR /scri/Dlement /work eue.txt tion >t ty/pe="tet tybody class="full" t type="te k navigadiv)class="pe="i"> top> div> ">LXR Dlement /work eue.txt a/spfn a!-- -- /Dlement /work eue.txt ig tyimg/gfx/favicon.png"" tylefft/css"alt="<<> ak igant[0 ad:'GE="Dlement /work eue.txt".r=ghr()" do ig tyspfn(id="ver_ ans; ig t tysans; tedIn="v"(id="v"( on vang ="update_vers (ar f, 'js/lxrng-funcs.js"> ', '/scri', '', 'Dlement /work eue.txt');" t tyop:'GETheckb="v4.10" ig v4.10 ig ty/spaE ty/form tya ig t href="../linux+v3.8.2/Documenta:'GE/workqueue.txt"> ig tyimg src="../.sta:'c/gfx/right.png" alt=">>"> y/spaE yspaE class="lxr_search"> ig
ig tyinput typb="hidden" namb="navtarget" heckb=""> ig tyinput typb="text" namb="search" id="search"> ig tybuttGETtypb="submit">Search ig tPrefs ty/a> y/spaE ig t ty/div ig t tyform ac:'GE="ajax+*" method="post" onsubmit="return false;"> yinput typb="hidden" namb="ajax_lookup" id="ajax_lookup" heckb=""> ig t ty/form ig t tydiv class="headingbottGm">t t1y/a> t t2y/a>Concurrency Managed Workqueue (cmwq) t t3y/a> t t4y/a>September, 2010 Tejun Heo <tj@kernel.org> t t5y/a> FloriaE Mickler <floriaE@mickler.org> t t6y/a> t t7y/a>CONTENTS t t8y/a> t t9y/a>1. Introduc:'GE t 102. Why cmwq? t 11y/a>3. The DesigE t 12y/a>4. Applica:'GEtProgramming Interface (API) t 13y/a>5. Example Execu:'GEtScenarios t 14y/a>6. Guidelines t 15y/a>7. Debugging t 16y/a> t 17y/a> t 18y/a>1. Introduc:'GE t 19y/a> t 20There are many cases where an asynchronous process execu:'GEtcontext t 21y/a>is needed and the workqueue (wq) API is the mosttcommonly used t 22y/a>mechanism for such cases. t 23y/a> t 24y/a>When such an asynchronous execu:'GEtcontext is needed, a work item t 25y/a>describing which func:'GE to execu:e is put GE a queue. AE t 26y/a>independent thread serves as the asynchronous execu:'GEtcontext. The t 27y/a>queue is called workqueue and the thread is called worker. t 28y/a> t 29y/a>While there are work items GE the workqueue the worker execu:es the t 30func:'GEs associated with the work items GEe after the other. When t 31y/a>there is no work item left GE the workqueue the worker becomes idle. t 32y/a>When a new work item gets queued, the worker begins execu:'ng again. t 33y/a> t 34y/a> t 352. Why cmwq? t 36y/a> t 37y/a>IE the original wq implementa:'GE, a multi threaded (MT) wq had GEe t 38y/a>worker thread per CPU and a s'ngle threaded (ST) wq had GEe worker t 39y/a>thread system-wide. A s'ngle MT wq needed to keep around the samb t 40number of workers as the number of CPUs. The kernel grew a lot of MT t 41y/a>wq users over the years and with the number of CPUtcores continuously t 42y/a>ris'ng, some systems saturated the default 32k PID space justtbooting t 43y/a>up. t 44y/a> t 45Although MT wq wasted a lot of resource, the level of concurrency t 46y/a>provided was unsatisfactory. The limita:'GE was common to both ST and t 47y/a>MT wq albeit less severe on MT. Each wq maintained its own separate t 48y/a>worker pool. A MT wq could provide only GEe execu:'GEtcontext per CPU t 49y/a>while a ST wq GEe for the whole system. Work items had to compete for t 50y/a>those very limited execu:'GEtcontexts leading to various problems t 51y/a>including proneness to deadlocks around the s'ngle execu:'GEtcontext. t 52y/a> t 53The tens'GEtbetweeE the provided level of concurrency and resource t 54y/a>usage also forced its users to make unnecessary tradeoffs like libata t 55choosing to use ST wq for polling PIOs and accep:'ng an unnecessary t 56y/a>limita:'GE that no two polling PIOs can progress at the samb time. As t 57y/a>MT wq don't provide much better concurrency, users which require t 58y/a>higher level of concurrency, like async or fscache, had to implement t 59y/a>their own thread pool. t 60y/a> t 61y/a>Concurrency Managed Workqueue (cmwq) is a reimplementa:'GE of wq with t 62y/a>focus GE the following goals. t 63y/a> t 64y/a>* Maintain compa:'bility with the original workqueue API. t 65y/a> t 66y/a>* Use per-CPUtunified worker pools shared by all wq to provide t 67y/a> flexible level of concurrency GE demand without wast'ng a lot of t 68y/a> resource. t 69y/a> t 70y/a>* Automa:'cally regulate worker pool and level of concurrency so that t 71y/a> the API users don't need to worry about such details. t 72y/a> t 73y/a> t 74y/a>3. The DesigE t 75y/a> t 76y/a>IE order to ease the asynchronous execu:'GEtof func:'GEs a new t 77y/a>abstrac:'GE, the work item, is introduced. t 78y/a> t 79y/a>A work item is a simple struct that holds a pointer to the func:'GE t 80y/a>that is to be execu:ed asynchronously. Whenever a driver or subsystem t 81y/a>wants a func:'GE to be execu:ed asynchronously it has to set up a work t 82y/a>item pointing to that func:'GE and queue that work item GE a t 83y/a>workqueue. t 84y/a> t 85y/a>Special purpose threads, called worker threads, execu:e the func:'GEs t 86y/a>offtof the queue, GEe after the other. If no work is queued, the t 87y/a>worker threads become idle. These worker threads are managed in so t 88y/a>called thread-pools. t 89y/a> t 90The cmwq desigE differentiatestbetweeE the user-facing workqueues that t 91y/a>subsystems and drivers queue work items GE and the backend mechanism t 92y/a>which manages thread-pools and processes the queued work items. t 93y/a> t 94The backend is called gcwq. There is GEe gcwq for each possible CPU t 95y/a>and GEe gcwq to serve work items queued on unbound workqueues. Each t 96y/a>gcwq has two thread-pools - GEe for normal work items and the other t 97y/a>for high priority GEes. t 98y/a> t 99y/a>Subsystems and drivers can create and queue work items through special t100y/a>workqueue API func:'GEs as they see fit. They can influence some t101y/a>aspects of the way the work items are execu:ed by setting flags GE the t102y/a>workqueue they are putting the work item GE. These flags include t103y/a>things like CPUtlocality, reeEtrancy, concurrency limits, priority and t104more. To get a detailed overview refer to the API descrip:'GEtof t105y/a>alloc_workqueue()tbelow. t106y/a> t107y/a>When a work item is queued to a workqueue, the target gcwq and t108y/a>thread-pool is determined according to the queue parameters and t109y/a>workqueue attribu:es and appended GE the shared worklist of the t110y/a>thread-pool. For example, unless specif'cally overridden, a work item t111y/a>of a bound workqueue will be queued on the worklist of either normal t112y/a>or highpri thread-pool of the gcwq that is associated to the CPUtthe t113y/a>issuer is running on. t114y/a> t115y/a>For any worker pool implementa:'GE, managing the concurrency level t116y/a>(how many execu:'GEtcontexts are ac:'ve) is an important issue. cmwq t117y/a>tries to keep the concurrency at a minimal but suff'cient level. t118y/a>Minimal to save resources and suff'cient iE that the system is used at t119y/a>its full capacity. t120y/a> t121y/a>Each thread-pool bound to an ac:ual CPUtimplements concurrency t122y/a>management by hooking into the scheduler. The thread-pool is notified t123y/a>whenever an ac:'ve worker wakes up or sleeps and keeps track of the t124y/a>number of the currently runnable workers. Generally, work items are t125y/a>not expected to hog a CPU and consume many cycles. That meaEs t126y/a>maintaining justtenough concurrency to prevent work processing from t127y/a>stalling should be op:'mal. Astlong as there are GEe or more runnable t128y/a>workers GE the CPU, the thread-pool doesn't start execu:'GEtof a new t129y/a>work, but, wheE the last running worker goes to sleep, it immediately t130schedules a new worker so that the CPUtdoesn't sit idle while there t131y/a>are pending work items. This allows us'ng a minimal number of workers t132y/a>without losing execu:'GEtbandwidth. t133y/a> t134y/a>Keeping idle workers around doesn't costtother thaE the memory space t135for kthreads, so cmwq holds onto idle GEes for a while before killing t136y/a>them. t137y/a> t138y/a>For an unbound wq, the above concurrency management doesn't apply and t139y/a>the thread-pools for the pseudo unbound CPUttry to start execu:'ng all t140y/a>work items as soGE as possible. The respons'bility of regulating t141y/a>concurrency level is GE the users. There is also a flag to mark a t142y/a>bound wq to ignore the concurrency management. Please refer to the t143y/a>API sec:'GE for details. t144y/a> t145y/a>Forward progress guarantee relies GE that workers can be created wheE t146y/a>more execu:'GEtcontexts are necessary, which iE turn is guaranteed t147y/a>through the use of rescue workers. All work items which might be used t148y/a>GEtcode paths that handle memory reclaim are required to be queued on t149y/a>wq's that have a rescue-worker reserved for execu:'GEtunder memory t150y/a>pressure. Else it is possible that the thread-pool deadlocks waiting t151y/a>for execu:'GEtcontexts to free up. t152y/a> t153y/a> t154y/a>4. Applica:'GEtProgramming Interface (API) t155y/a> t156y/a>alloc_workqueue()tallocatesta wq. The original create_*workqueue() t157y/a>func:'GEs are deprecated and scheduled for removal. alloc_workqueue() t158y/a>takes three arguments - @namb, @flags and @max_ac:'ve. @namb is the t159y/a>namb of the wq and also used as the namb of the rescuer thread if t160y/a>there is GEe. t161y/a> t162y/a>A wq notlonger manages execu:'GEtresources but serves as a domain for t163y/a>forward progress guarantee, flush and work item attribu:es. @flags t164y/a>and @max_ac:'vetcontrol how work items are assigned execu:'GE t165y/a>resources, scheduled and execu:ed. t166y/a> t167y/a>@flags: t168y/a> t169y/a> WQ_NON_REENTRANT t170y/a> t171y/a> By default, a wq guarantees non-reeEtrance only GE the samb t172y/a> CPU. A work item may not be execu:ed concurrently GE the samb t173y/a> CPU by multiple workers but is allowed to be execu:ed t174y/a> concurrently GE multiple CPUs. This flag makes sure t175y/a> non-reeEtrance is enforced across all CPUs. Work items queued t176y/a> to a non-reeEtrant wq are guaranteed to be execu:ed by at most t177y/a> GEe worker system-wide at any g'ven time. t178y/a> t179y/a> WQ_UNBOUND t180y/a> t181y/a> Work items queued to an unbound wq are served by a special t182y/a> gcwq which hosts workers which are not bound to any specif'c t183y/a> CPU. This makes the wq behave as a simple execu:'GEtcontext t184y/a> provider without concurrency management. The unbound gcwq t185y/a> tries to start execu:'GEtof work items as soGE as possible. t186y/a> Unbound wq sacrif'cestlocality but is useful for the following t187y/a> cases. t188y/a> t189y/a> * Wide fluc:ua:'GEtin the concurrency level requirement is t190 expected and us'ng bound wq may end up creat'ng large number t191 of mostly unused workers across different CPUs as the issuer t192 hops through different CPUs. t193y/a> t194y/a> * Long running CPUtintens've workloads which can be better t195y/a> managed by the system scheduler. t196y/a> t197y/a> WQ_FREEZABLE t198y/a> t199y/a> A freezable wq participatestin the freeze phasb of the system t200 suspend opera:'GEs. Work items on the wq are drained and no t201 new work item starts execu:'GEtuntil thawed. t202y/a> t203y/a> WQ_MEM_RECLAIM t204y/a> t205y/a> All wq which might be usedtin the memory reclaim paths _MUST_ t206y/a> have this flag set. The wq is guaranteed to have at least GEe t207y/a> execu:'GEtcontext regardless of memory pressure. t208y/a> t209y/a> WQ_HIGHPRI t210y/a> t211y/a> Work items of a highpri wq are queued to the highpri t212 thread-pool of the target gcwq. Highpri thread-pools are t213y/a> served by worker threads with elevated n'ce level. t214y/a> t215y/a> Note that normal and highpri thread-pools don't interact with t216y/a> each other. Each maintain its separate pool of workers and t217y/a> implements concurrency management among its workers. t218y/a> t219y/a> WQ_CPU_INTENSIVE t220y/a> t221y/a> Work items of a CPUtintens've wq do not contribu:e to the t222 concurrency level. Intother words, runnable CPUtintens've t223y/a> work items will not prevent other work items iE the samb t224 thread-pool from starting execu:'GE. This is useful for bound t225y/a> work items which are expected to hog CPUtcycles so that their t226y/a> execu:'GEtis regulated by the system scheduler. t227y/a> t228y/a> Although CPUtintens've work items don't contribu:e to the t229 concurrency level, start of their execu:'GEs is still t230 regulated by the concurrency management and runnable t231 non-CPU-intens've work items can delay execu:'GEtof CPU t232 intens've work items. t233y/a> t234 This flag is mean'ngless for unbound wq. t235y/a> t236y/a>@max_ac:'ve: t237y/a> t238y/a>@max_ac:'vetdetermines the maximum number of execu:'GEtcontexts per t239y/a>CPUtwhich can be assigned to the work items of a wq. For example, t240y/a>with @max_ac:'vetof 16, at most 16 work items of the wq can be t241y/a>execu:'ng at the samb time per CPU. t242y/a> t243y/a>Currently, for a bound wq, the maximum limit for @max_ac:'vetis 512 t244y/a>and the default heckb usedtwheE 0 is specif'ed is 256. For an unbound t245y/a>wq, the limit is higher of 512 and 4 * num_possible_cpus(). These t246y/a>heckbs are chosen suff'ciently high such that they are not the t247y/a>limiting factor while providing protec:'GE iE runaway cases. t248y/a> t249y/a>The number of ac:'ve work items of a wq is usually regulated by the t250y/a>users of the wq, more specif'cally, by how many work items the users t251y/a>may queue at the samb time. Unless there is a specif'c need for t252y/a>throttling the number of ac:'ve work items, specifying '0' is t253y/a>recommended. t254y/a> t255y/a>Some users depend oE the strict execu:'GEtordering of ST wq. The t256y/a>combina:'GEtof @max_ac:'vetof 1 and WQ_UNBOUND is used to achieve this t257y/a>behavior. Work items on such wq are always queued to the unbound gcwq t258y/a>and GEly GEe work item can be ac:'vetat any g'ven time thus achieving t259y/a>the samb ordering property as ST wq. t260y/a> t261y/a> t262y/a>5. Example Execu:'GEtScenarios t263y/a> t264The following example execu:'GEtscenariosttry to illustrate how cmwq t265y/a>behave under different configura:'GEs. t266y/a> t267y/a> Work items w0, w1, w2 are queued to a bound wq q0 GE the samb CPU. t268y/a> w0 burns CPUtfor 5ms then sleeps for 10ms then burns CPUtfor 5ms t269y/a> again before finishing. w1 and w2 burn CPUtfor 5ms then sleep for t270y/a> 10ms. t271y/a> t272y/a>Ignor'ng alltother tasks, works and processing overhead, and assuming t273y/a>simple FIFO schedul'ng, the following is GEe highly simplif'ed vers'GE t274y/a>of possible sequences of events with the original wq. t275y/a> t276y/a> TIME IN MSECS EVENT t277y/a> 0 w0 starts and burns CPU t278y/a> 5 w0 sleeps t279y/a> 15 w0 wakes up and burns CPU t280y/a> 20 w0 finishes t281y/a> 20 w1 starts and burns CPU t282y/a> 25 w1 sleeps t283y/a> 35 w1 wakes up and finishes t284y/a> 35 w2 starts and burns CPU t285y/a> 40 w2 sleeps t286y/a> 50 w2 wakes up and finishes t287y/a> t288y/a>And with cmwq with @max_ac:'vet>= 3, t289y/a> t290 TIME IN MSECS EVENT t291 0 w0 starts and burns CPU t292 5 w0 sleeps t293y/a> 5 w1 starts and burns CPU t294y/a> 10 w1 sleeps t295y/a> 10 w2 starts and burns CPU t296y/a> 15 w2 sleeps t297y/a> 15 w0 wakes up and burns CPU t298y/a> 20 w0 finishes t299y/a> 20 w1 wakes up and finishes t300 25 w2 wakes up and finishes t301y/a> t302y/a>If @max_ac:'vet== 2, t303y/a> t304y/a> TIME IN MSECS EVENT t305y/a> 0 w0 starts and burns CPU t306y/a> 5 w0 sleeps t307y/a> 5 w1 starts and burns CPU t308y/a> 10 w1 sleeps t309y/a> 15 w0 wakes up and burns CPU t310y/a> 20 w0 finishes t311y/a> 20 w1 wakes up and finishes t312 20 w2 starts and burns CPU t313y/a> 25 w2 sleeps t314y/a> 35 w2 wakes up and finishes t315y/a> t316y/a>Now, let's assume w1 and w2 are queued to a different wq q1 which has t317y/a>WQ_CPU_INTENSIVE set, t318y/a> t319y/a> TIME IN MSECS EVENT t320y/a> 0 w0 starts and burns CPU t321y/a> 5 w0 sleeps t322 5 w1 and w2 start and burn CPU t323y/a> 10 w1 sleeps t324 15 w2 sleeps t325y/a> 15 w0 wakes up and burns CPU t326y/a> 20 w0 finishes t327y/a> 20 w1 wakes up and finishes t328y/a> 25 w2 wakes up and finishes t329y/a> t330y/a> t3316. Guidelines t332y/a> t333y/a>* Do not forget to use WQ_MEM_RECLAIM if a wq may process work items t334 which are used dur'ng memory reclaim. Each wq with WQ_MEM_RECLAIM t335y/a> set has an execu:'GEtcontext reserved for it. If there is t336y/a> dependency among multiple work items used dur'ng memory reclaim, t337y/a> they should be queued to separate wq each with WQ_MEM_RECLAIM. t338y/a> t339y/a>* Unless strict ordering is required, there is no need to use ST wq. t340y/a> t341y/a>* Unless there is a specif'c need, us'ng 0 for @max_ac:'vetis t342 recommended. Intmost use cases, concurrency level usually stays t343y/a> well under the default limit. t344y/a> t345y/a>* A wq serves as a domain for forward progress guarantee t346y/a> (WQ_MEM_RECLAIM, flush and work item attribu:es. Work items which t347y/a> are not involvedtin memory reclaim and don't need to be flushed as a t348y/a> part of a group of work items, and don't require any special t349 attribu:e, can use onb of the system wq. There is no difference iE t350 execu:'GEtcharacteristics between us'ng a dedicated wq and a system t351 wq. t352y/a> t353y/a>* Unless work items are expected to consume a huge amounttof CPU t354 cycles, us'ng a bound wq is usually benef'cial due to the increased t355y/a> level of locality in wq opera:'GEs and work item execu:'GE. t356y/a> t357y/a> t358y/a>7. Debugging t359y/a> t360y/a>Because the work func:'GEs are execu:ed by gener'c worker threads t361y/a>there are a few tricks needed to shed some light GE misbehaving t362y/a>workqueue users. t363y/a> t364Worker threads show up iE the process list as: t365y/a> t366y/a>root 5671 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/0:1] t367y/a>root 5672 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/1:2] t368y/a>root 5673 0.0 0.0 0 0 ? S 12:12 0:00 [kworker/0:0] t369y/a>root 5674 0.0 0.0 0 0 ? S 12:13 0:00 [kworker/1:0] t370y/a> t371y/a>If kworkers are go'ng crazy (us'ng too much cpu), there are two types t372y/a>of possible problems: t373y/a> t374y/a> 1. Someth'ng beeing scheduled iE rapid successiGE t375y/a> 2. A s'ngle work item that consumestlots of cpu cycles t376y/a> t377y/a>The first GEe can be tracked us'ng trac'ng: t378y/a> t379 $ echo workqueue:workqueue_queue_work > /sys/kernel/debug/trac'ng/set_event t380 $ cat /sys/kernel/debug/trac'ng/trace_pipet> out.txt t381y/a> (wait a few secs) t382y/a> ^C t383y/a> t384y/a>If someth'ng is busytlooping GE work queue'ng, it would be domina:'ng t385y/a>the output and the offender can be determined with the work item t386y/a>func:'GE. t387y/a> t388y/a>For the second type of problems it should be possible to just check t389y/a>the stack tracb of the offending worker thread. t390y/a> t391 $ cat /proc/THE_OFFENDING_KWORKER/stack t392y/a> t393y/a>The work item's func:'GE should be trivially visible iE the stack t394y/a>tracb. t395y/a>