linux/Documentation/accounting/taskstats.txt
<<
0 v3.4/spa > 3.4/form> 3.4a 0 v3. href="../linux+v3/op7/Documenta /accounting/tasksta s.txt">0 v3.4img src="../.sta ic/gfx/right.png" alt=">>">0 4/spa > 0 4spa class="lxr_search">0 v0 v3.4input typ> hidden" nam> navtarget" tion> ">0 v3.4input typ> text" nam> search" id search">0 v3.4butt/optyp> submit">Search 3.4/form> 4/spa > 0 4spa class="lxr_prefs"> 3.4a href="+prefs?return=Documenta /accounting/tasksta s.txt"0 v3. onclick="return ajax_prefs();">0 v3.Prefs 3.4/a>0 4/spa > v3. .4/div> v3. .4form ac ="ajax+*" method="post" onsubmit="return false;">0 4input typ> hidden" nam> ajax_lookup" id ajax_lookup" tion> ">0 v3. .4/form> 0 v3. .4div class="headingbott/m">
v3.
v3. 3. .4div id search_results" class="search_results" 3> v3. .4/div> 4div id content"> 4div id file_contents">

 L1">. .14/a>Per-task sta
is
ics interface

 L2">. .24/a>-----------------------------

 L3">. .34/a>0
 L4">. .44/a>0
 L5">. .54/a>Tasksta
s is a netlink-based interface for sending per-task and0
 L6">. .64/a>per-process sta
is
ics from the kernel to userspace.0
 L7">. .74/a>0
 L8">. .84/a>Tasksta
s was designed for the following benefits:0
 L9">. .94/a>0
 L10">. tiona>- efficiently provide sta
is
ics during lifetime of a task and on its exit0
 L11">. 11ona>- unified interface for multiple accounting subsystems0
 L12">. 12ona>- extensibility for use by future accounting patches0
 L13">. 134/a>0
 L14">. 144/a>Terminology0
 L15">. 154/a>-----------0
 L16">. 164/a>0
 L17">. 174/a>"pid", "tid" and "task" are used interchangeably and refer to the standard0
 L18">. 184/a>Linux task defined by struct task_struct.  per-pid sta
s are the sam> as0
 L19">. 194/a>per-task sta
s.0
 L20">. 204/a>0
 L21">. 214/a>"tgid", "process" and "thread group" are used interchangeably and refer to the0
 L22">. 22ona>tasks that share a  mm_struct i.e. the tradi
	  al Unix process. Despite the0
 L23">. 234/a>use of tgid, there is no special treatment for the task that is thread group0
 L24">. 244/a>leader - a process is deemed aliv> as long as it has any task belonging to it.0
 L25">. 254/a>0
 L26">. 264/a>Usage0
 L27">. 274/a>-----0
 L28">. 284/a>0
 L29">. 294/a>To get sta
is
ics during a task's lifetime, userspace opens a unicast netlink

 L30">. 304/a>socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.0
 L31">. 314/a>The response contains sta
is
ics for a task (if pid is specified) or the sum of0
 L32">. 32ona>sta
is
ics for all tasks of the process (if tgid is specified).0
 L33">. 334/a>0
 L34">. 344/a>To obtain sta
is
ics for tasks which are exiting, the userspace lis
ener0
 L35">. 354/a>sends a regis
er command and specifies a cpumask. Whenever a task exits on0
 L36">. 364/a>one of the cpus in the cpumask, its per-pid sta
is
ics are sent to the0
 L37">. 374/a>regis
ered lis
ener. Using cpumasks allows the data receiv>d by one lis
ener0
 L38">. 384/a>to be limited and assis
s in flow control over the netlink interface and is0
 L39">. 394/a>explained in more detail below.0
 L40">. 404/a>0
 L41">. 414/a>If the exiting task is the last thread exiting its thread group,0
 L42">. 42ona>a  addi
	  al record containing the per-tgid sta
s is also sent to userspace.0
 L43">. 434/a>The lat
er contains the sum of per-pid sta
s for all threads in the thread0
 L44">. 444/a>group, both past and present.0
 L45">. 454/a>0
 L46">. 464/a>getdelays.c is a simple utility demonstrating usage of the tasksta
s interface

 L47">. 474/a>for reporting delay accounting sta
is
ics. Users ca  regis
er cpumasks,0
 L48">. 484/a>send commands and process responses, lis
en for per-tid/tgid exit data,0
 L49">. 494/a>write the data receiv>d to a file and do basic flow control by increasing0
 L50">. 504/a>receiv> buffer sizes.0
 L51">. 514/a>0
 L52">. 52ona>Interface

 L53">. 534/a>---------

 L54">. 544/a>0
 L55">. 554/a>The user-kernel interface is encapsulated in include/linux/tasksta
s.h0
 L56">. 564/a>0
 L57">. 574/a>To avoid this documenta
	   becoming obsolet> as the interface evolves, only0
 L58">. 58ona>a  outline of the current vers	   is giv>n. tasksta
s.h always overrides the0
 L59">. 594/a>descri30
 L60">. 604/a>0
 L61">. 61ona>struct tasksta
s is the comm/opaccounting structure for both per-pid and0
 L62">. 624/a>per-tgid data. It is vers	  ed and ca  be extend>d by each accounting subsystem0
 L63">. 634/a>that is add>d to the kernel. The fields and their seman
ics are defined in the0
 L64">. 64ona>tasksta
s.h file.0
 L65">. 654/a>0
 L66">. 664/a>The data exchanged between user and kernel space is a netlink message belonging0
 L67">. 674/a>to the NETLINK_GENERIC family and using the netlink at
ributes interface.0
 L68">. 684/a>The messages are in the format0
 L69">. 694/a>0
 L70">. 704/a>3. .+----------+- - -+-------------+-------------------+0
 L71">. 714/a>3. .| nlmsghdr.| Pad.|  genlmsghdr.| tasksta
s payload.|0
 L72">. 724/a>3. .+----------+- - -+-------------+-------------------+0
 L73">. 734/a>0
 L74">. 744/a>0
 L75">. 754/a>The tasksta
s payload.is one of the following three kinds:0
 L76">. 764/a>0
 L77">. 774/a>1. Commands: Sent from user to kernel. Commands to get data on0
 L78">. 78ona>a pid/tgid consis
 of one at
ribute, of type TASKSTATS_CMD_ATTR_PID/TGID,0
 L79">. 794/a>containing a u32 pid or tgid in the at
ribute payload. The pid/tgid denotes0
 L80">. 804/a>the task/process for which userspace wan
s sta
is
ics.0
 L81">. 814/a>0
 L82">. 824/a>Commands to regis
er/deregis
er interes
 in exit data from a se
 of cpus0
 L83">. 834/a>consis
 of one at
ribute, of type0
 L84">. 844/a>TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the0
 L85">. 854/a>at
ribute payload. The cpumask is specified as an ascii string of0
 L86">. 864/a>comma-separated cpu ranges e.g. to lis
en to exit data from cpus 1,2,3,5,7,80
 L87">. 874/a>the cpumask would be "1-3,5,7-8". If userspace forgets to deregis
er interes
0
 L88">. 88ona>in cpus before closing the lis
ening socket, the kernel cleans up its interes
0
 L89">. 894/a>set over time. However, for the sake of efficiency, an explicit deregis
ra
	  0
 L90">. 904/a>is advisable.0
 L91">. 914/a>0
 L92">. 924/a>2. Response for a command: sent from the kernel in response to a userspace0
 L93">. 934/a>command. The payload.is a series of three at
ributes of type:0
 L94">. 944/a>0
 L95">. 954/a>a) TASKSTATS_TYPE_AGGR_PID/TGID : at
ribute containing no payload.but indicates0
 L96">. 96ona>a pid/tgid will be followed by some sta
s.0
 L97">. 974/a>0
 L98">. 98ona>b) TASKSTATS_TYPE_PID/TGID: at
ribute whose payload.is the pid/tgid whose sta
s0
 L99">. 994/a>are being returned.0
 L100">.1004/a>0
 L101">.1014/a>c) TASKSTATS_TYPE_STATS: at
ribute with a struct tasksta
s as payload. The0
 L102">.102ona>same structure is used for both per-pid and per-tgid sta
s.0
 L103">.1034/a>0
 L104">.1044/a>3. New message sent by kernel whenever a task exits. The payload.consis
s of a0
 L105">.1054/a>3. series of at
ributes of the following type:0
 L106">.1064/a>0
 L107">.1074/a>a) TASKSTATS_TYPE_AGGR_PID: indicates next two at
ributes will be pid+sta
s0
 L108">.108ona>b) TASKSTATS_TYPE_PID: contains exiting task's pid0
 L109">.1094/a>c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid sta
s0
 L110">.1tiona>d) TASKSTATS_TYPE_AGGR_TGID: indicates next two at
ributes will be tgid+sta
s0
 L111">.111ona>e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs0
 L112">.112ona>f) TASKSTATS_TYPE_STATS: contains the per-tgid sta
s for exiting task's process0
 L113">.1134/a>0
 L114">.1144/a>0
 L115">.1154/a>per-tgid sta
s0
 L116">.1164/a>--------------

 L117">.1174/a>0
 L118">.1184/a>Tasksta
s provides per-process sta
s, in addi
	   to per-task sta
s, since0
 L119">.1194/a>resource management is of
en done at a process granularity and aggregating task0
 L120">.1204/a>sta
s in userspace alone is inefficient and potentially inaccurate (due to lack0
 L121">.1214/a>of atomicity).0
 L122">.122ona>0
 L123">.1234/a>However, maintaining per-process, in addi
	   to per-task sta
s, within the0
 L124">.1244/a>kernel has space and time overheads. To address this, the tasksta
s code0
 L125">.1254/a>accumulates each exiting task's sta
is
ics into a process-wide data structure.0
 L126">.1264/a>When the last task of a process exits, the process level data accumulated also0
 L127">.1274/a>gets sent to userspace (along with the per-task data).0
 L128">.1284/a>0
 L129">.1294/a>When a user queries to get per-tgid data, the sum of all other liv> threads in0
 L130">.1304/a>the group is add>d up and add>d to the accumulated total for previously exited0
 L131">.1314/a>threads of the same thread group.0
 L132">.132ona>0
 L133">.1334/a>Extending tasksta
s0
 L134">.1344/a>-------------------

 L135">.1354/a>0
 L136">.1364/a>There are two ways to extend the tasksta
s interface to export more0
 L137">.1374/a>per-task/process sta
s as patches to collect them get add>d to the kernel0
 L138">.138ona>in future:0
 L139">.1394/a>0
 L140">.1404/a>1. Adding more fields to the end of the existing struct tasksta
s. Backward0
 L141">.1414/a>3. compatibility is ensured by the vers	   number within the0
 L142">.1424/a>3. structure. Userspace will use only the fields of the struct that correspond0
 L143">.1434/a>3. to the vers	   its using.0
 L144">.1444/a>0
 L145">.1454/a>2. Defining separate sta
is
ic structs and using the netlink at
ributes0
 L146">.1464/a>3. interface to return them. Since userspace processes each netlink at
ribute0
 L147">.1474/a>3. independ>ntly, it ca  always ignore at
ributes whose type it does not0
 L148">.1484/a>3. und>rstand (because it is using an older vers	   of the interface).0
 L149">.1494/a>0
 L150">.1504/a>0
 L151">.1514/a>Choosing between 1. and 2. is a mat
er of trading off flexibility and0
 L152">.152ona>overhead. If only a few fields ne>d to be add>d, then 1. is the preferable0
 L153">.1534/a>path since the kernel and userspace don't ne>d to incur the overhead of0
 L154">.1544/a>processing new netlink at
ributes. But if the new fields expand the existing0
 L155">.1554/a>struct too much, requiring disparate userspace accounting utilities to0
 L156">.1564/a>unnecessarily receiv> large structures whose fields are of no interes
, then0
 L157">.1574/a>extending the at
ributes structure would be worthwhile.0
 L158">.1584/a>0
 L159">.1594/a>Flow control for tasksta
s0
 L160">.1604/a>--------------------------

 L161">.1614/a>0
 L162">.1624/a>When the rate of task exits becomes large, a lis
ener may not be able to keep0
 L163">.1634/a>up with the kernel's rate of sending per-tid/tgid exit data leading to data0
 L164">.164ona>loss. This possibility gets compound>d when the tasksta
s structure gets0
 L165">.1654/a>extend>d and the number of cpus grows large.0
 L166">.1664/a>0
 L167">.1674/a>To avoid losing sta
is
ics, userspace should do one or more of the following:0
 L168">.1684/a>0
 L169">.1694/a>- increase the receiv> buffer sizes for the netlink sockets opened by0
 L170">.1704/a>lis
eners to receiv> exit data.0
 L171">.1714/a>0
 L172">.172ona>- create more lis
eners and reduce the number of cpus being lis
ened to by0
 L173">.1734/a>each lis
ener. In the extreme case, there could be one lis
ener for each cpu.0
 L174">.1744/a>Users may also consider setting the cpu affinity of the lis
ener to the subset0
 L175">.1754/a>of cpus to which it lis
ens, especially if they are lis
ening to jus
 one cpu.0
 L176">.1764/a>0
 L177">.1774/a>Despite these measures, if the userspace receiv>s ENOBUFS error messages0
 L178">.178ona>indicated overflow of receiv> buffers, it should take measures to handle the0
 L179">.1794/a>loss of data.0
 L180">.1804/a>0
 L181">.1814/a>----

 L182">.1824/a>
4/div> 4div class="footer"> The origi al LXR software by the LXR community4/a>, this experimental vers by lxr@linux.no4/a>. 4/div> 4div class="subfooter"> lxr.linux.no kindly hosted by Redpill Linpro AS4/a>, provider of Linux consulting and opera s services since 1995. 4/div> 4/body> 4/html>