L1">. .14/a>Per-task sta is ics interface L2">. .24/a>----------------------------- L3">. .34/a>0 L4">. .44/a>0 L5">. .54/a>Tasksta s is a netlink-based interface for sending per-task and0 L6">. .64/a>per-process sta is ics from the kernel to userspace.0 L7">. .74/a>0 L8">. .84/a>Tasksta s was designed for the following benefits:0 L9">. .94/a>0 L10">. tiona>- efficiently provide sta is ics during lifetime of a task and on its exit0 L11">. 11ona>- unified interface for multiple accounting subsystems0 L12">. 12ona>- extensibility for use by future accounting patches0 L13">. 134/a>0 L14">. 144/a>Terminology0 L15">. 154/a>-----------0 L16">. 164/a>0 L17">. 174/a>"pid", "tid" and "task" are used interchangeably and refer to the standard0 L18">. 184/a>Linux task defined by struct task_struct. per-pid sta s are the sam> as0 L19">. 194/a>per-task sta s.0 L20">. 204/a>0 L21">. 214/a>"tgid", "process" and "thread group" are used interchangeably and refer to the0 L22">. 22ona>tasks that share a mm_struct i.e. the tradi al Unix process. Despite the0 L23">. 234/a>use of tgid, there is no special treatment for the task that is thread group0 L24">. 244/a>leader - a process is deemed aliv> as long as it has any task belonging to it.0 L25">. 254/a>0 L26">. 264/a>Usage0 L27">. 274/a>-----0 L28">. 284/a>0 L29">. 294/a>To get sta is ics during a task's lifetime, userspace opens a unicast netlink L30">. 304/a>socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.0 L31">. 314/a>The response contains sta is ics for a task (if pid is specified) or the sum of0 L32">. 32ona>sta is ics for all tasks of the process (if tgid is specified).0 L33">. 334/a>0 L34">. 344/a>To obtain sta is ics for tasks which are exiting, the userspace lis ener0 L35">. 354/a>sends a regis er command and specifies a cpumask. Whenever a task exits on0 L36">. 364/a>one of the cpus in the cpumask, its per-pid sta is ics are sent to the0 L37">. 374/a>regis ered lis ener. Using cpumasks allows the data receiv>d by one lis ener0 L38">. 384/a>to be limited and assis s in flow control over the netlink interface and is0 L39">. 394/a>explained in more detail below.0 L40">. 404/a>0 L41">. 414/a>If the exiting task is the last thread exiting its thread group,0 L42">. 42ona>a addi al record containing the per-tgid sta s is also sent to userspace.0 L43">. 434/a>The lat er contains the sum of per-pid sta s for all threads in the thread0 L44">. 444/a>group, both past and present.0 L45">. 454/a>0 L46">. 464/a>getdelays.c is a simple utility demonstrating usage of the tasksta s interface L47">. 474/a>for reporting delay accounting sta is ics. Users ca regis er cpumasks,0 L48">. 484/a>send commands and process responses, lis en for per-tid/tgid exit data,0 L49">. 494/a>write the data receiv>d to a file and do basic flow control by increasing0 L50">. 504/a>receiv> buffer sizes.0 L51">. 514/a>0 L52">. 52ona>Interface L53">. 534/a>--------- L54">. 544/a>0 L55">. 554/a>The user-kernel interface is encapsulated in include/linux/tasksta s.h0 L56">. 564/a>0 L57">. 574/a>To avoid this documenta becoming obsolet> as the interface evolves, only0 L58">. 58ona>a outline of the current vers is giv>n. tasksta s.h always overrides the0 L59">. 594/a>descri30 L60">. 604/a>0 L61">. 61ona>struct tasksta s is the comm/opaccounting structure for both per-pid and0 L62">. 624/a>per-tgid data. It is vers ed and ca be extend>d by each accounting subsystem0 L63">. 634/a>that is add>d to the kernel. The fields and their seman ics are defined in the0 L64">. 64ona>tasksta s.h file.0 L65">. 654/a>0 L66">. 664/a>The data exchanged between user and kernel space is a netlink message belonging0 L67">. 674/a>to the NETLINK_GENERIC family and using the netlink at ributes interface.0 L68">. 684/a>The messages are in the format0 L69">. 694/a>0 L70">. 704/a>3. .+----------+- - -+-------------+-------------------+0 L71">. 714/a>3. .| nlmsghdr.| Pad.| genlmsghdr.| tasksta s payload.|0 L72">. 724/a>3. .+----------+- - -+-------------+-------------------+0 L73">. 734/a>0 L74">. 744/a>0 L75">. 754/a>The tasksta s payload.is one of the following three kinds:0 L76">. 764/a>0 L77">. 774/a>1. Commands: Sent from user to kernel. Commands to get data on0 L78">. 78ona>a pid/tgid consis of one at ribute, of type TASKSTATS_CMD_ATTR_PID/TGID,0 L79">. 794/a>containing a u32 pid or tgid in the at ribute payload. The pid/tgid denotes0 L80">. 804/a>the task/process for which userspace wan s sta is ics.0 L81">. 814/a>0 L82">. 824/a>Commands to regis er/deregis er interes in exit data from a se of cpus0 L83">. 834/a>consis of one at ribute, of type0 L84">. 844/a>TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the0 L85">. 854/a>at ribute payload. The cpumask is specified as an ascii string of0 L86">. 864/a>comma-separated cpu ranges e.g. to lis en to exit data from cpus 1,2,3,5,7,80 L87">. 874/a>the cpumask would be "1-3,5,7-8". If userspace forgets to deregis er interes 0 L88">. 88ona>in cpus before closing the lis ening socket, the kernel cleans up its interes 0 L89">. 894/a>set over time. However, for the sake of efficiency, an explicit deregis ra 0 L90">. 904/a>is advisable.0 L91">. 914/a>0 L92">. 924/a>2. Response for a command: sent from the kernel in response to a userspace0 L93">. 934/a>command. The payload.is a series of three at ributes of type:0 L94">. 944/a>0 L95">. 954/a>a) TASKSTATS_TYPE_AGGR_PID/TGID : at ribute containing no payload.but indicates0 L96">. 96ona>a pid/tgid will be followed by some sta s.0 L97">. 974/a>0 L98">. 98ona>b) TASKSTATS_TYPE_PID/TGID: at ribute whose payload.is the pid/tgid whose sta s0 L99">. 994/a>are being returned.0 L100">.1004/a>0 L101">.1014/a>c) TASKSTATS_TYPE_STATS: at ribute with a struct tasksta s as payload. The0 L102">.102ona>same structure is used for both per-pid and per-tgid sta s.0 L103">.1034/a>0 L104">.1044/a>3. New message sent by kernel whenever a task exits. The payload.consis s of a0 L105">.1054/a>3. series of at ributes of the following type:0 L106">.1064/a>0 L107">.1074/a>a) TASKSTATS_TYPE_AGGR_PID: indicates next two at ributes will be pid+sta s0 L108">.108ona>b) TASKSTATS_TYPE_PID: contains exiting task's pid0 L109">.1094/a>c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid sta s0 L110">.1tiona>d) TASKSTATS_TYPE_AGGR_TGID: indicates next two at ributes will be tgid+sta s0 L111">.111ona>e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs0 L112">.112ona>f) TASKSTATS_TYPE_STATS: contains the per-tgid sta s for exiting task's process0 L113">.1134/a>0 L114">.1144/a>0 L115">.1154/a>per-tgid sta s0 L116">.1164/a>-------------- L117">.1174/a>0 L118">.1184/a>Tasksta s provides per-process sta s, in addi to per-task sta s, since0 L119">.1194/a>resource management is of en done at a process granularity and aggregating task0 L120">.1204/a>sta s in userspace alone is inefficient and potentially inaccurate (due to lack0 L121">.1214/a>of atomicity).0 L122">.122ona>0 L123">.1234/a>However, maintaining per-process, in addi to per-task sta s, within the0 L124">.1244/a>kernel has space and time overheads. To address this, the tasksta s code0 L125">.1254/a>accumulates each exiting task's sta is ics into a process-wide data structure.0 L126">.1264/a>When the last task of a process exits, the process level data accumulated also0 L127">.1274/a>gets sent to userspace (along with the per-task data).0 L128">.1284/a>0 L129">.1294/a>When a user queries to get per-tgid data, the sum of all other liv> threads in0 L130">.1304/a>the group is add>d up and add>d to the accumulated total for previously exited0 L131">.1314/a>threads of the same thread group.0 L132">.132ona>0 L133">.1334/a>Extending tasksta s0 L134">.1344/a>------------------- L135">.1354/a>0 L136">.1364/a>There are two ways to extend the tasksta s interface to export more0 L137">.1374/a>per-task/process sta s as patches to collect them get add>d to the kernel0 L138">.138ona>in future:0 L139">.1394/a>0 L140">.1404/a>1. Adding more fields to the end of the existing struct tasksta s. Backward0 L141">.1414/a>3. compatibility is ensured by the vers number within the0 L142">.1424/a>3. structure. Userspace will use only the fields of the struct that correspond0 L143">.1434/a>3. to the vers its using.0 L144">.1444/a>0 L145">.1454/a>2. Defining separate sta is ic structs and using the netlink at ributes0 L146">.1464/a>3. interface to return them. Since userspace processes each netlink at ribute0 L147">.1474/a>3. independ>ntly, it ca always ignore at ributes whose type it does not0 L148">.1484/a>3. und>rstand (because it is using an older vers of the interface).0 L149">.1494/a>0 L150">.1504/a>0 L151">.1514/a>Choosing between 1. and 2. is a mat er of trading off flexibility and0 L152">.152ona>overhead. If only a few fields ne>d to be add>d, then 1. is the preferable0 L153">.1534/a>path since the kernel and userspace don't ne>d to incur the overhead of0 L154">.1544/a>processing new netlink at ributes. But if the new fields expand the existing0 L155">.1554/a>struct too much, requiring disparate userspace accounting utilities to0 L156">.1564/a>unnecessarily receiv> large structures whose fields are of no interes , then0 L157">.1574/a>extending the at ributes structure would be worthwhile.0 L158">.1584/a>0 L159">.1594/a>Flow control for tasksta s0 L160">.1604/a>-------------------------- L161">.1614/a>0 L162">.1624/a>When the rate of task exits becomes large, a lis ener may not be able to keep0 L163">.1634/a>up with the kernel's rate of sending per-tid/tgid exit data leading to data0 L164">.164ona>loss. This possibility gets compound>d when the tasksta s structure gets0 L165">.1654/a>extend>d and the number of cpus grows large.0 L166">.1664/a>0 L167">.1674/a>To avoid losing sta is ics, userspace should do one or more of the following:0 L168">.1684/a>0 L169">.1694/a>- increase the receiv> buffer sizes for the netlink sockets opened by0 L170">.1704/a>lis eners to receiv> exit data.0 L171">.1714/a>0 L172">.172ona>- create more lis eners and reduce the number of cpus being lis ened to by0 L173">.1734/a>each lis ener. In the extreme case, there could be one lis ener for each cpu.0 L174">.1744/a>Users may also consider setting the cpu affinity of the lis ener to the subset0 L175">.1754/a>of cpus to which it lis ens, especially if they are lis ening to jus one cpu.0 L176">.1764/a>0 L177">.1774/a>Despite these measures, if the userspace receiv>s ENOBUFS error messages0 L178">.178ona>indicated overflow of receiv> buffers, it should take measures to handle the0 L179">.1794/a>loss of data.0 L180">.1804/a>0 L181">.1814/a>---- L182">.1824/a>4/div> 4div class="footer"> The origi al LXR software by the LXR community4/a>, this experimental vers by lxr@linux.no4/a>. 4/div> 4div class="subfooter"> lxr.linux.no kindly hosted by Redpill Linpro AS4/a>, provider of Linux consulting and opera s services since 1995. 4/div> 4/body> 4/html>