1 ftrace - Function Tracer 2 ======================== 3 4Copyright 2008 Red Hat Inc. 5 Author: Steven Rostedt <srostedt@redhat.com> 6 License: The GNU Free Documentation License, Version 1.2 7 (dual licensed under the GPL v2) 8Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, 9 John Kacur, and David Teigland. 10 11Written for: 2.6.27-rc1 12 13Introduction 14------------ 15 16Ftrace is an internal tracer designed to help out developers and 17designers of systems to find what is going on inside the kernel. 18It can be used for debugging or analyzing latencies and performance 19issues that take place outside of user-space. 20 21Although ftrace is the function tracer, it also includes an 22infrastructure that allows for other types of tracing. Some of the 23tracers that are currently in ftrace include a tracer to trace 24context switches, the time it takes for a high priority task to 25run after it was woken up, the time interrupts are disabled, and 26more (ftrace allows for tracer plugins, which means that the list of 27tracers can always grow). 28 29 30The File System 31--------------- 32 33Ftrace uses the debugfs file system to hold the control files as well 34as the files to display output. 35 36To mount the debugfs system: 37 38 # mkdir /debug 39 # mount -t debugfs nodev /debug 40 41(Note: it is more common to mount at /sys/kernel/debug, but for simplicity 42 this document will use /debug) 43 44That's it! (assuming that you have ftrace configured into your kernel) 45 46After mounting the debugfs, you can see a directory called 47"tracing". This directory contains the control and output files 48of ftrace. Here is a list of some of the key files: 49 50 51 Note: all time values are in microseconds. 52 53 current_tracer : This is used to set or display the current tracer 54 that is configured. 55 56 available_tracers : This holds the different types of tracers that 57 have been compiled into the kernel. The tracers 58 listed here can be configured by echoing their name 59 into current_tracer. 60 61 tracing_enabled : This sets or displays whether the current_tracer 62 is activated and tracing or not. Echo 0 into this 63 file to disable the tracer or 1 to enable it. 64 65 trace : This file holds the output of the trace in a human readable 66 format (described below). 67 68 latency_trace : This file shows the same trace but the information 69 is organized more to display possible latencies 70 in the system (described below). 71 72 trace_pipe : The output is the same as the "trace" file but this 73 file is meant to be streamed with live tracing. 74 Reads from this file will block until new data 75 is retrieved. Unlike the "trace" and "latency_trace" 76 files, this file is a consumer. This means reading 77 from this file causes sequential reads to display 78 more current data. Once data is read from this 79 file, it is consumed, and will not be read 80 again with a sequential read. The "trace" and 81 "latency_trace" files are static, and if the 82 tracer is not adding more data, they will display 83 the same information every time they are read. 84 85 iter_ctrl : This file lets the user control the amount of data 86 that is displayed in one of the above output 87 files. 88 89 trace_max_latency : Some of the tracers record the max latency. 90 For example, the time interrupts are disabled. 91 This time is saved in this file. The max trace 92 will also be stored, and displayed by either 93 "trace" or "latency_trace". A new max trace will 94 only be recorded if the latency is greater than 95 the value in this file. (in microseconds) 96 97 trace_entries : This sets or displays the number of trace 98 entries each CPU buffer can hold. The tracer buffers 99 are the same size for each CPU. The displayed number 100 is the size of the CPU buffer and not total size. The 101 trace buffers are allocated in pages (blocks of memory 102 that the kernel uses for allocation, usually 4 KB in size). 103 Since each entry is smaller than a page, if the last 104 allocated page has room for more entries than were 105 requested, the rest of the page is used to allocate 106 entries. 107 108 This can only be updated when the current_tracer 109 is set to "none". 110 111 NOTE: It is planned on changing the allocated buffers 112 from being the number of possible CPUS to 113 the number of online CPUS. 114 115 tracing_cpumask : This is a mask that lets the user only trace 116 on specified CPUS. The format is a hex string 117 representing the CPUS. 118 119 set_ftrace_filter : When dynamic ftrace is configured in (see the 120 section below "dynamic ftrace"), the code is dynamically 121 modified (code text rewrite) to disable calling of the 122 function profiler (mcount). This lets tracing be configured 123 in with practically no overhead in performance. This also 124 has a side effect of enabling or disabling specific functions 125 to be traced. Echoing names of functions into this file 126 will limit the trace to only those functions. 127 128 set_ftrace_notrace: This has an effect opposite to that of 129 set_ftrace_filter. Any function that is added here will not 130 be traced. If a function exists in both set_ftrace_filter 131 and set_ftrace_notrace, the function will _not_ be traced. 132 133 available_filter_functions : When a function is encountered the first 134 time by the dynamic tracer, it is recorded and 135 later the call is converted into a nop. This file 136 lists the functions that have been recorded 137 by the dynamic tracer and these functions can 138 be used to set the ftrace filter by the above 139 "set_ftrace_filter" file. (See the section "dynamic ftrace" 140 below for more details). 141 142 143The Tracers 144----------- 145 146Here is the list of current tracers that may be configured. 147 148 ftrace - function tracer that uses mcount to trace all functions. 149 150 sched_switch - traces the context switches between tasks. 151 152 irqsoff - traces the areas that disable interrupts and saves 153 the trace with the longest max latency. 154 See tracing_max_latency. When a new max is recorded, 155 it replaces the old trace. It is best to view this 156 trace via the latency_trace file. 157 158 preemptoff - Similar to irqsoff but traces and records the amount of 159 time for which preemption is disabled. 160 161 preemptirqsoff - Similar to irqsoff and preemptoff, but traces and 162 records the largest time for which irqs and/or preemption 163 is disabled. 164 165 wakeup - Traces and records the max latency that it takes for 166 the highest priority task to get scheduled after 167 it has been woken up. 168 169 none - This is not a tracer. To remove all tracers from tracing 170 simply echo "none" into current_tracer. 171 172 173Examples of using the tracer 174---------------------------- 175 176Here are typical examples of using the tracers when controlling them only 177with the debugfs interface (without using any user-land utilities). 178 179Output format: 180-------------- 181 182Here is an example of the output format of the file "trace" 183 184 -------- 185# tracer: ftrace 186# 187# TASK-PID CPU# TIMESTAMP FUNCTION 188# | | | | | 189 bash-4251 [01] 10152.583854: path_put <-path_walk 190 bash-4251 [01] 10152.583855: dput <-path_put 191 bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput 192 -------- 193 194A header is printed with the tracer name that is represented by the trace. 195In this case the tracer is "ftrace". Then a header showing the format. Task 196name "bash", the task PID "4251", the CPU that it was running on 197"01", the timestamp in <secs>.<usecs> format, the function name that was 198traced "path_put" and the parent function that called this function 199"path_walk". The timestamp is the time at which the function was 200entered. 201 202The sched_switch tracer also includes tracing of task wakeups and 203context switches. 204 205 ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S 206 ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S 207 ksoftirqd/1-7 [01] 1453.070013: 7:115:R ==> 10:115:R 208 events/1-10 [01] 1453.070013: 10:115:S ==> 2916:115:R 209 kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R 210 ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R 211 212Wake ups are represented by a "+" and the context switches are shown as 213"==>". The format is: 214 215 Context switches: 216 217 Previous task Next Task 218 219 <pid>:<prio>:<state> ==> <pid>:<prio>:<state> 220 221 Wake ups: 222 223 Current task Task waking up 224 225 <pid>:<prio>:<state> + <pid>:<prio>:<state> 226 227The prio is the internal kernel priority, which is the inverse of the 228priority that is usually displayed by user-space tools. Zero represents 229the highest priority (99). Prio 100 starts the "nice" priorities with 230100 being equal to nice -20 and 139 being nice 19. The prio "140" is 231reserved for the idle task which is the lowest priority thread (pid 0). 232 233 234Latency trace format 235-------------------- 236 237For traces that display latency times, the latency_trace file gives 238somewhat more information to see why a latency happened. Here is a typical 239trace. 240 241# tracer: irqsoff 242# 243irqsoff latency trace v1.1.5 on 2.6.26-rc8 244-------------------------------------------------------------------- 245 latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 246 ----------------- 247 | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) 248 ----------------- 249 => started at: apic_timer_interrupt 250 => ended at: do_softirq 251 252# _------=> CPU# 253# / _-----=> irqs-off 254# | / _----=> need-resched 255# || / _---=> hardirq/softirq 256# ||| / _--=> preempt-depth 257# |||| / 258# ||||| delay 259# cmd pid ||||| time | caller 260# \ / ||||| \ | / 261 <idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt) 262 <idle>-0 0d.s. 97us : __do_softirq (do_softirq) 263 <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) 264 265 266 267This shows that the current tracer is "irqsoff" tracing the time for which 268interrupts were disabled. It gives the trace version and the version 269of the kernel upon which this was executed on (2.6.26-rc8). Then it displays 270the max latency in microsecs (97 us). The number of trace entries displayed 271and the total number recorded (both are three: #3/3). The type of 272preemption that was used (PREEMPT). VP, KP, SP, and HP are always zero 273and are reserved for later use. #P is the number of online CPUS (#P:2). 274 275The task is the process that was running when the latency occurred. 276(swapper pid: 0). 277 278The start and stop (the functions in which the interrupts were disabled and 279enabled respectively) that caused the latencies: 280 281 apic_timer_interrupt is where the interrupts were disabled. 282 do_softirq is where they were enabled again. 283 284The next lines after the header are the trace itself. The header 285explains which is which. 286 287 cmd: The name of the process in the trace. 288 289 pid: The PID of that process. 290 291 CPU#: The CPU which the process was running on. 292 293 irqs-off: 'd' interrupts are disabled. '.' otherwise. 294 295 need-resched: 'N' task need_resched is set, '.' otherwise. 296 297 hardirq/softirq: 298 'H' - hard irq occurred inside a softirq. 299 'h' - hard irq is running 300 's' - soft irq is running 301 '.' - normal context. 302 303 preempt-depth: The level of preempt_disabled 304 305The above is mostly meaningful for kernel developers. 306 307 time: This differs from the trace file output. The trace file output 308 includes an absolute timestamp. The timestamp used by the 309 latency_trace file is relative to the start of the trace. 310 311 delay: This is just to help catch your eye a bit better. And 312 needs to be fixed to be only relative to the same CPU. 313 The marks are determined by the difference between this 314 current trace and the next trace. 315 '!' - greater than preempt_mark_thresh (default 100) 316 '+' - greater than 1 microsecond 317 ' ' - less than or equal to 1 microsecond. 318 319 The rest is the same as the 'trace' file. 320 321 322iter_ctrl 323--------- 324 325The iter_ctrl file is used to control what gets printed in the trace 326output. To see what is available, simply cat the file: 327 328 cat /debug/tracing/iter_ctrl 329 print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ 330 noblock nostacktrace nosched-tree 331 332To disable one of the options, echo in the option prepended with "no". 333 334 echo noprint-parent > /debug/tracing/iter_ctrl 335 336To enable an option, leave off the "no". 337 338 echo sym-offset > /debug/tracing/iter_ctrl 339 340Here are the available options: 341 342 print-parent - On function traces, display the calling function 343 as well as the function being traced. 344 345 print-parent: 346 bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul 347 348 noprint-parent: 349 bash-4000 [01] 1477.606694: simple_strtoul 350 351 352 sym-offset - Display not only the function name, but also the offset 353 in the function. For example, instead of seeing just 354 "ktime_get", you will see "ktime_get+0xb/0x20". 355 356 sym-offset: 357 bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 358 359 sym-addr - this will also display the function address as well as 360 the function name. 361 362 sym-addr: 363 bash-4000 [01] 1477.606694: simple_strtoul <c0339346> 364 365 verbose - This deals with the latency_trace file. 366 367 bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ 368 (+0.000ms): simple_strtoul (strict_strtoul) 369 370 raw - This will display raw numbers. This option is best for use with 371 user applications that can translate the raw numbers better than 372 having it done in the kernel. 373 374 hex - Similar to raw, but the numbers will be in a hexadecimal format. 375 376 bin - This will print out the formats in raw binary. 377 378 block - TBD (needs update) 379 380 stacktrace - This is one of the options that changes the trace itself. 381 When a trace is recorded, so is the stack of functions. 382 This allows for back traces of trace sites. 383 384 sched-tree - TBD (any users??) 385 386 387sched_switch 388------------ 389 390This tracer simply records schedule switches. Here is an example 391of how to use it. 392 393 # echo sched_switch > /debug/tracing/current_tracer 394 # echo 1 > /debug/tracing/tracing_enabled 395 # sleep 1 396 # echo 0 > /debug/tracing/tracing_enabled 397 # cat /debug/tracing/trace 398 399# tracer: sched_switch 400# 401# TASK-PID CPU# TIMESTAMP FUNCTION 402# | | | | | 403 bash-3997 [01] 240.132281: 3997:120:R + 4055:120:R 404 bash-3997 [01] 240.132284: 3997:120:R ==> 4055:120:R 405 sleep-4055 [01] 240.132371: 4055:120:S ==> 3997:120:R 406 bash-3997 [01] 240.132454: 3997:120:R + 4055:120:S 407 bash-3997 [01] 240.132457: 3997:120:R ==> 4055:120:R 408 sleep-4055 [01] 240.132460: 4055:120:D ==> 3997:120:R 409 bash-3997 [01] 240.132463: 3997:120:R + 4055:120:D 410 bash-3997 [01] 240.132465: 3997:120:R ==> 4055:120:R 411 <idle>-0 [00] 240.132589: 0:140:R + 4:115:S 412 <idle>-0 [00] 240.132591: 0:140:R ==> 4:115:R 413 ksoftirqd/0-4 [00] 240.132595: 4:115:S ==> 0:140:R 414 <idle>-0 [00] 240.132598: 0:140:R + 4:115:S 415 <idle>-0 [00] 240.132599: 0:140:R ==> 4:115:R 416 ksoftirqd/0-4 [00] 240.132603: 4:115:S ==> 0:140:R 417 sleep-4055 [01] 240.133058: 4055:120:S ==> 3997:120:R 418 [...] 419 420 421As we have discussed previously about this format, the header shows 422the name of the trace and points to the options. The "FUNCTION" 423is a misnomer since here it represents the wake ups and context 424switches. 425 426The sched_switch file only lists the wake ups (represented with '+') 427and context switches ('==>') with the previous task or current task 428first followed by the next task or task waking up. The format for both 429of these is PID:KERNEL-PRIO:TASK-STATE. Remember that the KERNEL-PRIO 430is the inverse of the actual priority with zero (0) being the highest 431priority and the nice values starting at 100 (nice -20). Below is 432a quick chart to map the kernel priority to user land priorities. 433 434 Kernel priority: 0 to 99 ==> user RT priority 99 to 0 435 Kernel priority: 100 to 139 ==> user nice -20 to 19 436 Kernel priority: 140 ==> idle task priority 437 438The task states are: 439 440 R - running : wants to run, may not actually be running 441 S - sleep : process is waiting to be woken up (handles signals) 442 D - disk sleep (uninterruptible sleep) : process must be woken up 443 (ignores signals) 444 T - stopped : process suspended 445 t - traced : process is being traced (with something like gdb) 446 Z - zombie : process waiting to be cleaned up 447 X - unknown 448 449 450ftrace_enabled 451-------------- 452 453The following tracers (listed below) give different output depending 454on whether or not the sysctl ftrace_enabled is set. To set ftrace_enabled, 455one can either use the sysctl function or set it via the proc 456file system interface. 457 458 sysctl kernel.ftrace_enabled=1 459 460 or 461 462 echo 1 > /proc/sys/kernel/ftrace_enabled 463 464To disable ftrace_enabled simply replace the '1' with '0' in 465the above commands. 466 467When ftrace_enabled is set the tracers will also record the functions 468that are within the trace. The descriptions of the tracers 469will also show an example with ftrace enabled. 470 471 472irqsoff 473------- 474 475When interrupts are disabled, the CPU can not react to any other 476external event (besides NMIs and SMIs). This prevents the timer 477interrupt from triggering or the mouse interrupt from letting the 478kernel know of a new mouse event. The result is a latency with the 479reaction time. 480 481The irqsoff tracer tracks the time for which interrupts are disabled. 482When a new maximum latency is hit, the tracer saves the trace leading up 483to that latency point so that every time a new maximum is reached, the old 484saved trace is discarded and the new trace is saved. 485 486To reset the maximum, echo 0 into tracing_max_latency. Here is an 487example: 488 489 # echo irqsoff > /debug/tracing/current_tracer 490 # echo 0 > /debug/tracing/tracing_max_latency 491 # echo 1 > /debug/tracing/tracing_enabled 492 # ls -ltr 493 [...] 494 # echo 0 > /debug/tracing/tracing_enabled 495 # cat /debug/tracing/latency_trace 496# tracer: irqsoff 497# 498irqsoff latency trace v1.1.5 on 2.6.26 499-------------------------------------------------------------------- 500 latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 501 ----------------- 502 | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0) 503 ----------------- 504 => started at: sys_setpgid 505 => ended at: sys_setpgid 506 507# _------=> CPU# 508# / _-----=> irqs-off 509# | / _----=> need-resched 510# || / _---=> hardirq/softirq 511# ||| / _--=> preempt-depth 512# |||| / 513# ||||| delay 514# cmd pid ||||| time | caller 515# \ / ||||| \ | / 516 bash-3730 1d... 0us : _write_lock_irq (sys_setpgid) 517 bash-3730 1d..1 1us+: _write_unlock_irq (sys_setpgid) 518 bash-3730 1d..2 14us : trace_hardirqs_on (sys_setpgid) 519 520 521Here we see that that we had a latency of 12 microsecs (which is 522very good). The _write_lock_irq in sys_setpgid disabled interrupts. 523The difference between the 12 and the displayed timestamp 14us occurred 524because the clock was incremented between the time of recording the max 525latency and the time of recording the function that had that latency. 526 527Note the above example had ftrace_enabled not set. If we set the 528ftrace_enabled, we get a much larger output: 529 530# tracer: irqsoff 531# 532irqsoff latency trace v1.1.5 on 2.6.26-rc8 533-------------------------------------------------------------------- 534 latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 535 ----------------- 536 | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0) 537 ----------------- 538 => started at: __alloc_pages_internal 539 => ended at: __alloc_pages_internal 540 541# _------=> CPU# 542# / _-----=> irqs-off 543# | / _----=> need-resched 544# || / _---=> hardirq/softirq 545# ||| / _--=> preempt-depth 546# |||| / 547# ||||| delay 548# cmd pid ||||| time | caller 549# \ / ||||| \ | / 550 ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal) 551 ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist) 552 ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk) 553 ls-4339 0d..1 4us : add_preempt_count (_spin_lock) 554 ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk) 555 ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue) 556 ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest) 557 ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk) 558 ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue) 559 ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest) 560 ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk) 561 ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue) 562[...] 563 ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue) 564 ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest) 565 ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk) 566 ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue) 567 ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest) 568 ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk) 569 ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock) 570 ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal) 571 ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal) 572 573 574 575Here we traced a 50 microsecond latency. But we also see all the 576functions that were called during that time. Note that by enabling 577function tracing, we incur an added overhead. This overhead may 578extend the latency times. But nevertheless, this trace has provided 579some very helpful debugging information. 580 581 582preemptoff 583---------- 584 585When preemption is disabled, we may be able to receive interrupts but 586the task cannot be preempted and a higher priority task must wait 587for preemption to be enabled again before it can preempt a lower 588priority task. 589 590The preemptoff tracer traces the places that disable preemption. 591Like the irqsoff tracer, it records the maximum latency for which preemption 592was disabled. The control of preemptoff tracer is much like the irqsoff 593tracer. 594 595 # echo preemptoff > /debug/tracing/current_tracer 596 # echo 0 > /debug/tracing/tracing_max_latency 597 # echo 1 > /debug/tracing/tracing_enabled 598 # ls -ltr 599 [...] 600 # echo 0 > /debug/tracing/tracing_enabled 601 # cat /debug/tracing/latency_trace 602# tracer: preemptoff 603# 604preemptoff latency trace v1.1.5 on 2.6.26-rc8 605-------------------------------------------------------------------- 606 latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 607 ----------------- 608 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) 609 ----------------- 610 => started at: do_IRQ 611 => ended at: __do_softirq 612 613# _------=> CPU# 614# / _-----=> irqs-off 615# | / _----=> need-resched 616# || / _---=> hardirq/softirq 617# ||| / _--=> preempt-depth 618# |||| / 619# ||||| delay 620# cmd pid ||||| time | caller 621# \ / ||||| \ | / 622 sshd-4261 0d.h. 0us+: irq_enter (do_IRQ) 623 sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq) 624 sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) 625 626 627This has some more changes. Preemption was disabled when an interrupt 628came in (notice the 'h'), and was enabled while doing a softirq. 629(notice the 's'). But we also see that interrupts have been disabled 630when entering the preempt off section and leaving it (the 'd'). 631We do not know if interrupts were enabled in the mean time. 632 633# tracer: preemptoff 634# 635preemptoff latency trace v1.1.5 on 2.6.26-rc8 636-------------------------------------------------------------------- 637 latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 638 ----------------- 639 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) 640 ----------------- 641 => started at: remove_wait_queue 642 => ended at: __do_softirq 643 644# _------=> CPU# 645# / _-----=> irqs-off 646# | / _----=> need-resched 647# || / _---=> hardirq/softirq 648# ||| / _--=> preempt-depth 649# |||| / 650# ||||| delay 651# cmd pid ||||| time | caller 652# \ / ||||| \ | / 653 sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue) 654 sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue) 655 sshd-4261 0d..1 2us : do_IRQ (common_interrupt) 656 sshd-4261 0d..1 2us : irq_enter (do_IRQ) 657 sshd-4261 0d..1 2us : idle_cpu (irq_enter) 658 sshd-4261 0d..1 3us : add_preempt_count (irq_enter) 659 sshd-4261 0d.h1 3us : idle_cpu (irq_enter) 660 sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ) 661[...] 662 sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock) 663 sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq) 664 sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq) 665 sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq) 666 sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock) 667 sshd-4261 0d.h1 14us : irq_exit (do_IRQ) 668 sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit) 669 sshd-4261 0d..2 15us : do_softirq (irq_exit) 670 sshd-4261 0d... 15us : __do_softirq (do_softirq) 671 sshd-4261 0d... 16us : __local_bh_disable (__do_softirq) 672 sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable) 673 sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable) 674 sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable) 675 sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable) 676[...] 677 sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable) 678 sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable) 679 sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable) 680 sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable) 681 sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip) 682 sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip) 683 sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable) 684 sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable) 685[...] 686 sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq) 687 sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq) 688 689 690The above is an example of the preemptoff trace with ftrace_enabled 691set. Here we see that interrupts were disabled the entire time. 692The irq_enter code lets us know that we entered an interrupt 'h'. 693Before that, the functions being traced still show that it is not 694in an interrupt, but we can see from the functions themselves that 695this is not the case. 696 697Notice that __do_softirq when called does not have a preempt_count. 698It may seem that we missed a preempt enabling. What really happened 699is that the preempt count is held on the thread's stack and we 700switched to the softirq stack (4K stacks in effect). The code 701does not copy the preempt count, but because interrupts are disabled, 702we do not need to worry about it. Having a tracer like this is good 703for letting people know what really happens inside the kernel. 704 705 706preemptirqsoff 707-------------- 708 709Knowing the locations that have interrupts disabled or preemption 710disabled for the longest times is helpful. But sometimes we would 711like to know when either preemption and/or interrupts are disabled. 712 713Consider the following code: 714 715 local_irq_disable(); 716 call_function_with_irqs_off(); 717 preempt_disable(); 718 call_function_with_irqs_and_preemption_off(); 719 local_irq_enable(); 720 call_function_with_preemption_off(); 721 preempt_enable(); 722 723The irqsoff tracer will record the total length of 724call_function_with_irqs_off() and 725call_function_with_irqs_and_preemption_off(). 726 727The preemptoff tracer will record the total length of 728call_function_with_irqs_and_preemption_off() and 729call_function_with_preemption_off(). 730 731But neither will trace the time that interrupts and/or preemption 732is disabled. This total time is the time that we can not schedule. 733To record this time, use the preemptirqsoff tracer. 734 735Again, using this trace is much like the irqsoff and preemptoff tracers. 736 737 # echo preemptirqsoff > /debug/tracing/current_tracer 738 # echo 0 > /debug/tracing/tracing_max_latency 739 # echo 1 > /debug/tracing/tracing_enabled 740 # ls -ltr 741 [...] 742 # echo 0 > /debug/tracing/tracing_enabled 743 # cat /debug/tracing/latency_trace 744# tracer: preemptirqsoff 745# 746preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 747-------------------------------------------------------------------- 748 latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 749 ----------------- 750 | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0) 751 ----------------- 752 => started at: apic_timer_interrupt 753 => ended at: __do_softirq 754 755# _------=> CPU# 756# / _-----=> irqs-off 757# | / _----=> need-resched 758# || / _---=> hardirq/softirq 759# ||| / _--=> preempt-depth 760# |||| / 761# ||||| delay 762# cmd pid ||||| time | caller 763# \ / ||||| \ | / 764 ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt) 765 ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq) 766 ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq) 767 768 769 770The trace_hardirqs_off_thunk is called from assembly on x86 when 771interrupts are disabled in the assembly code. Without the function 772tracing, we do not know if interrupts were enabled within the preemption 773points. We do see that it started with preemption enabled. 774 775Here is a trace with ftrace_enabled set: 776 777 778# tracer: preemptirqsoff 779# 780preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 781-------------------------------------------------------------------- 782 latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 783 ----------------- 784 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) 785 ----------------- 786 => started at: write_chan 787 => ended at: __do_softirq 788 789# _------=> CPU# 790# / _-----=> irqs-off 791# | / _----=> need-resched 792# || / _---=> hardirq/softirq 793# ||| / _--=> preempt-depth 794# |||| / 795# ||||| delay 796# cmd pid ||||| time | caller 797# \ / ||||| \ | / 798 ls-4473 0.N.. 0us : preempt_schedule (write_chan) 799 ls-4473 0dN.1 1us : _spin_lock (schedule) 800 ls-4473 0dN.1 2us : add_preempt_count (_spin_lock) 801 ls-4473 0d..2 2us : put_prev_task_fair (schedule) 802[...] 803 ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts) 804 ls-4473 0d..2 13us : __switch_to (schedule) 805 sshd-4261 0d..2 14us : finish_task_switch (schedule) 806 sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch) 807 sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave) 808 sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set) 809 sshd-4261 0d..2 16us : do_IRQ (common_interrupt) 810 sshd-4261 0d..2 17us : irq_enter (do_IRQ) 811 sshd-4261 0d..2 17us : idle_cpu (irq_enter) 812 sshd-4261 0d..2 18us : add_preempt_count (irq_enter) 813 sshd-4261 0d.h2 18us : idle_cpu (irq_enter) 814 sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ) 815 sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq) 816 sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock) 817 sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq) 818 sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock) 819[...] 820 sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq) 821 sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock) 822 sshd-4261 0d.h2 29us : irq_exit (do_IRQ) 823 sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit) 824 sshd-4261 0d..3 30us : do_softirq (irq_exit) 825 sshd-4261 0d... 30us : __do_softirq (do_softirq) 826 sshd-4261 0d... 31us : __local_bh_disable (__do_softirq) 827 sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable) 828 sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable) 829[...] 830 sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip) 831 sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip) 832 sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt) 833 sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt) 834 sshd-4261 0d.s3 45us : idle_cpu (irq_enter) 835 sshd-4261 0d.s3 46us : add_preempt_count (irq_enter) 836 sshd-4261 0d.H3 46us : idle_cpu (irq_enter) 837 sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt) 838 sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt) 839[...] 840 sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt) 841 sshd-4261 0d.H3 82us : ktime_get (tick_program_event) 842 sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get) 843 sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts) 844 sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts) 845 sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event) 846 sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event) 847 sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt) 848 sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit) 849 sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit) 850 sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable) 851[...] 852 sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action) 853 sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq) 854 sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq) 855 sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq) 856 sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable) 857 sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq) 858 sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq) 859 860 861This is a very interesting trace. It started with the preemption of 862the ls task. We see that the task had the "need_resched" bit set 863via the 'N' in the trace. Interrupts were disabled before the spin_lock 864at the beginning of the trace. We see that a schedule took place to run 865sshd. When the interrupts were enabled, we took an interrupt. 866On return from the interrupt handler, the softirq ran. We took another 867interrupt while running the softirq as we see from the capital 'H'. 868 869 870wakeup 871------ 872 873In a Real-Time environment it is very important to know the wakeup 874time it takes for the highest priority task that is woken up to the 875time that it executes. This is also known as "schedule latency". 876I stress the point that this is about RT tasks. It is also important 877to know the scheduling latency of non-RT tasks, but the average 878schedule latency is better for non-RT tasks. Tools like 879LatencyTop are more appropriate for such measurements. 880 881Real-Time environments are interested in the worst case latency. 882That is the longest latency it takes for something to happen, and 883not the average. We can have a very fast scheduler that may only 884have a large latency once in a while, but that would not work well 885with Real-Time tasks. The wakeup tracer was designed to record 886the worst case wakeups of RT tasks. Non-RT tasks are not recorded 887because the tracer only records one worst case and tracing non-RT 888tasks that are unpredictable will overwrite the worst case latency 889of RT tasks. 890 891Since this tracer only deals with RT tasks, we will run this slightly 892differently than we did with the previous tracers. Instead of performing 893an 'ls', we will run 'sleep 1' under 'chrt' which changes the 894priority of the task. 895 896 # echo wakeup > /debug/tracing/current_tracer 897 # echo 0 > /debug/tracing/tracing_max_latency 898 # echo 1 > /debug/tracing/tracing_enabled 899 # chrt -f 5 sleep 1 900 # echo 0 > /debug/tracing/tracing_enabled 901 # cat /debug/tracing/latency_trace 902# tracer: wakeup 903# 904wakeup latency trace v1.1.5 on 2.6.26-rc8 905-------------------------------------------------------------------- 906 latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 907 ----------------- 908 | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5) 909 ----------------- 910 911# _------=> CPU# 912# / _-----=> irqs-off 913# | / _----=> need-resched 914# || / _---=> hardirq/softirq 915# ||| / _--=> preempt-depth 916# |||| / 917# ||||| delay 918# cmd pid ||||| time | caller 919# \ / ||||| \ | / 920 <idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process) 921 <idle>-0 1d..4 4us : schedule (cpu_idle) 922 923 924 925Running this on an idle system, we see that it only took 4 microseconds 926to perform the task switch. Note, since the trace marker in the 927schedule is before the actual "switch", we stop the tracing when 928the recorded task is about to schedule in. This may change if 929we add a new marker at the end of the scheduler. 930 931Notice that the recorded task is 'sleep' with the PID of 4901 and it 932has an rt_prio of 5. This priority is user-space priority and not 933the internal kernel priority. The policy is 1 for SCHED_FIFO and 2 934for SCHED_RR. 935 936Doing the same with chrt -r 5 and ftrace_enabled set. 937 938# tracer: wakeup 939# 940wakeup latency trace v1.1.5 on 2.6.26-rc8 941-------------------------------------------------------------------- 942 latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 943 ----------------- 944 | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5) 945 ----------------- 946 947# _------=> CPU# 948# / _-----=> irqs-off 949# | / _----=> need-resched 950# || / _---=> hardirq/softirq 951# ||| / _--=> preempt-depth 952# |||| / 953# ||||| delay 954# cmd pid ||||| time | caller 955# \ / ||||| \ | / 956ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process) 957ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb) 958ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up) 959ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup) 960ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr) 961ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup) 962ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up) 963ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up) 964[...] 965ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt) 966ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit) 967ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit) 968ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq) 969[...] 970ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks) 971ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq) 972ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable) 973ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd) 974ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd) 975ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched) 976ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched) 977ksoftirq-7 1.N.2 33us : schedule (__cond_resched) 978ksoftirq-7 1.N.2 33us : add_preempt_count (schedule) 979ksoftirq-7 1.N.3 34us : hrtick_clear (schedule) 980ksoftirq-7 1dN.3 35us : _spin_lock (schedule) 981ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock) 982ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule) 983ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair) 984[...] 985ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline) 986ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock) 987ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline) 988ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock) 989ksoftirq-7 1d..4 50us : schedule (__cond_resched) 990 991The interrupt went off while running ksoftirqd. This task runs at 992SCHED_OTHER. Why did not we see the 'N' set early? This may be 993a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K stacks 994configured, the interrupt and softirq run with their own stack. 995Some information is held on the top of the task's stack (need_resched 996and preempt_count are both stored there). The setting of the NEED_RESCHED 997bit is done directly to the task's stack, but the reading of the 998NEED_RESCHED is done by looking at the current stack, which in this case 999is the stack for the hard interrupt. This hides the fact that NEED_RESCHED 1000has been set. We do not see the 'N' until we switch back to the task's
1001assigned stack. 1002 1003ftrace 1004------ 1005 1006ftrace is not only the name of the tracing infrastructure, but it 1007is also a name of one of the tracers. The tracer is the function 1008tracer. Enabling the function tracer can be done from the 1009debug file system. Make sure the ftrace_enabled is set otherwise 1010this tracer is a nop. 1011 1012 # sysctl kernel.ftrace_enabled=1 1013 # echo ftrace > /debug/tracing/current_tracer 1014 # echo 1 > /debug/tracing/tracing_enabled 1015 # usleep 1 1016 # echo 0 > /debug/tracing/tracing_enabled 1017 # cat /debug/tracing/trace 1018# tracer: ftrace 1019# 1020# TASK-PID CPU# TIMESTAMP FUNCTION 1021# | | | | | 1022 bash-4003 [00] 123.638713: finish_task_switch <-schedule 1023 bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch 1024 bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq 1025 bash-4003 [00] 123.638715: hrtick_set <-schedule 1026 bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set 1027 bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave 1028 bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set 1029 bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore 1030 bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set 1031 bash-4003 [00] 123.638718: sub_preempt_count <-schedule 1032 bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule 1033 bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run 1034 bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion 1035 bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common 1036 bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq 1037[...] 1038 1039 1040Note: ftrace uses ring buffers to store the above entries. The newest data 1041may overwrite the oldest data. Sometimes using echo to stop the trace 1042is not sufficient because the tracing could have overwritten the data 1043that you wanted to record. For this reason, it is sometimes better to 1044disable tracing directly from a program. This allows you to stop the 1045tracing at the point that you hit the part that you are interested in. 1046To disable the tracing directly from a C program, something like following 1047code snippet can be used: 1048 1049int trace_fd; 1050[...] 1051int main(int argc, char *argv[]) { 1052 [...] 1053 trace_fd = open("/debug/tracing/tracing_enabled", O_WRONLY); 1054 [...] 1055 if (condition_hit()) { 1056 write(trace_fd, "0", 1); 1057 } 1058 [...] 1059} 1060 1061Note: Here we hard coded the path name. The debugfs mount is not 1062guaranteed to be at /debug (and is more commonly at /sys/kernel/debug). 1063For simple one time traces, the above is sufficent. For anything else, 1064a search through /proc/mounts may be needed to find where the debugfs 1065file-system is mounted. 1066 1067dynamic ftrace 1068-------------- 1069 1070If CONFIG_DYNAMIC_FTRACE is set, the system will run with 1071virtually no overhead when function tracing is disabled. The way 1072this works is the mcount function call (placed at the start of 1073every kernel function, produced by the -pg switch in gcc), starts 1074of pointing to a simple return. (Enabling FTRACE will include the 1075-pg switch in the compiling of the kernel.) 1076 1077When dynamic ftrace is initialized, it calls kstop_machine to make 1078the machine act like a uniprocessor so that it can freely modify code 1079without worrying about other processors executing that same code. At 1080initialization, the mcount calls are changed to call a "record_ip" 1081function. After this, the first time a kernel function is called, 1082it has the calling address saved in a hash table. 1083 1084Later on the ftraced kernel thread is awoken and will again call 1085kstop_machine if new functions have been recorded. The ftraced thread 1086will change all calls to mcount to "nop". Just calling mcount 1087and having mcount return has shown a 10% overhead. By converting 1088it to a nop, there is no measurable overhead to the system. 1089 1090One special side-effect to the recording of the functions being 1091traced is that we can now selectively choose which functions we 1092wish to trace and which ones we want the mcount calls to remain as 1093nops. 1094 1095Two files are used, one for enabling and one for disabling the tracing 1096of specified functions. They are: 1097 1098 set_ftrace_filter 1099 1100and 1101 1102 set_ftrace_notrace 1103 1104A list of available functions that you can add to these files is listed 1105in: 1106 1107 available_filter_functions 1108 1109 # cat /debug/tracing/available_filter_functions 1110put_prev_task_idle 1111kmem_cache_create 1112pick_next_task_rt 1113get_online_cpus 1114pick_next_task_fair 1115mutex_lock 1116[...] 1117 1118If I am only interested in sys_nanosleep and hrtimer_interrupt: 1119 1120 # echo sys_nanosleep hrtimer_interrupt \ 1121 > /debug/tracing/set_ftrace_filter 1122 # echo ftrace > /debug/tracing/current_tracer 1123 # echo 1 > /debug/tracing/tracing_enabled 1124 # usleep 1 1125 # echo 0 > /debug/tracing/tracing_enabled 1126 # cat /debug/tracing/trace 1127# tracer: ftrace 1128# 1129# TASK-PID CPU# TIMESTAMP FUNCTION 1130# | | | | | 1131 usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt 1132 usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call 1133 <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt 1134 1135To see which functions are being traced, you can cat the file: 1136 1137 # cat /debug/tracing/set_ftrace_filter 1138hrtimer_interrupt 1139sys_nanosleep 1140 1141 1142Perhaps this is not enough. The filters also allow simple wild cards. 1143Only the following are currently available 1144 1145 <match>* - will match functions that begin with <match> 1146 *<match> - will match functions that end with <match> 1147 *<match>* - will match functions that have <match> in it 1148 1149These are the only wild cards which are supported. 1150 1151 <match>*<match> will not work. 1152 1153 # echo hrtimer_* > /debug/tracing/set_ftrace_filter 1154 1155Produces: 1156 1157# tracer: ftrace 1158# 1159# TASK-PID CPU# TIMESTAMP FUNCTION 1160# | | | | | 1161 bash-4003 [00] 1480.611794: hrtimer_init <-copy_process 1162 bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set 1163 bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear 1164 bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel 1165 <idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt 1166 <idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt 1167 <idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt 1168 <idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt 1169 <idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt 1170 1171 1172Notice that we lost the sys_nanosleep. 1173 1174 # cat /debug/tracing/set_ftrace_filter 1175hrtimer_run_queues 1176hrtimer_run_pending 1177hrtimer_init 1178hrtimer_cancel 1179hrtimer_try_to_cancel 1180hrtimer_forward 1181hrtimer_start 1182hrtimer_reprogram 1183hrtimer_force_reprogram 1184hrtimer_get_next_event 1185hrtimer_interrupt 1186hrtimer_nanosleep 1187hrtimer_wakeup 1188hrtimer_get_remaining 1189hrtimer_get_res 1190hrtimer_init_sleeper 1191 1192 1193This is because the '>' and '>>' act just like they do in bash. 1194To rewrite the filters, use '>' 1195To append to the filters, use '>>' 1196 1197To clear out a filter so that all functions will be recorded again: 1198 1199 # echo > /debug/tracing/set_ftrace_filter 1200 # cat /debug/tracing/set_ftrace_filter 1201 # 1202 1203Again, now we want to append. 1204 1205 # echo sys_nanosleep > /debug/tracing/set_ftrace_filter 1206 # cat /debug/tracing/set_ftrace_filter 1207sys_nanosleep 1208 # echo hrtimer_* >> /debug/tracing/set_ftrace_filter 1209 # cat /debug/tracing/set_ftrace_filter 1210hrtimer_run_queues 1211hrtimer_run_pending 1212hrtimer_init 1213hrtimer_cancel 1214hrtimer_try_to_cancel 1215hrtimer_forward 1216hrtimer_start 1217hrtimer_reprogram 1218hrtimer_force_reprogram 1219hrtimer_get_next_event 1220hrtimer_interrupt 1221sys_nanosleep 1222hrtimer_nanosleep 1223hrtimer_wakeup 1224hrtimer_get_remaining 1225hrtimer_get_res 1226hrtimer_init_sleeper 1227 1228 1229The set_ftrace_notrace prevents those functions from being traced. 1230 1231 # echo '*preempt*' '*lock*' > /debug/tracing/set_ftrace_notrace 1232 1233Produces: 1234 1235# tracer: ftrace 1236# 1237# TASK-PID CPU# TIMESTAMP FUNCTION 1238# | | | | | 1239 bash-4043 [01] 115.281644: finish_task_switch <-schedule 1240 bash-4043 [01] 115.281645: hrtick_set <-schedule 1241 bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set 1242 bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run 1243 bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion 1244 bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run 1245 bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop 1246 bash-4043 [01] 115.281648: wake_up_process <-kthread_stop 1247 bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process 1248 1249We can see that there's no more lock or preempt tracing. 1250 1251ftraced 1252------- 1253 1254As mentioned above, when dynamic ftrace is configured in, a kernel 1255thread wakes up once a second and checks to see if there are mcount 1256calls that need to be converted into nops. If there are not any, then 1257it simply goes back to sleep. But if there are some, it will call 1258kstop_machine to convert the calls to nops. 1259 1260There may be a case in which you do not want this added latency. 1261Perhaps you are doing some audio recording and this activity might 1262cause skips in the playback. There is an interface to disable 1263and enable the "ftraced" kernel thread. 1264 1265 # echo 0 > /debug/tracing/ftraced_enabled 1266 1267This will disable the calling of kstop_machine to update the 1268mcount calls to nops. Remember that there is a large overhead 1269to calling mcount. Without this kernel thread, that overhead will 1270exist. 1271 1272If there are recorded calls to mcount, any write to the ftraced_enabled 1273file will cause the kstop_machine to run. This means that a 1274user can manually perform the updates when they want to by simply 1275echoing a '0' into the ftraced_enabled file. 1276 1277The updates are also done at the beginning of enabling a tracer 1278that uses ftrace function recording. 1279 1280 1281trace_pipe 1282---------- 1283 1284The trace_pipe outputs the same content as the trace file, but the effect 1285on the tracing is different. Every read from trace_pipe is consumed. 1286This means that subsequent reads will be different. The trace 1287is live. 1288 1289 # echo ftrace > /debug/tracing/current_tracer 1290 # cat /debug/tracing/trace_pipe > /tmp/trace.out & 1291[1] 4153 1292 # echo 1 > /debug/tracing/tracing_enabled 1293 # usleep 1 1294 # echo 0 > /debug/tracing/tracing_enabled 1295 # cat /debug/tracing/trace 1296# tracer: ftrace 1297# 1298# TASK-PID CPU# TIMESTAMP FUNCTION 1299# | | | | | 1300 1301 # 1302 # cat /tmp/trace.out 1303 bash-4043 [00] 41.267106: finish_task_switch <-schedule 1304 bash-4043 [00] 41.267106: hrtick_set <-schedule 1305 bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set 1306 bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run 1307 bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion 1308 bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run 1309 bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop 1310 bash-4043 [00] 41.267110: wake_up_process <-kthread_stop 1311 bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process 1312 bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up 1313 1314 1315Note, reading the trace_pipe file will block until more input is added. 1316By changing the tracer, trace_pipe will issue an EOF. We needed 1317to set the ftrace tracer _before_ cating the trace_pipe file. 1318 1319 1320trace entries 1321------------- 1322 1323Having too much or not enough data can be troublesome in diagnosing 1324an issue in the kernel. The file trace_entries is used to modify 1325the size of the internal trace buffers. The number listed 1326is the number of entries that can be recorded per CPU. To know 1327the full size, multiply the number of possible CPUS with the 1328number of entries. 1329 1330 # cat /debug/tracing/trace_entries 133165620 1332 1333Note, to modify this, you must have tracing completely disabled. To do that, 1334echo "none" into the current_tracer. If the current_tracer is not set 1335to "none", an EINVAL error will be returned. 1336 1337 # echo none > /debug/tracing/current_tracer 1338 # echo 100000 > /debug/tracing/trace_entries 1339 # cat /debug/tracing/trace_entries 1340100045 1341 1342 1343Notice that we echoed in 100,000 but the size is 100,045. The entries 1344are held in individual pages. It allocates the number of pages it takes 1345to fulfill the request. If more entries may fit on the last page 1346then they will be added. 1347 1348 # echo 1 > /debug/tracing/trace_entries 1349 # cat /debug/tracing/trace_entries 135085 1351 1352This shows us that 85 entries can fit in a single page. 1353 1354The number of pages which will be allocated is limited to a percentage 1355of available memory. Allocating too much will produce an error. 1356 1357 # echo 1000000000000 > /debug/tracing/trace_entries 1358-bash: echo: write error: Cannot allocate memory 1359 # cat /debug/tracing/trace_entries 136085 1361 1362