linux/Documentation/trace/ftrace.txt
<<
>>
Prefs
   1                ftrace - Function Tracer
   2                ========================
   3
   4Copyright 2008 Red Hat Inc.
   5   Author:   Steven Rostedt <srostedt@redhat.com>
   6  License:   The GNU Free Documentation License, Version 1.2
   7               (dual licensed under the GPL v2)
   8Reviewers:   Elias Oltmanns, Randy Dunlap, Andrew Morton,
   9             John Kacur, and David Teigland.
  10Written for: 2.6.28-rc2
  11
  12Introduction
  13------------
  14
  15Ftrace is an internal tracer designed to help out developers and
  16designers of systems to find what is going on inside the kernel.
  17It can be used for debugging or analyzing latencies and
  18performance issues that take place outside of user-space.
  19
  20Although ftrace is the function tracer, it also includes an
  21infrastructure that allows for other types of tracing. Some of
  22the tracers that are currently in ftrace include a tracer to
  23trace context switches, the time it takes for a high priority
  24task to run after it was woken up, the time interrupts are
  25disabled, and more (ftrace allows for tracer plugins, which
  26means that the list of tracers can always grow).
  27
  28
  29Implementation Details
  30----------------------
  31
  32See ftrace-design.txt for details for arch porters and such.
  33
  34
  35The File System
  36---------------
  37
  38Ftrace uses the debugfs file system to hold the control files as
  39well as the files to display output.
  40
  41When debugfs is configured into the kernel (which selecting any ftrace
  42option will do) the directory /sys/kernel/debug will be created. To mount
  43this directory, you can add to your /etc/fstab file:
  44
  45 debugfs       /sys/kernel/debug          debugfs defaults        0       0
  46
  47Or you can mount it at run time with:
  48
  49 mount -t debugfs nodev /sys/kernel/debug
  50
  51For quicker access to that directory you may want to make a soft link to
  52it:
  53
  54 ln -s /sys/kernel/debug /debug
  55
  56Any selected ftrace option will also create a directory called tracing
  57within the debugfs. The rest of the document will assume that you are in
  58the ftrace directory (cd /sys/kernel/debug/tracing) and will only concentrate
  59on the files within that directory and not distract from the content with
  60the extended "/sys/kernel/debug/tracing" path name.
  61
  62That's it! (assuming that you have ftrace configured into your kernel)
  63
  64After mounting the debugfs, you can see a directory called
  65"tracing".  This directory contains the control and output files
  66of ftrace. Here is a list of some of the key files:
  67
  68
  69 Note: all time values are in microseconds.
  70
  71  current_tracer:
  72
  73        This is used to set or display the current tracer
  74        that is configured.
  75
  76  available_tracers:
  77
  78        This holds the different types of tracers that
  79        have been compiled into the kernel. The
  80        tracers listed here can be configured by
  81        echoing their name into current_tracer.
  82
  83  tracing_on:
  84
  85        This sets or displays whether writing to the trace
  86        ring buffer is enabled. Echo 0 into this file to disable
  87        the tracer or 1 to enable it.
  88
  89  trace:
  90
  91        This file holds the output of the trace in a human
  92        readable format (described below).
  93
  94  trace_pipe:
  95
  96        The output is the same as the "trace" file but this
  97        file is meant to be streamed with live tracing.
  98        Reads from this file will block until new data is
  99        retrieved.  Unlike the "trace" file, this file is a
 100        consumer. This means reading from this file causes
 101        sequential reads to display more current data. Once
 102        data is read from this file, it is consumed, and
 103        will not be read again with a sequential read. The
 104        "trace" file is static, and if the tracer is not
 105        adding more data,they will display the same
 106        information every time they are read.
 107
 108  trace_options:
 109
 110        This file lets the user control the amount of data
 111        that is displayed in one of the above output
 112        files.
 113
 114  tracing_max_latency:
 115
 116        Some of the tracers record the max latency.
 117        For example, the time interrupts are disabled.
 118        This time is saved in this file. The max trace
 119        will also be stored, and displayed by "trace".
 120        A new max trace will only be recorded if the
 121        latency is greater than the value in this
 122        file. (in microseconds)
 123
 124  buffer_size_kb:
 125
 126        This sets or displays the number of kilobytes each CPU
 127        buffer can hold. The tracer buffers are the same size
 128        for each CPU. The displayed number is the size of the
 129        CPU buffer and not total size of all buffers. The
 130        trace buffers are allocated in pages (blocks of memory
 131        that the kernel uses for allocation, usually 4 KB in size).
 132        If the last page allocated has room for more bytes
 133        than requested, the rest of the page will be used,
 134        making the actual allocation bigger than requested.
 135        ( Note, the size may not be a multiple of the page size
 136          due to buffer management overhead. )
 137
 138        This can only be updated when the current_tracer
 139        is set to "nop".
 140
 141  tracing_cpumask:
 142
 143        This is a mask that lets the user only trace
 144        on specified CPUS. The format is a hex string
 145        representing the CPUS.
 146
 147  set_ftrace_filter:
 148
 149        When dynamic ftrace is configured in (see the
 150        section below "dynamic ftrace"), the code is dynamically
 151        modified (code text rewrite) to disable calling of the
 152        function profiler (mcount). This lets tracing be configured
 153        in with practically no overhead in performance.  This also
 154        has a side effect of enabling or disabling specific functions
 155        to be traced. Echoing names of functions into this file
 156        will limit the trace to only those functions.
 157
 158        This interface also allows for commands to be used. See the
 159        "Filter commands" section for more details.
 160
 161  set_ftrace_notrace:
 162
 163        This has an effect opposite to that of
 164        set_ftrace_filter. Any function that is added here will not
 165        be traced. If a function exists in both set_ftrace_filter
 166        and set_ftrace_notrace, the function will _not_ be traced.
 167
 168  set_ftrace_pid:
 169
 170        Have the function tracer only trace a single thread.
 171
 172  set_graph_function:
 173
 174        Set a "trigger" function where tracing should start
 175        with the function graph tracer (See the section
 176        "dynamic ftrace" for more details).
 177
 178  available_filter_functions:
 179
 180        This lists the functions that ftrace
 181        has processed and can trace. These are the function
 182        names that you can pass to "set_ftrace_filter" or
 183        "set_ftrace_notrace". (See the section "dynamic ftrace"
 184        below for more details.)
 185
 186
 187The Tracers
 188-----------
 189
 190Here is the list of current tracers that may be configured.
 191
 192  "function"
 193
 194        Function call tracer to trace all kernel functions.
 195
 196  "function_graph"
 197
 198        Similar to the function tracer except that the
 199        function tracer probes the functions on their entry
 200        whereas the function graph tracer traces on both entry
 201        and exit of the functions. It then provides the ability
 202        to draw a graph of function calls similar to C code
 203        source.
 204
 205  "irqsoff"
 206
 207        Traces the areas that disable interrupts and saves
 208        the trace with the longest max latency.
 209        See tracing_max_latency. When a new max is recorded,
 210        it replaces the old trace. It is best to view this
 211        trace with the latency-format option enabled.
 212
 213  "preemptoff"
 214
 215        Similar to irqsoff but traces and records the amount of
 216        time for which preemption is disabled.
 217
 218  "preemptirqsoff"
 219
 220        Similar to irqsoff and preemptoff, but traces and
 221        records the largest time for which irqs and/or preemption
 222        is disabled.
 223
 224  "wakeup"
 225
 226        Traces and records the max latency that it takes for
 227        the highest priority task to get scheduled after
 228        it has been woken up.
 229
 230  "hw-branch-tracer"
 231
 232        Uses the BTS CPU feature on x86 CPUs to traces all
 233        branches executed.
 234
 235  "nop"
 236
 237        This is the "trace nothing" tracer. To remove all
 238        tracers from tracing simply echo "nop" into
 239        current_tracer.
 240
 241
 242Examples of using the tracer
 243----------------------------
 244
 245Here are typical examples of using the tracers when controlling
 246them only with the debugfs interface (without using any
 247user-land utilities).
 248
 249Output format:
 250--------------
 251
 252Here is an example of the output format of the file "trace"
 253
 254                             --------
 255# tracer: function
 256#
 257#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
 258#              | |      |          |         |
 259            bash-4251  [01] 10152.583854: path_put <-path_walk
 260            bash-4251  [01] 10152.583855: dput <-path_put
 261            bash-4251  [01] 10152.583855: _atomic_dec_and_lock <-dput
 262                             --------
 263
 264A header is printed with the tracer name that is represented by
 265the trace. In this case the tracer is "function". Then a header
 266showing the format. Task name "bash", the task PID "4251", the
 267CPU that it was running on "01", the timestamp in <secs>.<usecs>
 268format, the function name that was traced "path_put" and the
 269parent function that called this function "path_walk". The
 270timestamp is the time at which the function was entered.
 271
 272Latency trace format
 273--------------------
 274
 275When the latency-format option is enabled, the trace file gives
 276somewhat more information to see why a latency happened.
 277Here is a typical trace.
 278
 279# tracer: irqsoff
 280#
 281irqsoff latency trace v1.1.5 on 2.6.26-rc8
 282--------------------------------------------------------------------
 283 latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
 284    -----------------
 285    | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
 286    -----------------
 287 => started at: apic_timer_interrupt
 288 => ended at:   do_softirq
 289
 290#                _------=> CPU#
 291#               / _-----=> irqs-off
 292#              | / _----=> need-resched
 293#              || / _---=> hardirq/softirq
 294#              ||| / _--=> preempt-depth
 295#              |||| /
 296#              |||||     delay
 297#  cmd     pid ||||| time  |   caller
 298#     \   /    |||||   \   |   /
 299  <idle>-0     0d..1    0us+: trace_hardirqs_off_thunk (apic_timer_interrupt)
 300  <idle>-0     0d.s.   97us : __do_softirq (do_softirq)
 301  <idle>-0     0d.s1   98us : trace_hardirqs_on (do_softirq)
 302
 303
 304This shows that the current tracer is "irqsoff" tracing the time
 305for which interrupts were disabled. It gives the trace version
 306and the version of the kernel upon which this was executed on
 307(2.6.26-rc8). Then it displays the max latency in microsecs (97
 308us). The number of trace entries displayed and the total number
 309recorded (both are three: #3/3). The type of preemption that was
 310used (PREEMPT). VP, KP, SP, and HP are always zero and are
 311reserved for later use. #P is the number of online CPUS (#P:2).
 312
 313The task is the process that was running when the latency
 314occurred. (swapper pid: 0).
 315
 316The start and stop (the functions in which the interrupts were
 317disabled and enabled respectively) that caused the latencies:
 318
 319  apic_timer_interrupt is where the interrupts were disabled.
 320  do_softirq is where they were enabled again.
 321
 322The next lines after the header are the trace itself. The header
 323explains which is which.
 324
 325  cmd: The name of the process in the trace.
 326
 327  pid: The PID of that process.
 328
 329  CPU#: The CPU which the process was running on.
 330
 331  irqs-off: 'd' interrupts are disabled. '.' otherwise.
 332            Note: If the architecture does not support a way to
 333                  read the irq flags variable, an 'X' will always
 334                  be printed here.
 335
 336  need-resched: 'N' task need_resched is set, '.' otherwise.
 337
 338  hardirq/softirq:
 339        'H' - hard irq occurred inside a softirq.
 340        'h' - hard irq is running
 341        's' - soft irq is running
 342        '.' - normal context.
 343
 344  preempt-depth: The level of preempt_disabled
 345
 346The above is mostly meaningful for kernel developers.
 347
 348  time: When the latency-format option is enabled, the trace file
 349        output includes a timestamp relative to the start of the
 350        trace. This differs from the output when latency-format
 351        is disabled, which includes an absolute timestamp.
 352
 353  delay: This is just to help catch your eye a bit better. And
 354         needs to be fixed to be only relative to the same CPU.
 355         The marks are determined by the difference between this
 356         current trace and the next trace.
 357          '!' - greater than preempt_mark_thresh (default 100)
 358          '+' - greater than 1 microsecond
 359          ' ' - less than or equal to 1 microsecond.
 360
 361  The rest is the same as the 'trace' file.
 362
 363
 364trace_options
 365-------------
 366
 367The trace_options file is used to control what gets printed in
 368the trace output. To see what is available, simply cat the file:
 369
 370  cat trace_options
 371  print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
 372  noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj
 373
 374To disable one of the options, echo in the option prepended with
 375"no".
 376
 377  echo noprint-parent > trace_options
 378
 379To enable an option, leave off the "no".
 380
 381  echo sym-offset > trace_options
 382
 383Here are the available options:
 384
 385  print-parent - On function traces, display the calling (parent)
 386                 function as well as the function being traced.
 387
 388  print-parent:
 389   bash-4000  [01]  1477.606694: simple_strtoul <-strict_strtoul
 390
 391  noprint-parent:
 392   bash-4000  [01]  1477.606694: simple_strtoul
 393
 394
 395  sym-offset - Display not only the function name, but also the
 396               offset in the function. For example, instead of
 397               seeing just "ktime_get", you will see
 398               "ktime_get+0xb/0x20".
 399
 400  sym-offset:
 401   bash-4000  [01]  1477.606694: simple_strtoul+0x6/0xa0
 402
 403  sym-addr - this will also display the function address as well
 404             as the function name.
 405
 406  sym-addr:
 407   bash-4000  [01]  1477.606694: simple_strtoul <c0339346>
 408
 409  verbose - This deals with the trace file when the
 410            latency-format option is enabled.
 411
 412    bash  4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
 413    (+0.000ms): simple_strtoul (strict_strtoul)
 414
 415  raw - This will display raw numbers. This option is best for
 416        use with user applications that can translate the raw
 417        numbers better than having it done in the kernel.
 418
 419  hex - Similar to raw, but the numbers will be in a hexadecimal
 420        format.
 421
 422  bin - This will print out the formats in raw binary.
 423
 424  block - TBD (needs update)
 425
 426  stacktrace - This is one of the options that changes the trace
 427               itself. When a trace is recorded, so is the stack
 428               of functions. This allows for back traces of
 429               trace sites.
 430
 431  userstacktrace - This option changes the trace. It records a
 432                   stacktrace of the current userspace thread.
 433
 434  sym-userobj - when user stacktrace are enabled, look up which
 435                object the address belongs to, and print a
 436                relative address. This is especially useful when
 437                ASLR is on, otherwise you don't get a chance to
 438                resolve the address to object/file/line after
 439                the app is no longer running
 440
 441                The lookup is performed when you read
 442                trace,trace_pipe. Example:
 443
 444                a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
 445x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
 446
 447  sched-tree - trace all tasks that are on the runqueue, at
 448               every scheduling event. Will add overhead if
 449               there's a lot of tasks running at once.
 450
 451  latency-format - This option changes the trace. When
 452                   it is enabled, the trace displays
 453                   additional information about the
 454                   latencies, as described in "Latency
 455                   trace format".
 456
 457  overwrite - This controls what happens when the trace buffer is
 458              full. If "1" (default), the oldest events are
 459              discarded and overwritten. If "0", then the newest
 460              events are discarded.
 461
 462ftrace_enabled
 463--------------
 464
 465The following tracers (listed below) give different output
 466depending on whether or not the sysctl ftrace_enabled is set. To
 467set ftrace_enabled, one can either use the sysctl function or
 468set it via the proc file system interface.
 469
 470  sysctl kernel.ftrace_enabled=1
 471
 472 or
 473
 474  echo 1 > /proc/sys/kernel/ftrace_enabled
 475
 476To disable ftrace_enabled simply replace the '1' with '0' in the
 477above commands.
 478
 479When ftrace_enabled is set the tracers will also record the
 480functions that are within the trace. The descriptions of the
 481tracers will also show an example with ftrace enabled.
 482
 483
 484irqsoff
 485-------
 486
 487When interrupts are disabled, the CPU can not react to any other
 488external event (besides NMIs and SMIs). This prevents the timer
 489interrupt from triggering or the mouse interrupt from letting
 490the kernel know of a new mouse event. The result is a latency
 491with the reaction time.
 492
 493The irqsoff tracer tracks the time for which interrupts are
 494disabled. When a new maximum latency is hit, the tracer saves
 495the trace leading up to that latency point so that every time a
 496new maximum is reached, the old saved trace is discarded and the
 497new trace is saved.
 498
 499To reset the maximum, echo 0 into tracing_max_latency. Here is
 500an example:
 501
 502 # echo irqsoff > current_tracer
 503 # echo latency-format > trace_options
 504 # echo 0 > tracing_max_latency
 505 # echo 1 > tracing_on
 506 # ls -ltr
 507 [...]
 508 # echo 0 > tracing_on
 509 # cat trace
 510# tracer: irqsoff
 511#
 512irqsoff latency trace v1.1.5 on 2.6.26
 513--------------------------------------------------------------------
 514 latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
 515    -----------------
 516    | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0)
 517    -----------------
 518 => started at: sys_setpgid
 519 => ended at:   sys_setpgid
 520
 521#                _------=> CPU#
 522#               / _-----=> irqs-off
 523#              | / _----=> need-resched
 524#              || / _---=> hardirq/softirq
 525#              ||| / _--=> preempt-depth
 526#              |||| /
 527#              |||||     delay
 528#  cmd     pid ||||| time  |   caller
 529#     \   /    |||||   \   |   /
 530    bash-3730  1d...    0us : _write_lock_irq (sys_setpgid)
 531    bash-3730  1d..1    1us+: _write_unlock_irq (sys_setpgid)
 532    bash-3730  1d..2   14us : trace_hardirqs_on (sys_setpgid)
 533
 534
 535Here we see that that we had a latency of 12 microsecs (which is
 536very good). The _write_lock_irq in sys_setpgid disabled
 537interrupts. The difference between the 12 and the displayed
 538timestamp 14us occurred because the clock was incremented
 539between the time of recording the max latency and the time of
 540recording the function that had that latency.
 541
 542Note the above example had ftrace_enabled not set. If we set the
 543ftrace_enabled, we get a much larger output:
 544
 545# tracer: irqsoff
 546#
 547irqsoff latency trace v1.1.5 on 2.6.26-rc8
 548--------------------------------------------------------------------
 549 latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
 550    -----------------
 551    | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0)
 552    -----------------
 553 => started at: __alloc_pages_internal
 554 => ended at:   __alloc_pages_internal
 555
 556#                _------=> CPU#
 557#               / _-----=> irqs-off
 558#              | / _----=> need-resched
 559#              || / _---=> hardirq/softirq
 560#              ||| / _--=> preempt-depth
 561#              |||| /
 562#              |||||     delay
 563#  cmd     pid ||||| time  |   caller
 564#     \   /    |||||   \   |   /
 565      ls-4339  0...1    0us+: get_page_from_freelist (__alloc_pages_internal)
 566      ls-4339  0d..1    3us : rmqueue_bulk (get_page_from_freelist)
 567      ls-4339  0d..1    3us : _spin_lock (rmqueue_bulk)
 568      ls-4339  0d..1    4us : add_preempt_count (_spin_lock)
 569      ls-4339  0d..2    4us : __rmqueue (rmqueue_bulk)
 570      ls-4339  0d..2    5us : __rmqueue_smallest (__rmqueue)
 571      ls-4339  0d..2    5us : __mod_zone_page_state (__rmqueue_smallest)
 572      ls-4339  0d..2    6us : __rmqueue (rmqueue_bulk)
 573      ls-4339  0d..2    6us : __rmqueue_smallest (__rmqueue)
 574      ls-4339  0d..2    7us : __mod_zone_page_state (__rmqueue_smallest)
 575      ls-4339  0d..2    7us : __rmqueue (rmqueue_bulk)
 576      ls-4339  0d..2    8us : __rmqueue_smallest (__rmqueue)
 577[...]
 578      ls-4339  0d..2   46us : __rmqueue_smallest (__rmqueue)
 579      ls-4339  0d..2   47us : __mod_zone_page_state (__rmqueue_smallest)
 580      ls-4339  0d..2   47us : __rmqueue (rmqueue_bulk)
 581      ls-4339  0d..2   48us : __rmqueue_smallest (__rmqueue)
 582      ls-4339  0d..2   48us : __mod_zone_page_state (__rmqueue_smallest)
 583      ls-4339  0d..2   49us : _spin_unlock (rmqueue_bulk)
 584      ls-4339  0d..2   49us : sub_preempt_count (_spin_unlock)
 585      ls-4339  0d..1   50us : get_page_from_freelist (__alloc_pages_internal)
 586      ls-4339  0d..2   51us : trace_hardirqs_on (__alloc_pages_internal)
 587
 588
 589
 590Here we traced a 50 microsecond latency. But we also see all the
 591functions that were called during that time. Note that by
 592enabling function tracing, we incur an added overhead. This
 593overhead may extend the latency times. But nevertheless, this
 594trace has provided some very helpful debugging information.
 595
 596
 597preemptoff
 598----------
 599
 600When preemption is disabled, we may be able to receive
 601interrupts but the task cannot be preempted and a higher
 602priority task must wait for preemption to be enabled again
 603before it can preempt a lower priority task.
 604
 605The preemptoff tracer traces the places that disable preemption.
 606Like the irqsoff tracer, it records the maximum latency for
 607which preemption was disabled. The control of preemptoff tracer
 608is much like the irqsoff tracer.
 609
 610 # echo preemptoff > current_tracer
 611 # echo latency-format > trace_options
 612 # echo 0 > tracing_max_latency
 613 # echo 1 > tracing_on
 614 # ls -ltr
 615 [...]
 616 # echo 0 > tracing_on
 617 # cat trace
 618# tracer: preemptoff
 619#
 620preemptoff latency trace v1.1.5 on 2.6.26-rc8
 621--------------------------------------------------------------------
 622 latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
 623    -----------------
 624    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
 625    -----------------
 626 => started at: do_IRQ
 627 => ended at:   __do_softirq
 628
 629#                _------=> CPU#
 630#               / _-----=> irqs-off
 631#              | / _----=> need-resched
 632#              || / _---=> hardirq/softirq
 633#              ||| / _--=> preempt-depth
 634#              |||| /
 635#              |||||     delay
 636#  cmd     pid ||||| time  |   caller
 637#     \   /    |||||   \   |   /
 638    sshd-4261  0d.h.    0us+: irq_enter (do_IRQ)
 639    sshd-4261  0d.s.   29us : _local_bh_enable (__do_softirq)
 640    sshd-4261  0d.s1   30us : trace_preempt_on (__do_softirq)
 641
 642
 643This has some more changes. Preemption was disabled when an
 644interrupt came in (notice the 'h'), and was enabled while doing
 645a softirq. (notice the 's'). But we also see that interrupts
 646have been disabled when entering the preempt off section and
 647leaving it (the 'd'). We do not know if interrupts were enabled
 648in the mean time.
 649
 650# tracer: preemptoff
 651#
 652preemptoff latency trace v1.1.5 on 2.6.26-rc8
 653--------------------------------------------------------------------
 654 latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
 655    -----------------
 656    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
 657    -----------------
 658 => started at: remove_wait_queue
 659 => ended at:   __do_softirq
 660
 661#                _------=> CPU#
 662#               / _-----=> irqs-off
 663#              | / _----=> need-resched
 664#              || / _---=> hardirq/softirq
 665#              ||| / _--=> preempt-depth
 666#              |||| /
 667#              |||||     delay
 668#  cmd     pid ||||| time  |   caller
 669#     \   /    |||||   \   |   /
 670    sshd-4261  0d..1    0us : _spin_lock_irqsave (remove_wait_queue)
 671    sshd-4261  0d..1    1us : _spin_unlock_irqrestore (remove_wait_queue)
 672    sshd-4261  0d..1    2us : do_IRQ (common_interrupt)
 673    sshd-4261  0d..1    2us : irq_enter (do_IRQ)
 674    sshd-4261  0d..1    2us : idle_cpu (irq_enter)
 675    sshd-4261  0d..1    3us : add_preempt_count (irq_enter)
 676    sshd-4261  0d.h1    3us : idle_cpu (irq_enter)
 677    sshd-4261  0d.h.    4us : handle_fasteoi_irq (do_IRQ)
 678[...]
 679    sshd-4261  0d.h.   12us : add_preempt_count (_spin_lock)
 680    sshd-4261  0d.h1   12us : ack_ioapic_quirk_irq (handle_fasteoi_irq)
 681    sshd-4261  0d.h1   13us : move_native_irq (ack_ioapic_quirk_irq)
 682    sshd-4261  0d.h1   13us : _spin_unlock (handle_fasteoi_irq)
 683    sshd-4261  0d.h1   14us : sub_preempt_count (_spin_unlock)
 684    sshd-4261  0d.h1   14us : irq_exit (do_IRQ)
 685    sshd-4261  0d.h1   15us : sub_preempt_count (irq_exit)
 686    sshd-4261  0d..2   15us : do_softirq (irq_exit)
 687    sshd-4261  0d...   15us : __do_softirq (do_softirq)
 688    sshd-4261  0d...   16us : __local_bh_disable (__do_softirq)
 689    sshd-4261  0d...   16us+: add_preempt_count (__local_bh_disable)
 690    sshd-4261  0d.s4   20us : add_preempt_count (__local_bh_disable)
 691    sshd-4261  0d.s4   21us : sub_preempt_count (local_bh_enable)
 692    sshd-4261  0d.s5   21us : sub_preempt_count (local_bh_enable)
 693[...]
 694    sshd-4261  0d.s6   41us : add_preempt_count (__local_bh_disable)
 695    sshd-4261  0d.s6   42us : sub_preempt_count (local_bh_enable)
 696    sshd-4261  0d.s7   42us : sub_preempt_count (local_bh_enable)
 697    sshd-4261  0d.s5   43us : add_preempt_count (__local_bh_disable)
 698    sshd-4261  0d.s5   43us : sub_preempt_count (local_bh_enable_ip)
 699    sshd-4261  0d.s6   44us : sub_preempt_count (local_bh_enable_ip)
 700    sshd-4261  0d.s5   44us : add_preempt_count (__local_bh_disable)
 701    sshd-4261  0d.s5   45us : sub_preempt_count (local_bh_enable)
 702[...]
 703    sshd-4261  0d.s.   63us : _local_bh_enable (__do_softirq)
 704    sshd-4261  0d.s1   64us : trace_preempt_on (__do_softirq)
 705
 706
 707The above is an example of the preemptoff trace with
 708ftrace_enabled set. Here we see that interrupts were disabled
 709the entire time. The irq_enter code lets us know that we entered
 710an interrupt 'h'. Before that, the functions being traced still
 711show that it is not in an interrupt, but we can see from the
 712functions themselves that this is not the case.
 713
 714Notice that __do_softirq when called does not have a
 715preempt_count. It may seem that we missed a preempt enabling.
 716What really happened is that the preempt count is held on the
 717thread's stack and we switched to the softirq stack (4K stacks
 718in effect). The code does not copy the preempt count, but
 719because interrupts are disabled, we do not need to worry about
 720it. Having a tracer like this is good for letting people know
 721what really happens inside the kernel.
 722
 723
 724preemptirqsoff
 725--------------
 726
 727Knowing the locations that have interrupts disabled or
 728preemption disabled for the longest times is helpful. But
 729sometimes we would like to know when either preemption and/or
 730interrupts are disabled.
 731
 732Consider the following code:
 733
 734    local_irq_disable();
 735    call_function_with_irqs_off();
 736    preempt_disable();
 737    call_function_with_irqs_and_preemption_off();
 738    local_irq_enable();
 739    call_function_with_preemption_off();
 740    preempt_enable();
 741
 742The irqsoff tracer will record the total length of
 743call_function_with_irqs_off() and
 744call_function_with_irqs_and_preemption_off().
 745
 746The preemptoff tracer will record the total length of
 747call_function_with_irqs_and_preemption_off() and
 748call_function_with_preemption_off().
 749
 750But neither will trace the time that interrupts and/or
 751preemption is disabled. This total time is the time that we can
 752not schedule. To record this time, use the preemptirqsoff
 753tracer.
 754
 755Again, using this trace is much like the irqsoff and preemptoff
 756tracers.
 757
 758 # echo preemptirqsoff > current_tracer
 759 # echo latency-format > trace_options
 760 # echo 0 > tracing_max_latency
 761 # echo 1 > tracing_on
 762 # ls -ltr
 763 [...]
 764 # echo 0 > tracing_on
 765 # cat trace
 766# tracer: preemptirqsoff
 767#
 768preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
 769--------------------------------------------------------------------
 770 latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
 771    -----------------
 772    | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0)
 773    -----------------
 774 => started at: apic_timer_interrupt
 775 => ended at:   __do_softirq
 776
 777#                _------=> CPU#
 778#               / _-----=> irqs-off
 779#              | / _----=> need-resched
 780#              || / _---=> hardirq/softirq
 781#              ||| / _--=> preempt-depth
 782#              |||| /
 783#              |||||     delay
 784#  cmd     pid ||||| time  |   caller
 785#     \   /    |||||   \   |   /
 786      ls-4860  0d...    0us!: trace_hardirqs_off_thunk (apic_timer_interrupt)
 787      ls-4860  0d.s.  294us : _local_bh_enable (__do_softirq)
 788      ls-4860  0d.s1  294us : trace_preempt_on (__do_softirq)
 789
 790
 791
 792The trace_hardirqs_off_thunk is called from assembly on x86 when
 793interrupts are disabled in the assembly code. Without the
 794function tracing, we do not know if interrupts were enabled
 795within the preemption points. We do see that it started with
 796preemption enabled.
 797
 798Here is a trace with ftrace_enabled set:
 799
 800
 801# tracer: preemptirqsoff
 802#
 803preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
 804--------------------------------------------------------------------
 805 latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
 806    -----------------
 807    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
 808    -----------------
 809 => started at: write_chan
 810 => ended at:   __do_softirq
 811
 812#                _------=> CPU#
 813#               / _-----=> irqs-off
 814#              | / _----=> need-resched
 815#              || / _---=> hardirq/softirq
 816#              ||| / _--=> preempt-depth
 817#              |||| /
 818#              |||||     delay
 819#  cmd     pid ||||| time  |   caller
 820#     \   /    |||||   \   |   /
 821      ls-4473  0.N..    0us : preempt_schedule (write_chan)
 822      ls-4473  0dN.1    1us : _spin_lock (schedule)
 823      ls-4473  0dN.1    2us : add_preempt_count (_spin_lock)
 824      ls-4473  0d..2    2us : put_prev_task_fair (schedule)
 825[...]
 826      ls-4473  0d..2   13us : set_normalized_timespec (ktime_get_ts)
 827      ls-4473  0d..2   13us : __switch_to (schedule)
 828    sshd-4261  0d..2   14us : finish_task_switch (schedule)
 829    sshd-4261  0d..2   14us : _spin_unlock_irq (finish_task_switch)
 830    sshd-4261  0d..1   15us : add_preempt_count (_spin_lock_irqsave)
 831    sshd-4261  0d..2   16us : _spin_unlock_irqrestore (hrtick_set)
 832    sshd-4261  0d..2   16us : do_IRQ (common_interrupt)
 833    sshd-4261  0d..2   17us : irq_enter (do_IRQ)
 834    sshd-4261  0d..2   17us : idle_cpu (irq_enter)
 835    sshd-4261  0d..2   18us : add_preempt_count (irq_enter)
 836    sshd-4261  0d.h2   18us : idle_cpu (irq_enter)
 837    sshd-4261  0d.h.   18us : handle_fasteoi_irq (do_IRQ)
 838    sshd-4261  0d.h.   19us : _spin_lock (handle_fasteoi_irq)
 839    sshd-4261  0d.h.   19us : add_preempt_count (_spin_lock)
 840    sshd-4261  0d.h1   20us : _spin_unlock (handle_fasteoi_irq)
 841    sshd-4261  0d.h1   20us : sub_preempt_count (_spin_unlock)
 842[...]
 843    sshd-4261  0d.h1   28us : _spin_unlock (handle_fasteoi_irq)
 844    sshd-4261  0d.h1   29us : sub_preempt_count (_spin_unlock)
 845    sshd-4261  0d.h2   29us : irq_exit (do_IRQ)
 846    sshd-4261  0d.h2   29us : sub_preempt_count (irq_exit)
 847    sshd-4261  0d..3   30us : do_softirq (irq_exit)
 848    sshd-4261  0d...   30us : __do_softirq (do_softirq)
 849    sshd-4261  0d...   31us : __local_bh_disable (__do_softirq)
 850    sshd-4261  0d...   31us+: add_preempt_count (__local_bh_disable)
 851    sshd-4261  0d.s4   34us : add_preempt_count (__local_bh_disable)
 852[...]
 853    sshd-4261  0d.s3   43us : sub_preempt_count (local_bh_enable_ip)
 854    sshd-4261  0d.s4   44us : sub_preempt_count (local_bh_enable_ip)
 855    sshd-4261  0d.s3   44us : smp_apic_timer_interrupt (apic_timer_interrupt)
 856    sshd-4261  0d.s3   45us : irq_enter (smp_apic_timer_interrupt)
 857    sshd-4261  0d.s3   45us : idle_cpu (irq_enter)
 858    sshd-4261  0d.s3   46us : add_preempt_count (irq_enter)
 859    sshd-4261  0d.H3   46us : idle_cpu (irq_enter)
 860    sshd-4261  0d.H3   47us : hrtimer_interrupt (smp_apic_timer_interrupt)
 861    sshd-4261  0d.H3   47us : ktime_get (hrtimer_interrupt)
 862[...]
 863    sshd-4261  0d.H3   81us : tick_program_event (hrtimer_interrupt)
 864    sshd-4261  0d.H3   82us : ktime_get (tick_program_event)
 865    sshd-4261  0d.H3   82us : ktime_get_ts (ktime_get)
 866    sshd-4261  0d.H3   83us : getnstimeofday (ktime_get_ts)
 867    sshd-4261  0d.H3   83us : set_normalized_timespec (ktime_get_ts)
 868    sshd-4261  0d.H3   84us : clockevents_program_event (tick_program_event)
 869    sshd-4261  0d.H3   84us : lapic_next_event (clockevents_program_event)
 870    sshd-4261  0d.H3   85us : irq_exit (smp_apic_timer_interrupt)
 871    sshd-4261  0d.H3   85us : sub_preempt_count (irq_exit)
 872    sshd-4261  0d.s4   86us : sub_preempt_count (irq_exit)
 873    sshd-4261  0d.s3   86us : add_preempt_count (__local_bh_disable)
 874[...]
 875    sshd-4261  0d.s1   98us : sub_preempt_count (net_rx_action)
 876    sshd-4261  0d.s.   99us : add_preempt_count (_spin_lock_irq)
 877    sshd-4261  0d.s1   99us+: _spin_unlock_irq (run_timer_softirq)
 878    sshd-4261  0d.s.  104us : _local_bh_enable (__do_softirq)
 879    sshd-4261  0d.s.  104us : sub_preempt_count (_local_bh_enable)
 880    sshd-4261  0d.s.  105us : _local_bh_enable (__do_softirq)
 881    sshd-4261  0d.s1  105us : trace_preempt_on (__do_softirq)
 882
 883
 884This is a very interesting trace. It started with the preemption
 885of the ls task. We see that the task had the "need_resched" bit
 886set via the 'N' in the trace.  Interrupts were disabled before
 887the spin_lock at the beginning of the trace. We see that a
 888schedule took place to run sshd.  When the interrupts were
 889enabled, we took an interrupt. On return from the interrupt
 890handler, the softirq ran. We took another interrupt while
 891running the softirq as we see from the capital 'H'.
 892
 893
 894wakeup
 895------
 896
 897In a Real-Time environment it is very important to know the
 898wakeup time it takes for the highest priority task that is woken
 899up to the time that it executes. This is also known as "schedule
 900latency". I stress the point that this is about RT tasks. It is
 901also important to know the scheduling latency of non-RT tasks,
 902but the average schedule latency is better for non-RT tasks.
 903Tools like LatencyTop are more appropriate for such
 904measurements.
 905
 906Real-Time environments are interested in the worst case latency.
 907That is the longest latency it takes for something to happen,
 908and not the average. We can have a very fast scheduler that may
 909only have a large latency once in a while, but that would not
 910work well with Real-Time tasks.  The wakeup tracer was designed
 911to record the worst case wakeups of RT tasks. Non-RT tasks are
 912not recorded because the tracer only records one worst case and
 913tracing non-RT tasks that are unpredictable will overwrite the
 914worst case latency of RT tasks.
 915
 916Since this tracer only deals with RT tasks, we will run this
 917slightly differently than we did with the previous tracers.
 918Instead of performing an 'ls', we will run 'sleep 1' under
 919'chrt' which changes the priority of the task.
 920
 921 # echo wakeup > current_tracer
 922 # echo latency-format > trace_options
 923 # echo 0 > tracing_max_latency
 924 # echo 1 > tracing_on
 925 # chrt -f 5 sleep 1
 926 # echo 0 > tracing_on
 927 # cat trace
 928# tracer: wakeup
 929#
 930wakeup latency trace v1.1.5 on 2.6.26-rc8
 931--------------------------------------------------------------------
 932 latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
 933    -----------------
 934    | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5)
 935    -----------------
 936
 937#                _------=> CPU#
 938#               / _-----=> irqs-off
 939#              | / _----=> need-resched
 940#              || / _---=> hardirq/softirq
 941#              ||| / _--=> preempt-depth
 942#              |||| /
 943#              |||||     delay
 944#  cmd     pid ||||| time  |   caller
 945#     \   /    |||||   \   |   /
 946  <idle>-0     1d.h4    0us+: try_to_wake_up (wake_up_process)
 947  <idle>-0     1d..4    4us : schedule (cpu_idle)
 948
 949
 950Running this on an idle system, we see that it only took 4
 951microseconds to perform the task switch.  Note, since the trace
 952marker in the schedule is before the actual "switch", we stop
 953the tracing when the recorded task is about to schedule in. This
 954may change if we add a new marker at the end of the scheduler.
 955
 956Notice that the recorded task is 'sleep' with the PID of 4901
 957and it has an rt_prio of 5. This priority is user-space priority
 958and not the internal kernel priority. The policy is 1 for
 959SCHED_FIFO and 2 for SCHED_RR.
 960
 961Doing the same with chrt -r 5 and ftrace_enabled set.
 962
 963# tracer: wakeup
 964#
 965wakeup latency trace v1.1.5 on 2.6.26-rc8
 966--------------------------------------------------------------------
 967 latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
 968    -----------------
 969    | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5)
 970    -----------------
 971
 972#                _------=> CPU#
 973#               / _-----=> irqs-off
 974#              | / _----=> need-resched
 975#              || / _---=> hardirq/softirq
 976#              ||| / _--=> preempt-depth
 977#              |||| /
 978#              |||||     delay
 979#  cmd     pid ||||| time  |   caller
 980#     \   /    |||||   \   |   /
 981ksoftirq-7     1d.H3    0us : try_to_wake_up (wake_up_process)
 982ksoftirq-7     1d.H4    1us : sub_preempt_count (marker_probe_cb)
 983ksoftirq-7     1d.H3    2us : check_preempt_wakeup (try_to_wake_up)
 984ksoftirq-7     1d.H3    3us : update_curr (check_preempt_wakeup)
 985ksoftirq-7     1d.H3    4us : calc_delta_mine (update_curr)
 986ksoftirq-7     1d.H3    5us : __resched_task (check_preempt_wakeup)
 987ksoftirq-7     1d.H3    6us : task_wake_up_rt (try_to_wake_up)
 988ksoftirq-7     1d.H3    7us : _spin_unlock_irqrestore (try_to_wake_up)
 989[...]
 990ksoftirq-7     1d.H2   17us : irq_exit (smp_apic_timer_interrupt)
 991ksoftirq-7     1d.H2   18us : sub_preempt_count (irq_exit)
 992ksoftirq-7     1d.s3   19us : sub_preempt_count (irq_exit)
 993ksoftirq-7     1..s2   20us : rcu_process_callbacks (__do_softirq)
 994[...]
 995ksoftirq-7     1..s2   26us : __rcu_process_callbacks (rcu_process_callbacks)
 996ksoftirq-7     1d.s2   27us : _local_bh_enable (__do_softirq)
 997ksoftirq-7     1d.s2   28us : sub_preempt_count (_local_bh_enable)
 998ksoftirq-7     1.N.3   29us : sub_preempt_count (ksoftirqd)
 999ksoftirq-7     1.N.2   30us : _cond_resched (ksoftirqd)
1000ksoftirq-7     1.N.2   31us : __cond_resched (_cond_resched)
1001ksoftirq-7     1.N.2   32us : add_preempt_count (__cond_resched)
1002ksoftirq-7     1.N.2   33us : schedule (__cond_resched)
1003ksoftirq-7     1.N.2   33us : add_preempt_count (schedule)
1004ksoftirq-7     1.N.3   34us : hrtick_clear (schedule)
1005ksoftirq-7     1dN.3   35us : _spin_lock (schedule)
1006ksoftirq-7     1dN.3   36us : add_preempt_count (_spin_lock)
1007ksoftirq-7     1d..4   37us : put_prev_task_fair (schedule)
1008ksoftirq-7     1d..4   38us : update_curr (put_prev_task_fair)
1009[...]
1010ksoftirq-7     1d..5   47us : _spin_trylock (tracing_record_cmdline)
1011ksoftirq-7     1d..5   48us : add_preempt_count (_spin_trylock)
1012ksoftirq-7     1d..6   49us : _spin_unlock (tracing_record_cmdline)
1013ksoftirq-7     1d..6   49us : sub_preempt_count (_spin_unlock)
1014ksoftirq-7     1d..4   50us : schedule (__cond_resched)
1015
1016The interrupt went off while running ksoftirqd. This task runs
1017at SCHED_OTHER. Why did not we see the 'N' set early? This may
1018be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K
1019stacks configured, the interrupt and softirq run with their own
1020stack. Some information is held on the top of the task's stack
1021(need_resched and preempt_count are both stored there). The
1022setting of the NEED_RESCHED bit is done directly to the task's
1023stack, but the reading of the NEED_RESCHED is done by looking at
1024the current stack, which in this case is the stack for the hard
1025interrupt. This hides the fact that NEED_RESCHED has been set.
1026We do not see the 'N' until we switch back to the task's
1027assigned stack.
1028
1029function
1030--------
1031
1032This tracer is the function tracer. Enabling the function tracer
1033can be done from the debug file system. Make sure the
1034ftrace_enabled is set; otherwise this tracer is a nop.
1035
1036 # sysctl kernel.ftrace_enabled=1
1037 # echo function > current_tracer
1038 # echo 1 > tracing_on
1039 # usleep 1
1040 # echo 0 > tracing_on
1041 # cat trace
1042# tracer: function
1043#
1044#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1045#              | |      |          |         |
1046            bash-4003  [00]   123.638713: finish_task_switch <-schedule
1047            bash-4003  [00]   123.638714: _spin_unlock_irq <-finish_task_switch
1048            bash-4003  [00]   123.638714: sub_preempt_count <-_spin_unlock_irq
1049            bash-4003  [00]   123.638715: hrtick_set <-schedule
1050            bash-4003  [00]   123.638715: _spin_lock_irqsave <-hrtick_set
1051            bash-4003  [00]   123.638716: add_preempt_count <-_spin_lock_irqsave
1052            bash-4003  [00]   123.638716: _spin_unlock_irqrestore <-hrtick_set
1053            bash-4003  [00]   123.638717: sub_preempt_count <-_spin_unlock_irqrestore
1054            bash-4003  [00]   123.638717: hrtick_clear <-hrtick_set
1055            bash-4003  [00]   123.638718: sub_preempt_count <-schedule
1056            bash-4003  [00]   123.638718: sub_preempt_count <-preempt_schedule
1057            bash-4003  [00]   123.638719: wait_for_completion <-__stop_machine_run
1058            bash-4003  [00]   123.638719: wait_for_common <-wait_for_completion
1059            bash-4003  [00]   123.638720: _spin_lock_irq <-wait_for_common
1060            bash-4003  [00]   123.638720: add_preempt_count <-_spin_lock_irq
1061[...]
1062
1063
1064Note: function tracer uses ring buffers to store the above
1065entries. The newest data may overwrite the oldest data.
1066Sometimes using echo to stop the trace is not sufficient because
1067the tracing could have overwritten the data that you wanted to
1068record. For this reason, it is sometimes better to disable
1069tracing directly from a program. This allows you to stop the
1070tracing at the point that you hit the part that you are
1071interested in. To disable the tracing directly from a C program,
1072something like following code snippet can be used:
1073
1074int trace_fd;
1075[...]
1076int main(int argc, char *argv[]) {
1077        [...]
1078        trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
1079        [...]
1080        if (condition_hit()) {
1081                write(trace_fd, "0", 1);
1082        }
1083        [...]
1084}
1085
1086
1087Single thread tracing
1088---------------------
1089
1090By writing into set_ftrace_pid you can trace a
1091single thread. For example:
1092
1093# cat set_ftrace_pid
1094no pid
1095# echo 3111 > set_ftrace_pid
1096# cat set_ftrace_pid
10973111
1098# echo function > current_tracer
1099# cat trace | head
1100 # tracer: function
1101 #
1102 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
1103 #              | |       |          |         |
1104     yum-updatesd-3111  [003]  1637.254676: finish_task_switch <-thread_return
1105     yum-updatesd-3111  [003]  1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
1106     yum-updatesd-3111  [003]  1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
1107     yum-updatesd-3111  [003]  1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
1108     yum-updatesd-3111  [003]  1637.254685: fget_light <-do_sys_poll
1109     yum-updatesd-3111  [003]  1637.254686: pipe_poll <-do_sys_poll
1110# echo -1 > set_ftrace_pid
1111# cat trace |head
1112 # tracer: function
1113 #
1114 #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
1115 #              | |       |          |         |
1116 ##### CPU 3 buffer started ####
1117     yum-updatesd-3111  [003]  1701.957688: free_poll_entry <-poll_freewait
1118     yum-updatesd-3111  [003]  1701.957689: remove_wait_queue <-free_poll_entry
1119     yum-updatesd-3111  [003]  1701.957691: fput <-free_poll_entry
1120     yum-updatesd-3111  [003]  1701.957692: audit_syscall_exit <-sysret_audit
1121     yum-updatesd-3111  [003]  1701.957693: path_put <-audit_syscall_exit
1122
1123If you want to trace a function when executing, you could use
1124something like this simple program:
1125
1126#include <stdio.h>
1127#include <stdlib.h>
1128#include <sys/types.h>
1129#include <sys/stat.h>
1130#include <fcntl.h>
1131#include <unistd.h>
1132#include <string.h>
1133
1134#define _STR(x) #x
1135#define STR(x) _STR(x)
1136#define MAX_PATH 256
1137
1138const char *find_debugfs(void)
1139{
1140       static char debugfs[MAX_PATH+1];
1141       static int debugfs_found;
1142       char type[100];
1143       FILE *fp;
1144
1145       if (debugfs_found)
1146               return debugfs;
1147
1148       if ((fp = fopen("/proc/mounts","r")) == NULL) {
1149               perror("/proc/mounts");
1150               return NULL;
1151       }
1152
1153       while (fscanf(fp, "%*s %"
1154                     STR(MAX_PATH)
1155                     "s %99s %*s %*d %*d\n",
1156                     debugfs, type) == 2) {
1157               if (strcmp(type, "debugfs") == 0)
1158                       break;
1159       }
1160       fclose(fp);
1161
1162       if (strcmp(type, "debugfs") != 0) {
1163               fprintf(stderr, "debugfs not mounted");
1164               return NULL;
1165       }
1166
1167       strcat(debugfs, "/tracing/");
1168       debugfs_found = 1;
1169
1170       return debugfs;
1171}
1172
1173const char *tracing_file(const char *file_name)
1174{
1175       static char trace_file[MAX_PATH+1];
1176       snprintf(trace_file, MAX_PATH, "%s/%s", find_debugfs(), file_name);
1177       return trace_file;
1178}
1179
1180int main (int argc, char **argv)
1181{
1182        if (argc < 1)
1183                exit(-1);
1184
1185        if (fork() > 0) {
1186                int fd, ffd;
1187                char line[64];
1188                int s;
1189
1190                ffd = open(tracing_file("current_tracer"), O_WRONLY);
1191                if (ffd < 0)
1192                        exit(-1);
1193                write(ffd, "nop", 3);
1194
1195                fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
1196                s = sprintf(line, "%d\n", getpid());
1197                write(fd, line, s);
1198
1199                write(ffd, "function", 8);
1200
1201                close(fd);
1202                close(ffd);
1203
1204                execvp(argv[1], argv+1);
1205        }
1206
1207        return 0;
1208}
1209
1210
1211hw-branch-tracer (x86 only)
1212---------------------------
1213
1214This tracer uses the x86 last branch tracing hardware feature to
1215collect a branch trace on all cpus with relatively low overhead.
1216
1217The tracer uses a fixed-size circular buffer per cpu and only
1218traces ring 0 branches. The trace file dumps that buffer in the
1219following format:
1220
1221# tracer: hw-branch-tracer
1222#
1223# CPU#        TO  <-  FROM
1224   0  scheduler_tick+0xb5/0x1bf   <-  task_tick_idle+0x5/0x6
1225   2  run_posix_cpu_timers+0x2b/0x72a     <-  run_posix_cpu_timers+0x25/0x72a
1226   0  scheduler_tick+0x139/0x1bf          <-  scheduler_tick+0xed/0x1bf
1227   0  scheduler_tick+0x17c/0x1bf          <-  scheduler_tick+0x148/0x1bf
1228   2  run_posix_cpu_timers+0x9e/0x72a     <-  run_posix_cpu_timers+0x5e/0x72a
1229   0  scheduler_tick+0x1b6/0x1bf          <-  scheduler_tick+0x1aa/0x1bf
1230
1231
1232The tracer may be used to dump the trace for the oops'ing cpu on
1233a kernel oops into the system log. To enable this,
1234ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one
1235can either use the sysctl function or set it via the proc system
1236interface.
1237
1238  sysctl kernel.ftrace_dump_on_oops=n
1239
1240or
1241
1242  echo n > /proc/sys/kernel/ftrace_dump_on_oops
1243
1244If n = 1, ftrace will dump buffers of all CPUs, if n = 2 ftrace will
1245only dump the buffer of the CPU that triggered the oops.
1246
1247Here's an example of such a dump after a null pointer
1248dereference in a kernel module:
1249
1250[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
1251[57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops]
1252[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0
1253[57848.106019] Oops: 0002 [#1] SMP
1254[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus
1255[57848.106019] Dumping ftrace buffer:
1256[57848.106019] ---------------------------------
1257[...]
1258[57848.106019]    0  chrdev_open+0xe6/0x165       <-  cdev_put+0x23/0x24
1259[57848.106019]    0  chrdev_open+0x117/0x165      <-  chrdev_open+0xfa/0x165
1260[57848.106019]    0  chrdev_open+0x120/0x165      <-  chrdev_open+0x11c/0x165
1261[57848.106019]    0  chrdev_open+0x134/0x165      <-  chrdev_open+0x12b/0x165
1262[57848.106019]    0  open+0x0/0x14 [oops]         <-  chrdev_open+0x144/0x165
1263[57848.106019]    0  page_fault+0x0/0x30          <-  open+0x6/0x14 [oops]
1264[57848.106019]    0  error_entry+0x0/0x5b         <-  page_fault+0x4/0x30
1265[57848.106019]    0  error_kernelspace+0x0/0x31   <-  error_entry+0x59/0x5b
1266[57848.106019]    0  error_sti+0x0/0x1    <-  error_kernelspace+0x2d/0x31
1267[57848.106019]    0  page_fault+0x9/0x30          <-  error_sti+0x0/0x1
1268[57848.106019]    0  do_page_fault+0x0/0x881      <-  page_fault+0x1a/0x30
1269[...]
1270[57848.106019]    0  do_page_fault+0x66b/0x881    <-  is_prefetch+0x1ee/0x1f2
1271[57848.106019]    0  do_page_fault+0x6e0/0x881    <-  do_page_fault+0x67a/0x881
1272[57848.106019]    0  oops_begin+0x0/0x96          <-  do_page_fault+0x6e0/0x881
1273[57848.106019]    0  trace_hw_branch_oops+0x0/0x2d        <-  oops_begin+0x9/0x96
1274[...]
1275[57848.106019]    0  ds_suspend_bts+0x2a/0xe3     <-  ds_suspend_bts+0x1a/0xe3
1276[57848.106019] ---------------------------------
1277[57848.106019] CPU 0
1278[57848.106019] Modules linked in: oops
1279[57848.106019] Pid: 5542, comm: cat Tainted: G        W  2.6.28 #23
1280[57848.106019] RIP: 0010:[<ffffffffa0000006>]  [<ffffffffa0000006>] open+0x6/0x14 [oops]
1281[57848.106019] RSP: 0018:ffff880235457d48  EFLAGS: 00010246
1282[...]
1283
1284
1285function graph tracer
1286---------------------------
1287
1288This tracer is similar to the function tracer except that it
1289probes a function on its entry and its exit. This is done by
1290using a dynamically allocated stack of return addresses in each
1291task_struct. On function entry the tracer overwrites the return
1292address of each function traced to set a custom probe. Thus the
1293original return address is stored on the stack of return address
1294in the task_struct.
1295
1296Probing on both ends of a function leads to special features
1297such as:
1298
1299- measure of a function's time execution
1300- having a reliable call stack to draw function calls graph
1301
1302This tracer is useful in several situations:
1303
1304- you want to find the reason of a strange kernel behavior and
1305  need to see what happens in detail on any areas (or specific
1306  ones).
1307
1308- you are experiencing weird latencies but it's difficult to
1309  find its origin.
1310
1311- you want to find quickly which path is taken by a specific
1312  function
1313
1314- you just want to peek inside a working kernel and want to see
1315  what happens there.
1316
1317# tracer: function_graph
1318#
1319# CPU  DURATION                  FUNCTION CALLS
1320# |     |   |                     |   |   |   |
1321
1322 0)               |  sys_open() {
1323 0)               |    do_sys_open() {
1324 0)               |      getname() {
1325 0)               |        kmem_cache_alloc() {
1326 0)   1.382 us    |          __might_sleep();
1327 0)   2.478 us    |        }
1328 0)               |        strncpy_from_user() {
1329 0)               |          might_fault() {
1330 0)   1.389 us    |            __might_sleep();
1331 0)   2.553 us    |          }
1332 0)   3.807 us    |        }
1333 0)   7.876 us    |      }
1334 0)               |      alloc_fd() {
1335 0)   0.668 us    |        _spin_lock();
1336 0)   0.570 us    |        expand_files();
1337 0)   0.586 us    |        _spin_unlock();
1338
1339
1340There are several columns that can be dynamically
1341enabled/disabled. You can use every combination of options you
1342want, depending on your needs.
1343
1344- The cpu number on which the function executed is default
1345  enabled.  It is sometimes better to only trace one cpu (see
1346  tracing_cpu_mask file) or you might sometimes see unordered
1347  function calls while cpu tracing switch.
1348
1349        hide: echo nofuncgraph-cpu > trace_options
1350        show: echo funcgraph-cpu > trace_options
1351
1352- The duration (function's time of execution) is displayed on
1353  the closing bracket line of a function or on the same line
1354  than the current function in case of a leaf one. It is default
1355  enabled.
1356
1357        hide: echo nofuncgraph-duration > trace_options
1358        show: echo funcgraph-duration > trace_options
1359
1360- The overhead field precedes the duration field in case of
1361  reached duration thresholds.
1362
1363        hide: echo nofuncgraph-overhead > trace_options
1364        show: echo funcgraph-overhead > trace_options
1365        depends on: funcgraph-duration
1366
1367  ie:
1368
1369  0)               |    up_write() {
1370  0)   0.646 us    |      _spin_lock_irqsave();
1371  0)   0.684 us    |      _spin_unlock_irqrestore();
1372  0)   3.123 us    |    }
1373  0)   0.548 us    |    fput();
1374  0) + 58.628 us   |  }
1375
1376  [...]
1377
1378  0)               |      putname() {
1379  0)               |        kmem_cache_free() {
1380  0)   0.518 us    |          __phys_addr();
1381  0)   1.757 us    |        }
1382  0)   2.861 us    |      }
1383  0) ! 115.305 us  |    }
1384  0) ! 116.402 us  |  }
1385
1386  + means that the function exceeded 10 usecs.
1387  ! means that the function exceeded 100 usecs.
1388
1389
1390- The task/pid field displays the thread cmdline and pid which
1391  executed the function. It is default disabled.
1392
1393        hide: echo nofuncgraph-proc > trace_options
1394        show: echo funcgraph-proc > trace_options
1395
1396  ie:
1397
1398  # tracer: function_graph
1399  #
1400  # CPU  TASK/PID        DURATION                  FUNCTION CALLS
1401  # |    |    |           |   |                     |   |   |   |
1402  0)    sh-4802     |               |                  d_free() {
1403  0)    sh-4802     |               |                    call_rcu() {
1404  0)    sh-4802     |               |                      __call_rcu() {
1405  0)    sh-4802     |   0.616 us    |                        rcu_process_gp_end();
1406  0)    sh-4802     |   0.586 us    |                        check_for_new_grace_period();
1407  0)    sh-4802     |   2.899 us    |                      }
1408  0)    sh-4802     |   4.040 us    |                    }
1409  0)    sh-4802     |   5.151 us    |                  }
1410  0)    sh-4802     | + 49.370 us   |                }
1411
1412
1413- The absolute time field is an absolute timestamp given by the
1414  system clock since it started. A snapshot of this time is
1415  given on each entry/exit of functions
1416
1417        hide: echo nofuncgraph-abstime > trace_options
1418        show: echo funcgraph-abstime > trace_options
1419
1420  ie:
1421
1422  #
1423  #      TIME       CPU  DURATION                  FUNCTION CALLS
1424  #       |         |     |   |                     |   |   |   |
1425  360.774522 |   1)   0.541 us    |                                          }
1426  360.774522 |   1)   4.663 us    |                                        }
1427  360.774523 |   1)   0.541 us    |                                        __wake_up_bit();
1428  360.774524 |   1)   6.796 us    |                                      }
1429  360.774524 |   1)   7.952 us    |                                    }
1430  360.774525 |   1)   9.063 us    |                                  }
1431  360.774525 |   1)   0.615 us    |                                  journal_mark_dirty();
1432  360.774527 |   1)   0.578 us    |                                  __brelse();
1433  360.774528 |   1)               |                                  reiserfs_prepare_for_journal() {
1434  360.774528 |   1)               |                                    unlock_buffer() {
1435  360.774529 |   1)               |                                      wake_up_bit() {
1436  360.774529 |   1)               |                                        bit_waitqueue() {
1437  360.774530 |   1)   0.594 us    |                                          __phys_addr();
1438
1439
1440You can put some comments on specific functions by using
1441trace_printk() For example, if you want to put a comment inside
1442the __might_sleep() function, you just have to include
1443<linux/ftrace.h> and call trace_printk() inside __might_sleep()
1444
1445trace_printk("I'm a comment!\n")
1446
1447will produce:
1448
1449 1)               |             __might_sleep() {
1450 1)               |                /* I'm a comment! */
1451 1)   1.449 us    |             }
1452
1453
1454You might find other useful features for this tracer in the
1455following "dynamic ftrace" section such as tracing only specific
1456functions or tasks.
1457
1458dynamic ftrace
1459--------------
1460
1461If CONFIG_DYNAMIC_FTRACE is set, the system will run with
1462virtually no overhead when function tracing is disabled. The way
1463this works is the mcount function call (placed at the start of
1464every kernel function, produced by the -pg switch in gcc),
1465starts of pointing to a simple return. (Enabling FTRACE will
1466include the -pg switch in the compiling of the kernel.)
1467
1468At compile time every C file object is run through the
1469recordmcount.pl script (located in the scripts directory). This
1470script will process the C object using objdump to find all the
1471locations in the .text section that call mcount. (Note, only the
1472.text section is processed, since processing other sections like
1473.init.text may cause races due to those sections being freed).
1474
1475A new section called "__mcount_loc" is created that holds
1476references to all the mcount call sites in the .text section.
1477This section is compiled back into the original object. The
1478final linker will add all these references into a single table.
1479
1480On boot up, before SMP is initialized, the dynamic ftrace code
1481scans this table and updates all the locations into nops. It
1482also records the locations, which are added to the
1483available_filter_functions list.  Modules are processed as they
1484are loaded and before they are executed.  When a module is
1485unloaded, it also removes its functions from the ftrace function
1486list. This is automatic in the module unload code, and the
1487module author does not need to worry about it.
1488
1489When tracing is enabled, kstop_machine is called to prevent
1490races with the CPUS executing code being modified (which can
1491cause the CPU to do undesirable things), and the nops are
1492patched back to calls. But this time, they do not call mcount
1493(which is just a function stub). They now call into the ftrace
1494infrastructure.
1495
1496One special side-effect to the recording of the functions being
1497traced is that we can now selectively choose which functions we
1498wish to trace and which ones we want the mcount calls to remain
1499as nops.
1500
1501Two files are used, one for enabling and one for disabling the
1502tracing of specified functions. They are:
1503
1504  set_ftrace_filter
1505
1506and
1507
1508  set_ftrace_notrace
1509
1510A list of available functions that you can add to these files is
1511listed in:
1512
1513   available_filter_functions
1514
1515 # cat available_filter_functions
1516put_prev_task_idle
1517kmem_cache_create
1518pick_next_task_rt
1519get_online_cpus
1520pick_next_task_fair
1521mutex_lock
1522[...]
1523
1524If I am only interested in sys_nanosleep and hrtimer_interrupt:
1525
1526 # echo sys_nanosleep hrtimer_interrupt \
1527                > set_ftrace_filter
1528 # echo function > current_tracer
1529 # echo 1 > tracing_on
1530 # usleep 1
1531 # echo 0 > tracing_on
1532 # cat trace
1533# tracer: ftrace
1534#
1535#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1536#              | |      |          |         |
1537          usleep-4134  [00]  1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt
1538          usleep-4134  [00]  1317.070111: sys_nanosleep <-syscall_call
1539          <idle>-0     [00]  1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
1540
1541To see which functions are being traced, you can cat the file:
1542
1543 # cat set_ftrace_filter
1544hrtimer_interrupt
1545sys_nanosleep
1546
1547
1548Perhaps this is not enough. The filters also allow simple wild
1549cards. Only the following are currently available
1550
1551  <match>*  - will match functions that begin with <match>
1552  *<match>  - will match functions that end with <match>
1553  *<match>* - will match functions that have <match> in it
1554
1555These are the only wild cards which are supported.
1556
1557  <match>*<match> will not work.
1558
1559Note: It is better to use quotes to enclose the wild cards,
1560      otherwise the shell may expand the parameters into names
1561      of files in the local directory.
1562
1563 # echo 'hrtimer_*' > set_ftrace_filter
1564
1565Produces:
1566
1567# tracer: ftrace
1568#
1569#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1570#              | |      |          |         |
1571            bash-4003  [00]  1480.611794: hrtimer_init <-copy_process
1572            bash-4003  [00]  1480.611941: hrtimer_start <-hrtick_set
1573            bash-4003  [00]  1480.611956: hrtimer_cancel <-hrtick_clear
1574            bash-4003  [00]  1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel
1575          <idle>-0     [00]  1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt
1576          <idle>-0     [00]  1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt
1577          <idle>-0     [00]  1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt
1578          <idle>-0     [00]  1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt
1579          <idle>-0     [00]  1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt
1580
1581
1582Notice that we lost the sys_nanosleep.
1583
1584 # cat set_ftrace_filter
1585hrtimer_run_queues
1586hrtimer_run_pending
1587hrtimer_init
1588hrtimer_cancel
1589hrtimer_try_to_cancel
1590hrtimer_forward
1591hrtimer_start
1592hrtimer_reprogram
1593hrtimer_force_reprogram
1594hrtimer_get_next_event
1595hrtimer_interrupt
1596hrtimer_nanosleep
1597hrtimer_wakeup
1598hrtimer_get_remaining
1599hrtimer_get_res
1600hrtimer_init_sleeper
1601
1602
1603This is because the '>' and '>>' act just like they do in bash.
1604To rewrite the filters, use '>'
1605To append to the filters, use '>>'
1606
1607To clear out a filter so that all functions will be recorded
1608again:
1609
1610 # echo > set_ftrace_filter
1611 # cat set_ftrace_filter
1612 #
1613
1614Again, now we want to append.
1615
1616 # echo sys_nanosleep > set_ftrace_filter
1617 # cat set_ftrace_filter
1618sys_nanosleep
1619 # echo 'hrtimer_*' >> set_ftrace_filter
1620 # cat set_ftrace_filter
1621hrtimer_run_queues
1622hrtimer_run_pending
1623hrtimer_init
1624hrtimer_cancel
1625hrtimer_try_to_cancel
1626hrtimer_forward
1627hrtimer_start
1628hrtimer_reprogram
1629hrtimer_force_reprogram
1630hrtimer_get_next_event
1631hrtimer_interrupt
1632sys_nanosleep
1633hrtimer_nanosleep
1634hrtimer_wakeup
1635hrtimer_get_remaining
1636hrtimer_get_res
1637hrtimer_init_sleeper
1638
1639
1640The set_ftrace_notrace prevents those functions from being
1641traced.
1642
1643 # echo '*preempt*' '*lock*' > set_ftrace_notrace
1644
1645Produces:
1646
1647# tracer: ftrace
1648#
1649#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1650#              | |      |          |         |
1651            bash-4043  [01]   115.281644: finish_task_switch <-schedule
1652            bash-4043  [01]   115.281645: hrtick_set <-schedule
1653            bash-4043  [01]   115.281645: hrtick_clear <-hrtick_set
1654            bash-4043  [01]   115.281646: wait_for_completion <-__stop_machine_run
1655            bash-4043  [01]   115.281647: wait_for_common <-wait_for_completion
1656            bash-4043  [01]   115.281647: kthread_stop <-stop_machine_run
1657            bash-4043  [01]   115.281648: init_waitqueue_head <-kthread_stop
1658            bash-4043  [01]   115.281648: wake_up_process <-kthread_stop
1659            bash-4043  [01]   115.281649: try_to_wake_up <-wake_up_process
1660
1661We can see that there's no more lock or preempt tracing.
1662
1663
1664Dynamic ftrace with the function graph tracer
1665---------------------------------------------
1666
1667Although what has been explained above concerns both the
1668function tracer and the function-graph-tracer, there are some
1669special features only available in the function-graph tracer.
1670
1671If you want to trace only one function and all of its children,
1672you just have to echo its name into set_graph_function:
1673
1674 echo __do_fault > set_graph_function
1675
1676will produce the following "expanded" trace of the __do_fault()
1677function:
1678
1679 0)               |  __do_fault() {
1680 0)               |    filemap_fault() {
1681 0)               |      find_lock_page() {
1682 0)   0.804 us    |        find_get_page();
1683 0)               |        __might_sleep() {
1684 0)   1.329 us    |        }
1685 0)   3.904 us    |      }
1686 0)   4.979 us    |    }
1687 0)   0.653 us    |    _spin_lock();
1688 0)   0.578 us    |    page_add_file_rmap();
1689 0)   0.525 us    |    native_set_pte_at();
1690 0)   0.585 us    |    _spin_unlock();
1691 0)               |    unlock_page() {
1692 0)   0.541 us    |      page_waitqueue();
1693 0)   0.639 us    |      __wake_up_bit();
1694 0)   2.786 us    |    }
1695 0) + 14.237 us   |  }
1696 0)               |  __do_fault() {
1697 0)               |    filemap_fault() {
1698 0)               |      find_lock_page() {
1699 0)   0.698 us    |        find_get_page();
1700 0)               |        __might_sleep() {
1701 0)   1.412 us    |        }
1702 0)   3.950 us    |      }
1703 0)   5.098 us    |    }
1704 0)   0.631 us    |    _spin_lock();
1705 0)   0.571 us    |    page_add_file_rmap();
1706 0)   0.526 us    |    native_set_pte_at();
1707 0)   0.586 us    |    _spin_unlock();
1708 0)               |    unlock_page() {
1709 0)   0.533 us    |      page_waitqueue();
1710 0)   0.638 us    |      __wake_up_bit();
1711 0)   2.793 us    |    }
1712 0) + 14.012 us   |  }
1713
1714You can also expand several functions at once:
1715
1716 echo sys_open > set_graph_function
1717 echo sys_close >> set_graph_function
1718
1719Now if you want to go back to trace all functions you can clear
1720this special filter via:
1721
1722 echo > set_graph_function
1723
1724
1725Filter commands
1726---------------
1727
1728A few commands are supported by the set_ftrace_filter interface.
1729Trace commands have the following format:
1730
1731<function>:<command>:<parameter>
1732
1733The following commands are supported:
1734
1735- mod
1736  This command enables function filtering per module. The
1737  parameter defines the module. For example, if only the write*
1738  functions in the ext3 module are desired, run:
1739
1740   echo 'write*:mod:ext3' > set_ftrace_filter
1741
1742  This command interacts with the filter in the same way as
1743  filtering based on function names. Thus, adding more functions
1744  in a different module is accomplished by appending (>>) to the
1745  filter file. Remove specific module functions by prepending
1746  '!':
1747
1748   echo '!writeback*:mod:ext3' >> set_ftrace_filter
1749
1750- traceon/traceoff
1751  These commands turn tracing on and off when the specified
1752  functions are hit. The parameter determines how many times the
1753  tracing system is turned on and off. If unspecified, there is
1754  no limit. For example, to disable tracing when a schedule bug
1755  is hit the first 5 times, run:
1756
1757   echo '__schedule_bug:traceoff:5' > set_ftrace_filter
1758
1759  These commands are cumulative whether or not they are appended
1760  to set_ftrace_filter. To remove a command, prepend it by '!'
1761  and drop the parameter:
1762
1763   echo '!__schedule_bug:traceoff' > set_ftrace_filter
1764
1765
1766trace_pipe
1767----------
1768
1769The trace_pipe outputs the same content as the trace file, but
1770the effect on the tracing is different. Every read from
1771trace_pipe is consumed. This means that subsequent reads will be
1772different. The trace is live.
1773
1774 # echo function > current_tracer
1775 # cat trace_pipe > /tmp/trace.out &
1776[1] 4153
1777 # echo 1 > tracing_on
1778 # usleep 1
1779 # echo 0 > tracing_on
1780 # cat trace
1781# tracer: function
1782#
1783#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
1784#              | |      |          |         |
1785
1786 #
1787 # cat /tmp/trace.out
1788            bash-4043  [00] 41.267106: finish_task_switch <-schedule
1789            bash-4043  [00] 41.267106: hrtick_set <-schedule
1790            bash-4043  [00] 41.267107: hrtick_clear <-hrtick_set
1791            bash-4043  [00] 41.267108: wait_for_completion <-__stop_machine_run
1792            bash-4043  [00] 41.267108: wait_for_common <-wait_for_completion
1793            bash-4043  [00] 41.267109: kthread_stop <-stop_machine_run
1794            bash-4043  [00] 41.267109: init_waitqueue_head <-kthread_stop
1795            bash-4043  [00] 41.267110: wake_up_process <-kthread_stop
1796            bash-4043  [00] 41.267110: try_to_wake_up <-wake_up_process
1797            bash-4043  [00] 41.267111: select_task_rq_rt <-try_to_wake_up
1798
1799
1800Note, reading the trace_pipe file will block until more input is
1801added. By changing the tracer, trace_pipe will issue an EOF. We
1802needed to set the function tracer _before_ we "cat" the
1803trace_pipe file.
1804
1805
1806trace entries
1807-------------
1808
1809Having too much or not enough data can be troublesome in
1810diagnosing an issue in the kernel. The file buffer_size_kb is
1811used to modify the size of the internal trace buffers. The
1812number listed is the number of entries that can be recorded per
1813CPU. To know the full size, multiply the number of possible CPUS
1814with the number of entries.
1815
1816 # cat buffer_size_kb
18171408 (units kilobytes)
1818
1819Note, to modify this, you must have tracing completely disabled.
1820To do that, echo "nop" into the current_tracer. If the
1821current_tracer is not set to "nop", an EINVAL error will be
1822returned.
1823
1824 # echo nop > current_tracer
1825 # echo 10000 > buffer_size_kb
1826 # cat buffer_size_kb
182710000 (units kilobytes)
1828
1829The number of pages which will be allocated is limited to a
1830percentage of available memory. Allocating too much will produce
1831an error.
1832
1833 # echo 1000000000000 > buffer_size_kb
1834-bash: echo: write error: Cannot allocate memory
1835 # cat buffer_size_kb
183685
1837
1838-----------
1839
1840More details can be found in the source code, in the
1841kernel/trace/*.c files.
1842