linux/Documentation/thermal/intel_powerclamp.txt
<<
>>
Prefs
   1                         =======================
   2                         INTEL POWERCLAMP DRIVER
   3                         =======================
   4By: Arjan van de Ven <arjan@linux.intel.com>
   5    Jacob Pan <jacob.jun.pan@linux.intel.com>
   6
   7Contents:
   8        (*) Introduction
   9            - Goals and Objectives
  10
  11        (*) Theory of Operation
  12            - Idle Injection
  13            - Calibration
  14
  15        (*) Performance Analysis
  16            - Effectiveness and Limitations
  17            - Power vs Performance
  18            - Scalability
  19            - Calibration
  20            - Comparison with Alternative Techniques
  21
  22        (*) Usage and Interfaces
  23            - Generic Thermal Layer (sysfs)
  24            - Kernel APIs (TBD)
  25
  26============
  27INTRODUCTION
  28============
  29
  30Consider the situation where a system’s power consumption must be
  31reduced at runtime, due to power budget, thermal constraint, or noise
  32level, and where active cooling is not preferred. Software managed
  33passive power reduction must be performed to prevent the hardware
  34actions that are designed for catastrophic scenarios.
  35
  36Currently, P-states, T-states (clock modulation), and CPU offlining
  37are used for CPU throttling.
  38
  39On Intel CPUs, C-states provide effective power reduction, but so far
  40they’re only used opportunistically, based on workload. With the
  41development of intel_powerclamp driver, the method of synchronizing
  42idle injection across all online CPU threads was introduced. The goal
  43is to achieve forced and controllable C-state residency.
  44
  45Test/Analysis has been made in the areas of power, performance,
  46scalability, and user experience. In many cases, clear advantage is
  47shown over taking the CPU offline or modulating the CPU clock.
  48
  49
  50===================
  51THEORY OF OPERATION
  52===================
  53
  54Idle Injection
  55--------------
  56
  57On modern Intel processors (Nehalem or later), package level C-state
  58residency is available in MSRs, thus also available to the kernel.
  59
  60These MSRs are:
  61      #define MSR_PKG_C2_RESIDENCY      0x60D
  62      #define MSR_PKG_C3_RESIDENCY      0x3F8
  63      #define MSR_PKG_C6_RESIDENCY      0x3F9
  64      #define MSR_PKG_C7_RESIDENCY      0x3FA
  65
  66If the kernel can also inject idle time to the system, then a
  67closed-loop control system can be established that manages package
  68level C-state. The intel_powerclamp driver is conceived as such a
  69control system, where the target set point is a user-selected idle
  70ratio (based on power reduction), and the error is the difference
  71between the actual package level C-state residency ratio and the target idle
  72ratio.
  73
  74Injection is controlled by high priority kernel threads, spawned for
  75each online CPU.
  76
  77These kernel threads, with SCHED_FIFO class, are created to perform
  78clamping actions of controlled duty ratio and duration. Each per-CPU
  79thread synchronizes its idle time and duration, based on the rounding
  80of jiffies, so accumulated errors can be prevented to avoid a jittery
  81effect. Threads are also bound to the CPU such that they cannot be
  82migrated, unless the CPU is taken offline. In this case, threads
  83belong to the offlined CPUs will be terminated immediately.
  84
  85Running as SCHED_FIFO and relatively high priority, also allows such
  86scheme to work for both preemptable and non-preemptable kernels.
  87Alignment of idle time around jiffies ensures scalability for HZ
  88values. This effect can be better visualized using a Perf timechart.
  89The following diagram shows the behavior of kernel thread
  90kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
  91for a given "duration", then relinquishes the CPU to other tasks,
  92until the next time interval.
  93
  94The NOHZ schedule tick is disabled during idle time, but interrupts
  95are not masked. Tests show that the extra wakeups from scheduler tick
  96have a dramatic impact on the effectiveness of the powerclamp driver
  97on large scale systems (Westmere system with 80 processors).
  98
  99CPU0
 100                  ____________          ____________
 101kidle_inject/0   |   sleep    |  mwait |  sleep     |
 102        _________|            |________|            |_______
 103                               duration
 104CPU1
 105                  ____________          ____________
 106kidle_inject/1   |   sleep    |  mwait |  sleep     |
 107        _________|            |________|            |_______
 108                              ^
 109                              |
 110                              |
 111                              roundup(jiffies, interval)
 112
 113Only one CPU is allowed to collect statistics and update global
 114control parameters. This CPU is referred to as the controlling CPU in
 115this document. The controlling CPU is elected at runtime, with a
 116policy that favors BSP, taking into account the possibility of a CPU
 117hot-plug.
 118
 119In terms of dynamics of the idle control system, package level idle
 120time is considered largely as a non-causal system where its behavior
 121cannot be based on the past or current input. Therefore, the
 122intel_powerclamp driver attempts to enforce the desired idle time
 123instantly as given input (target idle ratio). After injection,
 124powerclamp moniors the actual idle for a given time window and adjust
 125the next injection accordingly to avoid over/under correction.
 126
 127When used in a causal control system, such as a temperature control,
 128it is up to the user of this driver to implement algorithms where
 129past samples and outputs are included in the feedback. For example, a
 130PID-based thermal controller can use the powerclamp driver to
 131maintain a desired target temperature, based on integral and
 132derivative gains of the past samples.
 133
 134
 135
 136Calibration
 137-----------
 138During scalability testing, it is observed that synchronized actions
 139among CPUs become challenging as the number of cores grows. This is
 140also true for the ability of a system to enter package level C-states.
 141
 142To make sure the intel_powerclamp driver scales well, online
 143calibration is implemented. The goals for doing such a calibration
 144are:
 145
 146a) determine the effective range of idle injection ratio
 147b) determine the amount of compensation needed at each target ratio
 148
 149Compensation to each target ratio consists of two parts:
 150
 151        a) steady state error compensation
 152        This is to offset the error occurring when the system can
 153        enter idle without extra wakeups (such as external interrupts).
 154
 155        b) dynamic error compensation
 156        When an excessive amount of wakeups occurs during idle, an
 157        additional idle ratio can be added to quiet interrupts, by
 158        slowing down CPU activities.
 159
 160A debugfs file is provided for the user to examine compensation
 161progress and results, such as on a Westmere system.
 162[jacob@nex01 ~]$ cat
 163/sys/kernel/debug/intel_powerclamp/powerclamp_calib
 164controlling cpu: 0
 165pct confidence steady dynamic (compensation)
 1660       0       0       0
 1671       1       0       0
 1682       1       1       0
 1693       3       1       0
 1704       3       1       0
 1715       3       1       0
 1726       3       1       0
 1737       3       1       0
 1748       3       1       0
 175...
 17630      3       2       0
 17731      3       2       0
 17832      3       1       0
 17933      3       2       0
 18034      3       1       0
 18135      3       2       0
 18236      3       1       0
 18337      3       2       0
 18438      3       1       0
 18539      3       2       0
 18640      3       3       0
 18741      3       1       0
 18842      3       2       0
 18943      3       1       0
 19044      3       1       0
 19145      3       2       0
 19246      3       3       0
 19347      3       0       0
 19448      3       2       0
 19549      3       3       0
 196
 197Calibration occurs during runtime. No offline method is available.
 198Steady state compensation is used only when confidence levels of all
 199adjacent ratios have reached satisfactory level. A confidence level
 200is accumulated based on clean data collected at runtime. Data
 201collected during a period without extra interrupts is considered
 202clean.
 203
 204To compensate for excessive amounts of wakeup during idle, additional
 205idle time is injected when such a condition is detected. Currently,
 206we have a simple algorithm to double the injection ratio. A possible
 207enhancement might be to throttle the offending IRQ, such as delaying
 208EOI for level triggered interrupts. But it is a challenge to be
 209non-intrusive to the scheduler or the IRQ core code.
 210
 211
 212CPU Online/Offline
 213------------------
 214Per-CPU kernel threads are started/stopped upon receiving
 215notifications of CPU hotplug activities. The intel_powerclamp driver
 216keeps track of clamping kernel threads, even after they are migrated
 217to other CPUs, after a CPU offline event.
 218
 219
 220=====================
 221Performance Analysis
 222=====================
 223This section describes the general performance data collected on
 224multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
 225
 226Effectiveness and Limitations
 227-----------------------------
 228The maximum range that idle injection is allowed is capped at 50
 229percent. As mentioned earlier, since interrupts are allowed during
 230forced idle time, excessive interrupts could result in less
 231effectiveness. The extreme case would be doing a ping -f to generated
 232flooded network interrupts without much CPU acknowledgement. In this
 233case, little can be done from the idle injection threads. In most
 234normal cases, such as scp a large file, applications can be throttled
 235by the powerclamp driver, since slowing down the CPU also slows down
 236network protocol processing, which in turn reduces interrupts.
 237
 238When control parameters change at runtime by the controlling CPU, it
 239may take an additional period for the rest of the CPUs to catch up
 240with the changes. During this time, idle injection is out of sync,
 241thus not able to enter package C- states at the expected ratio. But
 242this effect is minor, in that in most cases change to the target
 243ratio is updated much less frequently than the idle injection
 244frequency.
 245
 246Scalability
 247-----------
 248Tests also show a minor, but measurable, difference between the 4P/8P
 249Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
 250More compensation is needed on Westmere for the same amount of
 251target idle ratio. The compensation also increases as the idle ratio
 252gets larger. The above reason constitutes the need for the
 253calibration code.
 254
 255On the IVB 8P system, compared to an offline CPU, powerclamp can
 256achieve up to 40% better performance per watt. (measured by a spin
 257counter summed over per CPU counting threads spawned for all running
 258CPUs).
 259
 260====================
 261Usage and Interfaces
 262====================
 263The powerclamp driver is registered to the generic thermal layer as a
 264cooling device. Currently, it’s not bound to any thermal zones.
 265
 266jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
 267cur_state:0
 268max_state:50
 269type:intel_powerclamp
 270
 271Example usage:
 272- To inject 25% idle time
 273$ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
 274"
 275
 276If the system is not busy and has more than 25% idle time already,
 277then the powerclamp driver will not start idle injection. Using Top
 278will not show idle injection kernel threads.
 279
 280If the system is busy (spin test below) and has less than 25% natural
 281idle time, powerclamp kernel threads will do idle injection, which
 282appear running to the scheduler. But the overall system idle is still
 283reflected. In this example, 24.1% idle is shown. This helps the
 284system admin or user determine the cause of slowdown, when a
 285powerclamp driver is in action.
 286
 287
 288Tasks: 197 total,   1 running, 196 sleeping,   0 stopped,   0 zombie
 289Cpu(s): 71.2%us,  4.7%sy,  0.0%ni, 24.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 290Mem:   3943228k total,  1689632k used,  2253596k free,    74960k buffers
 291Swap:  4087804k total,        0k used,  4087804k free,   945336k cached
 292
 293  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 294 3352 jacob     20   0  262m  644  428 S  286  0.0   0:17.16 spin
 295 3341 root     -51   0     0    0    0 D   25  0.0   0:01.62 kidle_inject/0
 296 3344 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/3
 297 3342 root     -51   0     0    0    0 D   25  0.0   0:01.61 kidle_inject/1
 298 3343 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/2
 299 2935 jacob     20   0  696m 125m  35m S    5  3.3   0:31.11 firefox
 300 1546 root      20   0  158m  20m 6640 S    3  0.5   0:26.97 Xorg
 301 2100 jacob     20   0 1223m  88m  30m S    3  2.3   0:23.68 compiz
 302
 303Tests have shown that by using the powerclamp driver as a cooling
 304device, a PID based userspace thermal controller can manage to
 305control CPU temperature effectively, when no other thermal influence
 306is added. For example, a UltraBook user can compile the kernel under
 307certain temperature (below most active trip points).
 308
lxr.linux.no kindly hosted by Redpill Linpro AS, provider of Linux consulting and operations services since 1995.