1PCI Power Management
   3Copyright (c) 2010 Rafael J. Wysocki <>, Novell Inc.
   5An overview of concepts and the Linux kernel's interfaces related to PCI power
   6management.  Based on previous work by Patrick Mochel <>
   7(and others).
   9This document only covers the aspects of power management specific to PCI
  10devices.  For general description of the kernel's interfaces related to device
  11power management refer to Documentation/power/devices.txt and
  161. Hardware and Platform Support for PCI Power Management
  172. PCI Subsystem and Device Power Management
  183. PCI Device Drivers and Power Management
  194. Resources
  221. Hardware and Platform Support for PCI Power Management
  251.1. Native and Platform-Based Power Management
  27In general, power management is a feature allowing one to save energy by putting
  28devices into states in which they draw less power (low-power states) at the
  29price of reduced functionality or performance.
  31Usually, a device is put into a low-power state when it is underutilized or
  32completely inactive.  However, when it is necessary to use the device once
  33again, it has to be put back into the "fully functional" state (full-power
  34state).  This may happen when there are some data for the device to handle or
  35as a result of an external event requiring the device to be active, which may
  36be signaled by the device itself.
  38PCI devices may be put into low-power states in two ways, by using the device
  39capabilities introduced by the PCI Bus Power Management Interface Specification,
  40or with the help of platform firmware, such as an ACPI BIOS.  In the first
  41approach, that is referred to as the native PCI power management (native PCI PM)
  42in what follows, the device power state is changed as a result of writing a
  43specific value into one of its standard configuration registers.  The second
  44approach requires the platform firmware to provide special methods that may be
  45used by the kernel to change the device's power state.
  47Devices supporting the native PCI PM usually can generate wakeup signals called
  48Power Management Events (PMEs) to let the kernel know about external events
  49requiring the device to be active.  After receiving a PME the kernel is supposed
  50to put the device that sent it into the full-power state.  However, the PCI Bus
  51Power Management Interface Specification doesn't define any standard method of
  52delivering the PME from the device to the CPU and the operating system kernel.
  53It is assumed that the platform firmware will perform this task and therefore,
  54even though a PCI device is set up to generate PMEs, it also may be necessary to
  55prepare the platform firmware for notifying the CPU of the PMEs coming from the
  56device (e.g. by generating interrupts).
  58In turn, if the methods provided by the platform firmware are used for changing
  59the power state of a device, usually the platform also provides a method for
  60preparing the device to generate wakeup signals.  In that case, however, it
  61often also is necessary to prepare the device for generating PMEs using the
  62native PCI PM mechanism, because the method provided by the platform depends on
  65Thus in many situations both the native and the platform-based power management
  66mechanisms have to be used simultaneously to obtain the desired result.
  681.2. Native PCI Power Management
  70The PCI Bus Power Management Interface Specification (PCI PM Spec) was
  71introduced between the PCI 2.1 and PCI 2.2 Specifications.  It defined a
  72standard interface for performing various operations related to power
  75The implementation of the PCI PM Spec is optional for conventional PCI devices,
  76but it is mandatory for PCI Express devices.  If a device supports the PCI PM
  77Spec, it has an 8 byte power management capability field in its PCI
  78configuration space.  This field is used to describe and control the standard
  79features related to the native PCI power management.
  81The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses
  82(B0-B3).  The higher the number, the less power is drawn by the device or bus
  83in that state.  However, the higher the number, the longer the latency for
  84the device or bus to return to the full-power state (D0 or B0, respectively).
  86There are two variants of the D3 state defined by the specification.  The first
  87one is D3hot, referred to as the software accessible D3, because devices can be
  88programmed to go into it.  The second one, D3cold, is the state that PCI devices
  89are in when the supply voltage (Vcc) is removed from them.  It is not possible
  90to program a PCI device to go into D3cold, although there may be a programmable
  91interface for putting the bus the device is on into a state in which Vcc is
  92removed from all devices on the bus.
  94PCI bus power management, however, is not supported by the Linux kernel at the
  95time of this writing and therefore it is not covered by this document.
  97Note that every PCI device can be in the full-power state (D0) or in D3cold,
  98regardless of whether or not it implements the PCI PM Spec.  In addition to
  99that, if the PCI PM Spec is implemented by the device, it must support D3hot
 100as well as D0.  The support for the D1 and D2 power states is optional.
 102PCI devices supporting the PCI PM Spec can be programmed to go to any of the
 103supported low-power states (except for D3cold).  While in D1-D3hot the
 104standard configuration registers of the device must be accessible to software
 105(i.e. the device is required to respond to PCI configuration accesses), although
 106its I/O and memory spaces are then disabled.  This allows the device to be
 107programmatically put into D0.  Thus the kernel can switch the device back and
 108forth between D0 and the supported low-power states (except for D3cold) and the
 109possible power state transitions the device can undergo are the following:
 112| Current State | New State  |
 114| D0            | D1, D2, D3 |
 116| D1            | D2, D3     |
 118| D2            | D3         |
 120| D1, D2, D3    | D0         |
 123The transition from D3cold to D0 occurs when the supply voltage is provided to
 124the device (i.e. power is restored).  In that case the device returns to D0 with
 125a full power-on reset sequence and the power-on defaults are restored to the
 126device by hardware just as at initial power up.
 128PCI devices supporting the PCI PM Spec can be programmed to generate PMEs
 129while in a low-power state (D1-D3), but they are not required to be capable
 130of generating PMEs from all supported low-power states.  In particular, the
 131capability of generating PMEs from D3cold is optional and depends on the
 132presence of additional voltage (3.3Vaux) allowing the device to remain
 133sufficiently active to generate a wakeup signal.
 1351.3. ACPI Device Power Management
 137The platform firmware support for the power management of PCI devices is
 138system-specific.  However, if the system in question is compliant with the
 139Advanced Configuration and Power Interface (ACPI) Specification, like the
 140majority of x86-based systems, it is supposed to implement device power
 141management interfaces defined by the ACPI standard.
 143For this purpose the ACPI BIOS provides special functions called "control
 144methods" that may be executed by the kernel to perform specific tasks, such as
 145putting a device into a low-power state.  These control methods are encoded
 146using special byte-code language called the ACPI Machine Language (AML) and
 147stored in the machine's BIOS.  The kernel loads them from the BIOS and executes
 148them as needed using an AML interpreter that translates the AML byte code into
 149computations and memory or I/O space accesses.  This way, in theory, a BIOS
 150writer can provide the kernel with a means to perform actions depending
 151on the system design in a system-specific fashion.
 153ACPI control methods may be divided into global control methods, that are not
 154associated with any particular devices, and device control methods, that have
 155to be defined separately for each device supposed to be handled with the help of
 156the platform.  This means, in particular, that ACPI device control methods can
 157only be used to handle devices that the BIOS writer knew about in advance.  The
 158ACPI methods used for device power management fall into that category.
 160The ACPI specification assumes that devices can be in one of four power states
 161labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM
 162D0-D3 states (although the difference between D3hot and D3cold is not taken
 163into account by ACPI).  Moreover, for each power state of a device there is a
 164set of power resources that have to be enabled for the device to be put into
 165that state.  These power resources are controlled (i.e. enabled or disabled)
 166with the help of their own control methods, _ON and _OFF, that have to be
 167defined individually for each of them.
 169To put a device into the ACPI power state Dx (where x is a number between 0 and
 1703 inclusive) the kernel is supposed to (1) enable the power resources required
 171by the device in this state using their _ON control methods and (2) execute the
 172_PSx control method defined for the device.  In addition to that, if the device
 173is going to be put into a low-power state (D1-D3) and is supposed to generate
 174wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI
 1753.0) control method defined for it has to be executed before _PSx.  Power
 176resources that are not required by the device in the target power state and are
 177not required any more by any other device should be disabled (by executing their
 178_OFF control methods).  If the current power state of the device is D3, it can
 179only be put into D0 this way.
 181However, quite often the power states of devices are changed during a
 182system-wide transition into a sleep state or back into the working state.  ACPI
 183defines four system sleep states, S1, S2, S3, and S4, and denotes the system
 184working state as S0.  In general, the target system sleep (or working) state
 185determines the highest power (lowest number) state the device can be put
 186into and the kernel is supposed to obtain this information by executing the
 187device's _SxD control method (where x is a number between 0 and 4 inclusive).
 188If the device is required to wake up the system from the target sleep state, the
 189lowest power (highest number) state it can be put into is also determined by the
 190target state of the system.  The kernel is then supposed to use the device's
 191_SxW control method to obtain the number of that state.  It also is supposed to
 192use the device's _PRW control method to learn which power resources need to be
 193enabled for the device to be able to generate wakeup signals.
 1951.4. Wakeup Signaling
 197Wakeup signals generated by PCI devices, either as native PCI PMEs, or as
 198a result of the execution of the _DSW (or _PSW) ACPI control method before
 199putting the device into a low-power state, have to be caught and handled as
 200appropriate.  If they are sent while the system is in the working state
 201(ACPI S0), they should be translated into interrupts so that the kernel can
 202put the devices generating them into the full-power state and take care of the
 203events that triggered them.  In turn, if they are sent while the system is
 204sleeping, they should cause the system's core logic to trigger wakeup.
 206On ACPI-based systems wakeup signals sent by conventional PCI devices are
 207converted into ACPI General-Purpose Events (GPEs) which are hardware signals
 208from the system core logic generated in response to various events that need to
 209be acted upon.  Every GPE is associated with one or more sources of potentially
 210interesting events.  In particular, a GPE may be associated with a PCI device
 211capable of signaling wakeup.  The information on the connections between GPEs
 212and event sources is recorded in the system's ACPI BIOS from where it can be
 213read by the kernel.
 215If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE
 216associated with it (if there is one) is triggered.  The GPEs associated with PCI
 217bridges may also be triggered in response to a wakeup signal from one of the
 218devices below the bridge (this also is the case for root bridges) and, for
 219example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be
 220handled this way.
 222A GPE may be triggered when the system is sleeping (i.e. when it is in one of
 223the ACPI S1-S4 states), in which case system wakeup is started by its core logic
 224(the device that was the source of the signal causing the system wakeup to occur
 225may be identified later).  The GPEs used in such situations are referred to as
 226wakeup GPEs.
 228Usually, however, GPEs are also triggered when the system is in the working
 229state (ACPI S0) and in that case the system's core logic generates a System
 230Control Interrupt (SCI) to notify the kernel of the event.  Then, the SCI
 231handler identifies the GPE that caused the interrupt to be generated which,
 232in turn, allows the kernel to identify the source of the event (that may be
 233a PCI device signaling wakeup).  The GPEs used for notifying the kernel of
 234events occurring while the system is in the working state are referred to as
 235runtime GPEs.
 237Unfortunately, there is no standard way of handling wakeup signals sent by
 238conventional PCI devices on systems that are not ACPI-based, but there is one
 239for PCI Express devices.  Namely, the PCI Express Base Specification introduced
 240a native mechanism for converting native PCI PMEs into interrupts generated by
 241root ports.  For conventional PCI devices native PMEs are out-of-band, so they
 242are routed separately and they need not pass through bridges (in principle they
 243may be routed directly to the system's core logic), but for PCI Express devices
 244they are in-band messages that have to pass through the PCI Express hierarchy,
 245including the root port on the path from the device to the Root Complex.  Thus
 246it was possible to introduce a mechanism by which a root port generates an
 247interrupt whenever it receives a PME message from one of the devices below it.
 248The PCI Express Requester ID of the device that sent the PME message is then
 249recorded in one of the root port's configuration registers from where it may be
 250read by the interrupt handler allowing the device to be identified.  [PME
 251messages sent by PCI Express endpoints integrated with the Root Complex don't
 252pass through root ports, but instead they cause a Root Complex Event Collector
 253(if there is one) to generate interrupts.]
 255In principle the native PCI Express PME signaling may also be used on ACPI-based
 256systems along with the GPEs, but to use it the kernel has to ask the system's
 257ACPI BIOS to release control of root port configuration registers.  The ACPI
 258BIOS, however, is not required to allow the kernel to control these registers
 259and if it doesn't do that, the kernel must not modify their contents.  Of course
 260the native PCI Express PME signaling cannot be used by the kernel in that case.
 2632. PCI Subsystem and Device Power Management
 2662.1. Device Power Management Callbacks
 268The PCI Subsystem participates in the power management of PCI devices in a
 269number of ways.  First of all, it provides an intermediate code layer between
 270the device power management core (PM core) and PCI device drivers.
 271Specifically, the pm field of the PCI subsystem's struct bus_type object,
 272pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing
 273pointers to several device power management callbacks:
 275const struct dev_pm_ops pci_dev_pm_ops = {
 276        .prepare = pci_pm_prepare,
 277        .complete = pci_pm_complete,
 278        .suspend = pci_pm_suspend,
 279        .resume = pci_pm_resume,
 280        .freeze = pci_pm_freeze,
 281        .thaw = pci_pm_thaw,
 282        .poweroff = pci_pm_poweroff,
 283        .restore = pci_pm_restore,
 284        .suspend_noirq = pci_pm_suspend_noirq,
 285        .resume_noirq = pci_pm_resume_noirq,
 286        .freeze_noirq = pci_pm_freeze_noirq,
 287        .thaw_noirq = pci_pm_thaw_noirq,
 288        .poweroff_noirq = pci_pm_poweroff_noirq,
 289        .restore_noirq = pci_pm_restore_noirq,
 290        .runtime_suspend = pci_pm_runtime_suspend,
 291        .runtime_resume = pci_pm_runtime_resume,
 292        .runtime_idle = pci_pm_runtime_idle,
 295These callbacks are executed by the PM core in various situations related to
 296device power management and they, in turn, execute power management callbacks
 297provided by PCI device drivers.  They also perform power management operations
 298involving some standard configuration registers of PCI devices that device
 299drivers need not know or care about.
 301The structure representing a PCI device, struct pci_dev, contains several fields
 302that these callbacks operate on:
 304struct pci_dev {
 305        ...
 306        pci_power_t     current_state;  /* Current operating state. */
 307        int             pm_cap;         /* PM capability offset in the
 308                                           configuration space */
 309        unsigned int    pme_support:5;  /* Bitmask of states from which PME#
 310                                           can be generated */
 311        unsigned int    pme_interrupt:1;/* Is native PCIe PME signaling used? */
 312        unsigned int    d1_support:1;   /* Low power state D1 is supported */
 313        unsigned int    d2_support:1;   /* Low power state D2 is supported */
 314        unsigned int    no_d1d2:1;      /* D1 and D2 are forbidden */
 315        unsigned int    wakeup_prepared:1;  /* Device prepared for wake up */
 316        unsigned int    d3_delay;       /* D3->D0 transition time in ms */
 317        ...
 320They also indirectly use some fields of the struct device that is embedded in
 321struct pci_dev.
 3232.2. Device Initialization
 325The PCI subsystem's first task related to device power management is to
 326prepare the device for power management and initialize the fields of struct
 327pci_dev used for this purpose.  This happens in two functions defined in
 328drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init().
 330The first of these functions checks if the device supports native PCI PM
 331and if that's the case the offset of its power management capability structure
 332in the configuration space is stored in the pm_cap field of the device's struct
 333pci_dev object.  Next, the function checks which PCI low-power states are
 334supported by the device and from which low-power states the device can generate
 335native PCI PMEs.  The power management fields of the device's struct pci_dev and
 336the struct device embedded in it are updated accordingly and the generation of
 337PMEs by the device is disabled.
 339The second function checks if the device can be prepared to signal wakeup with
 340the help of the platform firmware, such as the ACPI BIOS.  If that is the case,
 341the function updates the wakeup fields in struct device embedded in the
 342device's struct pci_dev and uses the firmware-provided method to prevent the
 343device from signaling wakeup.
 345At this point the device is ready for power management.  For driverless devices,
 346however, this functionality is limited to a few basic operations carried out
 347during system-wide transitions to a sleep state and back to the working state.
 3492.3. Runtime Device Power Management
 351The PCI subsystem plays a vital role in the runtime power management of PCI
 352devices.  For this purpose it uses the general runtime power management
 353(runtime PM) framework described in Documentation/power/runtime_pm.txt.
 354Namely, it provides subsystem-level callbacks:
 356        pci_pm_runtime_suspend()
 357        pci_pm_runtime_resume()
 358        pci_pm_runtime_idle()
 360that are executed by the core runtime PM routines.  It also implements the
 361entire mechanics necessary for handling runtime wakeup signals from PCI devices
 362in low-power states, which at the time of this writing works for both the native
 363PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in
 364Section 1.
 366First, a PCI device is put into a low-power state, or suspended, with the help
 367of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call
 368pci_pm_runtime_suspend() to do the actual job.  For this to work, the device's
 369driver has to provide a pm->runtime_suspend() callback (see below), which is
 370run by pci_pm_runtime_suspend() as the first action.  If the driver's callback
 371returns successfully, the device's standard configuration registers are saved,
 372the device is prepared to generate wakeup signals and, finally, it is put into
 373the target low-power state.
 375The low-power state to put the device into is the lowest-power (highest number)
 376state from which it can signal wakeup.  The exact method of signaling wakeup is
 377system-dependent and is determined by the PCI subsystem on the basis of the
 378reported capabilities of the device and the platform firmware.  To prepare the
 379device for signaling wakeup and put it into the selected low-power state, the
 380PCI subsystem can use the platform firmware as well as the device's native PCI
 381PM capabilities, if supported.
 383It is expected that the device driver's pm->runtime_suspend() callback will
 384not attempt to prepare the device for signaling wakeup or to put it into a
 385low-power state.  The driver ought to leave these tasks to the PCI subsystem
 386that has all of the information necessary to perform them.
 388A suspended device is brought back into the "active" state, or resumed,
 389with the help of pm_request_resume() or pm_runtime_resume() which both call
 390pci_pm_runtime_resume() for PCI devices.  Again, this only works if the device's
 391driver provides a pm->runtime_resume() callback (see below).  However, before
 392the driver's callback is executed, pci_pm_runtime_resume() brings the device
 393back into the full-power state, prevents it from signaling wakeup while in that
 394state and restores its standard configuration registers.  Thus the driver's
 395callback need not worry about the PCI-specific aspects of the device resume.
 397Note that generally pci_pm_runtime_resume() may be called in two different
 398situations.  First, it may be called at the request of the device's driver, for
 399example if there are some data for it to process.  Second, it may be called
 400as a result of a wakeup signal from the device itself (this sometimes is
 401referred to as "remote wakeup").  Of course, for this purpose the wakeup signal
 402is handled in one of the ways described in Section 1 and finally converted into
 403a notification for the PCI subsystem after the source device has been
 406The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle()
 407and pm_request_idle(), executes the device driver's pm->runtime_idle()
 408callback, if defined, and if that callback doesn't return error code (or is not
 409present at all), suspends the device with the help of pm_runtime_suspend().
 410Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for
 411example, it is called right after the device has just been resumed), in which
 412cases it is expected to suspend the device if that makes sense.  Usually,
 413however, the PCI subsystem doesn't really know if the device really can be
 414suspended, so it lets the device's driver decide by running its
 415pm->runtime_idle() callback.
 4172.4. System-Wide Power Transitions
 419There are a few different types of system-wide power transitions, described in
 420Documentation/power/devices.txt.  Each of them requires devices to be handled
 421in a specific way and the PM core executes subsystem-level power management
 422callbacks for this purpose.  They are executed in phases such that each phase
 423involves executing the same subsystem-level callback for every device belonging
 424to the given subsystem before the next phase begins.  These phases always run
 425after tasks have been frozen.
 4272.4.1. System Suspend
 429When the system is going into a sleep state in which the contents of memory will
 430be preserved, such as one of the ACPI sleep states S1-S3, the phases are:
 432        prepare, suspend, suspend_noirq.
 434The following PCI bus type's callbacks, respectively, are used in these phases:
 436        pci_pm_prepare()
 437        pci_pm_suspend()
 438        pci_pm_suspend_noirq()
 440The pci_pm_prepare() routine first puts the device into the "fully functional"
 441state with the help of pm_runtime_resume().  Then, it executes the device
 442driver's pm->prepare() callback if defined (i.e. if the driver's struct
 443dev_pm_ops object is present and the prepare pointer in that object is valid).
 445The pci_pm_suspend() routine first checks if the device's driver implements
 446legacy PCI suspend routines (see Section 3), in which case the driver's legacy
 447suspend callback is executed, if present, and its result is returned.  Next, if
 448the device's driver doesn't provide a struct dev_pm_ops object (containing
 449pointers to the driver's callbacks), pci_pm_default_suspend() is called, which
 450simply turns off the device's bus master capability and runs
 451pcibios_disable_device() to disable it, unless the device is a bridge (PCI
 452bridges are ignored by this routine).  Next, the device driver's pm->suspend()
 453callback is executed, if defined, and its result is returned if it fails.
 454Finally, pci_fixup_device() is called to apply hardware suspend quirks related
 455to the device if necessary.
 457Note that the suspend phase is carried out asynchronously for PCI devices, so
 458the pci_pm_suspend() callback may be executed in parallel for any pair of PCI
 459devices that don't depend on each other in a known way (i.e. none of the paths
 460in the device tree from the root bridge to a leaf device contains both of them).
 462The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has
 463been called, which means that the device driver's interrupt handler won't be
 464invoked while this routine is running.  It first checks if the device's driver
 465implements legacy PCI suspends routines (Section 3), in which case the legacy
 466late suspend routine is called and its result is returned (the standard
 467configuration registers of the device are saved if the driver's callback hasn't
 468done that).  Second, if the device driver's struct dev_pm_ops object is not
 469present, the device's standard configuration registers are saved and the routine
 470returns success.  Otherwise the device driver's pm->suspend_noirq() callback is
 471executed, if present, and its result is returned if it fails.  Next, if the
 472device's standard configuration registers haven't been saved yet (one of the
 473device driver's callbacks executed before might do that), pci_pm_suspend_noirq()
 474saves them, prepares the device to signal wakeup (if necessary) and puts it into
 475a low-power state.
 477The low-power state to put the device into is the lowest-power (highest number)
 478state from which it can signal wakeup while the system is in the target sleep
 479state.  Just like in the runtime PM case described above, the mechanism of
 480signaling wakeup is system-dependent and determined by the PCI subsystem, which
 481is also responsible for preparing the device to signal wakeup from the system's
 482target sleep state as appropriate.
 484PCI device drivers (that don't implement legacy power management callbacks) are
 485generally not expected to prepare devices for signaling wakeup or to put them
 486into low-power states.  However, if one of the driver's suspend callbacks
 487(pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration
 488registers, pci_pm_suspend_noirq() will assume that the device has been prepared
 489to signal wakeup and put into a low-power state by the driver (the driver is
 490then assumed to have used the helper functions provided by the PCI subsystem for
 491this purpose).  PCI device drivers are not encouraged to do that, but in some
 492rare cases doing that in the driver may be the optimum approach.
 4942.4.2. System Resume
 496When the system is undergoing a transition from a sleep state in which the
 497contents of memory have been preserved, such as one of the ACPI sleep states
 498S1-S3, into the working state (ACPI S0), the phases are:
 500        resume_noirq, resume, complete.
 502The following PCI bus type's callbacks, respectively, are executed in these
 505        pci_pm_resume_noirq()
 506        pci_pm_resume()
 507        pci_pm_complete()
 509The pci_pm_resume_noirq() routine first puts the device into the full-power
 510state, restores its standard configuration registers and applies early resume
 511hardware quirks related to the device, if necessary.  This is done
 512unconditionally, regardless of whether or not the device's driver implements
 513legacy PCI power management callbacks (this way all PCI devices are in the
 514full-power state and their standard configuration registers have been restored
 515when their interrupt handlers are invoked for the first time during resume,
 516which allows the kernel to avoid problems with the handling of shared interrupts
 517by drivers whose devices are still suspended).  If legacy PCI power management
 518callbacks (see Section 3) are implemented by the device's driver, the legacy
 519early resume callback is executed and its result is returned.  Otherwise, the
 520device driver's pm->resume_noirq() callback is executed, if defined, and its
 521result is returned.
 523The pci_pm_resume() routine first checks if the device's standard configuration
 524registers have been restored and restores them if that's not the case (this
 525only is necessary in the error path during a failing suspend).  Next, resume
 526hardware quirks related to the device are applied, if necessary, and if the
 527device's driver implements legacy PCI power management callbacks (see
 528Section 3), the driver's legacy resume callback is executed and its result is
 529returned.  Otherwise, the device's wakeup signaling mechanisms are blocked and
 530its driver's pm->resume() callback is executed, if defined (the callback's
 531result is then returned).
 533The resume phase is carried out asynchronously for PCI devices, like the
 534suspend phase described above, which means that if two PCI devices don't depend
 535on each other in a known way, the pci_pm_resume() routine may be executed for
 536the both of them in parallel.
 538The pci_pm_complete() routine only executes the device driver's pm->complete()
 539callback, if defined.
 5412.4.3. System Hibernation
 543System hibernation is more complicated than system suspend, because it requires
 544a system image to be created and written into a persistent storage medium.  The
 545image is created atomically and all devices are quiesced, or frozen, before that
 548The freezing of devices is carried out after enough memory has been freed (at
 549the time of this writing the image creation requires at least 50% of system RAM
 550to be free) in the following three phases:
 552        prepare, freeze, freeze_noirq
 554that correspond to the PCI bus type's callbacks:
 556        pci_pm_prepare()
 557        pci_pm_freeze()
 558        pci_pm_freeze_noirq()
 560This means that the prepare phase is exactly the same as for system suspend.
 561The other two phases, however, are different.
 563The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs
 564the device driver's pm->freeze() callback, if defined, instead of pm->suspend(),
 565and it doesn't apply the suspend-related hardware quirks.  It is executed
 566asynchronously for different PCI devices that don't depend on each other in a
 567known way.
 569The pci_pm_freeze_noirq() routine, in turn, is similar to
 570pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq()
 571routine instead of pm->suspend_noirq().  It also doesn't attempt to prepare the
 572device for signaling wakeup and put it into a low-power state.  Still, it saves
 573the device's standard configuration registers if they haven't been saved by one
 574of the driver's callbacks.
 576Once the image has been created, it has to be saved.  However, at this point all
 577devices are frozen and they cannot handle I/O, while their ability to handle
 578I/O is obviously necessary for the image saving.  Thus they have to be brought
 579back to the fully functional state and this is done in the following phases:
 581        thaw_noirq, thaw, complete
 583using the following PCI bus type's callbacks:
 585        pci_pm_thaw_noirq()
 586        pci_pm_thaw()
 587        pci_pm_complete()
 591The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(),
 592but it doesn't put the device into the full power state and doesn't attempt to
 593restore its standard configuration registers.  It also executes the device
 594driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq().
 596The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device
 597driver's pm->thaw() callback instead of pm->resume().  It is executed
 598asynchronously for different PCI devices that don't depend on each other in a
 599known way.
 601The complete phase it the same as for system resume.
 603After saving the image, devices need to be powered down before the system can
 604enter the target sleep state (ACPI S4 for ACPI-based systems).  This is done in
 605three phases:
 607        prepare, poweroff, poweroff_noirq
 609where the prepare phase is exactly the same as for system suspend.  The other
 610two phases are analogous to the suspend and suspend_noirq phases, respectively.
 611The PCI subsystem-level callbacks they correspond to
 613        pci_pm_poweroff()
 614        pci_pm_poweroff_noirq()
 616work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively,
 617although they don't attempt to save the device's standard configuration
 6202.4.4. System Restore
 622System restore requires a hibernation image to be loaded into memory and the
 623pre-hibernation memory contents to be restored before the pre-hibernation system
 624activity can be resumed.
 626As described in Documentation/power/devices.txt, the hibernation image is loaded
 627into memory by a fresh instance of the kernel, called the boot kernel, which in
 628turn is loaded and run by a boot loader in the usual way.  After the boot kernel
 629has loaded the image, it needs to replace its own code and data with the code
 630and data of the "hibernated" kernel stored within the image, called the image
 631kernel.  For this purpose all devices are frozen just like before creating
 632the image during hibernation, in the
 634        prepare, freeze, freeze_noirq
 636phases described above.  However, the devices affected by these phases are only
 637those having drivers in the boot kernel; other devices will still be in whatever
 638state the boot loader left them.
 640Should the restoration of the pre-hibernation memory contents fail, the boot
 641kernel would go through the "thawing" procedure described above, using the
 642thaw_noirq, thaw, and complete phases (that will only affect the devices having
 643drivers in the boot kernel), and then continue running normally.
 645If the pre-hibernation memory contents are restored successfully, which is the
 646usual situation, control is passed to the image kernel, which then becomes
 647responsible for bringing the system back to the working state.  To achieve this,
 648it must restore the devices' pre-hibernation functionality, which is done much
 649like waking up from the memory sleep state, although it involves different
 652        restore_noirq, restore, complete
 654The first two of these are analogous to the resume_noirq and resume phases
 655described above, respectively, and correspond to the following PCI subsystem
 658        pci_pm_restore_noirq()
 659        pci_pm_restore()
 661These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(),
 662respectively, but they execute the device driver's pm->restore_noirq() and
 663pm->restore() callbacks, if available.
 665The complete phase is carried out in exactly the same way as during system
 6693. PCI Device Drivers and Power Management
 6723.1. Power Management Callbacks
 674PCI device drivers participate in power management by providing callbacks to be
 675executed by the PCI subsystem's power management routines described above and by
 676controlling the runtime power management of their devices.
 678At the time of this writing there are two ways to define power management
 679callbacks for a PCI device driver, the recommended one, based on using a
 680dev_pm_ops structure described in Documentation/power/devices.txt, and the
 681"legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and
 682.resume() callbacks from struct pci_driver are used.  The legacy approach,
 683however, doesn't allow one to define runtime power management callbacks and is
 684not really suitable for any new drivers.  Therefore it is not covered by this
 685document (refer to the source code to learn more about it).
 687It is recommended that all PCI device drivers define a struct dev_pm_ops object
 688containing pointers to power management (PM) callbacks that will be executed by
 689the PCI subsystem's PM routines in various circumstances.  A pointer to the
 690driver's struct dev_pm_ops object has to be assigned to the field in
 691its struct pci_driver object.  Once that has happened, the "legacy" PM callbacks
 692in struct pci_driver are ignored (even if they are not NULL).
 694The PM callbacks in struct dev_pm_ops are not mandatory and if they are not
 695defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI
 696subsystem will handle the device in a simplified default manner.  If they are
 697defined, though, they are expected to behave as described in the following
 7003.1.1. prepare()
 702The prepare() callback is executed during system suspend, during hibernation
 703(when a hibernation image is about to be created), during power-off after
 704saving a hibernation image and during system restore, when a hibernation image
 705has just been loaded into memory.
 707This callback is only necessary if the driver's device has children that in
 708general may be registered at any time.  In that case the role of the prepare()
 709callback is to prevent new children of the device from being registered until
 710one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run.
 712In addition to that the prepare() callback may carry out some operations
 713preparing the device to be suspended, although it should not allocate memory
 714(if additional memory is required to suspend the device, it has to be
 715preallocated earlier, for example in a suspend/hibernate notifier as described
 716in Documentation/power/notifiers.txt).
 7183.1.2. suspend()
 720The suspend() callback is only executed during system suspend, after prepare()
 721callbacks have been executed for all devices in the system.
 723This callback is expected to quiesce the device and prepare it to be put into a
 724low-power state by the PCI subsystem.  It is not required (in fact it even is
 725not recommended) that a PCI driver's suspend() callback save the standard
 726configuration registers of the device, prepare it for waking up the system, or
 727put it into a low-power state.  All of these operations can very well be taken
 728care of by the PCI subsystem, without the driver's participation.
 730However, in some rare case it is convenient to carry out these operations in
 731a PCI driver.  Then, pci_save_state(), pci_prepare_to_sleep(), and
 732pci_set_power_state() should be used to save the device's standard configuration
 733registers, to prepare it for system wakeup (if necessary), and to put it into a
 734low-power state, respectively.  Moreover, if the driver calls pci_save_state(),
 735the PCI subsystem will not execute either pci_prepare_to_sleep(), or
 736pci_set_power_state() for its device, so the driver is then responsible for
 737handling the device as appropriate.
 739While the suspend() callback is being executed, the driver's interrupt handler
 740can be invoked to handle an interrupt from the device, so all suspend-related
 741operations relying on the driver's ability to handle interrupts should be
 742carried out in this callback.
 7443.1.3. suspend_noirq()
 746The suspend_noirq() callback is only executed during system suspend, after
 747suspend() callbacks have been executed for all devices in the system and
 748after device interrupts have been disabled by the PM core.
 750The difference between suspend_noirq() and suspend() is that the driver's
 751interrupt handler will not be invoked while suspend_noirq() is running.  Thus
 752suspend_noirq() can carry out operations that would cause race conditions to
 753arise if they were performed in suspend().
 7553.1.4. freeze()
 757The freeze() callback is hibernation-specific and is executed in two situations,
 758during hibernation, after prepare() callbacks have been executed for all devices
 759in preparation for the creation of a system image, and during restore,
 760after a system image has been loaded into memory from persistent storage and the
 761prepare() callbacks have been executed for all devices.
 763The role of this callback is analogous to the role of the suspend() callback
 764described above.  In fact, they only need to be different in the rare cases when
 765the driver takes the responsibility for putting the device into a low-power
 768In that cases the freeze() callback should not prepare the device system wakeup
 769or put it into a low-power state.  Still, either it or freeze_noirq() should
 770save the device's standard configuration registers using pci_save_state().
 7723.1.5. freeze_noirq()
 774The freeze_noirq() callback is hibernation-specific.  It is executed during
 775hibernation, after prepare() and freeze() callbacks have been executed for all
 776devices in preparation for the creation of a system image, and during restore,
 777after a system image has been loaded into memory and after prepare() and
 778freeze() callbacks have been executed for all devices.  It is always executed
 779after device interrupts have been disabled by the PM core.
 781The role of this callback is analogous to the role of the suspend_noirq()
 782callback described above and it very rarely is necessary to define
 785The difference between freeze_noirq() and freeze() is analogous to the
 786difference between suspend_noirq() and suspend().
 7883.1.6. poweroff()
 790The poweroff() callback is hibernation-specific.  It is executed when the system
 791is about to be powered off after saving a hibernation image to a persistent
 792storage.  prepare() callbacks are executed for all devices before poweroff() is
 795The role of this callback is analogous to the role of the suspend() and freeze()
 796callbacks described above, although it does not need to save the contents of
 797the device's registers.  In particular, if the driver wants to put the device
 798into a low-power state itself instead of allowing the PCI subsystem to do that,
 799the poweroff() callback should use pci_prepare_to_sleep() and
 800pci_set_power_state() to prepare the device for system wakeup and to put it
 801into a low-power state, respectively, but it need not save the device's standard
 802configuration registers.
 8043.1.7. poweroff_noirq()
 806The poweroff_noirq() callback is hibernation-specific.  It is executed after
 807poweroff() callbacks have been executed for all devices in the system.
 809The role of this callback is analogous to the role of the suspend_noirq() and
 810freeze_noirq() callbacks described above, but it does not need to save the
 811contents of the device's registers.
 813The difference between poweroff_noirq() and poweroff() is analogous to the
 814difference between suspend_noirq() and suspend().
 8163.1.8. resume_noirq()
 818The resume_noirq() callback is only executed during system resume, after the
 819PM core has enabled the non-boot CPUs.  The driver's interrupt handler will not
 820be invoked while resume_noirq() is running, so this callback can carry out
 821operations that might race with the interrupt handler.
 823Since the PCI subsystem unconditionally puts all devices into the full power
 824state in the resume_noirq phase of system resume and restores their standard
 825configuration registers, resume_noirq() is usually not necessary.  In general
 826it should only be used for performing operations that would lead to race
 827conditions if carried out by resume().
 8293.1.9. resume()
 831The resume() callback is only executed during system resume, after
 832resume_noirq() callbacks have been executed for all devices in the system and
 833device interrupts have been enabled by the PM core.
 835This callback is responsible for restoring the pre-suspend configuration of the
 836device and bringing it back to the fully functional state.  The device should be
 837able to process I/O in a usual way after resume() has returned.
 8393.1.10. thaw_noirq()
 841The thaw_noirq() callback is hibernation-specific.  It is executed after a
 842system image has been created and the non-boot CPUs have been enabled by the PM
 843core, in the thaw_noirq phase of hibernation.  It also may be executed if the
 844loading of a hibernation image fails during system restore (it is then executed
 845after enabling the non-boot CPUs).  The driver's interrupt handler will not be
 846invoked while thaw_noirq() is running.
 848The role of this callback is analogous to the role of resume_noirq().  The
 849difference between these two callbacks is that thaw_noirq() is executed after
 850freeze() and freeze_noirq(), so in general it does not need to modify the
 851contents of the device's registers.
 8533.1.11. thaw()
 855The thaw() callback is hibernation-specific.  It is executed after thaw_noirq()
 856callbacks have been executed for all devices in the system and after device
 857interrupts have been enabled by the PM core.
 859This callback is responsible for restoring the pre-freeze configuration of
 860the device, so that it will work in a usual way after thaw() has returned.
 8623.1.12. restore_noirq()
 864The restore_noirq() callback is hibernation-specific.  It is executed in the
 865restore_noirq phase of hibernation, when the boot kernel has passed control to
 866the image kernel and the non-boot CPUs have been enabled by the image kernel's
 867PM core.
 869This callback is analogous to resume_noirq() with the exception that it cannot
 870make any assumption on the previous state of the device, even if the BIOS (or
 871generally the platform firmware) is known to preserve that state over a
 872suspend-resume cycle.
 874For the vast majority of PCI device drivers there is no difference between
 875resume_noirq() and restore_noirq().
 8773.1.13. restore()
 879The restore() callback is hibernation-specific.  It is executed after
 880restore_noirq() callbacks have been executed for all devices in the system and
 881after the PM core has enabled device drivers' interrupt handlers to be invoked.
 883This callback is analogous to resume(), just like restore_noirq() is analogous
 884to resume_noirq().  Consequently, the difference between restore_noirq() and
 885restore() is analogous to the difference between resume_noirq() and resume().
 887For the vast majority of PCI device drivers there is no difference between
 888resume() and restore().
 8903.1.14. complete()
 892The complete() callback is executed in the following situations:
 893  - during system resume, after resume() callbacks have been executed for all
 894    devices,
 895  - during hibernation, before saving the system image, after thaw() callbacks
 896    have been executed for all devices,
 897  - during system restore, when the system is going back to its pre-hibernation
 898    state, after restore() callbacks have been executed for all devices.
 899It also may be executed if the loading of a hibernation image into memory fails
 900(in that case it is run after thaw() callbacks have been executed for all
 901devices that have drivers in the boot kernel).
 903This callback is entirely optional, although it may be necessary if the
 904prepare() callback performs operations that need to be reversed.
 9063.1.15. runtime_suspend()
 908The runtime_suspend() callback is specific to device runtime power management
 909(runtime PM).  It is executed by the PM core's runtime PM framework when the
 910device is about to be suspended (i.e. quiesced and put into a low-power state)
 911at run time.
 913This callback is responsible for freezing the device and preparing it to be
 914put into a low-power state, but it must allow the PCI subsystem to perform all
 915of the PCI-specific actions necessary for suspending the device.
 9173.1.16. runtime_resume()
 919The runtime_resume() callback is specific to device runtime PM.  It is executed
 920by the PM core's runtime PM framework when the device is about to be resumed
 921(i.e. put into the full-power state and programmed to process I/O normally) at
 922run time.
 924This callback is responsible for restoring the normal functionality of the
 925device after it has been put into the full-power state by the PCI subsystem.
 926The device is expected to be able to process I/O in the usual way after
 927runtime_resume() has returned.
 9293.1.17. runtime_idle()
 931The runtime_idle() callback is specific to device runtime PM.  It is executed
 932by the PM core's runtime PM framework whenever it may be desirable to suspend
 933the device according to the PM core's information.  In particular, it is
 934automatically executed right after runtime_resume() has returned in case the
 935resume of the device has happened as a result of a spurious event.
 937This callback is optional, but if it is not implemented or if it returns 0, the
 938PCI subsystem will call pm_runtime_suspend() for the device, which in turn will
 939cause the driver's runtime_suspend() callback to be executed.
 9413.1.18. Pointing Multiple Callback Pointers to One Routine
 943Although in principle each of the callbacks described in the previous
 944subsections can be defined as a separate function, it often is convenient to
 945point two or more members of struct dev_pm_ops to the same routine.  There are
 946a few convenience macros that can be used for this purpose.
 948The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one
 949suspend routine pointed to by the .suspend(), .freeze(), and .poweroff()
 950members and one resume routine pointed to by the .resume(), .thaw(), and
 951.restore() members.  The other function pointers in this struct dev_pm_ops are
 954The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it
 955additionally sets the .runtime_resume() pointer to the same value as
 956.resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to
 957the same value as .suspend() (and .freeze() and .poweroff()).
 959The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct
 960dev_pm_ops to indicate that one suspend routine is to be pointed to by the
 961.suspend(), .freeze(), and .poweroff() members and one resume routine is to
 962be pointed to by the .resume(), .thaw(), and .restore() members.
 9643.2. Device Runtime Power Management
 966In addition to providing device power management callbacks PCI device drivers
 967are responsible for controlling the runtime power management (runtime PM) of
 968their devices.
 970The PCI device runtime PM is optional, but it is recommended that PCI device
 971drivers implement it at least in the cases where there is a reliable way of
 972verifying that the device is not used (like when the network cable is detached
 973from an Ethernet adapter or there are no devices attached to a USB controller).
 975To support the PCI runtime PM the driver first needs to implement the
 976runtime_suspend() and runtime_resume() callbacks.  It also may need to implement
 977the runtime_idle() callback to prevent the device from being suspended again
 978every time right after the runtime_resume() callback has returned
 979(alternatively, the runtime_suspend() callback will have to check if the
 980device should really be suspended and return -EAGAIN if that is not the case).
 982The runtime PM of PCI devices is disabled by default.  It is also blocked by
 983pci_pm_init() that runs the pm_runtime_forbid() helper function.  If a PCI
 984driver implements the runtime PM callbacks and intends to use the runtime PM
 985framework provided by the PM core and the PCI subsystem, it should enable this
 986feature by executing the pm_runtime_enable() helper function.  However, the
 987driver should not call the pm_runtime_allow() helper function unblocking
 988the runtime PM of the device.  Instead, it should allow user space or some
 989platform-specific code to do that (user space can do it via sysfs), although
 990once it has called pm_runtime_enable(), it must be prepared to handle the
 991runtime PM of the device correctly as soon as pm_runtime_allow() is called
 992(which may happen at any time).  [It also is possible that user space causes
 993pm_runtime_allow() to be called via sysfs before the driver is loaded, so in
 994fact the driver has to be prepared to handle the runtime PM of the device as
 995soon as it calls pm_runtime_enable().]
 997The runtime PM framework works by processing requests to suspend or resume
 998devices, or to check if they are idle (in which cases it is reasonable to
 999subsequently request that they be suspended).  These requests are represented
1000by work items put into the power management workqueue, pm_wq.  Although there
1001are a few situations in which power management requests are automatically
1002queued by the PM core (for example, after processing a request to resume a
1003device the PM core automatically queues a request to check if the device is
1004idle), device drivers are generally responsible for queuing power management
1005requests for their devices.  For this purpose they should use the runtime PM
1006helper functions provided by the PM core, discussed in
1009Devices can also be suspended and resumed synchronously, without placing a
1010request into pm_wq.  In the majority of cases this also is done by their
1011drivers that use helper functions provided by the PM core for this purpose.
1013For more information on the runtime PM of devices refer to
10174. Resources
1020PCI Local Bus Specification, Rev. 3.0
1021PCI Bus Power Management Interface Specification, Rev. 1.2
1022Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b
1023PCI Express Base Specification, Rev. 2.0