3                      PCI Bus EEH Error Recovery
   4                      --------------------------
   5                           Linas Vepstas
   6                       <>
   7                          12 January 2005
  12The IBM POWER-based pSeries and iSeries computers include PCI bus
  13controller chips that have extended capabilities for detecting and
  14reporting a large variety of PCI bus error conditions.  These features
  15go under the name of "EEH", for "Extended Error Handling".  The EEH
  16hardware features allow PCI bus errors to be cleared and a PCI
  17card to be "rebooted", without also having to reboot the operating
  20This is in contrast to traditional PCI error handling, where the
  21PCI chip is wired directly to the CPU, and an error would cause
  22a CPU machine-check/check-stop condition, halting the CPU entirely.
  23Another "traditional" technique is to ignore such errors, which
  24can lead to data corruption, both of user data or of kernel data,
  25hung/unresponsive adapters, or system crashes/lockups.  Thus,
  26the idea behind EEH is that the operating system can become more
  27reliable and robust by protecting it from PCI errors, and giving
  28the OS the ability to "reboot"/recover individual PCI devices.
  30Future systems from other vendors, based on the PCI-E specification,
  31may contain similar features.
  34Causes of EEH Errors
  36EEH was originally designed to guard against hardware failure, such
  37as PCI cards dying from heat, humidity, dust, vibration and bad
  38electrical connections. The vast majority of EEH errors seen in
  39"real life" are due to either poorly seated PCI cards, or,
  40unfortunately quite commonly, due to device driver bugs, device firmware
  41bugs, and sometimes PCI card hardware bugs.
  43The most common software bug, is one that causes the device to
  44attempt to DMA to a location in system memory that has not been
  45reserved for DMA access for that card.  This is a powerful feature,
  46as it prevents what; otherwise, would have been silent memory
  47corruption caused by the bad DMA.  A number of device driver
  48bugs have been found and fixed in this way over the past few
  49years.  Other possible causes of EEH errors include data or
  50address line parity errors (for example, due to poor electrical
  51connectivity due to a poorly seated card), and PCI-X split-completion
  52errors (due to software, device firmware, or device PCI hardware bugs).
  53The vast majority of "true hardware failures" can be cured by
  54physically removing and re-seating the PCI card.
  57Detection and Recovery
  59In the following discussion, a generic overview of how to detect
  60and recover from EEH errors will be presented. This is followed
  61by an overview of how the current implementation in the Linux
  62kernel does it.  The actual implementation is subject to change,
  63and some of the finer points are still being debated.  These
  64may in turn be swayed if or when other architectures implement
  65similar functionality.
  67When a PCI Host Bridge (PHB, the bus controller connecting the
  68PCI bus to the system CPU electronics complex) detects a PCI error
  69condition, it will "isolate" the affected PCI card.  Isolation
  70will block all writes (either to the card from the system, or
  71from the card to the system), and it will cause all reads to
  72return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads).
  73This value was chosen because it is the same value you would
  74get if the device was physically unplugged from the slot.
  75This includes access to PCI memory, I/O space, and PCI config
  76space.  Interrupts; however, will continued to be delivered.
  78Detection and recovery are performed with the aid of ppc64
  79firmware.  The programming interfaces in the Linux kernel
  80into the firmware are referred to as RTAS (Run-Time Abstraction
  81Services).  The Linux kernel does not (should not) access
  82the EEH function in the PCI chipsets directly, primarily because
  83there are a number of different chipsets out there, each with
  84different interfaces and quirks. The firmware provides a
  85uniform abstraction layer that will work with all pSeries
  86and iSeries hardware (and be forwards-compatible).
  88If the OS or device driver suspects that a PCI slot has been
  89EEH-isolated, there is a firmware call it can make to determine if
  90this is the case. If so, then the device driver should put itself
  91into a consistent state (given that it won't be able to complete any
  92pending work) and start recovery of the card.  Recovery normally
  93would consist of resetting the PCI device (holding the PCI #RST
  94line high for two seconds), followed by setting up the device
  95config space (the base address registers (BAR's), latency timer,
  96cache line size, interrupt line, and so on).  This is followed by a
  97reinitialization of the device driver.  In a worst-case scenario,
  98the power to the card can be toggled, at least on hot-plug-capable
  99slots.  In principle, layers far above the device driver probably
 100do not need to know that the PCI card has been "rebooted" in this
 101way; ideally, there should be at most a pause in Ethernet/disk/USB
 102I/O while the card is being reset.
 104If the card cannot be recovered after three or four resets, the
 105kernel/device driver should assume the worst-case scenario, that the
 106card has died completely, and report this error to the sysadmin.
 107In addition, error messages are reported through RTAS and also through
 108syslogd (/var/log/messages) to alert the sysadmin of PCI resets.
 109The correct way to deal with failed adapters is to use the standard
 110PCI hotplug tools to remove and replace the dead card.
 113Current PPC64 Linux EEH Implementation
 115At this time, a generic EEH recovery mechanism has been implemented,
 116so that individual device drivers do not need to be modified to support
 117EEH recovery.  This generic mechanism piggy-backs on the PCI hotplug
 118infrastructure,  and percolates events up through the userspace/udev
 119infrastructure.  Following is a detailed description of how this is
 122EEH must be enabled in the PHB's very early during the boot process,
 123and if a PCI slot is hot-plugged. The former is performed by
 124eeh_init() in arch/powerpc/platforms/pseries/eeh.c, and the later by
 125drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code.
 126EEH must be enabled before a PCI scan of the device can proceed.
 127Current Power5 hardware will not work unless EEH is enabled;
 128although older Power4 can run with it disabled.  Effectively,
 129EEH can no longer be turned off.  PCI devices *must* be
 130registered with the EEH code; the EEH code needs to know about
 131the I/O address ranges of the PCI device in order to detect an
 132error.  Given an arbitrary address, the routine
 133pci_get_device_by_addr() will find the pci device associated
 134with that address (if any).
 136The default arch/powerpc/include/asm/io.h macros readb(), inb(), insb(),
 137etc. include a check to see if the i/o read returned all-0xff's.
 138If so, these make a call to eeh_dn_check_failure(), which in turn
 139asks the firmware if the all-ff's value is the sign of a true EEH
 140error.  If it is not, processing continues as normal.  The grand
 141total number of these false alarms or "false positives" can be
 142seen in /proc/ppc64/eeh (subject to change).  Normally, almost
 143all of these occur during boot, when the PCI bus is scanned, where
 144a large number of 0xff reads are part of the bus scan procedure.
 146If a frozen slot is detected, code in 
 147arch/powerpc/platforms/pseries/eeh.c will print a stack trace to 
 148syslog (/var/log/messages).  This stack trace has proven to be very 
 149useful to device-driver authors for finding out at what point the EEH 
 150error was detected, as the error itself usually occurs slightly 
 153Next, it uses the Linux kernel notifier chain/work queue mechanism to
 154allow any interested parties to find out about the failure.  Device
 155drivers, or other parts of the kernel, can use
 156eeh_register_notifier(struct notifier_block *) to find out about EEH
 157events.  The event will include a pointer to the pci device, the
 158device node and some state info.  Receivers of the event can "do as
 159they wish"; the default handler will be described further in this
 162To assist in the recovery of the device, eeh.c exports the
 163following functions:
 165rtas_set_slot_reset() -- assert the  PCI #RST line for 1/8th of a second
 166rtas_configure_bridge() -- ask firmware to configure any PCI bridges
 167   located topologically under the pci slot.
 168eeh_save_bars() and eeh_restore_bars(): save and restore the PCI
 169   config-space info for a device and any devices under it.
 172A handler for the EEH notifier_block events is implemented in
 173drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events().
 174It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter().
 175This last call causes the device driver for the card to be stopped,
 176which causes uevents to go out to user space. This triggers
 177user-space scripts that might issue commands such as "ifdown eth0"
 178for ethernet cards, and so on.  This handler then sleeps for 5 seconds,
 179hoping to give the user-space scripts enough time to complete.
 180It then resets the PCI card, reconfigures the device BAR's, and
 181any bridges underneath. It then calls rpaphp_enable_pci_slot(),
 182which restarts the device driver and triggers more user-space
 183events (for example, calling "ifup eth0" for ethernet cards).
 186Device Shutdown and User-Space Events
 188This section documents what happens when a pci slot is unconfigured,
 189focusing on how the device driver gets shut down, and on how the
 190events get delivered to user-space scripts.
 192Following is an example sequence of events that cause a device driver
 193close function to be called during the first phase of an EEH reset.
 194The following sequence is an example of the pcnet32 device driver.
 196    rpa_php_unconfig_pci_adapter (struct slot *)  // in rpaphp_pci.c
 197    {
 198      calls
 199      pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c
 200      {
 201        calls
 202        pci_destroy_dev (struct pci_dev *)
 203        {
 204          calls
 205          device_unregister (&dev->dev) // in /drivers/base/core.c
 206          {
 207            calls
 208            device_del (struct device *)
 209            {
 210              calls
 211              bus_remove_device() // in /drivers/base/bus.c
 212              {
 213                calls
 214                device_release_driver()
 215                {
 216                  calls
 217                  struct device_driver->remove() which is just
 218                  pci_device_remove()  // in /drivers/pci/pci_driver.c
 219                  {
 220                    calls
 221                    struct pci_driver->remove() which is just
 222                    pcnet32_remove_one() // in /drivers/net/pcnet32.c
 223                    {
 224                      calls
 225                      unregister_netdev() // in /net/core/dev.c
 226                      {
 227                        calls
 228                        dev_close()  // in /net/core/dev.c
 229                        {
 230                           calls dev->stop();
 231                           which is just pcnet32_close() // in pcnet32.c
 232                           {
 233                             which does what you wanted
 234                             to stop the device
 235                           }
 236                        }
 237                     }
 238                   which
 239                   frees pcnet32 device driver memory
 240                }
 241     }}}}}}
 244    in drivers/pci/pci_driver.c,
 245    struct device_driver->remove() is just pci_device_remove()
 246    which calls struct pci_driver->remove() which is pcnet32_remove_one()
 247    which calls unregister_netdev()  (in net/core/dev.c)
 248    which calls dev_close()  (in net/core/dev.c)
 249    which calls dev->stop() which is pcnet32_close()
 250    which then does the appropriate shutdown.
 253Following is the analogous stack trace for events sent to user-space
 254when the pci device is unconfigured.
 256rpa_php_unconfig_pci_adapter() {             // in rpaphp_pci.c
 257  calls
 258  pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c
 259    calls
 260    pci_destroy_dev (struct pci_dev *) {
 261      calls
 262      device_unregister (&dev->dev) {        // in /drivers/base/core.c
 263        calls
 264        device_del(struct device * dev) {    // in /drivers/base/core.c
 265          calls
 266          kobject_del() {                    //in /libs/kobject.c
 267            calls
 268            kobject_uevent() {               // in /libs/kobject.c
 269              calls
 270              kset_uevent() {                // in /lib/kobject.c
 271                calls
 272                kset->uevent_ops->uevent()   // which is really just
 273                a call to
 274                dev_uevent() {               // in /drivers/base/core.c
 275                  calls
 276                  dev->bus->uevent() which is really just a call to
 277                  pci_uevent () {            // in drivers/pci/hotplug.c
 278                    which prints device name, etc....
 279                 }
 280               }
 281               then kobject_uevent() sends a netlink uevent to userspace
 282               --> userspace uevent
 283               (during early boot, nobody listens to netlink events and
 284               kobject_uevent() executes uevent_helper[], which runs the
 285               event process /sbin/hotplug)
 286           }
 287         }
 288         kobject_del() then calls sysfs_remove_dir(), which would
 289         trigger any user-space daemon that was watching /sysfs,
 290         and notice the delete event.
 293Pro's and Con's of the Current Design
 295There are several issues with the current EEH software recovery design,
 296which may be addressed in future revisions.  But first, note that the
 297big plus of the current design is that no changes need to be made to
 298individual device drivers, so that the current design throws a wide net.
 299The biggest negative of the design is that it potentially disturbs
 300network daemons and file systems that didn't need to be disturbed.
 302-- A minor complaint is that resetting the network card causes
 303   user-space back-to-back ifdown/ifup burps that potentially disturb
 304   network daemons, that didn't need to even know that the pci
 305   card was being rebooted.
 307-- A more serious concern is that the same reset, for SCSI devices,
 308   causes havoc to mounted file systems.  Scripts cannot post-facto
 309   unmount a file system without flushing pending buffers, but this
 310   is impossible, because I/O has already been stopped.  Thus,
 311   ideally, the reset should happen at or below the block layer,
 312   so that the file systems are not disturbed.
 314   Reiserfs does not tolerate errors returned from the block device.
 315   Ext3fs seems to be tolerant, retrying reads/writes until it does
 316   succeed. Both have been only lightly tested in this scenario.
 318   The SCSI-generic subsystem already has built-in code for performing
 319   SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter
 320   (HBA) resets.  These are cascaded into a chain of attempted
 321   resets if a SCSI command fails. These are completely hidden
 322   from the block layer.  It would be very natural to add an EEH
 323   reset into this chain of events.
 325-- If a SCSI error occurs for the root device, all is lost unless
 326   the sysadmin had the foresight to run /bin, /sbin, /etc, /var
 327   and so on, out of ramdisk/tmpfs.
 332There's forward progress ...
 335 kindly hosted by Redpill Linpro AS, provider of Linux consulting and operations services since 1995.