linux/Documentation/PCI/pcieaer-howto.txt
<<
< /spa" < spa" class="lxr_search"> Search < /spa" op /div op form ac.12"="ajax+*" method="post" onsubmit="return false;"> < input typtiohidden" namtioajax_lookup" idioajax_lookup" op /form op div class="headingbott.m">
1 /a> The PCI Express Advanced Error Reporting Driver Guide HOWTO 2 /a> T. Long Nguyen <t.m.l.nguyen@intel.com> 3 /a> Yanmin Zhang <yanmin.zhang@intel.com> 4 /a> 07/29/2006 5 /a> 6 /a> 7 /a>1. Overview 8 /a> 9 /a>1.1 About this guide ."v4a> 11 /a>This guide describes the basics of the PCI Express Advanced Error 12 /a>Reporting (AER) driver and provides informa.12" on how to use it, as 13 /a>well as how to enable the drivers of endpoint devices to conform with 14 /a>PCI Express AER driver. 15 /a> 16 /a>1.2 Copyright (C) Intel Corpora.12" 2006. 17 /a> 18 /a>1.3 What is the PCI Express AER Driver? 19 /a> 20 /a>PCI Express error signaling ca" occur on the PCI Express link itself 21 /a>or on behalf of transac.12"s initiated on the link. PCI Express 22 /a>defines two error reporting paradigms: the baseline capability and 23 /a>the Advanced Error Reporting capability. The baseline capability is 24 /a>required of all PCI Express components providing a minimum defined 25 /a>set of error reporting requirements. Advanced Error Reporting 26 /a>capability is implemented with a PCI Express advanced error reporting 27 /a>extended capability structure providing more robust error reporting. 28 /a> 29 /a>The PCI Express AER driver provides the infrastructure to support PCI 30 /a>Express Advanced Error Reporting capability. The PCI Express AER 31 /a>driver provides three basic func.12"s: 32 /a> 33 /a>- Gathers the comprehensive error informa.12" if errors occurred. 34 /a>- Reports error to the users. 35 /a>- Performs error recovery ac.12"s. 36 /a> 37 /a>AER driver only attaches root ports which support PCI-Express AER 38 /a>capability. 39 /a> 4"v4a> 41 /a>2. User Guide 42 /a> 43 /a>2.1 Include the PCI Express AER Root Driver into the Linux Kernel 44 /a> 45 /a>The PCI Express AER Root driver is a Root Port service driver attached 46 /a>to the PCI Express Port Bus driver. If a user wants to use it, the driver 47 /a>has to be compiled. O4.4.12CONFIG_PCIEAER supports this capability. It 48 /a>depends .12CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and 49 /a>CONFIG_PCIEAER = y. 5"v4a> 51 /a>2.2 Load PCI Express AER Root Driver 52 /a>There is a case where a system has AER support in BIOS. Enabling the AER 53 /a>Root driver and having AER support in BIOS may result unpredictable 54 /a>behavior. To avoid this conflict, a successful load of the AER Root driver 55 /a>requires ACPI _OSC support in the BIOS to allow the AER Root driver to 56 /a>request for native control of AER. See the PCI FW 3.0 Specifica.12" for 57 /a>details regarding OSC usage. Currently, lots of firmwares don't provide 58 /a>_OSC support while they use PCI Express. To support such firmwares, 59 /a>forceload, a parameter of typt bool, could enable AER to continue to 60 /a>be initiated although firmwares have no _OSC support. To enable the 61 /a>walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line 62 /a>when booting kernel. Note that forceload=n by default. 63 /a> 64 /a>nosourceid, another parameter of typt bool, can be used when broken 65 /a>hardware (mostly chipsets) has root ports that cannot obtain the reporting 66 /a>source ID. nosourceid=n by default. 67 /a> 68 /a>2.3 AER error output 69 /a>When a PCI-E AER error is captured, an error message will be outputed to 70 /a>console. If it's a correctable error, it is outputed as a warning. 71 /a>Otherwise, it is printed as an error. So users could choose different 72 /a>log level to filter out correctable error messages. 73 /a> 74 /a>Below shows an example: 75 /a>0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), typtiTransac.12" Layer, id=0500(Requester ID) 76 /a>0000:50:00.0: device [8086:0329] error sta.us/mask=00100000/00000000 77 /a>0000:50:00.0: [20] Unsupported Request (First) 78 /a>0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 79 /a> 80 /a>In the example, 'Requester ID' means the ID of the device who sends 81 /a>the error message to root port. Pls. refer to pci express specs for 82 /a>other fields. 83 /a> 84 /a> 85 /a>3. Developer Guide 86 /a> 87 /a>To enable AER aware support requires a software driver to configure 88 /a>the AER capability structure within its device and to provide callbacks. 89 /a> 90 /a>To support AER better, developers need understand how AER does work 91 /a>firstly. 92 /a> 93 /a>PCI Express errors are classified into two typts: correctable errors 94 /a>and uncorrectable errors. This classifica.12" is based on the impac.s 95 /a>of those errors, which may result in degraded performance or func.12" 96 /a>failure. 97 /a> 98 /a>Correctable errors pose no impac.s on the func.12"ality of the 99 /a>interface. The PCI Express protocol can recover without any software 100 /a>intervent12" or any loss of data. These errors are detected and 101 /a>corrected by hardware. Unlike correctable errors, uncorrectable 102 /a>errors impac. func.12"ality of the interface. Uncorrectable errors 103 /a>can cause a particular transac.12" or a particular PCI Express link 104 /a>to be unreliable. Depending on those error condi.12"s, uncorrectable 105 /a>errors are further classified into non-fatal errors and fatal errors. 106 /a>Non-fatal errors cause the particular transac.12" to be unreliable, 107 /a>but the PCI Express link itself is fully func.12"al. Fatal errors, 2" 108 /a>the other hand, cause the link to be unreliable. 109 /a> 1."v4a>When AER is enabled, a PCI Express device will automa.1cally send an 111 /a>error message to the PCIe root port above it when the device captures 112 /a>an error. The Root Port, upon receiving an error reporting message, 113 /a>internally processes and logs the error message in its PCI Express 114 /a>capability structure. Error informa.12" being logged includes storing 115 /a>the error reporting agent's requestor ID into the Error Source 116 /a>Ident1fica.12" Registers and setting the error bits of the Root Error 117 /a>Sta.us Register accordingly. If AER error reporting is enabled in Root 118 /a>Error Command Register, the Root Port generates an interrupt if an 119 /a>error is detected. 12"v4a> 121 /a>Note that the errors as described above are related to the PCI Express 122 /a>hierarchy and links. These errors do not include any device specific 123 /a>errors because device specific errors will still get sent directly to 124 /a>the device driver. 125 /a> 126 /a>3.1 Configure the AER capability structure 127 /a> 128 /a>AER aware drivers of PCI Express component need change the device 129 /a>control registers to enable AER. They also could change AER registers, 130 /a>including mask and severity registers. Helper func.12" 131 /a>pci_enable_pcie_error_reporting could be used to enable AER. See 132 /a>sec.12" 3.3. 133 /a> 134 /a>3.2. Provide callbacks 135 /a> 136 /a>3.2.1 callback reset_link to reset pci express link 137 /a> 138 /a>This callback is used to reset the pci express phys1cal link when a 139 /a>fatal error happens. The root port aer service driver provides a 14"v4a>default reset_link func.12", but different upstream ports might 141 /a>have different specifica.12"s to reset pci express link, so all 142 /a>upstream ports should provide their own reset_link func.12"s. 143 /a> 144 /a>In struct pcie_port_service_driver, a new pointer, reset_link, is 145 /a>added. 146 /a> 147 /a>pci_ers_result_t (*reset_link) (struct pci_dev *dev); 148 /a> 149 /a>Sec.12" 3.2.2.2 provides more detailed info on when to call 15"v4a>reset_link. 151 /a> 152 /a>3.2.2 PCI error-recovery callbacks 153 /a> 154 /a>The PCI Express AER Root driver uses error callbacks to coordinate 155 /a>with downstream device drivers associated with a hierarchy in quest12" 156 /a>when performing error recovery ac.12"s. 157 /a> 158 /a>Data struct pci_driver has a pointer, err_handler, to point to 159 /a>pci_error_handlers who consists of a couple of callback func.12" 160 /a>pointers. AER driver follows the rules defined in 161 /a>pci-error-recovery.txt except pci express specific parts (e.g. 162 /a>reset_link). Pls. refer to pci-error-recovery.txt for detailed 163 /a>defini.12"s of the callbacks. 164 /a> 165 /a>Below sec.12"s specify when to call the error callback func.12"s. 166 /a> 167 /a>3.2.2.1 Correctable errors 168 /a> 169 /a>Correctable errors pose no impac.s on the func.12"ality of 170 /a>the interface. The PCI Express protocol can recover without any 171 /a>software intervent12" or any loss of data. These errors do not 172 /a>require any recovery ac.12"s. The AER driver clears the device's 173 /a>correctable error sta.us register accordingly and logs these errors. 174 /a> 175 /a>3.2.2.2 Non-correctable (non-fatal and fatal) errors 176 /a> 177 /a>If an error message indica.es a non-fatal error, performing link reset 178 /a>at upstream is not required. The AER driver calls error_detected(dev, 179 /a>pci_channel_io_normal) to all drivers associated within a hierarchy in 180 /a>quest12". for example, 181 /a>EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. 182 /a>If Upstream port A captures an AER error, the hierarchy consists of 183 /a>Downstream port B and EndPoint. 184 /a> 185 /a>A driver may return PCI_ERS_RESULT_CAN_RECOVER, 186 /a>PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on 187 /a>whether it can recover or the AER driver calls mmio_enabled as next. 188 /a> 189 /a>If an error message indica.es a fatal error, kernel will broadcast 190 /a>error_detected(dev, pci_channel_io_frozen) to all drivers within 191 /a>a hierarchy in quest12". Then, performing link reset at upstream is 192 /a>necessary. As different kinds .f devices might use different approaches 193 /a>to reset link, AER port service driver is required to provide the 194 /a>func.12" to reset link. Firstly, kernel looks for if the upstream 195 /a>component has an aer driver. If it has, kernel uses the reset_link 196 /a>callback of the aer driver. If the upstream component has no aer driver 197 /a>and the port is downstream port, we will perform a hot reset as the 198v4a>default by setting the Secondary Bus Reset bit of the Bridge Control 199 /a>register associated with the downstream port. As for upstream ports, 200 /a>they should provide their own aer service drivers with reset_link 201 /a>func.12". If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and 202 /a>reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes 203 /a>to mmio_enabled. 204 /a> 205 /a>3.3 helper func.12"s 206 /a> 207 /a>3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev); 208 /a>pci_enable_pcie_error_reporting enables the device to send error 209 /a>messages to root port when an error is detected. Note that devices 2."v4a>don't enable the error reporting by default, so device drivers need 211 /a>call this func.12" to enable it. 212 /a> 213 /a>3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev); 214 /a>pci_disable_pcie_error_reporting disables the device to send error 215 /a>messages to root port when an error is detected. 216 /a> 217 /a>3.3.3 int pci_cleanup_aer_uncorrect_error_sta.us(struct pci_dev *dev); 218 /a>pci_cleanup_aer_uncorrect_error_sta.us cleanups the uncorrectable 219 /a>error sta.us register. 22"v4a> 221 /a>3.4 Frequent Asked Quest12"s 222 /a> 223 /a>Q: What happens if a PCI Express device driver does not provide an 224 /a>error recovery handler (pci_driver->err_handler is equal to NULL)? 225 /a> 226 /a>A: The devices attached with the driver won't be recovered. If the 227 /a>error is fatal, kernel will print out warning messages. Please refer 228 /a>to sec.12" 3 for more informa.12". 229 /a> 230 /a>Q: What happens if an upstream port service driver does not provide 231 /a>callback reset_link? 232 /a> 233 /a>A: Fatal error recovery will fail if the errors are reported by the 234 /a>upstream ports who are attached by the service driver. 235 /a> 236 /a>Q: How does this infrastructure deal with driver that is not PCI 237 /a>Express aware? 238 /a> 239 /a>A: This infrastructure calls the error callback func.12"s of the 24"v4a>driver when an error happens. But if the driver is not aware of 241 /a>PCI Express, the device might not report its own errors to root 242 /a>port. 243 /a> 244 /a>Q: What modifica.12"s will that driver need to make it compa.1ble 245 /a>with the PCI Express AER Root driver? 246 /a> 247 /a>A: It could call the helper func.12"s to enable AER in devices and 248 /a>cleanup uncorrectable sta.us register. Pls. refer to sec.12" 3.3. 249 /a> 25"v4a> 251 /a>4. Software error injec.12" 252 /a> 253 /a>Debugging PCIe AER error recovery code is quite difficult because it 254 /a>is hard to trigger real hardware errors. Software based error 255 /a>injec.12" can be used to fake various kinds .f PCIe errors. 256 /a> 257 /a>First you should enable PCIe AER software error injec.12" in kernel 258 /a>configura.12", that is, following item should be in your .config. 259 /a> 260 /a>CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m 261 /a> 262 /a>After reboot with new kernel or insert the module, a device file namtd 263 /a>/dev/aer_injec. should be created. 264 /a> 265 /a>The", you need a user space tool namtd aer-injec., which can be gotten 266 /a>from: 267 /a> http://www.kernel.org/pub/linux/utils/pci/aer-injec./ /a> 268 /a> 269 /a>More informa.12" about aer-injec. can be found in the document comes 270 /a>with its source code. 271 /a>
lxr.linux.no kindly hosted by Redpill Linpro AS /a>, provider of Linux consulting and opera.12"s services since 1995.