linux/Documentation/PCI/pci.rst
<<
>>
Prefs
   1.. SPDX-License-Identifier: GPL-2.0
   2
   3==============================
   4How To Write Linux PCI Drivers
   5==============================
   6
   7:Authors: - Martin Mares <mj@ucw.cz>
   8          - Grant Grundler <grundler@parisc-linux.org>
   9
  10The world of PCI is vast and full of (mostly unpleasant) surprises.
  11Since each CPU architecture implements different chip-sets and PCI devices
  12have different requirements (erm, "features"), the result is the PCI support
  13in the Linux kernel is not as trivial as one would wish. This short paper
  14tries to introduce all potential driver authors to Linux APIs for
  15PCI device drivers.
  16
  17A more complete resource is the third edition of "Linux Device Drivers"
  18by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman.
  19LDD3 is available for free (under Creative Commons License) from:
  20https://lwn.net/Kernel/LDD3/.
  21
  22However, keep in mind that all documents are subject to "bit rot".
  23Refer to the source code if things are not working as described here.
  24
  25Please send questions/comments/patches about Linux PCI API to the
  26"Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list.
  27
  28
  29Structure of PCI drivers
  30========================
  31PCI drivers "discover" PCI devices in a system via pci_register_driver().
  32Actually, it's the other way around. When the PCI generic code discovers
  33a new device, the driver with a matching "description" will be notified.
  34Details on this below.
  35
  36pci_register_driver() leaves most of the probing for devices to
  37the PCI layer and supports online insertion/removal of devices [thus
  38supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver].
  39pci_register_driver() call requires passing in a table of function
  40pointers and thus dictates the high level structure of a driver.
  41
  42Once the driver knows about a PCI device and takes ownership, the
  43driver generally needs to perform the following initialization:
  44
  45  - Enable the device
  46  - Request MMIO/IOP resources
  47  - Set the DMA mask size (for both coherent and streaming DMA)
  48  - Allocate and initialize shared control data (pci_allocate_coherent())
  49  - Access device configuration space (if needed)
  50  - Register IRQ handler (request_irq())
  51  - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
  52  - Enable DMA/processing engines
  53
  54When done using the device, and perhaps the module needs to be unloaded,
  55the driver needs to take the follow steps:
  56
  57  - Disable the device from generating IRQs
  58  - Release the IRQ (free_irq())
  59  - Stop all DMA activity
  60  - Release DMA buffers (both streaming and coherent)
  61  - Unregister from other subsystems (e.g. scsi or netdev)
  62  - Release MMIO/IOP resources
  63  - Disable the device
  64
  65Most of these topics are covered in the following sections.
  66For the rest look at LDD3 or <linux/pci.h> .
  67
  68If the PCI subsystem is not configured (CONFIG_PCI is not set), most of
  69the PCI functions described below are defined as inline functions either
  70completely empty or just returning an appropriate error codes to avoid
  71lots of ifdefs in the drivers.
  72
  73
  74pci_register_driver() call
  75==========================
  76
  77PCI device drivers call ``pci_register_driver()`` during their
  78initialization with a pointer to a structure describing the driver
  79(``struct pci_driver``):
  80
  81.. kernel-doc:: include/linux/pci.h
  82   :functions: pci_driver
  83
  84The ID table is an array of ``struct pci_device_id`` entries ending with an
  85all-zero entry.  Definitions with static const are generally preferred.
  86
  87.. kernel-doc:: include/linux/mod_devicetable.h
  88   :functions: pci_device_id
  89
  90Most drivers only need ``PCI_DEVICE()`` or ``PCI_DEVICE_CLASS()`` to set up
  91a pci_device_id table.
  92
  93New PCI IDs may be added to a device driver pci_ids table at runtime
  94as shown below::
  95
  96  echo "vendor device subvendor subdevice class class_mask driver_data" > \
  97  /sys/bus/pci/drivers/{driver}/new_id
  98
  99All fields are passed in as hexadecimal values (no leading 0x).
 100The vendor and device fields are mandatory, the others are optional. Users
 101need pass only as many optional fields as necessary:
 102
 103  - subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
 104  - class and classmask fields default to 0
 105  - driver_data defaults to 0UL.
 106
 107Note that driver_data must match the value used by any of the pci_device_id
 108entries defined in the driver. This makes the driver_data field mandatory
 109if all the pci_device_id entries have a non-zero driver_data value.
 110
 111Once added, the driver probe routine will be invoked for any unclaimed
 112PCI devices listed in its (newly updated) pci_ids list.
 113
 114When the driver exits, it just calls pci_unregister_driver() and the PCI layer
 115automatically calls the remove hook for all devices handled by the driver.
 116
 117
 118"Attributes" for driver functions/data
 119--------------------------------------
 120
 121Please mark the initialization and cleanup functions where appropriate
 122(the corresponding macros are defined in <linux/init.h>):
 123
 124        ======          =================================================
 125        __init          Initialization code. Thrown away after the driver
 126                        initializes.
 127        __exit          Exit code. Ignored for non-modular drivers.
 128        ======          =================================================
 129
 130Tips on when/where to use the above attributes:
 131        - The module_init()/module_exit() functions (and all
 132          initialization functions called _only_ from these)
 133          should be marked __init/__exit.
 134
 135        - Do not mark the struct pci_driver.
 136
 137        - Do NOT mark a function if you are not sure which mark to use.
 138          Better to not mark the function than mark the function wrong.
 139
 140
 141How to find PCI devices manually
 142================================
 143
 144PCI drivers should have a really good reason for not using the
 145pci_register_driver() interface to search for PCI devices.
 146The main reason PCI devices are controlled by multiple drivers
 147is because one PCI device implements several different HW services.
 148E.g. combined serial/parallel port/floppy controller.
 149
 150A manual search may be performed using the following constructs:
 151
 152Searching by vendor and device ID::
 153
 154        struct pci_dev *dev = NULL;
 155        while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev))
 156                configure_device(dev);
 157
 158Searching by class ID (iterate in a similar way)::
 159
 160        pci_get_class(CLASS_ID, dev)
 161
 162Searching by both vendor/device and subsystem vendor/device ID::
 163
 164        pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev).
 165
 166You can use the constant PCI_ANY_ID as a wildcard replacement for
 167VENDOR_ID or DEVICE_ID.  This allows searching for any device from a
 168specific vendor, for example.
 169
 170These functions are hotplug-safe. They increment the reference count on
 171the pci_dev that they return. You must eventually (possibly at module unload)
 172decrement the reference count on these devices by calling pci_dev_put().
 173
 174
 175Device Initialization Steps
 176===========================
 177
 178As noted in the introduction, most PCI drivers need the following steps
 179for device initialization:
 180
 181  - Enable the device
 182  - Request MMIO/IOP resources
 183  - Set the DMA mask size (for both coherent and streaming DMA)
 184  - Allocate and initialize shared control data (pci_allocate_coherent())
 185  - Access device configuration space (if needed)
 186  - Register IRQ handler (request_irq())
 187  - Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
 188  - Enable DMA/processing engines.
 189
 190The driver can access PCI config space registers at any time.
 191(Well, almost. When running BIST, config space can go away...but
 192that will just result in a PCI Bus Master Abort and config reads
 193will return garbage).
 194
 195
 196Enable the PCI device
 197---------------------
 198Before touching any device registers, the driver needs to enable
 199the PCI device by calling pci_enable_device(). This will:
 200
 201  - wake up the device if it was in suspended state,
 202  - allocate I/O and memory regions of the device (if BIOS did not),
 203  - allocate an IRQ (if BIOS did not).
 204
 205.. note::
 206   pci_enable_device() can fail! Check the return value.
 207
 208.. warning::
 209   OS BUG: we don't check resource allocations before enabling those
 210   resources. The sequence would make more sense if we called
 211   pci_request_resources() before calling pci_enable_device().
 212   Currently, the device drivers can't detect the bug when two
 213   devices have been allocated the same range. This is not a common
 214   problem and unlikely to get fixed soon.
 215
 216   This has been discussed before but not changed as of 2.6.19:
 217   https://lore.kernel.org/r/20060302180025.GC28895@flint.arm.linux.org.uk/
 218
 219
 220pci_set_master() will enable DMA by setting the bus master bit
 221in the PCI_COMMAND register. It also fixes the latency timer value if
 222it's set to something bogus by the BIOS.  pci_clear_master() will
 223disable DMA by clearing the bus master bit.
 224
 225If the PCI device can use the PCI Memory-Write-Invalidate transaction,
 226call pci_set_mwi().  This enables the PCI_COMMAND bit for Mem-Wr-Inval
 227and also ensures that the cache line size register is set correctly.
 228Check the return value of pci_set_mwi() as not all architectures
 229or chip-sets may support Memory-Write-Invalidate.  Alternatively,
 230if Mem-Wr-Inval would be nice to have but is not required, call
 231pci_try_set_mwi() to have the system do its best effort at enabling
 232Mem-Wr-Inval.
 233
 234
 235Request MMIO/IOP resources
 236--------------------------
 237Memory (MMIO), and I/O port addresses should NOT be read directly
 238from the PCI device config space. Use the values in the pci_dev structure
 239as the PCI "bus address" might have been remapped to a "host physical"
 240address by the arch/chip-set specific kernel support.
 241
 242See Documentation/driver-api/io-mapping.rst for how to access device registers
 243or device memory.
 244
 245The device driver needs to call pci_request_region() to verify
 246no other device is already using the same address resource.
 247Conversely, drivers should call pci_release_region() AFTER
 248calling pci_disable_device().
 249The idea is to prevent two devices colliding on the same address range.
 250
 251.. tip::
 252   See OS BUG comment above. Currently (2.6.19), The driver can only
 253   determine MMIO and IO Port resource availability _after_ calling
 254   pci_enable_device().
 255
 256Generic flavors of pci_request_region() are request_mem_region()
 257(for MMIO ranges) and request_region() (for IO Port ranges).
 258Use these for address resources that are not described by "normal" PCI
 259BARs.
 260
 261Also see pci_request_selected_regions() below.
 262
 263
 264Set the DMA mask size
 265---------------------
 266.. note::
 267   If anything below doesn't make sense, please refer to
 268   Documentation/core-api/dma-api.rst. This section is just a reminder that
 269   drivers need to indicate DMA capabilities of the device and is not
 270   an authoritative source for DMA interfaces.
 271
 272While all drivers should explicitly indicate the DMA capability
 273(e.g. 32 or 64 bit) of the PCI bus master, devices with more than
 27432-bit bus master capability for streaming data need the driver
 275to "register" this capability by calling pci_set_dma_mask() with
 276appropriate parameters.  In general this allows more efficient DMA
 277on systems where System RAM exists above 4G _physical_ address.
 278
 279Drivers for all PCI-X and PCIe compliant devices must call
 280pci_set_dma_mask() as they are 64-bit DMA devices.
 281
 282Similarly, drivers must also "register" this capability if the device
 283can directly address "consistent memory" in System RAM above 4G physical
 284address by calling pci_set_consistent_dma_mask().
 285Again, this includes drivers for all PCI-X and PCIe compliant devices.
 286Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
 28764-bit DMA capable for payload ("streaming") data but not control
 288("consistent") data.
 289
 290
 291Setup shared control data
 292-------------------------
 293Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
 294memory.  See Documentation/core-api/dma-api.rst for a full description of
 295the DMA APIs. This section is just a reminder that it needs to be done
 296before enabling DMA on the device.
 297
 298
 299Initialize device registers
 300---------------------------
 301Some drivers will need specific "capability" fields programmed
 302or other "vendor specific" register initialized or reset.
 303E.g. clearing pending interrupts.
 304
 305
 306Register IRQ handler
 307--------------------
 308While calling request_irq() is the last step described here,
 309this is often just another intermediate step to initialize a device.
 310This step can often be deferred until the device is opened for use.
 311
 312All interrupt handlers for IRQ lines should be registered with IRQF_SHARED
 313and use the devid to map IRQs to devices (remember that all PCI IRQ lines
 314can be shared).
 315
 316request_irq() will associate an interrupt handler and device handle
 317with an interrupt number. Historically interrupt numbers represent
 318IRQ lines which run from the PCI device to the Interrupt controller.
 319With MSI and MSI-X (more below) the interrupt number is a CPU "vector".
 320
 321request_irq() also enables the interrupt. Make sure the device is
 322quiesced and does not have any interrupts pending before registering
 323the interrupt handler.
 324
 325MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts"
 326which deliver interrupts to the CPU via a DMA write to a Local APIC.
 327The fundamental difference between MSI and MSI-X is how multiple
 328"vectors" get allocated. MSI requires contiguous blocks of vectors
 329while MSI-X can allocate several individual ones.
 330
 331MSI capability can be enabled by calling pci_alloc_irq_vectors() with the
 332PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This
 333causes the PCI support to program CPU vector data into the PCI device
 334capability registers. Many architectures, chip-sets, or BIOSes do NOT
 335support MSI or MSI-X and a call to pci_alloc_irq_vectors with just
 336the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always
 337specify PCI_IRQ_LEGACY as well.
 338
 339Drivers that have different interrupt handlers for MSI/MSI-X and
 340legacy INTx should chose the right one based on the msi_enabled
 341and msix_enabled flags in the pci_dev structure after calling
 342pci_alloc_irq_vectors.
 343
 344There are (at least) two really good reasons for using MSI:
 345
 3461) MSI is an exclusive interrupt vector by definition.
 347   This means the interrupt handler doesn't have to verify
 348   its device caused the interrupt.
 349
 3502) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed
 351   to be visible to the host CPU(s) when the MSI is delivered. This
 352   is important for both data coherency and avoiding stale control data.
 353   This guarantee allows the driver to omit MMIO reads to flush
 354   the DMA stream.
 355
 356See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples
 357of MSI/MSI-X usage.
 358
 359
 360PCI device shutdown
 361===================
 362
 363When a PCI device driver is being unloaded, most of the following
 364steps need to be performed:
 365
 366  - Disable the device from generating IRQs
 367  - Release the IRQ (free_irq())
 368  - Stop all DMA activity
 369  - Release DMA buffers (both streaming and consistent)
 370  - Unregister from other subsystems (e.g. scsi or netdev)
 371  - Disable device from responding to MMIO/IO Port addresses
 372  - Release MMIO/IO Port resource(s)
 373
 374
 375Stop IRQs on the device
 376-----------------------
 377How to do this is chip/device specific. If it's not done, it opens
 378the possibility of a "screaming interrupt" if (and only if)
 379the IRQ is shared with another device.
 380
 381When the shared IRQ handler is "unhooked", the remaining devices
 382using the same IRQ line will still need the IRQ enabled. Thus if the
 383"unhooked" device asserts IRQ line, the system will respond assuming
 384it was one of the remaining devices asserted the IRQ line. Since none
 385of the other devices will handle the IRQ, the system will "hang" until
 386it decides the IRQ isn't going to get handled and masks the IRQ (100,000
 387iterations later). Once the shared IRQ is masked, the remaining devices
 388will stop functioning properly. Not a nice situation.
 389
 390This is another reason to use MSI or MSI-X if it's available.
 391MSI and MSI-X are defined to be exclusive interrupts and thus
 392are not susceptible to the "screaming interrupt" problem.
 393
 394
 395Release the IRQ
 396---------------
 397Once the device is quiesced (no more IRQs), one can call free_irq().
 398This function will return control once any pending IRQs are handled,
 399"unhook" the drivers IRQ handler from that IRQ, and finally release
 400the IRQ if no one else is using it.
 401
 402
 403Stop all DMA activity
 404---------------------
 405It's extremely important to stop all DMA operations BEFORE attempting
 406to deallocate DMA control data. Failure to do so can result in memory
 407corruption, hangs, and on some chip-sets a hard crash.
 408
 409Stopping DMA after stopping the IRQs can avoid races where the
 410IRQ handler might restart DMA engines.
 411
 412While this step sounds obvious and trivial, several "mature" drivers
 413didn't get this step right in the past.
 414
 415
 416Release DMA buffers
 417-------------------
 418Once DMA is stopped, clean up streaming DMA first.
 419I.e. unmap data buffers and return buffers to "upstream"
 420owners if there is one.
 421
 422Then clean up "consistent" buffers which contain the control data.
 423
 424See Documentation/core-api/dma-api.rst for details on unmapping interfaces.
 425
 426
 427Unregister from other subsystems
 428--------------------------------
 429Most low level PCI device drivers support some other subsystem
 430like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your
 431driver isn't losing resources from that other subsystem.
 432If this happens, typically the symptom is an Oops (panic) when
 433the subsystem attempts to call into a driver that has been unloaded.
 434
 435
 436Disable Device from responding to MMIO/IO Port addresses
 437--------------------------------------------------------
 438io_unmap() MMIO or IO Port resources and then call pci_disable_device().
 439This is the symmetric opposite of pci_enable_device().
 440Do not access device registers after calling pci_disable_device().
 441
 442
 443Release MMIO/IO Port Resource(s)
 444--------------------------------
 445Call pci_release_region() to mark the MMIO or IO Port range as available.
 446Failure to do so usually results in the inability to reload the driver.
 447
 448
 449How to access PCI config space
 450==============================
 451
 452You can use `pci_(read|write)_config_(byte|word|dword)` to access the config
 453space of a device represented by `struct pci_dev *`. All these functions return
 4540 when successful or an error code (`PCIBIOS_...`) which can be translated to a
 455text string by pcibios_strerror. Most drivers expect that accesses to valid PCI
 456devices don't fail.
 457
 458If you don't have a struct pci_dev available, you can call
 459`pci_bus_(read|write)_config_(byte|word|dword)` to access a given device
 460and function on that bus.
 461
 462If you access fields in the standard portion of the config header, please
 463use symbolic names of locations and bits declared in <linux/pci.h>.
 464
 465If you need to access Extended PCI Capability registers, just call
 466pci_find_capability() for the particular capability and it will find the
 467corresponding register block for you.
 468
 469
 470Other interesting functions
 471===========================
 472
 473=============================   ================================================
 474pci_get_domain_bus_and_slot()   Find pci_dev corresponding to given domain,
 475                                bus and slot and number. If the device is
 476                                found, its reference count is increased.
 477pci_set_power_state()           Set PCI Power Management state (0=D0 ... 3=D3)
 478pci_find_capability()           Find specified capability in device's capability
 479                                list.
 480pci_resource_start()            Returns bus start address for a given PCI region
 481pci_resource_end()              Returns bus end address for a given PCI region
 482pci_resource_len()              Returns the byte length of a PCI region
 483pci_set_drvdata()               Set private driver data pointer for a pci_dev
 484pci_get_drvdata()               Return private driver data pointer for a pci_dev
 485pci_set_mwi()                   Enable Memory-Write-Invalidate transactions.
 486pci_clear_mwi()                 Disable Memory-Write-Invalidate transactions.
 487=============================   ================================================
 488
 489
 490Miscellaneous hints
 491===================
 492
 493When displaying PCI device names to the user (for example when a driver wants
 494to tell the user what card has it found), please use pci_name(pci_dev).
 495
 496Always refer to the PCI devices by a pointer to the pci_dev structure.
 497All PCI layer functions use this identification and it's the only
 498reasonable one. Don't use bus/slot/function numbers except for very
 499special purposes -- on systems with multiple primary buses their semantics
 500can be pretty complex.
 501
 502Don't try to turn on Fast Back to Back writes in your driver.  All devices
 503on the bus need to be capable of doing it, so this is something which needs
 504to be handled by platform and generic code, not individual drivers.
 505
 506
 507Vendor and device identifications
 508=================================
 509
 510Do not add new device or vendor IDs to include/linux/pci_ids.h unless they
 511are shared across multiple drivers.  You can add private definitions in
 512your driver if they're helpful, or just use plain hex constants.
 513
 514The device IDs are arbitrary hex numbers (vendor controlled) and normally used
 515only in a single location, the pci_device_id table.
 516
 517Please DO submit new vendor/device IDs to https://pci-ids.ucw.cz/.
 518There's a mirror of the pci.ids file at https://github.com/pciutils/pciids.
 519
 520
 521Obsolete functions
 522==================
 523
 524There are several functions which you might come across when trying to
 525port an old driver to the new PCI interface.  They are no longer present
 526in the kernel as they aren't compatible with hotplug or PCI domains or
 527having sane locking.
 528
 529=================       ===========================================
 530pci_find_device()       Superseded by pci_get_device()
 531pci_find_subsys()       Superseded by pci_get_subsys()
 532pci_find_slot()         Superseded by pci_get_domain_bus_and_slot()
 533pci_get_slot()          Superseded by pci_get_domain_bus_and_slot()
 534=================       ===========================================
 535
 536The alternative is the traditional PCI device driver that walks PCI
 537device lists. This is still possible but discouraged.
 538
 539
 540MMIO Space and "Write Posting"
 541==============================
 542
 543Converting a driver from using I/O Port space to using MMIO space
 544often requires some additional changes. Specifically, "write posting"
 545needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2)
 546already do this. I/O Port space guarantees write transactions reach the PCI
 547device before the CPU can continue. Writes to MMIO space allow the CPU
 548to continue before the transaction reaches the PCI device. HW weenies
 549call this "Write Posting" because the write completion is "posted" to
 550the CPU before the transaction has reached its destination.
 551
 552Thus, timing sensitive code should add readl() where the CPU is
 553expected to wait before doing other work.  The classic "bit banging"
 554sequence works fine for I/O Port space::
 555
 556       for (i = 8; --i; val >>= 1) {
 557               outb(val & 1, ioport_reg);      /* write bit */
 558               udelay(10);
 559       }
 560
 561The same sequence for MMIO space should be::
 562
 563       for (i = 8; --i; val >>= 1) {
 564               writeb(val & 1, mmio_reg);      /* write bit */
 565               readb(safe_mmio_reg);           /* flush posted write */
 566               udelay(10);
 567       }
 568
 569It is important that "safe_mmio_reg" not have any side effects that
 570interferes with the correct operation of the device.
 571
 572Another case to watch out for is when resetting a PCI device. Use PCI
 573Configuration space reads to flush the writel(). This will gracefully
 574handle the PCI master abort on all platforms if the PCI device is
 575expected to not respond to a readl().  Most x86 platforms will allow
 576MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage
 577(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail").
 578