1VFIO - "Virtual Function I/O"[1]
   3Many modern system now provide DMA and interrupt remapping facilities
   4to help ensure I/O devices behave within the boundaries they've been
   5allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d,
   6POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
   7systems such as Freescale PAMU.  The VFIO driver is an IOMMU/device
   8agnostic framework for exposing direct device access to userspace, in
   9a secure, IOMMU protected environment.  In other words, this allows
  10safe[2], non-privileged, userspace drivers.
  12Why do we want that?  Virtual machines often make use of direct device
  13access ("device assignment") when configured for the highest possible
  14I/O performance.  From a device and host perspective, this simply
  15turns the VM into a userspace driver, with the benefits of
  16significantly reduced latency, higher bandwidth, and direct use of
  17bare-metal device drivers[3].
  19Some applications, particularly in the high performance computing
  20field, also benefit from low-overhead, direct device access from
  21userspace.  Examples include network adapters (often non-TCP/IP based)
  22and compute accelerators.  Prior to VFIO, these drivers had to either
  23go through the full development cycle to become proper upstream
  24driver, be maintained out of tree, or make use of the UIO framework,
  25which has no notion of IOMMU protection, limited interrupt support,
  26and requires root privileges to access things like PCI configuration
  29The VFIO driver framework intends to unify these, replacing both the
  30KVM PCI specific device assignment code as well as provide a more
  31secure, more featureful userspace driver environment than UIO.
  33Groups, Devices, and IOMMUs
  36Devices are the main target of any I/O driver.  Devices typically
  37create a programming interface made up of I/O access, interrupts,
  38and DMA.  Without going into the details of each of these, DMA is
  39by far the most critical aspect for maintaining a secure environment
  40as allowing a device read-write access to system memory imposes the
  41greatest risk to the overall system integrity.
  43To help mitigate this risk, many modern IOMMUs now incorporate
  44isolation properties into what was, in many cases, an interface only
  45meant for translation (ie. solving the addressing problems of devices
  46with limited address spaces).  With this, devices can now be isolated
  47from each other and from arbitrary memory access, thus allowing
  48things like secure direct assignment of devices into virtual machines.
  50This isolation is not always at the granularity of a single device
  51though.  Even when an IOMMU is capable of this, properties of devices,
  52interconnects, and IOMMU topologies can each reduce this isolation.
  53For instance, an individual device may be part of a larger multi-
  54function enclosure.  While the IOMMU may be able to distinguish
  55between devices within the enclosure, the enclosure may not require
  56transactions between devices to reach the IOMMU.  Examples of this
  57could be anything from a multi-function PCI device with backdoors
  58between functions to a non-PCI-ACS (Access Control Services) capable
  59bridge allowing redirection without reaching the IOMMU.  Topology
  60can also play a factor in terms of hiding devices.  A PCIe-to-PCI
  61bridge masks the devices behind it, making transaction appear as if
  62from the bridge itself.  Obviously IOMMU design plays a major factor
  63as well.
  65Therefore, while for the most part an IOMMU may have device level
  66granularity, any system is susceptible to reduced granularity.  The
  67IOMMU API therefore supports a notion of IOMMU groups.  A group is
  68a set of devices which is isolatable from all other devices in the
  69system.  Groups are therefore the unit of ownership used by VFIO.
  71While the group is the minimum granularity that must be used to
  72ensure secure user access, it's not necessarily the preferred
  73granularity.  In IOMMUs which make use of page tables, it may be
  74possible to share a set of page tables between different groups,
  75reducing the overhead both to the platform (reduced TLB thrashing,
  76reduced duplicate page tables), and to the user (programming only
  77a single set of translations).  For this reason, VFIO makes use of
  78a container class, which may hold one or more groups.  A container
  79is created by simply opening the /dev/vfio/vfio character device.
  81On its own, the container provides little functionality, with all
  82but a couple version and extension query interfaces locked away.
  83The user needs to add a group into the container for the next level
  84of functionality.  To do this, the user first needs to identify the
  85group associated with the desired device.  This can be done using
  86the sysfs links described in the example below.  By unbinding the
  87device from the host driver and binding it to a VFIO driver, a new
  88VFIO group will appear for the group as /dev/vfio/$GROUP, where
  89$GROUP is the IOMMU group number of which the device is a member.
  90If the IOMMU group contains multiple devices, each will need to
  91be bound to a VFIO driver before operations on the VFIO group
  92are allowed (it's also sufficient to only unbind the device from
  93host drivers if a VFIO driver is unavailable; this will make the
  94group available, but not that particular device).  TBD - interface
  95for disabling driver probing/locking a device.
  97Once the group is ready, it may be added to the container by opening
  98the VFIO group character device (/dev/vfio/$GROUP) and using the
  99VFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the
 100previously opened container file.  If desired and if the IOMMU driver
 101supports sharing the IOMMU context between groups, multiple groups may
 102be set to the same container.  If a group fails to set to a container
 103with existing groups, a new empty container will need to be used
 106With a group (or groups) attached to a container, the remaining
 107ioctls become available, enabling access to the VFIO IOMMU interfaces.
 108Additionally, it now becomes possible to get file descriptors for each
 109device within a group using an ioctl on the VFIO group file descriptor.
 111The VFIO device API includes ioctls for describing the device, the I/O
 112regions and their read/write/mmap offsets on the device descriptor, as
 113well as mechanisms for describing and registering interrupt
 116VFIO Usage Example
 119Assume user wants to access PCI device 0000:06:0d.0
 121$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
 124This device is therefore in IOMMU group 26.  This device is on the
 125pci bus, therefore the user will make use of vfio-pci to manage the
 128# modprobe vfio-pci
 130Binding this device to the vfio-pci driver creates the VFIO group
 131character devices for this group:
 133$ lspci -n -s 0000:06:0d.0
 13406:0d.0 0401: 1102:0002 (rev 08)
 135# echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
 136# echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
 138Now we need to look at what other devices are in the group to free
 139it for use by VFIO:
 141$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
 142total 0
 143lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 ->
 144        ../../../../devices/pci0000:00/0000:00:1e.0
 145lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 ->
 146        ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
 147lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
 148        ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
 150This device is behind a PCIe-to-PCI bridge[4], therefore ge also behind a PCIe-to-PCI bridge[4], the6./devices/pci0000:00/0000:00:1e.0 400:00/0000:00:1e.0/0000:04hich make use of page tab2:1e.0/0000:04hich make use/devicke use="L1416a hrefio.t:06:0d.8lass=lass="line" name=" id="L83" class="line" name="L83">  8312interfaces locked away.
 142total 0
  91be bound to a V1ure.  Whi1le the IOMMU may be able1 to d1stingu="L93" ref="Documentation/vfio.txt#L131"(n/vfio.txtone entacurlassl"name="L91">  91be bound to a V1uroot roothe enclosure, the enclos1ure m1y not groups  [4], the6.s)e" name="L115"> 115
  97Once the group 1ng from a1 multi-function PCI devi1ce wi1h backT00" inlinsteocumeno" class="cumentationg
  17bare-metal devicns to a n1on-PCI-ACS (Access Contr1ol Se1vices)unxt#L11" idt#L92" id= ref="t#L101(enteftion/.txt#L80" id="Lon/"L17">  17bare-metal devicnmentationion without reaching the1 IOMM1.  Topno="lineine" nas="lll
  17bare-metal devic factor i1n terms of hiding device1s.  A1PCIe-tt#L4 0666ass=g the
  12Why do we want t itself. 1 Obviously IOMMU design 1plays1a majo# ref h1ntat:ntati.txt#L80" ne" name="L123"> 123
 124This device ientation/1vfio.txt#L65" id="L65" c1lass=164 next level
 124This device ieroot root most part an IOMMU may 1have 16done using
lase" name="L140"> 140
  97Once the group 1fore supp1orts a notion of IOMMU g1roups1  A gr0d.1
 148        ../..s which i1s isolatable from all ot1her d160:06:0d.1
 148        ../..smentationefore the unit of owners1hip u1ed by                                         { .argsz = sizeof(="lin_lassus) }e" name="L148"> 148        ../..entation/1vfio.txt#L71" id="L71" c1lass=1line" 0d.1
 148        ../..e devices1inimum granularity that 1must 1e used0d.1
 148        ../..eitself. 1s, it's not necessar1ily t1e pref0d.1
 148        ../..e="Docume1which make use of page t1ables17"line" name="L124"> 124This device ire a set 1of page tables between d1iffer17href="Documen/* Cpts,
 124This device irroot rootth to the platform (redu1ced T1B thra"Documenxt#L104" i= 148        ../..educed latables), and to the user 1(prog17ine" name="L97">  97Once the group 17e drivers[[3].
  97Once the group 17 which i1 may hold one or more gr1oups.178 by                 /* Unknf h1ef=""Documen*/" name="L124"> 124This device irmentationning the /dev/vfio/vfio 1chara17"line" name="L150"> 150This device ientation/1vfio.txt#L81" id="L81" c1lass=180A gr0d.1
  97Once the group 1 containe1r provides little functi1onali181 by                 /*"line"L101" class="linevice
 124This device irsion and1 extension query interfa1ces l18"line" name="L133"> 133$ lspci -n -sto add a 1group into the container1 for 1he nex        /* O hrL108" 124This device ire a set 1 this, the user first ne1eds t18href="Documen 148        ../..d with th1e desired device.  This 1can b18"line" name="L116"> 116VFIO Usage Ex describe1d in the example below. 1 By u186 nex        /* Ttioning
 124This device ire driversver and binding it to a 1VFIO 18 A gr0d.1
 148        ../..dntation/vffio.txt#L19" id="L19" claups.18"line" name="L119"> 119Assume user wOMMU grou1p number of which the de1vice 189A gr0d.1
  97Once the group 1oup conta1ins multiple devices, ea1ch wi190 by                 /* Ga href="entaviio.tx(ie,ocumentl"Documenta href32" the )e*/" name="L124"> 124This device iFIO drive1r before operations on t1he VF19ine" name="L12">  12Why do we want t's al1so sufficient to only un1bind 192 nex        /* Add"L139" class="" id="L98" cla*/" name="L124"> 124This device iFo add a 1river is unavailable; th1is wi193A gr0d.1
 148        ../.., but not1 that particular device)1.  TB19ine" name="L65">  65Therefore, whil1river pro1bing/locking a device.
<1a hre195 nex        /* Eation/vfio.txt#L5odelvice
 124This device iFdescribe1vfio.txt#L97" id="L97" c1lass=196A gr0d.1
 124This device iFe drivers it may be added to the 1conta19"line" name="L138"> 138Now we need tcharacter1 device (/dev/vfio/$GROU1P) an198 nex        /* Gete/deor ea"L108" clfo */" name="L124"> 124This device iFs, particuularly in the high perforice 199A gr0d.1
 148        ../.2ened cont2iner file.  If desired a2d if 20"line" name="L141"> 141$ ls -l /sys2ing the I2MMU context between grou2s, mu201 nex        /* Allo="Docs
 124This device 2 same con2ainer.  If a group fails2to se202 nex        dma_ion.vhrefe=  26and requires roo groups, 2 new empty container wil2 need203 by                              MAP_PRIVATE | MAP_ANONYMOUS, 0,e0)e" name="L148"> 148        ../.2eained outentation/vfio.txt#L105" 2d="L1204 nex        dma_ion.sizee= 148        ../.2e5ined outeing/locking a device.
<2 clas205 nex        dma_ion.iovae=<0; /* 1MB stfio.#L11t 0x0
 124This device 2 t privile) attached to a containe2, the206 nex        dma_ion.flags =7of thlin_MAP_FLAG_READ | of thlin_MAP_FLAG_WRITEe" name="L148"> 148        ../.2eDocumenta, enabling access to the2VFIO 20"line" name="L138"> 138Now we need 2 it now b2comes possible to get fi2e des208A gr0d.1
 148        ../.2eframeworksing an ioctl on the VFI2 grou20"line" name="L150"> 150This device 2umentatio2/vfio.txt#L111" id="L1112 clas210 nex        /* Gete/" id="L100" classion/vfio="Docum*/" name="L124"> 124This device 2ce API in2ludes ioctls for describ2ng th211 nex        ="Docum= iion/(using,1of the
 148        ../.2heir read2write/mmap offsets on th2 devi21"line" name="L133"> 133$ lspci -n -2nisms for2describing and registeri2g int21e nex        /* Ttion 124This device 2cained out"Documentation/vfio.txt#2115" 214A gr0d.1
 148        ../.2h5ined out/vfio.txt#L116" id="L1162 clas2="line" name="L116"> 116VFIO Usage E2ample
 116VFIO Usage E2aDocumenta------------------------2-----217 by                 structon/vf2riptor_info rip = { .argsz = sizeof(rip) }e" name="L148"> 148        ../.2umentatio2/vfio.txt#L119" id="L1192 clas2="line" name="L119"> 119Assume user 2ants to a2cess PCI device 0000:06:2d.0
<21d by                 rip.indexm= ie" name="L148"> 148        ../.2efit from  low-overhead, direct dev clas2="line" name="L121"> 121$ readlink /2mples incllude network adapters (ofmu_gr221 by                 iion/(12" clasof thlEVICE_eET_REGION_INFO, &rip)e" name="L148"> 148        ../.2elerators.   Prior to VFIO, these drdevi22"line" name="L133"> 133$ lspci -n -2full devellopment cycle to become p clas223 by                 /* Setng
4" class...f="Document/vfio.tx, 124This device 2tained outt of tree, or make use ofs dev224 by                  *fio.t" id="L120x, 124This device 2tion of IOOMMU protection, limited o-pci225 nex        }" name="L124"> 124This device 2tmple
  97Once the group 2"Documentaation/vfio.txt#L28" id="L clas227 nex        ion/(ie=<0; i < 12" cl_info.num_irq,; i++) {" name="L116"> 116VFIO Usage E2entation/vvfio.txt#L29" id="L29" cltxt#L228 by                 structon/vf2=rq_info =rq = { .argsz = sizeof(=rq) }e" name="L148"> 148        ../.2 frameworkk intends to unify these, clas2="line" name="L130"> 130Binding this2device to2the vfio-pci driver crea2es th230 by                 =rq.indexm= ie" name="L148"> 148        ../.2ices for 2his group:
  12Why do we want 2umentatio2/vfio.txt#L133" id="L1332 clas232 by                 iion/(12" clasof thlEVICE_eET_IRQ_INFO, &rip)e" name="L148"> 148        ../.2 0000:06:2d.0
 124This device 2 1102:0002 (rev 08)
 124This device 26:0d.0 &g2; /sys/bus/pci/devices/0200:06235 nex        }" name="L124"> 124This device 2002 > 2sys/bus/pci/drivers/vfio2pci/n23ine" name="L97">  97Once the group 2umentatio2/vfio.txt#L138" id="L1382 clas237 nex        /* Gratuitouef="Documre"L10 124This device 26ntation/vwhat other devices are i2 the 238A gr0d.1
 148        ../.2 frameworkhref="Documentation/vfio2txt#L23"line" name="L130"> 130Binding this2umentatio2/vfio.txt#L141" id="L1412 clas2="line"L117" er1ef=" name="L130"> 130Binding this2uces for 2vices/0000:06:0d.0/iommu2group2device-------------------
  43To help mitigat2 root roo2 0 Apr 23 16:13 0000:00:2e.0 -2gt;
  65Therefore, whil2 root roo2 0 Apr 23 16:13 0000:06:2d.0 -2gt;
 130Binding this2u02 > 2ices/pci0000:00/0000:00:2e.0/0200:06:-------------------
 138Now we need 2/../../de2ices/pci0000:00/0000:00:2e.0/0248;
 138Now we need 2/framework/vfio.txt#L150" id="L1502 clas2="line/vfio"L117coumentatcumentation9" ia hrefclassna href="Dvfio=ntativ" name="L138"> 138Now we need 2s behind 2 PCIe-to-PCI bridge[4], 2heref2re ge vfio=ntati shoors
ine"on/vf2add2" cla_ent(" clasn/vf212l2" cla_ent("" name="L138"> 138Now we need 2sces for 2ke use of page tab2:1e.020000:24hich re"="L15" lye" name="L140"> 140
 28312interfaces locked aw2y.
  43To help mitigat2n individ2ual device may be part o2f a l2rger mon/vr5" clon/vf2add2" cla_ent(structo="L142" cla *="L142" clav" name="L138"> 138Now we need 2s../../de2le the IOMMU may be able2 to d254 by                               structo="Docum*="Dv" name="L138"> 138Now we need 2sroot roo2he enclosure, the enclos2ure m255 by                               clast structon/vf212" cl_ops *oe" name="L75">  75reducing the ov2tween dev2ices to reach the IOMMU.2  Exa256 by                               voidm*="D cl_data)e" name="L148"> 148        ../.2ng from a2 multi-function PCI devi2ce wi25"line" name="L138"> 138Now we need 2ns to a n2on-PCI-ACS (Access Contr2ol Se2vices)on/vr5"voidm*n/vf212l2" cla_ent(structo="Docum*="D)e" name="L148"> 148        ../.2nframeworkion without reaching the2 IOMM25"line" name="L130"> 130Binding this2 factor i2n terms of hiding device2s.  A2PCIe-tn/vf2add2" cla_ent(" roupc="L131="" id="rL24" idgd="Lra" clasine" name="L126"> 126group:
 2 Obviously IOMMU design 2plays2a majocumentabuef="lineline" o.txt#L96" vfio.tan ops structref=ion/ine"s="ln" name="L138"> 138Now we need 2f="Docume2ntation/vfio.txt#L64" id2="L6426"linesimit#L9t#L10 id="#L92" id="Lstructrefe" name="L140"> 140
  65Therefore, whil2eroot roo2 most part an IOMMU may 2have 26done structon/vf212" cl_ops {" name="L116"> 116VFIO Usage E26duced la2eency, higher bandwidth,2a Exa266A gr0d.1
 148        ../.2fore supp2orts a notion of IOMMU g2roups2  A gr0d.1
 148        ../.2fs to a n2s isolatable from all ot2her d260:06:0d.1
  26and requires rootframeworkefore the unit of owners2hip u2ed by                         size_toc hrt, loff_to*ppos)e" name="L148"> 148        ../.2entation/2vfio.txt#L71" id="L71" c2lass=2line" 0d.1
  26and requires rooe devices2inimum granularity that 2must 271 by                          size_tosize, loff_to*ppos)e" name="L148"> 148        ../.2e="L83"> 2s, it's not necessar2ily t2e pref0d.1
  26and requires rooe="Docume2which make use of page t2ables273 by                          un3" id="lolasarp)e" name="L148"> 148        ../.2re a set 2of page tables between d2iffer27href="Documeniclooooo(*tion)(voidm*="D cl_data,Dstructonm_9" a_structo*vma)e" name="L148"> 148        ../.2rroot roo2th to the platform (redu2ced T2B thra}e" name="L148"> 148        ../.2educed la2ables), and to the user 2(prog27ine" name="L97">  97Once the group 27e driver2[[3].
  17bare-metal devi2rsion and2 extension query interfa2ces l28 majocu"Documeio.tass=devirion/of thlEVICE_*ation/vline" o="Documentationname="L17">  17bare-metal devi2r="Docume2group into the container2 for 2he nex id="L83" cDocue7  17bare-metal devi2re a set 2 this, the user first ne2eds t2 identiwnsof thlEVICE_eET_REGION_INFOation/ name="L28">  28
 116VFIO Usage E2 describe2d in the example below. 2 By u280:06:-------------------
 138Now we need 2dntation/2ffio.txt#L19" id="L19" c2aups.28"line[1]umentawef= rigiach
< -PCIfronymrion/cumentV9" clasFo.txt#L5I/Ocumentid="n/f" name="L138"> 138Now we need 2dframeworkp number of which the de2vice 28"line/vitilasDocue7Nref
Tom Lyion/to
  95for disabling d2oup conta2ins multiple devices, ea2ch wi290 by out" cw="L139Ifronym"Documn/vfio.txc="ch" name="L83">  83The user needs 2FIO drive2r before operations on t2he VF29ine" name="L12">  12Why do we want 2's al2so sufficient to only un2bind 292 nex[2]/cumentsafecumentimentade hrds upormIo="Documbeclascument  17bare-metal devi2Fo add a 2river is unavailable; th2is wi293A grtion/vfiovfio.on/vfio.txt#L5entation/vf66" cs="line" vfio.txtname="L17">  17bare-metal devi2Fe a set 2 that particular device)2.  TB29tinguish
  17bare-metal devi2Froot roo2bing/locking a device.
<2a hre295 nexafio.txt#L10achines.
<" idclass= io.txtss=devirMMntaid="L114vline"name="L17">  17bare-metal devi2Fdescribe2vfio.txt#L97" id="L97" c2lass=296A grguard agion/a
vfio#L76nevicec"Docucumene/deor ealaspocuau  69system.  Groups2Fe driver2 it may be added to the 2conta29 A group is.txt#L9.txt cla ion/vfio.txt#L58" id="L58txt#ge"Docname="L69">  69system.  Groups2Fntation/2 device (/dev/vfio/$GROU2P) an298 nex(="L14=" cla_mf)line" olattnevicec"Dvfio.ssMU dert, ocumvfio.txt#Lshoorsname="L69">  69system.  Groups2Fframeworkuularly in the high perf2rice 299A grstn/vf class="ss="line"  fio.t" i, SR-IOV V9" clasFo.txt#L"Documentname="L69">  69system.  Groups3ened cont3iner file.  If desired a3d if 30"linebtionroupc="ass="lcument  69system.  Groups3e1ed cont3i before operations on t3s, mu301 nex49" claizNru clas5odelne" name="L115"> 115
  43To help mitigat3 groups, 3 new empty container wil3 need303 by [3] Asentation"DocuDocumerade-vfio ="L49" class="line"line" name="L51">  51though.  Even w3eained ou3entation/vfio.txt#L105" 3d="L1304 nexatation/vfition/9" iaeyio93" clscopL126"ine" iontvfio.taexpumentalionname="L51">  51though.  Even w3e5ined ou3eing/locking a device.
<3 clas305 nexfutref=cumentaechnon/vfio.on/vfid="L53s
  78a container cla3 t privil3) attached to a containe3, the30inding tsumerade-vfioe" name="L115"> 115
 138Now we need 3 it now b3comes possible to get fi3e des308A gr[4]"Doctd="L86se" id="L90" clas" id=-PCI b, the6.,3s
2" id="L62" tname="L17">  17bare-metal devi3eframewor3sing an ioctl on the VFI3 grou30"line 140
 111The VFIO dev3ce API in3ludes ioctls for describ3ng th311 nex-["L144" ]-+-lass-["6]--+-line" name="L147"> 147lrwxrwxrwx. 3heir read3write/mmap offsets on th3 devi312 by                         \-line" name="L149"> 149
 124This device 3cained ou3"Documentation/vfio.txt#3115" 314A gr" class=I b, the6.:"Dotel Corpo2" id= 82801=I b,Ba href(s="l90"" name="L138"> 138Now we need 3h5ined ou3/vfio.txt#L116" id="L1163 clas3="line

e" oarigiach LXR3s ftw9" iaf vfio name="L1">LXR3c"L14vityline,ctd="Lexpuri 138lxe@tic/g.noline.
lxe.tic/ kd=al"L6osa href= name="L1">Rd=pn/v LicMUo ASline,c class=ss="lLic/gDolason/v#L114" #L92" id="Lser"L58txe o" 1995.