linux/Documentation/virtual/virtio-spec.txt
<<
>>
Prefs
   1[Generated file: see http://ozlabs.org/~rusty/virtio-spec/]
   2Virtio PCI Card Specification
   3v0.9.5 DRAFT
   4-
   5
   6Rusty Russell <rusty@rustcorp.com.au> IBM Corporation (Editor)
   7
   82012 May 7.
   9
  10Purpose and Description
  11
  12This document describes the specifications of the “virtio” family
  13of PCI[LaTeX Command: nomenclature] devices. These are devices
  14are found in virtual environments[LaTeX Command: nomenclature],
  15yet by design they are not all that different from physical PCI
  16devices, and this document treats them as such. This allows the
  17guest to use standard PCI drivers and discovery mechanisms.
  18
  19The purpose of virtio and this specification is that virtual
  20environments and guests should have a straightforward, efficient,
  21standard and extensible mechanism for virtual devices, rather
  22than boutique per-environment or per-OS mechanisms.
  23
  24  Straightforward: Virtio PCI devices use normal PCI mechanisms
  25  of interrupts and DMA which should be familiar to any device
  26  driver author. There is no exotic page-flipping or COW
  27  mechanism: it's just a PCI device.[footnote:
  28This lack of page-sharing implies that the implementation of the
  29device (e.g. the hypervisor or host) needs full access to the
  30guest memory. Communication with untrusted parties (i.e.
  31inter-guest communication) requires copying.
  32]
  33
  34  Efficient: Virtio PCI devices consist of rings of descriptors
  35  for input and output, which are neatly separated to avoid cache
  36  effects from both guest and device writing to the same cache
  37  lines.
  38
  39  Standard: Virtio PCI makes no assumptions about the environment
  40  in which it operates, beyond supporting PCI. In fact the virtio
  41  devices specified in the appendices do not require PCI at all:
  42  they have been implemented on non-PCI buses.[footnote:
  43The Linux implementation further separates the PCI virtio code
  44from the specific virtio drivers: these drivers are shared with
  45the non-PCI implementations (currently lguest and S/390).
  46]
  47
  48  Extensible: Virtio PCI devices contain feature bits which are
  49  acknowledged by the guest operating system during device setup.
  50  This allows forwards and backwards compatibility: the device
  51  offers all the features it knows about, and the driver
  52  acknowledges those it understands and wishes to use.
  53
  54  Virtqueues
  55
  56The mechanism for bulk data transport on virtio PCI devices is
  57pretentiously called a virtqueue. Each device can have zero or
  58more virtqueues: for example, the network device has one for
  59transmit and one for receive.
  60
  61Each virtqueue occupies two or more physically-contiguous pages
  62(defined, for the purposes of this specification, as 4096 bytes),
  63and consists of three parts:
  64
  65
  66+-------------------+-----------------------------------+-----------+
  67| Descriptor Table  |   Available Ring     (padding)    | Used Ring |
  68+-------------------+-----------------------------------+-----------+
  69
  70
  71When the driver wants to send a buffer to the device, it fills in
  72a slot in the descriptor table (or chains several together), and
  73writes the descriptor index into the available ring. It then
  74notifies the device. When the device has finished a buffer, it
  75writes the descriptor into the used ring, and sends an interrupt.
  76
  77Specification
  78
  79  PCI Discovery
  80
  81Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000
  82through 0x103F inclusive is a virtio device[footnote:
  83The actual value within this range is ignored
  84]. The device must also have a Revision ID of 0 to match this
  85specification.
  86
  87The Subsystem Device ID indicates which virtio device is
  88supported by the device. The Subsystem Vendor ID should reflect
  89the PCI Vendor ID of the environment (it's currently only used
  90for informational purposes by the guest).
  91
  92
  93+----------------------+--------------------+---------------+
  94| Subsystem Device ID  |   Virtio Device    | Specification |
  95+----------------------+--------------------+---------------+
  96+----------------------+--------------------+---------------+
  97|          1           |   network card     |  Appendix C   |
  98+----------------------+--------------------+---------------+
  99|          2           |   block device     |  Appendix D   |
 100+----------------------+--------------------+---------------+
 101|          3           |      console       |  Appendix E   |
 102+----------------------+--------------------+---------------+
 103|          4           |  entropy source    |  Appendix F   |
 104+----------------------+--------------------+---------------+
 105|          5           | memory ballooning  |  Appendix G   |
 106+----------------------+--------------------+---------------+
 107|          6           |     ioMemory       |       -       |
 108+----------------------+--------------------+---------------+
 109|          7           |       rpmsg        |  Appendix H   |
 110+----------------------+--------------------+---------------+
 111|          8           |     SCSI host      |  Appendix I   |
 112+----------------------+--------------------+---------------+
 113|          9           |   9P transport     |       -       |
 114+----------------------+--------------------+---------------+
 115|         10           |   mac80211 wlan    |       -       |
 116+----------------------+--------------------+---------------+
 117
 118
 119  Device Configuration
 120
 121To configure the device, we use the first I/O region of the PCI
 122device. This contains a virtio header followed by a
 123device-specific region.
 124
 125There may be different widths of accesses to the I/O region; the “
 126natural” access method for each field in the virtio header must
 127be used (i.e. 32-bit accesses for 32-bit fields, etc), but the
 128device-specific region can be accessed using any width accesses,
 129and should obtain the same results.
 130
 131Note that this is possible because while the virtio header is PCI
 132(i.e. little) endian, the device-specific region is encoded in
 133the native endian of the guest (where such distinction is
 134applicable).
 135
 136  Device Initialization Sequence<sub:Device-Initialization-Sequence>
 137
 138We start with an overview of device initialization, then expand
 139on the details of the device and how each step is preformed.
 140
 141  Reset the device. This is not required on initial start up.
 142
 143  The ACKNOWLEDGE status bit is set: we have noticed the device.
 144
 145  The DRIVER status bit is set: we know how to drive the device.
 146
 147  Device-specific setup, including reading the Device Feature
 148  Bits, discovery of virtqueues for the device, optional MSI-X
 149  setup, and reading and possibly writing the virtio
 150  configuration space.
 151
 152  The subset of Device Feature Bits understood by the driver is
 153  written to the device.
 154
 155  The DRIVER_OK status bit is set.
 156
 157  The device can now be used (ie. buffers added to the
 158  virtqueues)[footnote:
 159Historically, drivers have used the device before steps 5 and 6.
 160This is only allowed if the driver does not use any features
 161which would alter this early use of the device.
 162]
 163
 164If any of these steps go irrecoverably wrong, the guest should
 165set the FAILED status bit to indicate that it has given up on the
 166device (it can reset the device later to restart if desired).
 167
 168We now cover the fields required for general setup in detail.
 169
 170  Virtio Header
 171
 172The virtio header looks as follows:
 173
 174
 175+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
 176| Bits       || 32                  | 32                  | 32       | 16     | 16      | 16      | 8       | 8      |
 177+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
 178| Read/Write || R                   | R+W                 | R+W      | R      | R+W     | R+W     | R+W     | R      |
 179+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
 180| Purpose    || Device              | Guest               | Queue    | Queue  | Queue   | Queue   | Device  | ISR    |
 181|            || Features bits 0:31  | Features bits 0:31  | Address  | Size   | Select  | Notify  | Status  | Status |
 182+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
 183
 184
 185If MSI-X is enabled for the device, two additional fields
 186immediately follow this header:[footnote:
 187ie. once you enable MSI-X on the device, the other fields move.
 188If you turn it off again, they move back!
 189]
 190
 191
 192+------------++----------------+--------+
 193| Bits       || 16             | 16     |
 194              +----------------+--------+
 195+------------++----------------+--------+
 196| Read/Write || R+W            | R+W    |
 197+------------++----------------+--------+
 198| Purpose    || Configuration  | Queue  |
 199| (MSI-X)    || Vector         | Vector |
 200+------------++----------------+--------+
 201
 202
 203Immediately following these general headers, there may be
 204device-specific headers:
 205
 206
 207+------------++--------------------+
 208| Bits       || Device Specific    |
 209              +--------------------+
 210+------------++--------------------+
 211| Read/Write || Device Specific    |
 212+------------++--------------------+
 213| Purpose    || Device Specific... |
 214|            ||                    |
 215+------------++--------------------+
 216
 217
 218  Device Status
 219
 220The Device Status field is updated by the guest to indicate its
 221progress. This provides a simple low-level diagnostic: it's most
 222useful to imagine them hooked up to traffic lights on the console
 223indicating the status of each device.
 224
 225The device can be reset by writing a 0 to this field, otherwise
 226at least one bit should be set:
 227
 228  ACKNOWLEDGE (1) Indicates that the guest OS has found the
 229  device and recognized it as a valid virtio device.
 230
 231  DRIVER (2) Indicates that the guest OS knows how to drive the
 232  device. Under Linux, drivers can be loadable modules so there
 233  may be a significant (or infinite) delay before setting this
 234  bit.
 235
 236  DRIVER_OK (4) Indicates that the driver is set up and ready to
 237  drive the device.
 238
 239  FAILED (128) Indicates that something went wrong in the guest,
 240  and it has given up on the device. This could be an internal
 241  error, or the driver didn't like the device for some reason, or
 242  even a fatal error during device operation. The device must be
 243  reset before attempting to re-initialize.
 244
 245  Feature Bits<sub:Feature-Bits>
 246
 247Thefirst configuration field indicates the features that the
 248device supports. The bits are allocated as follows:
 249
 250  0 to 23 Feature bits for the specific device type
 251
 252  24 to 32 Feature bits reserved for extensions to the queue and
 253  feature negotiation mechanisms
 254
 255For example, feature bit 0 for a network device (i.e. Subsystem
 256Device ID 1) indicates that the device supports checksumming of
 257packets.
 258
 259The feature bits are negotiated: the device lists all the
 260features it understands in the Device Features field, and the
 261guest writes the subset that it understands into the Guest
 262Features field. The only way to renegotiate is to reset the
 263device.
 264
 265In particular, new fields in the device configuration header are
 266indicated by offering a feature bit, so the guest can check
 267before accessing that part of the configuration space.
 268
 269This allows for forwards and backwards compatibility: if the
 270device is enhanced with a new feature bit, older guests will not
 271write that feature bit back to the Guest Features field and it
 272can go into backwards compatibility mode. Similarly, if a guest
 273is enhanced with a feature that the device doesn't support, it
 274will not see that feature bit in the Device Features field and
 275can go into backwards compatibility mode (or, for poor
 276implementations, set the FAILED Device Status bit).
 277
 278  Configuration/Queue Vectors
 279
 280When MSI-X capability is present and enabled in the device
 281(through standard PCI configuration space) 4 bytes at byte offset
 28220 are used to map configuration change and queue interrupts to
 283MSI-X vectors. In this case, the ISR Status field is unused, and
 284device specific configuration starts at byte offset 24 in virtio
 285header structure. When MSI-X capability is not enabled, device
 286specific configuration starts at byte offset 20 in virtio header.
 287
 288Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of
 289Configuration/Queue Vector registers, maps interrupts triggered
 290by the configuration change/selected queue events respectively to
 291the corresponding MSI-X vector. To disable interrupts for a
 292specific event type, unmap it by writing a special NO_VECTOR
 293value:
 294
 295/* Vector value used to disable MSI for queue */
 296
 297#define VIRTIO_MSI_NO_VECTOR            0xffff
 298
 299Reading these registers returns vector mapped to a given event,
 300or NO_VECTOR if unmapped. All queue and configuration change
 301events are unmapped by default.
 302
 303Note that mapping an event to vector might require allocating
 304internal device resources, and might fail. Devices report such
 305failures by returning the NO_VECTOR value when the relevant
 306Vector field is read. After mapping an event to vector, the
 307driver must verify success by reading the Vector field value: on
 308success, the previously written value is returned, and on
 309failure, NO_VECTOR is returned. If a mapping failure is detected,
 310the driver can retry mapping with fewervectors, or disable MSI-X.
 311
 312  Virtqueue Configuration<sec:Virtqueue-Configuration>
 313
 314As a device can have zero or more virtqueues for bulk data
 315transport (for example, the network driver has two), the driver
 316needs to configure them as part of the device-specific
 317configuration.
 318
 319This is done as follows, for each virtqueue a device has:
 320
 321  Write the virtqueue index (first queue is 0) to the Queue
 322  Select field.
 323
 324  Read the virtqueue size from the Queue Size field, which is
 325  always a power of 2. This controls how big the virtqueue is
 326  (see below). If this field is 0, the virtqueue does not exist.
 327
 328  Allocate and zero virtqueue in contiguous physical memory, on a
 329  4096 byte alignment. Write the physical address, divided by
 330  4096 to the Queue Address field.[footnote:
 331The 4096 is based on the x86 page size, but it's also large
 332enough to ensure that the separate parts of the virtqueue are on
 333separate cache lines.
 334]
 335
 336  Optionally, if MSI-X capability is present and enabled on the
 337  device, select a vector to use to request interrupts triggered
 338  by virtqueue events. Write the MSI-X Table entry number
 339  corresponding to this vector in Queue Vector field. Read the
 340  Queue Vector field: on success, previously written value is
 341  returned; on failure, NO_VECTOR value is returned.
 342
 343The Queue Size field controls the total number of bytes required
 344for the virtqueue according to the following formula:
 345
 346#define ALIGN(x) (((x) + 4095) & ~4095)
 347
 348static inline unsigned vring_size(unsigned int qsz)
 349
 350{
 351
 352     return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2
 353+ qsz))
 354
 355          + ALIGN(sizeof(struct vring_used_elem)*qsz);
 356
 357}
 358
 359This currently wastes some space with padding, but also allows
 360future extensions. The virtqueue layout structure looks like this
 361(qsz is the Queue Size field, which is a variable, so this code
 362won't compile):
 363
 364struct vring {
 365
 366    /* The actual descriptors (16 bytes each) */
 367
 368    struct vring_desc desc[qsz];
 369
 370
 371
 372    /* A ring of available descriptor heads with free-running
 373index. */
 374
 375    struct vring_avail avail;
 376
 377
 378
 379    // Padding to the next 4096 boundary.
 380
 381    char pad[];
 382
 383
 384
 385    // A ring of used descriptor heads with free-running index.
 386
 387    struct vring_used used;
 388
 389};
 390
 391  A Note on Virtqueue Endianness
 392
 393Note that the endian of these fields and everything else in the
 394virtqueue is the native endian of the guest, not little-endian as
 395PCI normally is. This makes for simpler guest code, and it is
 396assumed that the host already has to be deeply aware of the guest
 397endian so such an “endian-aware” device is not a significant
 398issue.
 399
 400  Descriptor Table
 401
 402The descriptor table refers to the buffers the guest is using for
 403the device. The addresses are physical addresses, and the buffers
 404can be chained via the next field. Each descriptor describes a
 405buffer which is read-only or write-only, but a chain of
 406descriptors can contain both read-only and write-only buffers.
 407
 408No descriptor chain may be more than 2^32 bytes long in total.struct vring_desc {
 409
 410    /* Address (guest-physical). */
 411
 412    u64 addr;
 413
 414    /* Length. */
 415
 416    u32 len;
 417
 418/* This marks a buffer as continuing via the next field. */
 419
 420#define VRING_DESC_F_NEXT   1
 421
 422/* This marks a buffer as write-only (otherwise read-only). */
 423
 424#define VRING_DESC_F_WRITE     2
 425
 426/* This means the buffer contains a list of buffer descriptors.
 427*/
 428
 429#define VRING_DESC_F_INDIRECT   4
 430
 431    /* The flags as indicated above. */
 432
 433    u16 flags;
 434
 435    /* Next field if flags & NEXT */
 436
 437    u16 next;
 438
 439};
 440
 441The number of descriptors in the table is specified by the Queue
 442Size field for this virtqueue.
 443
 444  <sub:Indirect-Descriptors>Indirect Descriptors
 445
 446Some devices benefit by concurrently dispatching a large number
 447of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be
 448used to allow this (see [cha:Reserved-Feature-Bits]). To increase
 449ring capacity it is possible to store a table of indirect
 450descriptors anywhere in memory, and insert a descriptor in main
 451virtqueue (with flags&INDIRECT on) that refers to memory buffer
 452containing this indirect descriptor table; fields addr and len
 453refer to the indirect table address and length in bytes,
 454respectively. The indirect table layout structure looks like this
 455(len is the length of the descriptor that refers to this table,
 456which is a variable, so this code won't compile):
 457
 458struct indirect_descriptor_table {
 459
 460    /* The actual descriptors (16 bytes each) */
 461
 462    struct vring_desc desc[len / 16];
 463
 464};
 465
 466The first indirect descriptor is located at start of the indirect
 467descriptor table (index 0), additional indirect descriptors are
 468chained by next field. An indirect descriptor without next field
 469(with flags&NEXT off) signals the end of the indirect descriptor
 470table, and transfers control back to the main virtqueue. An
 471indirect descriptor can not refer to another indirect descriptor
 472table (flags&INDIRECT must be off). A single indirect descriptor
 473table can include both read-only and write-only descriptors;
 474write-only flag (flags&WRITE) in the descriptor that refers to it
 475is ignored.
 476
 477  Available Ring
 478
 479The available ring refers to what descriptors we are offering the
 480device: it refers to the head of a descriptor chain. The “flags”
 481field is currently 0 or 1: 1 indicating that we do not need an
 482interrupt when the device consumes a descriptor from the
 483available ring. Alternatively, the guest can ask the device to
 484delay interrupts until an entry with an index specified by the “
 485used_event” field is written in the used ring (equivalently,
 486until the idx field in the used ring will reach the value
 487used_event + 1). The method employed by the device is controlled
 488by the VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits]
 489). This interrupt suppression is merely an optimization; it may
 490not suppress interrupts entirely.
 491
 492The “idx” field indicates where we would put the next descriptor
 493entry (modulo the ring size). This starts at 0, and increases.
 494
 495struct vring_avail {
 496
 497#define VRING_AVAIL_F_NO_INTERRUPT      1
 498
 499   u16 flags;
 500
 501   u16 idx;
 502
 503   u16 ring[qsz]; /* qsz is the Queue Size field read from device
 504*/
 505
 506   u16 used_event;
 507
 508};
 509
 510  Used Ring
 511
 512The used ring is where the device returns buffers once it is done
 513with them. The flags field can be used by the device to hint that
 514no notification is necessary when the guest adds to the available
 515ring. Alternatively, the “avail_event” field can be used by the
 516device to hint that no notification is necessary until an entry
 517with an index specified by the “avail_event” is written in the
 518available ring (equivalently, until the idx field in the
 519available ring will reach the value avail_event + 1). The method
 520employed by the device is controlled by the guest through the
 521VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits]
 522). [footnote:
 523These fields are kept here because this is the only part of the
 524virtqueue written by the device
 525].
 526
 527Each entry in the ring is a pair: the head entry of the
 528descriptor chain describing the buffer (this matches an entry
 529placed in the available ring by the guest earlier), and the total
 530of bytes written into the buffer. The latter is extremely useful
 531for guests using untrusted buffers: if you do not know exactly
 532how much has been written by the device, you usually have to zero
 533the buffer to ensure no data leakage occurs.
 534
 535/* u32 is used here for ids for padding reasons. */
 536
 537struct vring_used_elem {
 538
 539    /* Index of start of used descriptor chain. */
 540
 541    u32 id;
 542
 543    /* Total length of the descriptor chain which was used
 544(written to) */
 545
 546    u32 len;
 547
 548};
 549
 550
 551
 552struct vring_used {
 553
 554#define VRING_USED_F_NO_NOTIFY  1
 555
 556    u16 flags;
 557
 558    u16 idx;
 559
 560    struct vring_used_elem ring[qsz];
 561
 562    u16 avail_event;
 563
 564};
 565
 566  Helpers for Managing Virtqueues
 567
 568The Linux Kernel Source code contains the definitions above and
 569helper routines in a more usable form, in
 570include/linux/virtio_ring.h. This was explicitly licensed by IBM
 571and Red Hat under the (3-clause) BSD license so that it can be
 572freely used by all other projects, and is reproduced (with slight
 573variation to remove Linux assumptions) in Appendix A.
 574
 575  Device Operation<sec:Device-Operation>
 576
 577There are two parts to device operation: supplying new buffers to
 578the device, and processing used buffers from the device. As an
 579example, the virtio network device has two virtqueues: the
 580transmit virtqueue and the receive virtqueue. The driver adds
 581outgoing (read-only) packets to the transmit virtqueue, and then
 582frees them after they are used. Similarly, incoming (write-only)
 583buffers are added to the receive virtqueue, and processed after
 584they are used.
 585
 586  Supplying Buffers to The Device
 587
 588Actual transfer of buffers from the guest OS to the device
 589operates as follows:
 590
 591  Place the buffer(s) into free descriptor(s).
 592
 593  If there are no free descriptors, the guest may choose to
 594    notify the device even if notifications are suppressed (to
 595    reduce latency).[footnote:
 596The Linux drivers do this only for read-only buffers: for
 597write-only buffers, it is assumed that the driver is merely
 598trying to keep the receive buffer ring full, and no notification
 599of this expected condition is necessary.
 600]
 601
 602  Place the id of the buffer in the next ring entry of the
 603  available ring.
 604
 605  The steps (1) and (2) may be performed repeatedly if batching
 606  is possible.
 607
 608  A memory barrier should be executed to ensure the device sees
 609  the updated descriptor table and available ring before the next
 610  step.
 611
 612  The available “idx” field should be increased by the number of
 613  entries added to the available ring.
 614
 615  A memory barrier should be executed to ensure that we update
 616  the idx field before checking for notification suppression.
 617
 618  If notifications are not suppressed, the device should be
 619  notified of the new buffers.
 620
 621Note that the above code does not take precautions against the
 622available ring buffer wrapping around: this is not possible since
 623the ring buffer is the same size as the descriptor table, so step
 624(1) will prevent such a condition.
 625
 626In addition, the maximum queue size is 32768 (it must be a power
 627of 2 which fits in 16 bits), so the 16-bit “idx” value can always
 628distinguish between a full and empty buffer.
 629
 630Here is a description of each stage in more detail.
 631
 632  Placing Buffers Into The Descriptor Table
 633
 634A buffer consists of zero or more read-only physically-contiguous
 635elements followed by zero or more physically-contiguous
 636write-only elements (it must have at least one element). This
 637algorithm maps it into the descriptor table:
 638
 639  for each buffer element, b:
 640
 641  Get the next free descriptor table entry, d
 642
 643  Set d.addr to the physical address of the start of b
 644
 645  Set d.len to the length of b.
 646
 647  If b is write-only, set d.flags to VRING_DESC_F_WRITE,
 648    otherwise 0.
 649
 650  If there is a buffer element after this:
 651
 652    Set d.next to the index of the next free descriptor element.
 653
 654    Set the VRING_DESC_F_NEXT bit in d.flags.
 655
 656In practice, the d.next fields are usually used to chain free
 657descriptors, and a separate count kept to check there are enough
 658free descriptors before beginning the mappings.
 659
 660  Updating The Available Ring
 661
 662The head of the buffer we mapped is the first d in the algorithm
 663above. A naive implementation would do the following:
 664
 665avail->ring[avail->idx % qsz] = head;
 666
 667However, in general we can add many descriptors before we update
 668the “idx” field (at which point they become visible to the
 669device), so we keep a counter of how many we've added:
 670
 671avail->ring[(avail->idx + added++) % qsz] = head;
 672
 673  Updating The Index Field
 674
 675Once the idx field of the virtqueue is updated, the device will
 676be able to access the descriptor entries we've created and the
 677memory they refer to. This is why a memory barrier is generally
 678used before the idx update, to ensure it sees the most up-to-date
 679copy.
 680
 681The idx field always increments, and we let it wrap naturally at
 68265536:
 683
 684avail->idx += added;
 685
 686  <sub:Notifying-The-Device>Notifying The Device
 687
 688Device notification occurs by writing the 16-bit virtqueue index
 689of this virtqueue to the Queue Notify field of the virtio header
 690in the first I/O region of the PCI device. This can be expensive,
 691however, so the device can suppress such notifications if it
 692doesn't need them. We have to be careful to expose the new idx
 693value before checking the suppression flag: it's OK to notify
 694gratuitously, but not to omit a required notification. So again,
 695we use a memory barrier here before reading the flags or the
 696avail_event field.
 697
 698If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if
 699the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to
 700the PCI configuration space.
 701
 702If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the
 703avail_event field in the available ring structure. If the
 704available index crossed_the avail_event field value since the
 705last notification, we go ahead and write to the PCI configuration
 706space. The avail_event field wraps naturally at 65536 as well:
 707
 708(u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx)
 709
 710  <sub:Receiving-Used-Buffers>Receiving Used Buffers From The
 711  Device
 712
 713Once the device has used a buffer (read from or written to it, or
 714parts of both, depending on the nature of the virtqueue and the
 715device), it sends an interrupt, following an algorithm very
 716similar to the algorithm used for the driver to send the device a
 717buffer:
 718
 719  Write the head descriptor number to the next field in the used
 720  ring.
 721
 722  Update the used ring idx.
 723
 724  Determine whether an interrupt is necessary:
 725
 726  If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: check
 727    if f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in avail-
 728    >flags
 729
 730  If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check
 731    whether the used index crossed the used_event field value
 732    since the last update. The used_event field wraps naturally
 733    at 65536 as well:(u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx)
 734
 735  If an interrupt is necessary:
 736
 737  If MSI-X capability is disabled:
 738
 739    Set the lower bit of the ISR Status field for the device.
 740
 741    Send the appropriate PCI interrupt for the device.
 742
 743  If MSI-X capability is enabled:
 744
 745    Request the appropriate MSI-X interrupt message for the
 746      device, Queue Vector field sets the MSI-X Table entry
 747      number.
 748
 749    If Queue Vector field value is NO_VECTOR, no interrupt
 750      message is requested for this event.
 751
 752The guest interrupt handler should:
 753
 754  If MSI-X capability is disabled: read the ISR Status field,
 755  which will reset it to zero. If the lower bit is zero, the
 756  interrupt was not for this device. Otherwise, the guest driver
 757  should look through the used rings of each virtqueue for the
 758  device, to see if any progress has been made by the device
 759  which requires servicing.
 760
 761  If MSI-X capability is enabled: look through the used rings of
 762  each virtqueue mapped to the specific MSI-X vector for the
 763  device, to see if any progress has been made by the device
 764  which requires servicing.
 765
 766For each ring, guest should then disable interrupts by writing
 767VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required.
 768It can then process used ring entries finally enabling interrupts
 769by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the
 770EVENT_IDX field in the available structure, Guest should then
 771execute a memory barrier, and then recheck the ring empty
 772condition. This is necessary to handle the case where, after the
 773last check and before enabling interrupts, an interrupt has been
 774suppressed by the device:
 775
 776vring_disable_interrupts(vq);
 777
 778for (;;) {
 779
 780    if (vq->last_seen_used != vring->used.idx) {
 781
 782                vring_enable_interrupts(vq);
 783
 784                mb();
 785
 786                if (vq->last_seen_used != vring->used.idx)
 787
 788                        break;
 789
 790    }
 791
 792    struct vring_used_elem *e =
 793vring.used->ring[vq->last_seen_used%vsz];
 794
 795    process_buffer(e);
 796
 797    vq->last_seen_used++;
 798
 799}
 800
 801  Dealing With Configuration Changes<sub:Dealing-With-Configuration>
 802
 803Some virtio PCI devices can change the device configuration
 804state, as reflected in the virtio header in the PCI configuration
 805space. In this case:
 806
 807  If MSI-X capability is disabled: an interrupt is delivered and
 808  the second highest bit is set in the ISR Status field to
 809  indicate that the driver should re-examine the configuration
 810  space.Note that a single interrupt can indicate both that one
 811  or more virtqueue has been used and that the configuration
 812  space has changed: even if the config bit is set, virtqueues
 813  must be scanned.
 814
 815  If MSI-X capability is enabled: an interrupt message is
 816  requested. The Configuration Vector field sets the MSI-X Table
 817  entry number to use. If Configuration Vector field value is
 818  NO_VECTOR, no interrupt message is requested for this event.
 819
 820Creating New Device Types
 821
 822Various considerations are necessary when creating a new device
 823type:
 824
 825  How Many Virtqueues?
 826
 827It is possible that a very simple device will operate entirely
 828through its configuration space, but most will need at least one
 829virtqueue in which it will place requests. A device with both
 830input and output (eg. console and network devices described here)
 831need two queues: one which the driver fills with buffers to
 832receive input, and one which the driver places buffers to
 833transmit output.
 834
 835  What Configuration Space Layout?
 836
 837Configuration space is generally used for rarely-changing or
 838initialization-time parameters. But it is a limited resource, so
 839it might be better to use a virtqueue to update configuration
 840information (the network device does this for filtering,
 841otherwise the table in the config space could potentially be very
 842large).
 843
 844Note that this space is generally the guest's native endian,
 845rather than PCI's little-endian.
 846
 847  What Device Number?
 848
 849Currently device numbers are assigned quite freely: a simple
 850request mail to the author of this document or the Linux
 851virtualization mailing list[footnote:
 852
 853https://lists.linux-foundation.org/mailman/listinfo/virtualization
 854] will be sufficient to secure a unique one.
 855
 856Meanwhile for experimental drivers, use 65535 and work backwards.
 857
 858  How many MSI-X vectors?
 859
 860Using the optional MSI-X capability devices can speed up
 861interrupt processing by removing the need to read ISR Status
 862register by guest driver (which might be an expensive operation),
 863reducing interrupt sharing between devices and queues within the
 864device, and handling interrupts from multiple CPUs. However, some
 865systems impose a limit (which might be as low as 256) on the
 866total number of MSI-X vectors that can be allocated to all
 867devices. Devices and/or device drivers should take this into
 868account, limiting the number of vectors used unless the device is
 869expected to cause a high volume of interrupts. Devices can
 870control the number of vectors used by limiting the MSI-X Table
 871Size or not presenting MSI-X capability in PCI configuration
 872space. Drivers can control this by mapping events to as small
 873number of vectors as possible, or disabling MSI-X capability
 874altogether.
 875
 876  Message Framing
 877
 878The descriptors used for a buffer should not effect the semantics
 879of the message, except for the total length of the buffer. For
 880example, a network buffer consists of a 10 byte header followed
 881by the network packet. Whether this is presented in the ring
 882descriptor chain as (say) a 10 byte buffer and a 1514 byte
 883buffer, or a single 1524 byte buffer, or even three buffers,
 884should have no effect.
 885
 886In particular, no implementation should use the descriptor
 887boundaries to determine the size of any header in a request.[footnote:
 888The current qemu device implementations mistakenly insist that
 889the first descriptor cover the header in these cases exactly, so
 890a cautious driver should arrange it so.
 891]
 892
 893  Device Improvements
 894
 895Any change to configuration space, or new virtqueues, or
 896behavioural changes, should be indicated by negotiation of a new
 897feature bit. This establishes clarity[footnote:
 898Even if it does mean documenting design or implementation
 899mistakes!
 900] and avoids future expansion problems.
 901
 902Clusters of functionality which are always implemented together
 903can use a single bit, but if one feature makes sense without the
 904others they should not be gratuitously grouped together to
 905conserve feature bits. We can always extend the spec when the
 906first person needs more than 24 feature bits for their device.
 907
 908[LaTeX Command: printnomenclature]
 909
 910Appendix A: virtio_ring.h
 911
 912#ifndef VIRTIO_RING_H
 913
 914#define VIRTIO_RING_H
 915
 916/* An interface for efficient virtio implementation.
 917
 918 *
 919
 920 * This header is BSD licensed so anyone can use the definitions
 921
 922 * to implement compatible drivers/servers.
 923
 924 *
 925
 926 * Copyright 2007, 2009, IBM Corporation
 927
 928 * Copyright 2011, Red Hat, Inc
 929
 930 * All rights reserved.
 931
 932 *
 933
 934 * Redistribution and use in source and binary forms, with or
 935without
 936
 937 * modification, are permitted provided that the following
 938conditions
 939
 940 * are met:
 941
 942 * 1. Redistributions of source code must retain the above
 943copyright
 944
 945 *    notice, this list of conditions and the following
 946disclaimer.
 947
 948 * 2. Redistributions in binary form must reproduce the above
 949copyright
 950
 951 *    notice, this list of conditions and the following
 952disclaimer in the
 953
 954 *    documentation and/or other materials provided with the
 955distribution.
 956
 957 * 3. Neither the name of IBM nor the names of its contributors
 958
 959 *    may be used to endorse or promote products derived from
 960this software
 961
 962 *    without specific prior written permission.
 963
 964 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
 965CONTRIBUTORS ``AS IS'' AND
 966
 967 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
 968TO, THE
 969
 970 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
 971PARTICULAR PURPOSE
 972
 973 * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE
 974LIABLE
 975
 976 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 977CONSEQUENTIAL
 978
 979 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 980SUBSTITUTE GOODS
 981
 982 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 983INTERRUPTION)
 984
 985 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 986CONTRACT, STRICT
 987
 988 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
 989IN ANY WAY
 990
 991 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 992POSSIBILITY OF
 993
 994 * SUCH DAMAGE.
 995
 996 */
 997
 998
 999
1000/* This marks a buffer as continuing via the next field. */
1001
1002#define VRING_DESC_F_NEXT       1
1003
1004/* This marks a buffer as write-only (otherwise read-only). */
1005
1006#define VRING_DESC_F_WRITE      2
1007
1008
1009
1010/* The Host uses this in used->flags to advise the Guest: don't
1011kick me
1012
1013 * when you add a buffer.  It's unreliable, so it's simply an
1014
1015 * optimization.  Guest will still kick if it's out of buffers.
1016*/
1017
1018#define VRING_USED_F_NO_NOTIFY  1
1019
1020/* The Guest uses this in avail->flags to advise the Host: don't
1021
1022 * interrupt me when you consume a buffer.  It's unreliable, so
1023it's
1024
1025 * simply an optimization.  */
1026
1027#define VRING_AVAIL_F_NO_INTERRUPT      1
1028
1029
1030
1031/* Virtio ring descriptors: 16 bytes.
1032
1033 * These can chain together via "next". */
1034
1035struct vring_desc {
1036
1037        /* Address (guest-physical). */
1038
1039        uint64_t addr;
1040
1041        /* Length. */
1042
1043        uint32_t len;
1044
1045        /* The flags as indicated above. */
1046
1047        uint16_t flags;
1048
1049        /* We chain unused descriptors via this, too */
1050
1051        uint16_t next;
1052
1053};
1054
1055
1056
1057struct vring_avail {
1058
1059        uint16_t flags;
1060
1061        uint16_t idx;
1062
1063        uint16_t ring[];
1064
1065        uint16_t used_event;
1066
1067};
1068
1069
1070
1071/* u32 is used here for ids for padding reasons. */
1072
1073struct vring_used_elem {
1074
1075        /* Index of start of used descriptor chain. */
1076
1077        uint32_t id;
1078
1079        /* Total length of the descriptor chain which was written
1080to. */
1081
1082        uint32_t len;
1083
1084};
1085
1086
1087
1088struct vring_used {
1089
1090        uint16_t flags;
1091
1092        uint16_t idx;
1093
1094        struct vring_used_elem ring[];
1095
1096        uint16_t avail_event;
1097
1098};
1099
1100
1101
1102struct vring {
1103
1104        unsigned int num;
1105
1106
1107
1108        struct vring_desc *desc;
1109
1110        struct vring_avail *avail;
1111
1112        struct vring_used *used;
1113
1114};
1115
1116
1117
1118/* The standard layout for the ring is a continuous chunk of
1119memory which
1120
1121 * looks like this.  We assume num is a power of 2.
1122
1123 *
1124
1125 * struct vring {
1126
1127 *      // The actual descriptors (16 bytes each)
1128
1129 *      struct vring_desc desc[num];
1130
1131 *
1132
1133 *      // A ring of available descriptor heads with free-running
1134index.
1135
1136 *      __u16 avail_flags;
1137
1138 *      __u16 avail_idx;
1139
1140 *      __u16 available[num];
1141
1142 *
1143
1144 *      // Padding to the next align boundary.
1145
1146 *      char pad[];
1147
1148 *
1149
1150 *      // A ring of used descriptor heads with free-running
1151index.
1152
1153 *      __u16 used_flags;
1154
1155 *      __u16 EVENT_IDX;
1156
1157 *      struct vring_used_elem used[num];
1158
1159 * };
1160
1161 * Note: for virtio PCI, align is 4096.
1162
1163 */
1164
1165static inline void vring_init(struct vring *vr, unsigned int num,
1166void *p,
1167
1168                              unsigned long align)
1169
1170{
1171
1172        vr->num = num;
1173
1174        vr->desc = p;
1175
1176        vr->avail = p + num*sizeof(struct vring_desc);
1177
1178        vr->used = (void *)(((unsigned long)&vr->avail->ring[num]
1179
1180                              + align-1)
1181
1182                            & ~(align - 1));
1183
1184}
1185
1186
1187
1188static inline unsigned vring_size(unsigned int num, unsigned long
1189align)
1190
1191{
1192
1193        return ((sizeof(struct vring_desc)*num +
1194sizeof(uint16_t)*(2+num)
1195
1196                 + align - 1) & ~(align - 1))
1197
1198                + sizeof(uint16_t)*3 + sizeof(struct
1199vring_used_elem)*num;
1200
1201}
1202
1203
1204
1205static inline int vring_need_event(uint16_t event_idx, uint16_t
1206new_idx, uint16_t old_idx)
1207
1208{
1209
1210         return (uint16_t)(new_idx - event_idx - 1) <
1211(uint16_t)(new_idx - old_idx);
1212
1213}
1214
1215#endif /* VIRTIO_RING_H */
1216
1217<cha:Reserved-Feature-Bits>Appendix B: Reserved Feature Bits
1218
1219Currently there are five device-independent feature bits defined:
1220
1221  VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature
1222  indicates that the driver wants an interrupt if the device runs
1223  out of available descriptors on a virtqueue, even though
1224  interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT
1225  flag or the used_event field. An example of this is the
1226  networking driver: it doesn't need to know every time a packet
1227  is transmitted, but it does need to free the transmitted
1228  packets a finite time after they are transmitted. It can avoid
1229  using a timer if the device interrupts it when all the packets
1230  are transmitted.
1231
1232  VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature
1233  indicates that the driver can use descriptors with the
1234  VRING_DESC_F_INDIRECT flag set, as described in [sub:Indirect-Descriptors]
1235  .
1236
1237  VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event
1238  and the avail_event fields. If set, it indicates that the
1239  device should ignore the flags field in the available ring
1240  structure. Instead, the used_event field in this structure is
1241  used by guest to suppress device interrupts. Further, the
1242  driver should ignore the flags field in the used ring
1243  structure. Instead, the avail_event field in this structure is
1244  used by the device to suppress notifications. If unset, the
1245  driver should ignore the used_event field; the device should
1246  ignore the avail_event field; the flags field is used
1247
1248Appendix C: Network Device
1249
1250The virtio network device is a virtual ethernet card, and is the
1251most complex of the devices supported so far by virtio. It has
1252enhanced rapidly and demonstrates clearly how support for new
1253features should be added to an existing device. Empty buffers are
1254placed in one virtqueue for receiving packets, and outgoing
1255packets are enqueued into another for transmission in that order.
1256A third command queue is used to control advanced filtering
1257features.
1258
1259  Configuration
1260
1261  Subsystem Device ID 1
1262
1263  Virtqueues 0:receiveq. 1:transmitq. 2:controlq[footnote:
1264Only if VIRTIO_NET_F_CTRL_VQ set
1265]
1266
1267  Feature bits
1268
1269  VIRTIO_NET_F_CSUM (0) Device handles packets with partial
1270    checksum
1271
1272  VIRTIO_NET_F_GUEST_CSUM (1) Guest handles packets with partial
1273    checksum
1274
1275  VIRTIO_NET_F_MAC (5) Device has given MAC address.
1276
1277  VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets with
1278    any GSO type.[footnote:
1279It was supposed to indicate segmentation offload support, but
1280upon further investigation it became clear that multiple bits
1281were required.
1282]
1283
1284  VIRTIO_NET_F_GUEST_TSO4 (7) Guest can receive TSOv4.
1285
1286  VIRTIO_NET_F_GUEST_TSO6 (8) Guest can receive TSOv6.
1287
1288  VIRTIO_NET_F_GUEST_ECN (9) Guest can receive TSO with ECN.
1289
1290  VIRTIO_NET_F_GUEST_UFO (10) Guest can receive UFO.
1291
1292  VIRTIO_NET_F_HOST_TSO4 (11) Device can receive TSOv4.
1293
1294  VIRTIO_NET_F_HOST_TSO6 (12) Device can receive TSOv6.
1295
1296  VIRTIO_NET_F_HOST_ECN (13) Device can receive TSO with ECN.
1297
1298  VIRTIO_NET_F_HOST_UFO (14) Device can receive UFO.
1299
1300  VIRTIO_NET_F_MRG_RXBUF (15) Guest can merge receive buffers.
1301
1302  VIRTIO_NET_F_STATUS (16) Configuration status field is
1303    available.
1304
1305  VIRTIO_NET_F_CTRL_VQ (17) Control channel is available.
1306
1307  VIRTIO_NET_F_CTRL_RX (18) Control channel RX mode support.
1308
1309  VIRTIO_NET_F_CTRL_VLAN (19) Control channel VLAN filtering.
1310
1311  VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous
1312    packets.
1313
1314  Device configuration layout Two configuration fields are
1315  currently defined. The mac address field always exists (though
1316  is only valid if VIRTIO_NET_F_MAC is set), and the status field
1317  only exists if VIRTIO_NET_F_STATUS is set. Two read-only bits
1318  are currently defined for the status field:
1319  VIRTIO_NET_S_LINK_UP and VIRTIO_NET_S_ANNOUNCE. #define VIRTIO_NET_S_LINK_UP  1
1320
1321#define VIRTIO_NET_S_ANNOUNCE   2
1322
1323
1324
1325struct virtio_net_config {
1326
1327    u8 mac[6];
1328
1329    u16 status;
1330
1331};
1332
1333  Device Initialization
1334
1335  The initialization routine should identify the receive and
1336  transmission virtqueues.
1337
1338  If the VIRTIO_NET_F_MAC feature bit is set, the configuration
1339  space “mac” entry indicates the “physical” address of the the
1340  network card, otherwise a private MAC address should be
1341  assigned. All guests are expected to negotiate this feature if
1342  it is set.
1343
1344  If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, identify
1345  the control virtqueue.
1346
1347  If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link
1348  status can be read from the bottom bit of the “status” config
1349  field. Otherwise, the link should be assumed active.
1350
1351  The receive virtqueue should be filled with receive buffers.
1352  This is described in detail below in “Setting Up Receive
1353  Buffers”.
1354
1355  A driver can indicate that it will generate checksumless
1356  packets by negotating the VIRTIO_NET_F_CSUM feature. This “
1357  checksum offload” is a common feature on modern network cards.
1358
1359  If that feature is negotiated[footnote:
1360ie. VIRTIO_NET_F_HOST_TSO* and VIRTIO_NET_F_HOST_UFO are
1361dependent on VIRTIO_NET_F_CSUM; a dvice which offers the offload
1362features must offer the checksum feature, and a driver which
1363accepts the offload features must accept the checksum feature.
1364Similar logic applies to the VIRTIO_NET_F_GUEST_TSO4 features
1365depending on VIRTIO_NET_F_GUEST_CSUM.
1366], a driver can use TCP or UDP segmentation offload by
1367  negotiating the VIRTIO_NET_F_HOST_TSO4 (IPv4 TCP),
1368  VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and VIRTIO_NET_F_HOST_UFO
1369  (UDP fragmentation) features. It should not send TCP packets
1370  requiring segmentation offload which have the Explicit
1371  Congestion Notification bit set, unless the
1372  VIRTIO_NET_F_HOST_ECN feature is negotiated.[footnote:
1373This is a common restriction in real, older network cards.
1374]
1375
1376  The converse features are also available: a driver can save the
1377  virtual device some work by negotiating these features.[footnote:
1378For example, a network packet transported between two guests on
1379the same system may not require checksumming at all, nor
1380segmentation, if both guests are amenable.
1381] The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially
1382  checksummed packets can be received, and if it can do that then
1383  the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6,
1384  VIRTIO_NET_F_GUEST_UFO and VIRTIO_NET_F_GUEST_ECN are the input
1385  equivalents of the features described above. See “Receiving
1386  Packets” below.
1387
1388  Device Operation
1389
1390Packets are transmitted by placing them in the transmitq, and
1391buffers for incoming packets are placed in the receiveq. In each
1392case, the packet itself is preceeded by a header:
1393
1394struct virtio_net_hdr {
1395
1396#define VIRTIO_NET_HDR_F_NEEDS_CSUM    1
1397
1398        u8 flags;
1399
1400#define VIRTIO_NET_HDR_GSO_NONE        0
1401
1402#define VIRTIO_NET_HDR_GSO_TCPV4       1
1403
1404#define VIRTIO_NET_HDR_GSO_UDP           3
1405
1406#define VIRTIO_NET_HDR_GSO_TCPV6       4
1407
1408#define VIRTIO_NET_HDR_GSO_ECN      0x80
1409
1410        u8 gso_type;
1411
1412        u16 hdr_len;
1413
1414        u16 gso_size;
1415
1416        u16 csum_start;
1417
1418        u16 csum_offset;
1419
1420/* Only if VIRTIO_NET_F_MRG_RXBUF: */
1421
1422        u16 num_buffers
1423
1424};
1425
1426The controlq is used to control device features such as
1427filtering.
1428
1429  Packet Transmission
1430
1431Transmitting a single packet is simple, but varies depending on
1432the different features the driver negotiated.
1433
1434  If the driver negotiated VIRTIO_NET_F_CSUM, and the packet has
1435  not been fully checksummed, then the virtio_net_hdr's fields
1436  are set as follows. Otherwise, the packet must be fully
1437  checksummed, and flags is zero.
1438
1439  flags has the VIRTIO_NET_HDR_F_NEEDS_CSUM set,
1440
1441  <ite:csum_start-is-set>csum_start is set to the offset within
1442    the packet to begin checksumming, and
1443
1444  csum_offset indicates how many bytes after the csum_start the
1445    new (16 bit ones' complement) checksum should be placed.[footnote:
1446For example, consider a partially checksummed TCP (IPv4) packet.
1447It will have a 14 byte ethernet header and 20 byte IP header
1448followed by the TCP header (with the TCP checksum field 16 bytes
1449into that header). csum_start will be 14+20 = 34 (the TCP
1450checksum includes the header), and csum_offset will be 16. The
1451value in the TCP checksum field should be initialized to the sum
1452of the TCP pseudo header, so that replacing it by the ones'
1453complement checksum of the TCP header and body will give the
1454correct result.
1455]
1456
1457  <enu:If-the-driver>If the driver negotiated
1458  VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires
1459  TCP segmentation or UDP fragmentation, then the “gso_type”
1460  field is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6 or UDP.
1461  (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). In this
1462  case, packets larger than 1514 bytes can be transmitted: the
1463  metadata indicates how to replicate the packet header to cut it
1464  into smaller packets. The other gso fields are set:
1465
1466  hdr_len is a hint to the device as to how much of the header
1467    needs to be kept to copy into each packet, usually set to the
1468    length of the headers, including the transport header.[footnote:
1469Due to various bugs in implementations, this field is not useful
1470as a guarantee of the transport header size.
1471]
1472
1473  gso_size is the maximum size of each packet beyond that header
1474    (ie. MSS).
1475
1476  If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the
1477    VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as well,
1478    indicating that the TCP packet has the ECN bit set.[footnote:
1479This case is not handled by some older hardware, so is called out
1480specifically in the protocol.
1481]
1482
1483  If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature,
1484  the num_buffers field is set to zero.
1485
1486  The header and packet are added as one output buffer to the
1487  transmitq, and the device is notified of the new entry (see [sub:Notifying-The-Device]
1488  ).[footnote:
1489Note that the header will be two bytes longer for the
1490VIRTIO_NET_F_MRG_RXBUF case.
1491]
1492
1493  Packet Transmission Interrupt
1494
1495Often a driver will suppress transmission interrupts using the
1496VRING_AVAIL_F_NO_INTERRUPT flag (see [sub:Receiving-Used-Buffers]
1497) and check for used packets in the transmit path of following
1498packets. However, it will still receive interrupts if the
1499VIRTIO_F_NOTIFY_ON_EMPTY feature is negotiated, indicating that
1500the transmission queue is completely emptied.
1501
1502The normal behavior in this interrupt handler is to retrieve and
1503new descriptors from the used ring and free the corresponding
1504headers and packets.
1505
1506  Setting Up Receive Buffers
1507
1508It is generally a good idea to keep the receive virtqueue as
1509fully populated as possible: if it runs out, network performance
1510will suffer.
1511
1512If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or
1513VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need to
1514accept packets of up to 65550 bytes long (the maximum size of a
1515TCP or UDP packet, plus the 14 byte ethernet header), otherwise
15161514 bytes. So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every
1517buffer in the receive queue needs to be at least this length [footnote:
1518Obviously each one can be split across multiple descriptor
1519elements.
1520].
1521
1522If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at
1523least the size of the struct virtio_net_hdr.
1524
1525  Packet Receive Interrupt
1526
1527When a packet is copied into a buffer in the receiveq, the
1528optimal path is to disable further interrupts for the receiveq
1529(see [sub:Receiving-Used-Buffers]) and process packets until no
1530more are found, then re-enable them.
1531
1532Processing packet involves:
1533
1534  If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature,
1535  then the “num_buffers” field indicates how many descriptors
1536  this packet is spread over (including this one). This allows
1537  receipt of large packets without having to allocate large
1538  buffers. In this case, there will be at least “num_buffers” in
1539  the used ring, and they should be chained together to form a
1540  single packet. The other buffers will not begin with a struct
1541  virtio_net_hdr.
1542
1543  If the VIRTIO_NET_F_MRG_RXBUF feature was not negotiated, or
1544  the “num_buffers” field is one, then the entire packet will be
1545  contained within this buffer, immediately following the struct
1546  virtio_net_hdr.
1547
1548  If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the
1549  VIRTIO_NET_HDR_F_NEEDS_CSUM bit in the “flags” field may be
1550  set: if so, the checksum on the packet is incomplete and the “
1551  csum_start” and “csum_offset” fields indicate how to calculate
1552  it (see [ite:csum_start-is-set]).
1553
1554  If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were
1555  negotiated, then the “gso_type” may be something other than
1556  VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the
1557  desired MSS (see [enu:If-the-driver]).
1558
1559  Control Virtqueue
1560
1561The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is
1562negotiated) to send commands to manipulate various features of
1563the device which would not easily map into the configuration
1564space.
1565
1566All commands are of the following form:
1567
1568struct virtio_net_ctrl {
1569
1570        u8 class;
1571
1572        u8 command;
1573
1574        u8 command-specific-data[];
1575
1576        u8 ack;
1577
1578};
1579
1580
1581
1582/* ack values */
1583
1584#define VIRTIO_NET_OK     0
1585
1586#define VIRTIO_NET_ERR    1
1587
1588The class, command and command-specific-data are set by the
1589driver, and the device sets the ack byte. There is little it can
1590do except issue a diagnostic if the ack byte is not
1591VIRTIO_NET_OK.
1592
1593  Packet Receive Filtering
1594
1595If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can
1596send control commands for promiscuous mode, multicast receiving,
1597and filtering of MAC addresses.
1598
1599Note that in general, these commands are best-effort: unwanted
1600packets may still arrive.
1601
1602  Setting Promiscuous Mode
1603
1604#define VIRTIO_NET_CTRL_RX    0
1605
1606 #define VIRTIO_NET_CTRL_RX_PROMISC      0
1607
1608 #define VIRTIO_NET_CTRL_RX_ALLMULTI     1
1609
1610The class VIRTIO_NET_CTRL_RX has two commands:
1611VIRTIO_NET_CTRL_RX_PROMISC turns promiscuous mode on and off, and
1612VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and
1613off. The command-specific-data is one byte containing 0 (off) or
16141 (on).
1615
1616  Setting MAC Address Filtering
1617
1618struct virtio_net_ctrl_mac {
1619
1620        u32 entries;
1621
1622        u8 macs[entries][ETH_ALEN];
1623
1624};
1625
1626
1627
1628#define VIRTIO_NET_CTRL_MAC    1
1629
1630 #define VIRTIO_NET_CTRL_MAC_TABLE_SET        0
1631
1632The device can filter incoming packets by any number of
1633destination MAC addresses.[footnote:
1634Since there are no guarentees, it can use a hash filter
1635orsilently switch to allmulti or promiscuous mode if it is given
1636too many addresses.
1637] This table is set using the class VIRTIO_NET_CTRL_MAC and the
1638command VIRTIO_NET_CTRL_MAC_TABLE_SET. The command-specific-data
1639is two variable length tables of 6-byte MAC addresses. The first
1640table contains unicast addresses, and the second contains
1641multicast addresses.
1642
1643  VLAN Filtering
1644
1645If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it
1646can control a VLAN filter table in the device.
1647
1648#define VIRTIO_NET_CTRL_VLAN       2
1649
1650 #define VIRTIO_NET_CTRL_VLAN_ADD             0
1651
1652 #define VIRTIO_NET_CTRL_VLAN_DEL             1
1653
1654Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL
1655command take a 16-bit VLAN id as the command-specific-data.
1656
1657  Gratuitous Packet Sending
1658
1659If the driver negotiates the VIRTIO_NET_F_GUEST_ANNOUNCE (depends
1660on VIRTIO_NET_F_CTRL_VQ), it can ask the guest to send gratuitous
1661packets; this is usually done after the guest has been physically
1662migrated, and needs to announce its presence on the new network
1663links. (As hypervisor does not have the knowledge of guest
1664network configuration (eg. tagged vlan) it is simplest to prod
1665the guest in this way).
1666
1667#define VIRTIO_NET_CTRL_ANNOUNCE       3
1668
1669 #define VIRTIO_NET_CTRL_ANNOUNCE_ACK             0
1670
1671The Guest needs to check VIRTIO_NET_S_ANNOUNCE bit in status
1672field when it notices the changes of device configuration. The
1673command VIRTIO_NET_CTRL_ANNOUNCE_ACK is used to indicate that
1674driver has recevied the notification and device would clear the
1675VIRTIO_NET_S_ANNOUNCE bit in the status filed after it received
1676this command.
1677
1678Processing this notification involves:
1679
1680  Sending the gratuitous packets or marking there are pending
1681  gratuitous packets to be sent and letting deferred routine to
1682  send them.
1683
1684  Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control
1685  vq.
1686
1687  .
1688
1689Appendix D: Block Device
1690
1691The virtio block device is a simple virtual block device (ie.
1692disk). Read and write requests (and other exotic requests) are
1693placed in the queue, and serviced (probably out of order) by the
1694device except where noted.
1695
1696  Configuration
1697
1698  Subsystem Device ID 2
1699
1700  Virtqueues 0:requestq.
1701
1702  Feature bits
1703
1704  VIRTIO_BLK_F_BARRIER (0) Host supports request barriers.
1705
1706  VIRTIO_BLK_F_SIZE_MAX (1) Maximum size of any single segment is
1707    in “size_max”.
1708
1709  VIRTIO_BLK_F_SEG_MAX (2) Maximum number of segments in a
1710    request is in “seg_max”.
1711
1712  VIRTIO_BLK_F_GEOMETRY (4) Disk-style geometry specified in “
1713    geometry”.
1714
1715  VIRTIO_BLK_F_RO (5) Device is read-only.
1716
1717  VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”.
1718
1719  VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands.
1720
1721  VIRTIO_BLK_F_FLUSH (9) Cache flush command support.
1722
1723  Device configuration layout The capacity of the device
1724  (expressed in 512-byte sectors) is always present. The
1725  availability of the others all depend on various feature bits
1726  as indicated above. struct virtio_blk_config {
1727
1728        u64 capacity;
1729
1730        u32 size_max;
1731
1732        u32 seg_max;
1733
1734        struct virtio_blk_geometry {
1735
1736                u16 cylinders;
1737
1738                u8 heads;
1739
1740                u8 sectors;
1741
1742        } geometry;
1743
1744        u32 blk_size;
1745
1746
1747
1748};
1749
1750  Device Initialization
1751
1752  The device size should be read from the “capacity”
1753  configuration field. No requests should be submitted which goes
1754  beyond this limit.
1755
1756  If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, the
1757  blk_size field can be read to determine the optimal sector size
1758  for the driver to use. This does not effect the units used in
1759  the protocol (always 512 bytes), but awareness of the correct
1760  value can effect performance.
1761
1762  If the VIRTIO_BLK_F_RO feature is set by the device, any write
1763  requests will fail.
1764
1765  Device Operation
1766
1767The driver queues requests to the virtqueue, and they are used by
1768the device (not necessarily in order). Each request is of form:
1769
1770struct virtio_blk_req {
1771
1772
1773
1774        u32 type;
1775
1776        u32 ioprio;
1777
1778        u64 sector;
1779
1780        char data[][512];
1781
1782        u8 status;
1783
1784};
1785
1786If the device has VIRTIO_BLK_F_SCSI feature, it can also support
1787scsi packet command requests, each of these requests is of form:struct virtio_scsi_pc_req {
1788
1789        u32 type;
1790
1791        u32 ioprio;
1792
1793        u64 sector;
1794
1795    char cmd[];
1796
1797        char data[][512];
1798
1799#define SCSI_SENSE_BUFFERSIZE   96
1800
1801    u8 sense[SCSI_SENSE_BUFFERSIZE];
1802
1803    u32 errors;
1804
1805    u32 data_len;
1806
1807    u32 sense_len;
1808
1809    u32 residual;
1810
1811        u8 status;
1812
1813};
1814
1815The type of the request is either a read (VIRTIO_BLK_T_IN), a
1816write (VIRTIO_BLK_T_OUT), a scsi packet command
1817(VIRTIO_BLK_T_SCSI_CMD or VIRTIO_BLK_T_SCSI_CMD_OUT[footnote:
1818the SCSI_CMD and SCSI_CMD_OUT types are equivalent, the device
1819does not distinguish between them
1820]) or a flush (VIRTIO_BLK_T_FLUSH or VIRTIO_BLK_T_FLUSH_OUT[footnote:
1821the FLUSH and FLUSH_OUT types are equivalent, the device does not
1822distinguish between them
1823]). If the device has VIRTIO_BLK_F_BARRIER feature the high bit
1824(VIRTIO_BLK_T_BARRIER) indicates that this request acts as a
1825barrier and that all preceeding requests must be complete before
1826this one, and all following requests must not be started until
1827this is complete. Note that a barrier does not flush caches in
1828the underlying backend device in host, and thus does not serve as
1829data consistency guarantee. Driver must use FLUSH request to
1830flush the host cache.
1831
1832#define VIRTIO_BLK_T_IN           0
1833
1834#define VIRTIO_BLK_T_OUT          1
1835
1836#define VIRTIO_BLK_T_SCSI_CMD     2
1837
1838#define VIRTIO_BLK_T_SCSI_CMD_OUT 3
1839
1840#define VIRTIO_BLK_T_FLUSH        4
1841
1842#define VIRTIO_BLK_T_FLUSH_OUT    5
1843
1844#define VIRTIO_BLK_T_BARRIER     0x80000000
1845
1846The ioprio field is a hint about the relative priorities of
1847requests to the device: higher numbers indicate more important
1848requests.
1849
1850The sector number indicates the offset (multiplied by 512) where
1851the read or write is to occur. This field is unused and set to 0
1852for scsi packet commands and for flush commands.
1853
1854The cmd field is only present for scsi packet command requests,
1855and indicates the command to perform. This field must reside in a
1856single, separate read-only buffer; command length can be derived
1857from the length of this buffer.
1858
1859Note that these first three (four for scsi packet commands)
1860fields are always read-only: the data field is either read-only
1861or write-only, depending on the request. The size of the read or
1862write can be derived from the total size of the request buffers.
1863
1864The sense field is only present for scsi packet command requests,
1865and indicates the buffer for scsi sense data.
1866
1867The data_len field is only present for scsi packet command
1868requests, this field is deprecated, and should be ignored by the
1869driver. Historically, devices copied data length there.
1870
1871The sense_len field is only present for scsi packet command
1872requests and indicates the number of bytes actually written to
1873the sense buffer.
1874
1875The residual field is only present for scsi packet command
1876requests and indicates the residual size, calculated as data
1877length - number of bytes actually transferred.
1878
1879The final status byte is written by the device: either
1880VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for host or guest
1881error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host:#define VIRTIO_BLK_S_OK        0
1882
1883#define VIRTIO_BLK_S_IOERR     1
1884
1885#define VIRTIO_BLK_S_UNSUPP    2
1886
1887Historically, devices assumed that the fields type, ioprio and
1888sector reside in a single, separate read-only buffer; the fields
1889errors, data_len, sense_len and residual reside in a single,
1890separate write-only buffer; the sense field in a separate
1891write-only buffer of size 96 bytes, by itself; the fields errors,
1892data_len, sense_len and residual in a single write-only buffer;
1893and the status field is a separate read-only buffer of size 1
1894byte, by itself.
1895
1896Appendix E: Console Device
1897
1898The virtio console device is a simple device for data input and
1899output. A device may have one or more ports. Each port has a pair
1900of input and output virtqueues. Moreover, a device has a pair of
1901control IO virtqueues. The control virtqueues are used to
1902communicate information between the device and the driver about
1903ports being opened and closed on either side of the connection,
1904indication from the host about whether a particular port is a
1905console port, adding new ports, port hot-plug/unplug, etc., and
1906indication from the guest about whether a port or a device was
1907successfully added, port open/close, etc.. For data IO, one or
1908more empty buffers are placed in the receive queue for incoming
1909data and outgoing characters are placed in the transmit queue.
1910
1911  Configuration
1912
1913  Subsystem Device ID 3
1914
1915  Virtqueues 0:receiveq(port0). 1:transmitq(port0), 2:control
1916  receiveq[footnote:
1917Ports 2 onwards only if VIRTIO_CONSOLE_F_MULTIPORT is set
1918], 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1),
1919  ...
1920
1921  Feature bits
1922
1923  VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields
1924    are valid.
1925
1926  VIRTIO_CONSOLE_F_MULTIPORT(1) Device has support for multiple
1927    ports; configuration fields nr_ports and max_nr_ports are
1928    valid and control virtqueues will be used.
1929
1930  Device configuration layout The size of the console is supplied
1931  in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature
1932  is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature
1933  is set, the maximum number of ports supported by the device can
1934  be fetched.struct virtio_console_config {
1935
1936        u16 cols;
1937
1938        u16 rows;
1939
1940
1941
1942        u32 max_nr_ports;
1943
1944};
1945
1946  Device Initialization
1947
1948  If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver
1949  can read the console dimensions from the configuration fields.
1950
1951  If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, the
1952  driver can spawn multiple ports, not all of which may be
1953  attached to a console. Some could be generic ports. In this
1954  case, the control virtqueues are enabled and according to the
1955  max_nr_ports configuration-space value, the appropriate number
1956  of virtqueues are created. A control message indicating the
1957  driver is ready is sent to the host. The host can then send
1958  control messages for adding new ports to the device. After
1959  creating and initializing each port, a
1960  VIRTIO_CONSOLE_PORT_READY control message is sent to the host
1961  for that port so the host can let us know of any additional
1962  configuration options set for that port.
1963
1964  The receiveq for each port is populated with one or more
1965  receive buffers.
1966
1967  Device Operation
1968
1969  For output, a buffer containing the characters is placed in the
1970  port's transmitq.[footnote:
1971Because this is high importance and low bandwidth, the current
1972Linux implementation polls for the buffer to be used, rather than
1973waiting for an interrupt, simplifying the implementation
1974significantly. However, for generic serial ports with the
1975O_NONBLOCK flag set, the polling limitation is relaxed and the
1976consumed buffers are freed upon the next write or poll call or
1977when a port is closed or hot-unplugged.
1978]
1979
1980  When a buffer is used in the receiveq (signalled by an
1981  interrupt), the contents is the input to the port associated
1982  with the virtqueue for which the notification was received.
1983
1984  If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a
1985  configuration change interrupt may occur. The updated size can
1986  be read from the configuration fields.
1987
1988  If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT
1989  feature, active ports are announced by the host using the
1990  VIRTIO_CONSOLE_PORT_ADD control message. The same message is
1991  used for port hot-plug as well.
1992
1993  If the host specified a port `name', a sysfs attribute is
1994  created with the name filled in, so that udev rules can be
1995  written that can create a symlink from the port's name to the
1996  char device for port discovery by applications in the guest.
1997
1998  Changes to ports' state are effected by control messages.
1999  Appropriate action is taken on the port indicated in the
2000  control message. The layout of the structure of the control
2001  buffer and the events associated are:struct virtio_console_control {
2002
2003        uint32_t id;    /* Port number */
2004
2005        uint16_t event; /* The kind of control event */
2006
2007        uint16_t value; /* Extra information for the event */
2008
2009};
2010
2011
2012
2013/* Some events for the internal messages (control packets) */
2014
2015
2016
2017#define VIRTIO_CONSOLE_DEVICE_READY     0
2018
2019#define VIRTIO_CONSOLE_PORT_ADD         1
2020
2021#define VIRTIO_CONSOLE_PORT_REMOVE      2
2022
2023#define VIRTIO_CONSOLE_PORT_READY       3
2024
2025#define VIRTIO_CONSOLE_CONSOLE_PORT     4
2026
2027#define VIRTIO_CONSOLE_RESIZE           5
2028
2029#define VIRTIO_CONSOLE_PORT_OPEN        6
2030
2031#define VIRTIO_CONSOLE_PORT_NAME        7
2032
2033Appendix F: Entropy Device
2034
2035The virtio entropy device supplies high-quality randomness for
2036guest use.
2037
2038  Configuration
2039
2040  Subsystem Device ID 4
2041
2042  Virtqueues 0:requestq.
2043
2044  Feature bits None currently defined
2045
2046  Device configuration layout None currently defined.
2047
2048  Device Initialization
2049
2050  The virtqueue is initialized
2051
2052  Device Operation
2053
2054When the driver requires random bytes, it places the descriptor
2055of one or more buffers in the queue. It will be completely filled
2056by random data by the device.
2057
2058Appendix G: Memory Balloon Device
2059
2060The virtio memory balloon device is a primitive device for
2061managing guest memory: the device asks for a certain amount of
2062memory, and the guest supplies it (or withdraws it, if the device
2063has more than it asks for). This allows the guest to adapt to
2064changes in allowance of underlying physical memory. If the
2065feature is negotiated, the device can also be used to communicate
2066guest memory statistics to the host.
2067
2068  Configuration
2069
2070  Subsystem Device ID 5
2071
2072  Virtqueues 0:inflateq. 1:deflateq. 2:statsq.[footnote:
2073Only if VIRTIO_BALLON_F_STATS_VQ set
2074]
2075
2076  Feature bits
2077
2078  VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before
2079    pages from the balloon are used.
2080
2081  VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest
2082    memory statistics is present.
2083
2084  Device configuration layout Both fields of this configuration
2085  are always available. Note that they are little endian, despite
2086  convention that device fields are guest endian:struct virtio_balloon_config {
2087
2088        u32 num_pages;
2089
2090        u32 actual;
2091
2092};
2093
2094  Device Initialization
2095
2096  The inflate and deflate virtqueues are identified.
2097
2098  If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated:
2099
2100  Identify the stats virtqueue.
2101
2102  Add one empty buffer to the stats virtqueue and notify the
2103    host.
2104
2105Device operation begins immediately.
2106
2107  Device Operation
2108
2109  Memory Ballooning The device is driven by the receipt of a
2110  configuration change interrupt.
2111
2112  The “num_pages” configuration field is examined. If this is
2113  greater than the “actual” number of pages, memory must be given
2114  to the balloon. If it is less than the “actual” number of
2115  pages, memory may be taken back from the balloon for general
2116  use.
2117
2118  To supply memory to the balloon (aka. inflate):
2119
2120  The driver constructs an array of addresses of unused memory
2121    pages. These addresses are divided by 4096[footnote:
2122This is historical, and independent of the guest page size
2123] and the descriptor describing the resulting 32-bit array is
2124    added to the inflateq.
2125
2126  To remove memory from the balloon (aka. deflate):
2127
2128  The driver constructs an array of addresses of memory pages it
2129    has previously given to the balloon, as described above. This
2130    descriptor is added to the deflateq.
2131
2132  If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is set, the
2133    guest may not use these requested pages until that descriptor
2134    in the deflateq has been used by the device.
2135
2136  Otherwise, the guest may begin to re-use pages previously given
2137    to the balloon before the device has acknowledged their
2138    withdrawl. [footnote:
2139In this case, deflation advice is merely a courtesy
2140]
2141
2142  In either case, once the device has completed the inflation or
2143  deflation, the “actual” field of the configuration should be
2144  updated to reflect the new number of pages in the balloon.[footnote:
2145As updates to configuration space are not atomic, this field
2146isn't particularly reliable, but can be used to diagnose buggy
2147guests.
2148]
2149
2150  Memory Statistics
2151
2152The stats virtqueue is atypical because communication is driven
2153by the device (not the driver). The channel becomes active at
2154driver initialization time when the driver adds an empty buffer
2155and notifies the device. A request for memory statistics proceeds
2156as follows:
2157
2158  The device pushes the buffer onto the used ring and sends an
2159  interrupt.
2160
2161  The driver pops the used buffer and discards it.
2162
2163  The driver collects memory statistics and writes them into a
2164  new buffer.
2165
2166  The driver adds the buffer to the virtqueue and notifies the
2167  device.
2168
2169  The device pops the buffer (retaining it to initiate a
2170  subsequent request) and consumes the statistics.
2171
2172  Memory Statistics Format Each statistic consists of a 16 bit
2173  tag and a 64 bit value. Both quantities are represented in the
2174  native endian of the guest. All statistics are optional and the
2175  driver may choose which ones to supply. To guarantee backwards
2176  compatibility, unsupported statistics should be omitted.
2177
2178  struct virtio_balloon_stat {
2179
2180#define VIRTIO_BALLOON_S_SWAP_IN  0
2181
2182#define VIRTIO_BALLOON_S_SWAP_OUT 1
2183
2184#define VIRTIO_BALLOON_S_MAJFLT   2
2185
2186#define VIRTIO_BALLOON_S_MINFLT   3
2187
2188#define VIRTIO_BALLOON_S_MEMFREE  4
2189
2190#define VIRTIO_BALLOON_S_MEMTOT   5
2191
2192        u16 tag;
2193
2194        u64 val;
2195
2196} __attribute__((packed));
2197
2198  Tags
2199
2200  VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been
2201  swapped in (in bytes).
2202
2203  VIRTIO_BALLOON_S_SWAP_OUT The amount of memory that has been
2204  swapped out to disk (in bytes).
2205
2206  VIRTIO_BALLOON_S_MAJFLT The number of major page faults that
2207  have occurred.
2208
2209  VIRTIO_BALLOON_S_MINFLT The number of minor page faults that
2210  have occurred.
2211
2212  VIRTIO_BALLOON_S_MEMFREE The amount of memory not being used
2213  for any purpose (in bytes).
2214
2215  VIRTIO_BALLOON_S_MEMTOT The total amount of memory available
2216  (in bytes).
2217
2218Appendix H: Rpmsg: Remote Processor Messaging
2219
2220Virtio rpmsg devices represent remote processors on the system
2221which run in asymmetric multi-processing (AMP) configuration, and
2222which are usually used to offload cpu-intensive tasks from the
2223main application processor (a typical SoC methodology).
2224
2225Virtio is being used to communicate with those remote processors;
2226empty buffers are placed in one virtqueue for receiving messages,
2227and non-empty buffers, containing outbound messages, are enqueued
2228in a second virtqueue for transmission.
2229
2230Numerous communication channels can be multiplexed over those two
2231virtqueues, so different entities, running on the application and
2232remote processor, can directly communicate in a point-to-point
2233fashion.
2234
2235  Configuration
2236
2237  Subsystem Device ID 7
2238
2239  Virtqueues 0:receiveq. 1:transmitq.
2240
2241  Feature bits
2242
2243  VIRTIO_RPMSG_F_NS (0) Device sends (and capable of receiving)
2244    name service messages announcing the creation (or
2245    destruction) of a channel:/**
2246
2247 * struct rpmsg_ns_msg - dynamic name service announcement
2248message
2249
2250 * @name: name of remote service that is published
2251
2252 * @addr: address of remote service that is published
2253
2254 * @flags: indicates whether service is created or destroyed
2255
2256 *
2257
2258 * This message is sent across to publish a new service (or
2259announce
2260
2261 * about its removal). When we receives these messages, an
2262appropriate
2263
2264 * rpmsg channel (i.e device) is created/destroyed.
2265
2266 */
2267
2268struct rpmsg_ns_msgoon_config {
2269
2270        char name[RPMSG_NAME_SIZE];
2271
2272        u32 addr;
2273
2274        u32 flags;
2275
2276} __packed;
2277
2278
2279
2280/**
2281
2282 * enum rpmsg_ns_flags - dynamic name service announcement flags
2283
2284 *
2285
2286 * @RPMSG_NS_CREATE: a new remote service was just created
2287
2288 * @RPMSG_NS_DESTROY: a remote service was just destroyed
2289
2290 */
2291
2292enum rpmsg_ns_flags {
2293
2294        RPMSG_NS_CREATE = 0,
2295
2296        RPMSG_NS_DESTROY = 1,
2297
2298};
2299
2300  Device configuration layout
2301
2302At his point none currently defined.
2303
2304  Device Initialization
2305
2306  The initialization routine should identify the receive and
2307  transmission virtqueues.
2308
2309  The receive virtqueue should be filled with receive buffers.
2310
2311  Device Operation
2312
2313Messages are transmitted by placing them in the transmitq, and
2314buffers for inbound messages are placed in the receiveq. In any
2315case, messages are always preceded by the following header: /**
2316
2317 * struct rpmsg_hdr - common header for all rpmsg messages
2318
2319 * @src: source address
2320
2321 * @dst: destination address
2322
2323 * @reserved: reserved for future use
2324
2325 * @len: length of payload (in bytes)
2326
2327 * @flags: message flags
2328
2329 * @data: @len bytes of message payload data
2330
2331 *
2332
2333 * Every message sent(/received) on the rpmsg bus begins with
2334this header.
2335
2336 */
2337
2338struct rpmsg_hdr {
2339
2340        u32 src;
2341
2342        u32 dst;
2343
2344        u32 reserved;
2345
2346        u16 len;
2347
2348        u16 flags;
2349
2350        u8 data[0];
2351
2352} __packed;
2353
2354Appendix I: SCSI Host Device
2355
2356The virtio SCSI host device groups together one or more virtual
2357logical units (such as disks), and allows communicating to them
2358using the SCSI protocol. An instance of the device represents a
2359SCSI host to which many targets and LUNs are attached.
2360
2361The virtio SCSI device services two kinds of requests:
2362
2363  command requests for a logical unit;
2364
2365  task management functions related to a logical unit, target or
2366  command.
2367
2368The device is also able to send out notifications about added and
2369removed logical units. Together, these capabilities provide a
2370SCSI transport protocol that uses virtqueues as the transfer
2371medium. In the transport protocol, the virtio driver acts as the
2372initiator, while the virtio SCSI host provides one or more
2373targets that receive and process the requests.
2374
2375  Configuration
2376
2377  Subsystem Device ID 8
2378
2379  Virtqueues 0:controlq; 1:eventq; 2..n:request queues.
2380
2381  Feature bits
2382
2383  VIRTIO_SCSI_F_INOUT (0) A single request can include both
2384    read-only and write-only data buffers.
2385
2386  VIRTIO_SCSI_F_HOTPLUG (1) The host should enable
2387    hot-plug/hot-unplug of new LUNs and targets on the SCSI bus.
2388
2389  Device configuration layout All fields of this configuration
2390  are always available. sense_size and cdb_size are writable by
2391  the guest.struct virtio_scsi_config {
2392
2393    u32 num_queues;
2394
2395    u32 seg_max;
2396
2397    u32 max_sectors;
2398
2399    u32 cmd_per_lun;
2400
2401    u32 event_info_size;
2402
2403    u32 sense_size;
2404
2405    u32 cdb_size;
2406
2407    u16 max_channel;
2408
2409    u16 max_target;
2410
2411    u32 max_lun;
2412
2413};
2414
2415  num_queues is the total number of request virtqueues exposed by
2416    the device. The driver is free to use only one request queue,
2417    or it can use more to achieve better performance.
2418
2419  seg_max is the maximum number of segments that can be in a
2420    command. A bidirectional command can include seg_max input
2421    segments and seg_max output segments.
2422
2423  max_sectors is a hint to the guest about the maximum transfer
2424    size it should use.
2425
2426  cmd_per_lun is a hint to the guest about the maximum number of
2427    linked commands it should send to one LUN. The actual value
2428    to be used is the minimum of cmd_per_lun and the virtqueue
2429    size.
2430
2431  event_info_size is the maximum size that the device will fill
2432    for buffers that the driver places in the eventq. The driver
2433    should always put buffers at least of this size. It is
2434    written by the device depending on the set of negotated
2435    features.
2436
2437  sense_size is the maximum size of the sense data that the
2438    device will write. The default value is written by the device
2439    and will always be 96, but the driver can modify it. It is
2440    restored to the default when the device is reset.
2441
2442  cdb_size is the maximum size of the CDB that the driver will
2443    write. The default value is written by the device and will
2444    always be 32, but the driver can likewise modify it. It is
2445    restored to the default when the device is reset.
2446
2447  max_channel, max_target and max_lun can be used by the driver
2448    as hints to constrain scanning the logical units on the
2449    host.h
2450
2451  Device Initialization
2452
2453The initialization routine should first of all discover the
2454device's virtqueues.
2455
2456If the driver uses the eventq, it should then place at least a
2457buffer in the eventq.
2458
2459The driver can immediately issue requests (for example, INQUIRY
2460or REPORT LUNS) or task management functions (for example, I_T
2461RESET).
2462
2463  Device Operation: request queues
2464
2465The driver queues requests to an arbitrary request queue, and
2466they are used by the device on that same queue. It is the
2467responsibility of the driver to ensure strict request ordering
2468for commands placed on different queues, because they will be
2469consumed with no order constraints.
2470
2471Requests have the following format:
2472
2473struct virtio_scsi_req_cmd {
2474
2475    // Read-only
2476
2477    u8 lun[8];
2478
2479    u64 id;
2480
2481    u8 task_attr;
2482
2483    u8 prio;
2484
2485    u8 crn;
2486
2487    char cdb[cdb_size];
2488
2489    char dataout[];
2490
2491    // Write-only part
2492
2493    u32 sense_len;
2494
2495    u32 residual;
2496
2497    u16 status_qualifier;
2498
2499    u8 status;
2500
2501    u8 response;
2502
2503    u8 sense[sense_size];
2504
2505    char datain[];
2506
2507};
2508
2509
2510
2511/* command-specific response values */
2512
2513#define VIRTIO_SCSI_S_OK                0
2514
2515#define VIRTIO_SCSI_S_OVERRUN           1
2516
2517#define VIRTIO_SCSI_S_ABORTED           2
2518
2519#define VIRTIO_SCSI_S_BAD_TARGET        3
2520
2521#define VIRTIO_SCSI_S_RESET             4
2522
2523#define VIRTIO_SCSI_S_BUSY              5
2524
2525#define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6
2526
2527#define VIRTIO_SCSI_S_TARGET_FAILURE    7
2528
2529#define VIRTIO_SCSI_S_NEXUS_FAILURE     8
2530
2531#define VIRTIO_SCSI_S_FAILURE           9
2532
2533
2534
2535/* task_attr */
2536
2537#define VIRTIO_SCSI_S_SIMPLE            0
2538
2539#define VIRTIO_SCSI_S_ORDERED           1
2540
2541#define VIRTIO_SCSI_S_HEAD              2
2542
2543#define VIRTIO_SCSI_S_ACA               3
2544
2545The lun field addresses a target and logical unit in the
2546virtio-scsi device's SCSI domain. The only supported format for
2547the LUN field is: first byte set to 1, second byte set to target,
2548third and fourth byte representing a single level LUN structure,
2549followed by four zero bytes. With this representation, a
2550virtio-scsi device can serve up to 256 targets and 16384 LUNs per
2551target.
2552
2553The id field is the command identifier (“tag”).
2554
2555task_attr, prio and crn should be left to zero. task_attr defines
2556the task attribute as in the table above, but all task attributes
2557may be mapped to SIMPLE by the device; crn may also be provided
2558by clients, but is generally expected to be 0. The maximum CRN
2559value defined by the protocol is 255, since CRN is stored in an
25608-bit integer.
2561
2562All of these fields are defined in SAM. They are always
2563read-only, as are the cdb and dataout field. The cdb_size is
2564taken from the configuration space.
2565
2566sense and subsequent fields are always write-only. The sense_len
2567field indicates the number of bytes actually written to the sense
2568buffer. The residual field indicates the residual size,
2569calculated as “data_length - number_of_transferred_bytes”, for
2570read or write operations. For bidirectional commands, the
2571number_of_transferred_bytes includes both read and written bytes.
2572A residual field that is less than the size of datain means that
2573the dataout field was processed entirely. A residual field that
2574exceeds the size of datain means that the dataout field was
2575processed partially and the datain field was not processed at
2576all.
2577
2578The status byte is written by the device to be the status code as
2579defined in SAM.
2580
2581The response byte is written by the device to be one of the
2582following:
2583
2584  VIRTIO_SCSI_S_OK when the request was completed and the status
2585  byte is filled with a SCSI status code (not necessarily
2586  "GOOD").
2587
2588  VIRTIO_SCSI_S_OVERRUN if the content of the CDB requires
2589  transferring more data than is available in the data buffers.
2590
2591  VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an
2592  ABORT TASK or ABORT TASK SET task management function.
2593
2594  VIRTIO_SCSI_S_BAD_TARGET if the request was never processed
2595  because the target indicated by the lun field does not exist.
2596
2597  VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus
2598  or device reset (including a task management function).
2599
2600  VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed due to a
2601  problem in the connection between the host and the target
2602  (severed link).
2603
2604  VIRTIO_SCSI_S_TARGET_FAILURE if the target is suffering a
2605  failure and the guest should not retry on other paths.
2606
2607  VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is suffering a failure
2608  but retrying on other paths might yield a different result.
2609
2610  VIRTIO_SCSI_S_BUSY if the request failed but retrying on the
2611  same path should work.
2612
2613  VIRTIO_SCSI_S_FAILURE for other host or guest error. In
2614  particular, if neither dataout nor datain is empty, and the
2615  VIRTIO_SCSI_F_INOUT feature has not been negotiated, the
2616  request will be immediately returned with a response equal to
2617  VIRTIO_SCSI_S_FAILURE.
2618
2619  Device Operation: controlq
2620
2621The controlq is used for other SCSI transport operations.
2622Requests have the following format:
2623
2624struct virtio_scsi_ctrl {
2625
2626    u32 type;
2627
2628    ...
2629
2630    u8 response;
2631
2632};
2633
2634
2635
2636/* response values valid for all commands */
2637
2638#define VIRTIO_SCSI_S_OK                       0
2639
2640#define VIRTIO_SCSI_S_BAD_TARGET               3
2641
2642#define VIRTIO_SCSI_S_BUSY                     5
2643
2644#define VIRTIO_SCSI_S_TRANSPORT_FAILURE        6
2645
2646#define VIRTIO_SCSI_S_TARGET_FAILURE           7
2647
2648#define VIRTIO_SCSI_S_NEXUS_FAILURE            8
2649
2650#define VIRTIO_SCSI_S_FAILURE                  9
2651
2652#define VIRTIO_SCSI_S_INCORRECT_LUN            12
2653
2654The type identifies the remaining fields.
2655
2656The following commands are defined:
2657
2658  Task management function
2659#define VIRTIO_SCSI_T_TMF                      0
2660
2661
2662
2663#define VIRTIO_SCSI_T_TMF_ABORT_TASK           0
2664
2665#define VIRTIO_SCSI_T_TMF_ABORT_TASK_SET       1
2666
2667#define VIRTIO_SCSI_T_TMF_CLEAR_ACA            2
2668
2669#define VIRTIO_SCSI_T_TMF_CLEAR_TASK_SET       3
2670
2671#define VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET      4
2672
2673#define VIRTIO_SCSI_T_TMF_LOGICAL_UNIT_RESET   5
2674
2675#define VIRTIO_SCSI_T_TMF_QUERY_TASK           6
2676
2677#define VIRTIO_SCSI_T_TMF_QUERY_TASK_SET       7
2678
2679
2680
2681struct virtio_scsi_ctrl_tmf
2682
2683{
2684
2685    // Read-only part
2686
2687    u32 type;
2688
2689    u32 subtype;
2690
2691    u8 lun[8];
2692
2693    u64 id;
2694
2695    // Write-only part
2696
2697    u8 response;
2698
2699}
2700
2701
2702
2703/* command-specific response values */
2704
2705#define VIRTIO_SCSI_S_FUNCTION_COMPLETE        0
2706
2707#define VIRTIO_SCSI_S_FUNCTION_SUCCEEDED       10
2708
2709#define VIRTIO_SCSI_S_FUNCTION_REJECTED        11
2710
2711  The type is VIRTIO_SCSI_T_TMF; the subtype field defines. All
2712  fields except response are filled by the driver. The subtype
2713  field must always be specified and identifies the requested
2714  task management function.
2715
2716  Other fields may be irrelevant for the requested TMF; if so,
2717  they are ignored but they should still be present. The lun
2718  field is in the same format specified for request queues; the
2719  single level LUN is ignored when the task management function
2720  addresses a whole I_T nexus. When relevant, the value of the id
2721  field is matched against the id values passed on the requestq.
2722
2723  The outcome of the task management function is written by the
2724  device in the response field. The command-specific response
2725  values map 1-to-1 with those defined in SAM.
2726
2727  Asynchronous notification query
2728#define VIRTIO_SCSI_T_AN_QUERY                    1
2729
2730
2731
2732struct virtio_scsi_ctrl_an {
2733
2734    // Read-only part
2735
2736    u32 type;
2737
2738    u8  lun[8];
2739
2740    u32 event_requested;
2741
2742    // Write-only part
2743
2744    u32 event_actual;
2745
2746    u8  response;
2747
2748}
2749
2750
2751
2752#define VIRTIO_SCSI_EVT_ASYNC_OPERATIONAL_CHANGE  2
2753
2754#define VIRTIO_SCSI_EVT_ASYNC_POWER_MGMT          4
2755
2756#define VIRTIO_SCSI_EVT_ASYNC_EXTERNAL_REQUEST    8
2757
2758#define VIRTIO_SCSI_EVT_ASYNC_MEDIA_CHANGE        16
2759
2760#define VIRTIO_SCSI_EVT_ASYNC_MULTI_HOST          32
2761
2762#define VIRTIO_SCSI_EVT_ASYNC_DEVICE_BUSY         64
2763
2764  By sending this command, the driver asks the device which
2765  events the given LUN can report, as described in paragraphs 6.6
2766  and A.6 of the SCSI MMC specification. The driver writes the
2767  events it is interested in into the event_requested; the device
2768  responds by writing the events that it supports into
2769  event_actual.
2770
2771  The type is VIRTIO_SCSI_T_AN_QUERY. The lun and event_requested
2772  fields are written by the driver. The event_actual and response
2773  fields are written by the device.
2774
2775  No command-specific values are defined for the response byte.
2776
2777  Asynchronous notification subscription
2778#define VIRTIO_SCSI_T_AN_SUBSCRIBE                2
2779
2780
2781
2782struct virtio_scsi_ctrl_an {
2783
2784    // Read-only part
2785
2786    u32 type;
2787
2788    u8  lun[8];
2789
2790    u32 event_requested;
2791
2792    // Write-only part
2793
2794    u32 event_actual;
2795
2796    u8  response;
2797
2798}
2799
2800  By sending this command, the driver asks the specified LUN to
2801  report events for its physical interface, again as described in
2802  the SCSI MMC specification. The driver writes the events it is
2803  interested in into the event_requested; the device responds by
2804  writing the events that it supports into event_actual.
2805
2806  Event types are the same as for the asynchronous notification
2807  query message.
2808
2809  The type is VIRTIO_SCSI_T_AN_SUBSCRIBE. The lun and
2810  event_requested fields are written by the driver. The
2811  event_actual and response fields are written by the device.
2812
2813  No command-specific values are defined for the response byte.
2814
2815  Device Operation: eventq
2816
2817The eventq is used by the device to report information on logical
2818units that are attached to it. The driver should always leave a
2819few buffers ready in the eventq. In general, the device will not
2820queue events to cope with an empty eventq, and will end up
2821dropping events if it finds no buffer ready. However, when
2822reporting events for many LUNs (e.g. when a whole target
2823disappears), the device can throttle events to avoid dropping
2824them. For this reason, placing 10-15 buffers on the event queue
2825should be enough.
2826
2827Buffers are placed in the eventq and filled by the device when
2828interesting events occur. The buffers should be strictly
2829write-only (device-filled) and the size of the buffers should be
2830at least the value given in the device's configuration
2831information.
2832
2833Buffers returned by the device on the eventq will be referred to
2834as "events" in the rest of this section. Events have the
2835following format:
2836
2837#define VIRTIO_SCSI_T_EVENTS_MISSED   0x80000000
2838
2839
2840
2841struct virtio_scsi_event {
2842
2843    // Write-only part
2844
2845    u32 event;
2846
2847    ...
2848
2849}
2850
2851If bit 31 is set in the event field, the device failed to report
2852an event due to missing buffers. In this case, the driver should
2853poll the logical units for unit attention conditions, and/or do
2854whatever form of bus scan is appropriate for the guest operating
2855system.
2856
2857Other data that the device writes to the buffer depends on the
2858contents of the event field. The following events are defined:
2859
2860  No event
2861#define VIRTIO_SCSI_T_NO_EVENT         0
2862
2863  This event is fired in the following cases:
2864
2865  When the device detects in the eventq a buffer that is shorter
2866    than what is indicated in the configuration field, it might
2867    use it immediately and put this dummy value in the event
2868    field. A well-written driver will never observe this
2869    situation.
2870
2871  When events are dropped, the device may signal this event as
2872    soon as the drivers makes a buffer available, in order to
2873    request action from the driver. In this case, of course, this
2874    event will be reported with the VIRTIO_SCSI_T_EVENTS_MISSED
2875    flag.
2876
2877  Transport reset
2878#define VIRTIO_SCSI_T_TRANSPORT_RESET  1
2879
2880
2881
2882struct virtio_scsi_event_reset {
2883
2884    // Write-only part
2885
2886    u32 event;
2887
2888    u8  lun[8];
2889
2890    u32 reason;
2891
2892}
2893
2894
2895
2896#define VIRTIO_SCSI_EVT_RESET_HARD         0
2897
2898#define VIRTIO_SCSI_EVT_RESET_RESCAN       1
2899
2900#define VIRTIO_SCSI_EVT_RESET_REMOVED      2
2901
2902  By sending this event, the device signals that a logical unit
2903  on a target has been reset, including the case of a new device
2904  appearing or disappearing on the bus.The device fills in all
2905  fields. The event field is set to
2906  VIRTIO_SCSI_T_TRANSPORT_RESET. The lun field addresses a
2907  logical unit in the SCSI host.
2908
2909  The reason value is one of the three #define values appearing
2910  above:
2911
2912  VIRTIO_SCSI_EVT_RESET_REMOVED (“LUN/target removed”) is used if
2913    the target or logical unit is no longer able to receive
2914    commands.
2915
2916  VIRTIO_SCSI_EVT_RESET_HARD (“LUN hard reset”) is used if the
2917    logical unit has been reset, but is still present.
2918
2919  VIRTIO_SCSI_EVT_RESET_RESCAN (“rescan LUN/target”) is used if a
2920    target or logical unit has just appeared on the device.
2921
2922  The “removed” and “rescan” events, when sent for LUN 0, may
2923  apply to the entire target. After receiving them the driver
2924  should ask the initiator to rescan the target, in order to
2925  detect the case when an entire target has appeared or
2926  disappeared. These two events will never be reported unless the
2927  VIRTIO_SCSI_F_HOTPLUG feature was negotiated between the host
2928  and the guest.
2929
2930  Events will also be reported via sense codes (this obviously
2931  does not apply to newly appeared buses or targets, since the
2932  application has never discovered them):
2933
2934  “LUN/target removed” maps to sense key ILLEGAL REQUEST, asc
2935    0x25, ascq 0x00 (LOGICAL UNIT NOT SUPPORTED)
2936
2937  “LUN hard reset” maps to sense key UNIT ATTENTION, asc 0x29
2938    (POWER ON, RESET OR BUS DEVICE RESET OCCURRED)
2939
2940  “rescan LUN/target” maps to sense key UNIT ATTENTION, asc 0x3f,
2941    ascq 0x0e (REPORTED LUNS DATA HAS CHANGED)
2942
2943  The preferred way to detect transport reset is always to use
2944  events, because sense codes are only seen by the driver when it
2945  sends a SCSI command to the logical unit or target. However, in
2946  case events are dropped, the initiator will still be able to
2947  synchronize with the actual state of the controller if the
2948  driver asks the initiator to rescan of the SCSI bus. During the
2949  rescan, the initiator will be able to observe the above sense
2950  codes, and it will process them as if it the driver had
2951  received the equivalent event.
2952
2953  Asynchronous notification
2954#define VIRTIO_SCSI_T_ASYNC_NOTIFY     2
2955
2956
2957
2958struct virtio_scsi_event_an {
2959
2960    // Write-only part
2961
2962    u32 event;
2963
2964    u8  lun[8];
2965
2966    u32 reason;
2967
2968}
2969
2970  By sending this event, the device signals that an asynchronous
2971  event was fired from a physical interface.
2972
2973  All fields are written by the device. The event field is set to
2974  VIRTIO_SCSI_T_ASYNC_NOTIFY. The lun field addresses a logical
2975  unit in the SCSI host. The reason field is a subset of the
2976  events that the driver has subscribed to via the "Asynchronous
2977  notification subscription" command.
2978
2979  When dropped events are reported, the driver should poll for
2980  asynchronous events manually using SCSI commands.
2981
2982Appendix X: virtio-mmio
2983
2984Virtual environments without PCI support (a common situation in
2985embedded devices models) might use simple memory mapped device (“
2986virtio-mmio”) instead of the PCI device.
2987
2988The memory mapped virtio device behaviour is based on the PCI
2989device specification. Therefore most of operations like device
2990initialization, queues configuration and buffer transfers are
2991nearly identical. Existing differences are described in the
2992following sections.
2993
2994  Device Initialization
2995
2996Instead of using the PCI IO space for virtio header, the “
2997virtio-mmio” device provides a set of memory mapped control
2998registers, all 32 bits wide, followed by device-specific
2999configuration space. The following list presents their layout:
3000
3001  Offset from the device base address | Direction | Name
3002 Description
3003
3004  0x000 | R | MagicValue
3005 “virt” string.
3006
3007  0x004 | R | Version
3008 Device version number. Currently must be 1.
3009
3010  0x008 | R | DeviceID
3011 Virtio Subsystem Device ID (ie. 1 for network card).
3012
3013  0x00c | R | VendorID
3014 Virtio Subsystem Vendor ID.
3015
3016  0x010 | R | HostFeatures
3017 Flags representing features the device supports.
3018 Reading from this register returns 32 consecutive flag bits,
3019  first bit depending on the last value written to
3020  HostFeaturesSel register. Access to this register returns bits HostFeaturesSel*32
3021
3022   to (HostFeaturesSel*32)+31
3023, eg. feature bits 0 to 31 if
3024  HostFeaturesSel is set to 0 and features bits 32 to 63 if
3025  HostFeaturesSel is set to 1. Also see [sub:Feature-Bits]
3026
3027  0x014 | W | HostFeaturesSel
3028 Device (Host) features word selection.
3029 Writing to this register selects a set of 32 device feature bits
3030  accessible by reading from HostFeatures register. Device driver
3031  must write a value to the HostFeaturesSel register before
3032  reading from the HostFeatures register.
3033
3034  0x020 | W | GuestFeatures
3035 Flags representing device features understood and activated by
3036  the driver.
3037 Writing to this register sets 32 consecutive flag bits, first
3038  bit depending on the last value written to GuestFeaturesSel
3039  register. Access to this register sets bits GuestFeaturesSel*32
3040
3041   to (GuestFeaturesSel*32)+31
3042, eg. feature bits 0 to 31 if
3043  GuestFeaturesSel is set to 0 and features bits 32 to 63 if
3044  GuestFeaturesSel is set to 1. Also see [sub:Feature-Bits]
3045
3046  0x024 | W | GuestFeaturesSel
3047 Activated (Guest) features word selection.
3048 Writing to this register selects a set of 32 activated feature
3049  bits accessible by writing to the GuestFeatures register.
3050  Device driver must write a value to the GuestFeaturesSel
3051  register before writing to the GuestFeatures register.
3052
3053  0x028 | W | GuestPageSize
3054 Guest page size.
3055 Device driver must write the guest page size in bytes to the
3056  register during initialization, before any queues are used.
3057  This value must be a power of 2 and is used by the Host to
3058  calculate Guest address of the first queue page (see QueuePFN).
3059
3060  0x030 | W | QueueSel
3061 Virtual queue index (first queue is 0).
3062 Writing to this register selects the virtual queue that the
3063  following operations on QueueNum, QueueAlign and QueuePFN apply
3064  to.
3065
3066  0x034 | R | QueueNumMax
3067 Maximum virtual queue size.
3068 Reading from the register returns the maximum size of the queue
3069  the Host is ready to process or zero (0x0) if the queue is not
3070  available. This applies to the queue selected by writing to
3071  QueueSel and is allowed only when QueuePFN is set to zero
3072  (0x0), so when the queue is not actively used.
3073
3074  0x038 | W | QueueNum
3075 Virtual queue size.
3076 Queue size is a number of elements in the queue, therefore size
3077  of the descriptor table and both available and used rings.
3078 Writing to this register notifies the Host what size of the
3079  queue the Guest will use. This applies to the queue selected by
3080  writing to QueueSel.
3081
3082  0x03c | W | QueueAlign
3083 Used Ring alignment in the virtual queue.
3084 Writing to this register notifies the Host about alignment
3085  boundary of the Used Ring in bytes. This value must be a power
3086  of 2 and applies to the queue selected by writing to QueueSel.
3087
3088  0x040 | RW | QueuePFN
3089 Guest physical page number of the virtual queue.
3090 Writing to this register notifies the host about location of the
3091  virtual queue in the Guest's physical address space. This value
3092  is the index number of a page starting with the queue
3093  Descriptor Table. Value zero (0x0) means physical address zero
3094  (0x00000000) and is illegal. When the Guest stops using the
3095  queue it must write zero (0x0) to this register.
3096 Reading from this register returns the currently used page
3097  number of the queue, therefore a value other than zero (0x0)
3098  means that the queue is in use.
3099 Both read and write accesses apply to the queue selected by
3100  writing to QueueSel.
3101
3102  0x050 | W | QueueNotify
3103 Queue notifier.
3104 Writing a queue index to this register notifies the Host that
3105  there are new buffers to process in the queue.
3106
3107  0x60 | R | InterruptStatus
3108Interrupt status.
3109Reading from this register returns a bit mask of interrupts
3110  asserted by the device. An interrupt is asserted if the
3111  corresponding bit is set, ie. equals one (1).
3112
3113  Bit 0 | Used Ring Update
3114This interrupt is asserted when the Host has updated the Used
3115    Ring in at least one of the active virtual queues.
3116
3117  Bit 1 | Configuration change
3118This interrupt is asserted when configuration of the device has
3119    changed.
3120
3121  0x064 | W | InterruptACK
3122 Interrupt acknowledge.
3123 Writing to this register notifies the Host that the Guest
3124  finished handling interrupts. Set bits in the value clear the
3125  corresponding bits of the InterruptStatus register.
3126
3127  0x070 | RW | Status
3128 Device status.
3129 Reading from this register returns the current device status
3130  flags.
3131 Writing non-zero values to this register sets the status flags,
3132  indicating the Guest progress. Writing zero (0x0) to this
3133  register triggers a device reset.
3134 Also see [sub:Device-Initialization-Sequence]
3135
3136  0x100+ | RW | Config
3137 Device-specific configuration space starts at an offset 0x100
3138  and is accessed with byte alignment. Its meaning and size
3139  depends on the device and the driver.
3140
3141Virtual queue size is a number of elements in the queue,
3142therefore size of the descriptor table and both available and
3143used rings.
3144
3145The endianness of the registers follows the native endianness of
3146the Guest. Writing to registers described as “R” and reading from
3147registers described as “W” is not permitted and can cause
3148undefined behavior.
3149
3150The device initialization is performed as described in [sub:Device-Initialization-Sequence]
3151 with one exception: the Guest must notify the Host about its
3152page size, writing the size in bytes to GuestPageSize register
3153before the initialization is finished.
3154
3155The memory mapped virtio devices generate single interrupt only,
3156therefore no special configuration is required.
3157
3158  Virtqueue Configuration
3159
3160The virtual queue configuration is performed in a similar way to
3161the one described in [sec:Virtqueue-Configuration] with a few
3162additional operations:
3163
3164  Select the queue writing its index (first queue is 0) to the
3165  QueueSel register.
3166
3167  Check if the queue is not already in use: read QueuePFN
3168  register, returned value should be zero (0x0).
3169
3170  Read maximum queue size (number of elements) from the
3171  QueueNumMax register. If the returned value is zero (0x0) the
3172  queue is not available.
3173
3174  Allocate and zero the queue pages in contiguous virtual memory,
3175  aligning the Used Ring to an optimal boundary (usually page
3176  size). Size of the allocated queue may be smaller than or equal
3177  to the maximum size returned by the Host.
3178
3179  Notify the Host about the queue size by writing the size to
3180  QueueNum register.
3181
3182  Notify the Host about the used alignment by writing its value
3183  in bytes to QueueAlign register.
3184
3185  Write the physical number of the first page of the queue to the
3186  QueuePFN register.
3187
3188The queue and the device are ready to begin normal operations
3189now.
3190
3191  Device Operation
3192
3193The memory mapped virtio device behaves in the same way as
3194described in [sec:Device-Operation], with the following
3195exceptions:
3196
3197  The device is notified about new buffers available in a queue
3198  by writing the queue index to register QueueNum instead of the
3199  virtio header in PCI I/O space ([sub:Notifying-The-Device]).
3200
3201  The memory mapped virtio device is using single, dedicated
3202  interrupt signal, which is raised when at least one of the
3203  interrupts described in the InterruptStatus register
3204  description is asserted. After receiving an interrupt, the
3205  driver must read the InterruptStatus register to check what
3206  caused the interrupt (see the register description). After the
3207  interrupt is handled, the driver must acknowledge it by writing
3208  a bit mask corresponding to the serviced interrupt to the
3209  InterruptACK register.
3210
3211
lxr.linux.no kindly hosted by Redpill Linpro AS, provider of Linux consulting and operations services since 1995.