linux/Documentation/networking/spider_net.txt
<<
>>
Prefs
   1
   2            The Spidernet Device Driver
   3            ===========================
   4
   5Written by Linas Vepstas <linas@austin.ibm.com>
   6
   7Version of 7 June 2007
   8
   9Abstract
  10========
  11This document sketches the structure of portions of the spidernet
  12device driver in the Linux kernel tree. The spidernet is a gigabit
  13ethernet device built into the Toshiba southbridge commonly used
  14in the SONY Playstation 3 and the IBM QS20 Cell blade.
  15
  16The Structure of the RX Ring.
  17=============================
  18The receive (RX) ring is a circular linked list of RX descriptors,
  19together with three pointers into the ring that are used to manage its
  20contents.
  21
  22The elements of the ring are called "descriptors" or "descrs"; they
  23describe the received data. This includes a pointer to a buffer
  24containing the received data, the buffer size, and various status bits.
  25
  26There are three primary states that a descriptor can be in: "empty",
  27"full" and "not-in-use".  An "empty" or "ready" descriptor is ready
  28to receive data from the hardware. A "full" descriptor has data in it,
  29and is waiting to be emptied and processed by the OS. A "not-in-use"
  30descriptor is neither empty or full; it is simply not ready. It may
  31not even have a data buffer in it, or is otherwise unusable.
  32
  33During normal operation, on device startup, the OS (specifically, the
  34spidernet device driver) allocates a set of RX descriptors and RX
  35buffers. These are all marked "empty", ready to receive data. This
  36ring is handed off to the hardware, which sequentially fills in the
  37buffers, and marks them "full". The OS follows up, taking the full
  38buffers, processing them, and re-marking them empty.
  39
  40This filling and emptying is managed by three pointers, the "head"
  41and "tail" pointers, managed by the OS, and a hardware current
  42descriptor pointer (GDACTDPA). The GDACTDPA points at the descr
  43currently being filled. When this descr is filled, the hardware
  44marks it full, and advances the GDACTDPA by one.  Thus, when there is
  45flowing RX traffic, every descr behind it should be marked "full",
  46and everything in front of it should be "empty".  If the hardware
  47discovers that the current descr is not empty, it will signal an
  48interrupt, and halt processing.
  49
  50The tail pointer tails or trails the hardware pointer. When the
  51hardware is ahead, the tail pointer will be pointing at a "full"
  52descr. The OS will process this descr, and then mark it "not-in-use",
  53and advance the tail pointer.  Thus, when there is flowing RX traffic,
  54all of the descrs in front of the tail pointer should be "full", and
  55all of those behind it should be "not-in-use". When RX traffic is not
  56flowing, then the tail pointer can catch up to the hardware pointer.
  57The OS will then note that the current tail is "empty", and halt
  58processing.
  59
  60The head pointer (somewhat mis-named) follows after the tail pointer.
  61When traffic is flowing, then the head pointer will be pointing at
  62a "not-in-use" descr. The OS will perform various housekeeping duties
  63on this descr. This includes allocating a new data buffer and
  64dma-mapping it so as to make it visible to the hardware. The OS will
  65then mark the descr as "empty", ready to receive data. Thus, when there
  66is flowing RX traffic, everything in front of the head pointer should
  67be "not-in-use", and everything behind it should be "empty". If no
  68RX traffic is flowing, then the head pointer can catch up to the tail
  69pointer, at which point the OS will notice that the head descr is
  70"empty", and it will halt processing.
  71
  72Thus, in an idle system, the GDACTDPA, tail and head pointers will
  73all be pointing at the same descr, which should be "empty". All of the
  74other descrs in the ring should be "empty" as well.
  75
  76The show_rx_chain() routine will print out the the locations of the
  77GDACTDPA, tail and head pointers. It will also summarize the contents
  78of the ring, starting at the tail pointer, and listing the status
  79of the descrs that follow.
  80
  81A typical example of the output, for a nearly idle system, might be
  82
  83net eth1: Total number of descrs=256
  84net eth1: Chain tail located at descr=20
  85net eth1: Chain head is at 20
  86net eth1: HW curr desc (GDACTDPA) is at 21
  87net eth1: Have 1 descrs with stat=x40800101
  88net eth1: HW next desc (GDACNEXTDA) is at 22
  89net eth1: Last 255 descrs with stat=xa0800000
  90
  91In the above, the hardware has filled in one descr, number 20. Both
  92head and tail are pointing at 20, because it has not yet been emptied.
  93Meanwhile, hw is pointing at 21, which is free.
  94
  95The "Have nnn decrs" refers to the descr starting at the tail: in this
  96case, nnn=1 descr, starting at descr 20. The "Last nnn descrs" refers
  97to all of the rest of the descrs, from the last status change. The "nnn"
  98is a count of how many descrs have exactly the same status.
  99
 100The status x4... corresponds to "full" and status xa... corresponds
 101to "empty". The actual value printed is RXCOMST_A.
 102
 103In the device driver source code, a different set of names are
 104used for these same concepts, so that
 105
 106"empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa
 107"full"  == SPIDER_NET_DESCR_FRAME_END == 0x4
 108"not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf
 109
 110
 111The RX RAM full bug/feature
 112===========================
 113
 114As long as the OS can empty out the RX buffers at a rate faster than
 115the hardware can fill them, there is no problem. If, for some reason,
 116the OS fails to empty the RX ring fast enough, the hardware GDACTDPA
 117pointer will catch up to the head, notice the not-empty condition,
 118ad stop. However, RX packets may still continue arriving on the wire.
 119The spidernet chip can save some limited number of these in local RAM.
 120When this local ram fills up, the spider chip will issue an interrupt
 121indicating this (GHIINT0STS will show ERRINT, and the GRMFLLINT bit
 122will be set in GHIINT1STS).  When the RX ram full condition occurs,
 123a certain bug/feature is triggered that has to be specially handled.
 124This section describes the special handling for this condition.
 125
 126When the OS finally has a chance to run, it will empty out the RX ring.
 127In particular, it will clear the descriptor on which the hardware had
 128stopped. However, once the hardware has decided that a certain
 129descriptor is invalid, it will not restart at that descriptor; instead
 130it will restart at the next descr. This potentially will lead to a
 131deadlock condition, as the tail pointer will be pointing at this descr,
 132which, from the OS point of view, is empty; the OS will be waiting for
 133this descr to be filled. However, the hardware has skipped this descr,
 134and is filling the next descrs. Since the OS doesn't see this, there
 135is a potential deadlock, with the OS waiting for one descr to fill,
 136while the hardware is waiting for a different set of descrs to become
 137empty.
 138
 139A call to show_rx_chain() at this point indicates the nature of the
 140problem. A typical print when the network is hung shows the following:
 141
 142net eth1: Spider RX RAM full, incoming packets might be discarded!
 143net eth1: Total number of descrs=256
 144net eth1: Chain tail located at descr=255
 145net eth1: Chain head is at 255
 146net eth1: HW curr desc (GDACTDPA) is at 0
 147net eth1: Have 1 descrs with stat=xa0800000
 148net eth1: HW next desc (GDACNEXTDA) is at 1
 149net eth1: Have 127 descrs with stat=x40800101
 150net eth1: Have 1 descrs with stat=x40800001
 151net eth1: Have 126 descrs with stat=x40800101
 152net eth1: Last 1 descrs with stat=xa0800000
 153
 154Both the tail and head pointers are pointing at descr 255, which is
 155marked xa... which is "empty". Thus, from the OS point of view, there
 156is nothing to be done. In particular, there is the implicit assumption
 157that everything in front of the "empty" descr must surely also be empty,
 158as explained in the last section. The OS is waiting for descr 255 to
 159become non-empty, which, in this case, will never happen.
 160
 161The HW pointer is at descr 0. This descr is marked 0x4.. or "full".
 162Since its already full, the hardware can do nothing more, and thus has
 163halted processing. Notice that descrs 0 through 254 are all marked
 164"full", while descr 254 and 255 are empty. (The "Last 1 descrs" is
 165descr 254, since tail was at 255.) Thus, the system is deadlocked,
 166and there can be no forward progress; the OS thinks there's nothing
 167to do, and the hardware has nowhere to put incoming data.
 168
 169This bug/feature is worked around with the spider_net_resync_head_ptr()
 170routine. When the driver receives RX interrupts, but an examination
 171of the RX chain seems to show it is empty, then it is probable that
 172the hardware has skipped a descr or two (sometimes dozens under heavy
 173network conditions). The spider_net_resync_head_ptr() subroutine will
 174search the ring for the next full descr, and the driver will resume
 175operations there.  Since this will leave "holes" in the ring, there
 176is also a spider_net_resync_tail_ptr() that will skip over such holes.
 177
 178As of this writing, the spider_net_resync() strategy seems to work very
 179well, even under heavy network loads.
 180
 181
 182The TX ring
 183===========
 184The TX ring uses a low-watermark interrupt scheme to make sure that
 185the TX queue is appropriately serviced for large packet sizes.
 186
 187For packet sizes greater than about 1KBytes, the kernel can fill
 188the TX ring quicker than the device can drain it. Once the ring
 189is full, the netdev is stopped. When there is room in the ring,
 190the netdev needs to be reawakened, so that more TX packets are placed
 191in the ring. The hardware can empty the ring about four times per jiffy,
 192so its not appropriate to wait for the poll routine to refill, since
 193the poll routine runs only once per jiffy.  The low-watermark mechanism
 194marks a descr about 1/4th of the way from the bottom of the queue, so
 195that an interrupt is generated when the descr is processed. This
 196interrupt wakes up the netdev, which can then refill the queue.
 197For large packets, this mechanism generates a relatively small number
 198of interrupts, about 1K/sec. For smaller packets, this will drop to zero
 199interrupts, as the hardware can empty the queue faster than the kernel
 200can fill it.
 201
 202
 203 ======= END OF DOCUMENT ========
 204
 205
lxr.linux.no kindly hosted by Redpill Linpro AS, provider of Linux consulting and operations services since 1995.