ptio4.1/spa="v 4.1/form"v 4.1a ptio4. href="../linux+v3.7.2/Documentalue=/networking/scaling.txt">ptio4.1img src="../.staluc/gfx/right.png" alt=">>">pt1/spa="vpt1spa= class="lxr_search">ptioptio4.1input typ3.1hidden" nam3.1navtarget" v3.1">ptio4.1input typ3.1text" nam3.1search" id.1search">ptio4.1butt >typ3.1submit">Searchptio4.Prefsv 4.1/a>pt1/spa="vio4. .1/div"vio4. .1form aclue=="ajax+*" method="post" onsubmit="return false;">pt1input typ3.1hidden" nam3.1ajax_lookup" id.1ajax_lookup" v3.1">pio4. .1/form"vpio4. .1div class="headingbott m">. .11/a>Scaling i >the Linux Networking Stack . .21/a>p. .31/a>p. .41/a>Introduclue=p. .51/a>============p. .61/a>p. .71/a>This document describes a set of complementary techniques i >the Linuxp. .81/a>networking stack to i crease parallelism and improve performance forp. .91/a>multi-processor systems.p. ptioa>p. 111/a>The following technologies are described:p. 121/a>p. 131/a> RSS: Receive Side Scalingp. 141/a> RPS: Receive Packet Steeringp. 151/a> RFS: Receive Flow Steeringp. 161/a> Accelerated Receive Flow Steeringp. 171/a> XPS: Transmit Packet Steeringp. 181/a>p. 191/a>p. 2tioa>RSS: Receive Side Scalingp. 211/a>=========================p. 221/a>p. 231/a>Contemporary NICs support multiple receive and transmit descriptor queuesp. 241/a>(multi-queue). On recealue=, a NIC ca= send different packets to differentp. 251/a>queues to distribute processing among CPUs. The NIC distributes packets byp. 261/a>applying a filter to each packet>that assigns it to one of a small numberp. 271/a>of logical flows. Packets for each flow are steered to a separate receivep. 281/a>queue, which i >turn ca= be processed by separate CPUs. This mechanism isp. 291/a>generally known as “Receive-side Scaling” (RSS). The goal of RSS andp. 3tioa>the other scaling techniques is to i crease performance uniformly.p. 311/a>Multi-queue distribut >ca= also be used for traffic prioritizalue=, butp. 32ioa>that is not>the focus of these techniques.p. 331/a>p. 341/a>The filter used i >RSS is typically a hash funclue= over>the networkp. 351/a>and/or transport layer>headers-- for example, a 4-tuple hash overp. 361/a>IP addresses and TCP ports of a packet. The most common hardwarep. 371/a>implementalue= of RSS uses a 128-entry indireclue= table where each entryp. 381/a>stores a queue number. The receive queue for a packet is determinedp. 391/a>by masking out>the low order seve= bits of the computed hash for>thep. 4tioa>packet (usually a Toeplitz hash), taking this number as a key into thep. 411/a>indireclue= table and reading the corresponding v3.p. 421/a>p. 431/a>Some advanced NICs allow steering packets to queues based e=p. 441/a>programmable filters. For example, webserver>bound TCP port 80 packetsp. 451/a>ca= be direcled to their own receive queue. Such “n-tuple” filters>ca=p. 461/a>be configured from ethtool (--config-ntuple).p. 471/a>p. 481/a>==== RSS Configuralue=p. 491/a>p. 501/a>The driver>for a multi-queue>capable NIC typically provides a kernelp. 511/a>module parameter>for specifying the number of hardware queues top. 521/a>configure. I >the bnx2x driver,>for instance, this parameter>is>calledp. 531/a>num_queues. A typical RSS configuralue= would be to have one receive queuep. 541/a>for each CPU if the device supports enough queues, or otherwise at leastp. 551/a>one for each memory domain, where a memory domain>is>a set of CPUs>thatp. 561/a>share a particular memory level (L1, L2, NUMA node, etc.).p. 571/a>p. 581/a>The indireclue= table of a >RSS device, which resolves a queue by maskedp. 591/a>hash,>is>usually programmed by the driver>at initializalue=. Thep. 601/a>default mapping is to distribute the queues eve=ly i >the table, but thep. 611/a>indireclue= table ca= be retrieved and modified at runtime using ethtoolp. 621/a>commands (--show-rxfh-indir and --set-rxfh-indir). Modifying thep. 631/a>indireclue= table could be done to give different queues differentp. 641/a>relaluve weights.p. 651/a>p. 661/a>==>RSS IRQ Configuralue=p. 671/a>p. 681/a>Each receive queue has a separate IRQ associated with it. The NIC triggersp. 69ioa>this to notify a CPU when new packets arrive o >the given queue. Thep. 701/a>signaling path for>PCIe devices uses message signaled i terrupts (MSI-X),p. 71ioa>that ca= route each i terrupt to a particular CPU. The acluve mappingp. 721/a>of queues to IRQs ca= be determined from /proc/i terrupts. By default,p. 731/a>an IRQ may be handled e= any CPU. Because a non-negligible part of packetp. 741/a>processing takes place in receive i terrupt handling, it is>advantageousp. 751/a>to spread receive i terrupts between CPUs. To manually adjust>the IRQp. 761/a>affinity of each i terrupt see Documentalue=/IRQ-affinity.txt. Some systemsp. 771/a>will be running irqbalance, a daemo >that dynamically valumizes IRQp. 781/a>assignments and as a result may override any manual settings.p. 791/a>p. 801/a>==>Suggested Configuralue=p. 811/a>p. 82ioa>RSS should be enabled when lalency>is>a concern or whenever>receivep. 831/a>interrupt processing forms>a bottleneck. Spreading load between CPUsp. 841/a>decreases queue length. For low lalency>networking,>the valumal settingp. 851/a>is to allocate as many queues as>there are CPUs i >the system (or>thep. 861/a>NIC maximum, if lower). The most efficient high-rate configuralue=p. 871/a>is likely the one with the smallest number of receive queues where nop. 881/a>receive queue overflows due to a saturated CPU, because i >defaultp. 891/a>mode with interrupt coalescing enabled,>the aggregate number ofp. 901/a>interrupts (and thus work) grows with each addilue=al queue.p. 911/a>p. 92ioa>Per-cpu load ca= be observed using the mpstal utility, but note>that e=p. 931/a>processors with hyperthreading (HT), each hyperthread>is>represented asp. 941/a>a separate CPU. For i terrupt handling, HT has shown no benefit i=p. 951/a>initial tests, so limit the number of queues to the number of CPU coresp. 961/a>in>the system.p. 971/a>p. 981/a>p. 991/a>RPS: Receive Packet Steeringp.1001/a>============================p.1011/a>p.102ioa>Receive Packet Steering (RPS)>is>logically a software implementalue= ofp.103ioa>RSS. Being i >software, it is>necessarily called laler i >the datapath.p.1041/a>Whereas RSS selects the queue and hence CPU>that will run the hardwarep.1051/a>interrupt handler, RPS selects the CPU>to perform protocol processingp.1061/a>above the i terrupt handler. This is accomplished by placing the packetp.1071/a>o >the desired CPU’s backlog queue and waking up the CPU>for>processing.p.1081/a>RPS has some advantages over>RSS: 1) it ca= be used with any NIC,p.1091/a>2) software filters>ca= easily be added to hash over new protocols,p.1ptioa>3) it does not>i crease hardware device i terrupt rate (although it doesp.1111/a>introduce i ter-processor interrupts (IPIs)).p.1121/a>p.1131/a>RPS is>called during bott m half of the receive i terrupt handler, whenp.1141/a>a driver>sends a packet up the network stack with netif_rx() orp.1151/a>netif_receive_skb(). These>call>the get_rps_cpu() funclue=, whichp.1161/a>selects the queue that should process a packet.p.1171/a>p.1181/a>The first step i >determining the target CPU>for>RPS is>to calculate ap.1191/a>flow hash over the packet’s addresses or>ports (2-tuple or>4-tuple hashp.1201/a>depending o >the protocol). This serves as a consistent hash of>thep.1211/a>associated flow of the packet. The hash is either provided by hardwarep.1221/a>or will be computed in>the stack. Capable hardware ca= pass the hash inp.1231/a>the receive descriptor for>the packet; this would usually be the sam3p.1241/a>hash used for RSS (e.g. computed Toeplitz hash). The hash is saved inp.1251/a>skb->rx_hash and ca= be used elsewhere in>the stack as a hash of>thep.1261/a>packet’s flow.p.1271/a>p.1281/a>Each receive hardware queue has an associated list of CPUs>to whichp.1291/a>RPS may enqueue packets for processing. For each received packet,p.13tioa>an index into the list is>computed from the flow hash modulo the sizep.1311/a>of the list. The indexed CPU is>the target for processing>the packet,p.132ioa>and the packet is queued to the tail of>that CPU’s backlog queue. Atp.1331/a>the end of the bott m half routine, IPIs are sent to any CPUs for whichp.1341/a>packets have been queued to their backlog queue. The IPI wakes backlogp.1351/a>processing>o >the remote CPU, and any queued packets are then processedp.1361/a>up the networking stack.p.1371/a>p.1381/a>==== RPS Configuralue=p.1391/a>p.14tioa>RPS requires a kernel>compiled with the CONFIG_RPS kconfig symbol (e=p.1411/a>by default for SMP). Even when compiled in, RPS remains disabled untilp.1421/a>explicitly configured. The list of CPUs>to which RPS may forward trafficp.1431/a>ca= be configured for each receive queue using a sysfs file entry:p.1441/a>p.1451/a> /sys/class/net/<dev>/queues/rx-<n>/rps_cpusp.1461/a>p.1471/a>This file implements>a bitmap of CPUs.>RPS is>disabled when it is>zerop.1481/a>(the default), in>which case packets are processed o >the i terruptingp.1491/a>CPU. Documentalue=/IRQ-affinity.txt explains how CPUs>are assigned top.15tioa>the bitmap.p.1511/a>p.1521/a>==>Suggested Configuralue=p.1531/a>p.1541/a>For a single queue device, a typical RPS configuralue= would be to setp.1551/a>the rps_cpus to the CPUs i >the sam3 memory domain>of the i terruptingp.1561/a>CPU. If NUMA locality is not>an issue, this could also be all>CPUs i p.1571/a>the system. At high i terrupt rate, it might be wise to exclude>thep.1581/a>i terrupting CPU>from the map since that already performs much work.p.1591/a>p.1601/a>For a multi-queue>system, if RSS is configured so that a hardwarep.1611/a>receive queue is mapped to each CPU, then RPS is>probably redundantp.162ioa>and unnecessary. If there are fewer>hardware queues than CPUs, thenp.1631/a>RPS might be beneficial if the rps_cpus for each queue are the ones>thatp.1641/a>share the sam3 memory domain>as>the i terrupting CPU>for>that queue.p.1651/a>p.1661/a>p.1671/a>RFS: Receive Flow Steeringp.1681/a>==========================p.1691/a>p.1701/a>While RPS steers packets solely based e= hash,>and thus generallyp.171ioa>provides good load distribut , it does not>take into accountp.1721/a>applicalue= locality. This is accomplished by Receive Flow Steeringp.1731/a>(RFS). The goal of RFS is to i crease datacache hitrate by steeringp.1741/a>kernel>processing>of packets to the CPU>where the applicalue= threadp.1751/a>consuming>the packet>is>running. RFS relies o >the sam3 RPS mechanismsp.1761/a>to enqueue packets onto the backlog of a other CPU>and to wake up thatp.1771/a>CPU.p.1781/a>p.1791/a>In RFS, packets are not>forwarded direclly by the v3>of their hash,p.1801/a>but the hash is used as index into a flow lookup table. This table mapsp.1811/a>flows to the CPUs where those flows are being processed. The flow hashp.182ioa>(see RPS seclue= above) is used to calculate the index into this table.p.1831/a>The CPU>recorded in each entry is>the one which last processed the flow.p.1841/a>If a >entry does not>hold a id CPU, then packets mapped to that entryp.1851/a>are steered using plain RPS. Multiple table entries may point to thep.1861/a>sam3 CPU. Indeed,>with many flows and few CPUs, it is>very likely thatp.1871/a>a single applicalue= thread handles flows with many different flow hashes.p.1881/a>p.1891/a>rps_sock_flow_table is>a global flow table that contains the *desired* CPUp.1901/a>for>flows: the CPU>that is currently processing>the flow in userspace.p.1911/a>Each table v3>is>a CPU index that is updated during calls to recvmsgp.192ioa>and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()p.193ioa>and tcp_splice_read()).p.1941/a>p.1951/a>When the scheduler moves a thread to a new CPU while it has outstandingp.1961/a>receive packets on>the old CPU, packets may arrive out of order. Top.1971/a>avoid this, RFS uses a second flow table to track outstanding packetsp.1981/a>for each flow: rps_dev_flow_table is>a table specific to each hardwarep.1991/a>receive queue of each device. Each table v3>stores a CPU index and ap.2001/a>counter. The CPU index represents the *current* CPU onto which packetsp.2011/a>for>this flow are enqueued for further kernel>processing. Ideally, kernelp.202ioa>and userspace>processing>occur o >the sam3 CPU, and hence the CPU indexp.2031/a>in both tables is identical. This is likely false if the scheduler hasp.2041/a>recently migrated a userspace>thread while the kernel>still has packetsp.2051/a>enqueued for kernel>processing>on>the old CPU.p.2061/a>p.2071/a>The counter in rps_dev_flow_table v3s>records the length of the currentp.2081/a>CPU's backlog when a packet in>this flow was last enqueued. Each backlogp.2091/a>queue has a head counter that is i cremented e= dequeue. A tail counterp.2ptioa>is>computed as head counter + queue length. In other words,>the counterp.2111/a>in rps_dev_flow[i]>records the last element in>flow i that hasp.2121/a>been enqueued onto the currently designated CPU for>flow i (of course,p.2131/a>entry i is actually selected by hash and multiple flows may hash to thep.2141/a>sam3 entry i).p.2151/a>p.2161/a>And now the trick for avoiding out>of order packets: when selecting thep.2171/a>CPU for packet processing>(from get_rps_cpu()) the rps_sock_flow tablep.2181/a>and the rps_dev_flow table of the queue that the packet>was received e=p.2191/a>are compared. If the desired CPU for>the flow (found i >thep.2201/a>rps_sock_flow table) matches the current CPU (found i >the rps_dev_flowp.2211/a>table),>the packet>is>enqueued onto that CPU’s backlog. If they differ,p.2221/a>the current CPU is updated to match the desired CPU if one of>thep.2231/a>following is>true:p.2241/a>p.2251/a>- The current CPU's queue head counter >= the recorded tail counterp.2261/a> v3>in rps_dev_flow[i]p.2271/a>- The current CPU is unset (equal to RPS_NO_CPU)p.2281/a>- The current CPU is offlinep.2291/a>p.23tioa>After this>check,>the packet>is>sent to the (possibly updated) currentp.2311/a>CPU. These rules aim to ensure that a flow o=ly moves to a new CPU whenp.232ioa>there are no packets outstanding on>the old CPU, as>the outstandingp.2331/a>packets could arrive laler than those about>to be processed on>the newp.2341/a>CPU.p.2351/a>p.2361/a>==== RFS Configuralue=p.2371/a>p.2381/a>RFS is o=ly available if the kconfig symbol CONFIG_RPS is>enabled (e=p.2391/a>by default for SMP). The funclue=ality remains disabled until explicitlyp.2401/a>configured. The number of entries i >the global flow table is>set through:p.2411/a>p.2421/a> /proc/sys/net/core/rps_sock_flow_entriesp.2431/a>p.2441/a>The number of entries i >the per-queue>flow table are set through:p.2451/a>p.2461/a> /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cntp.2471/a>p.2481/a>==>Suggested Configuralue=p.2491/a>p.25tioa>Both of these need to be set before RFS is enabled for a receive queue.p.2511/a>V v3s>for both are rounded up to the nearest power of two. Thep.2521/a>suggested flow count depends on>the expected number of acluve conneclue=sp.2531/a>at any given time, which may be significantly less than the number of openp.2541/a>conneclue=s. We have found that a v3>of 32768>for rps_sock_flow_entriesp.2551/a>works>fairly well e= a moderately loaded server.p.2561/a>p.2571/a>For a single queue device, the rps_flow_cnt v3>for>the single queuep.2581/a>would normally be configured to the sam3 v3>as rps_sock_flow_entries.p.2591/a>For a multi-queue>device, the rps_flow_cnt for each queue might bep.2601/a>configured>as rps_sock_flow_entries / N, where N is>the number ofp.2611/a>queues. So>for instance, if rps_sock_flow_entries is>set to 32768>and therep.262ioa>are 16 configured receive queues, rps_flow_cnt for each queue might bep.2631/a>configured>as 2048.p.2641/a>p.2651/a>p.2661/a>Accelerated RFSp.2671/a>===============p.2681/a>p.2691/a>Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated loadp.2701/a>balancing mechanism that uses soft state to steer flows based e= wherep.271ioa>the applicalue= thread consuming>the packets of each flow is>running.p.2721/a>Accelerated RFS should perform betler than RFS since packets are sentp.2731/a>direclly to a CPU local to the thread consuming>the data. The target CPUp.2741/a>will either be the sam3 CPU>where the applicalue= runs, or at least a CPUp.2751/a>which is>local to the applicalue= thread’s CPU in the cache hierarchy.p.2761/a>p.2771/a>To enable accelerated RFS, the networking stack calls thep.2781/a>ndo_rx_flow_steer driver>funclue= to communicate the desired hardwarep.2791/a>queue for packets matching a particular flow. The network stackp.2801/a>automatically calls this funclue= every time a flow entry inp.2811/a>rps_dev_flow_table is>updated. The driver>in turn uses a device specificp.282ioa>method to program the NIC to steer the packets.p.2831/a>p.2841/a>The hardware queue for a flow is>derived from the CPU>recorded inp.2851/a>rps_dev_flow_table. The stack consults a CPU to hardware queue map whichp.2861/a>is maintained by the NIC driver. This is an auto-generated reverse map ofp.2871/a>the IRQ affinity table shown by /proc/i terrupts. Drivers>ca= usep.2881/a>funclue=s i >the cpu_rmap (“CPU>affinity reverse map”) kernel>libraryp.2891/a>to populate the map. For each CPU, the corresponding queue in the map isp.2901/a>set to be one whose processing>CPU is closest in>cache locality.p.2911/a>p.2921/a>==== Accelerated RFS Configuralue=p.2931/a>p.2941/a>Accelerated RFS is o=ly available if the kernel>is>compiled withp.2951/a>CONFIG_RFS_ACCEL>and support is>provided by the NIC device and driver.p.2961/a>It also requires that ntuple filtering is>enabled via ethtool. The mapp.2971/a>of CPU to queues is automatically deduced from the IRQ affinitiesp.2981/a>configured for each receive queue by the driver, so no addilue=alp.2991/a>configuralue= should be necessary.p.3001/a>p.3011/a>==>Suggested Configuralue=p.3021/a>p.3031/a>This technique should be enabled whenever>one wants to use RFS and thep.3041/a>NIC supports hardware acceleratue=.p.3051/a>p.3061/a>XPS: Transmit Packet Steeringp.3071/a>=============================p.3081/a>p.3091/a>Transmit Packet Steering is>a mechanism for intelligently selectingp.3ptioa>which transmit queue to use when transmitting a packet e= a multi-queuep.3111/a>device. To accomplish this, a mapping from CPU to hardware queue(s) isp.3121/a>recorded. The goal of this mapping is usually to assign queuesp.3131/a>exclusively to a subset of CPUs,>where the transmit complelue=s forp.3141/a>these queues are processed o >a CPU within>this set. This>choicep.3151/a>provides two benefits. First, contentue= o >the device queue lock isp.3161/a>significantly reduced since fewer>CPUs contend>for>the sam3 queuep.3171/a>(contentue= ca= be eliminated complelely if each CPU has its ow=p.3181/a>transmit queue). Secondly, cache miss rate on transmit complelue= isp.3191/a>reduced, in>particular for>data>cache lines>that>hold the sk_buffp.3201/a>structures.p.3211/a>p.3221/a>XPS is configured per transmit queue by setting a bitmap of CPUs thatp.3231/a>may use that queue to transmit. The reverse mapping, from CPUs top.3241/a>transmit queues, is>computed and maintained for each network device.p.3251/a>When transmitting the first packet in>a flow, the funclue=p.3261/a>get_xps_queue() is>called to select>a queue. This funclue= uses the IDp.3271/a>of the running CPU>as a key into the CPU-to-queue lookup table. If>thep.3281/a>ID matches a single queue, that is used for transmissie=. If>multiplep.3291/a>queues match,>one is selected by using>the flow hash to compute an indexp.33tioa>into the set.p.3311/a>p.332ioa>The queue chosen for transmitting a particular flow is saved in>thep.3331/a>corresponding socket structure for>the flow (e.g. a TCP conneclue=).p.3341/a>This transmit queue is used for subsequent packets sent o >the flow top.3351/a>prevent out>of order (ooo) packets. The choice also amortizes the costp.3361/a>of calling get_xps_queues() over>all>packets i >the flow. To avoidp.3371/a>ooo>packets, the queue for a flow ca= subsequently o=ly be changed ifp.3381/a>skb->ooo_okay is>set for a packet in>the flow. This flag indicates thatp.3391/a>there are no outstanding packets in>the flow, so the transmit queue ca=p.3401/a>change without the risk>of generating out>of order packets. Thep.3411/a>transport layer>is responsible for setting ooo_okay appropriately. TCP,p.3421/a>for instance, sets the flag when all>data>for a conneclue= has beenp.3431/a>acknowledged.p.3441/a>p.3451/a>==== XPS Configuralue=p.3461/a>p.3471/a>XPS is o=ly available if the kconfig symbol CONFIG_XPS is>enabled (e= byp.3481/a>default for SMP). The funclue=ality remains disabled until explicitlyp.3491/a>configured. To enable XPS, the bitmap of CPUs that may use a transmitp.35tioa>queue is configured using>the sysfs file entry:p.3511/a>p.3521/a>/sys/class/net/<dev>/queues/tx-<n>/xps_cpusp.3531/a>p.3541/a>==>Suggested Configuralue=p.3551/a>p.3561/a>For a network device with a single transmissie= queue, XPS configuralue=p.3571/a>has no effecl, since there is no choice in>this case. In a multi-queuep.3581/a>system, XPS is>preferably configured so that each CPU maps onto one queue.p.3591/a>If there are as many queues as there are CPUs i >the system, then eachp.36tioa>queue ca= also map onto one CPU, resulting in exclusive pairings thatp.3611/a>experience no contentue=. If there are fewer>queues than CPUs, then>thep.3621/a>best CPUs>to share a given queue are probably those that share the cachep.3631/a>with the CPU>that processes transmit complelue=s for that queuep.3641/a>(transmit i terrupts).p.3651/a>p.3661/a>p.3671/a>Further Informalue=p.3681/a>===================p.3691/a>RPS and RFS were introduced in kernel>2.6.35. XPS was incorporated intop.3701/a>2.6.38. Origi=al patches were submitted by Tom Herbertp.371ioa>(>p.3731/a>Accelerated RFS was introduced in 2.6.35. Origi=al patches werep.3741/a>submitted by Ben Hutchings (>p.3761/a>Authors:p.3771/a>Tom Herbert (>Willem de Bruijn (> kindly hosted by Redpill Linpro AS1/a>, provider of Linux consulting and operatue=s services since 1995.