From mboxrd@z Thu Jan 1 00:00:00 1970 From: Randy Dunlap Subject: Re: [PATCH] net: add Documentation/networking/scaling.txt Date: Mon, 1 Aug 2011 11:41:42 -0700 Message-ID: <20110801114142.544a109d.rdunlap@xenotime.net> References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: linux-doc@vger.kernel.org, davem@davemloft.net, netdev@vger.kernel.org, willemb@google.com To: Tom Herbert Return-path: In-Reply-To: Sender: linux-doc-owner@vger.kernel.org List-Id: netdev.vger.kernel.org On Sun, 31 Jul 2011 23:56:26 -0700 (PDT) Tom Herbert wrote: > Describes RSS, RPS, RFS, accelerated RFS, and XPS. >=20 > Signed-off-by: Tom Herbert > --- > Documentation/networking/scaling.txt | 346 ++++++++++++++++++++++++= ++++++++++ > 1 files changed, 346 insertions(+), 0 deletions(-) > create mode 100644 Documentation/networking/scaling.txt >=20 > diff --git a/Documentation/networking/scaling.txt b/Documentation/net= working/scaling.txt > new file mode 100644 > index 0000000..aa51f0f > --- /dev/null > +++ b/Documentation/networking/scaling.txt > @@ -0,0 +1,346 @@ > +Scaling in the Linux Networking Stack > + > + > +Introduction > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=20 > + > +This document describes a set of complementary techniques in the Lin= ux > +networking stack to increase parallelism and improve performance (in > +throughput, latency, CPU utilization, etc.) for multi-processor syst= ems. > + > +The following technologies are described: > + > + RSS: Receive Side Scaling > + RPS: Receive Packet Steering > + RFS: Receive Flow Steering > + Accelerated Receive Flow Steering > + XPS: Transmit Packet Steering > + > + > +RSS: Receive Side Scaling > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D > + > +Contemporary NICs support multiple receive queues (multi-queue), whi= ch > +can be used to distribute packets amongst CPUs for processing. The N= IC > +distributes packets by applying a filter to each packet to assign it= to > +one of a small number of logical flows. Packets for each flow are > +steered to a separate receive queue, which in turn can be processed = by > +separate CPUs. This mechanism is generally known as =E2=80=9CReceiv= e-side > +Scaling=E2=80=9D (RSS). > + > +The filter used in RSS is typically a hash function over the network= or > +transport layer headers-- for example, a 4-tuple hash over IP addres= ses > +and TCP ports of a packet. The most common hardware implementation o= f > +RSS uses a 128 entry indirection table where each entry stores a que= ue 128-entry > +number. The receive queue for a packet is determined by masking out = the > +low order seven bits of the computed hash for the packet (usually a > +Toeplitz hash), taking this number as a key into the indirection tab= le > +and reading the corresponding value. > + > +Some advanced NICs allow steering packets to queues based on > +programmable filters. For example, webserver bound TCP port 80 packe= ts > +can be directed to their own receive queue. Such =E2=80=9Cn-tuple=E2= =80=9D filters can > +be configured from ethtool (--config-ntuple). > + > +=3D=3D RSS Configuration > + > +The driver for a multi-queue capable NIC typically provides a module > +parameter specifying the number of hardware queues to configure. In = the > +bnx2x driver, for instance, this parameter is called num_queues. A > +typical RSS configuration would be to have one receive queue for eac= h > +CPU if the device supports enough queues, or otherwise at least one = for > +each cache domain at a particular cache level (L1, L2, etc.). > + > +The indirection table of an RSS device, which resolves a queue by ma= sked > +hash, is usually programmed by the driver at initialization. The > +default mapping is to distribute the queues evenly in the table, but= the > +indirection table can be retrieved and modified at runtime using eth= tool > +commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the > +indirection table could be done to to give different queues differen= t ^^drop one "to" > +relative weights.=20 Drop trailing whitespace above and anywhere else that it's found. (5 pl= aces) I thought (long ago :) that multiple RX queues were for prioritizing tr= affic, but there is nothing here about using multi-queues for priorities. Is that (no longer) done? > + > +=3D=3D RSS IRQ Configuration > + > +Each receive queue has a separate IRQ associated with it. The NIC > +triggers this to notify a CPU when new packets arrive on the given > +queue. The signaling path for PCIe devices uses message signaled > +interrupts (MSI-X), that can route each interrupt to a particular CP= U. > +The active mapping of queues to IRQs can be determined from > +/proc/interrupts. By default, all IRQs are routed to CPU0. Because = a > +non-negligible part of packet processing takes place in receive > +interrupt handling, it is advantageous to spread receive interrupts > +between CPUs. To manually adjust the IRQ affinity of each interrupt = see > +Documentation/IRQ-affinity. On some systems, the irqbalance daemon i= s > +running and will try to dynamically optimize this setting. or (avoid a split infinitive): will try to optimize this setting dynam= ically. > + > + > +RPS: Receive Packet Steering > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > + > +Receive Packet Steering (RPS) is logically a software implementation= of > +RSS. Being in software, it is necessarily called later in the datap= ath. > +Whereas RSS selects the queue and hence CPU that will run the hardwa= re > +interrupt handler, RPS selects the CPU to perform protocol processin= g > +above the interrupt handler. This is accomplished by placing the pa= cket > +on the desired CPU=E2=80=99s backlog queue and waking up the CPU for= processing. > +RPS has some advantages over RSS: 1) it can be used with any NIC, 2) > +software filters can easily be added to handle new protocols, 3) it = does > +not increase hardware device interrupt rate (but does use IPIs). > + > +RPS is called during bottom half of the receive interrupt handler, w= hen > +a driver sends a packet up the network stack with netif_rx() or > +netif_receive_skb(). These call the get_rps_cpu() function, which > +selects the queue that should process a packet. > + > +The first step in determining the target CPU for RPS is to calculate= a > +flow hash over the packet=E2=80=99s addresses or ports (2-tuple or 4= -tuple hash > +depending on the protocol). This serves as a consistent hash of the > +associated flow of the packet. The hash is either provided by hardwa= re > +or will be computed in the stack. Capable hardware can pass the hash= in > +the receive descriptor for the packet, this would usually be the sam= e packet; > +hash used for RSS (e.g. computed Toeplitz hash). The hash is saved i= n > +skb->rx_hash and can be used elsewhere in the stack as a hash of the > +packet=E2=80=99s flow. > + > +Each receive hardware qeueue has associated list of CPUs which can has an associated list (?) > +process packets received on the queue for RPS. For each received > +packet, an index into the list is computed from the flow hash modulo= the > +size of the list. The indexed CPU is the target for processing the > +packet, and the packet is queued to the tail of that CPU=E2=80=99s b= acklog > +queue. At the end of the bottom half routine, inter-processor interr= upts > +(IPIs) are sent to any CPUs for which packets have been queued to th= eir > +backlog queue. The IPI wakes backlog processing on the remote CPU, a= nd > +any queued packets are then processed up the networking stack. Note = that > +the list of CPUs can be configured separately for each hardware rece= ive > +queue. > + > +=3D=3D RPS Configuration > + > +RPS requires a kernel compiled with the CONFIG_RPS flag (on by defau= lt s/flag/kconfig symbol/ > +for smp). Even when compiled in, it is disabled without any for SMP). > +configuration. The list of CPUs to which RPS may forward traffic can= be > +configured for each receive queue using the sysfs file entry: > + > + /sys/class/net//queues/rx-/rps_cpus > + > +This file implements a bitmap of CPUs. RPS is disabled when it is ze= ro > +(the default), in which case packets are processed on the interrupti= ng > +CPU. IRQ-affinity.txt explains how CPUs are assigned to the bitmap. > + > +For a single queue device, a typical RPS configuration would be to s= et > +the rps_cpus to the CPUs in the same cache domain of the interruptin= g > +CPU for a queue. If NUMA locality is not an issue, this could also b= e > +all CPUs in the system. At high interrupt rate, it might wise to exc= lude it might be wise > +the interrupting CPU from the map since that already performs much w= ork. > + > +For a multi-queue system, if RSS is configured so that a receive que= ue > +is mapped to each CPU, then RPS is probably redundant and unnecessar= y. > +If there are fewer queues than CPUs, then RPS might be beneficial if= the > +rps_cpus for each queue are the ones that share the same cache domai= n as > +the interrupting CPU for the queue. > + > +RFS: Receive Flow Steering > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D > + > +While RPS steers packet solely based on hash, and thus generally steers packets > +provides good load distribution, it does not take into account > +application locality. This is accomplished by Receive Flow Steering > +(RFS). The goal of RFS is to increase datacache hitrate by steering > +kernel processing of packets to the CPU where the application thread > +consuming the packet is running. RFS relies on the same RPS mechanis= ms > +to enqueue packets onto the backlog of another CPU and to wake that = CPU. > + > +In RFS, packets are not forwarded directly by the value of their has= h, > +but the hash is used as index into a flow lookup table. This table m= aps > +flows to the CPUs where those flows are being processed. The flow ha= sh > +(see RPS section above) is used to calculate the index into this tab= le. > +The CPU recorded in each entry is the one which last processed the f= low, > +and if there is not a valid CPU for an entry, then packets mapped to > +that entry are steered using plain RPS. > + > +To avoid out of order packets (ie. when scheduler moves a thread wit= h (i.e., when the scheduler moves a thre= ad that > +outstanding receive packets on) there are two levels of flow tables = used has outstanding receive packets), > +by RFS: rps_sock_flow_table and rps_dev_flow_table. > + > +rps_sock_table is a global flow table. Each table value is a CPU ind= ex > +and is populated by recvmsg and sendmsg (specifically, inet_recvmsg(= ), > +inet_sendmsg(), inet_sendpage() and tcp_splice_read()). This table > +contains the *desired* CPUs for flows. > + > +rps_dev_flow_table is specific to each hardware receive queue of eac= h > +device. Each table value stores a CPU index and a counter. The CPU > +index represents the *current* CPU that is assigned to processing th= e > +matching flows. > + > +The counter records the length of this CPU's backlog when a packet i= n > +this flow was last enqueued. Each backlog queue has a head counter = that > +is incremented on dequeue. A tail counter is computed as head counte= r + > +queue length. In other words, the counter in rps_dev_flow_table[i] > +records the last element in flow i that has been enqueued onto the > +currently designated CPU for flow i (of course, entry i is actually > +selected by hash and multiple flows may hash to the same entry i).=20 > + > +And now the trick for avoiding out of order packets: when selecting = the > +CPU for packet processing (from get_rps_cpu()) the rps_sock_flow tab= le > +and the rps_dev_flow table of the queue that the packet was received= on > +are compared. If the desired CPU for the flow (found in the > +rps_sock_flow table) matches the current CPU (found in the rps_dev_f= low > +table), the packet is enqueud onto that CPU=E2=80=99s backlog. If th= ey differ, enqueued > +the current cpu is updated to match the desired CPU if one of the s/cpu/CPU/ (globally as needed) > +following is true: > + > +- The current CPU's queue head counter >=3D the recorded tail counte= r > + value in rps_dev_flow[i] > +- The current CPU is unset (equal to NR_CPUS) > +- The current CPU is offline > + > +After this check, the packet is sent to the (possibly updated) curre= nt > +CPU. These rules aim to ensure that a flow only moves to a new CPU = when > +there are no packets outstanding on the old CPU, as the outstanding > +packets could arrive later than those about to be processed on the n= ew > +CPU. > + > +=3D=3D RFS Configuration > + > +RFS is only available if the kernel flag CONFIG_RFS is enabled (on b= y s/flag/kconfig symbol/ > +default for smp). The functionality is disabled without any s/smp/SMP/ > +configuration. The number of entries in the global flow table is set > +through: > + > + /proc/sys/net/core/rps_sock_flow_entries > + > +The number of entries in the per queue flow table are set through: per-queue > + > + /sys/class/net//queues/tx-/rps_flow_cnt > + > +Both of these need to be set before RFS is enabled for a receive que= ue. > +Values for both of these are rounded up to the nearest power of two.= The > +suggested flow count depends on the expected number active connectio= ns number of > +at any given time, which may be significantly less than the number o= f > +open connections. We have found that a value of 32768 for > +rps_sock_flow_entries works fairly well on a moderately loaded serve= r. > + > +For a single queue device, the rps_flow_cnt value for the single que= ue > +would normally be configured to the same value as rps_sock_flow_entr= ies. > +For a multi-queue device, the rps_flow_cnt for each queue might be > +configured as rps_sock_flow_entries / N, where N is the number of > +queues. So for instance, if rps_flow_entries is set to 32768 and the= re > +are 16 configured receive queues, rps_flow_cnt for each queue might = be > +configured as 2048. > + > + > +Accelerated RFS > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated > +load balancing mechanism that uses soft state to steer flows based o= n > +where the thread consuming the packets of each flow is running. > +Accelerated RFS should perform better than RFS since packets are sen= t > +directly to a CPU local to the thread consuming the data. The target= CPU > +will either be the same CPU where the application runs, or at least = a > +CPU which is local to the application thread=E2=80=99s CPU in the ca= che > +hierarchy.=20 > + > +To enable accelerated RFS, the networking stack calls the > +ndo_rx_flow_steer driver function to communicate the desired hardwar= e > +queue for packets matching a particular flow. The network stack > +automatically calls this function every time a flow entry in > +rps_dev_flow_table is updated. The driver in turn uses a device spec= ific device-spec= ific > +method to program the NIC to steer the packets. > + > +The hardware queue for a flow is derived from the CPU recorded in > +rps_dev_flow_table. The stack consults a CPU to hardware queue map w= hich CPU-to-hardware-queue map > +is maintained by the NIC driver. This is an autogenerated reverse ma= p of > +the IRQ affinity table shown by /proc/interrupts. Drivers can use > +functions in the cpu_rmap (=E2=80=9Ccpu affinitiy reverse map=E2=80=9D= ) kernel library > +to populate the map. For each CPU, the corresponding queue in the ma= p is > +set to be one whose processing CPU is closest in cache locality. > + > +=3D=3D Accelerated RFS Configuration > + > +Accelerated RFS is only available if the kernel is compiled with > +CONFIG_RFS_ACCEL and support is provided by the NIC device and drive= r. > +It also requires that ntuple filtering is enabled via ethtool. The m= ap > +of CPU to queues is automatically deduced from the IRQ affinities > +configured for each receive queue by the driver, so no additional > +configuration should be necessary. > + > +XPS: Transmit Packet Steering > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D > + > +Transmit Packet Steering is a mechanism for intelligently selecting > +which transmit queue to use when transmitting a packet on a multi-qu= eue > +device. To accomplish this, a mapping from CPU to hardware queue(s) = is > +recorded. The goal of this mapping is usually to assign queues > +exclusively to a subset of CPUs, where the transmit completions for > +these queues are processed on a CPU within this set. This choice > +provides two benefits. First, contention on the device queue lock is > +significantly reduced since fewer CPUs contend for the same queue > +(contention can be eliminated completely if each CPU has its own > +transmit queue). Secondly, cache miss rate on transmit completion i= s > +reduced, in particular for data cache lines that hold the sk_buff > +structures. > + > +XPS is configured per transmit queue by setting a bitmap of CPUs tha= t > +may use that queue to transmit. The reverse mapping, from CPUs to > +transmit queues, is computed and maintained for each network device. > +When transmitting the first packet in a flow, the function > +get_xps_queue() is called to select a queue. This function uses the= ID > +of the running CPU as a key into the CPU to queue lookup table. If t= he CPU-to-queue > +ID matches a single queue, that is used for transmission. If multip= le > +queues match, one is selected by using the flow hash to compute an i= ndex > +into the set. > + > +The queue chosen for transmitting a particular flow is saved in the > +corresponding socket structure for the flow (e.g. a TCP connection). > +This transmit queue is used for subsequent packets sent on the flow = to > +prevent out of order (ooo) packets. The choice also amortizes the co= st > +of calling get_xps_queues() over all packets in the connection. To a= void > +ooo packets, the queue for a flow can subsequently only be changed i= f > +skb->ooo_okay is set for a packet in the flow. This flag indicates t= hat > +there are no outstanding packets in the flow, so the transmit queue = can > +change without the risk of generating out of order packets. The > +transport layer is responsible for setting ooo_okay appropriately. T= CP, > +for instance, sets the flag when all data for a connection has been > +acknowledged. > + > + > +=3D=3D XPS Configuration > + > +XPS is only available if the kernel flag CONFIG_XPS is enabled (on b= y s/flag/kconfig symbol/ > +default for smp). The functionality is disabled without any s/smp/SMP/ > +configuration, in which case the the transmit queue for a packet is > +selected by using a flow hash as an index into the set of all transm= it > +queues for the device. To enable XPS, the bitmap of CPUs that may us= e a > +transmit queue is configured using the sysfs file entry: > + > +/sys/class/net//queues/tx-/xps_cpus > + > +XPS is disabled when it is zero (the default). IRQ-affinity.txt expl= ains > +how CPUs are assigned to the bitmap.=20 > + > +For a network device with a single transmission queue, XPS configura= tion > +has no effect, since there is no choice in this case. In a multi-que= ue > +system, XPS is usually configured so that each CPU maps onto one que= ue. > +If there are as many queues as there are CPUs in the system, then ea= ch > +queue can also map onto one CPU, resulting in exclusive pairings tha= t > +experience no contention. If there are fewer queues than CPUs, then = the > +best CPUs to share a given queue are probably those that share the c= ache > +with the CPU that processes transmit completions for that queue > +(transmit interrupts). > + > + > +Further Information > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > +RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated i= nto > +2.6.38. Original patches were submitted by Tom Herbert > +(therbert@google.com) > + > + > +Accelerated RFS was introduced in 2.6.35. Original patches were > +submitted by Ben Hutchings (bhutchings@solarflare.com) > + > +Authors: > +Tom Herbert (therbert@google.com) > +Willem de Bruijn (willemb@google.com) > + > --=20 Very nice writeup. Thanks. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your cod= e ***