From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rick Jones Subject: Re: [PATCH v2] net: add Documentation/networking/scaling.txt Date: Thu, 11 Aug 2011 11:02:18 -0700 Message-ID: <4E44192A.2070204@hp.com> References: <1312899648.5889.14.camel@gopher.nyc.corp.google.com> <4E418030.2010102@hp.com> <1313080279.3261.17.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Will de Bruijn , rdunlap@xenotime.net, linux-doc@vger.kernel.org, davem@davemloft.net, netdev@vger.kernel.org, therbert@google.com To: Eric Dumazet Return-path: In-Reply-To: <1313080279.3261.17.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC> Sender: linux-doc-owner@vger.kernel.org List-Id: netdev.vger.kernel.org On 08/11/2011 09:31 AM, Eric Dumazet wrote: > Le jeudi 11 ao=C3=BBt 2011 =C3=A0 10:26 -0400, Will de Bruijn a =C3=A9= crit : > >> >> I'll be happy to revise it once more. This version also lacks the >> required one-line description in Documentation/networking/00-INDEX, = so >> I will have to resubmit, either way. >> > > Well, patch was already accepted by David in net tree two days ago ;) Didn't see the customary "Applied" email - mailer glitch somewhere? Anyhow, regardless of how further changes are made, or if they are made= ,=20 here's the bits I was considering might a matter of opinion, or perhaps= =20 simply stripes on the bikeshed... > +=3D=3D Suggested Configuration > + > +RSS should be enabled when latency is a concern or whenever receive > +interrupt processing forms a bottleneck. Spreading load between CPUs > +decreases queue length. For low latency networking, the optimal sett= ing > +is to allocate as many queues as there are CPUs in the system (or th= e > +NIC maximum, if lower). Because the aggregate number of interrupts g= rows > +with each additional queue, the most efficient high-rate configurati= on > +is likely the one with the smallest number of receive queues where n= o > +CPU that processes receive interrupts reaches 100% utilization. Per-= cpu > +load can be observed using the mpstat utility. Whether it lowers latency in the absence of an interrupt processing=20 bottleneck depends on whether or not the application(s) receiving the=20 data are able/allowed to run on the CPU(s) to which the IRQs of the=20 queues are directed right? Also, what mpstat and its ilk shows as CPUs could be HW threads - is it= =20 indeed the case that one is optimal when there are as many queues as=20 there are HW threads, or is it when there are as many queues as there=20 are discrete cores? If I have disabled interrupt coalescing in the name of latency, does th= e=20 number of queues actually affect the number of interrupts? Certainly any CPU processing interrupts that stays below 100%=20 utilization is less likely to be a bottleneck, but if there are=20 algorithms/heuristics that get more efficient under load, staying below= =20 the 100% CPU utilization mark doesn't mean that peak efficiency has bee= n=20 reached. If there is something that processes more and more packets pe= r=20 lock grab/release then it is actually most efficient in terms of packet= s=20 processed per unit CPU consumption once one gets to the ragged edge of=20 saturation. Is utilization of the rx ring associated with the queue the more=20 accurate, albeit unavailable, measure of saturation? > +=3D=3D Suggested Configuration > + > +For a single queue device, a typical RPS configuration would be to s= et > +the rps_cpus to the CPUs in the same cache domain of the interruptin= g > +CPU. If NUMA locality is not an issue, this could also be all CPUs i= n > +the system. At high interrupt rate, it might be wise to exclude the > +interrupting CPU from the map since that already performs much work. > + > +For a multi-queue system, if RSS is configured so that a hardware > +receive queue is mapped to each CPU, then RPS is probably redundant > +and unnecessary. If there are fewer hardware queues than CPUs, then > +RPS might be beneficial if the rps_cpus for each queue are the ones = that > +share the same cache domain as the interrupting CPU for that queue. This isn't the first mention of "cache domain" - there is actually one=20 above it in the RSS Configuration section, but is the anticipated=20 audience reasonably expected to already know what a cache domain is,=20 particularly as it may relate/differ from NUMA locality? A very simplistic search for "cache domain" against Documentation/=20 doesn't find that term used anywhere else. > +When the scheduler moves a thread to a new CPU while it has outstand= ing > +receive packets on the old CPU, packets may arrive out of order. To > +avoid this, RFS uses a second flow table to track outstanding packet= s > +for each flow: rps_dev_flow_table is a table specific to each hardwa= re > +receive queue of each device. Each table value stores a CPU index an= d a > +counter. The CPU index represents the *current* CPU onto which packe= ts > +for this flow are enqueued for further kernel processing. Ideally, k= ernel > +and userspace processing occur on the same CPU, and hence the CPU in= dex > +in both tables is identical. This is likely false if the scheduler h= as > +recently migrated a userspace thread while the kernel still has pack= ets > +enqueued for kernel processing on the old CPU. This one is more drift than critique of the documentation itself, but=20 just how often is the scheduler shuffling a thread of execution around=20 anyway? I would have thought that was happening on a timescale that=20 would seem positively glacial compared to packet arrival rates. > +=3D=3D Suggested Configuration > + > +This technique should be enabled whenever one wants to use RFS and t= he > +NIC supports hardware acceleration. Again, drifting from critique simply of the documentation, but if=20 accelerated RFS is indeed goodness when RFS is being used and the NIC H= W=20 supports it, shouldn't it be enabled automagically? And then drifting=20 back to the documentation itself, if accelerated RFS isn't enabled=20 automagically with RFS today, does the reason suggest a caveat to the=20 suggested configuration? > +The queue chosen for transmitting a particular flow is saved in the > +corresponding socket structure for the flow (e.g. a TCP connection). > +This transmit queue is used for subsequent packets sent on the flow = to > +prevent out of order (ooo) packets. The choice also amortizes the co= st > +of calling get_xps_queues() over all packets in the connection. To a= void > +ooo packets, the queue for a flow can subsequently only be changed i= f > +skb->ooo_okay is set for a packet in the flow. This flag indicates t= hat > +there are no outstanding packets in the flow, so the transmit queue = can > +change without the risk of generating out of order packets. The > +transport layer is responsible for setting ooo_okay appropriately. T= CP, > +for instance, sets the flag when all data for a connection has been > +acknowledged. I'd probably go with "over all packets in the flow" as that part is in=20 the "generic" discussion space rather than the specific example of a TC= P=20 connection. And I'm curious/confused about rates of thread migration vs packets - i= t=20 seems like the mechanisms in place to avoid OOO packets have a property= =20 that the queue selected can remain "stuck" when the packet rates are=20 sufficiently high. If being stuck isn't likely, it suggests that=20 "normal" processing is enough to get packets drained - that the thread=20 of execution is (at least in the context of sending and receiving=20 traffic) going idle. Is that then consistent with that thread of=20 execution being bounced from CPU to CPU by the scheduler in the first p= lace? In the specific example of TCP, I see where ACK of data is sufficient t= o=20 guarantee no OOO on outbound when migrating, but all that is really=20 necessary is transmit completion by the NIC, no? Admittedly, getting=20 that information to TCP is probably undesired overhead, but doesn't=20 using the ACK "penalize" the thread/TCP talking to more remote (in term= s=20 of RTT) destinations? rick jones