From mboxrd@z Thu Jan 1 00:00:00 1970 From: starlight@binnacle.cx Subject: Re: big picture UDP/IP performance question re 2.6.18 -> 2.6.32 Date: Fri, 07 Oct 2011 02:13:42 -0400 Message-ID: <6.2.5.6.2.20111007020308.039be248@binnacle.cx> References: <6.2.5.6.2.20111006231958.039bb570@binnacle.cx> <1317966007.3457.47.camel@edumazet-laptop> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Cc: linux-kernel@vger.kernel.org, netdev , Peter Zijlstra , Christoph Lameter , Willy Tarreau , Ingo Molnar , Stephen Hemminger , Benjamin LaHaise , Joe Perches , Chetan Loke , Con Kolivas , Serge Belyshev To: Eric Dumazet Return-path: In-Reply-To: <1317966007.3457.47.camel@edumazet-laptop> References: <6.2.5.6.2.20111006231958.039bb570@binnacle.cx> <1317966007.3457.47.camel@edumazet-laptop> Sender: linux-kernel-owner@vger.kernel.org List-Id: netdev.vger.kernel.org At 07:40 AM 10/7/2011 +0200, Eric Dumazet wrote: > >Thats exactly the opposite : Your old kernel is not fast enough >to enter/exit NAPI on every incoming frame. > >Instead of one IRQ per incoming frame, you have less interrupts: >A napi run processes more than 1 frame. Please look at the data I posted. Batching appears to give 80us *better* latency in this case--with the old kernel. >Now increase your incoming rate, and you'll discover a new >kernel will be able to process more frames without losses. Data loss happens mainly in relation to CPU utilization-per-message-rate as buffers are configured huge at all points. So newer kernels break down at significantly lower message rates than older kernels. Determined this last year when testing SLES 11 and unmodified 2.6.27. I can run a max-rate comparison for 2.6.39.4 if you like. >About your thread model : > >You have one thread that reads the incoming >frame, and do a distribution on several queues >based on some flow parameters. Then you wakeup >a second thread. > >This kind of model is very expensive and triggers >lot of false sharing. Please note my use of the word "nominal" and the overall context. Both thread-per-socket and dual-thread handoff handling tests were performed with the clear observation that the former is the production model and works best at maximum load. However at 50% load (the test here), the dual-thread handoff model is the clear winner over *all* other scenarios. >New kernels are able to perform this fanout in kernel land. Yes, of course I am interested in Intel's flow director and similar solutions, netfilter especially. 2.6.32 is only recently available in commercial deployment and I will be looking at that next up. Mainly I'll be looking at complete kernel bypass with 10G. Myricom looks like it might be good. Tested Solarflare last year and it was a bust for high volume UDP (one thread) but I've heard that they fixed that and will revisit. In relation to the above observation that less CPU-per-packet is best for avoiding data loss, it also correlates somewhat (though not always) with better latency. I've used the 'e1000e' 1G network interfaces for these tests because they work better than the multi-queue 'igb' (Intel 82576) and 'ixgbe' (Intel 82599) in all scenarios other than maximum-stress load. The reason is apparently that the old 'e1000e' driver has shorter, more efficient code paths while 'igb' and 'ixgbe' use significantly more CPU to process the same number of packets. I can quantify that if it is of interest. At present the only place where multi-queue NICs best four 1G NICs is at breaking-point traffic loads where asymmetries in the traffic can't be easily redistributed by the kernel and the resulting hot-spots are weakest-link breakpoints. Please understand that I am not a curmudgeonly Luddite. I realize that sometimes it is necessary to trade efficiency for scalability. All I'm doing here is trying to quantify the current state of affairs and make recommendations in a commercial environment. For the moment all the excellent enhancements designed to permit extreme scalability are costing too much in efficiency to be worth using in production. When/if Tilera delivers their 100 core CPU in volume this state of affairs will likely change. I imagine both Intel and AMD have many-core solutions in the pipe as well, though it will be interesting to see if Tilera has the essential patents and can surpass the two majors in the market and the courts.