From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: big picture UDP/IP performance question re 2.6.18 -> 2.6.32 Date: Fri, 07 Oct 2011 07:40:07 +0200 Message-ID: <1317966007.3457.47.camel@edumazet-laptop> References: <6.2.5.6.2.20111006231958.039bb570@binnacle.cx> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: linux-kernel@vger.kernel.org, netdev , Peter Zijlstra , Christoph Lameter , Willy Tarreau , Ingo Molnar , Stephen Hemminger , Benjamin LaHaise , Joe Perches , Chetan Loke , Con Kolivas , Serge Belyshev To: starlight@binnacle.cx Return-path: In-Reply-To: <6.2.5.6.2.20111006231958.039bb570@binnacle.cx> Sender: linux-kernel-owner@vger.kernel.org List-Id: netdev.vger.kernel.org Le jeudi 06 octobre 2011 =C3=A0 23:27 -0400, starlight@binnacle.cx a =C3= =A9crit : > After writing the last post, the large > difference in IRQ rate between the older > and newer kernels caught my eye. >=20 > I wonder if the hugely lower rate in the older > kernels reflects a more agile shifting > into and out of NAPI mode by the network > bottom-half. >=20 > In this test the sending system > pulses data out on millisecond boundaries > due to the behavior of nsleep(), which > is used to establish the playback pace. >=20 > If the older kernels are switching to NAPI > for much of surge and the switching out > once the pulse falls off, it might > conceivably result in much better latency > and overall performance. >=20 > All tests were run with Intel 82571=20 > network interfaces and the 'e1000e' > device driver. Some used the driver > packaged with the kernel, some used > Intel driver compiled from the source > found on sourceforge.net. Never could > detected any difference between the two. >=20 > Since data in the production environment > also tends to arrive in bursts, I don't find > the pulsing playback behavior a detriment. >=20 Thats exactly the opposite : Your old kernel is not fast enough to enter/exit NAPI on every incoming frame. Instead of one IRQ per incoming frame, you have less interrupts : A napi run processes more than 1 frame. Now increase your incoming rate, and you'll discover a new kernel will be able to process more frames without losses. About your thread model : You have one thread that reads the incoming frame, and do a distributio= n on several queues based on some flow parameters. Then you wakeup a second thread. This kind of model is very expensive and triggers lot of false sharing. New kernels are able to perform this fanout in kernel land. You really should take a look at Documentation/networking/scaling.txt [ An other way of doing this fanout is using some iptables rules : check following commit changelog for an idea ] commit e8648a1fdb54da1f683784b36a17aa65ea56e931 Author: Eric Dumazet Date: Fri Jul 23 12:59:36 2010 +0200 netfilter: add xt_cpu match =20 In some situations a CPU match permits a better spreading of connections, or select targets only for a given cpu. =20 With Remote Packet Steering or multiqueue NIC and appropriate IRQ affinities, we can distribute trafic on available cpus, per session= =2E (all RX packets for a given flow is handled by a given cpu) =20 Some legacy applications being not SMP friendly, one way to scale a server is to run multiple copies of them. =20 Instead of randomly choosing an instance, we can use the cpu number= as a key so that softirq handler for a whole instance is running on a si= ngle cpu, maximizing cache effects in TCP/UDP stacks. =20 Using NAT for example, a four ways machine might run four copies of server application, using a separate listening port for each instan= ce, but still presenting an unique external port : =20 iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 0 \ -j REDIRECT --to-port 8080 =20 iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 1 \ -j REDIRECT --to-port 8081 =20 iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 2 \ -j REDIRECT --to-port 8082 =20 iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 3 \ -j REDIRECT --to-port 8083 =20