From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Hemminger Subject: Re: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel Date: Fri, 8 Apr 2011 07:49:02 -0700 Message-ID: <20110408074902.2bd10e6b@nehalam> References: <1302153412.2701.64.camel@edumazet-laptop> <1302157012.2701.73.camel@edumazet-laptop> <1302163650.3357.8.camel@edumazet-laptop> <1302167168.3357.12.camel@edumazet-laptop> <1302176811.3357.15.camel@edumazet-laptop> <4D9DDF43.9080302@intel.com> <1302192218.3357.47.camel@edumazet-laptop> <4D9DE465.1080008@intel.com> <1302253651.4409.2.camel@edumazet-laptop> <1302267400.4409.22.camel@edumazet-laptop> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Eric Dumazet , Alexander Duyck , netdev , "Kirsher, Jeffrey T" To: Wei Gu Return-path: Received: from mail.vyatta.com ([76.74.103.46]:32919 "EHLO mail.vyatta.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757161Ab1DHOtG convert rfc822-to-8bit (ORCPT ); Fri, 8 Apr 2011 10:49:06 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Fri, 8 Apr 2011 22:10:50 +0800 Wei Gu wrote: > Hi, > Got you mean. > But as I decribed before, I start the eth10 with 8 rx queues and 8 tx= queues, and then I binding these 8 tx&rx queue each to CPU core 24-32 = (NUMA3), which I think could gain the best performance in my case (It's= true on Linux 2.6.32) > single queue ->single CPU > Then I can descibe a little bit with packet generator, I config the I= XIA to continues increase the dest ip address towards the test server, = so the packet was evenly distributed to each receving queues of the eth= 10. And according the IXIA tools the transmit sharp was really good, no= too much peaks >=20 > What I observed on Linux 2.6.38 during the test, there is no softqd w= as stressed (< 03% on SI for each core(24-31)) while the packet lost ha= ppens, so we are not really stress the CPU:), It looks like we are limi= ted on some memory bandwidth (DMA) on this release >=20 > And with same test case on 2.6.32, no such problem at all. It running= pretty stable > 2Mpps without rx_missing_error. There is no HW limitat= ion on this DL580 >=20 >=20 > BTW what is these "swapper" > + 0.80% swapper [ixgbe] [k] ixgbe_p= oll > + 0.79% perf [ixgbe] [k] ixgbe_p= oll > Why the ixgbe_poll was on swapper/perf? >=20 > Thanks > WeiGu >=20 > -----Original Message----- > From: Eric Dumazet [mailto:eric.dumazet@gmail.com] > Sent: Friday, April 08, 2011 8:57 PM > To: Wei Gu > Cc: Alexander Duyck; netdev; Kirsher, Jeffrey T > Subject: RE: Low performance Intel 10GE NIC (3.2.10) on 2.6.38 Kernel >=20 > Le vendredi 08 avril 2011 =E0 20:19 +0800, Wei Gu a =E9crit : > > Hi again, > > I tried more testing with by disable this CONFIG_DMAR with shipped > > 2.6.38 ixgbe and Intel released 3.2.10/3.1.15. > > All these test looks we can get >1Mpps 400bype packtes but not stab= le > > at all, there will huge number missing errors with 100% CPU IDLE: > > ethtool -S eth10 |grep rx_missed_errors > > > > rx_missed_errors: 76832040 > > > > SUM: 1102212 ETH8: 0 ETH10: 1102212 ETH6: 0 ETH4: 0 > > SUM: 521841 ETH8: 0 ETH10: 521841 ETH6: 0 ETH4: 0 > > SUM: 426776 ETH8: 0 ETH10: 426776 ETH6: 0 ETH4: 0 > > SUM: 927520 ETH8: 0 ETH10: 927520 ETH6: 0 ETH4: 0 > > SUM: 1171995 ETH8: 0 ETH10: 1171995 ETH6: 0 ETH4: 0 > > SUM: 855980 ETH8: 0 ETH10: 855980 ETH6: 0 ETH4: 0 > > > > > > Do you know if there is other options in the kernel will cause high > > rate rx_missed_errors with low CPU usage. (No problem on 2.6.32 wit= h > > same test case) > > > > perf record: > > + 69.74% swapper [kernel.kallsyms] [k] poll_= idle > > + 11.62% swapper [kernel.kallsyms] [k] intel= _idle > > + 0.80% swapper [ixgbe] [k] ixgbe= _poll > > + 0.79% perf [ixgbe] [k] ixgbe= _poll > > + 0.77% perf [kernel.kallsyms] [k] skb_c= opy_bits > > + 0.64% swapper [kernel.kallsyms] [k] skb_c= opy_bits > > + 0.48% perf [kernel.kallsyms] [k] __kma= lloc_node_track_caller > > + 0.44% swapper [kernel.kallsyms] [k] __kma= lloc_node_track_caller > > + 0.36% swapper [kernel.kallsyms] [k] kmem_= cache_alloc_node > > + 0.35% swapper [kernel.kallsyms] [k] kfree > > + 0.35% perf [kernel.kallsyms] [k] kmem_= cache_alloc_node > > >=20 >=20 > Make sure enough cpus serves interrupts, _before_ even starting your = stress test. >=20 > Then, make sure trafic is distributed to many different queues. > If a single flow is used, it probably uses a single queue ->single CP= U. >=20 > Say you have irq affinities set to fffffffffffff (all cpus able to s= erve IRQ X,Y,Z,T,...) >=20 > Then you have a network burst (because you start your packet generato= r at full rate), spreaded on many queues. >=20 > CPU0 takes hard interrupt for queue 0, eth8, and queues NAPI mode. > CPU0 takes hard interrupt for queue 0, eth10, and queues NAPI mode. > CPU0 takes hard interrupt for queue 1, eth8, and queues NAPI mode. > CPU0 takes hard interrupt for queue 1, eth10, and queues NAPI mode. > CPU0 takes hard interrupt for queue 2, eth8, and queues NAPI mode. > CPU0 takes hard interrupt for queue 2, eth10, and queues NAPI mode. > ... > CPU0 takes hard interrupt for queue X, eth8, and queues NAPI mode. > ... >=20 > Then softirq can start, and only CPU0 is able to handle NAPI for all = the queued devices. You are stuck, with CPU0 never leaving ksoftirqd. >=20 > NAPI handling is always performed on the CPU that received the hardwa= re interrupt, until we exit NAPI (and rearm interrupt delivery). > It cannot migrate to an "idle cpu" =46or performance, you need to assign each network interrupt to a singl= e CPU. There is no load balancing effect in the IRQ controller. If you have a multi-socket system, then it is a good idea to make the I= RQ's for the NIC's be on the same socket as the bus interface. Multi socket = systems are really NUMA and putting IRQ on non-local CPU has measurable impact. --=20