* e1000 performance issue in 4 simultaneous links
@ 2008-01-10 16:17 Breno Leitao
2008-01-10 16:36 ` Ben Hutchings
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Breno Leitao @ 2008-01-10 16:17 UTC (permalink / raw)
To: netdev
Hello,
I've perceived that there is a performance issue when running netperf
against 4 e1000 links connected end-to-end to another machine with 4
e1000 interfaces.
I have 2 4-port interfaces on my machine, but the test is just
considering 2 port for each interfaces card.
When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
of transfer rate. If I run 4 netperf against 4 different interfaces, I
get around 720 * 10^6 bits/sec.
If I run the same test against 2 interfaces I get a 940 * 10^6 bits/sec
transfer rate also, and if I run it against 3 interfaces I get around
850 * 10^6 bits/sec performance.
I got this results using the upstream netdev-2.6 branch kernel plus
David Miller's 7 NAPI patches set[1]. In the kernel 2.6.23.12 the result
is a bit worse, and the the transfer rate was around 600 * 10^6
bits/sec.
[1] http://marc.info/?l=linux-netdev&m=119977075917488&w=2
PS: I am not using a switch in the middle of interfaces (they are
end-to-end) and the connections are independents.
--
Breno Leitao <leitao@linux.vnet.ibm.com>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: e1000 performance issue in 4 simultaneous links 2008-01-10 16:17 e1000 performance issue in 4 simultaneous links Breno Leitao @ 2008-01-10 16:36 ` Ben Hutchings 2008-01-10 16:51 ` Jeba Anandhan 2008-01-10 17:31 ` Breno Leitao 2008-01-10 18:26 ` Rick Jones 2008-01-10 20:52 ` Brandeburg, Jesse 2 siblings, 2 replies; 19+ messages in thread From: Ben Hutchings @ 2008-01-10 16:36 UTC (permalink / raw) To: Breno Leitao; +Cc: netdev Breno Leitao wrote: > Hello, > > I've perceived that there is a performance issue when running netperf > against 4 e1000 links connected end-to-end to another machine with 4 > e1000 interfaces. > > I have 2 4-port interfaces on my machine, but the test is just > considering 2 port for each interfaces card. > > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec > of transfer rate. If I run 4 netperf against 4 different interfaces, I > get around 720 * 10^6 bits/sec. <snip> I take it that's the average for individual interfaces, not the aggregate? RX processing for multi-gigabits per second can be quite expensive. This can be mitigated by interrupt moderation and NAPI polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO). I don't think e1000 hardware does LRO, but the driver could presumably be changed use Linux's software LRO. Even with these optimisations, if all RX processing is done on a single CPU this can become a bottleneck. Does the test system have multiple CPUs? Are IRQs for the multiple NICs balanced across multiple CPUs? Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-10 16:36 ` Ben Hutchings @ 2008-01-10 16:51 ` Jeba Anandhan 2008-01-10 17:31 ` Breno Leitao 1 sibling, 0 replies; 19+ messages in thread From: Jeba Anandhan @ 2008-01-10 16:51 UTC (permalink / raw) To: Ben Hutchings; +Cc: Breno Leitao, netdev Ben, I am facing the performance issue when we try to bond the multiple interfaces with virtual interface. It could be related to this thread. My questions are, *) When we use mulitple NICs, will the performance of overall system be summation of all individual lines XX bits/sec. ? *) What are the factors improves the performance if we have multiple interfaces?. [ kind of tuning the parameters in proc ] Breno, I hope this thread will be helpful for performance issue which i have with bonding driver. Jeba On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote: > Breno Leitao wrote: > > Hello, > > > > I've perceived that there is a performance issue when running netperf > > against 4 e1000 links connected end-to-end to another machine with 4 > > e1000 interfaces. > > > > I have 2 4-port interfaces on my machine, but the test is just > > considering 2 port for each interfaces card. > > > > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec > > of transfer rate. If I run 4 netperf against 4 different interfaces, I > > get around 720 * 10^6 bits/sec. > <snip> > > I take it that's the average for individual interfaces, not the > aggregate? RX processing for multi-gigabits per second can be quite > expensive. This can be mitigated by interrupt moderation and NAPI > polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO). > I don't think e1000 hardware does LRO, but the driver could presumably > be changed use Linux's software LRO. > > Even with these optimisations, if all RX processing is done on a > single CPU this can become a bottleneck. Does the test system have > multiple CPUs? Are IRQs for the multiple NICs balanced across > multiple CPUs? > > Ben. > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-10 16:36 ` Ben Hutchings 2008-01-10 16:51 ` Jeba Anandhan @ 2008-01-10 17:31 ` Breno Leitao 2008-01-10 18:18 ` Kok, Auke 2008-01-10 18:37 ` Rick Jones 1 sibling, 2 replies; 19+ messages in thread From: Breno Leitao @ 2008-01-10 17:31 UTC (permalink / raw) To: bhutchings On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote: > > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec > > of transfer rate. If I run 4 netperf against 4 different interfaces, I > > get around 720 * 10^6 bits/sec. > <snip> > > I take it that's the average for individual interfaces, not the > aggregate? Right, each of these results are for individual interfaces. Otherwise, we'd have a huge problem. :-) > This can be mitigated by interrupt moderation and NAPI > polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO). > I don't think e1000 hardware does LRO, but the driver could presumably > be changed use Linux's software LRO. Without using these "features" and keeping the MTU as 1500, do you think we could get a better performance than this one? I also tried to increase my interface MTU to 9000, but I am afraid that netperf only transmits packets with less than 1500. Still investigating. > single CPU this can become a bottleneck. Does the test system have > multiple CPUs? Are IRQs for the multiple NICs balanced across > multiple CPUs? Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced across the CPUs, as I see in /proc/interrupts: # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 16: 940 760 1047 904 993 777 975 813 XICS Level IPI 18: 4 3 4 1 3 6 8 3 XICS Level hvc_console 19: 0 0 0 0 0 0 0 0 XICS Level RAS_EPOW 273: 10728 10850 10937 10833 10884 10788 10868 10776 XICS Level eth4 275: 0 0 0 0 0 0 0 0 XICS Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3 277: 234933 230275 229770 234048 235906 229858 229975 233859 XICS Level eth6 278: 266225 267606 262844 265985 268789 266869 263110 267422 XICS Level eth7 279: 893 919 857 909 867 917 894 881 XICS Level eth0 305: 439246 439117 438495 436072 438053 440111 438973 438951 XICS Level eth0 Neterion Xframe II 10GbE network adapter 321: 3268 3088 3143 3113 3305 2982 3326 3084 XICS Level ipr 323: 268030 273207 269710 271338 270306 273258 270872 273281 XICS Level eth16 324: 215012 221102 219494 216732 216531 220460 219718 218654 XICS Level eth17 325: 7103 3580 7246 3475 7132 3394 7258 3435 XICS Level pata_pdc2027x BAD: 4216 Thanks, -- Breno Leitao <leitao@linux.vnet.ibm.com> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-10 17:31 ` Breno Leitao @ 2008-01-10 18:18 ` Kok, Auke 2008-01-10 18:37 ` Rick Jones 1 sibling, 0 replies; 19+ messages in thread From: Kok, Auke @ 2008-01-10 18:18 UTC (permalink / raw) To: Breno Leitao; +Cc: bhutchings, NetDev Breno Leitao wrote: > On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote: >>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec >>> of transfer rate. If I run 4 netperf against 4 different interfaces, I >>> get around 720 * 10^6 bits/sec. >> <snip> >> >> I take it that's the average for individual interfaces, not the >> aggregate? > Right, each of these results are for individual interfaces. Otherwise, > we'd have a huge problem. :-) > >> This can be mitigated by interrupt moderation and NAPI >> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO). >> I don't think e1000 hardware does LRO, but the driver could presumably >> be changed use Linux's software LRO. > Without using these "features" and keeping the MTU as 1500, do you think > we could get a better performance than this one? > > I also tried to increase my interface MTU to 9000, but I am afraid that > netperf only transmits packets with less than 1500. Still investigating. > >> single CPU this can become a bottleneck. Does the test system have >> multiple CPUs? Are IRQs for the multiple NICs balanced across >> multiple CPUs? > Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced > across the CPUs, as I see in /proc/interrupts: which is wrong and hurts performance. you want your ethernet irq's to stick to a CPU for long times to prevent cache thrash. please disable the in-kernel irq balancing code and use the userspace `irqbalance` daemon. Gee I should put that in my signature, I already wrote that twice today :) Auke > > # cat /proc/interrupts > CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 > 16: 940 760 1047 904 993 777 975 813 XICS Level IPI > 18: 4 3 4 1 3 6 8 3 XICS Level hvc_console > 19: 0 0 0 0 0 0 0 0 XICS Level RAS_EPOW > 273: 10728 10850 10937 10833 10884 10788 10868 10776 XICS Level eth4 > 275: 0 0 0 0 0 0 0 0 XICS Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3 > 277: 234933 230275 229770 234048 235906 229858 229975 233859 XICS Level eth6 > 278: 266225 267606 262844 265985 268789 266869 263110 267422 XICS Level eth7 > 279: 893 919 857 909 867 917 894 881 XICS Level eth0 > 305: 439246 439117 438495 436072 438053 440111 438973 438951 XICS Level eth0 Neterion Xframe II 10GbE network adapter > 321: 3268 3088 3143 3113 3305 2982 3326 3084 XICS Level ipr > 323: 268030 273207 269710 271338 270306 273258 270872 273281 XICS Level eth16 > 324: 215012 221102 219494 216732 216531 220460 219718 218654 XICS Level eth17 > 325: 7103 3580 7246 3475 7132 3394 7258 3435 XICS Level pata_pdc2027x > BAD: 4216 > > Thanks, > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-10 17:31 ` Breno Leitao 2008-01-10 18:18 ` Kok, Auke @ 2008-01-10 18:37 ` Rick Jones 1 sibling, 0 replies; 19+ messages in thread From: Rick Jones @ 2008-01-10 18:37 UTC (permalink / raw) To: Breno Leitao; +Cc: bhutchings, Linux Network Development list > I also tried to increase my interface MTU to 9000, but I am afraid that > netperf only transmits packets with less than 1500. Still investigating. It may seem like picking a tiny nit, but netperf never transmits packets. It only provides buffers of specified size to the stack. It is then the stack which transmits and determines the size of the packets on the network. Drifting a bit more... While there are settings, conditions and known stack behaviours where one can be confident of the packet size on the network based on the options passed to netperf, generally speaking one should not ass-u-me a direct relationship between the options one passes to netperf and the size of the packets on the network. And for JumboFrames to be effective it must be set on both ends, otherwise the TCP MSS exchange will result in the smaller of the two MTU's "winning" as it were. >>single CPU this can become a bottleneck. Does the test system have >>multiple CPUs? Are IRQs for the multiple NICs balanced across >>multiple CPUs? > > Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced > across the CPUs, as I see in /proc/interrupts: That suggests to me anyway that the dreaded irqbalanced is running, shuffling the interrupts as you go. Not often a happy place for running netperf when one want's consistent results. > > # cat /proc/interrupts > CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 > 16: 940 760 1047 904 993 777 975 813 XICS Level IPI > 18: 4 3 4 1 3 6 8 3 XICS Level hvc_console > 19: 0 0 0 0 0 0 0 0 XICS Level RAS_EPOW > 273: 10728 10850 10937 10833 10884 10788 10868 10776 XICS Level eth4 > 275: 0 0 0 0 0 0 0 0 XICS Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3 > 277: 234933 230275 229770 234048 235906 229858 229975 233859 XICS Level eth6 > 278: 266225 267606 262844 265985 268789 266869 263110 267422 XICS Level eth7 > 279: 893 919 857 909 867 917 894 881 XICS Level eth0 > 305: 439246 439117 438495 436072 438053 440111 438973 438951 XICS Level eth0 Neterion Xframe II 10GbE network adapter > 321: 3268 3088 3143 3113 3305 2982 3326 3084 XICS Level ipr > 323: 268030 273207 269710 271338 270306 273258 270872 273281 XICS Level eth16 > 324: 215012 221102 219494 216732 216531 220460 219718 218654 XICS Level eth17 > 325: 7103 3580 7246 3475 7132 3394 7258 3435 XICS Level pata_pdc2027x > BAD: 4216 IMO, what you want (in the absence of multi-queue NICs) is one CPU taking the interrupts of one port/interface, and each port/interface's interrupts going to a separate CPU. So, something that looks roughly like concocted example: CPU0 CPU1 CPU2 CPU3 1: 1234 0 0 0 eth0 2: 0 1234 0 0 eth1 3: 0 0 1234 0 eth2 4: 0 0 0 1234 eth3 which you should be able to acheive via the method I think someone else has already mentioned about echoing values into /proc/irq/<irq>/smp_affinity - after you have slain the dreaded irqbalance daemon. rick jones ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-10 16:17 e1000 performance issue in 4 simultaneous links Breno Leitao 2008-01-10 16:36 ` Ben Hutchings @ 2008-01-10 18:26 ` Rick Jones 2008-01-10 20:52 ` Brandeburg, Jesse 2 siblings, 0 replies; 19+ messages in thread From: Rick Jones @ 2008-01-10 18:26 UTC (permalink / raw) To: Breno Leitao; +Cc: netdev Many many things to check when running netperf :) *) Are the cards on the same or separate PCImumble bus, and what sort of bus *) is the two interface performance two interfaces on the same four-port card, or an interface from each of the two four-port cards? *) is there a dreaded (IMO) irqbalance daemon running? one of the very first things I do when running netperf is terminate the irqbalance daemon with as extreme a predjudice as I can. *) what is the distribution of interrupts from the interfaces to the CPUs? if you've tried to set that manually, the dreaded irqbalance daemon will come along shortly thereafter and ruin everything. *) what does netperf say about the overall CPU utilization of the system(s) when the tests are running? *) what does top say about the utilization of any single CPU in the system(s) when the tests are running? *) are you using the global -T option to spread the netperf/netserver processes across the CPUs, or leaving that all up to the stack/scheduler/etc? I suspect there could be more but that is what comes to mind thusfar as far as things I often check when running netperf. rick jones ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: e1000 performance issue in 4 simultaneous links 2008-01-10 16:17 e1000 performance issue in 4 simultaneous links Breno Leitao 2008-01-10 16:36 ` Ben Hutchings 2008-01-10 18:26 ` Rick Jones @ 2008-01-10 20:52 ` Brandeburg, Jesse 2008-01-11 1:28 ` David Miller 2008-01-11 16:20 ` Breno Leitao 2 siblings, 2 replies; 19+ messages in thread From: Brandeburg, Jesse @ 2008-01-10 20:52 UTC (permalink / raw) To: Breno Leitao; +Cc: netdev, Brandeburg, Jesse Breno Leitao wrote: > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec > of transfer rate. If I run 4 netperf against 4 different interfaces, I > get around 720 * 10^6 bits/sec. This is actually a known issue that we have worked with your company before on. It comes down to your system's default behavior of round robining interrupts (see cat /proc/interrupts while running the test) combined with e1000's way of exiting / rescheduling NAPI. The default round robin behavior of the interrupts on your system is the root cause of this issue, and here is what happens: 4 interfaces start generating interrupts, if you're lucky the round robin balancer has them all on different cpus. As the e1000 driver goes into and out of polling mode, the round robin balancer keeps moving the interrupt to the next cpu. Eventually 2 or more driver instances end up on the same CPU, which causes both driver instances to stay in NAPI polling mode, due to the amount of work being done, and that there are always more than "netdev->weight" packets to do for each instance. This keeps *hardware* interrupts for each interface *disabled*. Staying in NAPI polling mode causes higher cpu utilization on that one processor, which guarantees that when the hardware round robin balancer moves any other network interrupt onto that CPU, it too will join the NAPI polling mode chain. So no matter how many processors you have, with this round robin style of hardware interrupts, it guarantees you that if there is a lot of work to do (more than weight) at each softirq, then, all network interfaces will end up on the same cpu eventually (the busiest one) Your performance becomes the same as if you had booted with maxcpus=1 I hope this explanation makes sense, but what it comes down to is that combining hardware round robin balancing with NAPI is a BAD IDEA. In general the behavior of hardware round robin balancing is bad and I'm sure it is causing all sorts of other performance issues that you may not even be aware of. I'm sure your problem will go away if you run e1000 in interrupt mode. (use make CFLAGS_EXTRA=-DE1000_NO_NAPI) > If I run the same test against 2 interfaces I get a 940 * 10^6 > bits/sec transfer rate also, and if I run it against 3 interfaces I > get around 850 * 10^6 bits/sec performance. > > I got this results using the upstream netdev-2.6 branch kernel plus > David Miller's 7 NAPI patches set[1]. In the kernel 2.6.23.12 the > result is a bit worse, and the the transfer rate was around 600 * 10^6 > bits/sec. Thank you for testing the latest kernel.org kernel. Hope this helps. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-10 20:52 ` Brandeburg, Jesse @ 2008-01-11 1:28 ` David Miller 2008-01-11 11:09 ` Benny Amorsen 2008-01-11 16:20 ` Breno Leitao 1 sibling, 1 reply; 19+ messages in thread From: David Miller @ 2008-01-11 1:28 UTC (permalink / raw) To: jesse.brandeburg; +Cc: leitao, netdev From: "Brandeburg, Jesse" <jesse.brandeburg@intel.com> Date: Thu, 10 Jan 2008 12:52:15 -0800 > I hope this explanation makes sense, but what it comes down to is that > combining hardware round robin balancing with NAPI is a BAD IDEA. Absolutely agreed on all counts. No IRQ balancing should be done at all for networking device interrupts, with zero exceptions. It destroys performance. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-11 1:28 ` David Miller @ 2008-01-11 11:09 ` Benny Amorsen 2008-01-12 1:41 ` David Miller 0 siblings, 1 reply; 19+ messages in thread From: Benny Amorsen @ 2008-01-11 11:09 UTC (permalink / raw) To: netdev David Miller <davem@davemloft.net> writes: > No IRQ balancing should be done at all for networking device > interrupts, with zero exceptions. It destroys performance. Does irqbalanced need to be taught about this? And how about the initial balancing, so that each network card gets assigned to one CPU? /Benny ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-11 11:09 ` Benny Amorsen @ 2008-01-12 1:41 ` David Miller 2008-01-12 5:13 ` Denys Fedoryshchenko 0 siblings, 1 reply; 19+ messages in thread From: David Miller @ 2008-01-12 1:41 UTC (permalink / raw) To: benny+usenet; +Cc: netdev From: Benny Amorsen <benny+usenet@amorsen.dk> Date: Fri, 11 Jan 2008 12:09:32 +0100 > David Miller <davem@davemloft.net> writes: > > > No IRQ balancing should be done at all for networking device > > interrupts, with zero exceptions. It destroys performance. > > Does irqbalanced need to be taught about this? The userland one already does. It's only the in-kernel IRQ load balancing for these (presumably powerpc) platforms that is broken. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-12 1:41 ` David Miller @ 2008-01-12 5:13 ` Denys Fedoryshchenko 2008-01-30 16:57 ` Kok, Auke 0 siblings, 1 reply; 19+ messages in thread From: Denys Fedoryshchenko @ 2008-01-12 5:13 UTC (permalink / raw) To: David Miller, benny+usenet; +Cc: netdev Sorry. that i interfere in this subject. Do you recommend CONFIG_IRQBALANCE to be enabled? If it is enabled - irq's not jumping nonstop over processors. softirqd changing this behavior. If it is disabled, irq's distributed over each processor, and in loaded systems it seems harmful. I work a little yesterday with server with CONFIG_IRQBALANCE=no, 160kpps load. It was packetloss-ing, till i set smp_affinity. Maybe it is useful to put more info in Kconfig, since it is very important for performance option. On Fri, 11 Jan 2008 17:41:09 -0800 (PST), David Miller wrote > From: Benny Amorsen <benny usenet@amorsen.dk> > Date: Fri, 11 Jan 2008 12:09:32 0100 > > > David Miller <davem@davemloft.net> writes: > > > > > No IRQ balancing should be done at all for networking device > > > interrupts, with zero exceptions. It destroys performance. > > > > Does irqbalanced need to be taught about this? > > The userland one already does. > > It's only the in-kernel IRQ load balancing for these (presumably > powerpc) platforms that is broken. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-12 5:13 ` Denys Fedoryshchenko @ 2008-01-30 16:57 ` Kok, Auke 0 siblings, 0 replies; 19+ messages in thread From: Kok, Auke @ 2008-01-30 16:57 UTC (permalink / raw) To: Denys Fedoryshchenko; +Cc: David Miller, benny+usenet, netdev Denys Fedoryshchenko wrote: > Sorry. that i interfere in this subject. > > Do you recommend CONFIG_IRQBALANCE to be enabled? I certainly do not. Manual tweaking and pinning the irq's to the correct CPU will give the best performance (for specific loads). The userspace irqbalance daemon tries very hard to approximate this behaviour and is what I recommend for most situations, it usually does the right thing and does so without making your head spin (just start it). The in-kernel one usually does the wrong thing for network loads. Cheers, Auke ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: e1000 performance issue in 4 simultaneous links 2008-01-10 20:52 ` Brandeburg, Jesse 2008-01-11 1:28 ` David Miller @ 2008-01-11 16:20 ` Breno Leitao 2008-01-11 16:48 ` Eric Dumazet 1 sibling, 1 reply; 19+ messages in thread From: Breno Leitao @ 2008-01-11 16:20 UTC (permalink / raw) To: Brandeburg, Jesse, rick.jones2; +Cc: netdev On Thu, 2008-01-10 at 12:52 -0800, Brandeburg, Jesse wrote: > Breno Leitao wrote: > > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec > > of transfer rate. If I run 4 netperf against 4 different interfaces, I > > get around 720 * 10^6 bits/sec. > > I hope this explanation makes sense, but what it comes down to is that > combining hardware round robin balancing with NAPI is a BAD IDEA. In > general the behavior of hardware round robin balancing is bad and I'm > sure it is causing all sorts of other performance issues that you may > not even be aware of. I've made another test removing the ppc IRQ Round Robin scheme, bonded each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1, CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in average. Take a look at the interrupt table this time: io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67] 277: 15 1362450 13 14 13 14 15 18 XICS Level eth6 278: 12 13 1348681 19 13 15 10 11 XICS Level eth7 323: 11 18 17 1348426 18 11 11 13 XICS Level eth16 324: 12 16 11 19 1402709 13 14 11 XICS Level eth17 I also tried to bound all the 4 interface IRQ to a single CPU (CPU0) using the noirqdistrib boot paramenter, and the performance was a little worse. Rick, The 2 interface test that I showed in my first email, was run in two different NIC. Also, I am running netperf with the following command "netperf -H <hostname> -T 0,8" while netserver is running without any argument at all. Also, running vmstat in parallel shows that there is no bottleneck in the CPU. Take a look: procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 6714732 16168 227440 0 0 8 2 203 21 0 1 98 0 0 0 0 0 6715120 16176 227440 0 0 0 28 16234 505 0 16 83 0 1 0 0 0 6715516 16176 227440 0 0 0 0 16251 518 0 16 83 0 1 1 0 0 6715252 16176 227440 0 0 0 1 16316 497 0 15 84 0 1 0 0 0 6716092 16176 227440 0 0 0 0 16300 520 0 16 83 0 1 0 0 0 6716320 16180 227440 0 0 0 1 16354 486 0 15 84 0 1 Thanks! -- Breno Leitao <leitao@linux.vnet.ibm.com> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-11 16:20 ` Breno Leitao @ 2008-01-11 16:48 ` Eric Dumazet 2008-01-11 17:36 ` Denys Fedoryshchenko 2008-01-11 18:19 ` Breno Leitao 0 siblings, 2 replies; 19+ messages in thread From: Eric Dumazet @ 2008-01-11 16:48 UTC (permalink / raw) To: Breno Leitao; +Cc: Brandeburg, Jesse, rick.jones2, netdev Breno Leitao a écrit : > On Thu, 2008-01-10 at 12:52 -0800, Brandeburg, Jesse wrote: > >> Breno Leitao wrote: >> >>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec >>> of transfer rate. If I run 4 netperf against 4 different interfaces, I >>> get around 720 * 10^6 bits/sec. >>> >> I hope this explanation makes sense, but what it comes down to is that >> combining hardware round robin balancing with NAPI is a BAD IDEA. In >> general the behavior of hardware round robin balancing is bad and I'm >> sure it is causing all sorts of other performance issues that you may >> not even be aware of. >> > I've made another test removing the ppc IRQ Round Robin scheme, bonded > each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1, > CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in > average. > > Take a look at the interrupt table this time: > > io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67] > 277: 15 1362450 13 14 13 14 15 18 XICS Level eth6 > 278: 12 13 1348681 19 13 15 10 11 XICS Level eth7 > 323: 11 18 17 1348426 18 11 11 13 XICS Level eth16 > 324: 12 16 11 19 1402709 13 14 11 XICS Level eth17 > > > I also tried to bound all the 4 interface IRQ to a single CPU (CPU0) > using the noirqdistrib boot paramenter, and the performance was a little > worse. > > Rick, > The 2 interface test that I showed in my first email, was run in two > different NIC. Also, I am running netperf with the following command > "netperf -H <hostname> -T 0,8" while netserver is running without any > argument at all. Also, running vmstat in parallel shows that there is no > bottleneck in the CPU. Take a look: > > procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------ > r b swpd free buff cache si so bi bo in cs us sy id wa st > 2 0 0 6714732 16168 227440 0 0 8 2 203 21 0 1 98 0 0 > 0 0 0 6715120 16176 227440 0 0 0 28 16234 505 0 16 83 0 1 > 0 0 0 6715516 16176 227440 0 0 0 0 16251 518 0 16 83 0 1 > 1 0 0 6715252 16176 227440 0 0 0 1 16316 497 0 15 84 0 1 > 0 0 0 6716092 16176 227440 0 0 0 0 16300 520 0 16 83 0 1 > 0 0 0 6716320 16180 227440 0 0 0 1 16354 486 0 15 84 0 1 > > > If your machine has 8 cpus, then your vmstat output shows a bottleneck :) (100/8 = 12.5), so I guess one of your CPU is full ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-11 16:48 ` Eric Dumazet @ 2008-01-11 17:36 ` Denys Fedoryshchenko 2008-01-11 18:45 ` Breno Leitao 2008-01-11 18:19 ` Breno Leitao 1 sibling, 1 reply; 19+ messages in thread From: Denys Fedoryshchenko @ 2008-01-11 17:36 UTC (permalink / raw) To: netdev Maybe good idea to use sysstat ? http://perso.wanadoo.fr/sebastien.godard/ For example: visp-1 ~ # mpstat -P ALL 1 Linux 2.6.24-rc7-devel (visp-1) 01/11/08 19:27:57 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 19:27:58 all 0.00 0.00 0.00 0.00 0.00 2.51 0.00 97.49 7707.00 19:27:58 0 0.00 0.00 0.00 0.00 0.00 4.00 0.00 96.00 1926.00 19:27:58 1 0.00 0.00 0.00 0.00 0.00 1.01 0.00 98.99 1926.00 19:27:58 2 0.00 0.00 0.00 0.00 0.00 5.00 0.00 95.00 1927.00 19:27:58 3 0.00 0.00 0.00 0.00 0.00 0.99 0.00 99.01 1927.00 19:27:58 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >> > >>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec > >>> of transfer rate. If I run 4 netperf against 4 different interfaces, I > >>> get around 720 * 10^6 bits/sec. > >>> > >> I hope this explanation makes sense, but what it comes down to is that > >> combining hardware round robin balancing with NAPI is a BAD IDEA. In > >> general the behavior of hardware round robin balancing is bad and I'm > >> sure it is causing all sorts of other performance issues that you may > >> not even be aware of. > >> > > I've made another test removing the ppc IRQ Round Robin scheme, bonded > > each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1, > > CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in > > average. > > > > Take a look at the interrupt table this time: > > > > io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67] > > 277: 15 1362450 13 14 13 14 15 18 XICS Level eth6 > > 278: 12 13 1348681 19 13 15 10 11 XICS Level eth7 > > 323: 11 18 17 1348426 18 11 11 13 XICS Level eth16 > > 324: 12 16 11 19 1402709 13 14 11 XICS Level eth17 > > > > > > I also tried to bound all the 4 interface IRQ to a single CPU (CPU0) > > using the noirqdistrib boot paramenter, and the performance was a little > > worse. > > > > Rick, > > The 2 interface test that I showed in my first email, was run in two > > different NIC. Also, I am running netperf with the following command > > "netperf -H <hostname> -T 0,8" while netserver is running without any > > argument at all. Also, running vmstat in parallel shows that there is no > > bottleneck in the CPU. Take a look: > > > > procs -----------memory---------- ---swap-- -----io---- -system-- ----- cpu------ > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > 2 0 0 6714732 16168 227440 0 0 8 2 203 21 0 1 98 0 0 > > 0 0 0 6715120 16176 227440 0 0 0 28 16234 505 0 16 83 0 1 > > 0 0 0 6715516 16176 227440 0 0 0 0 16251 518 0 16 83 0 1 > > 1 0 0 6715252 16176 227440 0 0 0 1 16316 497 0 15 84 0 1 > > 0 0 0 6716092 16176 227440 0 0 0 0 16300 520 0 16 83 0 1 > > 0 0 0 6716320 16180 227440 0 0 0 1 16354 486 0 15 84 0 1 > > > > > > > If your machine has 8 cpus, then your vmstat output shows a > bottleneck :) > > (100/8 = 12.5), so I guess one of your CPU is full > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-11 17:36 ` Denys Fedoryshchenko @ 2008-01-11 18:45 ` Breno Leitao 0 siblings, 0 replies; 19+ messages in thread From: Breno Leitao @ 2008-01-11 18:45 UTC (permalink / raw) To: Denys Fedoryshchenko; +Cc: netdev Hello Denys, I've installed sysstat (good tools!) and the result is very similar to the one which appears at top, take a look: 13:34:23 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 13:34:24 all 0.00 0.00 2.72 0.00 0.25 12.13 0.99 83.91 16267.33 13:34:24 0 0.00 0.00 21.78 0.00 0.00 0.00 7.92 70.30 40.59 13:34:24 1 0.00 0.00 0.00 0.00 0.99 24.75 0.00 74.26 4025.74 13:34:24 2 0.00 0.00 0.00 0.00 0.99 24.75 0.00 74.26 4036.63 13:34:24 3 0.00 0.00 0.00 0.00 0.99 21.78 0.00 77.23 4032.67 13:34:24 4 0.00 0.00 0.00 0.00 0.98 24.51 0.00 74.51 4034.65 13:34:24 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 30.69 13:34:24 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 33.66 13:34:24 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 32.67 So, we can assure that the IRQs are not being balanced, and that there isn't any processor overload. Thanks! On Fri, 2008-01-11 at 19:36 +0200, Denys Fedoryshchenko wrote: > Maybe good idea to use sysstat ? > > http://perso.wanadoo.fr/sebastien.godard/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-11 16:48 ` Eric Dumazet 2008-01-11 17:36 ` Denys Fedoryshchenko @ 2008-01-11 18:19 ` Breno Leitao 2008-01-11 18:48 ` Rick Jones 1 sibling, 1 reply; 19+ messages in thread From: Breno Leitao @ 2008-01-11 18:19 UTC (permalink / raw) To: Eric Dumazet; +Cc: Brandeburg, Jesse, rick.jones2, netdev On Fri, 2008-01-11 at 17:48 +0100, Eric Dumazet wrote: > Breno Leitao a écrit : > > Take a look at the interrupt table this time: > > > > io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67] > > 277: 15 1362450 13 14 13 14 15 18 XICS Level eth6 > > 278: 12 13 1348681 19 13 15 10 11 XICS Level eth7 > > 323: 11 18 17 1348426 18 11 11 13 XICS Level eth16 > > 324: 12 16 11 19 1402709 13 14 11 XICS Level eth17 > > > > > > > If your machine has 8 cpus, then your vmstat output shows a bottleneck :) > > (100/8 = 12.5), so I guess one of your CPU is full Well, if I run top while running the test, I see this load distributed among the CPUs, mainly those that had a NIC IRC bonded. Take a look: Tasks: 133 total, 2 running, 130 sleeping, 0 stopped, 1 zombie Cpu0 : 0.3%us, 19.5%sy, 0.0%ni, 73.5%id, 0.0%wa, 0.0%hi, 0.0%si, 6.6%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 75.1%id, 0.0%wa, 0.7%hi, 24.3%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 73.1%id, 0.0%wa, 0.7%hi, 26.2%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 76.1%id, 0.0%wa, 0.7%hi, 23.3%si, 0.0%st Cpu4 : 0.0%us, 0.3%sy, 0.0%ni, 70.4%id, 0.7%wa, 0.3%hi, 28.2%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Note that this average scenario doesn't change during the entire benchmarking test. Thanks! -- Breno Leitao <leitao@linux.vnet.ibm.com> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links 2008-01-11 18:19 ` Breno Leitao @ 2008-01-11 18:48 ` Rick Jones 0 siblings, 0 replies; 19+ messages in thread From: Rick Jones @ 2008-01-11 18:48 UTC (permalink / raw) To: Breno Leitao; +Cc: Eric Dumazet, Brandeburg, Jesse, netdev Breno Leitao wrote: > On Fri, 2008-01-11 at 17:48 +0100, Eric Dumazet wrote: > >>Breno Leitao a écrit : >> >>>Take a look at the interrupt table this time: >>> >>>io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67] >>>277: 15 1362450 13 14 13 14 15 18 XICS Level eth6 >>>278: 12 13 1348681 19 13 15 10 11 XICS Level eth7 >>>323: 11 18 17 1348426 18 11 11 13 XICS Level eth16 >>>324: 12 16 11 19 1402709 13 14 11 XICS Level eth17 >>> >>> >>> >> >>If your machine has 8 cpus, then your vmstat output shows a bottleneck :) >> >>(100/8 = 12.5), so I guess one of your CPU is full > > > Well, if I run top while running the test, I see this load distributed > among the CPUs, mainly those that had a NIC IRC bonded. Take a look: > > Tasks: 133 total, 2 running, 130 sleeping, 0 stopped, 1 zombie > Cpu0 : 0.3%us, 19.5%sy, 0.0%ni, 73.5%id, 0.0%wa, 0.0%hi, 0.0%si, 6.6%st > Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 75.1%id, 0.0%wa, 0.7%hi, 24.3%si, 0.0%st > Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 73.1%id, 0.0%wa, 0.7%hi, 26.2%si, 0.0%st > Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 76.1%id, 0.0%wa, 0.7%hi, 23.3%si, 0.0%st > Cpu4 : 0.0%us, 0.3%sy, 0.0%ni, 70.4%id, 0.7%wa, 0.3%hi, 28.2%si, 0.0%st > Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st > Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st > Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st If you have IRQ's bound to CPUs 1-4, and have four netperfs running, given that the stack ostensibly tries to have applications run on the same CPUs, what is running on CPU0? Is it related to: > The 2 interface test that I showed in my first email, was run in two > different NIC. Also, I am running netperf with the following command > "netperf -H <hostname> -T 0,8" while netserver is running without any > argument at all. Also, running vmstat in parallel shows that there is no > bottleneck in the CPU. Take a look: Unless you have a morbid curiousity :) there isn't much point in binding all the netperf's to CPU 0 when the interrupts for the NICs servicing their connections are on CPUs 1-4. I also assume then that the system(s) on which netserver is running have > 8 CPUs in them? (There are multiple destination systems yes?) Does anything change if you explicitly bind each netperf to the CPU on which the interrups for its connection are processed? Or for that matter if you remove the -T command entirely Does UDP_STREAM show different performance than TCP_STREAM (I'm ass-u-me-ing based on the above we are looking at the netperf side of a TCP_STREAM test above, please correct if otherwise). Are the CPUs above single-core CPUs or multi-core CPUs, and if multi-core are caches shared? How are CPUs numbered if multi-core on that system? Is there any hardware threading involved? I'm wondering if there may be some wrinkles in the system that might lead to reported CPU utilization being low even if a chip is otherwise saturated. Might need some HW counters to check that... Can you describe the I/O subsystem more completely? I understand that you are using at most two ports of a pair of quad-port cards at any one time, but am still curious to know if those two cards are on separate busses, or if they share any bus/link on the way to memory. rick jones ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2008-01-30 16:58 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-01-10 16:17 e1000 performance issue in 4 simultaneous links Breno Leitao 2008-01-10 16:36 ` Ben Hutchings 2008-01-10 16:51 ` Jeba Anandhan 2008-01-10 17:31 ` Breno Leitao 2008-01-10 18:18 ` Kok, Auke 2008-01-10 18:37 ` Rick Jones 2008-01-10 18:26 ` Rick Jones 2008-01-10 20:52 ` Brandeburg, Jesse 2008-01-11 1:28 ` David Miller 2008-01-11 11:09 ` Benny Amorsen 2008-01-12 1:41 ` David Miller 2008-01-12 5:13 ` Denys Fedoryshchenko 2008-01-30 16:57 ` Kok, Auke 2008-01-11 16:20 ` Breno Leitao 2008-01-11 16:48 ` Eric Dumazet 2008-01-11 17:36 ` Denys Fedoryshchenko 2008-01-11 18:45 ` Breno Leitao 2008-01-11 18:19 ` Breno Leitao 2008-01-11 18:48 ` Rick Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).