* e1000 performance issue in 4 simultaneous links
@ 2008-01-10 16:17 Breno Leitao
2008-01-10 16:36 ` Ben Hutchings
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Breno Leitao @ 2008-01-10 16:17 UTC (permalink / raw)
To: netdev
Hello,
I've perceived that there is a performance issue when running netperf
against 4 e1000 links connected end-to-end to another machine with 4
e1000 interfaces.
I have 2 4-port interfaces on my machine, but the test is just
considering 2 port for each interfaces card.
When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
of transfer rate. If I run 4 netperf against 4 different interfaces, I
get around 720 * 10^6 bits/sec.
If I run the same test against 2 interfaces I get a 940 * 10^6 bits/sec
transfer rate also, and if I run it against 3 interfaces I get around
850 * 10^6 bits/sec performance.
I got this results using the upstream netdev-2.6 branch kernel plus
David Miller's 7 NAPI patches set[1]. In the kernel 2.6.23.12 the result
is a bit worse, and the the transfer rate was around 600 * 10^6
bits/sec.
[1] http://marc.info/?l=linux-netdev&m=119977075917488&w=2
PS: I am not using a switch in the middle of interfaces (they are
end-to-end) and the connections are independents.
--
Breno Leitao <leitao@linux.vnet.ibm.com>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-10 16:17 e1000 performance issue in 4 simultaneous links Breno Leitao
@ 2008-01-10 16:36 ` Ben Hutchings
2008-01-10 16:51 ` Jeba Anandhan
2008-01-10 17:31 ` Breno Leitao
2008-01-10 18:26 ` Rick Jones
2008-01-10 20:52 ` Brandeburg, Jesse
2 siblings, 2 replies; 19+ messages in thread
From: Ben Hutchings @ 2008-01-10 16:36 UTC (permalink / raw)
To: Breno Leitao; +Cc: netdev
Breno Leitao wrote:
> Hello,
>
> I've perceived that there is a performance issue when running netperf
> against 4 e1000 links connected end-to-end to another machine with 4
> e1000 interfaces.
>
> I have 2 4-port interfaces on my machine, but the test is just
> considering 2 port for each interfaces card.
>
> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> of transfer rate. If I run 4 netperf against 4 different interfaces, I
> get around 720 * 10^6 bits/sec.
<snip>
I take it that's the average for individual interfaces, not the
aggregate? RX processing for multi-gigabits per second can be quite
expensive. This can be mitigated by interrupt moderation and NAPI
polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
I don't think e1000 hardware does LRO, but the driver could presumably
be changed use Linux's software LRO.
Even with these optimisations, if all RX processing is done on a
single CPU this can become a bottleneck. Does the test system have
multiple CPUs? Are IRQs for the multiple NICs balanced across
multiple CPUs?
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-10 16:36 ` Ben Hutchings
@ 2008-01-10 16:51 ` Jeba Anandhan
2008-01-10 17:31 ` Breno Leitao
1 sibling, 0 replies; 19+ messages in thread
From: Jeba Anandhan @ 2008-01-10 16:51 UTC (permalink / raw)
To: Ben Hutchings; +Cc: Breno Leitao, netdev
Ben,
I am facing the performance issue when we try to bond the multiple
interfaces with virtual interface. It could be related to this thread.
My questions are,
*) When we use mulitple NICs, will the performance of overall system be
summation of all individual lines XX bits/sec. ?
*) What are the factors improves the performance if we have multiple
interfaces?. [ kind of tuning the parameters in proc ]
Breno,
I hope this thread will be helpful for performance issue which i have
with bonding driver.
Jeba
On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote:
> Breno Leitao wrote:
> > Hello,
> >
> > I've perceived that there is a performance issue when running netperf
> > against 4 e1000 links connected end-to-end to another machine with 4
> > e1000 interfaces.
> >
> > I have 2 4-port interfaces on my machine, but the test is just
> > considering 2 port for each interfaces card.
> >
> > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> > of transfer rate. If I run 4 netperf against 4 different interfaces, I
> > get around 720 * 10^6 bits/sec.
> <snip>
>
> I take it that's the average for individual interfaces, not the
> aggregate? RX processing for multi-gigabits per second can be quite
> expensive. This can be mitigated by interrupt moderation and NAPI
> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
> I don't think e1000 hardware does LRO, but the driver could presumably
> be changed use Linux's software LRO.
>
> Even with these optimisations, if all RX processing is done on a
> single CPU this can become a bottleneck. Does the test system have
> multiple CPUs? Are IRQs for the multiple NICs balanced across
> multiple CPUs?
>
> Ben.
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-10 16:36 ` Ben Hutchings
2008-01-10 16:51 ` Jeba Anandhan
@ 2008-01-10 17:31 ` Breno Leitao
2008-01-10 18:18 ` Kok, Auke
2008-01-10 18:37 ` Rick Jones
1 sibling, 2 replies; 19+ messages in thread
From: Breno Leitao @ 2008-01-10 17:31 UTC (permalink / raw)
To: bhutchings
On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote:
> > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> > of transfer rate. If I run 4 netperf against 4 different interfaces, I
> > get around 720 * 10^6 bits/sec.
> <snip>
>
> I take it that's the average for individual interfaces, not the
> aggregate?
Right, each of these results are for individual interfaces. Otherwise,
we'd have a huge problem. :-)
> This can be mitigated by interrupt moderation and NAPI
> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
> I don't think e1000 hardware does LRO, but the driver could presumably
> be changed use Linux's software LRO.
Without using these "features" and keeping the MTU as 1500, do you think
we could get a better performance than this one?
I also tried to increase my interface MTU to 9000, but I am afraid that
netperf only transmits packets with less than 1500. Still investigating.
> single CPU this can become a bottleneck. Does the test system have
> multiple CPUs? Are IRQs for the multiple NICs balanced across
> multiple CPUs?
Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
across the CPUs, as I see in /proc/interrupts:
# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
16: 940 760 1047 904 993 777 975 813 XICS Level IPI
18: 4 3 4 1 3 6 8 3 XICS Level hvc_console
19: 0 0 0 0 0 0 0 0 XICS Level RAS_EPOW
273: 10728 10850 10937 10833 10884 10788 10868 10776 XICS Level eth4
275: 0 0 0 0 0 0 0 0 XICS Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
277: 234933 230275 229770 234048 235906 229858 229975 233859 XICS Level eth6
278: 266225 267606 262844 265985 268789 266869 263110 267422 XICS Level eth7
279: 893 919 857 909 867 917 894 881 XICS Level eth0
305: 439246 439117 438495 436072 438053 440111 438973 438951 XICS Level eth0 Neterion Xframe II 10GbE network adapter
321: 3268 3088 3143 3113 3305 2982 3326 3084 XICS Level ipr
323: 268030 273207 269710 271338 270306 273258 270872 273281 XICS Level eth16
324: 215012 221102 219494 216732 216531 220460 219718 218654 XICS Level eth17
325: 7103 3580 7246 3475 7132 3394 7258 3435 XICS Level pata_pdc2027x
BAD: 4216
Thanks,
--
Breno Leitao <leitao@linux.vnet.ibm.com>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-10 17:31 ` Breno Leitao
@ 2008-01-10 18:18 ` Kok, Auke
2008-01-10 18:37 ` Rick Jones
1 sibling, 0 replies; 19+ messages in thread
From: Kok, Auke @ 2008-01-10 18:18 UTC (permalink / raw)
To: Breno Leitao; +Cc: bhutchings, NetDev
Breno Leitao wrote:
> On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote:
>>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
>>> of transfer rate. If I run 4 netperf against 4 different interfaces, I
>>> get around 720 * 10^6 bits/sec.
>> <snip>
>>
>> I take it that's the average for individual interfaces, not the
>> aggregate?
> Right, each of these results are for individual interfaces. Otherwise,
> we'd have a huge problem. :-)
>
>> This can be mitigated by interrupt moderation and NAPI
>> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
>> I don't think e1000 hardware does LRO, but the driver could presumably
>> be changed use Linux's software LRO.
> Without using these "features" and keeping the MTU as 1500, do you think
> we could get a better performance than this one?
>
> I also tried to increase my interface MTU to 9000, but I am afraid that
> netperf only transmits packets with less than 1500. Still investigating.
>
>> single CPU this can become a bottleneck. Does the test system have
>> multiple CPUs? Are IRQs for the multiple NICs balanced across
>> multiple CPUs?
> Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
> across the CPUs, as I see in /proc/interrupts:
which is wrong and hurts performance. you want your ethernet irq's to stick to a
CPU for long times to prevent cache thrash.
please disable the in-kernel irq balancing code and use the userspace `irqbalance`
daemon.
Gee I should put that in my signature, I already wrote that twice today :)
Auke
>
> # cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
> 16: 940 760 1047 904 993 777 975 813 XICS Level IPI
> 18: 4 3 4 1 3 6 8 3 XICS Level hvc_console
> 19: 0 0 0 0 0 0 0 0 XICS Level RAS_EPOW
> 273: 10728 10850 10937 10833 10884 10788 10868 10776 XICS Level eth4
> 275: 0 0 0 0 0 0 0 0 XICS Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
> 277: 234933 230275 229770 234048 235906 229858 229975 233859 XICS Level eth6
> 278: 266225 267606 262844 265985 268789 266869 263110 267422 XICS Level eth7
> 279: 893 919 857 909 867 917 894 881 XICS Level eth0
> 305: 439246 439117 438495 436072 438053 440111 438973 438951 XICS Level eth0 Neterion Xframe II 10GbE network adapter
> 321: 3268 3088 3143 3113 3305 2982 3326 3084 XICS Level ipr
> 323: 268030 273207 269710 271338 270306 273258 270872 273281 XICS Level eth16
> 324: 215012 221102 219494 216732 216531 220460 219718 218654 XICS Level eth17
> 325: 7103 3580 7246 3475 7132 3394 7258 3435 XICS Level pata_pdc2027x
> BAD: 4216
>
> Thanks,
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-10 16:17 e1000 performance issue in 4 simultaneous links Breno Leitao
2008-01-10 16:36 ` Ben Hutchings
@ 2008-01-10 18:26 ` Rick Jones
2008-01-10 20:52 ` Brandeburg, Jesse
2 siblings, 0 replies; 19+ messages in thread
From: Rick Jones @ 2008-01-10 18:26 UTC (permalink / raw)
To: Breno Leitao; +Cc: netdev
Many many things to check when running netperf :)
*) Are the cards on the same or separate PCImumble bus, and what sort of bus
*) is the two interface performance two interfaces on the same four-port
card, or an interface from each of the two four-port cards?
*) is there a dreaded (IMO) irqbalance daemon running? one of the very
first things I do when running netperf is terminate the irqbalance
daemon with as extreme a predjudice as I can.
*) what is the distribution of interrupts from the interfaces to the
CPUs? if you've tried to set that manually, the dreaded irqbalance
daemon will come along shortly thereafter and ruin everything.
*) what does netperf say about the overall CPU utilization of the
system(s) when the tests are running?
*) what does top say about the utilization of any single CPU in the
system(s) when the tests are running?
*) are you using the global -T option to spread the netperf/netserver
processes across the CPUs, or leaving that all up to the
stack/scheduler/etc?
I suspect there could be more but that is what comes to mind thusfar as
far as things I often check when running netperf.
rick jones
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-10 17:31 ` Breno Leitao
2008-01-10 18:18 ` Kok, Auke
@ 2008-01-10 18:37 ` Rick Jones
1 sibling, 0 replies; 19+ messages in thread
From: Rick Jones @ 2008-01-10 18:37 UTC (permalink / raw)
To: Breno Leitao; +Cc: bhutchings, Linux Network Development list
> I also tried to increase my interface MTU to 9000, but I am afraid that
> netperf only transmits packets with less than 1500. Still investigating.
It may seem like picking a tiny nit, but netperf never transmits
packets. It only provides buffers of specified size to the stack. It is
then the stack which transmits and determines the size of the packets on
the network.
Drifting a bit more...
While there are settings, conditions and known stack behaviours where
one can be confident of the packet size on the network based on the
options passed to netperf, generally speaking one should not ass-u-me a
direct relationship between the options one passes to netperf and the
size of the packets on the network.
And for JumboFrames to be effective it must be set on both ends,
otherwise the TCP MSS exchange will result in the smaller of the two
MTU's "winning" as it were.
>>single CPU this can become a bottleneck. Does the test system have
>>multiple CPUs? Are IRQs for the multiple NICs balanced across
>>multiple CPUs?
>
> Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
> across the CPUs, as I see in /proc/interrupts:
That suggests to me anyway that the dreaded irqbalanced is running,
shuffling the interrupts as you go. Not often a happy place for running
netperf when one want's consistent results.
>
> # cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
> 16: 940 760 1047 904 993 777 975 813 XICS Level IPI
> 18: 4 3 4 1 3 6 8 3 XICS Level hvc_console
> 19: 0 0 0 0 0 0 0 0 XICS Level RAS_EPOW
> 273: 10728 10850 10937 10833 10884 10788 10868 10776 XICS Level eth4
> 275: 0 0 0 0 0 0 0 0 XICS Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
> 277: 234933 230275 229770 234048 235906 229858 229975 233859 XICS Level eth6
> 278: 266225 267606 262844 265985 268789 266869 263110 267422 XICS Level eth7
> 279: 893 919 857 909 867 917 894 881 XICS Level eth0
> 305: 439246 439117 438495 436072 438053 440111 438973 438951 XICS Level eth0 Neterion Xframe II 10GbE network adapter
> 321: 3268 3088 3143 3113 3305 2982 3326 3084 XICS Level ipr
> 323: 268030 273207 269710 271338 270306 273258 270872 273281 XICS Level eth16
> 324: 215012 221102 219494 216732 216531 220460 219718 218654 XICS Level eth17
> 325: 7103 3580 7246 3475 7132 3394 7258 3435 XICS Level pata_pdc2027x
> BAD: 4216
IMO, what you want (in the absence of multi-queue NICs) is one CPU
taking the interrupts of one port/interface, and each port/interface's
interrupts going to a separate CPU. So, something that looks roughly
like concocted example:
CPU0 CPU1 CPU2 CPU3
1: 1234 0 0 0 eth0
2: 0 1234 0 0 eth1
3: 0 0 1234 0 eth2
4: 0 0 0 1234 eth3
which you should be able to acheive via the method I think someone else
has already mentioned about echoing values into
/proc/irq/<irq>/smp_affinity - after you have slain the dreaded
irqbalance daemon.
rick jones
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: e1000 performance issue in 4 simultaneous links
2008-01-10 16:17 e1000 performance issue in 4 simultaneous links Breno Leitao
2008-01-10 16:36 ` Ben Hutchings
2008-01-10 18:26 ` Rick Jones
@ 2008-01-10 20:52 ` Brandeburg, Jesse
2008-01-11 1:28 ` David Miller
2008-01-11 16:20 ` Breno Leitao
2 siblings, 2 replies; 19+ messages in thread
From: Brandeburg, Jesse @ 2008-01-10 20:52 UTC (permalink / raw)
To: Breno Leitao; +Cc: netdev, Brandeburg, Jesse
Breno Leitao wrote:
> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> of transfer rate. If I run 4 netperf against 4 different interfaces, I
> get around 720 * 10^6 bits/sec.
This is actually a known issue that we have worked with your company
before on. It comes down to your system's default behavior of round
robining interrupts (see cat /proc/interrupts while running the test)
combined with e1000's way of exiting / rescheduling NAPI.
The default round robin behavior of the interrupts on your system is the
root cause of this issue, and here is what happens:
4 interfaces start generating interrupts, if you're lucky the round
robin balancer has them all on different cpus.
As the e1000 driver goes into and out of polling mode, the round robin
balancer keeps moving the interrupt to the next cpu.
Eventually 2 or more driver instances end up on the same CPU, which
causes both driver instances to stay in NAPI polling mode, due to the
amount of work being done, and that there are always more than
"netdev->weight" packets to do for each instance. This keeps *hardware*
interrupts for each interface *disabled*.
Staying in NAPI polling mode causes higher cpu utilization on that one
processor, which guarantees that when the hardware round robin balancer
moves any other network interrupt onto that CPU, it too will join the
NAPI polling mode chain.
So no matter how many processors you have, with this round robin style
of hardware interrupts, it guarantees you that if there is a lot of work
to do (more than weight) at each softirq, then, all network interfaces
will end up on the same cpu eventually (the busiest one)
Your performance becomes the same as if you had booted with maxcpus=1
I hope this explanation makes sense, but what it comes down to is that
combining hardware round robin balancing with NAPI is a BAD IDEA. In
general the behavior of hardware round robin balancing is bad and I'm
sure it is causing all sorts of other performance issues that you may
not even be aware of.
I'm sure your problem will go away if you run e1000 in interrupt mode.
(use make CFLAGS_EXTRA=-DE1000_NO_NAPI)
> If I run the same test against 2 interfaces I get a 940 * 10^6
> bits/sec transfer rate also, and if I run it against 3 interfaces I
> get around 850 * 10^6 bits/sec performance.
>
> I got this results using the upstream netdev-2.6 branch kernel plus
> David Miller's 7 NAPI patches set[1]. In the kernel 2.6.23.12 the
> result is a bit worse, and the the transfer rate was around 600 * 10^6
> bits/sec.
Thank you for testing the latest kernel.org kernel.
Hope this helps.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-10 20:52 ` Brandeburg, Jesse
@ 2008-01-11 1:28 ` David Miller
2008-01-11 11:09 ` Benny Amorsen
2008-01-11 16:20 ` Breno Leitao
1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2008-01-11 1:28 UTC (permalink / raw)
To: jesse.brandeburg; +Cc: leitao, netdev
From: "Brandeburg, Jesse" <jesse.brandeburg@intel.com>
Date: Thu, 10 Jan 2008 12:52:15 -0800
> I hope this explanation makes sense, but what it comes down to is that
> combining hardware round robin balancing with NAPI is a BAD IDEA.
Absolutely agreed on all counts.
No IRQ balancing should be done at all for networking device
interrupts, with zero exceptions. It destroys performance.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-11 1:28 ` David Miller
@ 2008-01-11 11:09 ` Benny Amorsen
2008-01-12 1:41 ` David Miller
0 siblings, 1 reply; 19+ messages in thread
From: Benny Amorsen @ 2008-01-11 11:09 UTC (permalink / raw)
To: netdev
David Miller <davem@davemloft.net> writes:
> No IRQ balancing should be done at all for networking device
> interrupts, with zero exceptions. It destroys performance.
Does irqbalanced need to be taught about this? And how about the
initial balancing, so that each network card gets assigned to one CPU?
/Benny
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: e1000 performance issue in 4 simultaneous links
2008-01-10 20:52 ` Brandeburg, Jesse
2008-01-11 1:28 ` David Miller
@ 2008-01-11 16:20 ` Breno Leitao
2008-01-11 16:48 ` Eric Dumazet
1 sibling, 1 reply; 19+ messages in thread
From: Breno Leitao @ 2008-01-11 16:20 UTC (permalink / raw)
To: Brandeburg, Jesse, rick.jones2; +Cc: netdev
On Thu, 2008-01-10 at 12:52 -0800, Brandeburg, Jesse wrote:
> Breno Leitao wrote:
> > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> > of transfer rate. If I run 4 netperf against 4 different interfaces, I
> > get around 720 * 10^6 bits/sec.
>
> I hope this explanation makes sense, but what it comes down to is that
> combining hardware round robin balancing with NAPI is a BAD IDEA. In
> general the behavior of hardware round robin balancing is bad and I'm
> sure it is causing all sorts of other performance issues that you may
> not even be aware of.
I've made another test removing the ppc IRQ Round Robin scheme, bonded
each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1,
CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in
average.
Take a look at the interrupt table this time:
io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67]
277: 15 1362450 13 14 13 14 15 18 XICS Level eth6
278: 12 13 1348681 19 13 15 10 11 XICS Level eth7
323: 11 18 17 1348426 18 11 11 13 XICS Level eth16
324: 12 16 11 19 1402709 13 14 11 XICS Level eth17
I also tried to bound all the 4 interface IRQ to a single CPU (CPU0)
using the noirqdistrib boot paramenter, and the performance was a little
worse.
Rick,
The 2 interface test that I showed in my first email, was run in two
different NIC. Also, I am running netperf with the following command
"netperf -H <hostname> -T 0,8" while netserver is running without any
argument at all. Also, running vmstat in parallel shows that there is no
bottleneck in the CPU. Take a look:
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 6714732 16168 227440 0 0 8 2 203 21 0 1 98 0 0
0 0 0 6715120 16176 227440 0 0 0 28 16234 505 0 16 83 0 1
0 0 0 6715516 16176 227440 0 0 0 0 16251 518 0 16 83 0 1
1 0 0 6715252 16176 227440 0 0 0 1 16316 497 0 15 84 0 1
0 0 0 6716092 16176 227440 0 0 0 0 16300 520 0 16 83 0 1
0 0 0 6716320 16180 227440 0 0 0 1 16354 486 0 15 84 0 1
Thanks!
--
Breno Leitao <leitao@linux.vnet.ibm.com>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-11 16:20 ` Breno Leitao
@ 2008-01-11 16:48 ` Eric Dumazet
2008-01-11 17:36 ` Denys Fedoryshchenko
2008-01-11 18:19 ` Breno Leitao
0 siblings, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2008-01-11 16:48 UTC (permalink / raw)
To: Breno Leitao; +Cc: Brandeburg, Jesse, rick.jones2, netdev
Breno Leitao a écrit :
> On Thu, 2008-01-10 at 12:52 -0800, Brandeburg, Jesse wrote:
>
>> Breno Leitao wrote:
>>
>>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
>>> of transfer rate. If I run 4 netperf against 4 different interfaces, I
>>> get around 720 * 10^6 bits/sec.
>>>
>> I hope this explanation makes sense, but what it comes down to is that
>> combining hardware round robin balancing with NAPI is a BAD IDEA. In
>> general the behavior of hardware round robin balancing is bad and I'm
>> sure it is causing all sorts of other performance issues that you may
>> not even be aware of.
>>
> I've made another test removing the ppc IRQ Round Robin scheme, bonded
> each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1,
> CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in
> average.
>
> Take a look at the interrupt table this time:
>
> io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67]
> 277: 15 1362450 13 14 13 14 15 18 XICS Level eth6
> 278: 12 13 1348681 19 13 15 10 11 XICS Level eth7
> 323: 11 18 17 1348426 18 11 11 13 XICS Level eth16
> 324: 12 16 11 19 1402709 13 14 11 XICS Level eth17
>
>
> I also tried to bound all the 4 interface IRQ to a single CPU (CPU0)
> using the noirqdistrib boot paramenter, and the performance was a little
> worse.
>
> Rick,
> The 2 interface test that I showed in my first email, was run in two
> different NIC. Also, I am running netperf with the following command
> "netperf -H <hostname> -T 0,8" while netserver is running without any
> argument at all. Also, running vmstat in parallel shows that there is no
> bottleneck in the CPU. Take a look:
>
> procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 2 0 0 6714732 16168 227440 0 0 8 2 203 21 0 1 98 0 0
> 0 0 0 6715120 16176 227440 0 0 0 28 16234 505 0 16 83 0 1
> 0 0 0 6715516 16176 227440 0 0 0 0 16251 518 0 16 83 0 1
> 1 0 0 6715252 16176 227440 0 0 0 1 16316 497 0 15 84 0 1
> 0 0 0 6716092 16176 227440 0 0 0 0 16300 520 0 16 83 0 1
> 0 0 0 6716320 16180 227440 0 0 0 1 16354 486 0 15 84 0 1
>
>
>
If your machine has 8 cpus, then your vmstat output shows a bottleneck :)
(100/8 = 12.5), so I guess one of your CPU is full
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-11 16:48 ` Eric Dumazet
@ 2008-01-11 17:36 ` Denys Fedoryshchenko
2008-01-11 18:45 ` Breno Leitao
2008-01-11 18:19 ` Breno Leitao
1 sibling, 1 reply; 19+ messages in thread
From: Denys Fedoryshchenko @ 2008-01-11 17:36 UTC (permalink / raw)
To: netdev
Maybe good idea to use sysstat ?
http://perso.wanadoo.fr/sebastien.godard/
For example:
visp-1 ~ # mpstat -P ALL 1
Linux 2.6.24-rc7-devel (visp-1) 01/11/08
19:27:57 CPU %user %nice %sys %iowait %irq %soft %steal
%idle intr/s
19:27:58 all 0.00 0.00 0.00 0.00 0.00 2.51 0.00
97.49 7707.00
19:27:58 0 0.00 0.00 0.00 0.00 0.00 4.00 0.00
96.00 1926.00
19:27:58 1 0.00 0.00 0.00 0.00 0.00 1.01 0.00
98.99 1926.00
19:27:58 2 0.00 0.00 0.00 0.00 0.00 5.00 0.00
95.00 1927.00
19:27:58 3 0.00 0.00 0.00 0.00 0.00 0.99 0.00
99.01 1927.00
19:27:58 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00
> >>
> >>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> >>> of transfer rate. If I run 4 netperf against 4 different interfaces, I
> >>> get around 720 * 10^6 bits/sec.
> >>>
> >> I hope this explanation makes sense, but what it comes down to is that
> >> combining hardware round robin balancing with NAPI is a BAD IDEA. In
> >> general the behavior of hardware round robin balancing is bad and I'm
> >> sure it is causing all sorts of other performance issues that you may
> >> not even be aware of.
> >>
> > I've made another test removing the ppc IRQ Round Robin scheme, bonded
> > each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1,
> > CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in
> > average.
> >
> > Take a look at the interrupt table this time:
> >
> > io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67]
> > 277: 15 1362450 13 14 13
14 15 18 XICS Level eth6
> > 278: 12 13 1348681 19 13
15 10 11 XICS Level eth7
> > 323: 11 18 17 1348426 18
11 11 13 XICS Level eth16
> > 324: 12 16 11 19 1402709
13 14 11 XICS Level eth17
> >
> >
> > I also tried to bound all the 4 interface IRQ to a single CPU (CPU0)
> > using the noirqdistrib boot paramenter, and the performance was a little
> > worse.
> >
> > Rick,
> > The 2 interface test that I showed in my first email, was run in two
> > different NIC. Also, I am running netperf with the following command
> > "netperf -H <hostname> -T 0,8" while netserver is running without any
> > argument at all. Also, running vmstat in parallel shows that there is no
> > bottleneck in the CPU. Take a look:
> >
> > procs -----------memory---------- ---swap-- -----io---- -system-- -----
cpu------
> > r b swpd free buff cache si so bi bo in cs us sy
id wa st
> > 2 0 0 6714732 16168 227440 0 0 8 2 203 21 0 1
98 0 0
> > 0 0 0 6715120 16176 227440 0 0 0 28 16234 505 0 16
83 0 1
> > 0 0 0 6715516 16176 227440 0 0 0 0 16251 518 0 16
83 0 1
> > 1 0 0 6715252 16176 227440 0 0 0 1 16316 497 0 15
84 0 1
> > 0 0 0 6716092 16176 227440 0 0 0 0 16300 520 0 16
83 0 1
> > 0 0 0 6716320 16180 227440 0 0 0 1 16354 486 0 15
84 0 1
> >
> >
> >
> If your machine has 8 cpus, then your vmstat output shows a
> bottleneck :)
>
> (100/8 = 12.5), so I guess one of your CPU is full
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-11 16:48 ` Eric Dumazet
2008-01-11 17:36 ` Denys Fedoryshchenko
@ 2008-01-11 18:19 ` Breno Leitao
2008-01-11 18:48 ` Rick Jones
1 sibling, 1 reply; 19+ messages in thread
From: Breno Leitao @ 2008-01-11 18:19 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Brandeburg, Jesse, rick.jones2, netdev
On Fri, 2008-01-11 at 17:48 +0100, Eric Dumazet wrote:
> Breno Leitao a écrit :
> > Take a look at the interrupt table this time:
> >
> > io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67]
> > 277: 15 1362450 13 14 13 14 15 18 XICS Level eth6
> > 278: 12 13 1348681 19 13 15 10 11 XICS Level eth7
> > 323: 11 18 17 1348426 18 11 11 13 XICS Level eth16
> > 324: 12 16 11 19 1402709 13 14 11 XICS Level eth17
> >
> >
> >
> If your machine has 8 cpus, then your vmstat output shows a bottleneck :)
>
> (100/8 = 12.5), so I guess one of your CPU is full
Well, if I run top while running the test, I see this load distributed
among the CPUs, mainly those that had a NIC IRC bonded. Take a look:
Tasks: 133 total, 2 running, 130 sleeping, 0 stopped, 1 zombie
Cpu0 : 0.3%us, 19.5%sy, 0.0%ni, 73.5%id, 0.0%wa, 0.0%hi, 0.0%si, 6.6%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 75.1%id, 0.0%wa, 0.7%hi, 24.3%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 73.1%id, 0.0%wa, 0.7%hi, 26.2%si, 0.0%st
Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 76.1%id, 0.0%wa, 0.7%hi, 23.3%si, 0.0%st
Cpu4 : 0.0%us, 0.3%sy, 0.0%ni, 70.4%id, 0.7%wa, 0.3%hi, 28.2%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Note that this average scenario doesn't change during the entire
benchmarking test.
Thanks!
--
Breno Leitao <leitao@linux.vnet.ibm.com>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-11 17:36 ` Denys Fedoryshchenko
@ 2008-01-11 18:45 ` Breno Leitao
0 siblings, 0 replies; 19+ messages in thread
From: Breno Leitao @ 2008-01-11 18:45 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: netdev
Hello Denys,
I've installed sysstat (good tools!) and the result is very similar
to the one which appears at top, take a look:
13:34:23 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
13:34:24 all 0.00 0.00 2.72 0.00 0.25 12.13 0.99 83.91 16267.33
13:34:24 0 0.00 0.00 21.78 0.00 0.00 0.00 7.92 70.30 40.59
13:34:24 1 0.00 0.00 0.00 0.00 0.99 24.75 0.00 74.26 4025.74
13:34:24 2 0.00 0.00 0.00 0.00 0.99 24.75 0.00 74.26 4036.63
13:34:24 3 0.00 0.00 0.00 0.00 0.99 21.78 0.00 77.23 4032.67
13:34:24 4 0.00 0.00 0.00 0.00 0.98 24.51 0.00 74.51 4034.65
13:34:24 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 30.69
13:34:24 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 33.66
13:34:24 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 32.67
So, we can assure that the IRQs are not being balanced, and that there
isn't any processor overload.
Thanks!
On Fri, 2008-01-11 at 19:36 +0200, Denys Fedoryshchenko wrote:
> Maybe good idea to use sysstat ?
>
> http://perso.wanadoo.fr/sebastien.godard/
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-11 18:19 ` Breno Leitao
@ 2008-01-11 18:48 ` Rick Jones
0 siblings, 0 replies; 19+ messages in thread
From: Rick Jones @ 2008-01-11 18:48 UTC (permalink / raw)
To: Breno Leitao; +Cc: Eric Dumazet, Brandeburg, Jesse, netdev
Breno Leitao wrote:
> On Fri, 2008-01-11 at 17:48 +0100, Eric Dumazet wrote:
>
>>Breno Leitao a écrit :
>>
>>>Take a look at the interrupt table this time:
>>>
>>>io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67]
>>>277: 15 1362450 13 14 13 14 15 18 XICS Level eth6
>>>278: 12 13 1348681 19 13 15 10 11 XICS Level eth7
>>>323: 11 18 17 1348426 18 11 11 13 XICS Level eth16
>>>324: 12 16 11 19 1402709 13 14 11 XICS Level eth17
>>>
>>>
>>>
>>
>>If your machine has 8 cpus, then your vmstat output shows a bottleneck :)
>>
>>(100/8 = 12.5), so I guess one of your CPU is full
>
>
> Well, if I run top while running the test, I see this load distributed
> among the CPUs, mainly those that had a NIC IRC bonded. Take a look:
>
> Tasks: 133 total, 2 running, 130 sleeping, 0 stopped, 1 zombie
> Cpu0 : 0.3%us, 19.5%sy, 0.0%ni, 73.5%id, 0.0%wa, 0.0%hi, 0.0%si, 6.6%st
> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 75.1%id, 0.0%wa, 0.7%hi, 24.3%si, 0.0%st
> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 73.1%id, 0.0%wa, 0.7%hi, 26.2%si, 0.0%st
> Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 76.1%id, 0.0%wa, 0.7%hi, 23.3%si, 0.0%st
> Cpu4 : 0.0%us, 0.3%sy, 0.0%ni, 70.4%id, 0.7%wa, 0.3%hi, 28.2%si, 0.0%st
> Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
> Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
If you have IRQ's bound to CPUs 1-4, and have four netperfs running,
given that the stack ostensibly tries to have applications run on the
same CPUs, what is running on CPU0?
Is it related to:
> The 2 interface test that I showed in my first email, was run in two
> different NIC. Also, I am running netperf with the following command
> "netperf -H <hostname> -T 0,8" while netserver is running without any
> argument at all. Also, running vmstat in parallel shows that there is no
> bottleneck in the CPU. Take a look:
Unless you have a morbid curiousity :) there isn't much point in binding
all the netperf's to CPU 0 when the interrupts for the NICs servicing
their connections are on CPUs 1-4. I also assume then that the
system(s) on which netserver is running have > 8 CPUs in them? (There
are multiple destination systems yes?)
Does anything change if you explicitly bind each netperf to the CPU on
which the interrups for its connection are processed? Or for that
matter if you remove the -T command entirely
Does UDP_STREAM show different performance than TCP_STREAM (I'm
ass-u-me-ing based on the above we are looking at the netperf side of a
TCP_STREAM test above, please correct if otherwise).
Are the CPUs above single-core CPUs or multi-core CPUs, and if
multi-core are caches shared? How are CPUs numbered if multi-core on
that system? Is there any hardware threading involved? I'm wondering
if there may be some wrinkles in the system that might lead to reported
CPU utilization being low even if a chip is otherwise saturated. Might
need some HW counters to check that...
Can you describe the I/O subsystem more completely? I understand that
you are using at most two ports of a pair of quad-port cards at any one
time, but am still curious to know if those two cards are on separate
busses, or if they share any bus/link on the way to memory.
rick jones
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-11 11:09 ` Benny Amorsen
@ 2008-01-12 1:41 ` David Miller
2008-01-12 5:13 ` Denys Fedoryshchenko
0 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2008-01-12 1:41 UTC (permalink / raw)
To: benny+usenet; +Cc: netdev
From: Benny Amorsen <benny+usenet@amorsen.dk>
Date: Fri, 11 Jan 2008 12:09:32 +0100
> David Miller <davem@davemloft.net> writes:
>
> > No IRQ balancing should be done at all for networking device
> > interrupts, with zero exceptions. It destroys performance.
>
> Does irqbalanced need to be taught about this?
The userland one already does.
It's only the in-kernel IRQ load balancing for these (presumably
powerpc) platforms that is broken.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-12 1:41 ` David Miller
@ 2008-01-12 5:13 ` Denys Fedoryshchenko
2008-01-30 16:57 ` Kok, Auke
0 siblings, 1 reply; 19+ messages in thread
From: Denys Fedoryshchenko @ 2008-01-12 5:13 UTC (permalink / raw)
To: David Miller, benny+usenet; +Cc: netdev
Sorry. that i interfere in this subject.
Do you recommend CONFIG_IRQBALANCE to be enabled?
If it is enabled - irq's not jumping nonstop over processors. softirqd
changing this behavior.
If it is disabled, irq's distributed over each processor, and in loaded
systems it seems harmful.
I work a little yesterday with server with CONFIG_IRQBALANCE=no, 160kpps load.
It was packetloss-ing, till i set smp_affinity.
Maybe it is useful to put more info in Kconfig, since it is very important
for performance option.
On Fri, 11 Jan 2008 17:41:09 -0800 (PST), David Miller wrote
> From: Benny Amorsen <benny usenet@amorsen.dk>
> Date: Fri, 11 Jan 2008 12:09:32 0100
>
> > David Miller <davem@davemloft.net> writes:
> >
> > > No IRQ balancing should be done at all for networking device
> > > interrupts, with zero exceptions. It destroys performance.
> >
> > Does irqbalanced need to be taught about this?
>
> The userland one already does.
>
> It's only the in-kernel IRQ load balancing for these (presumably
> powerpc) platforms that is broken.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: e1000 performance issue in 4 simultaneous links
2008-01-12 5:13 ` Denys Fedoryshchenko
@ 2008-01-30 16:57 ` Kok, Auke
0 siblings, 0 replies; 19+ messages in thread
From: Kok, Auke @ 2008-01-30 16:57 UTC (permalink / raw)
To: Denys Fedoryshchenko; +Cc: David Miller, benny+usenet, netdev
Denys Fedoryshchenko wrote:
> Sorry. that i interfere in this subject.
>
> Do you recommend CONFIG_IRQBALANCE to be enabled?
I certainly do not. Manual tweaking and pinning the irq's to the correct CPU will
give the best performance (for specific loads).
The userspace irqbalance daemon tries very hard to approximate this behaviour and
is what I recommend for most situations, it usually does the right thing and does
so without making your head spin (just start it).
The in-kernel one usually does the wrong thing for network loads.
Cheers,
Auke
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2008-01-30 16:58 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-10 16:17 e1000 performance issue in 4 simultaneous links Breno Leitao
2008-01-10 16:36 ` Ben Hutchings
2008-01-10 16:51 ` Jeba Anandhan
2008-01-10 17:31 ` Breno Leitao
2008-01-10 18:18 ` Kok, Auke
2008-01-10 18:37 ` Rick Jones
2008-01-10 18:26 ` Rick Jones
2008-01-10 20:52 ` Brandeburg, Jesse
2008-01-11 1:28 ` David Miller
2008-01-11 11:09 ` Benny Amorsen
2008-01-12 1:41 ` David Miller
2008-01-12 5:13 ` Denys Fedoryshchenko
2008-01-30 16:57 ` Kok, Auke
2008-01-11 16:20 ` Breno Leitao
2008-01-11 16:48 ` Eric Dumazet
2008-01-11 17:36 ` Denys Fedoryshchenko
2008-01-11 18:45 ` Breno Leitao
2008-01-11 18:19 ` Breno Leitao
2008-01-11 18:48 ` Rick Jones
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).