netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* e1000 performance issue in 4 simultaneous links
@ 2008-01-10 16:17 Breno Leitao
  2008-01-10 16:36 ` Ben Hutchings
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Breno Leitao @ 2008-01-10 16:17 UTC (permalink / raw)
  To: netdev

Hello, 

I've perceived that there is a performance issue when running netperf
against 4 e1000 links connected end-to-end to another machine with 4
e1000 interfaces. 

I have 2 4-port interfaces on my machine, but the test is just
considering 2 port for each interfaces card.

When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
of transfer rate. If I run 4 netperf against 4 different interfaces, I
get around 720 * 10^6 bits/sec.  

If I run the same test against 2 interfaces I get a 940 * 10^6 bits/sec
transfer rate also, and if I run it against 3 interfaces I get around
850 * 10^6 bits/sec performance. 

I got this results using the upstream netdev-2.6 branch kernel plus
David Miller's 7 NAPI patches set[1]. In the kernel 2.6.23.12 the result
is a bit worse, and the the transfer rate was around 600 * 10^6
bits/sec.

[1] http://marc.info/?l=linux-netdev&m=119977075917488&w=2

PS: I am not using a switch in the middle of interfaces (they are
end-to-end) and the connections are independents.

-- 
Breno Leitao <leitao@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-10 16:17 e1000 performance issue in 4 simultaneous links Breno Leitao
@ 2008-01-10 16:36 ` Ben Hutchings
  2008-01-10 16:51   ` Jeba Anandhan
  2008-01-10 17:31   ` Breno Leitao
  2008-01-10 18:26 ` Rick Jones
  2008-01-10 20:52 ` Brandeburg, Jesse
  2 siblings, 2 replies; 19+ messages in thread
From: Ben Hutchings @ 2008-01-10 16:36 UTC (permalink / raw)
  To: Breno Leitao; +Cc: netdev

Breno Leitao wrote:
> Hello, 
> 
> I've perceived that there is a performance issue when running netperf
> against 4 e1000 links connected end-to-end to another machine with 4
> e1000 interfaces. 
> 
> I have 2 4-port interfaces on my machine, but the test is just
> considering 2 port for each interfaces card.
> 
> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> of transfer rate. If I run 4 netperf against 4 different interfaces, I
> get around 720 * 10^6 bits/sec.
<snip>

I take it that's the average for individual interfaces, not the
aggregate?  RX processing for multi-gigabits per second can be quite
expensive.  This can be mitigated by interrupt moderation and NAPI
polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
I don't think e1000 hardware does LRO, but the driver could presumably
be changed use Linux's software LRO.

Even with these optimisations, if all RX processing is done on a
single CPU this can become a bottleneck.  Does the test system have
multiple CPUs?  Are IRQs for the multiple NICs balanced across
multiple CPUs?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-10 16:36 ` Ben Hutchings
@ 2008-01-10 16:51   ` Jeba Anandhan
  2008-01-10 17:31   ` Breno Leitao
  1 sibling, 0 replies; 19+ messages in thread
From: Jeba Anandhan @ 2008-01-10 16:51 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Breno Leitao, netdev

Ben,
I am facing the performance issue when we try to bond the multiple
interfaces with virtual interface. It could be related to this thread. 
My questions are,
*) When we use mulitple NICs, will the performance of overall system  be
summation of all individual lines  XX bits/sec. ?
*) What are the factors improves the performance if we have multiple
interfaces?. [ kind of tuning the parameters in proc ]

Breno, 
I hope this thread will be helpful for performance issue which i have
with bonding driver.

Jeba
On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote:
> Breno Leitao wrote:
> > Hello, 
> > 
> > I've perceived that there is a performance issue when running netperf
> > against 4 e1000 links connected end-to-end to another machine with 4
> > e1000 interfaces. 
> > 
> > I have 2 4-port interfaces on my machine, but the test is just
> > considering 2 port for each interfaces card.
> > 
> > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> > of transfer rate. If I run 4 netperf against 4 different interfaces, I
> > get around 720 * 10^6 bits/sec.
> <snip>
> 
> I take it that's the average for individual interfaces, not the
> aggregate?  RX processing for multi-gigabits per second can be quite
> expensive.  This can be mitigated by interrupt moderation and NAPI
> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
> I don't think e1000 hardware does LRO, but the driver could presumably
> be changed use Linux's software LRO.
> 
> Even with these optimisations, if all RX processing is done on a
> single CPU this can become a bottleneck.  Does the test system have
> multiple CPUs?  Are IRQs for the multiple NICs balanced across
> multiple CPUs?
> 
> Ben.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-10 16:36 ` Ben Hutchings
  2008-01-10 16:51   ` Jeba Anandhan
@ 2008-01-10 17:31   ` Breno Leitao
  2008-01-10 18:18     ` Kok, Auke
  2008-01-10 18:37     ` Rick Jones
  1 sibling, 2 replies; 19+ messages in thread
From: Breno Leitao @ 2008-01-10 17:31 UTC (permalink / raw)
  To: bhutchings

On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote:
> > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> > of transfer rate. If I run 4 netperf against 4 different interfaces, I
> > get around 720 * 10^6 bits/sec.
> <snip>
> 
> I take it that's the average for individual interfaces, not the
> aggregate?
Right, each of these results are for individual interfaces. Otherwise,
we'd have a huge problem. :-)

> This can be mitigated by interrupt moderation and NAPI
> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
> I don't think e1000 hardware does LRO, but the driver could presumably
> be changed use Linux's software LRO.
Without using these "features" and keeping the MTU as 1500, do you think
we could get a better performance than this one?

I also tried to increase my interface MTU to 9000, but I am afraid that
netperf only transmits packets with less than 1500. Still investigating.

> single CPU this can become a bottleneck.  Does the test system have
> multiple CPUs?  Are IRQs for the multiple NICs balanced across
> multiple CPUs?
Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
across the CPUs, as I see in /proc/interrupts: 

# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
 16:        940        760       1047        904        993        777        975        813   XICS      Level     IPI
 18:          4          3          4          1          3          6          8          3   XICS      Level     hvc_console
 19:          0          0          0          0          0          0          0          0   XICS      Level     RAS_EPOW
273:      10728      10850      10937      10833      10884      10788      10868      10776   XICS      Level     eth4
275:          0          0          0          0          0          0          0          0   XICS      Level     ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
277:     234933     230275     229770     234048     235906     229858     229975     233859   XICS      Level     eth6
278:     266225     267606     262844     265985     268789     266869     263110     267422   XICS      Level     eth7
279:        893        919        857        909        867        917        894        881   XICS      Level     eth0
305:     439246     439117     438495     436072     438053     440111     438973     438951   XICS      Level     eth0 Neterion Xframe II 10GbE network adapter
321:       3268       3088       3143       3113       3305       2982       3326       3084   XICS      Level     ipr
323:     268030     273207     269710     271338     270306     273258     270872     273281   XICS      Level     eth16
324:     215012     221102     219494     216732     216531     220460     219718     218654   XICS      Level     eth17
325:       7103       3580       7246       3475       7132       3394       7258       3435   XICS      Level     pata_pdc2027x
BAD:       4216

Thanks,

-- 
Breno Leitao <leitao@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-10 17:31   ` Breno Leitao
@ 2008-01-10 18:18     ` Kok, Auke
  2008-01-10 18:37     ` Rick Jones
  1 sibling, 0 replies; 19+ messages in thread
From: Kok, Auke @ 2008-01-10 18:18 UTC (permalink / raw)
  To: Breno Leitao; +Cc: bhutchings, NetDev

Breno Leitao wrote:
> On Thu, 2008-01-10 at 16:36 +0000, Ben Hutchings wrote:
>>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
>>> of transfer rate. If I run 4 netperf against 4 different interfaces, I
>>> get around 720 * 10^6 bits/sec.
>> <snip>
>>
>> I take it that's the average for individual interfaces, not the
>> aggregate?
> Right, each of these results are for individual interfaces. Otherwise,
> we'd have a huge problem. :-)
> 
>> This can be mitigated by interrupt moderation and NAPI
>> polling, jumbo frames (MTU >1500) and/or Large Receive Offload (LRO).
>> I don't think e1000 hardware does LRO, but the driver could presumably
>> be changed use Linux's software LRO.
> Without using these "features" and keeping the MTU as 1500, do you think
> we could get a better performance than this one?
> 
> I also tried to increase my interface MTU to 9000, but I am afraid that
> netperf only transmits packets with less than 1500. Still investigating.
> 
>> single CPU this can become a bottleneck.  Does the test system have
>> multiple CPUs?  Are IRQs for the multiple NICs balanced across
>> multiple CPUs?
> Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
> across the CPUs, as I see in /proc/interrupts: 


which is wrong and hurts performance. you want your ethernet irq's to stick to a
CPU for long times to prevent cache thrash.

please disable the in-kernel irq balancing code and use the userspace `irqbalance`
daemon.

Gee I should put that in my signature, I already wrote that twice today :)

Auke

> 
> # cat /proc/interrupts 
>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
>  16:        940        760       1047        904        993        777        975        813   XICS      Level     IPI
>  18:          4          3          4          1          3          6          8          3   XICS      Level     hvc_console
>  19:          0          0          0          0          0          0          0          0   XICS      Level     RAS_EPOW
> 273:      10728      10850      10937      10833      10884      10788      10868      10776   XICS      Level     eth4
> 275:          0          0          0          0          0          0          0          0   XICS      Level     ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
> 277:     234933     230275     229770     234048     235906     229858     229975     233859   XICS      Level     eth6
> 278:     266225     267606     262844     265985     268789     266869     263110     267422   XICS      Level     eth7
> 279:        893        919        857        909        867        917        894        881   XICS      Level     eth0
> 305:     439246     439117     438495     436072     438053     440111     438973     438951   XICS      Level     eth0 Neterion Xframe II 10GbE network adapter
> 321:       3268       3088       3143       3113       3305       2982       3326       3084   XICS      Level     ipr
> 323:     268030     273207     269710     271338     270306     273258     270872     273281   XICS      Level     eth16
> 324:     215012     221102     219494     216732     216531     220460     219718     218654   XICS      Level     eth17
> 325:       7103       3580       7246       3475       7132       3394       7258       3435   XICS      Level     pata_pdc2027x
> BAD:       4216
> 
> Thanks,
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-10 16:17 e1000 performance issue in 4 simultaneous links Breno Leitao
  2008-01-10 16:36 ` Ben Hutchings
@ 2008-01-10 18:26 ` Rick Jones
  2008-01-10 20:52 ` Brandeburg, Jesse
  2 siblings, 0 replies; 19+ messages in thread
From: Rick Jones @ 2008-01-10 18:26 UTC (permalink / raw)
  To: Breno Leitao; +Cc: netdev

Many many things to check when running netperf :)

*) Are the cards on the same or separate PCImumble bus, and what sort of bus

*) is the two interface performance two interfaces on the same four-port 
card, or an interface from each of the two four-port cards?

*) is there a dreaded (IMO) irqbalance daemon running?  one of the very 
first things I do when running netperf is terminate the irqbalance 
daemon with as extreme a predjudice as I can.

*) what is the distribution of interrupts from the interfaces to the 
CPUs?  if you've tried to set that manually, the dreaded irqbalance 
daemon will come along shortly thereafter and ruin everything.

*) what does netperf say about the overall CPU utilization of the 
system(s) when the tests are running?

*) what does top say about the utilization of any single CPU in the 
system(s) when the tests are running?

*) are you using the global -T option to spread the netperf/netserver 
processes across the CPUs, or leaving that all up to the 
stack/scheduler/etc?

I suspect there could be more but that is what comes to mind thusfar as 
far as things I often check when running netperf.

rick jones


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-10 17:31   ` Breno Leitao
  2008-01-10 18:18     ` Kok, Auke
@ 2008-01-10 18:37     ` Rick Jones
  1 sibling, 0 replies; 19+ messages in thread
From: Rick Jones @ 2008-01-10 18:37 UTC (permalink / raw)
  To: Breno Leitao; +Cc: bhutchings, Linux Network Development list

> I also tried to increase my interface MTU to 9000, but I am afraid that
> netperf only transmits packets with less than 1500. Still investigating.

It may seem like picking a tiny nit, but netperf never transmits 
packets.  It only provides buffers of specified size to the stack. It is 
then the stack which transmits and determines the size of the packets on 
the network.

Drifting a bit more...

While there are settings, conditions and known stack behaviours where 
one can be confident of the packet size on the network based on the 
options passed to netperf, generally speaking one should not ass-u-me a 
direct relationship between the options one passes to netperf and the 
size of the packets on the network.

And for JumboFrames to be effective it must be set on both ends, 
otherwise the TCP MSS exchange will result in the smaller of the two 
MTU's "winning" as it were.

>>single CPU this can become a bottleneck.  Does the test system have
>>multiple CPUs?  Are IRQs for the multiple NICs balanced across
>>multiple CPUs?
> 
> Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
> across the CPUs, as I see in /proc/interrupts: 

That suggests to me anyway that the dreaded irqbalanced is running, 
shuffling the interrupts as you go.  Not often a happy place for running 
netperf when one want's consistent results.

> 
> # cat /proc/interrupts 
>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
>  16:        940        760       1047        904        993        777        975        813   XICS      Level     IPI
>  18:          4          3          4          1          3          6          8          3   XICS      Level     hvc_console
>  19:          0          0          0          0          0          0          0          0   XICS      Level     RAS_EPOW
> 273:      10728      10850      10937      10833      10884      10788      10868      10776   XICS      Level     eth4
> 275:          0          0          0          0          0          0          0          0   XICS      Level     ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
> 277:     234933     230275     229770     234048     235906     229858     229975     233859   XICS      Level     eth6
> 278:     266225     267606     262844     265985     268789     266869     263110     267422   XICS      Level     eth7
> 279:        893        919        857        909        867        917        894        881   XICS      Level     eth0
> 305:     439246     439117     438495     436072     438053     440111     438973     438951   XICS      Level     eth0 Neterion Xframe II 10GbE network adapter
> 321:       3268       3088       3143       3113       3305       2982       3326       3084   XICS      Level     ipr
> 323:     268030     273207     269710     271338     270306     273258     270872     273281   XICS      Level     eth16
> 324:     215012     221102     219494     216732     216531     220460     219718     218654   XICS      Level     eth17
> 325:       7103       3580       7246       3475       7132       3394       7258       3435   XICS      Level     pata_pdc2027x
> BAD:       4216

IMO, what you want (in the absence of multi-queue NICs) is one CPU 
taking the interrupts of one port/interface, and each port/interface's 
interrupts going to a separate CPU.  So, something that looks roughly 
like concocted example:

            CPU0     CPU1      CPU2     CPU3
   1:       1234        0         0        0   eth0
   2:          0     1234         0        0   eth1
   3:          0        0      1234        0   eth2
   4:          0        0         0     1234   eth3

which you should be able to acheive via the method I think someone else 
has already mentioned about echoing values into 
/proc/irq/<irq>/smp_affinity  - after you have slain the dreaded 
irqbalance daemon.

rick jones

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: e1000 performance issue in 4 simultaneous links
  2008-01-10 16:17 e1000 performance issue in 4 simultaneous links Breno Leitao
  2008-01-10 16:36 ` Ben Hutchings
  2008-01-10 18:26 ` Rick Jones
@ 2008-01-10 20:52 ` Brandeburg, Jesse
  2008-01-11  1:28   ` David Miller
  2008-01-11 16:20   ` Breno Leitao
  2 siblings, 2 replies; 19+ messages in thread
From: Brandeburg, Jesse @ 2008-01-10 20:52 UTC (permalink / raw)
  To: Breno Leitao; +Cc: netdev, Brandeburg, Jesse

Breno Leitao wrote:
> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> of transfer rate. If I run 4 netperf against 4 different interfaces, I
> get around 720 * 10^6 bits/sec.

This is actually a known issue that we have worked with your company
before on.  It comes down to your system's default behavior of round
robining interrupts (see cat /proc/interrupts while running the test)
combined with e1000's way of exiting / rescheduling NAPI.

The default round robin behavior of the interrupts on your system is the
root cause of this issue, and here is what happens:

4 interfaces start generating interrupts, if you're lucky the round
robin balancer has them all on different cpus.
As the e1000 driver goes into and out of polling mode, the round robin
balancer keeps moving the interrupt to the next cpu.
Eventually 2 or more driver instances end up on the same CPU, which
causes both driver instances to stay in NAPI polling mode, due to the
amount of work being done, and that there are always more than
"netdev->weight" packets to do for each instance.  This keeps *hardware*
interrupts for each interface *disabled*.
Staying in NAPI polling mode causes higher cpu utilization on that one
processor, which guarantees that when the hardware round robin balancer
moves any other network interrupt onto that CPU, it too will join the
NAPI polling mode chain.
So no matter how many processors you have, with this round robin style
of hardware interrupts, it guarantees you that if there is a lot of work
to do (more than weight) at each softirq, then, all network interfaces
will end up on the same cpu eventually (the busiest one)
Your performance becomes the same as if you had booted with maxcpus=1

I hope this explanation makes sense, but what it comes down to is that
combining hardware round robin balancing with NAPI is a BAD IDEA.  In
general the behavior of hardware round robin balancing is bad and I'm
sure it is causing all sorts of other performance issues that you may
not even be aware of.

I'm sure your problem will go away if you run e1000 in interrupt mode.
(use make CFLAGS_EXTRA=-DE1000_NO_NAPI)
 
> If I run the same test against 2 interfaces I get a 940 * 10^6
> bits/sec transfer rate also, and if I run it against 3 interfaces I
> get around 850 * 10^6 bits/sec performance.
> 
> I got this results using the upstream netdev-2.6 branch kernel plus
> David Miller's 7 NAPI patches set[1]. In the kernel 2.6.23.12 the
> result is a bit worse, and the the transfer rate was around 600 * 10^6
> bits/sec.

Thank you for testing the latest kernel.org kernel.

Hope this helps.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-10 20:52 ` Brandeburg, Jesse
@ 2008-01-11  1:28   ` David Miller
  2008-01-11 11:09     ` Benny Amorsen
  2008-01-11 16:20   ` Breno Leitao
  1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2008-01-11  1:28 UTC (permalink / raw)
  To: jesse.brandeburg; +Cc: leitao, netdev

From: "Brandeburg, Jesse" <jesse.brandeburg@intel.com>
Date: Thu, 10 Jan 2008 12:52:15 -0800

> I hope this explanation makes sense, but what it comes down to is that
> combining hardware round robin balancing with NAPI is a BAD IDEA.

Absolutely agreed on all counts.

No IRQ balancing should be done at all for networking device
interrupts, with zero exceptions.  It destroys performance.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-11  1:28   ` David Miller
@ 2008-01-11 11:09     ` Benny Amorsen
  2008-01-12  1:41       ` David Miller
  0 siblings, 1 reply; 19+ messages in thread
From: Benny Amorsen @ 2008-01-11 11:09 UTC (permalink / raw)
  To: netdev

David Miller <davem@davemloft.net> writes:

> No IRQ balancing should be done at all for networking device
> interrupts, with zero exceptions.  It destroys performance.

Does irqbalanced need to be taught about this? And how about the
initial balancing, so that each network card gets assigned to one CPU?


/Benny



^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: e1000 performance issue in 4 simultaneous links
  2008-01-10 20:52 ` Brandeburg, Jesse
  2008-01-11  1:28   ` David Miller
@ 2008-01-11 16:20   ` Breno Leitao
  2008-01-11 16:48     ` Eric Dumazet
  1 sibling, 1 reply; 19+ messages in thread
From: Breno Leitao @ 2008-01-11 16:20 UTC (permalink / raw)
  To: Brandeburg, Jesse, rick.jones2; +Cc: netdev

On Thu, 2008-01-10 at 12:52 -0800, Brandeburg, Jesse wrote:
> Breno Leitao wrote:
> > When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> > of transfer rate. If I run 4 netperf against 4 different interfaces, I
> > get around 720 * 10^6 bits/sec.
> 
> I hope this explanation makes sense, but what it comes down to is that
> combining hardware round robin balancing with NAPI is a BAD IDEA.  In
> general the behavior of hardware round robin balancing is bad and I'm
> sure it is causing all sorts of other performance issues that you may
> not even be aware of.
I've made another test removing the ppc IRQ Round Robin scheme, bonded
each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1,
CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in
average.

Take a look at the interrupt table this time: 

io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
277:         15    1362450         13         14         13         14         15         18   XICS      Level     eth6
278:         12         13    1348681         19         13         15         10         11   XICS      Level     eth7
323:         11         18         17    1348426         18         11         11         13   XICS      Level     eth16
324:         12         16         11         19    1402709         13         14         11   XICS      Level     eth17


I also tried to bound all the 4 interface IRQ to a single CPU (CPU0)
using the noirqdistrib boot paramenter, and the performance was a little
worse.

Rick, 
  The 2 interface test that I showed in my first email, was run in two
different NIC. Also, I am running netperf with the following command
"netperf -H <hostname> -T 0,8" while netserver is running without any
argument at all. Also, running vmstat in parallel shows that there is no
bottleneck in the CPU. Take a look: 

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 6714732  16168 227440    0    0     8     2  203   21  0  1 98  0  0
 0  0      0 6715120  16176 227440    0    0     0    28 16234  505  0 16 83  0  1
 0  0      0 6715516  16176 227440    0    0     0     0 16251  518  0 16 83  0  1
 1  0      0 6715252  16176 227440    0    0     0     1 16316  497  0 15 84  0  1
 0  0      0 6716092  16176 227440    0    0     0     0 16300  520  0 16 83  0  1
 0  0      0 6716320  16180 227440    0    0     0     1 16354  486  0 15 84  0  1
 

Thanks!

-- 
Breno Leitao <leitao@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-11 16:20   ` Breno Leitao
@ 2008-01-11 16:48     ` Eric Dumazet
  2008-01-11 17:36       ` Denys Fedoryshchenko
  2008-01-11 18:19       ` Breno Leitao
  0 siblings, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2008-01-11 16:48 UTC (permalink / raw)
  To: Breno Leitao; +Cc: Brandeburg, Jesse, rick.jones2, netdev

Breno Leitao a écrit :
> On Thu, 2008-01-10 at 12:52 -0800, Brandeburg, Jesse wrote:
>   
>> Breno Leitao wrote:
>>     
>>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
>>> of transfer rate. If I run 4 netperf against 4 different interfaces, I
>>> get around 720 * 10^6 bits/sec.
>>>       
>> I hope this explanation makes sense, but what it comes down to is that
>> combining hardware round robin balancing with NAPI is a BAD IDEA.  In
>> general the behavior of hardware round robin balancing is bad and I'm
>> sure it is causing all sorts of other performance issues that you may
>> not even be aware of.
>>     
> I've made another test removing the ppc IRQ Round Robin scheme, bonded
> each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1,
> CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in
> average.
>
> Take a look at the interrupt table this time: 
>
> io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
> 277:         15    1362450         13         14         13         14         15         18   XICS      Level     eth6
> 278:         12         13    1348681         19         13         15         10         11   XICS      Level     eth7
> 323:         11         18         17    1348426         18         11         11         13   XICS      Level     eth16
> 324:         12         16         11         19    1402709         13         14         11   XICS      Level     eth17
>
>
> I also tried to bound all the 4 interface IRQ to a single CPU (CPU0)
> using the noirqdistrib boot paramenter, and the performance was a little
> worse.
>
> Rick, 
>   The 2 interface test that I showed in my first email, was run in two
> different NIC. Also, I am running netperf with the following command
> "netperf -H <hostname> -T 0,8" while netserver is running without any
> argument at all. Also, running vmstat in parallel shows that there is no
> bottleneck in the CPU. Take a look: 
>
> procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  2  0      0 6714732  16168 227440    0    0     8     2  203   21  0  1 98  0  0
>  0  0      0 6715120  16176 227440    0    0     0    28 16234  505  0 16 83  0  1
>  0  0      0 6715516  16176 227440    0    0     0     0 16251  518  0 16 83  0  1
>  1  0      0 6715252  16176 227440    0    0     0     1 16316  497  0 15 84  0  1
>  0  0      0 6716092  16176 227440    0    0     0     0 16300  520  0 16 83  0  1
>  0  0      0 6716320  16180 227440    0    0     0     1 16354  486  0 15 84  0  1
>  
>
>   
If your machine has 8 cpus, then your vmstat output shows a bottleneck :)

(100/8 = 12.5), so I guess one of your CPU is full






^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-11 16:48     ` Eric Dumazet
@ 2008-01-11 17:36       ` Denys Fedoryshchenko
  2008-01-11 18:45         ` Breno Leitao
  2008-01-11 18:19       ` Breno Leitao
  1 sibling, 1 reply; 19+ messages in thread
From: Denys Fedoryshchenko @ 2008-01-11 17:36 UTC (permalink / raw)
  To: netdev

Maybe good idea to use sysstat ?

http://perso.wanadoo.fr/sebastien.godard/

For example:

visp-1 ~ # mpstat -P ALL 1
Linux 2.6.24-rc7-devel (visp-1)         01/11/08

19:27:57     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
   %idle    intr/s
19:27:58     all    0.00    0.00    0.00    0.00    0.00    2.51    0.00   
97.49   7707.00
19:27:58       0    0.00    0.00    0.00    0.00    0.00    4.00    0.00   
96.00   1926.00
19:27:58       1    0.00    0.00    0.00    0.00    0.00    1.01    0.00   
98.99   1926.00
19:27:58       2    0.00    0.00    0.00    0.00    0.00    5.00    0.00   
95.00   1927.00
19:27:58       3    0.00    0.00    0.00    0.00    0.00    0.99    0.00   
99.01   1927.00
19:27:58       4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    
0.00      0.00



> >>     
> >>> When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
> >>> of transfer rate. If I run 4 netperf against 4 different interfaces, I
> >>> get around 720 * 10^6 bits/sec.
> >>>       
> >> I hope this explanation makes sense, but what it comes down to is that
> >> combining hardware round robin balancing with NAPI is a BAD IDEA.  In
> >> general the behavior of hardware round robin balancing is bad and I'm
> >> sure it is causing all sorts of other performance issues that you may
> >> not even be aware of.
> >>     
> > I've made another test removing the ppc IRQ Round Robin scheme, bonded
> > each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1,
> > CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in
> > average.
> >
> > Take a look at the interrupt table this time: 
> >
> > io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
> > 277:         15    1362450         13         14         13         
14         15         18   XICS      Level     eth6
> > 278:         12         13    1348681         19         13         
15         10         11   XICS      Level     eth7
> > 323:         11         18         17    1348426         18         
11         11         13   XICS      Level     eth16
> > 324:         12         16         11         19    1402709         
13         14         11   XICS      Level     eth17
> >
> >
> > I also tried to bound all the 4 interface IRQ to a single CPU (CPU0)
> > using the noirqdistrib boot paramenter, and the performance was a little
> > worse.
> >
> > Rick, 
> >   The 2 interface test that I showed in my first email, was run in two
> > different NIC. Also, I am running netperf with the following command
> > "netperf -H <hostname> -T 0,8" while netserver is running without any
> > argument at all. Also, running vmstat in parallel shows that there is no
> > bottleneck in the CPU. Take a look: 
> >
> > procs -----------memory---------- ---swap-- -----io---- -system-- -----
cpu------
> >  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy 
id wa st
> >  2  0      0 6714732  16168 227440    0    0     8     2  203   21  0  1 
98  0  0
> >  0  0      0 6715120  16176 227440    0    0     0    28 16234  505  0 16 
83  0  1
> >  0  0      0 6715516  16176 227440    0    0     0     0 16251  518  0 16 
83  0  1
> >  1  0      0 6715252  16176 227440    0    0     0     1 16316  497  0 15 
84  0  1
> >  0  0      0 6716092  16176 227440    0    0     0     0 16300  520  0 16 
83  0  1
> >  0  0      0 6716320  16180 227440    0    0     0     1 16354  486  0 15 
84  0  1
> >  
> >
> >   
> If your machine has 8 cpus, then your vmstat output shows a 
> bottleneck :)
> 
> (100/8 = 12.5), so I guess one of your CPU is full
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-11 16:48     ` Eric Dumazet
  2008-01-11 17:36       ` Denys Fedoryshchenko
@ 2008-01-11 18:19       ` Breno Leitao
  2008-01-11 18:48         ` Rick Jones
  1 sibling, 1 reply; 19+ messages in thread
From: Breno Leitao @ 2008-01-11 18:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Brandeburg, Jesse, rick.jones2, netdev

On Fri, 2008-01-11 at 17:48 +0100, Eric Dumazet wrote:
> Breno Leitao a écrit :
> > Take a look at the interrupt table this time: 
> >
> > io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
> > 277:         15    1362450         13         14         13         14         15         18   XICS      Level     eth6
> > 278:         12         13    1348681         19         13         15         10         11   XICS      Level     eth7
> > 323:         11         18         17    1348426         18         11         11         13   XICS      Level     eth16
> > 324:         12         16         11         19    1402709         13         14         11   XICS      Level     eth17
> >
> >
> >   
> If your machine has 8 cpus, then your vmstat output shows a bottleneck :)
> 
> (100/8 = 12.5), so I guess one of your CPU is full

Well, if I run top while running the test, I see this load distributed
among the CPUs, mainly those that had a NIC IRC bonded. Take a look:

Tasks: 133 total,   2 running, 130 sleeping,   0 stopped,   1 zombie
Cpu0  :  0.3%us, 19.5%sy,  0.0%ni, 73.5%id,  0.0%wa,  0.0%hi,  0.0%si,  6.6%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 75.1%id,  0.0%wa,  0.7%hi, 24.3%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 73.1%id,  0.0%wa,  0.7%hi, 26.2%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni, 76.1%id,  0.0%wa,  0.7%hi, 23.3%si,  0.0%st
Cpu4  :  0.0%us,  0.3%sy,  0.0%ni, 70.4%id,  0.7%wa,  0.3%hi, 28.2%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Note that this average scenario doesn't change during the entire
benchmarking test.

Thanks!

-- 
Breno Leitao <leitao@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-11 17:36       ` Denys Fedoryshchenko
@ 2008-01-11 18:45         ` Breno Leitao
  0 siblings, 0 replies; 19+ messages in thread
From: Breno Leitao @ 2008-01-11 18:45 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

Hello Denys, 
   I've installed sysstat (good tools!) and the result is very similar
to the one which appears at top, take a look:
   13:34:23     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
13:34:24     all    0.00    0.00    2.72    0.00    0.25   12.13    0.99   83.91  16267.33
13:34:24       0    0.00    0.00   21.78    0.00    0.00    0.00    7.92   70.30     40.59
13:34:24       1    0.00    0.00    0.00    0.00    0.99   24.75    0.00   74.26   4025.74
13:34:24       2    0.00    0.00    0.00    0.00    0.99   24.75    0.00   74.26   4036.63
13:34:24       3    0.00    0.00    0.00    0.00    0.99   21.78    0.00   77.23   4032.67
13:34:24       4    0.00    0.00    0.00    0.00    0.98   24.51    0.00   74.51   4034.65
13:34:24       5    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00     30.69
13:34:24       6    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00     33.66
13:34:24       7    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00     32.67

So, we can assure that the IRQs are not being balanced, and that there
isn't any processor overload.

Thanks!


On Fri, 2008-01-11 at 19:36 +0200, Denys Fedoryshchenko wrote:
> Maybe good idea to use sysstat ?
> 
> http://perso.wanadoo.fr/sebastien.godard/


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-11 18:19       ` Breno Leitao
@ 2008-01-11 18:48         ` Rick Jones
  0 siblings, 0 replies; 19+ messages in thread
From: Rick Jones @ 2008-01-11 18:48 UTC (permalink / raw)
  To: Breno Leitao; +Cc: Eric Dumazet, Brandeburg, Jesse, netdev

Breno Leitao wrote:
> On Fri, 2008-01-11 at 17:48 +0100, Eric Dumazet wrote:
> 
>>Breno Leitao a écrit :
>>
>>>Take a look at the interrupt table this time: 
>>>
>>>io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
>>>277:         15    1362450         13         14         13         14         15         18   XICS      Level     eth6
>>>278:         12         13    1348681         19         13         15         10         11   XICS      Level     eth7
>>>323:         11         18         17    1348426         18         11         11         13   XICS      Level     eth16
>>>324:         12         16         11         19    1402709         13         14         11   XICS      Level     eth17
>>>
>>>
>>>  
>>
>>If your machine has 8 cpus, then your vmstat output shows a bottleneck :)
>>
>>(100/8 = 12.5), so I guess one of your CPU is full
> 
> 
> Well, if I run top while running the test, I see this load distributed
> among the CPUs, mainly those that had a NIC IRC bonded. Take a look:
> 
> Tasks: 133 total,   2 running, 130 sleeping,   0 stopped,   1 zombie
> Cpu0  :  0.3%us, 19.5%sy,  0.0%ni, 73.5%id,  0.0%wa,  0.0%hi,  0.0%si,  6.6%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 75.1%id,  0.0%wa,  0.7%hi, 24.3%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 73.1%id,  0.0%wa,  0.7%hi, 26.2%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni, 76.1%id,  0.0%wa,  0.7%hi, 23.3%si,  0.0%st
> Cpu4  :  0.0%us,  0.3%sy,  0.0%ni, 70.4%id,  0.7%wa,  0.3%hi, 28.2%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

If you have IRQ's bound to CPUs 1-4, and have four netperfs running, 
given that the stack ostensibly tries to have applications run on the 
same CPUs, what is running on CPU0?

Is it related to:

>   The 2 interface test that I showed in my first email, was run in two
> different NIC. Also, I am running netperf with the following command
> "netperf -H <hostname> -T 0,8" while netserver is running without any
> argument at all. Also, running vmstat in parallel shows that there is no
> bottleneck in the CPU. Take a look: 

Unless you have a morbid curiousity :) there isn't much point in binding 
all the netperf's to CPU 0 when the interrupts for the NICs servicing 
their connections are on CPUs 1-4.  I also assume then that the 
system(s) on which netserver is running have > 8 CPUs in them? (There 
are multiple destination systems yes?)

Does anything change if you explicitly bind each netperf to the CPU on 
which the interrups for its connection are processed?  Or for that 
matter if you remove the -T command entirely

Does UDP_STREAM show different performance than TCP_STREAM (I'm 
ass-u-me-ing based on the above we are looking at the netperf side of a 
TCP_STREAM test above, please correct if otherwise).

Are the CPUs above single-core CPUs or multi-core CPUs, and if 
multi-core are caches shared?  How are CPUs numbered if multi-core on 
that system?  Is there any hardware threading involved?  I'm wondering 
if there may be some wrinkles in the system that might lead to reported 
CPU utilization being low even if a chip is otherwise saturated.  Might 
need some HW counters to check that...

Can you describe the I/O subsystem more completely?  I understand that 
you are using at most two ports of a pair of quad-port cards at any one 
time, but am still curious to know if those two cards are on separate 
busses, or if they share any bus/link on the way to memory.

rick jones

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-11 11:09     ` Benny Amorsen
@ 2008-01-12  1:41       ` David Miller
  2008-01-12  5:13         ` Denys Fedoryshchenko
  0 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2008-01-12  1:41 UTC (permalink / raw)
  To: benny+usenet; +Cc: netdev

From: Benny Amorsen <benny+usenet@amorsen.dk>
Date: Fri, 11 Jan 2008 12:09:32 +0100

> David Miller <davem@davemloft.net> writes:
> 
> > No IRQ balancing should be done at all for networking device
> > interrupts, with zero exceptions.  It destroys performance.
> 
> Does irqbalanced need to be taught about this?

The userland one already does.

It's only the in-kernel IRQ load balancing for these (presumably
powerpc) platforms that is broken.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-12  1:41       ` David Miller
@ 2008-01-12  5:13         ` Denys Fedoryshchenko
  2008-01-30 16:57           ` Kok, Auke
  0 siblings, 1 reply; 19+ messages in thread
From: Denys Fedoryshchenko @ 2008-01-12  5:13 UTC (permalink / raw)
  To: David Miller, benny+usenet; +Cc: netdev

Sorry. that i interfere in this subject.

Do you recommend CONFIG_IRQBALANCE to be enabled?

If it is enabled - irq's not jumping nonstop over processors. softirqd 
changing this behavior.
If it is disabled, irq's distributed over each processor, and in loaded 
systems it seems harmful. 
I work a little yesterday with server with CONFIG_IRQBALANCE=no, 160kpps load.
It was packetloss-ing, till i set smp_affinity.

Maybe it is useful to put more info in Kconfig, since it is very important 
for performance option.

On Fri, 11 Jan 2008 17:41:09 -0800 (PST), David Miller wrote
> From: Benny Amorsen <benny usenet@amorsen.dk>
> Date: Fri, 11 Jan 2008 12:09:32  0100
> 
> > David Miller <davem@davemloft.net> writes:
> > 
> > > No IRQ balancing should be done at all for networking device
> > > interrupts, with zero exceptions.  It destroys performance.
> > 
> > Does irqbalanced need to be taught about this?
> 
> The userland one already does.
> 
> It's only the in-kernel IRQ load balancing for these (presumably
> powerpc) platforms that is broken.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: e1000 performance issue in 4 simultaneous links
  2008-01-12  5:13         ` Denys Fedoryshchenko
@ 2008-01-30 16:57           ` Kok, Auke
  0 siblings, 0 replies; 19+ messages in thread
From: Kok, Auke @ 2008-01-30 16:57 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: David Miller, benny+usenet, netdev

Denys Fedoryshchenko wrote:
> Sorry. that i interfere in this subject.
> 
> Do you recommend CONFIG_IRQBALANCE to be enabled?

I certainly do not. Manual tweaking and pinning the irq's to the correct CPU will
give the best performance (for specific loads).

The userspace irqbalance daemon tries very hard to approximate this behaviour and
is what I recommend for most situations, it usually does the right thing and does
so without making your head spin (just start it).

The in-kernel one usually does the wrong thing for network loads.

Cheers,

Auke

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2008-01-30 16:58 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-10 16:17 e1000 performance issue in 4 simultaneous links Breno Leitao
2008-01-10 16:36 ` Ben Hutchings
2008-01-10 16:51   ` Jeba Anandhan
2008-01-10 17:31   ` Breno Leitao
2008-01-10 18:18     ` Kok, Auke
2008-01-10 18:37     ` Rick Jones
2008-01-10 18:26 ` Rick Jones
2008-01-10 20:52 ` Brandeburg, Jesse
2008-01-11  1:28   ` David Miller
2008-01-11 11:09     ` Benny Amorsen
2008-01-12  1:41       ` David Miller
2008-01-12  5:13         ` Denys Fedoryshchenko
2008-01-30 16:57           ` Kok, Auke
2008-01-11 16:20   ` Breno Leitao
2008-01-11 16:48     ` Eric Dumazet
2008-01-11 17:36       ` Denys Fedoryshchenko
2008-01-11 18:45         ` Breno Leitao
2008-01-11 18:19       ` Breno Leitao
2008-01-11 18:48         ` Rick Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).