NAPI, e100, and system performance problem

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* NAPI, e100, and system performance problem
@ 2005-04-18  6:11 Brandeburg, Jesse
  2005-04-18 12:14 ` jamal
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Brandeburg, Jesse @ 2005-04-18  6:11 UTC (permalink / raw)
  To: netdev

_Summary_
As the part owner of the e100 driver, this issue has been nagging at me
for a while.  NAPI seems to be able to swamp a system with interrupts
and context switches.   In this case the system does not respond to
having zero cpu cycles available by polling more, as I figured it would.
This was mostly based on observation, but I included some quick data
gathering for this mail.

_Background_

I've noticed that when napi is enabled on a 10/100 (e100.c) adapter,
that if I push a lot of small packets through the 10/100 adapter that
the system begins to show interrupt lag.  In this case it is 64 bit
power processor machine.  The processors on this machine generally
execute code very quickly, while the bus is a little slow

With the current e100 driver in NAPI mode the adapter can generate
upwards of 36000 interrupts a second from only receiving 77000
packets/second (pktgen traffic)

_test 1_
I really couldn't think of a decent way to show the lag except to say
that the network and keyboard visibly lag when a 10/100 adapter is
pounding away.
*** Here is the localhost ping data first with an idle system

--- 127.0.0.1 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 4332ms
rtt min/avg/max/mdev = 0.008/0.009/0.077/0.005 ms, pipe 2, ipg/ewma
0.043/0.010 ms

*** now while receiving ~77000 packets per second of discardable traffic

--- 127.0.0.1 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time
26370ms
rtt min/avg/max/mdev = 0.008/0.010/9.410/0.034 ms, pipe 2, ipg/ewma
0.263/0.010 ms

*** ping took 4.3 seconds unloaded and 26.4 seconds when receiving
traffic.

_test 2_
Run netperf with 10 threads to localhost in order to stress the cpus
netserver; netperf -l 60 -H localhost
Shows 100% cpu utilization

Start pktgen at other end, and these are the results from vmstat 2
13  0      0 3711936  95192  92932    0    0     0     0    2 41175 10
90  0  0
14  0      0 3711872  95192  92932    0    0     0     0   16 41886 11
89  0  0
13  0      0 3711824  95192  92932    0    0     0    30    3 44299 11
89  0  0
13  0      0 3711952  95192  92932    0    0     0     0    5 77821  9
91  0  0
--->>> pktgen starts
13  0      0 3711888  95192  92932    0    0     0    16 27057 74014  7
93  0  0
12  0      0 3712016  95192  92932    0    0     0     0 34001 36483  8
92  0  0
13  1      0 3711952  95192  92932    0    0     0    28 31673 35748  8
92  0  0
14  0      0 3711952  95192  92932    0    0     0     2 32145  1411  9
91  0  0
14  0      0 3711952  95192  92932    0    0     0     0 32166  1231  9
91  0  0
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
 r  b   swpd   free   buff  cache    si   so    bi    bo    in    cs us
sy id wa
14  0      0 3712088  95192  92932    0    0     0     0 32157  2597  9
91  0  0
13  0      0 3711960  95192  92932    0    0     0     0 32130  1210  9
91  0  0
13  1      0 3711832  95192  92932    0    0     0    28 32095  1182  9
91  0  0
13  0      0 3712088  95192  92932    0    0     0    68 33023 11742  9
91  0  0
14  0      0 3711888  95192  92932    0    0     0     0 32460  1279  9
91  0  0
14  0      0 3712080  95192  92932    0    0     0    30 32126  1908  9
91  0  0
14  0      0 3711952  95192  92932    0    0     0     0 32736 22616  8
92  0  0

*** I expected a more significant drop in interrupts/sec 

_conclusion_

Is there any way we can configure how soon or often to be polled
(called)? For 100Mb speeds at worst we can get about one 64 byte packet
every 9us (if I did my math right) and since the processors don't take
that long to schedule NAPI, process a packet and handle an interrupt, we
just overload the system with interrupts going into and out of napi
mode.  In this case I only have one adapter getting scheduled.

Suggestions? Help? Want me to do more tests?

Thanks for sticking with me...
Jesse

--
Jesse Brandeburg

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-18  6:11 NAPI, e100, and system performance problem Brandeburg, Jesse
@ 2005-04-18 12:14 ` jamal
  2005-04-18 15:36 ` Robert Olsson
  2005-04-18 16:55 ` Arthur Kepner
  2 siblings, 0 replies; 38+ messages in thread
From: jamal @ 2005-04-18 12:14 UTC (permalink / raw)
  To: Brandeburg, Jesse; +Cc: netdev

Jesse,

I doubt your problem has to do with interrupt rates.
If you say the CPU is "fast" - then the problem is elsewhere.
A fast machine that cant handle 77K interupt/sec would be a pathetic
one.
It could be the PCI IO rate which such a fast system shouldnt have
issues handling.

You say it can process upto two packets/interupt -so thats
pretty fast. Actually it could be the e100 issue - try replacing the NIC
with e1000 and repeat your tests and see if you observe the same
issues. 
If you are doing the netperf tests then collect netstat -s output;
/proc/net/softnet_stat is another useful stat to look at.

cheers,
jamal 

On Sun, 2005-17-04 at 23:11 -0700, Brandeburg, Jesse wrote:
> _Summary_
> As the part owner of the e100 driver, this issue has been nagging at me
> for a while.  NAPI seems to be able to swamp a system with interrupts
> and context switches.   In this case the system does not respond to
> having zero cpu cycles available by polling more, as I figured it would.
> This was mostly based on observation, but I included some quick data
> gathering for this mail.
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* NAPI, e100, and system performance problem
  2005-04-18  6:11 NAPI, e100, and system performance problem Brandeburg, Jesse
  2005-04-18 12:14 ` jamal
@ 2005-04-18 15:36 ` Robert Olsson
  2005-04-18 16:55 ` Arthur Kepner
  2 siblings, 0 replies; 38+ messages in thread
From: Robert Olsson @ 2005-04-18 15:36 UTC (permalink / raw)
  To: Brandeburg, Jesse; +Cc: netdev

Brandeburg, Jesse writes:
 > _Summary_
 > As the part owner of the e100 driver, this issue has been nagging at me
 > for a while.  NAPI seems to be able to swamp a system with interrupts
 > and context switches.   In this case the system does not respond to
 > having zero cpu cycles available by polling more, as I figured it would.
 > This was mostly based on observation, but I included some quick data
 > gathering for this mail.

 Hello!

 Well to sort out some things. NAPI address just the balance between 
 irq  and softirq's nothing else, In the balance between userland and 
 sofirq's we pray that sofirq's should behave well. In some cases 
 these are deferred to ksoftirqd to balance softirq/user.

 > With the current e100 driver in NAPI mode the adapter can generate
 > upwards of 36000 interrupts a second from only receiving 77000
 > packets/second (pktgen traffic)

 You have a very fast CPU and slow device. Interrupt ISR handles two 
 packets etc. Same for NAPI/non-NAPI path. Not bad really. Probably
 you had interrupt delay w. non-NAPI so the interrupt rate was a bit
 lower this OK w, NAPI too if you like it. 

 > Is there any way we can configure how soon or often to be polled
 > (called)? For 100Mb speeds at worst we can get about one 64 byte packet
 > every 9us (if I did my math right) and since the processors don't take
 > that long to schedule NAPI, process a packet and handle an interrupt, we
 > just overload the system with interrupts going into and out of napi
 > mode.  In this case I only have one adapter getting scheduled.

 Interrupt delay is the most straight forward... You can play with some
 driver timer. Balance timer and the RX ring size so it does not overflow 
 and register for poll when the timer func have receive packets. You would 
 see no interrupts at all... Could be fun it feel like sub-optimizing. 
 No I think at these levels interrupts are our friends. 

 Cheers.
					--ro

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-18  6:11 NAPI, e100, and system performance problem Brandeburg, Jesse
  2005-04-18 12:14 ` jamal
  2005-04-18 15:36 ` Robert Olsson
@ 2005-04-18 16:55 ` Arthur Kepner
  2005-04-18 19:34   ` Robert Olsson
  2005-04-18 20:26   ` jamal
  2 siblings, 2 replies; 38+ messages in thread
From: Arthur Kepner @ 2005-04-18 16:55 UTC (permalink / raw)
  To: Brandeburg, Jesse; +Cc: netdev

On Sun, 17 Apr 2005, Brandeburg, Jesse wrote:

> ......
> _conclusion_
> 
> Is there any way we can configure how soon or often to be polled
> (called)? For 100Mb speeds at worst we can get about one 64 byte packet
> every 9us (if I did my math right) and since the processors don't take
> that long to schedule NAPI, process a packet and handle an interrupt, we
> just overload the system with interrupts going into and out of napi
> mode.  In this case I only have one adapter getting scheduled.
> 

I'll just chime in to say that I've seen similar behavior,
(but with a very different system.)

The problem with NAPI is (quoting a co-worker) that it
relies on an "accident of timing".

If the CPU speed, time to complete a PIO, and the inter-
packet arrival time are "just so", then a system (or at
least one of its CPUs) can be kept very busy even when
it's not receiving data from the network at a particularly
high rate.

The simple feedback mechanism used by NAPI is good at
balancing latency and throughput, but it doesn't have
any way of recognizing when system resources are being
poorly utilized. It would be nice if interrupt coalesence
could be used by NAPI (or maybe by NAPI_2 ?) in this sort
of situation.

--
Arthur

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-18 16:55 ` Arthur Kepner
@ 2005-04-18 19:34   ` Robert Olsson
  2005-04-18 20:26   ` jamal
  1 sibling, 0 replies; 38+ messages in thread
From: Robert Olsson @ 2005-04-18 19:34 UTC (permalink / raw)
  To: Arthur Kepner; +Cc: Brandeburg, Jesse, netdev


Arthur Kepner writes:
 > 
 > On Sun, 17 Apr 2005, Brandeburg, Jesse wrote:
 > 
 > > ......
 > > _conclusion_
 > > 
 > > Is there any way we can configure how soon or often to be polled
 > > (called)? For 100Mb speeds at worst we can get about one 64 byte packet
 > > every 9us (if I did my math right) and since the processors don't take
 > > that long to schedule NAPI, process a packet and handle an interrupt, we
 > > just overload the system with interrupts going into and out of napi
 > > mode.  In this case I only have one adapter getting scheduled.
 > > 
 > 
 > I'll just chime in to say that I've seen similar behavior,
 > (but with a very different system.)
 > 
 > The problem with NAPI is (quoting a co-worker) that it
 > relies on an "accident of timing".
 > 
 > If the CPU speed, time to complete a PIO, and the inter-
 > packet arrival time are "just so", then a system (or at
 > least one of its CPUs) can be kept very busy even when
 > it's not receiving data from the network at a particularly
 > high rate.
 > 
 > The simple feedback mechanism used by NAPI is good at
 > balancing latency and throughput, but it doesn't have
 > any way of recognizing when system resources are being
 > poorly utilized. It would be nice if interrupt coalesence
 > could be used by NAPI (or maybe by NAPI_2 ?) in this sort
 > of situation.

 Well there is nothing that prohibits interrupt coalescing use with 
 NAPI as is.  Matter of there are many drivers using this i.e e1000
 and tulip. You trade latency for more packets per interrupt.

 Cheers.
					--ro

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-18 16:55 ` Arthur Kepner
  2005-04-18 19:34   ` Robert Olsson
@ 2005-04-18 20:26   ` jamal
  2005-04-19  5:55     ` Greg Banks
  1 sibling, 1 reply; 38+ messages in thread
From: jamal @ 2005-04-18 20:26 UTC (permalink / raw)
  To: Arthur Kepner; +Cc: Brandeburg, Jesse, netdev

On Mon, 2005-18-04 at 09:55 -0700, Arthur Kepner wrote:
> On Sun, 17 Apr 2005, Brandeburg, Jesse wrote:
> 
> > ......
> > _conclusion_
> > 
> > Is there any way we can configure how soon or often to be polled
> > (called)? For 100Mb speeds at worst we can get about one 64 byte packet
> > every 9us (if I did my math right) and since the processors don't take
> > that long to schedule NAPI, process a packet and handle an interrupt, we
> > just overload the system with interrupts going into and out of napi
> > mode.  In this case I only have one adapter getting scheduled.
> > 
> 
> I'll just chime in to say that I've seen similar behavior,
> (but with a very different system.)
> 

It would _help_ a great deal if people collect data and post.
What was this other system? Was it running e100 as well? etc etc

> The problem with NAPI is (quoting a co-worker) that it
> relies on an "accident of timing".
> 

geez, that almost sounds like an insult. spank your coworker for me with
something sharp (i hope s/he doesnt enjoy it)

> If the CPU speed, time to complete a PIO, and the inter-
> packet arrival time are "just so", then a system (or at
> least one of its CPUs) can be kept very busy even when
> it's not receiving data from the network at a particularly
> high rate.
> 

again, it would help if you actually provided data - this sounds like
hand waving to me. 
The only known thing we are aware of is that if you have slow PIO at low
packet rates you will chew more CPU - the reason is simple: if you do 1
packet per interupt you will have one more IO per packet than if you
didnt use NAPI; at about 2.5 packets or so you start benefiting from the
amortization of IO. This issue has been discussed many many times at
netdev. There has been no good reason thus far to complicate things by
"fixing" it.  Go to MSI capable NICs if you are really concerned.

> The simple feedback mechanism used by NAPI is good at
> balancing latency and throughput, but it doesn't have
> any way of recognizing when system resources are being
> poorly utilized.

How do you recognize when system resources are being poorly utilized?
If you know the answer to that then you need to fix things at the
softirq scheduling level - not at NAPI (that would be the wrong spot)

>  It would be nice if interrupt coalesence
> could be used by NAPI (or maybe by NAPI_2 ?) in this sort
> of situation.
> 

You can add coalescing as is doable by e1000 for example - but that
shouldnt solve the problem you allude to - that of "system resources are
being poorly utilized" .

cheers,
jamal

PS:- if you want to help please post data.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-18 20:26   ` jamal
@ 2005-04-19  5:55     ` Greg Banks
  2005-04-19 18:36       ` David S. Miller
  2005-04-22 11:36       ` Andi Kleen
  0 siblings, 2 replies; 38+ messages in thread
From: Greg Banks @ 2005-04-19  5:55 UTC (permalink / raw)
  To: jamal; +Cc: Arthur Kepner, Brandeburg, Jesse, netdev

On Mon, Apr 18, 2005 at 04:26:07PM -0400, jamal wrote:
> On Mon, 2005-18-04 at 09:55 -0700, Arthur Kepner wrote:
> > I'll just chime in to say that I've seen similar behavior,
> > (but with a very different system.)
> 
> It would _help_ a great deal if people collect data and post.
> What was this other system? Was it running e100 as well? etc etc

http://marc.theaimsgroup.com/?l=linux-netdev&m=107183822710263&w=2

> > The problem with NAPI is (quoting a co-worker) that it
> > relies on an "accident of timing".
> > 
> 
> geez, that almost sounds like an insult. spank your coworker for me with
> something sharp (i hope s/he doesnt enjoy it)

No, thank you.  Maybe next time.

> How do you recognize when system resources are being poorly utilized?

An inordinate amount of CPU is being spent running around polling the
device instead of dealing with the packets in IP, TCP and NFS land.
By inordinate, we mean twice as much or more cpu% than a MIPS/Irix
box with slower CPUs.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-19  5:55     ` Greg Banks
@ 2005-04-19 18:36       ` David S. Miller
  2005-04-19 20:38         ` NAPI and CPU utilization [was: NAPI, e100, and system performance problem] Arthur Kepner
       [not found]         ` <20050420145629.GH19415@sgi.com>
  2005-04-22 11:36       ` Andi Kleen
  1 sibling, 2 replies; 38+ messages in thread
From: David S. Miller @ 2005-04-19 18:36 UTC (permalink / raw)
  To: Greg Banks; +Cc: hadi, akepner, jesse.brandeburg, netdev

On Tue, 19 Apr 2005 15:55:35 +1000
Greg Banks <gnb@sgi.com> wrote:

> > How do you recognize when system resources are being poorly utilized?
> 
> An inordinate amount of CPU is being spent running around polling the
> device instead of dealing with the packets in IP, TCP and NFS land.
> By inordinate, we mean twice as much or more cpu% than a MIPS/Irix
> box with slower CPUs.

You haven't answered the "how" yet.

What tools did you run, what did those tools attempt to measure, and
what results did those tools output for you so that you could determine
your conclusions with such certainty?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* NAPI and CPU utilization [was: NAPI, e100, and system performance problem]
  2005-04-19 18:36       ` David S. Miller
@ 2005-04-19 20:38         ` Arthur Kepner
  2005-04-19 20:52           ` Rick Jones
  2005-04-19 21:09           ` David S. Miller
       [not found]         ` <20050420145629.GH19415@sgi.com>
  1 sibling, 2 replies; 38+ messages in thread
From: Arthur Kepner @ 2005-04-19 20:38 UTC (permalink / raw)
  To: David S. Miller; +Cc: Greg Banks, hadi, jesse.brandeburg, netdev

[Modified the subject line in order to not distract from 
Jesse's original thread.]

On Tue, 19 Apr 2005, David S. Miller wrote:

> On Tue, 19 Apr 2005 15:55:35 +1000
> Greg Banks <gnb@sgi.com> wrote:
> 
> > > How do you recognize when system resources are being poorly utilized?
> > 
> > An inordinate amount of CPU is being spent running around polling the
> > device instead of dealing with the packets in IP, TCP and NFS land.
> > By inordinate, we mean twice as much or more cpu% than a MIPS/Irix
> > box with slower CPUs.
> 
> You haven't answered the "how" yet.
> 
> What tools did you run, what did those tools attempt to measure, and
> what results did those tools output for you so that you could determine
> your conclusions with such certainty?
> 

I'm answering for myself, not Greg. 

Much of the data is essentially from "/proc/". (We use a nice tool 
called PCP to gather the data, but that's where PCP gets it, for the 
most part.) But I've used several other tools to gather corroborating 
data, including: the "kernprof" patch, "q-tools", an ad-hoc patch 
which used "get_cycles()" to time things which were happening while 
interrupts were disabled. 

The data acquired with all these show that NAPI causes relatively 
few packets to be processed per interrupt, so that expensive PIOs 
are relatively poorly amortized over a given amount of input from 
the network. (When I use "relative(ly)" above, I mean relative 
what we see when using Greg's interrupt coalescence patch from 
http://marc.theaimsgroup.com/?l=linux-netdev&m=107183822710263&w=2)

For example, following is a comparison of the number of packets 
processed per interrupt and CPU utilization using NAPI and Greg's 
interrupt coalescence patch. 

This data is pretty old by now, it was gathered using 2.6.5 on an 
Altix with 1GHz CPUs using the tg3 driver doing bulk data transfer 
using nttcp. (I'm eyeballing the data from a set of graphs, so 
it's rough...)

   Link util [%]  Packets/Interrupt     CPU utilization [%]

                  NAPI    Intr Coal.    NAPI      Intr Coal.
   ---------------------------------------------------------
    25             2          3.5       45           17
    40             4          6         52           30
    50             4          6         60           36
    60             4          7         75           41
    70             6          10        80           36
    80             6          16        90           40
    85             7          16        100          45
    100            -          17        -            50

I know more recent kernels have somewhat better performance, 
(http://marc.theaimsgroup.com/?l=linux-netdev&m=109848080827969&w=2
helped, for one thing.) 

The reason that CPU utilization is so high with NAPI is that 
we're spinning, waiting for PIOs to flush (this can take an 
impressively long time when the PCI bus/bridge is busy.) 

I guess that some of us (at SGI) have seen this so often as 
a bottleneck that we're surprised that it's not more generally 
recognized as a problem, er, uh, "opportunity for improvement".

--
Arthur

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI and CPU utilization [was: NAPI, e100, and system performance problem]
  2005-04-19 20:38         ` NAPI and CPU utilization [was: NAPI, e100, and system performance problem] Arthur Kepner
@ 2005-04-19 20:52           ` Rick Jones
  2005-04-19 21:09           ` David S. Miller
  1 sibling, 0 replies; 38+ messages in thread
From: Rick Jones @ 2005-04-19 20:52 UTC (permalink / raw)
  To: Arthur Kepner; +Cc: David S. Miller, Greg Banks, hadi, jesse.brandeburg, netdev

One issue/caveat/whatever on those coalescing parms and such.  Watch-out for 
their effect on request/response performance on stuff like netperf TCP_RR.

rick jones

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI and CPU utilization [was: NAPI, e100, and system performance problem]
  2005-04-19 20:38         ` NAPI and CPU utilization [was: NAPI, e100, and system performance problem] Arthur Kepner
  2005-04-19 20:52           ` Rick Jones
@ 2005-04-19 21:09           ` David S. Miller
  1 sibling, 0 replies; 38+ messages in thread
From: David S. Miller @ 2005-04-19 21:09 UTC (permalink / raw)
  To: Arthur Kepner; +Cc: gnb, hadi, jesse.brandeburg, netdev

Please don't expect us to take the table of numbers seriously when it's
against a more than year old kernel and without a patch you know
and even explicitly mention makes this case perform significantly
better.

I understand your basic premise, but you're asking us to evaluate
known stale performance data.

Can you please get numbers against current 2.6.x kernels?

It is likely that people don't care so much about this because
on most systems the PIO overhead is not so pronounced as it is
on your enormous Altix systems.  On PCI-E and the onboard Intel
e1000 stuff, PIOs are incredibly cheap (Robert Olsson posted some
nice numbers not too long ago).

So what I'm saying is that the numbers probably don't fall much
this way on non-Altix systems.  But that doesn't mean we shouldn't
start using hw coalescing a little bit even with NAPI, I think
we should.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
       [not found]         ` <20050420145629.GH19415@sgi.com>
@ 2005-04-20 15:15           ` jamal
  0 siblings, 0 replies; 38+ messages in thread
From: jamal @ 2005-04-20 15:15 UTC (permalink / raw)
  To: Greg Banks; +Cc: David S. Miller, akepner, jesse.brandeburg, netdev

On Thu, 2005-21-04 at 00:56 +1000, Greg Banks wrote:

> We have a stats package called PCP (see oss.sgi.com) which samples
> all kinds of stuff out of /proc at a configurable polling frequency,
> default 2 sec, and provides scrolling graphs.  We've also done some
> profiling work using the SGI kernprof patch in 2.4 kernels and
> oprofile in 2.6 kernels.
> 

this may not be sufficient to debug; that PCP sounds like a hog in its
own merit polling /proc.

Actually, lets start by saying this:
If you problem is PIO being too expensive on your machines, then the
solution maybe for you to set coalescing parameters appropriately. This
is a known issue - "fixing NAPI" requires complicating things for the
majority who dont have the same problem as you.
The issue pointed out by Rick Jones that you sacrifice latency is still
valid. Additionaly, with many NICs in place, coalescing is not gonna cut
it.

Having said that - here are some items that will be useful to collect
before and after a run:

- netstat -s output
- /proc/net/softnet_stat
- ifconfig output
- tc -s qdisc on the interfaces 
- oprofile 
- any other thing you could come up with like some of the stuff you
posted recently on how many packets/interupt are processed with and
without NAPI.
- preferably run UDP tests so we dont have to think hard about stats
like retransmits etc.
- And as pointed by Dave, pick the latest kernel.

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-19  5:55     ` Greg Banks
  2005-04-19 18:36       ` David S. Miller
@ 2005-04-22 11:36       ` Andi Kleen
  2005-04-22 12:33         ` jamal
  2005-04-22 14:52         ` Robert Olsson
  1 sibling, 2 replies; 38+ messages in thread
From: Andi Kleen @ 2005-04-22 11:36 UTC (permalink / raw)
  To: Greg Banks; +Cc: Arthur Kepner, Brandeburg, Jesse, netdev, davem

Greg Banks <gnb@sgi.com> writes:
>
> An inordinate amount of CPU is being spent running around polling the
> device instead of dealing with the packets in IP, TCP and NFS land.
> By inordinate, we mean twice as much or more cpu% than a MIPS/Irix
> box with slower CPUs.

We have seen similar behaviour. With NAPI some benchmarks run
a lot slower than on a driver on the same hardware/NIC without NAPI.
This can be even observed with simple tests like netperf single stream
between two boxes.

There seems to be also some problems with bidirectional traffic, although
I have not fully tracked them down to NAPI yet.

There is definitely some problem in NAPI land ...

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 11:36       ` Andi Kleen
@ 2005-04-22 12:33         ` jamal
  2005-04-22 17:21           ` Andi Kleen
  2005-04-22 14:52         ` Robert Olsson
  1 sibling, 1 reply; 38+ messages in thread
From: jamal @ 2005-04-22 12:33 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Greg Banks, Arthur Kepner, Brandeburg, Jesse, netdev, davem

On Fri, 2005-22-04 at 13:36 +0200, Andi Kleen wrote:
> Greg Banks <gnb@sgi.com> writes:
> >
> > An inordinate amount of CPU is being spent running around polling the
> > device instead of dealing with the packets in IP, TCP and NFS land.
> > By inordinate, we mean twice as much or more cpu% than a MIPS/Irix
> > box with slower CPUs.
> 
> We have seen similar behaviour. With NAPI some benchmarks run
> a lot slower than on a driver on the same hardware/NIC without NAPI.

They should not run slower - but they may consume more CPU.

> This can be even observed with simple tests like netperf single stream
> between two boxes.
> 

Yes, slow traffic coming into the system would chew more CPU if you have
a fast CPU ;-> You should know this Andi, but let me explain the reason
for about the 100th time:
If your CPU is fast enough such that the interupt only causes 1-2
packets to be processed before reenabling, then you will have anywhere
between 1-2 extra IO calls (recall dis/reenabling interupts) - this will
be noticed in the CPU % chewed. It shouldnt be really a lot of CPU.
If you use non-NAPI then you dont have the extra IO happening and
therefore you will see less CPU used at those rates.
OTOH, Try more than 1 netperfs and it becomes beneficial to have NAPI
around. 
Yes, for low rates - this will make some of those benchmarks look bad.
But for such cases you can use mitigation in addition to NAPi - except
your latency in the benchmark will now look bad ;->

> There seems to be also some problems with bidirectional traffic, although
> I have not fully tracked them down to NAPI yet.
> 

doubt it has to do with NAPI.

> There is definitely some problem in NAPI land ...
> 

this is a design choice - a solution could be created to "fix" this but
hasnt happened because there has not been a good reason to complicate
things. The people who are bitching about this are benchmarkers who want
to win at both high and low rates and they are not happy because while
they can win at high rates, they cant at low rates.

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 11:36       ` Andi Kleen
  2005-04-22 12:33         ` jamal
@ 2005-04-22 14:52         ` Robert Olsson
  2005-04-22 15:37           ` jamal
  1 sibling, 1 reply; 38+ messages in thread
From: Robert Olsson @ 2005-04-22 14:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Greg Banks, Arthur Kepner, Brandeburg, Jesse, netdev, davem

Andi Kleen writes:

 > We have seen similar behaviour. With NAPI some benchmarks run
 > a lot slower than on a driver on the same hardware/NIC without NAPI.
 > This can be even observed with simple tests like netperf single stream
 > between two boxes.
 > 
 > There seems to be also some problems with bidirectional traffic, although
 > I have not fully tracked them down to NAPI yet.
 > 
 > There is definitely some problem in NAPI land ...

  Well NAPI enforces very little policy it leaves a lot of freedom for 
  driver writers. Driver design i.e do work in interrupt or softirq, use 
  of interrupt mitigation etc etc. It's minimal approach to solve some very 
  severe problem we had with networking stack at a time were linux OS was not 
  an option at all. Knowing also a bit about the background as we experimented
  quite a bit about possible options. Alexey did the final kernel design 
  it got very well integrated into the kernel and softirq linux model.
  Dave understood directly and included the framework directly.

  So help us sort out the problems. And of course there are some differences
  or "issues" as we know every design has it's trade-off is bit Jamal said you 
  can't optimize in both ends. Or help us replace it with something thats 
  solves the same problems even better.

  Cheers.
						--ro

  BTW We talked with Intel folks about leaving irq disabled when reading
  ISR and some status bit were. This can save some PCI-accesses I don't
  if any experiments are done. MSI is also interesting in this aspect... 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 14:52         ` Robert Olsson
@ 2005-04-22 15:37           ` jamal
  2005-04-22 17:22             ` Andi Kleen
  0 siblings, 1 reply; 38+ messages in thread
From: jamal @ 2005-04-22 15:37 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Andi Kleen, Greg Banks, Arthur Kepner, Brandeburg, Jesse, netdev,
	davem

On Fri, 2005-22-04 at 16:52 +0200, Robert Olsson wrote:

>  
>   BTW We talked with Intel folks about leaving irq disabled when reading
>   ISR and some status bit were. 

You caught me just before i said this ;-> 
Now - if those intel folks listened they would be making a killing with
their NICs on Linux. They have a paper at OLS to talk about how they get
influenced by the Linux people on how to improve their hardware - so
maybe they took the advice;-> Jesse?

Note, this would entirely solve what Andi and the SGI people are talking
about.

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 12:33         ` jamal
@ 2005-04-22 17:21           ` Andi Kleen
  2005-04-22 18:18             ` jamal
  0 siblings, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2005-04-22 17:21 UTC (permalink / raw)
  To: jamal; +Cc: Greg Banks, Arthur Kepner, Brandeburg, Jesse, netdev, davem

On Fri, Apr 22, 2005 at 08:33:15AM -0400, jamal wrote:
> On Fri, 2005-22-04 at 13:36 +0200, Andi Kleen wrote:
> > Greg Banks <gnb@sgi.com> writes:
> > >
> > > An inordinate amount of CPU is being spent running around polling the
> > > device instead of dealing with the packets in IP, TCP and NFS land.
> > > By inordinate, we mean twice as much or more cpu% than a MIPS/Irix
> > > box with slower CPUs.
> > 
> > We have seen similar behaviour. With NAPI some benchmarks run
> > a lot slower than on a driver on the same hardware/NIC without NAPI.
> 
> They should not run slower - but they may consume more CPU.

They actually run slower.

Now before David complains this was with old 2.6 kernels and I dont have
time right now to rerun the benchmarks, but at least I dont think
there was ever any patch addressing these issues.

> 
> > This can be even observed with simple tests like netperf single stream
> > between two boxes.
> > 
> 
> Yes, slow traffic coming into the system would chew more CPU if you have
> a fast CPU ;-> You should know this Andi, but let me explain the reason
> for about the 100th time:

No, the performance of the data transfer was actually slower. CPU time
was not the problem, Opterons have enough of that ...

> this is a design choice - a solution could be created to "fix" this but
> hasnt happened because there has not been a good reason to complicate
> things. The people who are bitching about this are benchmarkers who want
> to win at both high and low rates and they are not happy because while
> they can win at high rates, they cant at low rates.

My impression is that NAPI seems to be more optimized for a rather
obscure work load (routing), while it does not seem to be that 
great on the far more common server/client type workloads.
If that was a design choice then it was a bad design.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 15:37           ` jamal
@ 2005-04-22 17:22             ` Andi Kleen
  0 siblings, 0 replies; 38+ messages in thread
From: Andi Kleen @ 2005-04-22 17:22 UTC (permalink / raw)
  To: jamal
  Cc: Robert Olsson, Greg Banks, Arthur Kepner, Brandeburg, Jesse,
	netdev, davem

> Note, this would entirely solve what Andi and the SGI people are talking
> about.

Perhaps, but Linux has to perform well on old hardware too.
New silicon is not a solution.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 17:21           ` Andi Kleen
@ 2005-04-22 18:18             ` jamal
  2005-04-22 18:30               ` Andi Kleen
  2005-04-22 23:28               ` Greg Banks
  0 siblings, 2 replies; 38+ messages in thread
From: jamal @ 2005-04-22 18:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Greg Banks, Arthur Kepner, Brandeburg, Jesse, netdev, davem

On Fri, 2005-22-04 at 19:21 +0200, Andi Kleen wrote:
> On Fri, Apr 22, 2005 at 08:33:15AM -0400, jamal wrote:
[..]
> > They should not run slower - but they may consume more CPU.
> 
> They actually run slower.
> 

Why do they run slower? There could be 1000 other variables involved?
What is it that makes you so sure it is NAPI?
I know you are capable of proving it is NAPI - please do so.

> Now before David complains this was with old 2.6 kernels and I dont have
> time right now to rerun the benchmarks, but at least I dont think
> there was ever any patch addressing these issues.
> 

It would be helpful if you use new kernels of course - that reduces the
number of variables to look at. 

> > this is a design choice - a solution could be created to "fix" this but
> > hasnt happened because there has not been a good reason to complicate
> > things. The people who are bitching about this are benchmarkers who want
> > to win at both high and low rates and they are not happy because while
> > they can win at high rates, they cant at low rates.
> 
> My impression is that NAPI seems to be more optimized for a rather
> obscure work load (routing), while it does not seem to be that 
> great on the far more common server/client type workloads.
> If that was a design choice then it was a bad design.
> 

There is only one complaint I have ever heard about NAPI and it is about
low rates: It consumes more CPU at very low rates. Very low rates
depends on how fast your CPU can process at any given time. Refer to my
earlier email. Are you saying low rates are a common load?

The choices are: a) at high rates you die or b) at _very low_ rates
you consume more CPU (3-6% more depending on your system). 

Logic says lets choose a). You could overcome b) by turning on
mitigation at the expense of latency. We could "fix" at a cost of 
making the whole state machine complex - which would be defeating  
the " optimize for the common".

You are the first person i have heard that says NAPI would be slower
in terms of throughput or latency at low rates. My experiences is there
is no difference between the two at low input rate.  It would be
interesting to see the data.

>> Note, this would entirely solve what Andi and the SGI people are 
>> talking about.
> 
> Perhaps, but Linux has to perform well on old hardware too.
> New silicon is not a solution.
> 

I dont see any reason to "fix" anything (note my use of quotes) on old 
hardware. You have a workaround. 
OTOH, provide data to prove otherwise - we all want Linux to be the best.

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 18:18             ` jamal
@ 2005-04-22 18:30               ` Andi Kleen
  2005-04-22 18:37                 ` Arthur Kepner
                                   ` (2 more replies)
  2005-04-22 23:28               ` Greg Banks
  1 sibling, 3 replies; 38+ messages in thread
From: Andi Kleen @ 2005-04-22 18:30 UTC (permalink / raw)
  To: jamal; +Cc: Greg Banks, Arthur Kepner, Brandeburg, Jesse, netdev, davem

On Fri, Apr 22, 2005 at 02:18:22PM -0400, jamal wrote:
> On Fri, 2005-22-04 at 19:21 +0200, Andi Kleen wrote:
> > On Fri, Apr 22, 2005 at 08:33:15AM -0400, jamal wrote:
> [..]
> > > They should not run slower - but they may consume more CPU.
> > 
> > They actually run slower.
> > 
> 
> Why do they run slower? There could be 1000 other variables involved?
> What is it that makes you so sure it is NAPI?
> I know you are capable of proving it is NAPI - please do so.

We tested back then by downgrading to an older non NAPI tg3 driver
and it made the problem go away :) The broadcom bcm57xx driver which
did not support NAPI at this time was also much faster.

> > Now before David complains this was with old 2.6 kernels and I dont have
> > time right now to rerun the benchmarks, but at least I dont think
> > there was ever any patch addressing these issues.
> > 
> 
> It would be helpful if you use new kernels of course - that reduces the
> number of variables to look at. 

It was customers who use certified SLES kernels.

> There is only one complaint I have ever heard about NAPI and it is about
> low rates: It consumes more CPU at very low rates. Very low rates

It was not only more CPU usage, but actually slower network performance
on systems with plenty of CPU power.

Also I doubt the workload Jesse and Greg/Arthur/SGI saw also had issues
with CPU power (can you guys confirm?)

> You are the first person i have heard that says NAPI would be slower
> in terms of throughput or latency at low rates. My experiences is there
> is no difference between the two at low input rate.  It would be
> interesting to see the data.

Well, did you ever test a non routing workload?

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 18:30               ` Andi Kleen
@ 2005-04-22 18:37                 ` Arthur Kepner
  2005-04-22 18:52                   ` David S. Miller
  2005-04-22 19:01                 ` jamal
  2005-04-23 16:56                 ` Robert Olsson
  2 siblings, 1 reply; 38+ messages in thread
From: Arthur Kepner @ 2005-04-22 18:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: jamal, Greg Banks, Brandeburg, Jesse, netdev, davem


On Fri, 22 Apr 2005, Andi Kleen wrote:

> ....
> Also I doubt the workload Jesse and Greg/Arthur/SGI saw also had issues
> with CPU power (can you guys confirm?)
> 

The problem that we've had is that flushing PIOs can take a long 
time. So the CPU can spend more time spinning, waiting for the PIOs 
to flush, than doing useful work. 

I'm going to try to get some new (2.6.12-rc2) data this weekend.

--
Arthur

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 18:37                 ` Arthur Kepner
@ 2005-04-22 18:52                   ` David S. Miller
       [not found]                     ` <Pine.LNX.4.61.0504241845070.2934@linux.site>
  0 siblings, 1 reply; 38+ messages in thread
From: David S. Miller @ 2005-04-22 18:52 UTC (permalink / raw)
  To: Arthur Kepner; +Cc: ak, hadi, gnb, jesse.brandeburg, netdev, davem

On Fri, 22 Apr 2005 11:37:17 -0700 (PDT)
Arthur Kepner <akepner@sgi.com> wrote:

> I'm going to try to get some new (2.6.12-rc2) data this weekend.

That sounds great.  See the other email I posted (you were CC:'d)
where I show a patch to tg3 to minimally enable some hw interrupt
coalescing which might help.  It would be great to see what that
kind of change does for your tests.

Thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 18:30               ` Andi Kleen
  2005-04-22 18:37                 ` Arthur Kepner
@ 2005-04-22 19:01                 ` jamal
  2005-04-22 19:07                   ` David S. Miller
  2005-04-23 16:56                 ` Robert Olsson
  2 siblings, 1 reply; 38+ messages in thread
From: jamal @ 2005-04-22 19:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Greg Banks, Arthur Kepner, Brandeburg, Jesse, netdev, davem

On Fri, 2005-22-04 at 20:30 +0200, Andi Kleen wrote:
> On Fri, Apr 22, 2005 at 02:18:22PM -0400, jamal wrote:
> > On Fri, 2005-22-04 at 19:21 +0200, Andi Kleen wrote:

> > Why do they run slower? There could be 1000 other variables involved?
> > What is it that makes you so sure it is NAPI?
> > I know you are capable of proving it is NAPI - please do so.
> 
> We tested back then by downgrading to an older non NAPI tg3 driver
> and it made the problem go away :) The broadcom bcm57xx driver which
> did not support NAPI at this time was also much faster.
> 

Dont mean to make this into a meaningless debate - but have you thought
of the fact maybe it could be a driver bug in case of NAPI?
The e1000 NAPI had a serious bug since day one that was only recently
fixed (I think Robert provided the fix - but the intel folks made the
release).

> > It would be helpful if you use new kernels of course - that reduces the
> > number of variables to look at. 
> 
> It was customers who use certified SLES kernels.

That makes it hard.
You understand that there could be other issues - thats why its safer to
just ask for latest kernel.

> > There is only one complaint I have ever heard about NAPI and it is about
> > low rates: It consumes more CPU at very low rates. Very low rates
> 
> It was not only more CPU usage, but actually slower network performance
> on systems with plenty of CPU power.
> 

This is definetely a new thing nobody has mentioned before.
Whatever difference there is would not be that much.

> Also I doubt the workload Jesse and Greg/Arthur/SGI saw also had issues
> with CPU power (can you guys confirm?)
> 

The SGI folks seem to be on their way to collect some serious data.
So that should help.

> > You are the first person i have heard that says NAPI would be slower
> > in terms of throughput or latency at low rates. My experiences is there
> > is no difference between the two at low input rate.  It would be
> > interesting to see the data.
> 
> Well, did you ever test a non routing workload?
> 

I did extensive tests with UDP because it was easier to analyze as well
as to pump data at. I did some TCP tests with many connections but the
heursitics of retransmits, congestion control etc made it harder to
analyze. 
If i had a bulk type of workload on a TCP server at gigabit rate it
still isnt that interesting - they tend to go at MTU packet size which
is typically less than 90Kpps worst case.

With a simple UDP sink server that just swallowed packets it was easier.
I could send it 1Mpps and exercise that path. So this is where
i did most of the testing as far as non-routing paths - Robert is the
other person you could ask this question. 

Very interesting observation to note in the case of UDP: at some point
on my slow machine at 100Kpps that NAPI was able to keep up with, the
socket queue got overloaded. And packets started dropping at the socket
queue. 
I did provide patches to have feedback that goes all the way to the
driver level; however intepreting these feedback codes is hard. Look at
the comments in dev.c from Alexey to understand why this is hard;->
comments read as follows:

------
   /* Jamal, now you will not able to escape explaining
    * me how you were going to use this. :-)
    */
-------

That comment serves as a reminder to revist this. About everytime i see
i go back and look at my notes. So the challenge is still on the table.

The old non-NAPI code was never able to stress the socket code the way
NAPI does because the system simply died - so this was never seen.
Things like the classical lazy receiver processing would have been
useful to implement - but very hard to do in Linux.

Back to my comments: We need to analyze why something is happening.
Simply saying "its NAPI" is wrong.

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 19:01                 ` jamal
@ 2005-04-22 19:07                   ` David S. Miller
  2005-04-22 19:21                     ` jamal
  0 siblings, 1 reply; 38+ messages in thread
From: David S. Miller @ 2005-04-22 19:07 UTC (permalink / raw)
  To: hadi; +Cc: ak, gnb, akepner, jesse.brandeburg, netdev, davem

On Fri, 22 Apr 2005 15:01:27 -0400
jamal <hadi@cyberus.ca> wrote:

> Dont mean to make this into a meaningless debate - but have you thought
> of the fact maybe it could be a driver bug in case of NAPI?
> The e1000 NAPI had a serious bug since day one that was only recently
> fixed (I think Robert provided the fix - but the intel folks made the
> release).

True, but really Jamal I think a lot of this has to do with
not doing a small amount of hw coalescing even when doing NAPI.

Let's get people testing changes like that to see if it undoes
the bad cases.  I want to do something proactive with these
reports instead of just asking for more performance data like
a bunch of crazed lunatics :-)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 19:07                   ` David S. Miller
@ 2005-04-22 19:21                     ` jamal
  2005-04-23 20:50                       ` Robert Olsson
  0 siblings, 1 reply; 38+ messages in thread
From: jamal @ 2005-04-22 19:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, gnb, akepner, jesse.brandeburg, netdev, davem

On Fri, 2005-22-04 at 12:07 -0700, David S. Miller wrote:

> Let's get people testing changes like that to see if it undoes
> the bad cases.  I want to do something proactive with these
> reports instead of just asking for more performance data like
> a bunch of crazed lunatics :-)

But we are crazed lunatics ;->
I have a feeling that the patch will solve the problem and we wont have
the fun of seeing interesting data.

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 18:18             ` jamal
  2005-04-22 18:30               ` Andi Kleen
@ 2005-04-22 23:28               ` Greg Banks
  2005-04-22 23:40                 ` Stephen Hemminger
  1 sibling, 1 reply; 38+ messages in thread
From: Greg Banks @ 2005-04-22 23:28 UTC (permalink / raw)
  To: jamal
  Cc: Andi Kleen, Greg Banks, Arthur Kepner, Brandeburg, Jesse, netdev,
	davem

On Fri, Apr 22, 2005 at 02:18:22PM -0400, jamal wrote:
> On Fri, 2005-22-04 at 19:21 +0200, Andi Kleen wrote:
> > On Fri, Apr 22, 2005 at 08:33:15AM -0400, jamal wrote:
> [..]
> > > They should not run slower - but they may consume more CPU.
> > 
> > They actually run slower.
> > 

IIRC I saw a similar but very small effect on Altix hardware about 18
months ago, but I'm unable to get at my old logbooks right now.  I
do remember the effect was very small compared to the CPU usage effect
and I didn't bother investigating or mentioning it.

> Why do they run slower? There could be 1000 other variables involved?
> What is it that makes you so sure it is NAPI?

At the time I was running 2 kernels identical except that one had
NAPI disabled in tg3.c.

> There is only one complaint I have ever heard about NAPI and it is about
> low rates: It consumes more CPU at very low rates. Very low rates
> depends on how fast your CPU can process at any given time. Refer to my
> earlier email. Are you saying low rates are a common load?
> 
> The choices are: a) at high rates you die or b) at _very low_ rates
> you consume more CPU (3-6% more depending on your system). 

This is a false dichotomy.  The mechanism could instead dynamically
adjust to the actual network load.  For example dev->weight could
be dynamically adjusted according to a 1-second average packet
arrival rate on that device.  As a further example the driver could
use that value as a guide to control interrupt coalescing parameters.

In SGI's fileserving group we commonly see two very different traffic
patterns, both of which must work efficiently without manual tuning.

1.  high-bandwidth, CPU-sensitive: NFS and CIFS data and metadata
    traffic.

2.  low bandwidth, latency-sensitive: metadata traffic on SGI's
    proprietary clustered filesystem.

The solution on Irix was a dynamic feedback mechanism in the driver
to control the interrupt coalescing parameters, so the driver
adjusts to the predominant traffic.

I think this is a generic problem that other people face too, possibly
without being aware of it.  Given that NAPI seeks to be a generic
solution to device interrupt control, and given that it spreads
responsibility between the driver and the device layer, I think
there is room to improve NAPI to cater for various workloads without
implementing enormously complicated control mechanisms in each driver.

> Logic says lets choose a). You could overcome b) by turning on
> mitigation at the expense of latency. We could "fix" at a cost of 
> making the whole state machine complex - which would be defeating  
> the " optimize for the common".

Sure, NAPI is simple.  Current experience on Altix is that
NAPI is the solution that is clear, simple, and wrong.

> >> Note, this would entirely solve what Andi and the SGI people are 
> >> talking about.
> > 
> > Perhaps, but Linux has to perform well on old hardware too.
> > New silicon is not a solution.

Agreed.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 23:28               ` Greg Banks
@ 2005-04-22 23:40                 ` Stephen Hemminger
  2005-04-22 23:43                   ` David S. Miller
  0 siblings, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2005-04-22 23:40 UTC (permalink / raw)
  To: Greg Banks
  Cc: jamal, Andi Kleen, Greg Banks, Arthur Kepner, Brandeburg, Jesse,
	netdev, davem

On Sat, 23 Apr 2005 09:28:31 +1000
Greg Banks <gnb@sgi.com> wrote:

> On Fri, Apr 22, 2005 at 02:18:22PM -0400, jamal wrote:
> > On Fri, 2005-22-04 at 19:21 +0200, Andi Kleen wrote:
> > > On Fri, Apr 22, 2005 at 08:33:15AM -0400, jamal wrote:
> > [..]
> > > > They should not run slower - but they may consume more CPU.
> > > 
> > > They actually run slower.
> > > 
> 
> IIRC I saw a similar but very small effect on Altix hardware about 18
> months ago, but I'm unable to get at my old logbooks right now.  I
> do remember the effect was very small compared to the CPU usage effect
> and I didn't bother investigating or mentioning it.
> 
> > Why do they run slower? There could be 1000 other variables involved?
> > What is it that makes you so sure it is NAPI?
> 
> At the time I was running 2 kernels identical except that one had
> NAPI disabled in tg3.c.
> 
> > There is only one complaint I have ever heard about NAPI and it is about
> > low rates: It consumes more CPU at very low rates. Very low rates
> > depends on how fast your CPU can process at any given time. Refer to my
> > earlier email. Are you saying low rates are a common load?
> > 
> > The choices are: a) at high rates you die or b) at _very low_ rates
> > you consume more CPU (3-6% more depending on your system). 
> 
> This is a false dichotomy.  The mechanism could instead dynamically
> adjust to the actual network load.  For example dev->weight could
> be dynamically adjusted according to a 1-second average packet
> arrival rate on that device.  As a further example the driver could
> use that value as a guide to control interrupt coalescing parameters.
> 
> In SGI's fileserving group we commonly see two very different traffic
> patterns, both of which must work efficiently without manual tuning.
> 
> 1.  high-bandwidth, CPU-sensitive: NFS and CIFS data and metadata
>     traffic.
> 
> 2.  low bandwidth, latency-sensitive: metadata traffic on SGI's
>     proprietary clustered filesystem.
> 
> The solution on Irix was a dynamic feedback mechanism in the driver
> to control the interrupt coalescing parameters, so the driver
> adjusts to the predominant traffic.
> 
> I think this is a generic problem that other people face too, possibly
> without being aware of it.  Given that NAPI seeks to be a generic
> solution to device interrupt control, and given that it spreads
> responsibility between the driver and the device layer, I think
> there is room to improve NAPI to cater for various workloads without
> implementing enormously complicated control mechanisms in each driver.
> 
> > Logic says lets choose a). You could overcome b) by turning on
> > mitigation at the expense of latency. We could "fix" at a cost of 
> > making the whole state machine complex - which would be defeating  
> > the " optimize for the common".
> 
> Sure, NAPI is simple.  Current experience on Altix is that
> NAPI is the solution that is clear, simple, and wrong.
> 
> > >> Note, this would entirely solve what Andi and the SGI people are 
> > >> talking about.
> > > 
> > > Perhaps, but Linux has to perform well on old hardware too.
> > > New silicon is not a solution.
> 
> Agreed.
> 
> Greg.

My experience is that NAPI adds latency and that can cause worse performance.
I haven't seen a good analysis of the problem and/or simple tests to reproduce
the problem

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 23:40                 ` Stephen Hemminger
@ 2005-04-22 23:43                   ` David S. Miller
  2005-04-23  2:51                     ` Stephen Hemminger
                                       ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: David S. Miller @ 2005-04-22 23:43 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: gnb, hadi, ak, akepner, jesse.brandeburg, netdev, davem

On Sat, 23 Apr 2005 09:40:38 +1000
Stephen Hemminger <shemminger@osdl.org> wrote:

> My experience is that NAPI adds latency and that can cause worse performance.
> I haven't seen a good analysis of the problem and/or simple tests to reproduce
> the problem

Right, and it's cpu and bus speed dependant as to when you hit
this bad case.  If your packet rate is perfectly such that
only 1 or 2 packets get processed per interrupt then NAPI loses
badly due to the extra PIO overhead entailed from enabling and
disabling interrupts.

This is essential and well understood, and I personally don't need
to see "numbers" to acknowledge this flaw.

I hope that minimal mitigation settings alleviate this problem for
the most part.

When I moved tg3 over to NAPI, the happiest part of that was deleting
the dynamic hw mitigation setting code the tg3 driver had.  If ever
that kind of thing goes back into the drivers, it should be based
upon a common feedback variable (something based upon dev->weight
perhaps), not reimplemented N times, once in every driver.

With the dynamic schemes comes a new issue, how quickly to respond
to changes in traffic patterns.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 23:43                   ` David S. Miller
@ 2005-04-23  2:51                     ` Stephen Hemminger
  2005-04-23 17:54                       ` Robert Olsson
  2005-04-23  3:04                     ` jamal
  2005-04-23 17:14                     ` Robert Olsson
  2 siblings, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2005-04-23  2:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: gnb, hadi, ak, akepner, jesse.brandeburg, netdev, davem

On Fri, 22 Apr 2005 16:43:01 -0700
"David S. Miller" <davem@davemloft.net> wrote:

> On Sat, 23 Apr 2005 09:40:38 +1000
> Stephen Hemminger <shemminger@osdl.org> wrote:
> 
> > My experience is that NAPI adds latency and that can cause worse performance.
> > I haven't seen a good analysis of the problem and/or simple tests to reproduce
> > the problem
> 
> Right, and it's cpu and bus speed dependant as to when you hit
> this bad case.  If your packet rate is perfectly such that
> only 1 or 2 packets get processed per interrupt then NAPI loses
> badly due to the extra PIO overhead entailed from enabling and
> disabling interrupts.
> 
> This is essential and well understood, and I personally don't need
> to see "numbers" to acknowledge this flaw.
> 
> I hope that minimal mitigation settings alleviate this problem for
> the most part.
> 
> When I moved tg3 over to NAPI, the happiest part of that was deleting
> the dynamic hw mitigation setting code the tg3 driver had.  If ever
> that kind of thing goes back into the drivers, it should be based
> upon a common feedback variable (something based upon dev->weight
> perhaps), not reimplemented N times, once in every driver.
> 
> With the dynamic schemes comes a new issue, how quickly to respond
> to changes in traffic patterns.

One of the problem with the dynamic schemes, is there is no generic infrastructure
to support it. There are some drivers that have tried dynamic schemes
but each seems to reinvent some adhoc scheme. Seems like a good area
for future research.

Another related problem is some drivers use NAPI for both TX and RX and
this can cause asymmetric behavior and scheduling issues. The whole
dev->weight limit stuff assumes RX not TX usage of NAPI

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 23:43                   ` David S. Miller
  2005-04-23  2:51                     ` Stephen Hemminger
@ 2005-04-23  3:04                     ` jamal
  2005-04-23 17:14                     ` Robert Olsson
  2 siblings, 0 replies; 38+ messages in thread
From: jamal @ 2005-04-23  3:04 UTC (permalink / raw)
  To: David S. Miller
  Cc: Stephen Hemminger, gnb, ak, akepner, jesse.brandeburg, netdev,
	davem

On Fri, 2005-22-04 at 16:43 -0700, David S. Miller wrote:

> With the dynamic schemes comes a new issue, how quickly to respond
> to changes in traffic patterns.

Unfortunately, dynamic adjustment of mitigation parameters - in my
experiments at least (pre-NAPI) - show stability is hard to achieve.
In fact the early tulip driver had about 8 levels of mitigation (that i
put in and later taken out by Robert due to the instability).

The one thing that has been tossed around is to modify the state machine
such that the netdev is not taken out of the poll list for at least one
more poll round or a timeout period. I did try this a while back and the
extra poll did consume extra CPU - it probably isnt as bad as done by
extra PIO(s); 

I still think static mitigation + NAPI should do it.

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 18:30               ` Andi Kleen
  2005-04-22 18:37                 ` Arthur Kepner
  2005-04-22 19:01                 ` jamal
@ 2005-04-23 16:56                 ` Robert Olsson
  2 siblings, 0 replies; 38+ messages in thread
From: Robert Olsson @ 2005-04-23 16:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: jamal, Greg Banks, Arthur Kepner, Brandeburg, Jesse, netdev,
	davem


Andi Kleen writes:

 > Well, did you ever test a non routing workload?

 Well Linux. 2.6.11.7 SMP kernel using one CPU driver e1000 NAPI - 
 no-NAPI. Opteron 1.6 GHz e1000 w 82546GB. 

Driver                   sec       NAPI    no-NAPI
-------------------------------------------------
131072 131072      4    60.00      24.80    25.11   Mbits/sec
131072 131072    512    60.00     941.57   941.53   
131072 131072   1024    60.00     941.60   941.61   
131072 131072   2048    60.00     941.60   941.61   
131072 131072   4096    60.00     941.60   941.57   
131072 131072   8192    60.00     941.60   941.61   
131072 131072  16384    60.00     941.28   941.60   
131072 131072  32768    60.00     941.61   941.27   

About the same TCP performance I would say.

Now another TCP test... about to deliver TCP even under severe network
conditions. Say a big TCP server with many NIC's and one NIC gets DOS'ed.  

In this case a DoS attack at 820 kpps at a different NIC not the one 
with the netperf test. (Actually it's doing forwarding at 330 kpps) and 
at the same time serving TCP, netserver in our case.

Driver                            NAPI    no-NAPI
-------------------------------------------------
131072 131072      4    60.00      25.59     N/A
131072 131072    512    60.00     836.79     N/A
131072 131072   1024    60.00     709.65     0.02
131072 131072   2048    60.00     734.34     N/A
131072 131072   4096    60.00     753.99     N/A
131072 131072   8192    60.00     695.57     N/A
131072 131072  16384    60.00     815.50     N/A
131072 131072  32768    60.00     690.33     N/A

So even when we are under hard attack we can serve TCP at very decent 
rates with the NAPI driver. 

Without NAPI are not able to deliver any TCP service at all.

Cheers.
					--ro

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 23:43                   ` David S. Miller
  2005-04-23  2:51                     ` Stephen Hemminger
  2005-04-23  3:04                     ` jamal
@ 2005-04-23 17:14                     ` Robert Olsson
  2 siblings, 0 replies; 38+ messages in thread
From: Robert Olsson @ 2005-04-23 17:14 UTC (permalink / raw)
  To: David S. Miller
  Cc: Stephen Hemminger, gnb, hadi, ak, akepner, jesse.brandeburg,
	netdev, davem


David S. Miller writes:

 > With the dynamic schemes comes a new issue, how quickly to respond
 > to changes in traffic patterns.

 So true... spoiled months with this to trying to find acceptable 
 mitigation settings for all rates and for all packet sizes.

 Yes minimal mitigation settings with NAPI is probably the best way
 to go. 

 Cheers.
					--ro
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-23  2:51                     ` Stephen Hemminger
@ 2005-04-23 17:54                       ` Robert Olsson
  0 siblings, 0 replies; 38+ messages in thread
From: Robert Olsson @ 2005-04-23 17:54 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, gnb, hadi, ak, akepner, jesse.brandeburg, netdev,
	davem

Stephen Hemminger writes:

 > One of the problem with the dynamic schemes, is there is no generic infrastructure
 > to support it. There are some drivers that have tried dynamic schemes
 > but each seems to reinvent some adhoc scheme. Seems like a good area
 > for future research.

 Well how do decorrelate bursty network and more interesing at what 
 time level? I said this before... the only useful measure for mitigation
 setting was packets on RX ring.

 > Another related problem is some drivers use NAPI for both TX and RX and
 > this can cause asymmetric behavior and scheduling issues. The whole
 > dev->weight limit stuff assumes RX not TX usage of NAPI

 Well in the very beginning it was thought only RX yes, But it is possible 
 also to have TX buffer cleared etc in the ->poll routine to simplify 
 intr. acking. As implemented only RX packets updates the budget.
 From what I see in real life use this works well. Guess you noticed the 
 tests with e1000 and TCP just posted. 

 Anyway MSI will give us some new options in this area..

 Cheers.
					--ro

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-22 19:21                     ` jamal
@ 2005-04-23 20:50                       ` Robert Olsson
  0 siblings, 0 replies; 38+ messages in thread
From: Robert Olsson @ 2005-04-23 20:50 UTC (permalink / raw)
  To: David S. Miller
  Cc: hadi, ak, gnb, akepner, jesse.brandeburg, netdev, davem,
	Robert.Olsson

[-- Attachment #1: message body text --]
[-- Type: text/plain, Size: 516 bytes --]

Well remember I did some CPU-measurements with different variants of the 
tulip driver long time ago while discussing this with Manfred.

Yellow: NAPI no intr. mitigation
Red: old HW-Flow w. intr. mitigation
Blue: NAPI w. intr. mitigation
Purple: A tulip variant doing all work in ->poll w. intr. mitigation

Your coalicing patch for tg3 should roughly be like moving from the yellow 
to the purple line. I'll guess we soon got the answer... Maybe I have to 
upgrade the lab with some recent broadcom cards too.

[-- Attachment #2: NAPI-overhead.gif --]
[-- Type: image/gif, Size: 179965 bytes --]

[-- Attachment #3: message body text --]
[-- Type: text/plain, Size: 21 bytes --]

Cheers.
						--ro

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
       [not found]                     ` <Pine.LNX.4.61.0504241845070.2934@linux.site>
@ 2005-04-25 11:25                       ` jamal
  2005-04-25 18:51                         ` David S. Miller
  2005-04-25 11:41                       ` jamal
  1 sibling, 1 reply; 38+ messages in thread
From: jamal @ 2005-04-25 11:25 UTC (permalink / raw)
  To: Arthur Kepner
  Cc: Robert Olsson, David S. Miller, ak, gnb, jesse.brandeburg, netdev,
	davem

Dang, I just noticed you used TCP. Could you use UDP please?

While the suspect may be PIO/s - lets be scientific and prove it.
So please collect all the data i requested with UDP and if you can 
please also retry with e1000 and latest driver they have (this should
help rule out - hopefully - any tg3 driver related issues).

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
       [not found]                     ` <Pine.LNX.4.61.0504241845070.2934@linux.site>
  2005-04-25 11:25                       ` jamal
@ 2005-04-25 11:41                       ` jamal
  2005-04-25 12:16                         ` Jamal Hadi Salim
  1 sibling, 1 reply; 38+ messages in thread
From: jamal @ 2005-04-25 11:41 UTC (permalink / raw)
  To: Arthur Kepner; +Cc: David S. Miller, ak, gnb, jesse.brandeburg, netdev, davem


Actually, it may be sufficient to collect data for 20, 40, and 80% link
utilization for napi, napi+dmiller_mchan and nonapi

Same for the e1000.

cheers,
jamal

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-25 11:41                       ` jamal
@ 2005-04-25 12:16                         ` Jamal Hadi Salim
  0 siblings, 0 replies; 38+ messages in thread
From: Jamal Hadi Salim @ 2005-04-25 12:16 UTC (permalink / raw)
  To: Arthur Kepner; +Cc: David S. Miller, ak, gnb, jesse.brandeburg, netdev, davem


After looking closely at the data:
To reduce the work, 20 and 40% should be enough.
But lets rule out the tg3 and also test with e1000 and UDP.

cheers,
jamal

On Mon, 2005-25-04 at 07:41 -0400, jamal wrote:
> Actually, it may be sufficient to collect data for 20, 40, and 80% link
> utilization for napi, napi+dmiller_mchan and nonapi
> 
> Same for the e1000.
> 
> cheers,
> jamal
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: NAPI, e100, and system performance problem
  2005-04-25 11:25                       ` jamal
@ 2005-04-25 18:51                         ` David S. Miller
  0 siblings, 0 replies; 38+ messages in thread
From: David S. Miller @ 2005-04-25 18:51 UTC (permalink / raw)
  To: hadi; +Cc: akepner, Robert.Olsson, ak, gnb, jesse.brandeburg, netdev, davem

On Mon, 25 Apr 2005 07:25:55 -0400
jamal <hadi@cyberus.ca> wrote:

> Dang, I just noticed you used TCP. Could you use UDP please?
> 
> While the suspect may be PIO/s - lets be scientific and prove it.

Jamal, enough already... stop being such a control freak.
You're exceeding me as a control freak, and that's saying
something.

The guy got the numbers for us, in response to their claim that
on their systems cpu utilization is poor due to NAPI.  Are you
not convinced of this yet?

Will a hundred new sets of numbers using the precise driver and
protocol of your choosing convince you more of this?  I think
you're making unreasonable reqeusts of the problem submitter at
this point, and we should be sending them patches to test not
a new set of tests to run.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2005-04-25 18:51 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-18  6:11 NAPI, e100, and system performance problem Brandeburg, Jesse
2005-04-18 12:14 ` jamal
2005-04-18 15:36 ` Robert Olsson
2005-04-18 16:55 ` Arthur Kepner
2005-04-18 19:34   ` Robert Olsson
2005-04-18 20:26   ` jamal
2005-04-19  5:55     ` Greg Banks
2005-04-19 18:36       ` David S. Miller
2005-04-19 20:38         ` NAPI and CPU utilization [was: NAPI, e100, and system performance problem] Arthur Kepner
2005-04-19 20:52           ` Rick Jones
2005-04-19 21:09           ` David S. Miller
     [not found]         ` <20050420145629.GH19415@sgi.com>
2005-04-20 15:15           ` NAPI, e100, and system performance problem jamal
2005-04-22 11:36       ` Andi Kleen
2005-04-22 12:33         ` jamal
2005-04-22 17:21           ` Andi Kleen
2005-04-22 18:18             ` jamal
2005-04-22 18:30               ` Andi Kleen
2005-04-22 18:37                 ` Arthur Kepner
2005-04-22 18:52                   ` David S. Miller
     [not found]                     ` <Pine.LNX.4.61.0504241845070.2934@linux.site>
2005-04-25 11:25                       ` jamal
2005-04-25 18:51                         ` David S. Miller
2005-04-25 11:41                       ` jamal
2005-04-25 12:16                         ` Jamal Hadi Salim
2005-04-22 19:01                 ` jamal
2005-04-22 19:07                   ` David S. Miller
2005-04-22 19:21                     ` jamal
2005-04-23 20:50                       ` Robert Olsson
2005-04-23 16:56                 ` Robert Olsson
2005-04-22 23:28               ` Greg Banks
2005-04-22 23:40                 ` Stephen Hemminger
2005-04-22 23:43                   ` David S. Miller
2005-04-23  2:51                     ` Stephen Hemminger
2005-04-23 17:54                       ` Robert Olsson
2005-04-23  3:04                     ` jamal
2005-04-23 17:14                     ` Robert Olsson
2005-04-22 14:52         ` Robert Olsson
2005-04-22 15:37           ` jamal
2005-04-22 17:22             ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).