RE: RFC: NAPI packet weighting patch

All of lore.kernel.org
 help / color / mirror / Atom feed

* RE: RFC: NAPI packet weighting patch
@ 2005-06-06 20:29 Ronciak, John
  2005-06-06 23:55 ` Mitch Williams
  0 siblings, 1 reply; 52+ messages in thread
From: Ronciak, John @ 2005-06-06 20:29 UTC (permalink / raw)
  To: David S. Miller
  Cc: mchan, hadi, buytenh, Williams, Mitch A, jdmason, shemminger,
	netdev, Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse

> 	If you force the e1000 driver to do RX replenishment every N
> 	packets it should reduce the packet drops the same (in the
> 	single NIC case) as if you reduced the dev->weight to that
> 	same value N.

But this isn't what we are seeing.  Even if we just reduce the weight
value to 32 from 64, all of the drops go away.  So there seems to be
other things affecting this.

We are just talking about single NIC testing at this point.  I agree
that single and multi-NIC results different issues and we will need to
test this as well with whatever we come up with out of this.

I also like your idea about the weight value being adjusted based on
real work done using some measurable metric.  This seems like a good
path to explore as well.

Cheers,
John

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: RFC: NAPI packet weighting patch
  2005-06-06 20:29 RFC: NAPI packet weighting patch Ronciak, John
@ 2005-06-06 23:55 ` Mitch Williams
  2005-06-07  0:08   ` Ben Greear
  2005-06-07  4:53   ` Stephen Hemminger
  0 siblings, 2 replies; 52+ messages in thread
From: Mitch Williams @ 2005-06-06 23:55 UTC (permalink / raw)
  To: Ronciak, John
  Cc: David S. Miller, mchan, hadi, buytenh, Williams, Mitch A, jdmason,
	shemminger, netdev, Robert.Olsson, Venkatesan, Ganesh,
	Brandeburg, Jesse

On Mon, 6 Jun 2005, Ronciak, John wrote:

> > 	If you force the e1000 driver to do RX replenishment every N
> > 	packets it should reduce the packet drops the same (in the
> > 	single NIC case) as if you reduced the dev->weight to that
> > 	same value N.
>
> But this isn't what we are seeing.  Even if we just reduce the weight
> value to 32 from 64, all of the drops go away.  So there seems to be
> other things affecting this.

Some quickie results for everybody -- I've been working on other stuff this
morning and haven't had much time in the lab.

Increasing the RX ring to 512 descriptors eliminates dropped packets, but
performance goes down.  When I mentioned this, John and Jesse both nodded
and said, "Yep.  That's what happens when the descriptor ring grows past
one page."

Reducing the weight to 32 got rid of almost all of the dropped packets
(down to < 1 per second); reducing it to 20 eliminated all of them.  In
both cases performance rose as compared to the default weight of 64.

Tests were run on 2.6.12rc5 on a dual Xeon 2.8GHz PCI-X system.  We run
Chariot for performance testing, using TCP/IP large file transfers with 10
Windows 2000 clients.

We're still looking at some methods of returning RX resources to the
hardware more often, but we don't have results on that yet.

> I also like your idea about the weight value being adjusted based on
> real work done using some measurable metric.  This seems like a good
> path to explore as well.

Agreed.  I think NAPI can be a lot smarter than it is today.

-Mitch

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-06 23:55 ` Mitch Williams
@ 2005-06-07  0:08   ` Ben Greear
  2005-06-08  1:50     ` Jesse Brandeburg
  2005-06-07  4:53   ` Stephen Hemminger
  1 sibling, 1 reply; 52+ messages in thread
From: Ben Greear @ 2005-06-07  0:08 UTC (permalink / raw)
  To: Mitch Williams
  Cc: Ronciak, John, David S. Miller, mchan, hadi, buytenh, jdmason,
	shemminger, netdev, Robert.Olsson, Venkatesan, Ganesh,
	Brandeburg, Jesse

Mitch Williams wrote:
> 
> On Mon, 6 Jun 2005, Ronciak, John wrote:
> 
> 
>>>	If you force the e1000 driver to do RX replenishment every N
>>>	packets it should reduce the packet drops the same (in the
>>>	single NIC case) as if you reduced the dev->weight to that
>>>	same value N.
>>
>>But this isn't what we are seeing.  Even if we just reduce the weight
>>value to 32 from 64, all of the drops go away.  So there seems to be
>>other things affecting this.
> 
> 
> Some quickie results for everybody -- I've been working on other stuff this
> morning and haven't had much time in the lab.
> 
> Increasing the RX ring to 512 descriptors eliminates dropped packets, but
> performance goes down.  When I mentioned this, John and Jesse both nodded
> and said, "Yep.  That's what happens when the descriptor ring grows past
> one page."
> 
> Reducing the weight to 32 got rid of almost all of the dropped packets
> (down to < 1 per second); reducing it to 20 eliminated all of them.  In
> both cases performance rose as compared to the default weight of 64.
> 
> Tests were run on 2.6.12rc5 on a dual Xeon 2.8GHz PCI-X system.  We run
> Chariot for performance testing, using TCP/IP large file transfers with 10
> Windows 2000 clients.

So is the Linux server reading/writing these large files to/from the disk?

Can you tell us how much performance went down when you increased the
descriptors to 512?

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-06 23:55 ` Mitch Williams
  2005-06-07  0:08   ` Ben Greear
@ 2005-06-07  4:53   ` Stephen Hemminger
  2005-06-07 12:38     ` jamal
  2005-06-21 20:20     ` David S. Miller
  1 sibling, 2 replies; 52+ messages in thread
From: Stephen Hemminger @ 2005-06-07  4:53 UTC (permalink / raw)
  To: Mitch Williams
  Cc: Ronciak, John, David S. Miller, mchan, hadi, buytenh, jdmason,
	netdev, Robert.Olsson, Venkatesan, Ganesh, Brandeburg, Jesse

I noticed that the tg3 driver copies packets less than a certain
threshold to a new buffer, but e1000 always passes the big buffer up
the stack. Could this be having an impact?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 12:38     ` jamal
@ 2005-06-07 12:06       ` Martin Josefsson
  2005-06-07 13:29         ` jamal
  2005-06-21 20:37         ` David S. Miller
  0 siblings, 2 replies; 52+ messages in thread
From: Martin Josefsson @ 2005-06-07 12:06 UTC (permalink / raw)
  To: jamal
  Cc: Stephen Hemminger, Mitch Williams, Ronciak, John, David S. Miller,
	mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh, Brandeburg, Jesse

On Tue, 7 Jun 2005, jamal wrote:

> It is possible. Remember also the cost of IO these days is worse than a
> cache miss in cycles as well as absolute time. So the e1000 maybe doing
> more IO than the tg3.
>
> I think there is something fishy about the e1000 in general; From what i
> just heard mentioned reading the emails is there's improvement if the rx
> ring is replenished on a per packet basis instead of a batch at the end.
> This somehow is not an issue with tg3. I think doing replenishing in
> smaller batches like 5 packets at a time would also help.
> That the tg3 doesnt need to have its rx ring sizes adjusted but the
> e1000 gets better the lower the rx ring size is strange.
>
> To the intel folks: shouldnt someone be investigating why this is so?
>
> Fixing the effect with "lets lower the weight" or "wait, lets adjust it
> at runtime" because we know it fixes our problem - sounds like a serious
> bandaid to me. Lets find the cause and fix that instead.
> Why is this issue happening with e1000? Thats what needs to be resolved.
> So far some evidence seems to be suggesting that the tg3 uses less CPU.

One thing that jumps to mind is that e1000 starts at lastrxdescriptor+1
and loops and checks the status of each descriptor and stops when it finds
a descriptor that isn't finished. Another way to do it is to read out the
current position of the ring and loop from lastrxdescriptor+1 up to the
current position. Scott Feldman implemented this for TX and there it
increased performance somewhat (discussed here on netdev some months ago).
I wonder if it could also decrease RX latency, I mean, we have to get the
cache miss sometime anyway.

I havn't checked how tg3 does it.

/Martin

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 13:29         ` jamal
@ 2005-06-07 12:36           ` Martin Josefsson
  2005-06-07 16:34             ` Robert Olsson
  0 siblings, 1 reply; 52+ messages in thread
From: Martin Josefsson @ 2005-06-07 12:36 UTC (permalink / raw)
  To: jamal
  Cc: Stephen Hemminger, Mitch Williams, Ronciak, John, David S. Miller,
	mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh, Brandeburg, Jesse

On Tue, 7 Jun 2005, jamal wrote:

> On Tue, 2005-07-06 at 14:06 +0200, Martin Josefsson wrote:
>
> > One thing that jumps to mind is that e1000 starts at lastrxdescriptor+1
> > and loops and checks the status of each descriptor and stops when it finds
> > a descriptor that isn't finished. Another way to do it is to read out the
> > current position of the ring and loop from lastrxdescriptor+1 up to the
> > current position. Scott Feldman implemented this for TX and there it
> > increased performance somewhat (discussed here on netdev some months ago).
> > I wonder if it could also decrease RX latency, I mean, we have to get the
> > cache miss sometime anyway.
> >
>
> The effect of Scotts patch was to reduce IO by amortizing it on the TX
> side. Are we talking about the same thing ? This was in the case of TX
> descriptor prunning?

Yes, that was for TX pruning.

> So it is possible that the e1000 is doing more than necessary share of
> IO on the receive side as well.

Yes, that's what I mean. Same thing but for RX but the question is how
much we would gain from it, we still need to touch the rx-descriptor
sooner or later. Would be worth a test.
My testsetup isn't in a working condition right now, Robert?

/Martin

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07  4:53   ` Stephen Hemminger
@ 2005-06-07 12:38     ` jamal
  2005-06-07 12:06       ` Martin Josefsson
  2005-06-21 20:20     ` David S. Miller
  1 sibling, 1 reply; 52+ messages in thread
From: jamal @ 2005-06-07 12:38 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Mitch Williams, Ronciak, John, David S. Miller, mchan, buytenh,
	jdmason, netdev, Robert.Olsson, Venkatesan, Ganesh,
	Brandeburg, Jesse

On Mon, 2005-06-06 at 21:53 -0700, Stephen Hemminger wrote:
> I noticed that the tg3 driver copies packets less than a certain
> threshold to a new buffer, but e1000 always passes the big buffer up
> the stack. Could this be having an impact?

It is possible. Remember also the cost of IO these days is worse than a
cache miss in cycles as well as absolute time. So the e1000 maybe doing
more IO than the tg3. 

I think there is something fishy about the e1000 in general; From what i
just heard mentioned reading the emails is there's improvement if the rx
ring is replenished on a per packet basis instead of a batch at the end.
This somehow is not an issue with tg3. I think doing replenishing in
smaller batches like 5 packets at a time would also help.
That the tg3 doesnt need to have its rx ring sizes adjusted but the
e1000 gets better the lower the rx ring size is strange.

To the intel folks: shouldnt someone be investigating why this is so?

Fixing the effect with "lets lower the weight" or "wait, lets adjust it
at runtime" because we know it fixes our problem - sounds like a serious
bandaid to me. Lets find the cause and fix that instead.
Why is this issue happening with e1000? Thats what needs to be resolved.
So far some evidence seems to be suggesting that the tg3 uses less CPU.

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 12:06       ` Martin Josefsson
@ 2005-06-07 13:29         ` jamal
  2005-06-07 12:36           ` Martin Josefsson
  2005-06-21 20:37         ` David S. Miller
  1 sibling, 1 reply; 52+ messages in thread
From: jamal @ 2005-06-07 13:29 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Stephen Hemminger, Mitch Williams, Ronciak, John, David S. Miller,
	mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh, Brandeburg, Jesse

On Tue, 2005-07-06 at 14:06 +0200, Martin Josefsson wrote:

> One thing that jumps to mind is that e1000 starts at lastrxdescriptor+1
> and loops and checks the status of each descriptor and stops when it finds
> a descriptor that isn't finished. Another way to do it is to read out the
> current position of the ring and loop from lastrxdescriptor+1 up to the
> current position. Scott Feldman implemented this for TX and there it
> increased performance somewhat (discussed here on netdev some months ago).
> I wonder if it could also decrease RX latency, I mean, we have to get the
> cache miss sometime anyway.
> 

The effect of Scotts patch was to reduce IO by amortizing it on the TX
side. Are we talking about the same thing ? This was in the case of TX
descriptor prunning? 
So it is possible that the e1000 is doing more than necessary share of
IO on the receive side as well.

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 12:36           ` Martin Josefsson
@ 2005-06-07 16:34             ` Robert Olsson
  2005-06-07 23:19               ` Rick Jones
  0 siblings, 1 reply; 52+ messages in thread
From: Robert Olsson @ 2005-06-07 16:34 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: jamal, Stephen Hemminger, Mitch Williams, Ronciak, John,
	David S. Miller, mchan, buytenh, jdmason, netdev, Robert.Olsson,
	Venkatesan, Ganesh, Brandeburg, Jesse

Martin Josefsson writes:

 > > So it is possible that the e1000 is doing more than necessary share of
 > > IO on the receive side as well.
 > 
 > Yes, that's what I mean. Same thing but for RX but the question is how
 > much we would gain from it, we still need to touch the rx-descriptor
 > sooner or later. Would be worth a test.
 > My testsetup isn't in a working condition right now, Robert?

 Next week possibly... but really now idea what's to test or whats going on.

 We have dual TCP server with one NIC. How is setup now? I don't know even
 how it should be setup for maximum TCP performance? 

 How is irq affinity setup?  Is irq's jumping between the CPU:s etc?
 Does ksoftirq(s) use CPU?   If so it can be adjusted tuned too.
 How is packets processes by CPU's. /proc/net/softnet_stat. Do we see 
 drops w. one CPU too etc
 It might be intricate question of balance between softirq and userland.

 Cheers.
						--ro

 BTW, Can netperf be used for tests like this? (Rick?)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 16:34             ` Robert Olsson
@ 2005-06-07 23:19               ` Rick Jones
  0 siblings, 0 replies; 52+ messages in thread
From: Rick Jones @ 2005-06-07 23:19 UTC (permalink / raw)
  To: Robert Olsson, netdev
  Cc: Martin Josefsson, jamal, Stephen Hemminger, Mitch Williams,
	Ronciak, John, David S. Miller, mchan, buytenh, jdmason,
	Venkatesan, Ganesh, Brandeburg, Jesse

> 
>  BTW, Can netperf be used for tests like this? (Rick?)

Assuming I'm translating "test like this" to the right sort of stuff :)

If one wants to see the effect of different buffer replenishment strategies, I 
suppose that some netperf tests could indeed be used.  It would be desirable to 
look at service demand moreso than throughput (assuming the throughput is 
link-bound).  TCP_STREAM and/or TCP_MAERTS.  I'm not sure the extent to which it 
would be visible to a TCP_RR test.

Differences in service demand could also be used to measure effects of irq 
migration, pinning IRQs and/or processes to specific CPUs and the like.  The 
linux processor affinity stuff in netperf could use a little help though - it is 
easily confused as to when to use a two argument vs three argument 
sched_setaffinity call.  I suspect one may also see differences in TCP_RR 
transaction rates.

I suspect some high number of confidence interval iterations might be required.

rick jones

i'd trim individual names from the dist list, but am not 100% sure who is on 
netdev...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07  0:08   ` Ben Greear
@ 2005-06-08  1:50     ` Jesse Brandeburg
  0 siblings, 0 replies; 52+ messages in thread
From: Jesse Brandeburg @ 2005-06-08  1:50 UTC (permalink / raw)
  To: Ben Greear
  Cc: Williams, Mitch A, Ronciak, John, David S. Miller, mchan, hadi,
	buytenh, jdmason, shemminger, netdev, Robert.Olsson,
	Venkatesan, Ganesh, Brandeburg, Jesse

On Mon, 6 Jun 2005, Ben Greear wrote:

> So is the Linux server reading/writing these large files to/from the
> disk?

no, the test runs completely from memory, and the clients are 
reading/writing from/to the server

> Can you tell us how much performance went down when you increased the
> descriptors to 512?

sorry don't know the answer to that one.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07  4:53   ` Stephen Hemminger
  2005-06-07 12:38     ` jamal
@ 2005-06-21 20:20     ` David S. Miller
  2005-06-21 20:38       ` Rick Jones
  1 sibling, 1 reply; 52+ messages in thread
From: David S. Miller @ 2005-06-21 20:20 UTC (permalink / raw)
  To: shemminger
  Cc: mitch.a.williams, john.ronciak, mchan, hadi, buytenh, jdmason,
	netdev, Robert.Olsson, ganesh.venkatesan, jesse.brandeburg

From: Stephen Hemminger <shemminger@osdl.org>
Date: Mon, 06 Jun 2005 21:53:32 -0700

> I noticed that the tg3 driver copies packets less than a certain
> threshold to a new buffer, but e1000 always passes the big buffer up
> the stack. Could this be having an impact?

I bet it does, this makes ACK processing a lot more expensive.  And it
is so much cheaper to just recycle the big buffer back to the chip
if you copy to a small buffer, and it warms up the caches for the
packet headers as a side effect as well.

Actually, it has a _HUGE_ _HUGE_ impact.  If you pass the big buffer
up, the receiving socket gets charged for the size of the huge buffer,
not for just the size of the packet contained within.  This makes
sockets get overcharged for data reception, and it can cause all kinds
of performance problems.

I highly recommend that this gets fixed.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-07 12:06       ` Martin Josefsson
  2005-06-07 13:29         ` jamal
@ 2005-06-21 20:37         ` David S. Miller
  2005-06-22  7:27           ` Eric Dumazet
  2005-06-22  8:42           ` P
  1 sibling, 2 replies; 52+ messages in thread
From: David S. Miller @ 2005-06-21 20:37 UTC (permalink / raw)
  To: gandalf
  Cc: hadi, shemminger, mitch.a.williams, john.ronciak, mchan, buytenh,
	jdmason, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

From: Martin Josefsson <gandalf@wlug.westbo.se>
Date: Tue, 7 Jun 2005 14:06:18 +0200 (CEST)

> One thing that jumps to mind is that e1000 starts at lastrxdescriptor+1
> and loops and checks the status of each descriptor and stops when it finds
> a descriptor that isn't finished. Another way to do it is to read out the
> current position of the ring and loop from lastrxdescriptor+1 up to the
> current position. Scott Feldman implemented this for TX and there it
> increased performance somewhat (discussed here on netdev some months ago).
> I wonder if it could also decrease RX latency, I mean, we have to get the
> cache miss sometime anyway.
> 
> I havn't checked how tg3 does it.

I don't think this matters all that much.  tg3 does loop on RX
producer index, so doesn't touch descriptors unless the RX producer
index states there is a ready packet there.

One thing I noticed with Super TSO testing is that e1000 has very
expensive TSO transmit processing.  The big problem is the context
descriptor.  This is 4 extra 32-bit words eaten up in the transmit
ring for every TSO packet.  Whereas tg3 stores all the TSO offload
information directly in the normal TX descriptor (which is the
same size, 16 bytes, as the e1000 normal TX descriptor).

It accounts for a non-trivial amount of overhead.  On my SunBlade1500
with Super TSO, e1000 transmitter eats %40 of CPU to fill a gigabit
pipe whereas tg3 takes %30.  All of the extra time, based upon quick
scans of oprofile dumps, shows it in the e1000 driver.

Also, e1000 sends full MTU sized SKBs down into the stack even if the
packet is very small.  This also hurts performance a lot.  As
discussed elsewhere, it should use a "small packet" cut-off just like
other drivers do.  If the RX frame is less than this cut-off value, a
new smaller sized SKB is allocated and the RX data copied into it.
The RX ring SKB is left in-place and given back to the chip.

My only guess is that the e1000 driver implemented things this way
to simplify the RX recycling logic.  Well, it is an area ripe for
improvement in this driver :)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 20:20     ` David S. Miller
@ 2005-06-21 20:38       ` Rick Jones
  2005-06-21 20:55         ` David S. Miller
  2005-06-21 21:47         ` Andi Kleen
  0 siblings, 2 replies; 52+ messages in thread
From: Rick Jones @ 2005-06-21 20:38 UTC (permalink / raw)
  To: David S. Miller
  Cc: shemminger, mitch.a.williams, john.ronciak, mchan, hadi, buytenh,
	jdmason, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

David S. Miller wrote:
> From: Stephen Hemminger <shemminger@osdl.org>
> Date: Mon, 06 Jun 2005 21:53:32 -0700
> 
> 
>>I noticed that the tg3 driver copies packets less than a certain
>>threshold to a new buffer, but e1000 always passes the big buffer up
>>the stack. Could this be having an impact?
> 
> 
> I bet it does, this makes ACK processing a lot more expensive. 

Why would ACK processing care about the size of the buffer containing the ACK 
segment?

> And it
> is so much cheaper to just recycle the big buffer back to the chip
> if you copy to a small buffer, and it warms up the caches for the
> packet headers as a side effect as well.

I would think that the cache business would be a wash either way.  With 64 byte 
cache lines (128 in some cases) just accessing the link-level header has brought 
the IP header into the cache, and probably the TCP header as well.

Isn't the decision point between the sum of allocating a small buffer and doing 
the copy, versus allocating a new large buffer and (re)mapping it for DMA?  I 
guess that would come down to copy versus mapping overhead.

> Actually, it has a _HUGE_ _HUGE_ impact.  If you pass the big buffer
> up, the receiving socket gets charged for the size of the huge buffer,
> not for just the size of the packet contained within.  This makes
> sockets get overcharged for data reception, and it can cause all kinds
> of performance problems.

Then copy when the socket is about to fill with overhead bytes?

> I highly recommend that this gets fixed.

What is the cut-off point for the copy?

rick jones

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 20:38       ` Rick Jones
@ 2005-06-21 20:55         ` David S. Miller
  2005-06-21 21:47         ` Andi Kleen
  1 sibling, 0 replies; 52+ messages in thread
From: David S. Miller @ 2005-06-21 20:55 UTC (permalink / raw)
  To: rick.jones2
  Cc: shemminger, mitch.a.williams, john.ronciak, mchan, hadi, buytenh,
	jdmason, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

From: Rick Jones <rick.jones2@hp.com>
Date: Tue, 21 Jun 2005 13:38:39 -0700

> > I highly recommend that this gets fixed.
> 
> What is the cut-off point for the copy?

256 has been found to be a well functioning value to use.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 20:38       ` Rick Jones
  2005-06-21 20:55         ` David S. Miller
@ 2005-06-21 21:47         ` Andi Kleen
  2005-06-21 22:22           ` Donald Becker
  1 sibling, 1 reply; 52+ messages in thread
From: Andi Kleen @ 2005-06-21 21:47 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev, davem

Rick Jones <rick.jones2@hp.com> writes:

> > Actually, it has a _HUGE_ _HUGE_ impact.  If you pass the big buffer
> > up, the receiving socket gets charged for the size of the huge buffer,
> > not for just the size of the packet contained within.  This makes
> > sockets get overcharged for data reception, and it can cause all kinds
> > of performance problems.
> 
> Then copy when the socket is about to fill with overhead bytes?

The stack has supported that since 2.4.

Mostly because it is the only sane way to handle devices with very big
MTU. But it turns off all kinds of fast paths before it happens, I
guess that is what David was refering too.

However I suspect the cut-off points with rx-copybreak in common driver 
have been  often tuned before that code was introduced and it might
be worth to do some retesting.

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 21:47         ` Andi Kleen
@ 2005-06-21 22:22           ` Donald Becker
  2005-06-21 22:34             ` Andi Kleen
  0 siblings, 1 reply; 52+ messages in thread
From: Donald Becker @ 2005-06-21 22:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rick Jones, netdev, davem

On 21 Jun 2005, Andi Kleen wrote:

> Rick Jones <rick.jones2@hp.com> writes:
> 
> > > Actually, it has a _HUGE_ _HUGE_ impact.  If you pass the big buffer
> > > up, the receiving socket gets charged for the size of the huge buffer,
> > > not for just the size of the packet contained within.  This makes
> > > sockets get overcharged for data reception, and it can cause all kinds
> > > of performance problems.
> > 
> > Then copy when the socket is about to fill with overhead bytes?

Or better, predict when the frame you are currently stuffing into the 
queue will be there when the queue fills up.  And then use the same 
crystal ball to...
 
> Mostly because it is the only sane way to handle devices with very big
> MTU. But it turns off all kinds of fast paths before it happens, I
> guess that is what David was refering too.
> 
> However I suspect the cut-off points with rx-copybreak in common driver 
> have been  often tuned before that code was introduced and it might
> be worth to do some retesting.

Most of that analysis and tuning was done in the 1996-99 timeframe.  

While much has changed since then, the same basic parameters remain
   - cache line size
   - frame header size (MAC+IP+ProtocolHeader)
   - hot cache lines from copying or type classification
   - cold memory lines from PCI writes

I suspect you'll find that a good rx_copybreak is pretty much the same as 
it was when I did the original evaluation.

If you are looking for an area that has changed: the hidden cost of 
maintaining consistent cache lines on SMP systems is far higher than it 
was back in the days of the Pentium Pro.

Donald Becker				becker@scyld.com
Scyld Software				A Penguin Computing company
914 Bay Ridge Road, Suite 220		www.scyld.com
Annapolis MD 21403			410-990-9993

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 22:22           ` Donald Becker
@ 2005-06-21 22:34             ` Andi Kleen
  2005-06-22  0:08               ` Donald Becker
  0 siblings, 1 reply; 52+ messages in thread
From: Andi Kleen @ 2005-06-21 22:34 UTC (permalink / raw)
  To: Donald Becker; +Cc: Andi Kleen, Rick Jones, netdev, davem

> While much has changed since then, the same basic parameters remain
>    - cache line size

In 96 we had 32 byte cache lines. These days 64-128 are common,
with some 256 byte cache line systems around.

>    - frame header size (MAC+IP+ProtocolHeader)

In 96 people tended to not use time stamps.

>    - hot cache lines from copying or type classification

Not sure what you mean with that.

>    - cold memory lines from PCI writes

I suspect in '96 chipsets also didn't do as aggressive prefetching
as they do today.

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 22:34             ` Andi Kleen
@ 2005-06-22  0:08               ` Donald Becker
  2005-06-22  4:44                 ` Chris Friesen
  2005-06-22 16:23                 ` Leonid Grossman
  0 siblings, 2 replies; 52+ messages in thread
From: Donald Becker @ 2005-06-22  0:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Rick Jones, netdev, davem

On Wed, 22 Jun 2005, Andi Kleen wrote:

> > While much has changed since then, the same basic parameters remain
> >    - cache line size
> 
> In 96 we had 32 byte cache lines. These days 64-128 are common,
> with some 256 byte cache line systems around.

Good point.
I believe that the most common line size is 64 bytes for L1 cache.

Most L2 caches that have larger line sizes still fill only 64 byte 
blocks unless prefetching is triggered.  (Feel free to correct me with 
non-obscure CPUs and relevant cases.  For instance, I know that on the 
Itanium the 128 byte line L2 cache is used as L1, but only for FPU 
fetches.  That doesn't count.)

The implication here is that as soon as we look at the first byte of the 
MAC address, we have read in 64 bytes.  That's a whole minimum-size 
EThernet frame.

> >    - frame header size (MAC+IP+ProtocolHeader)
> 
> In 96 people tended to not use time stamps.

Ehh, not a big difference.

> >    - hot cache lines from copying or type classification
> Not sure what you mean with that.

See the comment above. We decide if a packet is multicast vs. unicast, IP 
vs. other at approximately interrupt/"rx_copybreak" time.  Very few NIC 
provide this info in status bits, so we end up looking at the packet 
header.  That read moves the previously known-uncached data (after all, it 
was just came in from a bus write) into the L1 cache for the CPU handling 
the device.  Once it's there, the copy is almost free.

[[ Background: Yes, the allocating the new skbuff is very expensive.  But 
we can either allocate a new, correctly-sized skbuff to copy into, or 
allocate a new full-sized skbuff to replace the one we will send to the Rx 
queue.  ]] 

> >    - cold memory lines from PCI writes
> 
> I suspect in '96 chipsets also didn't do as aggressive prefetching
> as they do today.

Prefetching helps linear read bandwidth, but we shouldn't be triggering 
it.  And I claim that cache line prefetching only restores the relative
balance between L1/L2 caches, otherwise the long L2 cache lines would be 
very expensive with bump-read-bump-read with linear scans through memory.

-- 
Donald Becker				becker@scyld.com
Scyld Software	 			Scyld Beowulf cluster systems
914 Bay Ridge Road, Suite 220		www.scyld.com
Annapolis MD 21403			410-990-9993

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22  0:08               ` Donald Becker
@ 2005-06-22  4:44                 ` Chris Friesen
  2005-06-22 11:31                   ` Andi Kleen
  2005-06-22 16:23                 ` Leonid Grossman
  1 sibling, 1 reply; 52+ messages in thread
From: Chris Friesen @ 2005-06-22  4:44 UTC (permalink / raw)
  To: Donald Becker; +Cc: Andi Kleen, Rick Jones, netdev, davem

Donald Becker wrote:
> On Wed, 22 Jun 2005, Andi Kleen wrote:
> 
> 
>>>While much has changed since then, the same basic parameters remain
>>>   - cache line size
>>
>>In 96 we had 32 byte cache lines. These days 64-128 are common,
>>with some 256 byte cache line systems around.
> 
> 
> Good point.
> I believe that the most common line size is 64 bytes for L1 cache.

If I recall, G4 chips are 32 bytes, and G5s are 128 bytes.  Most current 
x86 chips are 64 bytes though.

Chris

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 20:37         ` David S. Miller
@ 2005-06-22  7:27           ` Eric Dumazet
  2005-06-22  8:42           ` P
  1 sibling, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2005-06-22  7:27 UTC (permalink / raw)
  To: David S. Miller
  Cc: gandalf, hadi, shemminger, mitch.a.williams, john.ronciak, mchan,
	buytenh, jdmason, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

David S. Miller a écrit :
> 
> Also, e1000 sends full MTU sized SKBs down into the stack even if the
> packet is very small.  This also hurts performance a lot.  As
> discussed elsewhere, it should use a "small packet" cut-off just like
> other drivers do.  If the RX frame is less than this cut-off value, a
> new smaller sized SKB is allocated and the RX data copied into it.
> The RX ring SKB is left in-place and given back to the chip.
> 
> My only guess is that the e1000 driver implemented things this way
> to simplify the RX recycling logic.  Well, it is an area ripe for
> improvement in this driver :)
> 
> 

Here is a copy of a mail from Scott Feldman (19/11/2003) when I asked him to add this copybreak feature into e1000 driver :
It did improve performance on my workload. It also reduce the memory requirement a *lot* (It was using 300.000 active TCP sockets, mostly 
receiving short frames)

Eric Dumazet

---------------------------------------------------
Try this (untested) patch.  It's against 5.2.26 (which you don't have),
so hand patch it.  (Sorry).

Do you have any way to measure performance?  CPU utilization?  The copy
isn't free.

Oh, also, this patch doesn't try to recycle the 4K skb that was copied
from.  Instead, it's freed and re-allocated.  Shouldn't be a big deal
because your totally system memory allocation should remain constant
(except for outstanding copybreak skb's).

Let us know how it goes.

-scott

----------------

diff -Naurp e1000-5.2.26/src/e1000_main.c
e1000-5.2.26-cb/src/e1000_main.c
--- e1000-5.2.26/src/e1000_main.c	2003-11-17 19:23:38.000000000
-0800
+++ e1000-5.2.26-cb/src/e1000_main.c	2003-11-18 18:18:07.000000000
-0800
@@ -2343,6 +2343,20 @@ e1000_clean_rx_irq(struct e1000_adapter
  			}
  		}

+		/* RONCH 11/18/03 - code added for copybreak test */
+#define E1000_CB_LENGTH 128
+		if(length < E1000_CB_LENGTH ) {
+			struct sk_buff *new_skb = dev_alloc_skb(length
+2);
+			if(new_skb) {
+				skb_reserve(new_skb, 2);
+				new_skb->dev = netdev;
+				memcpy(new_skb->data, skb->data,
length);
+				dev_kfree_skb(skb);
+				skb = new_skb;
+			}
+		}
+		/* end copybreak code */
+
  		/* Good Receive */
  		skb_put(skb, length - ETHERNET_FCS_SIZE);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-21 20:37         ` David S. Miller
  2005-06-22  7:27           ` Eric Dumazet
@ 2005-06-22  8:42           ` P
  2005-06-22 19:37             ` jamal
  1 sibling, 1 reply; 52+ messages in thread
From: P @ 2005-06-22  8:42 UTC (permalink / raw)
  To: David S. Miller
  Cc: gandalf, hadi, shemminger, mitch.a.williams, john.ronciak, mchan,
	buytenh, jdmason, netdev, Robert.Olsson, ganesh.venkatesan,
	jesse.brandeburg

David S. Miller wrote:
> Also, e1000 sends full MTU sized SKBs down into the stack even if the
> packet is very small.  This also hurts performance a lot.  As
> discussed elsewhere, it should use a "small packet" cut-off just like
> other drivers do.  If the RX frame is less than this cut-off value, a
> new smaller sized SKB is allocated and the RX data copied into it.
> The RX ring SKB is left in-place and given back to the chip.

Yes the copy is essentially free here as the data is already cached.

As a data point, I went the whole hog and used buffer recycling
in my essentially packet sniffing application. I.E. there are no
allocs per packet at all, and this make a HUGE difference. On a
2x3.4GHz 2xe1000 system I can receive 620Kpps per port sustained
into my userspace app which does a LOT of processing per packet.
Without the buffer recycling is was around 250Kpps.
Note I don't reuse an skb until the packet is copied into a
PACKET_MMAP buffer.

-- 
Pádraig Brady - http://www.pixelbeat.org
--

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22  4:44                 ` Chris Friesen
@ 2005-06-22 11:31                   ` Andi Kleen
  0 siblings, 0 replies; 52+ messages in thread
From: Andi Kleen @ 2005-06-22 11:31 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Donald Becker, Andi Kleen, Rick Jones, netdev, davem

> If I recall, G4 chips are 32 bytes, and G5s are 128 bytes.  Most current 
> x86 chips are 64 bytes though.

P4s are effectively 128 byte. And that is the most common x86 right now.

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: RFC: NAPI packet weighting patch
  2005-06-22  0:08               ` Donald Becker
  2005-06-22  4:44                 ` Chris Friesen
@ 2005-06-22 16:23                 ` Leonid Grossman
  2005-06-22 16:37                   ` jamal
  2005-06-22 17:05                   ` RFC: NAPI packet weighting patch Andi Kleen
  1 sibling, 2 replies; 52+ messages in thread
From: Leonid Grossman @ 2005-06-22 16:23 UTC (permalink / raw)
  To: 'Donald Becker', 'Andi Kleen'
  Cc: 'Rick Jones', netdev, davem


> 
> See the comment above. We decide if a packet is multicast vs. 
> unicast, IP vs. other at approximately 
> interrupt/"rx_copybreak" time.  Very few NIC provide this 
> info in status bits, so we end up looking at the packet 
> header.  That read moves the previously known-uncached data 
> (after all, it was just came in from a bus write) into the L1 
> cache for the CPU handling the device.  Once it's there, the 
> copy is almost free.

What status bits a NIC has to provide, in order for the stack to avoid
touching headers?
In our case, the headers are separated by the hardware so ideally we would
like to avoid any header processing altogether,
and reduce the number of cache misses.

> 
> [[ Background: Yes, the allocating the new skbuff is very 
> expensive.  But we can either allocate a new, correctly-sized 
> skbuff to copy into, or allocate a new full-sized skbuff to 
> replace the one we will send to the Rx queue.  ]] 
>  
> > >    - cold memory lines from PCI writes
> > 
> > I suspect in '96 chipsets also didn't do as aggressive 
> prefetching as 
> > they do today.
> 
> Prefetching helps linear read bandwidth, but we shouldn't be 
> triggering it.  And I claim that cache line prefetching only 
> restores the relative balance between L1/L2 caches, otherwise 
> the long L2 cache lines would be very expensive with 
> bump-read-bump-read with linear scans through memory.
> 
> -- 
> Donald Becker				becker@scyld.com
> Scyld Software	 			Scyld Beowulf 
> cluster systems
> 914 Bay Ridge Road, Suite 220		www.scyld.com
> Annapolis MD 21403			410-990-9993
> 
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: RFC: NAPI packet weighting patch
  2005-06-22 16:23                 ` Leonid Grossman
@ 2005-06-22 16:37                   ` jamal
  2005-06-22 18:00                     ` Leonid Grossman
  2005-06-22 17:05                   ` RFC: NAPI packet weighting patch Andi Kleen
  1 sibling, 1 reply; 52+ messages in thread
From: jamal @ 2005-06-22 16:37 UTC (permalink / raw)
  To: Leonid Grossman
  Cc: 'Donald Becker', 'Andi Kleen',
	'Rick Jones', netdev, davem

On Wed, 2005-22-06 at 09:23 -0700, Leonid Grossman wrote:
> > 
> > See the comment above. We decide if a packet is multicast vs. 
> > unicast, IP vs. other at approximately 
> > interrupt/"rx_copybreak" time.  Very few NIC provide this 
> > info in status bits, so we end up looking at the packet 
> > header.  That read moves the previously known-uncached data 
> > (after all, it was just came in from a bus write) into the L1 
> > cache for the CPU handling the device.  Once it's there, the 
> > copy is almost free.
> 
> What status bits a NIC has to provide, in order for the stack to avoid
> touching headers?
> In our case, the headers are separated by the hardware so ideally we would
> like to avoid any header processing altogether,
> and reduce the number of cache misses.
> 

Provide metadata that can be used to totaly replace eth_type_trans()
i.e answer the questions: is it multi/uni/broadcast, is the packet for
us (you would need to be programmed with what for us means), Is it IP,
ARP etc. I am sure any standard NIC these days can do a subset of these
 You want to go one step further then allow the user to download a
number of filters and tell you what tag you should put on the descriptor
when sending the packet to user space on a match or mismatch. 
If say you allowed 1024 such filters (not very different from the
current multicast filters), you could cut down a lot of CPU time.

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 16:23                 ` Leonid Grossman
  2005-06-22 16:37                   ` jamal
@ 2005-06-22 17:05                   ` Andi Kleen
  1 sibling, 0 replies; 52+ messages in thread
From: Andi Kleen @ 2005-06-22 17:05 UTC (permalink / raw)
  To: Leonid Grossman
  Cc: 'Donald Becker', 'Andi Kleen',
	'Rick Jones', netdev, davem

On Wed, Jun 22, 2005 at 09:23:41AM -0700, Leonid Grossman wrote:
> 
> > 
> > See the comment above. We decide if a packet is multicast vs. 
> > unicast, IP vs. other at approximately 
> > interrupt/"rx_copybreak" time.  Very few NIC provide this 
> > info in status bits, so we end up looking at the packet 
> > header.  That read moves the previously known-uncached data 
> > (after all, it was just came in from a bus write) into the L1 
> > cache for the CPU handling the device.  Once it's there, the 
> > copy is almost free.
> 
> What status bits a NIC has to provide, in order for the stack to avoid
> touching headers?

To avoid it completely is pretty hard - you would need to supply
nearly everything in the header.

But when you supply L2 protocol/ and unicast/broadcast/multicast information
and if the packet is destined to the localhost or not then the 
headers can be gotten with a prefetch early and then when 
the header is later processed then it might be with some luck
already in cache.

BTW quite a few modern NICs provide this information actually contrary
to what Donald stated (sometimes with restrictions like only
works without multicast), but it hasn't been widely used yet.

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: RFC: NAPI packet weighting patch
  2005-06-22 16:37                   ` jamal
@ 2005-06-22 18:00                     ` Leonid Grossman
  2005-06-22 18:06                       ` Andi Kleen
  0 siblings, 1 reply; 52+ messages in thread
From: Leonid Grossman @ 2005-06-22 18:00 UTC (permalink / raw)
  To: hadi
  Cc: 'Donald Becker', 'Andi Kleen',
	'Rick Jones', netdev, davem

 

> -----Original Message-----
> From: jamal [mailto:hadi@cyberus.ca] 
> Sent: Wednesday, June 22, 2005 9:37 AM
> To: Leonid Grossman
> Cc: 'Donald Becker'; 'Andi Kleen'; 'Rick Jones'; 
> netdev@oss.sgi.com; davem@redhat.com
> Subject: RE: RFC: NAPI packet weighting patch
> 
> On Wed, 2005-22-06 at 09:23 -0700, Leonid Grossman wrote:
> > > 
> > > See the comment above. We decide if a packet is multicast vs. 
> > > unicast, IP vs. other at approximately interrupt/"rx_copybreak" 
> > > time.  Very few NIC provide this info in status bits, so 
> we end up 
> > > looking at the packet header.  That read moves the previously 
> > > known-uncached data (after all, it was just came in from a bus 
> > > write) into the L1 cache for the CPU handling the device. 
>  Once it's 
> > > there, the copy is almost free.
> > 
> > What status bits a NIC has to provide, in order for the 
> stack to avoid 
> > touching headers?
> > In our case, the headers are separated by the hardware so 
> ideally we 
> > would like to avoid any header processing altogether, and 
> reduce the 
> > number of cache misses.
> > 
> 
> Provide metadata that can be used to totaly replace 
> eth_type_trans() i.e answer the questions: is it 
> multi/uni/broadcast, is the packet for us (you would need to 
> be programmed with what for us means), Is it IP, ARP etc. I 
> am sure any standard NIC these days can do a subset of these  
> You want to go one step further then allow the user to 
> download a number of filters and tell you what tag you should 
> put on the descriptor when sending the packet to user space 
> on a match or mismatch. 
> If say you allowed 1024 such filters (not very different from 
> the current multicast filters), you could cut down a lot of CPU time.


Well, this is all supported in the hardware. 
The number of filters is only 256 (not 1024) for direct match, but it is
unlimited for a hash match.
Of course, the upper layer still needs to be able to take advantage of the
filters...

Outside of the filters capability, from the (granted, pretty limited)
testing we see some noticeable improvement from providing status bits but it
is not as big as I would expect,
It looks like the headers are still being touched somewhere... We will look
at this some more.

Thanks, Leonid

> 
> cheers,
> jamal
> 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 18:00                     ` Leonid Grossman
@ 2005-06-22 18:06                       ` Andi Kleen
  2005-06-22 20:22                         ` David S. Miller
  0 siblings, 1 reply; 52+ messages in thread
From: Andi Kleen @ 2005-06-22 18:06 UTC (permalink / raw)
  To: Leonid Grossman
  Cc: hadi, 'Donald Becker', 'Andi Kleen',
	'Rick Jones', netdev, davem

> Outside of the filters capability, from the (granted, pretty limited)
> testing we see some noticeable improvement from providing status bits but it
> is not as big as I would expect,
> It looks like the headers are still being touched somewhere... We will look
> at this some more.

The headers are read of course in the main stack. No way around that.

It basically helps you only when you can space the prefetch for the header
out long enough that the data is in cache when you need it. However
it is tricky because CPUs have only a limited load queue entries and doing
too many prefetches will just overflow that.

This can be done by batching L2 packet processing, but doing so 
is not good for your latency.

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22  8:42           ` P
@ 2005-06-22 19:37             ` jamal
  2005-06-23  8:56               ` P
  0 siblings, 1 reply; 52+ messages in thread
From: jamal @ 2005-06-22 19:37 UTC (permalink / raw)
  To: P
  Cc: David S. Miller, gandalf, shemminger, mitch.a.williams,
	john.ronciak, mchan, buytenh, jdmason, netdev, Robert.Olsson,
	ganesh.venkatesan, jesse.brandeburg

On Wed, 2005-22-06 at 09:42 +0100, P@draigBrady.com wrote:

> 
> Yes the copy is essentially free here as the data is already cached.
> 
> As a data point, I went the whole hog and used buffer recycling
> in my essentially packet sniffing application. I.E. there are no
> allocs per packet at all, and this make a HUGE difference. On a
> 2x3.4GHz 2xe1000 system I can receive 620Kpps per port sustained
> into my userspace app which does a LOT of processing per packet.
> Without the buffer recycling is was around 250Kpps.
> Note I don't reuse an skb until the packet is copied into a
> PACKET_MMAP buffer.

Was this machine SMP? NAPI involved? I take it nothing interfering in
the middle with the headers?

cheers,
jamal

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 18:06                       ` Andi Kleen
@ 2005-06-22 20:22                         ` David S. Miller
  2005-06-22 20:35                           ` Rick Jones
                                             ` (3 more replies)
  0 siblings, 4 replies; 52+ messages in thread
From: David S. Miller @ 2005-06-22 20:22 UTC (permalink / raw)
  To: ak; +Cc: leonid.grossman, hadi, becker, rick.jones2, netdev, davem

From: Andi Kleen <ak@suse.de>
Date: Wed, 22 Jun 2005 20:06:55 +0200

> However it is tricky because CPUs have only a limited load queue
> entries and doing too many prefetches will just overflow that.

Several processors can queue about 8 prefetch requests, and
these slots are independant of those consumed by a load.

Yes, if you queue too many prefetches, the queue overflows.

I think the optimal scheme would be:

1) eth_type_trans() info in RX descriptor
2) prefetch(skb->data) done as early as possible in driver
   RX handling

Actually, I believe to most optimal scheme is:

foo_driver_rx()
{
	for_each_rx_descriptor() {
		...
		skb = driver_priv->rx_skbs[index];
		prefetch(skb->data);

		skb = realloc_or_recycle_rx_descriptor(skb, index);
		if (skb == NULL)
			goto next_rxd;

		skb->prot = eth_type_trans(skb, driver_priv->dev);
		netif_receive_skb(skb);
		...
	next_rxd:
		...
	}
}

The idea is that first the prefetch goes into flight, then you do the
recycle or reallocation of the RX descriptor SKB, then you try to
touch the data.

This makes it very likely the prefetch will be in the cpu in time.

Everyone seems to have this absolute fetish about batching the RX
descriptor refilling work.  It's wrong, it should be done when you
pull a receive packet off the ring, for many reasons.  Off the top of
my head:

1) Descriptors are refilled as soon as possible, decreasing
   the chance of the device hitting the end of the RX ring
   and thus unable to receive a packet.

2) As shown above, it gives you compute time which can be used to
   schedule the prefetch.  This nearly makes RX replenishment free.
   Instead of having the CPU spin on a cache miss when we run
   eth_type_trans() during those cycles, we do useful work.

I'm going to play around with these ideas in the tg3 driver.
Obvious patch below.

--- 1/drivers/net/tg3.c.~1~	2005-06-22 12:33:07.000000000 -0700
+++ 2/drivers/net/tg3.c	2005-06-22 13:19:13.000000000 -0700
@@ -2772,6 +2772,13 @@
 			goto next_pkt_nopost;
 		}

+		/* Prefetch now.  The recycle/realloc of the RX
+		 * entry is moderately expensive, so by the time
+		 * that is complete the data should have reached
+		 * the cpu.
+		 */
+		prefetch(skb->data);
+
 		work_mask |= opaque_key;

 		if ((desc->err_vlan & RXD_ERR_MASK) != 0 &&

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 20:22                         ` David S. Miller
@ 2005-06-22 20:35                           ` Rick Jones
  2005-06-22 20:43                             ` David S. Miller
  2005-06-22 21:10                           ` Andi Kleen
                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 52+ messages in thread
From: Rick Jones @ 2005-06-22 20:35 UTC (permalink / raw)
  To: netdev; +Cc: hadi, becker


> Everyone seems to have this absolute fetish about batching the RX
> descriptor refilling work.  It's wrong, it should be done when you
> pull a receive packet off the ring, for many reasons.  Off the top of
> my head:
> 
> 1) Descriptors are refilled as soon as possible, decreasing
>    the chance of the device hitting the end of the RX ring
>    and thus unable to receive a packet.

IFF one pokes the NIC for each buffer right?

rick jones

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 20:35                           ` Rick Jones
@ 2005-06-22 20:43                             ` David S. Miller
  0 siblings, 0 replies; 52+ messages in thread
From: David S. Miller @ 2005-06-22 20:43 UTC (permalink / raw)
  To: rick.jones2; +Cc: netdev, hadi, becker

From: Rick Jones <rick.jones2@hp.com>
Date: Wed, 22 Jun 2005 13:35:46 -0700

> > Everyone seems to have this absolute fetish about batching the RX
> > descriptor refilling work.  It's wrong, it should be done when you
> > pull a receive packet off the ring, for many reasons.  Off the top of
> > my head:
> > 
> > 1) Descriptors are refilled as soon as possible, decreasing
> >    the chance of the device hitting the end of the RX ring
> >    and thus unable to receive a packet.
> 
> IFF one pokes the NIC for each buffer right?

Or "every 5" or something like that.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 20:22                         ` David S. Miller
  2005-06-22 20:35                           ` Rick Jones
@ 2005-06-22 21:10                           ` Andi Kleen
  2005-06-22 21:16                             ` David S. Miller
  2005-06-22 21:53                             ` Chris Friesen
  2005-06-22 21:38                           ` Eric Dumazet
  2005-06-22 22:42                           ` Leonid Grossman
  3 siblings, 2 replies; 52+ messages in thread
From: Andi Kleen @ 2005-06-22 21:10 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, leonid.grossman, hadi, becker, rick.jones2, netdev, davem

On Wed, Jun 22, 2005 at 01:22:41PM -0700, David S. Miller wrote:
> From: Andi Kleen <ak@suse.de>
> Date: Wed, 22 Jun 2005 20:06:55 +0200
> 
> > However it is tricky because CPUs have only a limited load queue
> > entries and doing too many prefetches will just overflow that.
> 
> Several processors can queue about 8 prefetch requests, and
> these slots are independant of those consumed by a load.

8 entries? That sounds very small. Is that an old Sparc or something? :)

An Opteron has 44 entries, effectively 32 for L2.   Netburst
or POWER4 derived CPUs have more than that.

> Yes, if you queue too many prefetches, the queue overflows.
> 
> I think the optimal scheme would be:
> 
> 1) eth_type_trans() info in RX descriptor
> 2) prefetch(skb->data) done as early as possible in driver
>    RX handling
> 
> Actually, I believe to most optimal scheme is:

Looks reasonable. Not sure about the "most optimal" though, some benchmarking
of different schemes would be probably a good idea.

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 21:10                           ` Andi Kleen
@ 2005-06-22 21:16                             ` David S. Miller
  2005-06-22 21:53                             ` Chris Friesen
  1 sibling, 0 replies; 52+ messages in thread
From: David S. Miller @ 2005-06-22 21:16 UTC (permalink / raw)
  To: ak; +Cc: leonid.grossman, hadi, becker, rick.jones2, netdev, davem

From: Andi Kleen <ak@suse.de>
Date: Wed, 22 Jun 2005 23:10:58 +0200

> 8 entries? That sounds very small. Is that an old Sparc or something? :)

Hey, Sparc does suck, this isn't news for anyone :-)

> Looks reasonable. Not sure about the "most optimal" though, some benchmarking
> of different schemes would be probably a good idea.

Absolutely.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 20:22                         ` David S. Miller
  2005-06-22 20:35                           ` Rick Jones
  2005-06-22 21:10                           ` Andi Kleen
@ 2005-06-22 21:38                           ` Eric Dumazet
  2005-06-22 22:13                             ` Eric Dumazet
  2005-06-22 22:23                             ` David S. Miller
  2005-06-22 22:42                           ` Leonid Grossman
  3 siblings, 2 replies; 52+ messages in thread
From: Eric Dumazet @ 2005-06-22 21:38 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, leonid.grossman, hadi, becker, rick.jones2, netdev, davem

David S. Miller a écrit :

> 
> 2) As shown above, it gives you compute time which can be used to
>    schedule the prefetch.  This nearly makes RX replenishment free.
>    Instead of having the CPU spin on a cache miss when we run
>    eth_type_trans() during those cycles, we do useful work.
> 
> I'm going to play around with these ideas in the tg3 driver.
> Obvious patch below.


Then maybe we could also play with prefetchw() in the case the incoming frame
is small enough to be copied to a new skb.

drivers/net/tg3.c

	copy_skb = dev_alloc_skb(len + 2);
	if (copy_skb == NULL)
		goto drop_it_no_recycle;
+	prefetchw(copy_skb->data);

	copy_skb->dev = tp->dev;
	skb_reserve(copy_skb, 2);
	skb_put(copy_skb, len);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 21:10                           ` Andi Kleen
  2005-06-22 21:16                             ` David S. Miller
@ 2005-06-22 21:53                             ` Chris Friesen
  2005-06-22 22:11                               ` Andi Kleen
  1 sibling, 1 reply; 52+ messages in thread
From: Chris Friesen @ 2005-06-22 21:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David S. Miller, leonid.grossman, hadi, becker, rick.jones2,
	netdev

Andi Kleen wrote:

> 8 entries? That sounds very small. Is that an old Sparc or something? :)

The G5 has 8 prefetch streams.  Not an ancient cpu.


Chris

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 21:53                             ` Chris Friesen
@ 2005-06-22 22:11                               ` Andi Kleen
  0 siblings, 0 replies; 52+ messages in thread
From: Andi Kleen @ 2005-06-22 22:11 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Andi Kleen, David S. Miller, leonid.grossman, hadi, becker,
	rick.jones2, netdev

On Wed, Jun 22, 2005 at 03:53:30PM -0600, Chris Friesen wrote:
> Andi Kleen wrote:
> 
> >8 entries? That sounds very small. Is that an old Sparc or something? :)
> 
> The G5 has 8 prefetch streams.  Not an ancient cpu.

prefetch stream means a context of the auto prefetcher. 

It different from a load queue entry which is just a load of a cache line
which can be triggered by user instructions or the auto prefetcher. 
Each prefetch stream would consume a lot of them, so just for your 8 streams 
above you probably need a large two digit number or more.

I don't have exact numbers for the PPC970, but afaik its LS unit
has a very long queue. On POWER4 (which is a very similar CPU) we see 
a lot of races that don't happen on other platforms. That seems to be 
because it reorders writes every aggressively. I suppose this is true for
reads as well.

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 21:38                           ` Eric Dumazet
@ 2005-06-22 22:13                             ` Eric Dumazet
  2005-06-22 22:30                               ` David S. Miller
  2005-06-22 22:23                             ` David S. Miller
  1 sibling, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2005-06-22 22:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, ak, leonid.grossman, hadi, becker, rick.jones2,
	netdev, davem

Eric Dumazet a écrit :

> 
> Then maybe we could also play with prefetchw() in the case the incoming 
> frame
> is small enough to be copied to a new skb.
> 
> drivers/net/tg3.c
> 
>     copy_skb = dev_alloc_skb(len + 2);
>     if (copy_skb == NULL)
>         goto drop_it_no_recycle;
> +    prefetchw(copy_skb->data);
> 
>     copy_skb->dev = tp->dev;
>     skb_reserve(copy_skb, 2);
>     skb_put(copy_skb, len);
> 
> 
> 

I also found that the memcpy() done to copy the data to the new skb suffers from misalignment.

This is because of skb_reserve(skbs, 2) that was done on both skb, and memcpy() (at least on x86_64) doing long word copies without checking 
alignment of source or destination.

Maybe we could :

1) make sure both skbs had the same skb_reserve() of 2 (thats not clear because tg3.c mixes the '2' and tp->rx_offset,
and according to a comment :
	rx_offset != 2 iff this is a 5701 card running 
                                          	in PCI-X mode

2) and do :

-	memcpy(copy_skb->data, skb->data, len);
+	memcpy(copy_skb->data-2, skb->data-2, len+2);

(That is copy 2 more bytes, but gain aligned copy to speedup memcpy())

Eric Dumazet

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 21:38                           ` Eric Dumazet
  2005-06-22 22:13                             ` Eric Dumazet
@ 2005-06-22 22:23                             ` David S. Miller
  2005-06-23 12:14                               ` jamal
  1 sibling, 1 reply; 52+ messages in thread
From: David S. Miller @ 2005-06-22 22:23 UTC (permalink / raw)
  To: dada1; +Cc: ak, leonid.grossman, hadi, becker, rick.jones2, netdev, davem

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 22 Jun 2005 23:38:21 +0200

> Then maybe we could also play with prefetchw() in the case the
> incoming frame is small enough to be copied to a new skb.

That's a good idea too.  In fact, this would deal with platforms
that use non-temporal stores in their memcpy() implementation.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 22:13                             ` Eric Dumazet
@ 2005-06-22 22:30                               ` David S. Miller
  0 siblings, 0 replies; 52+ messages in thread
From: David S. Miller @ 2005-06-22 22:30 UTC (permalink / raw)
  To: dada1; +Cc: ak, leonid.grossman, hadi, becker, rick.jones2, netdev, davem

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 23 Jun 2005 00:13:21 +0200

> I also found that the memcpy() done to copy the data to the new skb suffers from misalignment.
> 
> This is because of skb_reserve(skbs, 2) that was done on both skb, and memcpy() (at least on x86_64) doing long word copies without checking 
> alignment of source or destination.
> 
> Maybe we could :
>
> 1) make sure both skbs had the same skb_reserve() of 2 (thats not clear because tg3.c mixes the '2' and tp->rx_offset,
> and according to a comment :
> 	rx_offset != 2 iff this is a 5701 card running 
>                                           	in PCI-X mode
>
> 2) and do :
> 
> -	memcpy(copy_skb->data, skb->data, len);
> +	memcpy(copy_skb->data-2, skb->data-2, len+2);
> 
> (That is copy 2 more bytes, but gain aligned copy to speedup memcpy())

Yep, good idea.  Actually, the driver should be using
NET_IP_ALIGN for rx_offset unless it's the 5701 card running
in PCI-X mode case.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: RFC: NAPI packet weighting patch
  2005-06-22 20:22                         ` David S. Miller
                                             ` (2 preceding siblings ...)
  2005-06-22 21:38                           ` Eric Dumazet
@ 2005-06-22 22:42                           ` Leonid Grossman
  2005-06-22 23:13                             ` Andi Kleen
  3 siblings, 1 reply; 52+ messages in thread
From: Leonid Grossman @ 2005-06-22 22:42 UTC (permalink / raw)
  To: 'David S. Miller', ak; +Cc: hadi, becker, rick.jones2, netdev, davem

 

> -----Original Message-----
> From: David S. Miller [mailto:davem@davemloft.net] 
> Sent: Wednesday, June 22, 2005 1:23 PM
> To: ak@suse.de
> Cc: leonid.grossman@neterion.com; hadi@cyberus.ca; 
> becker@scyld.com; rick.jones2@hp.com; netdev@oss.sgi.com; 
> davem@redhat.com
> Subject: Re: RFC: NAPI packet weighting patch
> 
> From: Andi Kleen <ak@suse.de>
> Date: Wed, 22 Jun 2005 20:06:55 +0200
> 
> > However it is tricky because CPUs have only a limited load queue 
> > entries and doing too many prefetches will just overflow that.
> 
> Several processors can queue about 8 prefetch requests, and 
> these slots are independant of those consumed by a load.
> 
> Yes, if you queue too many prefetches, the queue overflows.
> 
> I think the optimal scheme would be:
> 
> 1) eth_type_trans() info in RX descriptor
> 2) prefetch(skb->data) done as early as possible in driver
>    RX handling
> 
> Actually, I believe to most optimal scheme is:
> 
> foo_driver_rx()
> {
> 	for_each_rx_descriptor() {
> 		...
> 		skb = driver_priv->rx_skbs[index];
> 		prefetch(skb->data);
> 
> 		skb = realloc_or_recycle_rx_descriptor(skb, index);
> 		if (skb == NULL)
> 			goto next_rxd;
> 
> 		skb->prot = eth_type_trans(skb, driver_priv->dev);
> 		netif_receive_skb(skb);
> 		...
> 	next_rxd:
> 		...
> 	}
> }
> 
> The idea is that first the prefetch goes into flight, then 
> you do the recycle or reallocation of the RX descriptor SKB, 
> then you try to touch the data.
> 
> This makes it very likely the prefetch will be in the cpu in time.
> 
> Everyone seems to have this absolute fetish about batching 
> the RX descriptor refilling work.  It's wrong, it should be 
> done when you pull a receive packet off the ring, for many 
> reasons.  Off the top of my head:

This is very hw-dependent, since there are NICs that read descriptors in
batches anyways - but the second argument below is compelling.

> 
> 1) Descriptors are refilled as soon as possible, decreasing
>    the chance of the device hitting the end of the RX ring
>    and thus unable to receive a packet.
> 
> 2) As shown above, it gives you compute time which can be used to
>    schedule the prefetch.  This nearly makes RX replenishment free.
>    Instead of having the CPU spin on a cache miss when we run
>    eth_type_trans() during those cycles, we do useful work.
> 
> I'm going to play around with these ideas in the tg3 driver.
> Obvious patch below.

We will play around with the s2io driver as well, there seem to be several
interesting ideas to try - thanks a lot for the input!
Cheers, Leonid

> 
> --- 1/drivers/net/tg3.c.~1~	2005-06-22 12:33:07.000000000 -0700
> +++ 2/drivers/net/tg3.c	2005-06-22 13:19:13.000000000 -0700
> @@ -2772,6 +2772,13 @@
>  			goto next_pkt_nopost;
>  		}
>  
> +		/* Prefetch now.  The recycle/realloc of the RX
> +		 * entry is moderately expensive, so by the time
> +		 * that is complete the data should have reached
> +		 * the cpu.
> +		 */
> +		prefetch(skb->data);
> +
>  		work_mask |= opaque_key;
>  
>  		if ((desc->err_vlan & RXD_ERR_MASK) != 0 &&
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 22:42                           ` Leonid Grossman
@ 2005-06-22 23:13                             ` Andi Kleen
  2005-06-22 23:19                               ` David S. Miller
  2005-06-23  9:32                               ` [PATCH] x86_64 prefetchw() function can take into account CONFIG_MK8 / CONFIG_MPSC Eric Dumazet
  0 siblings, 2 replies; 52+ messages in thread
From: Andi Kleen @ 2005-06-22 23:13 UTC (permalink / raw)
  To: Leonid Grossman
  Cc: 'David S. Miller', ak, hadi, becker, rick.jones2, netdev,
	davem

> This is very hw-dependent, since there are NICs that read descriptors in
> batches anyways - but the second argument below is compelling.

The computing time must be quite long to be really a win.
You need to waste a few hundred cycles at least on a modern fast CPU.

-Andi
> > 
> > 2) As shown above, it gives you compute time which can be used to
> >    schedule the prefetch.  This nearly makes RX replenishment free.
> >    Instead of having the CPU spin on a cache miss when we run
> >    eth_type_trans() during those cycles, we do useful work.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 23:13                             ` Andi Kleen
@ 2005-06-22 23:19                               ` David S. Miller
  2005-06-22 23:23                                 ` Andi Kleen
  2005-06-23  9:32                               ` [PATCH] x86_64 prefetchw() function can take into account CONFIG_MK8 / CONFIG_MPSC Eric Dumazet
  1 sibling, 1 reply; 52+ messages in thread
From: David S. Miller @ 2005-06-22 23:19 UTC (permalink / raw)
  To: ak; +Cc: leonid.grossman, davem, hadi, becker, rick.jones2, netdev

From: Andi Kleen <ak@suse.de>
Date: Thu, 23 Jun 2005 01:13:00 +0200

> The computing time must be quite long to be really a win.
> You need to waste a few hundred cycles at least on a modern fast CPU.

SKB allocation more than fits this requirement, and that is exactly
what the RX descriptor replenishment will do.

Even if SKB allocation was only half the necessary number of cycles
for the prefetch to hit the cpu, it'd still be a win.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 23:19                               ` David S. Miller
@ 2005-06-22 23:23                                 ` Andi Kleen
  0 siblings, 0 replies; 52+ messages in thread
From: Andi Kleen @ 2005-06-22 23:23 UTC (permalink / raw)
  To: David S. Miller
  Cc: ak, leonid.grossman, davem, hadi, becker, rick.jones2, netdev

On Wed, Jun 22, 2005 at 07:19:56PM -0400, David S. Miller wrote:
> From: Andi Kleen <ak@suse.de>
> Date: Thu, 23 Jun 2005 01:13:00 +0200
> 
> > The computing time must be quite long to be really a win.
> > You need to waste a few hundred cycles at least on a modern fast CPU.
> 
> SKB allocation more than fits this requirement, and that is exactly
> what the RX descriptor replenishment will do.

It shouldn't in theory. Unless they did something bad to the slab
allocator again when I wasn't looking ;-) 

> 
> Even if SKB allocation was only half the necessary number of cycles
> for the prefetch to hit the cpu, it'd still be a win.

If it's too small then it might be left in the noise.

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 19:37             ` jamal
@ 2005-06-23  8:56               ` P
  0 siblings, 0 replies; 52+ messages in thread
From: P @ 2005-06-23  8:56 UTC (permalink / raw)
  To: hadi
  Cc: David S. Miller, gandalf, shemminger, mitch.a.williams,
	john.ronciak, mchan, buytenh, jdmason, netdev, Robert.Olsson,
	ganesh.venkatesan, jesse.brandeburg

jamal wrote:
> On Wed, 2005-22-06 at 09:42 +0100, P@draigBrady.com wrote:
> 
> 
>>Yes the copy is essentially free here as the data is already cached.
>>
>>As a data point, I went the whole hog and used buffer recycling
>>in my essentially packet sniffing application. I.E. there are no
>>allocs per packet at all, and this make a HUGE difference. On a
>>2x3.4GHz 2xe1000 system I can receive 620Kpps per port sustained
>>into my userspace app which does a LOT of processing per packet.
>>Without the buffer recycling is was around 250Kpps.
>>Note I don't reuse an skb until the packet is copied into a
>>PACKET_MMAP buffer.
> 
> 
> Was this machine SMP?

Yes. 2 x 3.4GHz P4s
1 logical CPU per port (irq affinity)
1 thread (NB on same logical CPU as irq (sched_affinity))
to do user space per packet processing.

> NAPI involved?

Yep.

> I take it nothing interfering in
> the middle with the headers?

It uses the standard path to PACKET_MMAP buffer
e1000_clean_rx_irq -> netif_receive_skb -> tpacket_rcv

Pádraig.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] x86_64 prefetchw() function can take into account CONFIG_MK8 / CONFIG_MPSC
  2005-06-22 23:13                             ` Andi Kleen
  2005-06-22 23:19                               ` David S. Miller
@ 2005-06-23  9:32                               ` Eric Dumazet
  2005-06-23  9:37                                 ` Eric Dumazet
  2005-06-23 11:31                                 ` Andi Kleen
  1 sibling, 2 replies; 52+ messages in thread
From: Eric Dumazet @ 2005-06-23  9:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 155 bytes --]

If we build a x86_64 kernel for an AMD64 or for an Intel EMT64, no need to use alternative_input.
Reserve alternative_input only for a generic kernel.





[-- Attachment #2: patch.2 --]
[-- Type: text/plain, Size: 1075 bytes --]

diff -Nru linux-2.6.12/include/asm-x86_64/processor.h linux-2.6.12-orig/include/asm-x86_64/processor.h
--- linux-2.6.12-orig/include/asm-x86_64/processor.h 2005-06-17 21:48:29.000000000 +0200
+++ linux-2.6.12/include/asm-x86_64/processor.h       2005-06-23 11:20:08.000000000 +0200
@@ -389,10 +389,21 @@
 #define ARCH_HAS_PREFETCHW 1
 static inline void prefetchw(void *x)
 {
+#if defined(CONFIG_MK8)
+       /* AMD64 / MK8 has 3DNOW, we can emit a true prefetchw, using a "m" in the asm input */
+       asm volatile("prefetchw %0" :: "m" (*(unsigned long *)x));
+#elif defined(CONFIG_MPSC)
+       /* Intel EMT64 does not have 3DNOW, no prefetchw instruction */
+#else
+       /* If we build a generic X86_64 kernel,
+        * we must use alternative_input() and a "r" asm constraint to make sure
+        * the size of the instruction will be <= 5
+        */
        alternative_input(ASM_NOP5,
                          "prefetchw (%1)",
                          X86_FEATURE_3DNOW,
                          "r" (x));
+#endif
 }

 #define ARCH_HAS_SPINLOCK_PREFETCH 1

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] x86_64 prefetchw() function can take into account CONFIG_MK8 / CONFIG_MPSC
  2005-06-23  9:32                               ` [PATCH] x86_64 prefetchw() function can take into account CONFIG_MK8 / CONFIG_MPSC Eric Dumazet
@ 2005-06-23  9:37                                 ` Eric Dumazet
  2005-06-23 11:31                                 ` Andi Kleen
  1 sibling, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2005-06-23  9:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andi Kleen, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 228 bytes --]

If we build a x86_64 kernel for an AMD64 or for an Intel EMT64, no need to use alternative_input.
Reserve alternative_input only for a generic kernel.

Thank you

Eric Dumazet

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


[-- Attachment #2: patch.2 --]
[-- Type: text/plain, Size: 1075 bytes --]

diff -Nru linux-2.6.12/include/asm-x86_64/processor.h linux-2.6.12-orig/include/asm-x86_64/processor.h
--- linux-2.6.12-orig/include/asm-x86_64/processor.h 2005-06-17 21:48:29.000000000 +0200
+++ linux-2.6.12/include/asm-x86_64/processor.h       2005-06-23 11:20:08.000000000 +0200
@@ -389,10 +389,21 @@
 #define ARCH_HAS_PREFETCHW 1
 static inline void prefetchw(void *x)
 {
+#if defined(CONFIG_MK8)
+       /* AMD64 / MK8 has 3DNOW, we can emit a true prefetchw, using a "m" in the asm input */
+       asm volatile("prefetchw %0" :: "m" (*(unsigned long *)x));
+#elif defined(CONFIG_MPSC)
+       /* Intel EMT64 does not have 3DNOW, no prefetchw instruction */
+#else
+       /* If we build a generic X86_64 kernel,
+        * we must use alternative_input() and a "r" asm constraint to make sure
+        * the size of the instruction will be <= 5
+        */
        alternative_input(ASM_NOP5,
                          "prefetchw (%1)",
                          X86_FEATURE_3DNOW,
                          "r" (x));
+#endif
 }

 #define ARCH_HAS_SPINLOCK_PREFETCH 1

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] x86_64 prefetchw() function can take into account CONFIG_MK8 / CONFIG_MPSC
  2005-06-23  9:32                               ` [PATCH] x86_64 prefetchw() function can take into account CONFIG_MK8 / CONFIG_MPSC Eric Dumazet
  2005-06-23  9:37                                 ` Eric Dumazet
@ 2005-06-23 11:31                                 ` Andi Kleen
  2005-06-23 12:53                                   ` Eric Dumazet
  1 sibling, 1 reply; 52+ messages in thread
From: Andi Kleen @ 2005-06-23 11:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andi Kleen, linux-kernel

On Thu, Jun 23, 2005 at 11:32:34AM +0200, Eric Dumazet wrote:
> If we build a x86_64 kernel for an AMD64 or for an Intel EMT64, no need to 
> use alternative_input.
> Reserve alternative_input only for a generic kernel.

An EM64T kernel should still boot on AMD64 and vice versa. Rejected.

-Andi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-22 22:23                             ` David S. Miller
@ 2005-06-23 12:14                               ` jamal
  2005-06-23 17:36                                 ` David Mosberger
  0 siblings, 1 reply; 52+ messages in thread
From: jamal @ 2005-06-23 12:14 UTC (permalink / raw)
  To: David S. Miller
  Cc: Lennert Buytenhek, davidm, netdev, dada1, ak, leonid.grossman,
	becker, rick.jones2, davem

On Wed, 2005-22-06 at 15:23 -0700, David S. Miller wrote:
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Wed, 22 Jun 2005 23:38:21 +0200
> 
> > Then maybe we could also play with prefetchw() in the case the
> > incoming frame is small enough to be copied to a new skb.
> 
> That's a good idea too.  In fact, this would deal with platforms
> that use non-temporal stores in their memcpy() implementation.

For the fans of the e1000 (or even the tg3 deprived people), heres a 
patch which originated from David Mosberger that i played around (about
9 months back) - it will need some hand patching for the latest driver.
Similar approach: prefetch skb->data,twiddle twiddle not little star,
touch header.

I found the aggressive mode effective on a xeon but i belive David is
using this on x86_64. So Lennert, I lied to you saying it was never
effective on x86. You just have to do the right juju such as factoring
in the memory load-latency and how much cache you have on your specific
CPU.
CCing davidm (in addition To: davem of course ;->) so he may provide
more insight on his tests.

Interesting of course is if you miss the "twiddle here" (as i saw in my
experiments) and do the obvious (such as defining AGGRESSIVE to 0), you
infact end up paying a penalty in performance.

cheers,
jamal

===== drivers/net/e1000/e1000_main.c 1.134 vs edited =====
--- 1.134/drivers/net/e1000/e1000_main.c        2004-09-12 16:52:48
-07:00
+++ edited/drivers/net/e1000/e1000_main.c       2004-09-30 06:05:11
-07:00
@@ -2278,12 +2278,30 @@
        uint8_t last_byte;
        unsigned int i;
        boolean_t cleaned = FALSE;
+#define AGGRESSIVE 1
 
        i = rx_ring->next_to_clean;
+#if AGGRESSIVE
+       prefetch(rx_ring->buffer_info[i].skb->data);
+#endif
        rx_desc = E1000_RX_DESC(*rx_ring, i);
 
        while(rx_desc->status & E1000_RXD_STAT_DD) {
                buffer_info = &rx_ring->buffer_info[i];
+# if AGGRESSIVE
+               {
+                       struct e1000_rx_desc *next_rx;
+                       unsigned int j = i + 1;
+
+                       if (j == rx_ring->count)
+                               j = 0;
+                       next_rx = E1000_RX_DESC(*rx_ring, j);
+                       if (next_rx->status & E1000_RXD_STAT_DD)
+                               prefetch(rx_ring->buffer_info[j].skb->data);
+               }
+# else
+               prefetch(buffer_info->skb->data);
+# endif
 #ifdef CONFIG_E1000_NAPI
                if(*work_done >= work_to_do)
                        break;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] x86_64 prefetchw() function can take into account CONFIG_MK8 / CONFIG_MPSC
  2005-06-23 11:31                                 ` Andi Kleen
@ 2005-06-23 12:53                                   ` Eric Dumazet
  2005-06-23 13:10                                     ` Andi Kleen
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2005-06-23 12:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

Andi Kleen a écrit :
> On Thu, Jun 23, 2005 at 11:32:34AM +0200, Eric Dumazet wrote:
> 
>>If we build a x86_64 kernel for an AMD64 or for an Intel EMT64, no need to 
>>use alternative_input.
>>Reserve alternative_input only for a generic kernel.
> 
> 
> An EM64T kernel should still boot on AMD64 and vice versa. Rejected.
> 
> -Andi
> 
> 

OK, I wrongly assumed the 'MK8' or 'MPSC' choices were like x86 choices :

A kernel compiled for a Pentium-4 will not run on a i486.

But then what is the meaning of the choice "Generic-x86-64" in the "Processor family" menu ?
The Help message is : CONFIG_GENERIC_CPU: Generic x86-64 CPU.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] x86_64 prefetchw() function can take into account CONFIG_MK8 / CONFIG_MPSC
  2005-06-23 12:53                                   ` Eric Dumazet
@ 2005-06-23 13:10                                     ` Andi Kleen
  0 siblings, 0 replies; 52+ messages in thread
From: Andi Kleen @ 2005-06-23 13:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andi Kleen, linux-kernel, jh

On Thu, Jun 23, 2005 at 02:53:14PM +0200, Eric Dumazet wrote:
> Andi Kleen a ?crit :
> >On Thu, Jun 23, 2005 at 11:32:34AM +0200, Eric Dumazet wrote:
> >
> >>If we build a x86_64 kernel for an AMD64 or for an Intel EMT64, no need 
> >>to use alternative_input.
> >>Reserve alternative_input only for a generic kernel.
> >
> >
> >An EM64T kernel should still boot on AMD64 and vice versa. Rejected.
> >
> >-Andi
> >
> >
> 
> OK, I wrongly assumed the 'MK8' or 'MPSC' choices were like x86 choices :
> 
> A kernel compiled for a Pentium-4 will not run on a i486.
> 
> But then what is the meaning of the choice "Generic-x86-64" in the 
> "Processor family" menu ?
> The Help message is : CONFIG_GENERIC_CPU: Generic x86-64 CPU.

Optimize for both. It's a catch all setting for future changes
and does not do too much right now.

It also won't pass any -mcpu=... arguments to gcc. This currently
doesn't make any difference (gcc x86-64 default is to optimize
for K8), but I assume it might at some point when gcc gets
a "combined" mode that optimizes for both Intel and AMD CPUs.

-Andi


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: RFC: NAPI packet weighting patch
  2005-06-23 12:14                               ` jamal
@ 2005-06-23 17:36                                 ` David Mosberger
  0 siblings, 0 replies; 52+ messages in thread
From: David Mosberger @ 2005-06-23 17:36 UTC (permalink / raw)
  To: hadi
  Cc: David S. Miller, Lennert Buytenhek, davidm, netdev, dada1, ak,
	leonid.grossman, becker, rick.jones2, davem

>>>>> On Thu, 23 Jun 2005 08:14:11 -0400, jamal <hadi@cyberus.ca> said:

  Jamal> For the fans of the e1000 (or even the tg3 deprived people),
  Jamal> heres a patch which originated from David Mosberger that i
  Jamal> played around (about 9 months back) - it will need some hand
  Jamal> patching for the latest driver.  Similar approach: prefetch
  Jamal> skb->data,twiddle twiddle not little star, touch header.

  Jamal> I found the aggressive mode effective on a xeon but i belive
  Jamal> David is using this on x86_64. So Lennert, I lied to you
  Jamal> saying it was never effective on x86. You just have to do the
  Jamal> right juju such as factoring in the memory load-latency and
  Jamal> how much cache you have on your specific CPU.  CCing davidm
  Jamal> (in addition To: davem of course ;->) so he may provide more
  Jamal> insight on his tests.

I didn't remember what experiments I did, but I found the original
mail, with all the data.  The experiments were done on ia64 (naturally
;-).

Enjoy,

	--david

---
From: David Mosberger <davidm@linux.hpl.hp.com>
To: hadi@cyberus.ca
Cc: Alexey <kuznet@ms2.inr.ac.ru>, "David S. Miller" <davem@davemloft.net>,
        Robert Olsson <Robert.Olsson@data.slu.se>,
        Lennert Buytenhek <buytenh@wantstofly.org>, davidm@hpl.hp.com,
        eranian@linux.hpl.hp.com, grundler@parisc-linux.org
Subject: Re: prefetch
Date: Thu, 30 Sep 2004 06:51:29 -0700
Reply-To: davidm@hpl.hp.com
X-URL: http://www.hpl.hp.com/personal/David_Mosberger/

>>>>> On 27 Sep 2004 11:08:00 -0400, jamal <hadi@cyberus.ca> said:

  Jamal> one of the top abusers of cpu cycles in the netcode is
  Jamal> eth_type_trans() on x86 type hardware. This is where the
  Jamal> first time the skb->data is touched (hence a cache miss).
  Jamal> Clearly a good place to prefecth is in eth_type_trans itself
  Jamal> maybe right at the top you could prefetch skb->data or after
  Jamal> skb_pull() you could prefetch skb->mac.ethernet.

  Jamal> oprofile shows me the cycles being abused
  Jamal> (GLOBAL_POWER_EVENTS on xeon box) went down when i do either;
  Jamal> i cut down more cycles on doing skb->mac.ethernet that
  Jamal> skb->data - but thats a different topic.

  Jamal> My test is purely forwarding: packets come in through eth0,
  Jamal> get exercised by routing code and come out eth1. So the
  Jamal> important parameters for my test case are primarly throughput
  Jamal> and secondary is latency.  Adding the prefetch above while
  Jamal> showing lower CPU cycles, results in decreeased throughput
  Jamal> numbers and higher latency numbers.  What gives?

  Jamal> I am CCing the HP folks since they have some interesting
  Jamal> tools i heard David talk about at SUCON.

I don't have a good setup to measure packet forwarding performance.
However, prefetching skb->data certainly does reduce CPU utilization
on ia64 as the measurements below show.

I tried three versions:

 - original 2.6.9-rc3 (ORIGINAL)
 - 2.6.9-rc3 with a prefetch in e1000_clean_rx_irq (OBVIOUS)
 - 2.6.9-rc3 which prefetches the _next_ rx buffer (AGGRESSIVE)

All 3 cases use an e1000 board with NAPI enabled.

netperf results for 3 runs of ORIGINAL and AGGRESSIVE:

ORIGINAL:

$ netperf -l30 -c -C -H 192.168.10.15 -- -m1 -D
TCP STREAM TEST to 192.168.10.15 : nodelay
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time   Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.  10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384      1    30.00       1.59   99.93    10.94    5155.593  2257.461
 87380  16384      1    30.00       1.62   99.87    11.19    5045.549  2260.294
 87380  16384      1    30.00       1.62   99.89    11.29    5045.269  2281.327

AGGRESSIVE:

$ netperf -l30 -c -C -H 192.168.10.15 -- -m1 -D
TCP STREAM TEST to 192.168.10.15 : nodelay
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time   Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.  10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384      1    30.00       1.62   99.98    10.51    5062.204  2128.695
 87380  16384      1    30.00       1.62   99.99    10.51    5064.528  2128.940
 87380  16384      1    30.00       1.62   99.98    10.67    5053.365  2156.333

As you can see, not much of a throughput difference (I'd not expect
that, given the test...), but service demand on the receiver is down
significantly.  This is also confirmed with the following three
profiles (collected with q-syscollect):

ORIGINAL:

% time      self     cumul     calls self/call  tot/call name
 53.73     32.05     32.05      471k     68.1u     68.1u default_idle
  4.59      2.74     34.79     12.0M      228n      259n eth_type_trans

OBVIOUS:

% time      self     cumul     calls self/call  tot/call name
 55.72     33.25     33.25      469k     70.8u     70.8u default_idle
  4.49      2.68     35.93     12.0M      222n      278n tcp_v4_rcv
  2.84      1.70     37.63      473k     3.59u     32.6u e1000_clean
  2.81      1.68     39.30     12.2M      137n      525n tcp_rcv_established
  2.71      1.62     40.92     12.1M      134n      711n netif_receive_skb
  2.39      1.43     42.34     12.0M      119n      148n eth_type_trans

AGGRESSIVE:

% time      self     cumul     calls self/call  tot/call name
 57.51     34.34     34.34      395k     86.9u     86.9u default_idle
  4.40      2.62     36.96     12.3M      214n      265n tcp_v4_rcv
  3.12      1.86     38.82      455k     4.09u     31.3u e1000_clean
  3.09      1.84     40.66     12.0M      154n      584n tcp_rcv_established
  2.89      1.72     42.39     12.0M      144n      723n netif_receive_skb
  1.94      1.16     43.55      918k     1.26u     1.26u _spin_unlock_irq
  1.90      1.13     44.68     12.3M     92.4n      115n ip_route_input
  1.87      1.11     45.79     12.6M     88.4n     89.6n kfree
  1.87      1.11     46.91     12.1M     91.8n      572n ip_rcv
  1.68      1.00     47.91     12.1M     82.4n      351n ip_local_deliver
  1.21      0.72     48.63     12.6M     57.7n     58.9n __kmalloc
  1.01      0.60     49.23     12.3M     48.8n     53.7n sba_unmap_single
  1.00      0.59     49.83     12.0M     49.4n     81.0n eth_type_trans

Comparing ORIGINAL and AGGRESSIVE, we see that the latter spends an
additional 2.29 seconds in the idle-loop (default_idle), which
corresponds closely to the 2.19 seconds savings we're seeing in
eth_type_trans(), so the saving the prefetch achieves is real and not
offset by extra costs in other places.

The above also shows that the OBVIOUS prefetch is unable to cover the
entire load-latency.  Thus, I suspect it would really be best to use
the AGGRESSIVE prefetching policy.  If we were to do this, then the
code at label next_desc could be simplified, since we already
precomputed the next value of i/rx_desc as part of the prefetch.

It would be interesting to know how (modern) x86 CPUs behave.  If
somebody wants to try this, I attached a patch below (setting
AGGRESSIVE to 1 gives you the AGGRESSIVE version, seting it to 0 gives
you the OBVIOUS version).

Cheers,

	--david

===== drivers/net/e1000/e1000_main.c 1.134 vs edited =====
--- 1.134/drivers/net/e1000/e1000_main.c	2004-09-12 16:52:48 -07:00
+++ edited/drivers/net/e1000/e1000_main.c	2004-09-30 06:05:11 -07:00
@@ -2278,12 +2278,30 @@
 	uint8_t last_byte;
 	unsigned int i;
 	boolean_t cleaned = FALSE;
+#define AGGRESSIVE 1

 	i = rx_ring->next_to_clean;
+#if AGGRESSIVE
+	prefetch(rx_ring->buffer_info[i].skb->data);
+#endif
 	rx_desc = E1000_RX_DESC(*rx_ring, i);

 	while(rx_desc->status & E1000_RXD_STAT_DD) {
 		buffer_info = &rx_ring->buffer_info[i];
+# if AGGRESSIVE
+		{
+			struct e1000_rx_desc *next_rx;
+			unsigned int j = i + 1;
+
+			if (j == rx_ring->count)
+				j = 0;
+			next_rx = E1000_RX_DESC(*rx_ring, j);
+			if (next_rx->status & E1000_RXD_STAT_DD)
+				prefetch(rx_ring->buffer_info[j].skb->data);
+		}
+# else
+		prefetch(buffer_info->skb->data);
+# endif
 #ifdef CONFIG_E1000_NAPI
 		if(*work_done >= work_to_do)
 			break;

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2005-06-23 17:36 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-06 20:29 RFC: NAPI packet weighting patch Ronciak, John
2005-06-06 23:55 ` Mitch Williams
2005-06-07  0:08   ` Ben Greear
2005-06-08  1:50     ` Jesse Brandeburg
2005-06-07  4:53   ` Stephen Hemminger
2005-06-07 12:38     ` jamal
2005-06-07 12:06       ` Martin Josefsson
2005-06-07 13:29         ` jamal
2005-06-07 12:36           ` Martin Josefsson
2005-06-07 16:34             ` Robert Olsson
2005-06-07 23:19               ` Rick Jones
2005-06-21 20:37         ` David S. Miller
2005-06-22  7:27           ` Eric Dumazet
2005-06-22  8:42           ` P
2005-06-22 19:37             ` jamal
2005-06-23  8:56               ` P
2005-06-21 20:20     ` David S. Miller
2005-06-21 20:38       ` Rick Jones
2005-06-21 20:55         ` David S. Miller
2005-06-21 21:47         ` Andi Kleen
2005-06-21 22:22           ` Donald Becker
2005-06-21 22:34             ` Andi Kleen
2005-06-22  0:08               ` Donald Becker
2005-06-22  4:44                 ` Chris Friesen
2005-06-22 11:31                   ` Andi Kleen
2005-06-22 16:23                 ` Leonid Grossman
2005-06-22 16:37                   ` jamal
2005-06-22 18:00                     ` Leonid Grossman
2005-06-22 18:06                       ` Andi Kleen
2005-06-22 20:22                         ` David S. Miller
2005-06-22 20:35                           ` Rick Jones
2005-06-22 20:43                             ` David S. Miller
2005-06-22 21:10                           ` Andi Kleen
2005-06-22 21:16                             ` David S. Miller
2005-06-22 21:53                             ` Chris Friesen
2005-06-22 22:11                               ` Andi Kleen
2005-06-22 21:38                           ` Eric Dumazet
2005-06-22 22:13                             ` Eric Dumazet
2005-06-22 22:30                               ` David S. Miller
2005-06-22 22:23                             ` David S. Miller
2005-06-23 12:14                               ` jamal
2005-06-23 17:36                                 ` David Mosberger
2005-06-22 22:42                           ` Leonid Grossman
2005-06-22 23:13                             ` Andi Kleen
2005-06-22 23:19                               ` David S. Miller
2005-06-22 23:23                                 ` Andi Kleen
2005-06-23  9:32                               ` [PATCH] x86_64 prefetchw() function can take into account CONFIG_MK8 / CONFIG_MPSC Eric Dumazet
2005-06-23  9:37                                 ` Eric Dumazet
2005-06-23 11:31                                 ` Andi Kleen
2005-06-23 12:53                                   ` Eric Dumazet
2005-06-23 13:10                                     ` Andi Kleen
2005-06-22 17:05                   ` RFC: NAPI packet weighting patch Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.