RE: e1000 performance hack for ppc64 (Power4)

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: e1000 performance hack for ppc64 (Power4)
       [not found] <OF0078342A.E131D4B1-ON85256D44.0051F7C0@pok.ibm.com>
@ 2003-06-13 16:21 ` Dave Hansen
  2003-06-13 22:38   ` Anton Blanchard
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2003-06-13 16:21 UTC (permalink / raw)
  To: Herman Dierks
  Cc: Feldman, Scott, David Gibson, Linux Kernel Mailing List,
	Anton Blanchard, Nancy J Milliner, Ricardo C Gonzalez,
	Brian Twichell, netdev

Too long to quote:
http://marc.theaimsgroup.com/?t=105538879600001&r=1&w=2

Wouldn't you get most of the benefit from copying that stuff around in
the driver if you allocated the skb->data aligned in the first place? 

There's already code to align them on CPU cache boundaries:
#define SKB_DATA_ALIGN(X)       (((X) + (SMP_CACHE_BYTES - 1)) & \
                                 ~(SMP_CACHE_BYTES - 1))

So, do something like this:
#ifdef ARCH_ALIGN_SKB_BYTES
#define SKB_ALIGN_BYTES ARCH_ALIGN_SKB_BYTES
#else
#define SKB_ALIGN_BYTES SMP_CACHE_BYTES
#endif
#define SKB_DATA_ALIGN(X)       (((X) + (ARCH_ALIGN_SKB - 1)) & \
                                 ~(SKB_ALIGN_BYTES - 1))

You could easily make this adaptive to no align on th arch size when the
request is bigger than that, just like in the e1000 patch you posted.  
-- 
Dave Hansen
haveblue@us.ibm.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-13 17:03 Herman Dierks
  0 siblings, 0 replies; 27+ messages in thread
From: Herman Dierks @ 2003-06-13 17:03 UTC (permalink / raw)
  To: haveblue
  Cc: Feldman, Scott, David Gibson, Linux Kernel Mailing List,
	Anton Blanchard, Nancy J Milliner, Ricardo C Gonzalez,
	Brian Twichell, netdev

I will let Anton respond to this.   I think he may have tried this some
time back in his early prototypes to fix this.
I think the problem was not where the buffer started but where the  packet
ended up within the buffer.
Due to varying sizes of TCP and IP headers the packet ended up at some
non-cache aligned address.
What we need for the DMA to work well is to have the final packet  (with
datalink headers)  starting on a cache line as its the final packet that
must be DMA'd. In fact it may need to to be aligned to a higher level than
that (not sure).

haveblue@us.ltcfwd.linux.ibm.com on 06/13/2003 11:21:03 AM

To:    Herman Dierks/Austin/IBM@IBMUS
cc:    "Feldman, Scott" <scott.feldman@intel.com>, David Gibson
       <dwg@au1.ibm.com>, Linux Kernel Mailing List
       <linux-kernel@vger.kernel.org>, Anton Blanchard <anton@samba.org>,
       Nancy J Milliner/Austin/IBM@IBMUS, Ricardo C
       Gonzalez/Austin/IBM@ibmus, Brian Twichell/Austin/IBM@IBMUS,
       netdev@oss.sgi.com
Subject:    RE: e1000 performance hack for ppc64 (Power4)

Too long to quote:
http://marc.theaimsgroup.com/?t=105538879600001&r=1&w=2

Wouldn't you get most of the benefit from copying that stuff around in
the driver if you allocated the skb->data aligned in the first place?

There's already code to align them on CPU cache boundaries:
#define SKB_DATA_ALIGN(X)       (((X) + (SMP_CACHE_BYTES - 1)) & \
                                 ~(SMP_CACHE_BYTES - 1))

So, do something like this:
#ifdef ARCH_ALIGN_SKB_BYTES
#define SKB_ALIGN_BYTES ARCH_ALIGN_SKB_BYTES
#else
#define SKB_ALIGN_BYTES SMP_CACHE_BYTES
#endif
#define SKB_DATA_ALIGN(X)       (((X) + (ARCH_ALIGN_SKB - 1)) & \
                                 ~(SKB_ALIGN_BYTES - 1))

You could easily make this adaptive to no align on th arch size when the
request is bigger than that, just like in the e1000 patch you posted.
--
Dave Hansen
haveblue@us.ibm.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-13 16:21 ` Dave Hansen
@ 2003-06-13 22:38   ` Anton Blanchard
  2003-06-13 22:46     ` David S. Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Anton Blanchard @ 2003-06-13 22:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Herman Dierks, Feldman, Scott, David Gibson,
	Linux Kernel Mailing List, Nancy J Milliner, Ricardo C Gonzalez,
	Brian Twichell, netdev

> Wouldn't you get most of the benefit from copying that stuff around in
> the driver if you allocated the skb->data aligned in the first place? 

Nice try, but my understanding is that on the transmit path we reserve
the maximum sized TCP header, copy the data in then form our TCP header
backwards from that point. Since the TCP header size changes with
various options, its not an easy task.

One thing I thought of doing was to cache the current TCP header size
and align the next packet based on it, with an extra cacheline at the
start for it to spill into if the TCP header grew.

This is only worth it if most packets will have the same sized header.
Networking guys: is this a valid assumption?

Anton

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-13 22:38   ` Anton Blanchard
@ 2003-06-13 22:46     ` David S. Miller
  2003-06-13 23:18       ` Anton Blanchard
  2003-06-14  5:16       ` Nivedita Singhvi
  0 siblings, 2 replies; 27+ messages in thread
From: David S. Miller @ 2003-06-13 22:46 UTC (permalink / raw)
  To: anton
  Cc: haveblue, hdierks, scott.feldman, dwg, linux-kernel, milliner,
	ricardoz, twichell, netdev

   From: Anton Blanchard <anton@samba.org>
   Date: Sat, 14 Jun 2003 08:38:41 +1000

   This is only worth it if most packets will have the same sized header.
   Networking guys: is this a valid assumption?

Not really... one retransmit and the TCP header size grows
due to the SACK options.

I find it truly bletcherous what you're trying to do here.

Why not instead find out if it's possible to have the e1000
fetch the entire cache line where the first byte of the packet
resides?  Even ancient designes like SunHME do that.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-13 22:46     ` David S. Miller
@ 2003-06-13 23:18       ` Anton Blanchard
  2003-06-14  1:52         ` Lincoln Dale
  2003-06-14  5:16       ` Nivedita Singhvi
  1 sibling, 1 reply; 27+ messages in thread
From: Anton Blanchard @ 2003-06-13 23:18 UTC (permalink / raw)
  To: David S. Miller
  Cc: haveblue, hdierks, scott.feldman, dwg, linux-kernel, milliner,
	ricardoz, twichell, netdev


> Not really... one retransmit and the TCP header size grows
> due to the SACK options.

OK scratch that idea.

> I find it truly bletcherous what you're trying to do here.

I think so too, but its hard to ignore ~100Mbit/sec in performance.

> Why not instead find out if it's possible to have the e1000
> fetch the entire cache line where the first byte of the packet
> resides?  Even ancient designes like SunHME do that.

Rusty and I were wondering why the e1000 didnt do that exact thing.

Scott: is it possible to enable such a thing?

Anton

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-13 23:52 Feldman, Scott
  2003-06-13 23:52 ` David S. Miller
  2003-06-14  0:03 ` Anton Blanchard
  0 siblings, 2 replies; 27+ messages in thread
From: Feldman, Scott @ 2003-06-13 23:52 UTC (permalink / raw)
  To: Anton Blanchard, David S. Miller
  Cc: haveblue, hdierks, dwg, linux-kernel, milliner, ricardoz,
	twichell, netdev

> > Why not instead find out if it's possible to have the e1000 
> > fetch the entire cache line where the first byte of the
> > packet resides?  Even ancient designes like SunHME do that.
> 
> Rusty and I were wondering why the e1000 didnt do that exact thing.
> 
> Scott: is it possible to enable such a thing?

I thought the answer was no, so I double checked with a couple of
hardware guys, and the answer is still no.

-scott

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-13 23:52 Feldman, Scott
@ 2003-06-13 23:52 ` David S. Miller
  2003-06-14  0:55   ` Anton Blanchard
  2003-06-14  0:03 ` Anton Blanchard
  1 sibling, 1 reply; 27+ messages in thread
From: David S. Miller @ 2003-06-13 23:52 UTC (permalink / raw)
  To: scott.feldman
  Cc: anton, haveblue, hdierks, dwg, linux-kernel, milliner, ricardoz,
	twichell, netdev

   From: "Feldman, Scott" <scott.feldman@intel.com>
   Date: Fri, 13 Jun 2003 16:52:18 -0700

   > > Why not instead find out if it's possible to have the e1000 
   > > fetch the entire cache line where the first byte of the
   > > packet resides?  Even ancient designes like SunHME do that.
   > 
   > Rusty and I were wondering why the e1000 didnt do that exact thing.
   > 
   > Scott: is it possible to enable such a thing?

   I thought the answer was no, so I double checked with a couple of
   hardware guys, and the answer is still no.

Sigh...

So Anton, when the PCI controller gets a set of sub-cacheline word
reads from the device, it reads the value from memory once for every
one of those words?  ROFL, if so...  I can't believe they wouldn't
put caches on the PCI controller for this, at least a one-behind that
snoops the bus :(

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-13 23:52 Feldman, Scott
  2003-06-13 23:52 ` David S. Miller
@ 2003-06-14  0:03 ` Anton Blanchard
  1 sibling, 0 replies; 27+ messages in thread
From: Anton Blanchard @ 2003-06-14  0:03 UTC (permalink / raw)
  To: Feldman, Scott
  Cc: David S. Miller, haveblue, hdierks, dwg, linux-kernel, milliner,
	ricardoz, twichell, netdev

> I thought the answer was no, so I double checked with a couple of
> hardware guys, and the answer is still no.

Hi Scott,

Thats a pity, the e100 docs on sourceforge show it can do what we want,
it would be nice if e1000 had this feature too :)

4.2.2 Read Align

The Read Align feature is aimed to enhance performance in cache line
oriented systems. Starting a PCI transaction in these systems on a
non-cache line aligned address may result in low  performance. To solve
this performance problem, the controller can be configured to terminate
Transmit DMA cycles on a cache line boundary, and start the next
transaction on a cache line aligned address. This  feature is enabled
when the Read Align Enable bit is set in device Configure command
(Section 6.4.2.3, "Configure (010b)").

If this bit is set, the device operates as follows:

* When the device is close to running out of resources on the Transmit
* DMA (in other words, the Transmit FIFO is almost full), it attempts to
* terminate the read transaction on the nearest cache line boundary when 
* possible.

* When the arbitration counters feature is enabled (maximum Transmit DMA
* byte count value is set in configuration space), the device switches
* to other pending DMAs on cache line boundary only.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-13 23:52 ` David S. Miller
@ 2003-06-14  0:55   ` Anton Blanchard
  2003-06-14  1:34     ` David S. Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Anton Blanchard @ 2003-06-14  0:55 UTC (permalink / raw)
  To: David S. Miller
  Cc: scott.feldman, haveblue, hdierks, dwg, linux-kernel, milliner,
	ricardoz, twichell, netdev

> So Anton, when the PCI controller gets a set of sub-cacheline word
> reads from the device, it reads the value from memory once for every
> one of those words?  ROFL, if so...  I can't believe they wouldn't
> put caches on the PCI controller for this, at least a one-behind that
> snoops the bus :(

There is a cache in the host bridge and the PCI-PCI bridge. I dont
think we go back to memory for sub cacheline reads.

What I think is happening is that we arent tripping the prefetch logic.
We should take a latency hit for only the first cacheline at which point
the host bridge decides to start prefetching for us. If not then we take
take the latency hit on each transaction.

Anton

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-14  0:55   ` Anton Blanchard
@ 2003-06-14  1:34     ` David S. Miller
  0 siblings, 0 replies; 27+ messages in thread
From: David S. Miller @ 2003-06-14  1:34 UTC (permalink / raw)
  To: anton
  Cc: scott.feldman, haveblue, hdierks, dwg, linux-kernel, milliner,
	ricardoz, twichell, netdev

   From: Anton Blanchard <anton@samba.org>
   Date: Sat, 14 Jun 2003 10:55:34 +1000

   What I think is happening is that we arent tripping the prefetch
   logic.  We should take a latency hit for only the first cacheline
   at which point the host bridge decides to start prefetching for
   us. If not then we take take the latency hit on each transaction.

It sounds like what happens is that the sub-cacheline word reads
don't trigger the prefetch, but the first PCI read multiple
transaction does.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-13 23:18       ` Anton Blanchard
@ 2003-06-14  1:52         ` Lincoln Dale
  2003-06-14  5:41           ` David S. Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Lincoln Dale @ 2003-06-14  1:52 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: David S. Miller, haveblue, hdierks, scott.feldman, dwg,
	linux-kernel, milliner, ricardoz, twichell, netdev

At 09:18 AM 14/06/2003 +1000, Anton Blanchard wrote:

> > Not really... one retransmit and the TCP header size grows
> > due to the SACK options.
>
>OK scratch that idea.

why not have a performance option that is a tradeoff between optimum 
payload size versus efficiency.

unless i misunderstand the problem, you can certainly pad the TCP options 
with NOPs ...

> > I find it truly bletcherous what you're trying to do here.
>
>I think so too, but its hard to ignore ~100Mbit/sec in performance.

another option is for the write() path is for instantant-send TCP sockets 
to delay the copy_from_user() until the IP+TCP header size is known.
i wouldn't expect the net folks to like that, however ..

cheers,

lincoln.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-13 22:46     ` David S. Miller
  2003-06-13 23:18       ` Anton Blanchard
@ 2003-06-14  5:16       ` Nivedita Singhvi
  2003-06-14  5:36         ` David S. Miller
  1 sibling, 1 reply; 27+ messages in thread
From: Nivedita Singhvi @ 2003-06-14  5:16 UTC (permalink / raw)
  To: David S. Miller
  Cc: anton, haveblue, hdierks, scott.feldman, dwg, linux-kernel,
	milliner, ricardoz, twichell, netdev

David S. Miller wrote:
>    From: Anton Blanchard <anton@samba.org>
>    Date: Sat, 14 Jun 2003 08:38:41 +1000
>    
>    This is only worth it if most packets will have the same sized header.
>    Networking guys: is this a valid assumption?
> 
> Not really... one retransmit and the TCP header size grows
> due to the SACK options.

Yep, but it really doesn't have too many options (sic pun ;))..
i.e. The max the options can add are 40 bytes, speaking
strictly TCP, not IP. This really should fit into one extra
cacheline for most architectures, at most, right?

[The TCP options have to end and the data start on a 32
bit boundary. For established connections, we're
principally talking SACK options and v. likely timestamp.
(Ignoring those egregious benchmark guys who turn everything
useful off ;)). SYNs wont have data in any case.

So its going to grow by (SACK = 8*n + 2)+ (TS = 10) bytes,
with n = number of sack options, with a max of n = 3
if timestamps are enabled. Adding that to the standard
length of 20 bytes, the total len of a TCP header is thus
very likely one of:
	20  + [ 0 | 20 |32 | 36] bytes
	= 20 | 40 | 52 | 56 bytes.

If cachelines were 64 bytes, we wouldnt be wasting a
whole lot of space if we aligned data start or some
other scheme as was suggested. Even given the larger
cachelines, it might be worth it, or is this totally
not an option (cough,sic ;))?

> I find it truly bletcherous what you're trying to do here.

yep

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-14  5:16       ` Nivedita Singhvi
@ 2003-06-14  5:36         ` David S. Miller
  0 siblings, 0 replies; 27+ messages in thread
From: David S. Miller @ 2003-06-14  5:36 UTC (permalink / raw)
  To: niv
  Cc: anton, haveblue, hdierks, scott.feldman, dwg, linux-kernel,
	milliner, ricardoz, twichell, netdev

   From: Nivedita Singhvi <niv@us.ibm.com>
   Date: Fri, 13 Jun 2003 22:16:22 -0700

   Yep, but it really doesn't have too many options (sic pun ;))..
   i.e. The max the options can add are 40 bytes, speaking
   strictly TCP, not IP. This really should fit into one extra
   cacheline for most architectures, at most, right?

It's what the bottom of the header is aligned to, but
we build the packet top to bottom not the other way around.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-14  1:52         ` Lincoln Dale
@ 2003-06-14  5:41           ` David S. Miller
  2003-06-14  5:52             ` Lincoln Dale
  0 siblings, 1 reply; 27+ messages in thread
From: David S. Miller @ 2003-06-14  5:41 UTC (permalink / raw)
  To: ltd
  Cc: anton, haveblue, hdierks, scott.feldman, dwg, linux-kernel,
	milliner, ricardoz, twichell, netdev

   From: Lincoln Dale <ltd@cisco.com>
   Date: Sat, 14 Jun 2003 11:52:53 +1000

   unless i misunderstand the problem, you can certainly pad the TCP
   options with NOPs ...

You may not mangle packet if it is not your's alone.

And every TCP packet is shared with TCP retransmit
queue and therefore would need to be copied before
being mangled.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-14  5:41           ` David S. Miller
@ 2003-06-14  5:52             ` Lincoln Dale
  2003-06-14  6:08               ` David S. Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Lincoln Dale @ 2003-06-14  5:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: anton, haveblue, hdierks, scott.feldman, dwg, linux-kernel,
	milliner, ricardoz, twichell, netdev

At 10:41 PM 13/06/2003 -0700, David S. Miller wrote:
>    From: Lincoln Dale <ltd@cisco.com>
>    Date: Sat, 14 Jun 2003 11:52:53 +1000
>
>    unless i misunderstand the problem, you can certainly pad the TCP
>    options with NOPs ...
>
>You may not mangle packet if it is not your's alone.
>
>And every TCP packet is shared with TCP retransmit
>queue and therefore would need to be copied before
>being mangled.

ok, so lets take this a step further.

can we have the TCP retransmit side take a performance hit if it needs to 
realign buffers?

once again, for a "high performance app" requiring gigabit-type speeds, its 
probably fair to say that this is mostly in the realm of applications on a 
LAN rather than across a WAN or internet.
on a switched LAN, i'd expect TCP retransmissions to be far fewer ...

cheers,

lincoln.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-14  5:52             ` Lincoln Dale
@ 2003-06-14  6:08               ` David S. Miller
  2003-06-14  6:14                 ` David S. Miller
  0 siblings, 1 reply; 27+ messages in thread
From: David S. Miller @ 2003-06-14  6:08 UTC (permalink / raw)
  To: ltd
  Cc: anton, haveblue, hdierks, scott.feldman, dwg, linux-kernel,
	milliner, ricardoz, twichell, netdev

   From: Lincoln Dale <ltd@cisco.com>
   Date: Sat, 14 Jun 2003 15:52:35 +1000
   
   can we have the TCP retransmit side take a performance hit if it needs to 
   realign buffers?
   
You don't understand, the person who mangles the packet
must make the copy, not the person not doing the packet
modifications.

   for a "high performance app" requiring gigabit-type speeds,

...we probably won't be using ppc64 and e1000 cards, yes, I agree
:-)

Anton, go to the local computer store and pick up some tg3
cards or a bunch of Taiwan specials :-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-14  6:08               ` David S. Miller
@ 2003-06-14  6:14                 ` David S. Miller
  2003-06-14  6:27                   ` William Lee Irwin III
  2003-06-14 17:08                   ` Greg KH
  0 siblings, 2 replies; 27+ messages in thread
From: David S. Miller @ 2003-06-14  6:14 UTC (permalink / raw)
  To: ltd
  Cc: anton, haveblue, hdierks, scott.feldman, dwg, linux-kernel,
	milliner, ricardoz, twichell, netdev


Folks, can we remove whatever member of this CC: list creates
bounces that say:

Your message to Linux_news awaits moderator approval

Ok?  I can't guess which one it is because these all look
like normal people's email addresses (except possibly the
haveblue@us.ibm.com thing, maybe that's the one)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-14  6:14                 ` David S. Miller
@ 2003-06-14  6:27                   ` William Lee Irwin III
  2003-06-14 17:08                   ` Greg KH
  1 sibling, 0 replies; 27+ messages in thread
From: William Lee Irwin III @ 2003-06-14  6:27 UTC (permalink / raw)
  To: David S. Miller
  Cc: ltd, anton, haveblue, hdierks, scott.feldman, dwg, linux-kernel,
	milliner, ricardoz, twichell, netdev

On Fri, Jun 13, 2003 at 11:14:18PM -0700, David S. Miller wrote:
> Folks, can we remove whatever member of this CC: list creates
> bounces that say:
> Your message to Linux_news awaits moderator approval
> Ok?  I can't guess which one it is because these all look
> like normal people's email addresses (except possibly the
> haveblue@us.ibm.com thing, maybe that's the one)

That's legitimate, but I'm still trying to convince the BKL brigade
that he should have chosen something more eponymous e.g. dhansen. =)


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-14  6:14                 ` David S. Miller
  2003-06-14  6:27                   ` William Lee Irwin III
@ 2003-06-14 17:08                   ` Greg KH
  2003-06-15  3:01                     ` David S. Miller
  1 sibling, 1 reply; 27+ messages in thread
From: Greg KH @ 2003-06-14 17:08 UTC (permalink / raw)
  To: David S. Miller
  Cc: ltd, anton, haveblue, hdierks, scott.feldman, dwg, linux-kernel,
	milliner, ricardoz, twichell, netdev

On Fri, Jun 13, 2003 at 11:14:18PM -0700, David S. Miller wrote:
> 
> Folks, can we remove whatever member of this CC: list creates
> bounces that say:
> 
> Your message to Linux_news awaits moderator approval

It's someone subscribed to linux-kernel@vger.kernel.org that causes
this.

I've complained to the admin of that mail-news gateway that is barfing
on too many CC: members in the past to not do this, but it doesn't seem
like they are listening...

greg k-h

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-14 17:08                   ` Greg KH
@ 2003-06-15  3:01                     ` David S. Miller
  0 siblings, 0 replies; 27+ messages in thread
From: David S. Miller @ 2003-06-15  3:01 UTC (permalink / raw)
  To: greg
  Cc: ltd, anton, haveblue, hdierks, scott.feldman, dwg, linux-kernel,
	milliner, ricardoz, twichell, netdev

   From: Greg KH <greg@kroah.com>
   Date: Sat, 14 Jun 2003 10:08:49 -0700

   It's someone subscribed to linux-kernel@vger.kernel.org that causes
   this.
   
Thanks, I've nuked linux_news@nextphere.com, if they resubscribe
without contacting me, I'll block future subscription attempts from
them.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
@ 2003-06-15 14:32 Herman Dierks
  0 siblings, 0 replies; 27+ messages in thread
From: Herman Dierks @ 2003-06-15 14:32 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Feldman, Scott, David S. Miller, haveblue, dwg, linux-kernel,
	Nancy J Milliner, Ricardo C Gonzalez, Brian Twichell, netdev

Anton, I think the option described below is intended to cause the adapter
to "get off on a cache line boundary" so when it restarts the DMA it will
be aligned.   This is for cases when the adapter has to get off, for exampe
due to FIFO full, etc.
Some adapters would get off on any boundary and that then causes perf
issues when the DMA is restarted.
This is a good option, but I don't think it addresses what we need here as
the host needs to ensure a DMA starts on a cache line.
Different adapter anyway, but  I am just pointing out that even if e1000
had this it would not be the solution.

Anton Blanchard <anton@samba.org> on 06/13/2003 07:03:42 PM

To:    "Feldman, Scott" <scott.feldman@intel.com>
cc:    "David S. Miller" <davem@redhat.com>,
       haveblue@us.ltcfwd.linux.ibm.com, Herman Dierks/Austin/IBM@IBMUS,
       dwg@au1.ibm.com, linux-kernel@vger.kernel.org, Nancy J
       Milliner/Austin/IBM@IBMUS, Ricardo C Gonzalez/Austin/IBM@ibmus,
       Brian Twichell/Austin/IBM@IBMUS, netdev@oss.sgi.com
Subject:    Re: e1000 performance hack for ppc64 (Power4)

> I thought the answer was no, so I double checked with a couple of
> hardware guys, and the answer is still no.

Hi Scott,

Thats a pity, the e100 docs on sourceforge show it can do what we want,
it would be nice if e1000 had this feature too :)

4.2.2 Read Align

The Read Align feature is aimed to enhance performance in cache line
oriented systems. Starting a PCI transaction in these systems on a
non-cache line aligned address may result in low  performance. To solve
this performance problem, the controller can be configured to terminate
Transmit DMA cycles on a cache line boundary, and start the next
transaction on a cache line aligned address. This  feature is enabled
when the Read Align Enable bit is set in device Configure command
(Section 6.4.2.3, "Configure (010b)").

If this bit is set, the device operates as follows:

* When the device is close to running out of resources on the Transmit
* DMA (in other words, the Transmit FIFO is almost full), it attempts to
* terminate the read transaction on the nearest cache line boundary when
* possible.

* When the arbitration counters feature is enabled (maximum Transmit DMA
* byte count value is set in configuration space), the device switches
 * to other pending DMAs on cache line boundary only.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
@ 2003-06-15 14:40 Herman Dierks
  2003-06-15 14:44 ` David S. Miller
  2003-06-16 16:17 ` Nivedita Singhvi
  0 siblings, 2 replies; 27+ messages in thread
From: Herman Dierks @ 2003-06-15 14:40 UTC (permalink / raw)
  To: David S. Miller
  Cc: ltd, anton, haveblue, scott.feldman, dwg, linux-kernel,
	Nancy J Milliner, Ricardo C Gonzalez, Brian Twichell, netdev

Look folks,   we run 40 to 48  GigE adapters in a p690 32 way on AIX and
they basically all run at full speed  so let me se you try that on most of
these other boxes you are talking about.     Same adapter, same hardware
logic.
I have also seen what many of these other boxes you talk about do when data
or structures are not aligned on 64 bit boundaries.
The PPC HW does not have those 64bit alignment  issues.   So, each machine
has some warts.  Have yet to see a perfect one.

If you want a lot of PCI adapters on a box, it takes a number of bridge
chips and other IO links to do that.
Memory controllers like to deal with cache lines.
For larger packets, like jumbo frames or large send (TSO), the few added
DMA's is not an issue as the packets are so large the DMA soon get aligned
and are not an issue.   With TSO being the default,   the small packet case
becomes less important anyway.   Its more an issue on 2.4 where TSO is not
provided.  We also want this to run well if someone does not want to use
TSO.

Its only the MTU 1500 case with non-TSO that we are discussing here so
copying a few bytes is really not a big deal as the data is already in
cache from copying into kernel.  If it lets the adapter run at speed, thats
what customers want and what we need.
Granted, if the HW could deal with this we would not have to, but thats not
the case today so I want to spend a few CPU cycles to get best performance.
Again, if this is not done on other platforms, I don't understand why you
care.

If we have to do this for PPC port, fine.   I have not seen any of you
suggest a better solution that works and will not be a worse hack to TCP or
other code.  Anton tried various other ideas before we fell back to doing
it the same way we did this in AIX.   This code is very localized and is
only used by platforms that need it.  Thus I don't see the big issue here.

Herman

"David S. Miller" <davem@redhat.com> on 06/14/2003 01:08:50 AM

To:    ltd@cisco.com
cc:    anton@samba.org, haveblue@us.ltcfwd.linux.ibm.com, Herman
       Dierks/Austin/IBM@IBMUS, scott.feldman@intel.com, dwg@au1.ibm.com,
       linux-kernel@vger.kernel.org, Nancy J Milliner/Austin/IBM@IBMUS,
       Ricardo C Gonzalez/Austin/IBM@ibmus, Brian
       Twichell/Austin/IBM@IBMUS, netdev@oss.sgi.com
Subject:    Re: e1000 performance hack for ppc64 (Power4)

   From: Lincoln Dale <ltd@cisco.com>
   Date: Sat, 14 Jun 2003 15:52:35 +1000

   can we have the TCP retransmit side take a performance hit if it needs
   to
   realign buffers?

You don't understand, the person who mangles the packet
must make the copy, not the person not doing the packet
modifications.

   for a "high performance app" requiring gigabit-type speeds,

...we probably won't be using ppc64 and e1000 cards, yes, I agree
:-)

Anton, go to the local computer store and pick up some tg3
 cards or a bunch of Taiwan specials :-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-15 14:40 e1000 performance hack for ppc64 (Power4) Herman Dierks
@ 2003-06-15 14:44 ` David S. Miller
  2003-06-16 16:17 ` Nivedita Singhvi
  1 sibling, 0 replies; 27+ messages in thread
From: David S. Miller @ 2003-06-15 14:44 UTC (permalink / raw)
  To: hdierks
  Cc: ltd, anton, haveblue, scott.feldman, dwg, linux-kernel, milliner,
	ricardoz, twichell, netdev

   From: "Herman Dierks" <hdierks@us.ibm.com>
   Date: Sun, 15 Jun 2003 09:40:41 -0500

   With TSO being the default,   the small packet case
   becomes less important anyway.

This is a very narrow and unrealistic view of the situation.

Every third packet your system will process for any connection will be
an ACK, a small packet.  Most database and web and database
transactions happen using small packets for the transaction request.

Look, if you're gonna sit here and just rant justifying this bogus
behavior of your hardware, it is likely to go in one ear and out the
other.  Nobody wants to hear excuses. :)

The fact is, this system handles sub-cacheline reads inefficiently
even if a sequences of transactions are consequetive and to the same
cache line and no coherency transactions occur to that cache line.

That is dumb, and there is no arguing around this.  You would be
sensible to realize this, and accept it whilst others try to help you
find a solution for your problem.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: e1000 performance hack for ppc64 (Power4)
  2003-06-15 14:40 e1000 performance hack for ppc64 (Power4) Herman Dierks
  2003-06-15 14:44 ` David S. Miller
@ 2003-06-16 16:17 ` Nivedita Singhvi
  1 sibling, 0 replies; 27+ messages in thread
From: Nivedita Singhvi @ 2003-06-16 16:17 UTC (permalink / raw)
  To: Herman Dierks
  Cc: David S. Miller, ltd, anton, haveblue, scott.feldman, dwg,
	linux-kernel, Nancy J Milliner, Ricardo C Gonzalez,
	Brian Twichell, netdev

Herman Dierks wrote:
> Look folks,   we run 40 to 48  GigE adapters in a p690 32 way on AIX and
> they basically all run at full speed  so let me se you try that on most of
> these other boxes you are talking about.     Same adapter, same hardware
> logic.

FWIW, I think that's pretty cool. Not easy to do. :)

> For larger packets, like jumbo frames or large send (TSO), the few added
> DMA's is not an issue as the packets are so large the DMA soon get aligned
> and are not an issue.   With TSO being the default,   the small packet case
> becomes less important anyway.   Its more an issue on 2.4 where TSO is not
> provided.  We also want this to run well if someone does not want to use
> TSO.

Slightly off-topic, but TSO being enabled and TSO being used are
two different things, right? Ditto jumbo frames..How often is this
the actual env in real world situations? I'm concerned that this
is more typical of development testing/performance testing
environments.

> Its only the MTU 1500 case with non-TSO that we are discussing here so

Which still is the pretty important case, I think..

> copying a few bytes is really not a big deal as the data is already in
> cache from copying into kernel.  If it lets the adapter run at speed, thats
> what customers want and what we need.

Yep.

> Granted, if the HW could deal with this we would not have to, but thats not
> the case today so I want to spend a few CPU cycles to get best performance.
> Again, if this is not done on other platforms, I don't understand why you
> care.

Still would be nice to put in the best solution possible, i.e.
address it for the broadest set of affected pieces and minimizing
the impact..

> If we have to do this for PPC port, fine.   I have not seen any of you

Hope it doesn't have to come to that..It would be nice to
see it in the mainline kernel. Regardless of platform, distro, etc..
these users are still people who are taking the time, the effort to
adopt Linux and sometimes in environments and situations that are
pretty critical. Change and innovation are difficult activities
to engage in some places, and anything we can do to make this
a no-brainer solution for them, and their decisions shine, thats
gotta be worth something to go the extra mile for :)

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-16 18:21 Feldman, Scott
  2003-06-16 18:30 ` Dave Hansen
  0 siblings, 1 reply; 27+ messages in thread
From: Feldman, Scott @ 2003-06-16 18:21 UTC (permalink / raw)
  To: Herman Dierks, David S. Miller
  Cc: ltd, anton, haveblue, dwg, linux-kernel, Nancy J Milliner,
	Ricardo C Gonzalez, Brian Twichell, netdev

Herman wrote:
> Its only the MTU 1500 case with non-TSO that we are 
> discussing here so copying a few bytes is really not a big 
> deal as the data is already in cache from copying into 
> kernel.  If it lets the adapter run at speed, thats what 
> customers want and what we need. Granted, if the HW could 
> deal with this we would not have to, but thats not the case 
> today so I want to spend a few CPU cycles to get best 
> performance. Again, if this is not done on other platforms, I 
> don't understand why you care.

I care because adding the arch-specific hack creates a maintenance issue
for me.

-scott

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: e1000 performance hack for ppc64 (Power4)
  2003-06-16 18:21 Feldman, Scott
@ 2003-06-16 18:30 ` Dave Hansen
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2003-06-16 18:30 UTC (permalink / raw)
  To: Feldman, Scott
  Cc: Herman Dierks, David S. Miller, ltd, Anton Blanchard, dwg,
	Linux Kernel Mailing List, Nancy J Milliner, Ricardo C Gonzalez,
	Brian Twichell, netdev

On Mon, 2003-06-16 at 11:21, Feldman, Scott wrote:
> Herman wrote:
> > Its only the MTU 1500 case with non-TSO that we are 
> > discussing here so copying a few bytes is really not a big 
> > deal as the data is already in cache from copying into 
> > kernel.  If it lets the adapter run at speed, thats what 
> > customers want and what we need. Granted, if the HW could 
> > deal with this we would not have to, but thats not the case 
> > today so I want to spend a few CPU cycles to get best 
> > performance. Again, if this is not done on other platforms, I 
> > don't understand why you care.
> 
> I care because adding the arch-specific hack creates a maintenance issue
> for 

Scott, would you be pleased if something was implemented out of the
driver, in generic net code?  Something that all the drivers could use,
even if nothing but e1000 used it for now.

-- 
Dave Hansen
haveblue@us.ibm.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-16 18:56 Feldman, Scott
  0 siblings, 0 replies; 27+ messages in thread
From: Feldman, Scott @ 2003-06-16 18:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Herman Dierks, David S. Miller, ltd, Anton Blanchard, dwg,
	Linux Kernel Mailing List, Nancy J Milliner, Ricardo C Gonzalez,
	Brian Twichell, netdev

> Scott, would you be pleased if something was implemented out 
> of the driver, in generic net code?  Something that all the 
> drivers could use, even if nothing but e1000 used it for now.

I suppose the driver could unconditionally call something like
skb_realign_for_broken_hw, which is a nop on non-broken archs, but would
it make more sense to not have the driver mess with the skb at all?

-scott

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2003-06-16 18:56 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-06-15 14:40 e1000 performance hack for ppc64 (Power4) Herman Dierks
2003-06-15 14:44 ` David S. Miller
2003-06-16 16:17 ` Nivedita Singhvi
  -- strict thread matches above, loose matches on Subject: below --
2003-06-16 18:56 Feldman, Scott
2003-06-16 18:21 Feldman, Scott
2003-06-16 18:30 ` Dave Hansen
2003-06-15 14:32 Herman Dierks
2003-06-13 23:52 Feldman, Scott
2003-06-13 23:52 ` David S. Miller
2003-06-14  0:55   ` Anton Blanchard
2003-06-14  1:34     ` David S. Miller
2003-06-14  0:03 ` Anton Blanchard
2003-06-13 17:03 Herman Dierks
     [not found] <OF0078342A.E131D4B1-ON85256D44.0051F7C0@pok.ibm.com>
2003-06-13 16:21 ` Dave Hansen
2003-06-13 22:38   ` Anton Blanchard
2003-06-13 22:46     ` David S. Miller
2003-06-13 23:18       ` Anton Blanchard
2003-06-14  1:52         ` Lincoln Dale
2003-06-14  5:41           ` David S. Miller
2003-06-14  5:52             ` Lincoln Dale
2003-06-14  6:08               ` David S. Miller
2003-06-14  6:14                 ` David S. Miller
2003-06-14  6:27                   ` William Lee Irwin III
2003-06-14 17:08                   ` Greg KH
2003-06-15  3:01                     ` David S. Miller
2003-06-14  5:16       ` Nivedita Singhvi
2003-06-14  5:36         ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).