netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-13 23:52 Feldman, Scott
  2003-06-13 23:52 ` David S. Miller
  2003-06-14  0:03 ` Anton Blanchard
  0 siblings, 2 replies; 27+ messages in thread
From: Feldman, Scott @ 2003-06-13 23:52 UTC (permalink / raw)
  To: Anton Blanchard, David S. Miller
  Cc: haveblue, hdierks, dwg, linux-kernel, milliner, ricardoz,
	twichell, netdev

> > Why not instead find out if it's possible to have the e1000 
> > fetch the entire cache line where the first byte of the
> > packet resides?  Even ancient designes like SunHME do that.
> 
> Rusty and I were wondering why the e1000 didnt do that exact thing.
> 
> Scott: is it possible to enable such a thing?

I thought the answer was no, so I double checked with a couple of
hardware guys, and the answer is still no.

-scott

^ permalink raw reply	[flat|nested] 27+ messages in thread
* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-16 18:56 Feldman, Scott
  0 siblings, 0 replies; 27+ messages in thread
From: Feldman, Scott @ 2003-06-16 18:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Herman Dierks, David S. Miller, ltd, Anton Blanchard, dwg,
	Linux Kernel Mailing List, Nancy J Milliner, Ricardo C Gonzalez,
	Brian Twichell, netdev

> Scott, would you be pleased if something was implemented out 
> of the driver, in generic net code?  Something that all the 
> drivers could use, even if nothing but e1000 used it for now.

I suppose the driver could unconditionally call something like
skb_realign_for_broken_hw, which is a nop on non-broken archs, but would
it make more sense to not have the driver mess with the skb at all?

-scott

^ permalink raw reply	[flat|nested] 27+ messages in thread
* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-16 18:21 Feldman, Scott
  2003-06-16 18:30 ` Dave Hansen
  0 siblings, 1 reply; 27+ messages in thread
From: Feldman, Scott @ 2003-06-16 18:21 UTC (permalink / raw)
  To: Herman Dierks, David S. Miller
  Cc: ltd, anton, haveblue, dwg, linux-kernel, Nancy J Milliner,
	Ricardo C Gonzalez, Brian Twichell, netdev

Herman wrote:
> Its only the MTU 1500 case with non-TSO that we are 
> discussing here so copying a few bytes is really not a big 
> deal as the data is already in cache from copying into 
> kernel.  If it lets the adapter run at speed, thats what 
> customers want and what we need. Granted, if the HW could 
> deal with this we would not have to, but thats not the case 
> today so I want to spend a few CPU cycles to get best 
> performance. Again, if this is not done on other platforms, I 
> don't understand why you care.

I care because adding the arch-specific hack creates a maintenance issue
for me.

-scott

^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: e1000 performance hack for ppc64 (Power4)
@ 2003-06-15 14:40 Herman Dierks
  2003-06-15 14:44 ` David S. Miller
  2003-06-16 16:17 ` Nivedita Singhvi
  0 siblings, 2 replies; 27+ messages in thread
From: Herman Dierks @ 2003-06-15 14:40 UTC (permalink / raw)
  To: David S. Miller
  Cc: ltd, anton, haveblue, scott.feldman, dwg, linux-kernel,
	Nancy J Milliner, Ricardo C Gonzalez, Brian Twichell, netdev


Look folks,   we run 40 to 48  GigE adapters in a p690 32 way on AIX and
they basically all run at full speed  so let me se you try that on most of
these other boxes you are talking about.     Same adapter, same hardware
logic.
I have also seen what many of these other boxes you talk about do when data
or structures are not aligned on 64 bit boundaries.
The PPC HW does not have those 64bit alignment  issues.   So, each machine
has some warts.  Have yet to see a perfect one.

If you want a lot of PCI adapters on a box, it takes a number of bridge
chips and other IO links to do that.
Memory controllers like to deal with cache lines.
For larger packets, like jumbo frames or large send (TSO), the few added
DMA's is not an issue as the packets are so large the DMA soon get aligned
and are not an issue.   With TSO being the default,   the small packet case
becomes less important anyway.   Its more an issue on 2.4 where TSO is not
provided.  We also want this to run well if someone does not want to use
TSO.

Its only the MTU 1500 case with non-TSO that we are discussing here so
copying a few bytes is really not a big deal as the data is already in
cache from copying into kernel.  If it lets the adapter run at speed, thats
what customers want and what we need.
Granted, if the HW could deal with this we would not have to, but thats not
the case today so I want to spend a few CPU cycles to get best performance.
Again, if this is not done on other platforms, I don't understand why you
care.

If we have to do this for PPC port, fine.   I have not seen any of you
suggest a better solution that works and will not be a worse hack to TCP or
other code.  Anton tried various other ideas before we fell back to doing
it the same way we did this in AIX.   This code is very localized and is
only used by platforms that need it.  Thus I don't see the big issue here.

Herman


"David S. Miller" <davem@redhat.com> on 06/14/2003 01:08:50 AM

To:    ltd@cisco.com
cc:    anton@samba.org, haveblue@us.ltcfwd.linux.ibm.com, Herman
       Dierks/Austin/IBM@IBMUS, scott.feldman@intel.com, dwg@au1.ibm.com,
       linux-kernel@vger.kernel.org, Nancy J Milliner/Austin/IBM@IBMUS,
       Ricardo C Gonzalez/Austin/IBM@ibmus, Brian
       Twichell/Austin/IBM@IBMUS, netdev@oss.sgi.com
Subject:    Re: e1000 performance hack for ppc64 (Power4)



   From: Lincoln Dale <ltd@cisco.com>
   Date: Sat, 14 Jun 2003 15:52:35 +1000

   can we have the TCP retransmit side take a performance hit if it needs
   to
   realign buffers?

You don't understand, the person who mangles the packet
must make the copy, not the person not doing the packet
modifications.

   for a "high performance app" requiring gigabit-type speeds,

...we probably won't be using ppc64 and e1000 cards, yes, I agree
:-)

Anton, go to the local computer store and pick up some tg3
 cards or a bunch of Taiwan specials :-)

^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: e1000 performance hack for ppc64 (Power4)
@ 2003-06-15 14:32 Herman Dierks
  0 siblings, 0 replies; 27+ messages in thread
From: Herman Dierks @ 2003-06-15 14:32 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Feldman, Scott, David S. Miller, haveblue, dwg, linux-kernel,
	Nancy J Milliner, Ricardo C Gonzalez, Brian Twichell, netdev


Anton, I think the option described below is intended to cause the adapter
to "get off on a cache line boundary" so when it restarts the DMA it will
be aligned.   This is for cases when the adapter has to get off, for exampe
due to FIFO full, etc.
Some adapters would get off on any boundary and that then causes perf
issues when the DMA is restarted.
This is a good option, but I don't think it addresses what we need here as
the host needs to ensure a DMA starts on a cache line.
Different adapter anyway, but  I am just pointing out that even if e1000
had this it would not be the solution.


Anton Blanchard <anton@samba.org> on 06/13/2003 07:03:42 PM

To:    "Feldman, Scott" <scott.feldman@intel.com>
cc:    "David S. Miller" <davem@redhat.com>,
       haveblue@us.ltcfwd.linux.ibm.com, Herman Dierks/Austin/IBM@IBMUS,
       dwg@au1.ibm.com, linux-kernel@vger.kernel.org, Nancy J
       Milliner/Austin/IBM@IBMUS, Ricardo C Gonzalez/Austin/IBM@ibmus,
       Brian Twichell/Austin/IBM@IBMUS, netdev@oss.sgi.com
Subject:    Re: e1000 performance hack for ppc64 (Power4)




> I thought the answer was no, so I double checked with a couple of
> hardware guys, and the answer is still no.

Hi Scott,

Thats a pity, the e100 docs on sourceforge show it can do what we want,
it would be nice if e1000 had this feature too :)

4.2.2 Read Align

The Read Align feature is aimed to enhance performance in cache line
oriented systems. Starting a PCI transaction in these systems on a
non-cache line aligned address may result in low  performance. To solve
this performance problem, the controller can be configured to terminate
Transmit DMA cycles on a cache line boundary, and start the next
transaction on a cache line aligned address. This  feature is enabled
when the Read Align Enable bit is set in device Configure command
(Section 6.4.2.3, "Configure (010b)").

If this bit is set, the device operates as follows:

* When the device is close to running out of resources on the Transmit
* DMA (in other words, the Transmit FIFO is almost full), it attempts to
* terminate the read transaction on the nearest cache line boundary when
* possible.

* When the arbitration counters feature is enabled (maximum Transmit DMA
* byte count value is set in configuration space), the device switches
 * to other pending DMAs on cache line boundary only.

^ permalink raw reply	[flat|nested] 27+ messages in thread
* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-13 17:03 Herman Dierks
  0 siblings, 0 replies; 27+ messages in thread
From: Herman Dierks @ 2003-06-13 17:03 UTC (permalink / raw)
  To: haveblue
  Cc: Feldman, Scott, David Gibson, Linux Kernel Mailing List,
	Anton Blanchard, Nancy J Milliner, Ricardo C Gonzalez,
	Brian Twichell, netdev


I will let Anton respond to this.   I think he may have tried this some
time back in his early prototypes to fix this.
I think the problem was not where the buffer started but where the  packet
ended up within the buffer.
Due to varying sizes of TCP and IP headers the packet ended up at some
non-cache aligned address.
What we need for the DMA to work well is to have the final packet  (with
datalink headers)  starting on a cache line as its the final packet that
must be DMA'd. In fact it may need to to be aligned to a higher level than
that (not sure).


haveblue@us.ltcfwd.linux.ibm.com on 06/13/2003 11:21:03 AM

To:    Herman Dierks/Austin/IBM@IBMUS
cc:    "Feldman, Scott" <scott.feldman@intel.com>, David Gibson
       <dwg@au1.ibm.com>, Linux Kernel Mailing List
       <linux-kernel@vger.kernel.org>, Anton Blanchard <anton@samba.org>,
       Nancy J Milliner/Austin/IBM@IBMUS, Ricardo C
       Gonzalez/Austin/IBM@ibmus, Brian Twichell/Austin/IBM@IBMUS,
       netdev@oss.sgi.com
Subject:    RE: e1000 performance hack for ppc64 (Power4)



Too long to quote:
http://marc.theaimsgroup.com/?t=105538879600001&r=1&w=2

Wouldn't you get most of the benefit from copying that stuff around in
the driver if you allocated the skb->data aligned in the first place?

There's already code to align them on CPU cache boundaries:
#define SKB_DATA_ALIGN(X)       (((X) + (SMP_CACHE_BYTES - 1)) & \
                                 ~(SMP_CACHE_BYTES - 1))

So, do something like this:
#ifdef ARCH_ALIGN_SKB_BYTES
#define SKB_ALIGN_BYTES ARCH_ALIGN_SKB_BYTES
#else
#define SKB_ALIGN_BYTES SMP_CACHE_BYTES
#endif
#define SKB_DATA_ALIGN(X)       (((X) + (ARCH_ALIGN_SKB - 1)) & \
                                 ~(SKB_ALIGN_BYTES - 1))

You could easily make this adaptive to no align on th arch size when the
request is bigger than that, just like in the e1000 patch you posted.
--
Dave Hansen
haveblue@us.ibm.com

^ permalink raw reply	[flat|nested] 27+ messages in thread
[parent not found: <OF0078342A.E131D4B1-ON85256D44.0051F7C0@pok.ibm.com>]

end of thread, other threads:[~2003-06-16 18:56 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-06-13 23:52 e1000 performance hack for ppc64 (Power4) Feldman, Scott
2003-06-13 23:52 ` David S. Miller
2003-06-14  0:55   ` Anton Blanchard
2003-06-14  1:34     ` David S. Miller
2003-06-14  0:03 ` Anton Blanchard
  -- strict thread matches above, loose matches on Subject: below --
2003-06-16 18:56 Feldman, Scott
2003-06-16 18:21 Feldman, Scott
2003-06-16 18:30 ` Dave Hansen
2003-06-15 14:40 Herman Dierks
2003-06-15 14:44 ` David S. Miller
2003-06-16 16:17 ` Nivedita Singhvi
2003-06-15 14:32 Herman Dierks
2003-06-13 17:03 Herman Dierks
     [not found] <OF0078342A.E131D4B1-ON85256D44.0051F7C0@pok.ibm.com>
2003-06-13 16:21 ` Dave Hansen
2003-06-13 22:38   ` Anton Blanchard
2003-06-13 22:46     ` David S. Miller
2003-06-13 23:18       ` Anton Blanchard
2003-06-14  1:52         ` Lincoln Dale
2003-06-14  5:41           ` David S. Miller
2003-06-14  5:52             ` Lincoln Dale
2003-06-14  6:08               ` David S. Miller
2003-06-14  6:14                 ` David S. Miller
2003-06-14  6:27                   ` William Lee Irwin III
2003-06-14 17:08                   ` Greg KH
2003-06-15  3:01                     ` David S. Miller
2003-06-14  5:16       ` Nivedita Singhvi
2003-06-14  5:36         ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).