2.5 or 2.4 kernel profiling

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* 2.5 or 2.4 kernel profiling
       [not found] <Pine.GSO.4.21.0012071148420.515-100000@eos>
@ 2000-12-07 18:11 ` Brian Ford
  2000-12-08 17:41   ` diekema_jon
  2000-12-11  0:45   ` Graham Stoney
  0 siblings, 2 replies; 26+ messages in thread
From: Brian Ford @ 2000-12-07 18:11 UTC (permalink / raw)
  To: linuxppc-embedded

I am trying to do some kernel profiling on my EST8260 to determine the
bottle neck in TCP and UDP thruput, but I can't seem to get any profile
information.

I have put "profile=2" on the kernel boot line, and readprofile -i
confirms this, but no function info shows up.  Is this known to be broken,
or have I just not figured out the magic combination yet?

Thanks.

--
Brian Ford
Software Engineer
Vital Visual Simulation Systems
FlightSafety International
Phone: 314-551-8460
Fax:   314-551-8444

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-07 18:11 ` 2.5 or 2.4 kernel profiling Brian Ford
@ 2000-12-08 17:41   ` diekema_jon
  2000-12-08 18:24     ` Brian Ford
  2000-12-11  0:45   ` Graham Stoney
  1 sibling, 1 reply; 26+ messages in thread
From: diekema_jon @ 2000-12-08 17:41 UTC (permalink / raw)
  To: Brian Ford; +Cc: linuxppc-embedded


> I am trying to do some kernel profiling on my EST8260 to determine the
> bottle neck in TCP and UDP thruput, but I can't seem to get any profile
> information.

What version of Linux are you using?  I have tried the LTT, Linux
Trace Toolbox under 2.4.0-test11.  LTT tracks a large number of
events, so it may give you a better handle as to what is happening
when.

http://www.opersys.com/LTT

+Kernel events tracing support
+CONFIG_TRACE
+  It is possible for the kernel to log important events to a tracing
+  driver. Doing so, enables the use of the generated traces in order
+  to reconstruct the dynamic behavior of the kernel, and hence the
+  whole system.
+
+  The tracing process contains 4 parts :
+      1) The logging of events by key parts of the kernel.
+      2) The trace driver that keeps the events in a data buffer.
+      3) A trace daemon that opens the trace driver and is notified
+         every time there is a certain quantity of data to read
+         from the trace driver (using SIG_IO).
+      4) A trace event data decoder that reads the accumulated data
+         and formats it in a human-readable format.
+
+  If you say Y or M here, the first part of the tracing process will
+  always take place. That is, critical parts of the kernel will call
+  upon the kernel tracing function. The data generated doesn't go
+  any further until a trace driver registers himself as such with the
+  kernel. Therefore, if you answer Y, then the driver will be part of
+  the kernel and the events will always proceed onto the driver and
+  if you say M, then the events will only proceed onto the driver when
+  it's module is loaded. Note that event's aren't logged in the driver
+  until the profiling daemon opens the device, configures it and
+  issues the "start" command through ioctl().
+
+  The impact of a fully functionnal system (kernel event logging +
+  driver event copying + active trace daemon) is of 2.5% for core events.
+  This means that for a task that took 100 seconds on a normal system, it
+  will take 102.5 seconds on a traced system. This is very low compared
+  to other profiling or tracing methods.
+
+  For more information on kernel tracing, the trace daemon or the event
+  decoder, please check the following address :
+       http://www.opersys.com/LTT

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-08 17:41   ` diekema_jon
@ 2000-12-08 18:24     ` Brian Ford
  0 siblings, 0 replies; 26+ messages in thread
From: Brian Ford @ 2000-12-08 18:24 UTC (permalink / raw)
  To: diekema_jon; +Cc: linuxppc-embedded


On Fri, 8 Dec 2000, diekema_jon wrote:

> > I am trying to do some kernel profiling on my EST8260 to determine the
> > bottle neck in TCP and UDP thruput, but I can't seem to get any profile
> > information.
>
I figured it out.  It was a "corrupt" System.map file.  My Solaris cross
environment does not have all the extensions that Linux's Makefiles
expect.  grep was the culpret this time.

I am just a rebel that wants to create the cross environment myself
instead of using Monta Vista's.  Every so often I find another problem.

> What version of Linux are you using?  I have tried the LTT, Linux
> Trace Toolbox under 2.4.0-test11.  LTT tracks a large number of
> events, so it may give you a better handle as to what is happening
> when.
>
Thanks.  I have it, but I haven't figured it all out yet.

--
Brian Ford
Software Engineer
Vital Visual Simulation Systems
FlightSafety International
Phone: 314-551-8460
Fax:   314-551-8444


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-07 18:11 ` 2.5 or 2.4 kernel profiling Brian Ford
  2000-12-08 17:41   ` diekema_jon
@ 2000-12-11  0:45   ` Graham Stoney
  2000-12-11 15:27     ` Brian Ford
  1 sibling, 1 reply; 26+ messages in thread
From: Graham Stoney @ 2000-12-11  0:45 UTC (permalink / raw)
  To: Brian Ford; +Cc: linuxppc-embedded


Hi Brian,

On Thu, Dec 07, 2000 at 12:11:07PM -0600, Brian Ford wrote:
> I am trying to do some kernel profiling on my EST8260 to determine the
> bottle neck in TCP and UDP thruput, but I can't seem to get any profile
> information.

When I first attempted a similar thing with the 2.2 kernel on our 855T based
board, I found that the trivial do_profile routine needed to collect data for
/proc/profile kernel profiling wasn't implemented for the ppc architecture.
As far as I can see, it still isn't implemented on linux-2.4.0-test11, but it
is in the linuxppc_2_3 tree at http://www.fsmlabs.com/linuxppcbk.html .
I really wish the seperate architecture maintainers had got together to
eliminate all the duplicated do_profile functions like I did in my 2.2 patch
at:
    http://members.nbci.com/greyhams/linux/patches/2.2/profile.patch

Unfortunately I guess it was easier for the PPC guys to just copy the
do_profile function (yet again!) like everyone else did.  Oh well, maybe in
2.5...

Back to TCP, I found I could improve raw TCP throughput by 15-20% on the 855T
by DMAing received data directly into the kernel socket buffers.  The
improvement in performance from eliminating the extra copy between the ring
buffer and socket buffer isn't staggering, since the CPU still needs to do
a pass through the data to calculate the IP checksum, which unfortunately the
855T's FEC can't do for me.  Nevertheless, it does make things a little faster
and I would imagine a similar technique would work on the 8260; you can get a
feel for what is involved from my 2.2 FEC speedup patch at:
    http://members.nbci.com/greyhams/linux/patches/2.2/fecdmaskb.patch

Good luck!
Graham
--
Graham Stoney
Assistant Technology Manager
Canon Information Systems Research Australia
Ph: +61 2 9805 2909  Fax: +61 2 9805 2929

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-11  0:45   ` Graham Stoney
@ 2000-12-11 15:27     ` Brian Ford
  2000-12-12  2:36       ` Graham Stoney
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Ford @ 2000-12-11 15:27 UTC (permalink / raw)
  To: Graham Stoney; +Cc: linuxppc-embedded


On Mon, 11 Dec 2000, Graham Stoney wrote:

> Hi Brian,
>
> On Thu, Dec 07, 2000 at 12:11:07PM -0600, Brian Ford wrote:
> > I am trying to do some kernel profiling on my EST8260 to determine the
> > bottle neck in TCP and UDP thruput, but I can't seem to get any profile
> > information.
>
> When I first attempted a similar thing with the 2.2 kernel on our 855T based
> board, I found that the trivial do_profile routine needed to collect data for
> /proc/profile kernel profiling wasn't implemented for the ppc architecture.
> As far as I can see, it still isn't implemented on linux-2.4.0-test11, but it
> is in the linuxppc_2_3 tree at http://www.fsmlabs.com/linuxppcbk.html .
> I really wish the seperate architecture maintainers had got together to
> eliminate all the duplicated do_profile functions like I did in my 2.2 patch
> at:
>     http://members.nbci.com/greyhams/linux/patches/2.2/profile.patch
>
> Unfortunately I guess it was easier for the PPC guys to just copy the
> do_profile function (yet again!) like everyone else did.  Oh well, maybe in
> 2.5...
>
Thanks.  I did get it to work with the bitkeeper sources.  My problem was
that the grep command that produced the System.map file didn't get along
with Solaris grep.

I agree with you about the profiling stuff.  Did you post this idea to the
main kernel mailing list?  Maybe that would be the place to tackle this
issue.

> Back to TCP, I found I could improve raw TCP throughput by 15-20% on the 855T
> by DMAing received data directly into the kernel socket buffers.  The
> improvement in performance from eliminating the extra copy between the ring
> buffer and socket buffer isn't staggering, since the CPU still needs to do
> a pass through the data to calculate the IP checksum, which unfortunately the
> 855T's FEC can't do for me.  Nevertheless, it does make things a little faster
> and I would imagine a similar technique would work on the 8260; you can get a
> feel for what is involved from my 2.2 FEC speedup patch at:
>     http://members.nbci.com/greyhams/linux/patches/2.2/fecdmaskb.patch
>
Thanks.  I had already hacked something like this together.  It would be
great to finalize these and get them into the real sources.

I also turned checksumming off for testing purposes.  It helped some, but
I think my bottle neck is that I can't get the bus to run faster than 33
Mhz reliably.  If I could get the bus clocked at what it is rated, I might
be better off.

I don't think the 8260 CPM can do the checksums either.  Pitty.  It seems
like there is plenty of power in there.

--
Brian Ford
Software Engineer
Vital Visual Simulation Systems
FlightSafety International
Phone: 314-551-8460
Fax:   314-551-8444


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-11 15:27     ` Brian Ford
@ 2000-12-12  2:36       ` Graham Stoney
  2000-12-12  3:26         ` Dan Malek
  2000-12-12 15:26         ` Brian Ford
  0 siblings, 2 replies; 26+ messages in thread
From: Graham Stoney @ 2000-12-12  2:36 UTC (permalink / raw)
  To: Brian Ford; +Cc: Graham Stoney, linuxppc-embedded

On Mon, Dec 11, 2000 at 09:27:18AM -0600, Brian Ford wrote:
> I agree with you about the profiling stuff.  Did you post this idea to the
> main kernel mailing list?

Sure; they were all too busy though.  Profiling already worked for most of
them, and a cross-architecture change either requires the co-operation of all
seperate architecture maintainers, or a dictatorial initiative from above.

> Thanks.  I had already hacked something like this together.  It would be
> great to finalize these and get them into the real sources.

Yes, that would be excellent.

> I also turned checksumming off for testing purposes.  It helped some, but
> I think my bottle neck is that I can't get the bus to run faster than 33
> Mhz reliably.  If I could get the bus clocked at what it is rated, I might
> be better off.

Absolutely; the bus is the bottleneck.  You'll find the network throughput
scales almost linearly with bus speed, so getting it clocked faster will give
a higher payback than more driver tweaking.  Also, doesn't the 8260 have
seperate memory subsystems to help get around this?

Regards,
Graham
--
Graham Stoney
Assistant Technology Manager
Canon Information Systems Research Australia
Ph: +61 2 9805 2909  Fax: +61 2 9805 2929

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-12  2:36       ` Graham Stoney
@ 2000-12-12  3:26         ` Dan Malek
  2000-12-12  7:28           ` Graham Stoney
  2000-12-12 15:26         ` Brian Ford
  1 sibling, 1 reply; 26+ messages in thread
From: Dan Malek @ 2000-12-12  3:26 UTC (permalink / raw)
  To: Graham Stoney; +Cc: Brian Ford, linuxppc-embedded

Graham Stoney wrote:

> Absolutely; the bus is the bottleneck.  You'll find the network throughput
> scales almost linearly with bus speed,

I've never seen that.  My 860P with 80/40 MHz is faster than the
same processor at 50/50 MHz.  I also haven't seen the big speed
improvement using the DMA changes either.  I am experimenting with
a couple of other things, such as aligning the IP data on the
incoming side (i.e. misaligning the Ethernet frame).  Just using
bigger TCP window sizes will help more than anything else.

What tests were you using?  I have a variety of little things I
have written, but mostly use a source/sink TCP application.

> .......  Also, doesn't the 8260 have
> seperate memory subsystems to help get around this?

Yes, among other things.  The 8260 runs very well and I am currently
doing lots of performance testing on some custom boards.  I haven't
seen anything really bad in the driver yet, but there will likely be
some performance enhancements coming.

	-- Dan

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-12  3:26         ` Dan Malek
@ 2000-12-12  7:28           ` Graham Stoney
  2000-12-12 16:32             ` Brian Ford
  0 siblings, 1 reply; 26+ messages in thread
From: Graham Stoney @ 2000-12-12  7:28 UTC (permalink / raw)
  To: Dan Malek; +Cc: Graham Stoney, Brian Ford, linuxppc-embedded

> Graham Stoney wrote:
> > Absolutely; the bus is the bottleneck.  You'll find the network throughput
> > scales almost linearly with bus speed,

On Mon, Dec 11, 2000 at 10:26:46PM -0500, Dan Malek replied:
> I've never seen that.  My 860P with 80/40 MHz is faster than the
> same processor at 50/50 MHz.  I also haven't seen the big speed
> improvement using the DMA changes either.  I am experimenting with
> a couple of other things, such as aligning the IP data on the
> incoming side (i.e. misaligning the Ethernet frame).  Just using
> bigger TCP window sizes will help more than anything else.

OK, my comment was a bit simplistic; I should have said "bus and core speed".
It won't scale linearly if you reduce one but increase the other :-).
I found that our 855T at 80/40 MHz was just slightly slower than 50/50 when
measuring raw TCP throughput, my reasoning being that TCP performance over
the FEC is bus limited, so the reduction in bus speed more than offset the
gain in CPU core speed.  Once I added in a bit of application processing and
IDMA, it tipped the balance back towards 80/40 being marginally faster.  Our
slightly better optimised SDRAM UPM settings made a difference too (I ran the
same test on a CLLF with slightly different results), so other people's
kilometreage may vary.

I did see measurable gains by leaving the receive buffers cached (and adding
explicit calls to invalidate_dcache_range), and some more by DMA'ing large
packets directly to the skbuff to avoid the extra copy; this is what all the
other high performance Ethernet drivers do nowadays -- it would be nice to get
this improvement into the standard FEC/FCC drivers, even if it only gives an
extra 15-20%.  It should never be slower, and I found it actually simplified
some things like the ring buffer allocation slightly.  The only tricky bit
was what to do if I couldn't allocate a new rx skbuf to replace the one just
filled: the easiest solution was to just drop the current incoming packet and
reuse its skbuf next time.  I never saw this actually happen of course.

I looked at aligning the IP data too, but the FEC requires all Rx buffer
pointers in the descriptor to be 16 byte aligned, and since the Ethernet
header is 14 bytes, I couldn't see any way to do it.  Using a bigger TCP
window helps at the start; once the window is full though it makes no
difference from then on and it burns RAM, so it doesn't help average
throughput much in cases like ours where the total volume of data we're
trying to transfer to the 855T is significantly larger than its available RAM.

> What tests were you using?  I have a variety of little things I
> have written, but mostly use a source/sink TCP application.

Yes, I wrote a simple source/sink TCP app.  I discovered ttcp shortly after
writing my own.

> Yes, among other things.  The 8260 runs very well and I am currently
> doing lots of performance testing on some custom boards.  I haven't
> seen anything really bad in the driver yet, but there will likely be
> some performance enhancements coming.

Sounds good; I suspect Brian's throughput problems are mainly bus limited and
will go away when he gets the bus speed up.  Provided the CPU core speed
doesn't drop in the process of course :-).

Regards,
Graham
--
Graham Stoney
Assistant Technology Manager
Canon Information Systems Research Australia
Ph: +61 2 9805 2909  Fax: +61 2 9805 2929

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-12  2:36       ` Graham Stoney
  2000-12-12  3:26         ` Dan Malek
@ 2000-12-12 15:26         ` Brian Ford
  2000-12-12 17:12           ` Jerry Van Baren
  1 sibling, 1 reply; 26+ messages in thread
From: Brian Ford @ 2000-12-12 15:26 UTC (permalink / raw)
  To: Graham Stoney; +Cc: linuxppc-embedded

On Tue, 12 Dec 2000, Graham Stoney wrote:

> Also, doesn't the 8260 have seperate memory subsystems to help get
> around this?
>
I assume you are referring to the local bus?  Well yes, but there are large
tradeoffs.

If you use the local bus for the receive buffers then you can have
simultaneous CPM to local bus and CPU to 60x bus transactions.  The catch
is that the local bus can not be cached.  So, you trade off bus contention
for caching/bursting.  The CPU must go across the 60x to local bus bridge
for those transactions.  The DMA engine can burst between the 60x and
local busses.

If the data has to end up in user space, it ends up being about a
wash, given the checksum and user space copy.  I need more testing to
confirm this, though.  If the user space copy was not needed, like for
routing, then it might help some.

--
Brian Ford
Software Engineer
Vital Visual Simulation Systems
FlightSafety International
Phone: 314-551-8460
Fax:   314-551-8444

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-12  7:28           ` Graham Stoney
@ 2000-12-12 16:32             ` Brian Ford
  2000-12-12 16:58               ` Dan Malek
  2000-12-13  1:15               ` Graham Stoney
  0 siblings, 2 replies; 26+ messages in thread
From: Brian Ford @ 2000-12-12 16:32 UTC (permalink / raw)
  To: Graham Stoney; +Cc: Dan Malek, linuxppc-embedded


On Tue, 12 Dec 2000, Graham Stoney wrote:

> On Mon, Dec 11, 2000 at 10:26:46PM -0500, Dan Malek replied:
> > I also haven't seen the big speed improvement using the DMA changes
> > either.  I am experimenting with a couple of other things, such as
> > aligning the IP data on the incoming side (i.e. misaligning the
> > Ethernet frame).  Just using bigger TCP window sizes will help more
> > than anything else.
>
> I did see measurable gains by leaving the receive buffers cached (and adding
> explicit calls to invalidate_dcache_range), and some more by DMA'ing large
> packets directly to the skbuff to avoid the extra copy; this is what all the
> other high performance Ethernet drivers do nowadays -- it would be nice to get
> this improvement into the standard FEC/FCC drivers, even if it only gives an
> extra 15-20%.  It should never be slower, and I found it actually simplified
> some things like the ring buffer allocation slightly.  The only tricky bit
> was what to do if I couldn't allocate a new rx skbuf to replace the one just
> filled: the easiest solution was to just drop the current incoming packet and
> reuse its skbuf next time.  I never saw this actually happen of course.
>
Are the explicit driver level calls to invalidate_dcache_range necessary
on the receive side, or does the stack do them after netif_rx?  If they
are necessary, would bus snooping be more or less efficient?  As far as I
can tell, Dan's FCC driver has the all buffers in cached memory, but I
don't see any invalidate calls.  He does have the snooping bit set in the
FCMR.  I assume this if for CPM snooping ie. on the transmit side?

I also see measurable performance gains with direct DMA into the skbuf.  I
didn't seem to see much difference with aligning the IP header.  I will
measure both again and post the results.

I agree that most all the high performance drivers do direct DMA to the
skbufs, and it would be nice for ours to follow suit.  Hopefully my
performance measurements will justify this.

The best solution to the "couldn't allocate an skbuf" seems to be the one
taken by Donald Becker in the tulip driver called the "buffer deficit
scheme."  I am studying this to see if I can replicate it.

I would also like to implement hardware flow control for full duplex
connections.  My ultimate goal is to use UDP communications on a private
network and get the 8260 to not drop any packets when a Solaris box bursts
to full rate.

> > What tests were you using?  I have a variety of little things I
> > have written, but mostly use a source/sink TCP application.
>
> Yes, I wrote a simple source/sink TCP app.  I discovered ttcp shortly after
> writing my own.
>
I use ttcp most of the time, too.

> > Yes, among other things.  The 8260 runs very well and I am currently
> > doing lots of performance testing on some custom boards.  I haven't
> > seen anything really bad in the driver yet, but there will likely be
> > some performance enhancements coming.
>
> Sounds good; I suspect Brian's throughput problems are mainly bus limited and
> will go away when he gets the bus speed up.  Provided the CPU core speed
> doesn't drop in the process of course :-).
>
I hope so.  Maybe some good performance enhancements will come out of this
discussion.

--
Brian Ford
Software Engineer
Vital Visual Simulation Systems
FlightSafety International
Phone: 314-551-8460
Fax:   314-551-8444


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-12 16:32             ` Brian Ford
@ 2000-12-12 16:58               ` Dan Malek
  2000-12-12 17:17                 ` Brian Ford
  2000-12-13  1:15               ` Graham Stoney
  1 sibling, 1 reply; 26+ messages in thread
From: Dan Malek @ 2000-12-12 16:58 UTC (permalink / raw)
  To: Brian Ford; +Cc: Graham Stoney, linuxppc-embedded

Brian Ford wrote:

> Are the explicit driver level calls to invalidate_dcache_range necessary
> on the receive side,

None of the cache management calls are necessary on the 8260 since the
cache is snooped.  We are mixing processor types in this discussion,
so be careful what you assume is necessary or works for a particular
driver.  They are different.

	-- Dan

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-12 15:26         ` Brian Ford
@ 2000-12-12 17:12           ` Jerry Van Baren
  0 siblings, 0 replies; 26+ messages in thread
From: Jerry Van Baren @ 2000-12-12 17:12 UTC (permalink / raw)
  To: linuxppc-embedded


At 09:26 AM 12/12/00 -0600, Brian Ford wrote:

>On Tue, 12 Dec 2000, Graham Stoney wrote:
>
> > Also, doesn't the 8260 have seperate memory subsystems to help get
> > around this?
> >
>I assume you are referring to the local bus?  Well yes, but there are
>large
>tradeoffs.
>
>If you use the local bus for the receive buffers then you can have
>simultaneous CPM to local bus and CPU to 60x bus transactions.  The catch
>is that the local bus can not be cached.  So, you trade off bus contention
>for caching/bursting.  The CPU must go across the 60x to local bus bridge
>for those transactions.  The DMA engine can burst between the 60x and
>local busses.
>
>If the data has to end up in user space, it ends up being about a
>wash, given the checksum and user space copy.  I need more testing to
>confirm this, though.  If the user space copy was not needed, like for
>routing, then it might help some.
>
>--
>Brian Ford
>Software Engineer
>Vital Visual Simulation Systems
>FlightSafety International
>Phone: 314-551-8460
>Fax:   314-551-8444

I've been known to be wrong in the past, and I could be missing an
assumption, but local bus memory is cachable, it just isn't
snoopable.  If you need snooping as a prerequisite for enabling cache,
that would make the local bus effectively uncachable.  It also is 32
bits wide (max) rather than 64 which will affect your bus bandwidth.

gvb


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-12 16:58               ` Dan Malek
@ 2000-12-12 17:17                 ` Brian Ford
  2000-12-12 21:03                   ` Dan Malek
  0 siblings, 1 reply; 26+ messages in thread
From: Brian Ford @ 2000-12-12 17:17 UTC (permalink / raw)
  To: Dan Malek; +Cc: Graham Stoney, linuxppc-embedded


On Tue, 12 Dec 2000, Dan Malek wrote:

> Brian Ford wrote:
>
> > Are the explicit driver level calls to invalidate_dcache_range necessary
> > on the receive side,
>
> None of the cache management calls are necessary on the 8260 since the
> cache is snooped.  We are mixing processor types in this discussion,
> so be careful what you assume is necessary or works for a particular
> driver.  They are different.
>
Yes, I know.

Does your statement mean that 60x bus memory is mapped
_PAGE_COHERENT?  What is the exact meaning of the GBL bit in the FCRx
register?

I am confused about their relations.

--
Brian Ford
Software Engineer
Vital Visual Simulation Systems
FlightSafety International
Phone: 314-551-8460
Fax:   314-551-8444


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-12 17:17                 ` Brian Ford
@ 2000-12-12 21:03                   ` Dan Malek
  0 siblings, 0 replies; 26+ messages in thread
From: Dan Malek @ 2000-12-12 21:03 UTC (permalink / raw)
  To: Brian Ford; +Cc: Graham Stoney, linuxppc-embedded

Brian Ford wrote:

> Does your statement mean that 60x bus memory is mapped
> _PAGE_COHERENT?  What is the exact meaning of the GBL bit in the FCRx
> register?

The GBL flag in the FCRx is orthogonal to the PAGE_COHERENT (M in WIMG)
of the processor.  Since the CPM is a bus master, the GBL flag is used
to indicate whether it should announce memory updates in the cache
protocol.  The CPM shared memory is mapped uncached to the processor,
and I don't see any reason to do that differently.  For some reason,
I believe the GBL flag also ensures the CPM DMA will snoop the processor
cache, and setting PAGE_COHERENT isn't necessary.  I don't know why
I think this, except it appears to work that way :-).  Logically, we
should be required to set PAGE_COHERENT......It's a note on my board,
I'll keep looking at it.

Thanks.

	-- Dan

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-12 16:32             ` Brian Ford
  2000-12-12 16:58               ` Dan Malek
@ 2000-12-13  1:15               ` Graham Stoney
  2000-12-13 16:14                 ` Dan Malek
  1 sibling, 1 reply; 26+ messages in thread
From: Graham Stoney @ 2000-12-13  1:15 UTC (permalink / raw)
  To: Brian Ford; +Cc: Graham Stoney, Dan Malek, linuxppc-embedded

[ Dan's already answered the caching question wrt the 8260, so... ]

On Tue, Dec 12, 2000 at 10:32:53AM -0600, Brian Ford wrote:
> The best solution to the "couldn't allocate an skbuf" seems to be the one
> taken by Donald Becker in the tulip driver called the "buffer deficit
> scheme."  I am studying this to see if I can replicate it.

Right; I found a reference where he describes it at:
    http://www.tux.org/hypermail/linux-net/1998-Oct/0256.html

This does indeed sound better; the only sticky part I can think of is setting
the FCC/FEC to keep giving you Rx interrupts even when there are no buffer
descriptors to put the incoming packets in, because the ring hasn't been
refilled after running out of memory.  Maybe that Just Works though, I dunno;
I think you'd want to test the code in a low memory environment to be sure it
can actually recover from running completely out of Rx skbuf's.

It's cool to see some others looking at this stuff in addition to Dan's
most excellent work so far.  Shoulders of giants and all that.

Regards,
Graham
--
Graham Stoney
Assistant Technology Manager
Canon Information Systems Research Australia
Ph: +61 2 9805 2909  Fax: +61 2 9805 2929

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-13  1:15               ` Graham Stoney
@ 2000-12-13 16:14                 ` Dan Malek
  2000-12-13 17:23                   ` Arto Vuori
                                     ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Dan Malek @ 2000-12-13 16:14 UTC (permalink / raw)
  To: Graham Stoney; +Cc: Brian Ford, linuxppc-embedded

Graham Stoney wrote:

> This does indeed sound better; the only sticky part I can think of is setting
> the FCC/FEC to keep giving you Rx interrupts even when there are no buffer
> descriptors to put the incoming packets in,

Although I have not yet proven this, I am leaning toward the following.
Allocate a small fixed set of receive buffers (like we used to do)
in the driver and mark them copy-back cached.  The received BDs will
always point to thesed buffers.  Then, copy-and-sum these into IP
aligned skbuffs.  The advantage of Graham's DMA into skbufs isn't that
the driver doesn't copy/sum, it is that later when the IP stack does it
we get burst transfers into cache.  So, we get this advantage plus
the IP packet aligned properly for the remainder of the stack.  Of
course, the downside of this is the receive buffers are one-time
cached.  We blow the cache away to make the TCP benchmark look good,
and the remaining applications suffer.  I still have a problem with
this.....making single focused benchmarks look good isn't necessarily
the best for the overall system application.

	-- Dan

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-13 16:14                 ` Dan Malek
@ 2000-12-13 17:23                   ` Arto Vuori
  2000-12-13 17:33                     ` Dan Malek
  2000-12-13 22:08                   ` Brian Ford
  2000-12-14  7:21                   ` 2.5 or 2.4 kernel profiling Graham Stoney
  2 siblings, 1 reply; 26+ messages in thread
From: Arto Vuori @ 2000-12-13 17:23 UTC (permalink / raw)
  To: Dan Malek; +Cc: Graham Stoney, Brian Ford, linuxppc-embedded

Dan Malek wrote:
> Although I have not yet proven this, I am leaning toward the following.
> Allocate a small fixed set of receive buffers (like we used to do)
> in the driver and mark them copy-back cached.  The received BDs will
> always point to thesed buffers.  Then, copy-and-sum these into IP
> aligned skbuffs.  The advantage of Graham's DMA into skbufs isn't that
> the driver doesn't copy/sum, it is that later when the IP stack does it
> we get burst transfers into cache.  So, we get this advantage plus
> the IP packet aligned properly for the remainder of the stack.  Of
> course, the downside of this is the receive buffers are one-time
> cached.  We blow the cache away to make the TCP benchmark look good,
> and the remaining applications suffer.  I still have a problem with
> this.....making single focused benchmarks look good isn't necessarily
> the best for the overall system application.

That might help on TCP performance with single benchmark application,
but i can't see much practical use for this. I have done some tweaking
on 8260 ethernet drivers and we implemented that DMA into skbufs
optimization. It gives significant performance boost when we are routing
packets. Actually we have gone so far favoring routing performance that
I disabled per packet RX & TX interrupts and poll FCC interface by using
timer interrupts. That gives some constant overhead, but also improves
routing performace if you are routing a lot of small packets. I think
that it reduces interrupt load and IP routing code runs in nice small
cache friendly loops moving data from one queue to another.

We also found and fixed some problems including:
* FCC driver leaked memory when there was too much incoming data.
* Some error conditions caused FCC to stop receiving and restart didin't
work.
* Both PHY and FCC must have same full/half Duplex mode settings. ie
Autonegotion result must be read using MII Managament interface.

When i have some time to clean up the code and make it work with some
standard board, i could send patches if anybody is interested.

	-Arto

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-13 17:23                   ` Arto Vuori
@ 2000-12-13 17:33                     ` Dan Malek
  2000-12-13 17:55                       ` Arto Vuori
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Malek @ 2000-12-13 17:33 UTC (permalink / raw)
  To: Arto Vuori; +Cc: Graham Stoney, Brian Ford, linuxppc-embedded

Arto Vuori wrote:

> We also found and fixed some problems including:

I have been pushing driver updates into the BK linuxppc_2_5 tree,
but not into the linuxppc_2_3 tree as we are trying to stablize
that for 2.4.  All of these you mentioned, and others have been
updated.  The MII interface is still broken for some PHYs.

Once the linuxppc_2_5 driver is better, I will push it into the 2.4
baseline.

For all processors, the BK linuxppc_2_5 tree has many more updates
than the more visible 2_3 tree.

Send me any patches you have as you create them.  It is lots easier
to add a few lines here and there than to get a bunch of changes
from an old baseline.

	-- Dan

--

	I like MMUs because I don't have a real life.

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-13 17:33                     ` Dan Malek
@ 2000-12-13 17:55                       ` Arto Vuori
  0 siblings, 0 replies; 26+ messages in thread
From: Arto Vuori @ 2000-12-13 17:55 UTC (permalink / raw)
  To: Dan Malek; +Cc: linuxppc-embedded


Dan Malek wrote:
> I have been pushing driver updates into the BK linuxppc_2_5 tree,
> but not into the linuxppc_2_3 tree as we are trying to stablize
> that for 2.4.  All of these you mentioned, and others have been
> updated.  The MII interface is still broken for some PHYs.

OK

> For all processors, the BK linuxppc_2_5 tree has many more updates
> than the more visible 2_3 tree.

Is there any way to get BK linuxppc_2_5 tree without using BitKeeper
tools?? I found only older snapshots on FTP site. Also rsync is unable
to find linuxppc_2_5 snapshot.

	-Arto

--
Arto Vuori
email: avuori@ssh.com
mobile:	+358 40 754 5223

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-13 16:14                 ` Dan Malek
  2000-12-13 17:23                   ` Arto Vuori
@ 2000-12-13 22:08                   ` Brian Ford
  2000-12-13 22:45                     ` Jerry Van Baren
  2000-12-13 22:53                     ` Dan Malek
  2000-12-14  7:21                   ` 2.5 or 2.4 kernel profiling Graham Stoney
  2 siblings, 2 replies; 26+ messages in thread
From: Brian Ford @ 2000-12-13 22:08 UTC (permalink / raw)
  To: Dan Malek; +Cc: Graham Stoney, linuxppc-embedded


On Wed, 13 Dec 2000, Dan Malek wrote:

> Graham Stoney wrote:
>
> > This does indeed sound better; the only sticky part I can think of is setting
> > the FCC/FEC to keep giving you Rx interrupts even when there are no buffer
> > descriptors to put the incoming packets in,
>
Won't we get a "Busy error" Rx interrupt in this case?

> Although I have not yet proven this, I am leaning toward the following.
> Allocate a small fixed set of receive buffers (like we used to do)
> in the driver and mark them copy-back cached.  The received BDs will
> always point to thesed buffers.  Then, copy-and-sum these into IP
> aligned skbuffs.  The advantage of Graham's DMA into skbufs isn't that
> the driver doesn't copy/sum, it is that later when the IP stack does it
> we get burst transfers into cache.  So, we get this advantage plus
> the IP packet aligned properly for the remainder of the stack.
>
I speak only for the 8260, but...

With it, you can DMA directly into IP aligned skbuffs, eliminating the
copy.  I've done it and it seems to work.  I'll have to benchmark it, but
the copy overhead should be significant.  This just makes IP do the
checksum later.

Also, to avoid bus contention, shouldn't the Rx buffers be on the local
bus?  Probably the BD's too.  Unless we can figure out how to
put these in DPRAM, but it doesn't look possible for the FCC's.  I don't
know if it is possible to allocate skbuffs in other than 60x bus SDRAM,
though.

> Of course, the downside of this is the receive buffers are one-time
> cached.  We blow the cache away to make the TCP benchmark look good,
> and the remaining applications suffer.  I still have a problem with
> this.....making single focused benchmarks look good isn't necessarily
> the best for the overall system application.
>
I see your point.  Cacheing may not be the way to go.  Incidentally, this
is for all practicle purposes, my application.

The ideas I put forth above may still improve performance if they are
feasable.  Eliminating one copy can only help since we can still have the
alignment.  Unless the local bus proves to be a larger gain and we can't
do that there.

Opinions welcome.

--
Brian Ford
Software Engineer
Vital Visual Simulation Systems
FlightSafety International
Phone: 314-551-8460
Fax:   314-551-8444


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-13 22:08                   ` Brian Ford
@ 2000-12-13 22:45                     ` Jerry Van Baren
  2000-12-13 22:53                     ` Dan Malek
  1 sibling, 0 replies; 26+ messages in thread
From: Jerry Van Baren @ 2000-12-13 22:45 UTC (permalink / raw)
  To: Brian Ford, Dan Malek; +Cc: Graham Stoney, linuxppc-embedded


At 04:08 PM 12/13/00 -0600, Brian Ford wrote:

>On Wed, 13 Dec 2000, Dan Malek wrote:
>
> > Graham Stoney wrote:
> >

[snip]

>I speak only for the 8260, but...
>
>With it, you can DMA directly into IP aligned skbuffs, eliminating the
>copy.  I've done it and it seems to work.  I'll have to benchmark it, but
>the copy overhead should be significant.  This just makes IP do the
>checksum later.
>
>Also, to avoid bus contention, shouldn't the Rx buffers be on the local
>bus?  Probably the BD's too.  Unless we can figure out how to
>put these in DPRAM, but it doesn't look possible for the FCC's.  I don't
>know if it is possible to allocate skbuffs in other than 60x bus SDRAM,
>though.

I asked the Mot help line about this.  There response follows.  The
bottom line, according to them, is that you can put the BDs in DPRAM,
but it will still cause 60x bus cycles, defeating the purpose of
putting them in DPRAM.  I have not personally verified this.

----------------------------------------------------------------------

Date: Wed, 3 May 2000 07:36:33 +0200
Subject: Motorola DigitalDNA Help, Service Request # 1-PUU6 Reply
To: vanbaren_gerald@si.com
From: DigitalDNA Help <DigitalDNA.Help@motorola.com>
MIME-Version: 1.0
Reply-To: DigitalDNA Help <DigitalDNA.Help@motorola.com>
Message-Id: <390FCC60.2280E779@tcfc.tornado.nsk.ru>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=us-ascii
Content-Length: 2170
X-UID: 1

Dear K Van Baren,

in reply to your Service Request SR 1-PUU6 [ref. TORLP] (see details
below):


CPM busy-polling adds about 1% overhead to bus activity. Can you ignore
it?
You may put FCC BDs in Dual-Port RAM but CPM will continues to access
them through 60x bus.
The only way to disable this polling is to emit STOP TX command when you
have not data to send.

------- Details of your request: -------

Date Opened :  04 Apr 2000 06:46:09
Product :       MPC8260
Category:       Technical Request

---------- Subject ----------
8260 FCC buffer description

---------- Description ----------
I want to minimize bus activity due to the CPM busy-polling buffer
descriptors for the transmit "ready" flag.  I realize I can turn off the
polling and do the transmit on
  demand trick, but polling is convenient.  To minimize bus activity, I
wish to put the FCC buffer descriptors in Dual-Port RAM.

  The discussion of Dual-Port RAM in section 13.5 talks about the buffer
descriptors residing in Dual-Port RAM or in main memory.  The SCC
discussion on buffer
  descriptors in section 19.2 says the buffer descriptors for the SCCs,
SMCs, SPI, and I2C _must_ reside in Dual-Port RAM, and the RBASE and
TBASE definitions
  enforce that since they are 16 bits only.

  The FCCs are not mentioned in the above list.  The FCC parameter RAM
has a full 32 bit address for RBASE and TBASE and there is a flag that
selects whether the
   buffer descriptors are in local bus memory or 60x bus memory.  Can
they be put in Dual-Port RAM?

------- End of request details -------



To review or update this Service Request, or to enter a new Service
Request, please access Motorola's Customer Support web site at
http://www.motorola.com/semiconductors/support

If there is ever an occasion when you cannot access Motorola's Customer
Support web site, you can also contact us by sending an email to
DigitalDNA.Help@motorola.com
or by calling us at one of the following numbers:

Americas                1-800-521-6274          7AM-6PM Phoenix
Asia                    +852-2666-8307          8AM-6PM Hong Kong
Japan                   0120-191-014            8AM-5PM Tokyo
Europe                  +49-89-92103-559        9AM-5PM Munich


Regards,
Motorola Semiconductors Customer Support

----------------------------------------------------------------------


[snip]

>Opinions welcome.

I have plenty of them :-)


>--
>Brian Ford
>Software Engineer
>Vital Visual Simulation Systems
>FlightSafety International
>Phone: 314-551-8460
>Fax:   314-551-8444

gvb


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-13 22:08                   ` Brian Ford
  2000-12-13 22:45                     ` Jerry Van Baren
@ 2000-12-13 22:53                     ` Dan Malek
  2000-12-14 17:29                       ` FEC/FCC driver issues Brian Ford
  1 sibling, 1 reply; 26+ messages in thread
From: Dan Malek @ 2000-12-13 22:53 UTC (permalink / raw)
  To: Brian Ford; +Cc: Graham Stoney, linuxppc-embedded

Brian Ford wrote:

> I speak only for the 8260, but...
>
> With it, you can DMA directly into IP aligned skbuffs,

No, you can't.  The receiver buffers must be 32-byte aligned.  The
Ethernet header is 14 bytes, so the IP frame starts on this 16-bit
aligned boundary.  Then the IP stack promptly does a 32-bit load from
this misaligned address.

> ... I've done it and it seems to work.

I don't think so......It is typical of all of the CPM devices to
require strict alignment of incoming buffers, and very relaxed alignment
of outgoing buffers (which is precisely what you need for most
protocol processing).

> .....  I'll have to benchmark it, but
> the copy overhead should be significant.  This just makes IP do the
> checksum later.

Right, you have to read this data at some point, so the copy-sum does
both in one operation.  It does the checksum while it is moving the
buffer and aligning it on the IP frame boundary.  This also give the
advantage of the IP frame in the cache, so when you push it upstream
you are likely to get some cache hits.

> Also, to avoid bus contention, shouldn't the Rx buffers be on the local
> bus?  Probably the BD's too.

Yeah, that's why they designed the part this way.  Not many boards
use it, though.

> .....  Unless we can figure out how to
> put these in DPRAM, but it doesn't look possible for the FCC's.

I have tried, and for some reason I couldn't make it work correctly.
It should, I just haven't been back to debug it.

> ....  I don't
> know if it is possible to allocate skbuffs in other than 60x bus SDRAM,
> though.

Probably not worth it.  The 66 MHz buses with burst mode and fast
SDRAM are pretty sweet.  The skbuffs pool can get very large, certainly
more than would fit into DPRAM, but it is sufficiently modular that you
could put it into the local DRAM.

> .....  Unless the local bus proves to be a larger gain and we can't
> do that there.

The local bus can be an advantage for some applications.  For data
or information the CPM accesses frequently (BDs, hash tables, channel
tables, scheduler tables, etc.) this is a great benefit.  If you are
also just packet/frame routing, this is a useful place for the data.
It seems for other applications, where the PowerPC core is going to
use the data frequently, the best place is the 60x memory.  It is
one of those features that you shouldn't try to use just because it
is there.

	-- Dan

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-13 16:14                 ` Dan Malek
  2000-12-13 17:23                   ` Arto Vuori
  2000-12-13 22:08                   ` Brian Ford
@ 2000-12-14  7:21                   ` Graham Stoney
  2000-12-14 16:58                     ` Dan Malek
  2 siblings, 1 reply; 26+ messages in thread
From: Graham Stoney @ 2000-12-14  7:21 UTC (permalink / raw)
  To: Dan Malek; +Cc: Graham Stoney, Brian Ford, linuxppc-embedded

On Wed, Dec 13, 2000 at 11:14:00AM -0500, Dan Malek wrote:
> Although I have not yet proven this, I am leaning toward the following.
> Allocate a small fixed set of receive buffers (like we used to do)
> in the driver and mark them copy-back cached.  The received BDs will
> always point to thesed buffers.

I dunno; now that I've heard about the "buffer deficit scheme", I think it
gives better memory utilisation, since the receive buffers aren't permanently
tied up in the network driver.  It's the way the other drivers now do it.

> Then, copy-and-sum these into IP aligned skbuffs.

I think it depends whether the gain of having IP headers aligned outweighs the
cost of the extra copy in place of just a checksum.  The copy will tend to
throw more stuff out of the cache since it's dealing with the data twice.
Conventional logic when optimising network stacks is to eliminate copies, and
the only thing I see here that contradicts that is that we end up with not-
nicely-aligned IP headers.  In my application, eliminating the copy more than
offset the loss due to unaligned headers.

> The advantage of Graham's DMA into skbufs isn't that the driver doesn't
> copy/sum, it is that later when the IP stack does it we get burst transfers
> into cache.

But we get burst transfers into cache in either case, whether it's during
the checksum in csum_partial or the copy/sum in csum_partial_copy_generic.
The difference is that in the copy case, the data that gets loaded is only
used once, to write to another address in another cache line.  This extra
write will carry both a caching and bus penalty, and the impact is likely to
be worse in real applications than in simplistic throuhput tests.  Hence I
believe the conventional wisdom about eliminating the copy applies.

I know everything has second order effects and profiling is a perilous
minefield, but just occasionally things really do end up the way first order
thinking suggests :-).

Regards,
Graham
--
Graham Stoney
Assistant Technology Manager
Canon Information Systems Research Australia
Ph: +61 2 9805 2909  Fax: +61 2 9805 2929

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-14  7:21                   ` 2.5 or 2.4 kernel profiling Graham Stoney
@ 2000-12-14 16:58                     ` Dan Malek
  2000-12-15  0:18                       ` Graham Stoney
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Malek @ 2000-12-14 16:58 UTC (permalink / raw)
  To: Graham Stoney; +Cc: Brian Ford, linuxppc-embedded


Graham Stoney wrote:

> > The advantage of Graham's DMA into skbufs isn't that the driver doesn't
> > copy/sum, it is that later when the IP stack does it we get burst transfers
> > into cache.
>
> But we get burst transfers into cache in either case,

No, because I originally allocated the receive buffers uncached.
Only cached/copyback pages will burst on the bus.  The CPM DMA will
always burst.


	-- Dan

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* FEC/FCC driver issues
  2000-12-13 22:53                     ` Dan Malek
@ 2000-12-14 17:29                       ` Brian Ford
  0 siblings, 0 replies; 26+ messages in thread
From: Brian Ford @ 2000-12-14 17:29 UTC (permalink / raw)
  To: Dan Malek; +Cc: Graham Stoney, linuxppc-embedded


I changed the Subject line to better reflect the current discussion.

On Wed, 13 Dec 2000, Dan Malek wrote:

> Brian Ford wrote:
>
> > I speak only for the 8260, but...
> >
> > With it, you can DMA directly into IP aligned skbuffs,
>
> No, you can't.  The receiver buffers must be 32-byte aligned.  The
> Ethernet header is 14 bytes, so the IP frame starts on this 16-bit
> aligned boundary.  Then the IP stack promptly does a 32-bit load from
> this misaligned address.
>
> > ... I've done it and it seems to work.
>
> I don't think so......It is typical of all of the CPM devices to
> require strict alignment of incoming buffers, and very relaxed alignment
> of outgoing buffers (which is precisely what you need for most
> protocol processing).
>
Not to my understanding and actual experience.  The Rx buffer size must be
a multiple of 32, but it need not be aligned that way.  The CPM
uses the DMA engine to move the data from the DPRAM to its destination.
And the DMA engine supports unaligned destinations by transferring the
unaligned part, bursting the aligned part, and then transferring the
remaining unaligned part.

> > .....  I'll have to benchmark it, but
> > the copy overhead should be significant.  This just makes IP do the
> > checksum later.
>
> Right, you have to read this data at some point, so the copy-sum does
> both in one operation.  It does the checksum while it is moving the
> buffer and aligning it on the IP frame boundary.  This also give the
> advantage of the IP frame in the cache, so when you push it upstream
> you are likely to get some cache hits.
>
But you don't have to copy it until you move it to user space.

Are dev_alloc_skb() created skbuffs cached?  If so, your previous argument
about destroying the application's use of the cache is moot.  If not,
nothing will be cached after the copy.

> > Also, to avoid bus contention, shouldn't the Rx buffers be on the local
> > bus?  Probably the BD's too.
>
> Yeah, that's why they designed the part this way.  Not many boards
> use it, though.
>
Sounds like a good place to put a config ifdef or something then.

> > .....  Unless we can figure out how to
> > put these in DPRAM, but it doesn't look possible for the FCC's.
>
> I have tried, and for some reason I couldn't make it work correctly.
> It should, I just haven't been back to debug it.
>
Yes, I saw.  I didn't have much luck with my weak attempt either.

I am having a real hard time determining from the conflicting diagrams
exactly where the two ports of the DPRAM reside, ie. on what busses.  We
need to know this to set the proper flag in the FCRx if we put the BDs or
buffers in DPRAM (although I don't think the buffers should ever go
there).

Jerry Van Baren's mail would make it appear that you have to use the 60x
bus to access DPRAM through the FCC configuration.  That is why the
question above is so interesting to me.

> > ....  I don't
> > know if it is possible to allocate skbuffs in other than 60x bus SDRAM,
> > though.
>
> Probably not worth it.  The 66 MHz buses with burst mode and fast
> SDRAM are pretty sweet.  The skbuffs pool can get very large, certainly
> more than would fit into DPRAM, but it is sufficiently modular that you
> could put it into the local DRAM.
>
I still wish I could get 66 MHz working.

I guess we would need a whole allocation/free engine for local bus
RAM.  Are there any hooks into the current allocation engines for this
type of purpose?

> > .....  Unless the local bus proves to be a larger gain and we can't
> > do that there.
>
> The local bus can be an advantage for some applications.  For data
> or information the CPM accesses frequently (BDs, hash tables, channel
> tables, scheduler tables, etc.) this is a great benefit.  If you are
> also just packet/frame routing, this is a useful place for the data.
> It seems for other applications, where the PowerPC core is going to
> use the data frequently, the best place is the 60x memory.  It is
> one of those features that you shouldn't try to use just because it
> is there.
>
If it wasn't for the checksum, the core doesn't access it frequently until
it is copied to user space.


Thank you for your valuable advice and knowledge.  I just wanted to make
sure you know that I appreciate your help and input.

--
Brian Ford
Software Engineer
Vital Visual Simulation Systems
FlightSafety International
Phone: 314-551-8460
Fax:   314-551-8444


** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 2.5 or 2.4 kernel profiling
  2000-12-14 16:58                     ` Dan Malek
@ 2000-12-15  0:18                       ` Graham Stoney
  0 siblings, 0 replies; 26+ messages in thread
From: Graham Stoney @ 2000-12-15  0:18 UTC (permalink / raw)
  To: Dan Malek; +Cc: Graham Stoney, Brian Ford, linuxppc-embedded

Comparing DMA direct to skbufs with packet copy in the FEC driver, I said:
> But we get burst transfers into cache in either case,

Dan writes:
> No, because I originally allocated the receive buffers uncached.

That's comparing apples with oranges though.  I wasn't talking about what the
current driver does, I was comparing your suggestion of cached receive buffers
(which I also implemented in the FEC and benchmarked) with DMA direct to the
cached skbufs.  In both cases the CPU bursts the data into the cache when it
first goes to access it, so that doesn't explain why I found that DMA direct
to the skbuf was faster overall than just making the Rx buffer cached and
retaining the copy.  Both gave a measurable speed improvement over the
original driver.

Note that even when doing DMA direct to the skbuf, it's normal to have a size
threshold below which packets are copied into a newly allocated skbuf of the
exact size.  This avoids wasting skbuf space on tiny packets, and gives the
opportunity to nicely align the IP header.  As a result, small packets (where
IP stack processing dominates and header alignment is most important) are
processed exactly the way you describe, while large ones (where avoiding
copying the payload is most important) avoid being copied.  Hence you end up
with the best of both worlds.

Regards,
Graham

** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2000-12-15  0:18 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.GSO.4.21.0012071148420.515-100000@eos>
2000-12-07 18:11 ` 2.5 or 2.4 kernel profiling Brian Ford
2000-12-08 17:41   ` diekema_jon
2000-12-08 18:24     ` Brian Ford
2000-12-11  0:45   ` Graham Stoney
2000-12-11 15:27     ` Brian Ford
2000-12-12  2:36       ` Graham Stoney
2000-12-12  3:26         ` Dan Malek
2000-12-12  7:28           ` Graham Stoney
2000-12-12 16:32             ` Brian Ford
2000-12-12 16:58               ` Dan Malek
2000-12-12 17:17                 ` Brian Ford
2000-12-12 21:03                   ` Dan Malek
2000-12-13  1:15               ` Graham Stoney
2000-12-13 16:14                 ` Dan Malek
2000-12-13 17:23                   ` Arto Vuori
2000-12-13 17:33                     ` Dan Malek
2000-12-13 17:55                       ` Arto Vuori
2000-12-13 22:08                   ` Brian Ford
2000-12-13 22:45                     ` Jerry Van Baren
2000-12-13 22:53                     ` Dan Malek
2000-12-14 17:29                       ` FEC/FCC driver issues Brian Ford
2000-12-14  7:21                   ` 2.5 or 2.4 kernel profiling Graham Stoney
2000-12-14 16:58                     ` Dan Malek
2000-12-15  0:18                       ` Graham Stoney
2000-12-12 15:26         ` Brian Ford
2000-12-12 17:12           ` Jerry Van Baren

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).