From: David Jander <david.jander@protonic.nl>
To: linuxppc-dev@ozlabs.org
Cc: munroesj@us.ibm.com, John Rigby <jrigby@freescale.com>
Subject: Re: Efficient memcpy()/memmove() for G2/G3 cores...
Date: Tue, 2 Sep 2008 15:12:09 +0200 [thread overview]
Message-ID: <200809021512.10132.david.jander@protonic.nl> (raw)
In-Reply-To: <1220261775.5234.217.camel@gentoo-jocke.transmode.se>
On Monday 01 September 2008 11:36:15 Joakim Tjernlund wrote:
>[...]
> > Then I started my test program with LD_PRELOAD=...
> >
> > My test program only copies big chunks of aligned memory, so it will only
> > test for maximum throughput (such as copying video frames). I will make a
> > better one, to measure throughput on different sized blocks of aligned
> > and unaligned memory, but first I want to find out why I can't seem to
> > get even close to the expected RAM bandwidth (bursts occur at 1.6
> > Gbyte/s, sustained transfers might be able to reach 400 Mbyte/s in
> > theory, taking into account the video controller eating almost half of
> > it, I'd like to get somewhere close to 200).
> >
> > The result is quite a bit better than that of glibc-2.7 (13.2 Mbyte/s -->
> > 22 Mbyte/s), but still far from the 71.5 Mbyte/s achieved when using
> > bigger strides of 16 registers load/store at a time.
> > Note, that this is copy performance, one-way througput should be double
> > these figures.
>
> Yeah, the code is trying to do a reasonable job without knowing what
> micro arch it is running on. These could probably go to glibc
> as new general purpose memxxx() routines. You will probably see
> a big increase once dcbz is added to the copy/memset functions.
>
> Fire away :)
Ok here I go:
I have made some astonishing discoveries, and I'd like to post the used
source-code somewhere in the meantime, any suggestions? To this list?
There seem to be some substantial differences between the e300 core used in
the MPC5200B and in the MPC5121e (besides the MPC5121 having double the
amount of cache). Memcpy()-performance wise, these differences amount to the
following. The tests done are with vanilla glibc (version 2.6.1 and 2.7
without any powerpc specific memcpy() optimizations), Gunnar von Boehns
memcpy_e300 and my tweaked version, memcpy_e300_dj which basically uses
16-register strides instead of 4-register strides in Gunnar's example.
memcpy() peak-performance (RAM memory throughput) on:
MPC5200B, glibc-2.6, no optimizations: 136 Mbyte/s
MPC5121e, glibc-2.7, no optimizations: 30 Mbyte/s
MPC5200B, memcpy_e300: 225 Mbyte/s
MPC5121e, memcpy_e300: 130 Mbyte/s
MPC5200B, memcpy_e300_dj: 200 Mbyte/s
MPC5121e, memcpy_e300_dj: 202 Mbyte/s
For the MPC5121e, 16-register strides seem to be most optimal, whereas for the
MPC5200B, 4-register stides give best performance. Also, plain C memcpy()
performance on MPC5121e is terribly poor! Does enyone know why? I don't quite
seem to understand those results.
Some information on the test hardware:
MPC5200B-based board has 64 Mbyte DDR-SDRAM, 32-bit wide (two x16 chips),
running ubuntu 7.10 with kernel 2.6.19.2.
MPC5121e-based board has 256 Mbyte DDR2-SDRAM, 32-bit wide (two x16 chips),
running ubuntu 8.04.1 with kernel 2.6.24.5 from Freescale LTIB with the DIU
turned OFF. When the DIU is turned on, maximum throughput drops from 202 to
196 Mbyte/s.
memcpy_e300 variants basically use 4 or 16-register load/store strides, cache
alignment and dcbz/bcbt cache-manipulation instructions to tweak performance.
I have not tried interleaving integer and fpu instructions.
Does anybody have any suggestion about where to start searching for an
explaination of these results? I have the impression that there is something
wrong with my setup, or with the e300c4-core, or both, but what????
Greetings,
--
David Jander
next prev parent reply other threads:[~2008-09-02 13:12 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-08-25 9:31 Efficient memcpy()/memmove() for G2/G3 cores David Jander
2008-08-25 11:00 ` Matt Sealey
2008-08-25 13:06 ` David Jander
2008-08-25 22:28 ` Benjamin Herrenschmidt
2008-08-27 21:04 ` Steven Munroe
2008-08-29 11:48 ` David Jander
2008-08-29 12:21 ` Joakim Tjernlund
2008-09-01 7:23 ` David Jander
2008-09-01 9:36 ` Joakim Tjernlund
2008-09-02 13:12 ` David Jander [this message]
2008-09-03 6:43 ` Joakim Tjernlund
2008-09-03 20:33 ` prodyut hazarika
2008-09-04 2:04 ` Paul Mackerras
2008-09-04 12:05 ` David Jander
2008-09-04 12:19 ` Josh Boyer
2008-09-04 12:59 ` David Jander
2008-09-04 14:31 ` Steven Munroe
2008-09-04 14:45 ` Gunnar Von Boehn
2008-09-04 15:14 ` Gunnar Von Boehn
2008-09-04 16:25 ` David Jander
2008-09-04 15:01 ` Gunnar Von Boehn
2008-09-04 16:32 ` David Jander
2008-09-04 18:14 ` prodyut hazarika
2008-08-29 20:34 ` Steven Munroe
2008-09-01 8:29 ` David Jander
2008-08-31 8:28 ` Benjamin Herrenschmidt
2008-09-01 6:42 ` David Jander
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200809021512.10132.david.jander@protonic.nl \
--to=david.jander@protonic.nl \
--cc=jrigby@freescale.com \
--cc=linuxppc-dev@ozlabs.org \
--cc=munroesj@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).