* copy_4K_page() doesn't use dcbtst?
@ 2006-08-28 17:35 Hollis Blanchard
2006-08-29 0:16 ` Paul Mackerras
0 siblings, 1 reply; 4+ messages in thread
From: Hollis Blanchard @ 2006-08-28 17:35 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, xen-ppc-devel
Hi Paul, some Xen people were just noticing that copy_4K_page
(arch/powerpc/lib/copypage_64.S) doesn't use the dcbtst instruction. Why
doesn't it help there?
--
Hollis Blanchard
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: copy_4K_page() doesn't use dcbtst?
2006-08-28 17:35 copy_4K_page() doesn't use dcbtst? Hollis Blanchard
@ 2006-08-29 0:16 ` Paul Mackerras
2006-08-29 2:11 ` Hollis Blanchard
2006-08-29 6:57 ` Segher Boessenkool
0 siblings, 2 replies; 4+ messages in thread
From: Paul Mackerras @ 2006-08-29 0:16 UTC (permalink / raw)
To: Hollis Blanchard; +Cc: linuxppc-dev, xen-ppc-devel
Hollis Blanchard writes:
> Hi Paul, some Xen people were just noticing that copy_4K_page
> (arch/powerpc/lib/copypage_64.S) doesn't use the dcbtst instruction. Why
> doesn't it help there?
Why would we want to read the cache lines for the destination from
memory when we're only going to overwrite them completely anyway?
A stronger argument would be for using dcbz, but IIRC it actually made
things slower (on POWER4 at least). I suspect the hardware is
gathering the stores for the whole of each cache line automatically,
so using dcbz doesn't provide any benefit.
I did a lot of measurements of memory copy speed on POWER4 (using
different copy loops, copy sizes, alignments, cache hot/cold cases)
and the copy_4K_page loop is the fastest I could come up with for
POWER4. If anyone can come up with a routine that is measurably
faster on current machines, I'm happy to look at it, of course.
Paul.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: copy_4K_page() doesn't use dcbtst?
2006-08-29 0:16 ` Paul Mackerras
@ 2006-08-29 2:11 ` Hollis Blanchard
2006-08-29 6:57 ` Segher Boessenkool
1 sibling, 0 replies; 4+ messages in thread
From: Hollis Blanchard @ 2006-08-29 2:11 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, xen-ppc-devel
On Tue, 2006-08-29 at 10:16 +1000, Paul Mackerras wrote:
> Hollis Blanchard writes:
>
> > Hi Paul, some Xen people were just noticing that copy_4K_page
> > (arch/powerpc/lib/copypage_64.S) doesn't use the dcbtst instruction. Why
> > doesn't it help there?
>
> Why would we want to read the cache lines for the destination from
> memory when we're only going to overwrite them completely anyway?
>
> A stronger argument would be for using dcbz, but IIRC it actually made
> things slower (on POWER4 at least). I suspect the hardware is
> gathering the stores for the whole of each cache line automatically,
> so using dcbz doesn't provide any benefit.
Yes, dcbz makes more sense.
> I did a lot of measurements of memory copy speed on POWER4 (using
> different copy loops, copy sizes, alignments, cache hot/cold cases)
> and the copy_4K_page loop is the fastest I could come up with for
> POWER4. If anyone can come up with a routine that is measurably
> faster on current machines, I'm happy to look at it, of course.
I figured you had done measurements; we were just curious about the
unexpected results. Thanks!
--
Hollis Blanchard
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: copy_4K_page() doesn't use dcbtst?
2006-08-29 0:16 ` Paul Mackerras
2006-08-29 2:11 ` Hollis Blanchard
@ 2006-08-29 6:57 ` Segher Boessenkool
1 sibling, 0 replies; 4+ messages in thread
From: Segher Boessenkool @ 2006-08-29 6:57 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, Hollis Blanchard, xen-ppc-devel
> A stronger argument would be for using dcbz, but IIRC it actually made
> things slower (on POWER4 at least). I suspect the hardware is
> gathering the stores for the whole of each cache line automatically,
> so using dcbz doesn't provide any benefit.
It seems on 970 at least it still is a nice win. Do you have any
good benchmarks I could run?
> I did a lot of measurements of memory copy speed on POWER4 (using
> different copy loops, copy sizes, alignments, cache hot/cold cases)
> and the copy_4K_page loop is the fastest I could come up with for
> POWER4.
Yeah, POWER4 is quite a different beast (its memory subsystem,
anyway). I'm surprised dcbz hurt though; did you schedule it
early enough before the actual data copy?
Segher
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2006-08-29 6:57 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-28 17:35 copy_4K_page() doesn't use dcbtst? Hollis Blanchard
2006-08-29 0:16 ` Paul Mackerras
2006-08-29 2:11 ` Hollis Blanchard
2006-08-29 6:57 ` Segher Boessenkool
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).