memcpy prefetch

Linux MIPS Architecture development
 help / color / mirror / Atom feed

* memcpy prefetch
@ 2005-04-06 12:30 Greg Weeks
  2005-04-06 20:08 ` Ralf Baechle
  0 siblings, 1 reply; 8+ messages in thread
From: Greg Weeks @ 2005-04-06 12:30 UTC (permalink / raw)
  To: linux-mips

In trying to understand the prefetch code in memcpy it looks like it's 
prefetching too far out in front of the loop. In the main aligned loop 
the loop copies 32 or 64 bytes of data and the prefetch is trying to 
prefetch 256 bytes ahead of the current copy. The prefetches should also 
pay attention to cache line size and they currently don't. If the line 
size is less than the copy size we are skipping prefetches that should 
be done. For the 4kc the line size is only 16 bytes. We should be doing 
a prefetch for each line. The src_unaligned_dst_aligned loop is even 
worse as it prefetches 288 bytes ahead of the copy and only copies 16 or 
32 bytes at a time.

Have I totally misunderstood the code?

Greg Weeks

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: memcpy prefetch
  2005-04-06 12:30 memcpy prefetch Greg Weeks
@ 2005-04-06 20:08 ` Ralf Baechle
  2005-04-07 12:14   ` Greg Weeks
  0 siblings, 1 reply; 8+ messages in thread
From: Ralf Baechle @ 2005-04-06 20:08 UTC (permalink / raw)
  To: Greg Weeks; +Cc: linux-mips

On Wed, Apr 06, 2005 at 08:30:52AM -0400, Greg Weeks wrote:

> In trying to understand the prefetch code in memcpy it looks like it's 
> prefetching too far out in front of the loop. In the main aligned loop 
> the loop copies 32 or 64 bytes of data and the prefetch is trying to 
> prefetch 256 bytes ahead of the current copy. The prefetches should also 
> pay attention to cache line size and they currently don't. If the line 
> size is less than the copy size we are skipping prefetches that should 
> be done. For the 4kc the line size is only 16 bytes. We should be doing 
> a prefetch for each line. The src_unaligned_dst_aligned loop is even 
> worse as it prefetches 288 bytes ahead of the copy and only copies 16 or 
> 32 bytes at a time.
> 
> Have I totally misunderstood the code?

Nope, you've understood that perfectly right.  The messy thing is that on
a whole bunch of system we don't know the cacheline size before runtime
so we have two choices a) work under worst case assumptions which would be
16 bytes.  Or do the same thing as we're already doing it for a bunch of
other performance sensitive functions, generating them at runtime.  Choose
your poison ;-)

  Ralf

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: memcpy prefetch
  2005-04-06 20:08 ` Ralf Baechle
@ 2005-04-07 12:14   ` Greg Weeks
  2005-04-07 12:25     ` Ralf Baechle
                       ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Greg Weeks @ 2005-04-07 12:14 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: linux-mips

Ralf Baechle wrote:

>On Wed, Apr 06, 2005 at 08:30:52AM -0400, Greg Weeks wrote:
>
>  
>
>>In trying to understand the prefetch code in memcpy it looks like it's 
>>prefetching too far out in front of the loop. In the main aligned loop 
>>the loop copies 32 or 64 bytes of data and the prefetch is trying to 
>>prefetch 256 bytes ahead of the current copy. The prefetches should also 
>>pay attention to cache line size and they currently don't. If the line 
>>size is less than the copy size we are skipping prefetches that should 
>>be done. For the 4kc the line size is only 16 bytes. We should be doing 
>>a prefetch for each line. The src_unaligned_dst_aligned loop is even 
>>worse as it prefetches 288 bytes ahead of the copy and only copies 16 or 
>>32 bytes at a time.
>>
>>Have I totally misunderstood the code?
>>    
>>
>
>Nope, you've understood that perfectly right.  The messy thing is that on
>a whole bunch of system we don't know the cacheline size before runtime
>so we have two choices a) work under worst case assumptions which would be
>16 bytes.  Or do the same thing as we're already doing it for a bunch of
>other performance sensitive functions, generating them at runtime.  Choose
>your poison ;-)
>  
>
What's the performance hit for doing a pref on a cache line that is 
already pref'd? Does it turn into a nop, or do we get some horrible 
degenerate case? Are 64 bit processors always at least 32 byte cache 
line size? I don't really expect anyone to know the answers right now. I 
expect I'll need to time code to tell. This makes generating them at run 
time look better and better.

Greg Weeks

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: memcpy prefetch
  2005-04-07 12:14   ` Greg Weeks
@ 2005-04-07 12:25     ` Ralf Baechle
  2005-04-07 12:25     ` Michael Uhler
  2005-04-07 12:28     ` Dominic Sweetman
  2 siblings, 0 replies; 8+ messages in thread
From: Ralf Baechle @ 2005-04-07 12:25 UTC (permalink / raw)
  To: Greg Weeks; +Cc: linux-mips

On Thu, Apr 07, 2005 at 08:14:06AM -0400, Greg Weeks wrote:

> What's the performance hit for doing a pref on a cache line that is 
> already pref'd?

A wasted instruction.

(More complicated on certain multi-issue in-order processors such as the
SB1 CPU core.  Mentioning this for completeness; we shouldn't worry about
it here.)

>  Does it turn into a nop, or do we get some horrible 
> degenerate case? Are 64 bit processors always at least 32 byte cache 
> line size?

The smallest D-cache line I know of is 16 bytes.

> I don't really expect anyone to know the answers right now. I 
> expect I'll need to time code to tell. This makes generating them at run 
> time look better and better.

Indeed.  Initially when we started doing such things some people felt it
might be really bad to debug and everything but in practice it's been a
relativly minor problem, so I guess the resistance against yet another
run-time generated group of functions is getting less.

One interesting issue to solve - memcpy, memmove and copy_user are combined
into a single big function, so the fixups for userspace accesses need to
be handled at runtime as well.

  Ralf

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: memcpy prefetch
  2005-04-07 12:14   ` Greg Weeks
  2005-04-07 12:25     ` Ralf Baechle
@ 2005-04-07 12:25     ` Michael Uhler
  2005-04-07 12:25       ` Michael Uhler
  2005-04-07 12:28     ` Dominic Sweetman
  2 siblings, 1 reply; 8+ messages in thread
From: Michael Uhler @ 2005-04-07 12:25 UTC (permalink / raw)
  To: 'Greg Weeks', 'Ralf Baechle'; +Cc: linux-mips

> What's the performance hit for doing a pref on a cache line that is 
> already pref'd? Does it turn into a nop, or do we get some horrible 
> degenerate case? Are 64 bit processors always at least 32 byte cache 
> line size? I don't really expect anyone to know the answers 
> right now. I 
> expect I'll need to time code to tell. This makes generating 
> them at run 
> time look better and better.

As very general rules of thumb:

- A pref to a line which is already in the cache take a cycle in the
load/store unit and does not go back out to the memory subsystem.  There are
some possible ships-passing-in-the-night scenarios, but most processors do
what you'd expect.

- Most 64-bit processors are built for high-end applications, and most
high-end processors most likely have at least 32-bit lines.  One usually has
smaller line sizes when the processor is intended for lower-end
applications, or where the memory subsystem isn't all that good.

/gmu

---
Michael Uhler, Chief Technology Officer
MIPS Technologies, Inc.   Email: uhler at mips.com
1225 Charleston Road
Mountain View, CA 94043

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: memcpy prefetch
  2005-04-07 12:25     ` Michael Uhler
@ 2005-04-07 12:25       ` Michael Uhler
  0 siblings, 0 replies; 8+ messages in thread
From: Michael Uhler @ 2005-04-07 12:25 UTC (permalink / raw)
  To: 'Greg Weeks', 'Ralf Baechle'; +Cc: linux-mips

> What's the performance hit for doing a pref on a cache line that is 
> already pref'd? Does it turn into a nop, or do we get some horrible 
> degenerate case? Are 64 bit processors always at least 32 byte cache 
> line size? I don't really expect anyone to know the answers 
> right now. I 
> expect I'll need to time code to tell. This makes generating 
> them at run 
> time look better and better.

As very general rules of thumb:

- A pref to a line which is already in the cache take a cycle in the
load/store unit and does not go back out to the memory subsystem.  There are
some possible ships-passing-in-the-night scenarios, but most processors do
what you'd expect.

- Most 64-bit processors are built for high-end applications, and most
high-end processors most likely have at least 32-bit lines.  One usually has
smaller line sizes when the processor is intended for lower-end
applications, or where the memory subsystem isn't all that good.

/gmu

---
Michael Uhler, Chief Technology Officer
MIPS Technologies, Inc.   Email: uhler at mips.com
1225 Charleston Road
Mountain View, CA 94043

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: memcpy prefetch
  2005-04-07 12:14   ` Greg Weeks
  2005-04-07 12:25     ` Ralf Baechle
  2005-04-07 12:25     ` Michael Uhler
@ 2005-04-07 12:28     ` Dominic Sweetman
  2005-04-07 14:05       ` Ralf Baechle
  2 siblings, 1 reply; 8+ messages in thread
From: Dominic Sweetman @ 2005-04-07 12:28 UTC (permalink / raw)
  To: Greg Weeks; +Cc: Ralf Baechle, linux-mips

Greg Weeks (greg.weeks@timesys.com) writes:

> What's the performance hit for doing a pref on a cache line that is 
> already pref'd? Does it turn into a nop, or do we get some horrible 
> degenerate case?

The specification for the prefetch instruction is fairly wide, to
permit different implementations to act differently.  It's perfectly
legal for it to be a no-op.  However, implementors are told that they
should not do anything which would make performance *worse* than if it
was a no-op.

> Are 64 bit processors always at least 32 byte cache line size?

There's no reliable correlation.  If you were to go round the
"autogenerated at kernel-startup-time" route, then you can figure out
the line size from the "Config" registers (MIPS32- or MIPS64-compliant
CPUs) or from a table of CPU IDs or otherwise (earlier CPUs)...

--
Dominic Sweetman
MIPS Technologies

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: memcpy prefetch
  2005-04-07 12:28     ` Dominic Sweetman
@ 2005-04-07 14:05       ` Ralf Baechle
  0 siblings, 0 replies; 8+ messages in thread
From: Ralf Baechle @ 2005-04-07 14:05 UTC (permalink / raw)
  To: Dominic Sweetman; +Cc: Greg Weeks, linux-mips

On Thu, Apr 07, 2005 at 01:28:33PM +0100, Dominic Sweetman wrote:

> Greg Weeks (greg.weeks@timesys.com) writes:
> 
> > What's the performance hit for doing a pref on a cache line that is 
> > already pref'd? Does it turn into a nop, or do we get some horrible 
> > degenerate case?
> 
> The specification for the prefetch instruction is fairly wide, to
> permit different implementations to act differently.  It's perfectly
> legal for it to be a no-op.

The R5000 series actually treats it as a nop - which is why Linux treats
the R5000 as a CPU that does not have prefetch.

>  However, implementors are told that they
> should not do anything which would make performance *worse* than if it
> was a no-op.

Okay, but that should be obvious, I'd hope :-)

> > Are 64 bit processors always at least 32 byte cache line size?
> 
> There's no reliable correlation.  If you were to go round the
> "autogenerated at kernel-startup-time" route, then you can figure out
> the line size from the "Config" registers (MIPS32- or MIPS64-compliant
> CPUs) or from a table of CPU IDs or otherwise (earlier CPUs)...

Linux already contains a huge chunk of code to detect these and many
more cache properties for mips32, mips64 and more.  We also try to make
sure the compiler can do constant folding etc. as far as possible for
a particular platform.  All it takes is a suitable cpu-feature-overrides.h.
In absence of that Linux will built cache that pretty much support
everything in the universe - at some code bloat.

  Ralf

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-04-07 14:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-06 12:30 memcpy prefetch Greg Weeks
2005-04-06 20:08 ` Ralf Baechle
2005-04-07 12:14   ` Greg Weeks
2005-04-07 12:25     ` Ralf Baechle
2005-04-07 12:25     ` Michael Uhler
2005-04-07 12:25       ` Michael Uhler
2005-04-07 12:28     ` Dominic Sweetman
2005-04-07 14:05       ` Ralf Baechle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox