* memcpy and prefetch
@ 2009-01-27 23:00 Michael Sundius
2009-01-27 23:07 ` David Daney
0 siblings, 1 reply; 14+ messages in thread
From: Michael Sundius @ 2009-01-27 23:00 UTC (permalink / raw)
To: linux-mips, VomLehn, David, msundius
I know this topic has been written about but so excuse me if I am
redundant.
I saw lots of talk in the archives but I don't know if a solution was
ever arrived
at. so:
what is the current state of the use of prefetch in memcpy()? it seems that
it is #undef-ed if CONFIG_DMA_COHERENT is not turned on.
is this still because the memcpy does not check to prevent a prefetch of
addresses beyond the end of the buffer?
If so, what was the reason a solution was abandoned....
also has anyone out there written a memcopy that does use prefetch
intelligently (for mips32 that is)?
thanks
Mike
- - - - - Cisco - - - - -
This e-mail and any attachments may contain information which is confidential,
proprietary, privileged or otherwise protected by law. The information is solely
intended for the named addressee (or a person responsible for delivering it to
the addressee). If you are not the intended recipient of this message, you are
not authorized to read, print, retain, copy or disseminate this message or any
part of it. If you have received this e-mail in error, please notify the sender
immediately by return e-mail and delete it from your computer.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: memcpy and prefetch
2009-01-27 23:00 memcpy and prefetch Michael Sundius
@ 2009-01-27 23:07 ` David Daney
2009-01-28 10:37 ` Ralf Baechle
2009-01-28 19:28 ` Michael Sundius
0 siblings, 2 replies; 14+ messages in thread
From: David Daney @ 2009-01-27 23:07 UTC (permalink / raw)
To: Michael Sundius; +Cc: linux-mips, VomLehn, David, msundius
Michael Sundius wrote:
> I know this topic has been written about but so excuse me if I am
> redundant.
> I saw lots of talk in the archives but I don't know if a solution was
> ever arrived
> at. so:
>
> what is the current state of the use of prefetch in memcpy()? it seems that
> it is #undef-ed if CONFIG_DMA_COHERENT is not turned on.
>
> is this still because the memcpy does not check to prevent a prefetch of
> addresses beyond the end of the buffer?
>
> If so, what was the reason a solution was abandoned....
>
> also has anyone out there written a memcopy that does use prefetch
> intelligently (for mips32 that is)?
>
The Cavium OCTEON port overrides the default memcpy and does use
prefetch. It was recently merged (2.6.29-rc2). Look at octeon-memcpy.S
I have thought that memcpy could be generated by mm/page.c as copy_page
and clear_page are.
David Daney
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: memcpy and prefetch
2009-01-27 23:07 ` David Daney
@ 2009-01-28 10:37 ` Ralf Baechle
2009-01-28 15:28 ` Atsushi Nemoto
2009-01-28 19:28 ` Michael Sundius
1 sibling, 1 reply; 14+ messages in thread
From: Ralf Baechle @ 2009-01-28 10:37 UTC (permalink / raw)
To: David Daney; +Cc: Michael Sundius, linux-mips, VomLehn, David, msundius
On Tue, Jan 27, 2009 at 03:07:45PM -0800, David Daney wrote:
> The Cavium OCTEON port overrides the default memcpy and does use
> prefetch. It was recently merged (2.6.29-rc2). Look at octeon-memcpy.S
>
> I have thought that memcpy could be generated by mm/page.c as copy_page
> and clear_page are.
No, these two only generate copy_page and clear_page. I and Thiemo were
considering to extend this to a full memcopy however.
Ralf
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: memcpy and prefetch
2009-01-28 10:37 ` Ralf Baechle
@ 2009-01-28 15:28 ` Atsushi Nemoto
2009-01-28 18:30 ` Ralf Baechle
0 siblings, 1 reply; 14+ messages in thread
From: Atsushi Nemoto @ 2009-01-28 15:28 UTC (permalink / raw)
To: ralf; +Cc: ddaney, msundius, linux-mips, dvomlehn, msundius
On Wed, 28 Jan 2009 10:37:53 +0000, Ralf Baechle <ralf@linux-mips.org> wrote:
> > The Cavium OCTEON port overrides the default memcpy and does use
> > prefetch. It was recently merged (2.6.29-rc2). Look at octeon-memcpy.S
> >
> > I have thought that memcpy could be generated by mm/page.c as copy_page
> > and clear_page are.
>
> No, these two only generate copy_page and clear_page. I and Thiemo were
> considering to extend this to a full memcopy however.
BTW, this code in memcpy.S (and memcpy-atomic.S) looks weird.
#if !defined(CONFIG_DMA_COHERENT) || !defined(CONFIG_DMA_IP27)
#undef CONFIG_CPU_HAS_PREFETCH
#endif
#ifdef CONFIG_MIPS_MALTA
#undef CONFIG_CPU_HAS_PREFETCH
#endif
Are there any configuration which do not undef CONFIG_CPU_HAS_PREFETCH ? ;-)
---
Atsushi Nemoto
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: memcpy and prefetch
2009-01-28 15:28 ` Atsushi Nemoto
@ 2009-01-28 18:30 ` Ralf Baechle
2009-01-29 12:36 ` Atsushi Nemoto
0 siblings, 1 reply; 14+ messages in thread
From: Ralf Baechle @ 2009-01-28 18:30 UTC (permalink / raw)
To: Atsushi Nemoto; +Cc: ddaney, msundius, linux-mips, dvomlehn, msundius
On Thu, Jan 29, 2009 at 12:28:50AM +0900, Atsushi Nemoto wrote:
> #if !defined(CONFIG_DMA_COHERENT) || !defined(CONFIG_DMA_IP27)
> #undef CONFIG_CPU_HAS_PREFETCH
> #endif
> #ifdef CONFIG_MIPS_MALTA
> #undef CONFIG_CPU_HAS_PREFETCH
> #endif
>
> Are there any configuration which do not undef CONFIG_CPU_HAS_PREFETCH ? ;-)
Yes, that's been a long standing one. Fixed also.
Ralf
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 52c80c2..71e8ebd 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -351,7 +351,7 @@ config SGI_IP27
select ARC64
select BOOT_ELF64
select DEFAULT_SGI_PARTITION
- select DMA_IP27
+ select DMA_COHERENT
select SYS_HAS_EARLY_PRINTK
select HW_HAS_PCI
select NR_CPUS_DEFAULT_64
@@ -761,9 +761,6 @@ config CFE
config DMA_COHERENT
bool
-config DMA_IP27
- bool
-
config DMA_NONCOHERENT
bool
select DMA_NEED_PCI_MAP_STATE
diff --git a/arch/mips/configs/ip27_defconfig b/arch/mips/configs/ip27_defconfig
index 34ea319..f2baea3 100644
--- a/arch/mips/configs/ip27_defconfig
+++ b/arch/mips/configs/ip27_defconfig
@@ -53,7 +53,7 @@ CONFIG_GENERIC_TIME=y
CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
CONFIG_ARC=y
-CONFIG_DMA_IP27=y
+CONFIG_DMA_COHERENT=y
CONFIG_EARLY_PRINTK=y
CONFIG_SYS_HAS_EARLY_PRINTK=y
# CONFIG_NO_IOPORT is not set
diff --git a/arch/mips/lib/memcpy-inatomic.S b/arch/mips/lib/memcpy-inatomic.S
index 736d0fb..68853a0 100644
--- a/arch/mips/lib/memcpy-inatomic.S
+++ b/arch/mips/lib/memcpy-inatomic.S
@@ -21,7 +21,7 @@
* end of memory on some systems. It's also a seriously bad idea on non
* dma-coherent systems.
*/
-#if !defined(CONFIG_DMA_COHERENT) || !defined(CONFIG_DMA_IP27)
+#ifdef CONFIG_DMA_NONCOHERENT
#undef CONFIG_CPU_HAS_PREFETCH
#endif
#ifdef CONFIG_MIPS_MALTA
diff --git a/arch/mips/lib/memcpy.S b/arch/mips/lib/memcpy.S
index c06cccf..56a1f85 100644
--- a/arch/mips/lib/memcpy.S
+++ b/arch/mips/lib/memcpy.S
@@ -21,7 +21,7 @@
* end of memory on some systems. It's also a seriously bad idea on non
* dma-coherent systems.
*/
-#if !defined(CONFIG_DMA_COHERENT) || !defined(CONFIG_DMA_IP27)
+#ifdef CONFIG_DMA_NONCOHERENT
#undef CONFIG_CPU_HAS_PREFETCH
#endif
#ifdef CONFIG_MIPS_MALTA
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: memcpy and prefetch
2009-01-27 23:07 ` David Daney
2009-01-28 10:37 ` Ralf Baechle
@ 2009-01-28 19:28 ` Michael Sundius
2009-01-28 19:54 ` David Daney
1 sibling, 1 reply; 14+ messages in thread
From: Michael Sundius @ 2009-01-28 19:28 UTC (permalink / raw)
To: David Daney; +Cc: linux-mips, VomLehn, David, msundius
David Daney wrote:
> Michael Sundius wrote:
>> I know this topic has been written about but so excuse me if I am
>> redundant.
>> I saw lots of talk in the archives but I don't know if a solution was
>> ever arrived
>> at. so:
>>
>> what is the current state of the use of prefetch in memcpy()? it
>> seems that
>> it is #undef-ed if CONFIG_DMA_COHERENT is not turned on.
>>
>> is this still because the memcpy does not check to prevent a prefetch of
>> addresses beyond the end of the buffer?
>>
>> If so, what was the reason a solution was abandoned....
>>
>> also has anyone out there written a memcopy that does use prefetch
>> intelligently (for mips32 that is)?
>>
>
> The Cavium OCTEON port overrides the default memcpy and does use
> prefetch. It was recently merged (2.6.29-rc2). Look at octeon-memcpy.S
>
> I have thought that memcpy could be generated by mm/page.c as
> copy_page and clear_page are.
>
> David Daney
David,
thanks!!! that's really useful. I have a few questions tho:
1) So you made this function explicitly for the Octeon. and that is
because you know the cache-line is 128 bytes long
on the octeon? is that right?
2) It seems as though you always prefectch the first cache line.. what
happens if the memcopy is less than 1 cache line long?
wouldn't you risk prefetching beyond the end of the buffer?
3) why do you only do the "pref 0 offset(src)" and not a prefetch for
the destination?
4) on line 244 you check to see if len is less than 128. while on the
other checks you check for (offset)+1
why would you not do the prefetch if len was exactly 256 bytes? (or 128
in the case of line 196)?
thanks.
- - - - - Cisco - - - - -
This e-mail and any attachments may contain information which is confidential,
proprietary, privileged or otherwise protected by law. The information is solely
intended for the named addressee (or a person responsible for delivering it to
the addressee). If you are not the intended recipient of this message, you are
not authorized to read, print, retain, copy or disseminate this message or any
part of it. If you have received this e-mail in error, please notify the sender
immediately by return e-mail and delete it from your computer.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: memcpy and prefetch
2009-01-28 19:28 ` Michael Sundius
@ 2009-01-28 19:54 ` David Daney
2009-01-28 21:52 ` Chad Reese
0 siblings, 1 reply; 14+ messages in thread
From: David Daney @ 2009-01-28 19:54 UTC (permalink / raw)
To: Michael Sundius; +Cc: linux-mips, VomLehn, David, msundius
Michael Sundius wrote:
> David Daney wrote:
[...]
>> The Cavium OCTEON port overrides the default memcpy and does use
>> prefetch. It was recently merged (2.6.29-rc2). Look at octeon-memcpy.S
[...]
>
> thanks!!! that's really useful. I have a few questions tho:
>
> 1) So you made this function explicitly for the Octeon.
I didn't personally write the code. But yes, as its name implies, it was
created specifically for OCTEON.
> and that is
> because you know the cache-line is 128 bytes long
> on the octeon? is that right?
That and we know exactly how prefetching works on this CPU *and* we know
we can do unaligned accesses quickly.
>
> 2) It seems as though you always prefectch the first cache line.. what
> happens if the memcopy is less than 1 cache line long?
> wouldn't you risk prefetching beyond the end of the buffer?
It is a risk we were willing to take. Cache lines are loaded with
unneeded data all the time.
>
> 3) why do you only do the "pref 0 offset(src)" and not a prefetch for
> the destination?
I don't know. But the interaction between the writeback buffers, the
cache and RAM are somewhat complicated. It may not be enough of a win
to overcome the cost of the code that would determine when to do it.
>
> 4) on line 244 you check to see if len is less than 128. while on the
> other checks you check for (offset)+1
> why would you not do the prefetch if len was exactly 256 bytes? (or 128
> in the case of line 196)?
We are always prefetching 256 bytes ahead of the current position. If
we prefetch beyound the end of the buffer it is truly wasting memory
bandwidth, also if we prefetch to memory addresses where there is no
physical memory, bad things happen.
David Daney
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: memcpy and prefetch
2009-01-28 19:54 ` David Daney
@ 2009-01-28 21:52 ` Chad Reese
0 siblings, 0 replies; 14+ messages in thread
From: Chad Reese @ 2009-01-28 21:52 UTC (permalink / raw)
To: David Daney; +Cc: Michael Sundius, linux-mips, VomLehn, David, msundius
> Michael Sundius wrote:
>> David Daney wrote:
>> 2) It seems as though you always prefectch the first cache line.. what
>> happens if the memcopy is less than 1 cache line long?
>> wouldn't you risk prefetching beyond the end of the buffer?
>
> It is a risk we were willing to take. Cache lines are loaded with
> unneeded data all the time.
If you assume that the memcpy is going to copy at least one byte, then
it is always safe to prefetch the first source address.
>> 3) why do you only do the "pref 0 offset(src)" and not a prefetch for
>> the destination?
>
> I don't know. But the interaction between the writeback buffers, the
> cache and RAM are somewhat complicated. It may not be enough of a win
> to overcome the cost of the code that would determine when to do it.
Octeon's write buffer merges all writes to single store transactions.
Since this store contains a full cache line, the L2 controller
automatically optimizes for it. With Octeon, the prepare to store
operations normally slow things down by creating needless bus traffic.
There are a few times where it is useful, but a generic memcpy isn't one
of them.
>> 4) on line 244 you check to see if len is less than 128. while on the
>> other checks you check for (offset)+1
>> why would you not do the prefetch if len was exactly 256 bytes? (or 128
>> in the case of line 196)?
>
> We are always prefetching 256 bytes ahead of the current position. If
> we prefetch beyound the end of the buffer it is truly wasting memory
> bandwidth, also if we prefetch to memory addresses where there is no
> physical memory, bad things happen.
We prefetch 256 bytes ahead on every 128 bytes copied except for the
last two. Since we are fetching two lines ahead, the last two iterations
don't need prefetches. I think the code stops prefetching at the correct
time, but there is always the possibility that I messed up...
Chad
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: memcpy and prefetch
2009-01-28 18:30 ` Ralf Baechle
@ 2009-01-29 12:36 ` Atsushi Nemoto
2009-01-29 15:58 ` Ralf Baechle
0 siblings, 1 reply; 14+ messages in thread
From: Atsushi Nemoto @ 2009-01-29 12:36 UTC (permalink / raw)
To: ralf; +Cc: ddaney, msundius, linux-mips, dvomlehn, msundius
On Wed, 28 Jan 2009 18:30:47 +0000, Ralf Baechle <ralf@linux-mips.org> wrote:
> --- a/arch/mips/lib/memcpy.S
> +++ b/arch/mips/lib/memcpy.S
> @@ -21,7 +21,7 @@
> * end of memory on some systems. It's also a seriously bad idea on non
> * dma-coherent systems.
> */
> -#if !defined(CONFIG_DMA_COHERENT) || !defined(CONFIG_DMA_IP27)
> +#ifdef CONFIG_DMA_NONCOHERENT
> #undef CONFIG_CPU_HAS_PREFETCH
> #endif
> #ifdef CONFIG_MIPS_MALTA
This makes IP27 (and all other coherent platforms) use prefetch. Is
prefetch OK for all of them?
I suppose memcpy_fromio() should not use PREFETCH, at least.
---
Atsushi Nemoto
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: memcpy and prefetch
2009-01-29 12:36 ` Atsushi Nemoto
@ 2009-01-29 15:58 ` Ralf Baechle
2009-01-30 3:39 ` David VomLehn (dvomlehn)
0 siblings, 1 reply; 14+ messages in thread
From: Ralf Baechle @ 2009-01-29 15:58 UTC (permalink / raw)
To: Atsushi Nemoto; +Cc: ddaney, msundius, linux-mips, dvomlehn, msundius
On Thu, Jan 29, 2009 at 09:36:13PM +0900, Atsushi Nemoto wrote:
> On Wed, 28 Jan 2009 18:30:47 +0000, Ralf Baechle <ralf@linux-mips.org> wrote:
> > --- a/arch/mips/lib/memcpy.S
> > +++ b/arch/mips/lib/memcpy.S
> > @@ -21,7 +21,7 @@
> > * end of memory on some systems. It's also a seriously bad idea on non
> > * dma-coherent systems.
> > */
> > -#if !defined(CONFIG_DMA_COHERENT) || !defined(CONFIG_DMA_IP27)
> > +#ifdef CONFIG_DMA_NONCOHERENT
> > #undef CONFIG_CPU_HAS_PREFETCH
> > #endif
> > #ifdef CONFIG_MIPS_MALTA
>
> This makes IP27 (and all other coherent platforms) use prefetch. Is
> prefetch OK for all of them?
>
> I suppose memcpy_fromio() should not use PREFETCH, at least.
The idea here is that we have two issues with prefetching:
o Prefetching beyond the end of the source or destination range on a
in-coherent range might bring back stale values from a DMA I/O
buffer resulting in data corruption. Hardware DMA coherency will
avoid this issue.
o IP27 has full blown hardware coherency. Historically CONFIG_DMA_COHERENT
was not able to cope with something of the complexity of IP27, so
there was a separate CONFIG_DMA_IP27 and the broken logic expression
was meant to treat CONFIG_DMA_COHERENT and CONFIG_DMA_IP27 the same
as for prefetching.
o Prefetching beyond the end of physical memory can cause exceptions on
some systems. The Malta has this problem.
Thus no prefetching on Malta or non-coherent systems.
Ralf
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: memcpy and prefetch
@ 2009-01-30 3:39 ` David VomLehn (dvomlehn)
0 siblings, 0 replies; 14+ messages in thread
From: David VomLehn (dvomlehn) @ 2009-01-30 3:39 UTC (permalink / raw)
To: Ralf Baechle, Atsushi Nemoto
Cc: ddaney, Michael Sundius -X (msundius - Yoh Services LLC at Cisco),
linux-mips, msundius
> The idea here is that we have two issues with prefetching:
>
> o Prefetching beyond the end of the source or destination range on a
> in-coherent range might bring back stale values from a DMA I/O
> buffer resulting in data corruption. Hardware DMA coherency will
> avoid this issue.
>
> o IP27 has full blown hardware coherency. Historically
> CONFIG_DMA_COHERENT
> was not able to cope with something of the complexity of IP27, so
> there was a separate CONFIG_DMA_IP27 and the broken logic
> expression
> was meant to treat CONFIG_DMA_COHERENT and CONFIG_DMA_IP27 the same
> as for prefetching.
>
> o Prefetching beyond the end of physical memory can cause
> exceptions on
> some systems. The Malta has this problem.
>
> Thus no prefetching on Malta or non-coherent systems.
>
> Ralf
It seems to me as though we could avoid the first and third problems
with a memcpy that doesn't prefetch past the end of the buffer, the
thought being that if we are reading or writing a memory region, we
really shouldn't be doing DMA to or from that location. This would
probably be slightly suboptimal, performance-wise, for those systems
that do have DMA coherence. It seems as though we could have two
mutually exclusive versions, selectable via the CONFIG_DMA_COHERENT
flag. For those of us without DMA coherence, it would probably give our
memcpy performance a bit of a kick in the pants over using no prefetch
at all.
If this makes sense, we might be able to sign up to do the work. Anyone
have a good, caching-aware memcpy test?
--
David VomLehn, dvomlehn@cisco.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: memcpy and prefetch
@ 2009-01-30 3:39 ` David VomLehn (dvomlehn)
0 siblings, 0 replies; 14+ messages in thread
From: David VomLehn (dvomlehn) @ 2009-01-30 3:39 UTC (permalink / raw)
To: Ralf Baechle, Atsushi Nemoto
Cc: ddaney, Michael Sundius -X (msundius - Yoh Services LLC at Cisco),
linux-mips, msundius
> The idea here is that we have two issues with prefetching:
>
> o Prefetching beyond the end of the source or destination range on a
> in-coherent range might bring back stale values from a DMA I/O
> buffer resulting in data corruption. Hardware DMA coherency will
> avoid this issue.
>
> o IP27 has full blown hardware coherency. Historically
> CONFIG_DMA_COHERENT
> was not able to cope with something of the complexity of IP27, so
> there was a separate CONFIG_DMA_IP27 and the broken logic
> expression
> was meant to treat CONFIG_DMA_COHERENT and CONFIG_DMA_IP27 the same
> as for prefetching.
>
> o Prefetching beyond the end of physical memory can cause
> exceptions on
> some systems. The Malta has this problem.
>
> Thus no prefetching on Malta or non-coherent systems.
>
> Ralf
It seems to me as though we could avoid the first and third problems
with a memcpy that doesn't prefetch past the end of the buffer, the
thought being that if we are reading or writing a memory region, we
really shouldn't be doing DMA to or from that location. This would
probably be slightly suboptimal, performance-wise, for those systems
that do have DMA coherence. It seems as though we could have two
mutually exclusive versions, selectable via the CONFIG_DMA_COHERENT
flag. For those of us without DMA coherence, it would probably give our
memcpy performance a bit of a kick in the pants over using no prefetch
at all.
If this makes sense, we might be able to sign up to do the work. Anyone
have a good, caching-aware memcpy test?
--
David VomLehn, dvomlehn@cisco.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: memcpy and prefetch
2009-01-30 3:39 ` David VomLehn (dvomlehn)
(?)
@ 2009-02-04 21:27 ` Ralf Baechle
2009-02-05 15:31 ` Atsushi Nemoto
-1 siblings, 1 reply; 14+ messages in thread
From: Ralf Baechle @ 2009-02-04 21:27 UTC (permalink / raw)
To: David VomLehn (dvomlehn)
Cc: Atsushi Nemoto, ddaney,
Michael Sundius -X (msundius - Yoh Services LLC at Cisco),
linux-mips, msundius
On Thu, Jan 29, 2009 at 10:39:37PM -0500, David VomLehn (dvomlehn) wrote:
> > The idea here is that we have two issues with prefetching:
> >
> > o Prefetching beyond the end of the source or destination range on a
> > in-coherent range might bring back stale values from a DMA I/O
> > buffer resulting in data corruption. Hardware DMA coherency will
> > avoid this issue.
> >
> > o IP27 has full blown hardware coherency. Historically
> > CONFIG_DMA_COHERENT
> > was not able to cope with something of the complexity of IP27, so
> > there was a separate CONFIG_DMA_IP27 and the broken logic
> > expression
> > was meant to treat CONFIG_DMA_COHERENT and CONFIG_DMA_IP27 the same
> > as for prefetching.
> >
> > o Prefetching beyond the end of physical memory can cause
> > exceptions on
> > some systems. The Malta has this problem.
> >
> > Thus no prefetching on Malta or non-coherent systems.
> It seems to me as though we could avoid the first and third problems
> with a memcpy that doesn't prefetch past the end of the buffer, the
> thought being that if we are reading or writing a memory region, we
> really shouldn't be doing DMA to or from that location. This would
> probably be slightly suboptimal, performance-wise, for those systems
> that do have DMA coherence. It seems as though we could have two
> mutually exclusive versions, selectable via the CONFIG_DMA_COHERENT
> flag. For those of us without DMA coherence, it would probably give our
> memcpy performance a bit of a kick in the pants over using no prefetch
> at all.
Unnecessary prefetching can come at a high cost due to memory latencies
and cache pollution. So you want to avoid unnecessary prefetches rather
than hoping for hardware cache coherency to sorts out the mess software
left behind.
The general expectation is that prefetching will help - but depending on
the pipeline structure prefetching can be hard to exploit optimally. For
example there are MIPS cores were the optimal sequence is something like
load store load store load store load store
But on others it's
load load load load store store store store
Placing prefetching instructions into loops built from such blocks can
result in very surprising result.
> If this makes sense, we might be able to sign up to do the work. Anyone
> have a good, caching-aware memcpy test?
Testing memcpy is an interesting little project. Correctness is one
thing but a good implementation needs to do a few performance tradeoffs
which are best meassure with real world, not synthetic workloads.
Ralf
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: memcpy and prefetch
2009-02-04 21:27 ` Ralf Baechle
@ 2009-02-05 15:31 ` Atsushi Nemoto
0 siblings, 0 replies; 14+ messages in thread
From: Atsushi Nemoto @ 2009-02-05 15:31 UTC (permalink / raw)
To: ralf; +Cc: dvomlehn, ddaney, msundius, linux-mips, msundius
[-- Attachment #1: Type: Text/Plain, Size: 1491 bytes --]
On Wed, 4 Feb 2009 21:27:46 +0000, Ralf Baechle <ralf@linux-mips.org> wrote:
> > If this makes sense, we might be able to sign up to do the work. Anyone
> > have a good, caching-aware memcpy test?
>
> Testing memcpy is an interesting little project. Correctness is one
> thing but a good implementation needs to do a few performance tradeoffs
> which are best meassure with real world, not synthetic workloads.
For correctness test, drivers/dma/dmatest.c might be a good template.
For speed test, test_cipher_speed in crypt/tcrypt.c can be used as a
template. Attached is a test module I wrote based on it, when I
implemented an asm version of csum_partial_copy_nocheck, etc. It will
show something like this:
# insmod /tmp/testspeed.ko mode=1
testing speed of csum_partial_copy_nocheck
test 0 (32 byte): 2051560 operations in 1 seconds (65649920 bytes)
test 1 (96 byte): 823512 operations in 1 seconds (79057152 bytes)
test 2 (256 byte): 329124 operations in 1 seconds (84255744 bytes)
test 3 (512 byte): 167739 operations in 1 seconds (85882368 bytes)
...
testing speed of gen_csum_partial_copy_nocheck
test 0 (32 byte): 1555953 operations in 1 seconds (49790496 bytes)
test 1 (96 byte): 700025 operations in 1 seconds (67202400 bytes)
test 2 (256 byte): 293716 operations in 1 seconds (75191296 bytes)
test 3 (512 byte): 151770 operations in 1 seconds (77706240 bytes)
...
insmod: error inserting '/tmp/testspeed.ko': -1 Resource temporarily unavailable
Feel free to hack it ;)
[-- Attachment #2: testspeed.c --]
[-- Type: Text/Plain, Size: 4338 bytes --]
/*
* Quick & dirty speed testing module. (Based on tcrypt).
*
* This file is subject to the terms and conditions of the GNU General Public
* License. See the file "COPYING" in the main directory of this archive
* for more details.
*/
#include <linux/init.h>
#include <linux/module.h>
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/moduleparam.h>
#include <linux/jiffies.h>
#include <net/checksum.h>
static unsigned int sec = 1;
static int mode;
/* non-optimized version of csum_partial_copy_nocheck */
static unsigned int gen_csum_partial_copy_nocheck(const void *src,
void *dst, int len, unsigned int sum)
{
sum = csum_partial(src, len, sum);
memcpy(dst, src, len);
return sum;
}
/* non-optimized version of csum_partial_copy_from_user */
static unsigned int gen_csum_partial_copy_from_user(const void __user *src,
void *dst, int len, unsigned int sum, int *err_ptr)
{
might_sleep();
if (__copy_from_user(dst, src, len))
*err_ptr = -EFAULT;
return csum_partial(dst, len, sum);
}
#define loop_while_sec(start, end, sec, count) \
for (start = jiffies, end = start + sec * HZ, count = 0; \
time_before(jiffies, end); count++)
static int test_csum_partial_copy_speed(int cachemiss)
{
unsigned long start, end;
unsigned int i;
void *src, *dst;
size_t sizes[] = {
0x20, 0x60, 0x100, 0x200, 0x400,
1460, /* ETH_DATA_LEN - 20(ip header) - 20(tcp header) */
0x800, 0x1000,
};
size_t maxsize = sizes[ARRAY_SIZE(sizes) - 1];
int ofs;
int count;
int err;
int bufsize = 0x10000;
src = kmalloc(bufsize, GFP_KERNEL);
if (!src)
return -ENOMEM;
dst = kmalloc(bufsize, GFP_KERNEL);
if (!dst) {
kfree(src);
return -ENOMEM;
}
memset(src, 0xff, maxsize);
printk("\ntesting speed of csum_partial_copy_nocheck\n");
for (i = 0; i < ARRAY_SIZE(sizes); i++) {
printk("test %u (%d byte): ", i, sizes[i]);
ofs = 0;
loop_while_sec(start, end, sec, count) {
csum_partial_copy_nocheck(src + ofs, dst + ofs,
sizes[i], 0);
if (cachemiss) {
ofs += sizes[i];
if (ofs + sizes[i] > bufsize)
ofs = 0;
}
}
printk("%d operations in %d seconds (%d bytes)\n",
count, sec, count * sizes[i]);
}
printk("\ntesting speed of csum_partial_copy_from_user\n");
for (i = 0; i < ARRAY_SIZE(sizes); i++) {
printk("test %u (%d byte): ", i, sizes[i]);
ofs = 0;
loop_while_sec(start, end, sec, count) {
csum_partial_copy_from_user((const void __force __user *)src + ofs,
dst + ofs,
sizes[i], 0, &err);
if (cachemiss) {
ofs += sizes[i];
if (ofs + sizes[i] > bufsize)
ofs = 0;
}
}
printk("%d operations in %d seconds (%d bytes)\n",
count, sec, count * sizes[i]);
}
printk("\ntesting speed of gen_csum_partial_copy_nocheck\n");
for (i = 0; i < ARRAY_SIZE(sizes); i++) {
printk("test %u (%d byte): ", i, sizes[i]);
ofs = 0;
loop_while_sec(start, end, sec, count) {
gen_csum_partial_copy_nocheck(src + ofs, dst + ofs,
sizes[i], 0);
if (cachemiss) {
ofs += sizes[i];
if (ofs + sizes[i] > bufsize)
ofs = 0;
}
}
printk("%d operations in %d seconds (%d bytes)\n",
count, sec, count * sizes[i]);
}
printk("\ntesting speed of gen_csum_partial_copy_from_user\n");
for (i = 0; i < ARRAY_SIZE(sizes); i++) {
printk("test %u (%d byte): ", i, sizes[i]);
ofs = 0;
loop_while_sec(start, end, sec, count) {
gen_csum_partial_copy_from_user((const void __force __user *)src + ofs,
dst + ofs,
sizes[i], 0, &err);
if (cachemiss) {
ofs += sizes[i];
if (ofs + sizes[i] > bufsize)
ofs = 0;
}
}
printk("%d operations in %d seconds (%d bytes)\n",
count, sec, count * sizes[i]);
}
kfree(src);
kfree(dst);
return 0;
}
static int __init init(void)
{
int ret = 0;
switch (mode) {
case 0:
ret = test_csum_partial_copy_speed(0);
break;
case 1:
ret = test_csum_partial_copy_speed(1);
break;
}
if (ret)
return ret;
/* We intentionaly return -EAGAIN to prevent keeping the module. */
return -EAGAIN;
}
static void __exit fini(void) {}
module_init(init);
module_exit(fini);
module_param(mode, int, 0);
module_param(sec, uint, 0);
MODULE_PARM_DESC(sec, "Length in seconds of speed tests (default 1)");
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Quick & dirty speed testing module");
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-02-05 15:31 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-27 23:00 memcpy and prefetch Michael Sundius
2009-01-27 23:07 ` David Daney
2009-01-28 10:37 ` Ralf Baechle
2009-01-28 15:28 ` Atsushi Nemoto
2009-01-28 18:30 ` Ralf Baechle
2009-01-29 12:36 ` Atsushi Nemoto
2009-01-29 15:58 ` Ralf Baechle
2009-01-30 3:39 ` David VomLehn (dvomlehn)
2009-01-30 3:39 ` David VomLehn (dvomlehn)
2009-02-04 21:27 ` Ralf Baechle
2009-02-05 15:31 ` Atsushi Nemoto
2009-01-28 19:28 ` Michael Sundius
2009-01-28 19:54 ` David Daney
2009-01-28 21:52 ` Chad Reese
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.