* [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
@ 2014-01-21 16:18 Steven J. Hill
2014-01-21 17:37 ` Florian Fainelli
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Steven J. Hill @ 2014-01-21 16:18 UTC (permalink / raw)
To: linux-mips; +Cc: ralf
From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
Use the PREF instruction to optimize partial checksum operations.
Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
---
arch/mips/lib/csum_partial.S | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/mips/lib/csum_partial.S b/arch/mips/lib/csum_partial.S
index a6adffb..272820e 100644
--- a/arch/mips/lib/csum_partial.S
+++ b/arch/mips/lib/csum_partial.S
@@ -417,13 +417,19 @@ FEXPORT(csum_partial_copy_nocheck)
*
* If len < NBYTES use byte operations.
*/
+ PREF( 0, 0(src))
+ PREF( 1, 0(dst))
sltu t2, len, NBYTES
and t1, dst, ADDRMASK
bnez t2, .Lcopy_bytes_checklen
+ PREF( 0, 32(src))
+ PREF( 1, 32(dst))
and t0, src, ADDRMASK
andi odd, dst, 0x1 /* odd buffer? */
bnez t1, .Ldst_unaligned
nop
+ PREF( 0, 2*32(src))
+ PREF( 1, 2*32(dst))
bnez t0, .Lsrc_unaligned_dst_aligned
/*
* use delay slot for fall-through
@@ -434,6 +440,8 @@ FEXPORT(csum_partial_copy_nocheck)
beqz t0, .Lcleanup_both_aligned # len < 8*NBYTES
nop
SUB len, 8*NBYTES # subtract here for bgez loop
+ PREF( 0, 3*32(src))
+ PREF( 1, 3*32(dst))
.align 4
1:
EXC( LOAD t0, UNIT(0)(src), .Ll_exc)
@@ -464,6 +472,8 @@ EXC( STORE t7, UNIT(7)(dst), .Ls_exc)
ADDC(sum, t7)
.set reorder /* DADDI_WAR */
ADD dst, dst, 8*NBYTES
+ PREF( 0, 8*32(src))
+ PREF( 1, 8*32(dst))
bgez len, 1b
.set noreorder
ADD len, 8*NBYTES # revert len (see above)
@@ -569,8 +579,10 @@ EXC( STFIRST t3, FIRST(0)(dst), .Ls_exc)
.Lsrc_unaligned_dst_aligned:
SRL t0, len, LOG_NBYTES+2 # +2 for 4 units/iter
+ PREF( 0, 3*32(src))
beqz t0, .Lcleanup_src_unaligned
and rem, len, (4*NBYTES-1) # rem = len % 4*NBYTES
+ PREF( 1, 3*32(dst))
1:
/*
* Avoid consecutive LD*'s to the same register since some mips
--
1.8.3.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
2014-01-21 16:18 [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching Steven J. Hill
@ 2014-01-21 17:37 ` Florian Fainelli
2014-01-21 18:25 ` David Daney
2014-01-21 20:49 ` Ralf Baechle
2 siblings, 0 replies; 9+ messages in thread
From: Florian Fainelli @ 2014-01-21 17:37 UTC (permalink / raw)
To: Steven J. Hill; +Cc: Linux-MIPS, Ralf Baechle
2014/1/21 Steven J. Hill <Steven.Hill@imgtec.com>:
> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>
> Use the PREF instruction to optimize partial checksum operations.
This does look like a nice feature, do you have any performance
benchmark results?
>
> Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
> Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
> ---
> arch/mips/lib/csum_partial.S | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/arch/mips/lib/csum_partial.S b/arch/mips/lib/csum_partial.S
> index a6adffb..272820e 100644
> --- a/arch/mips/lib/csum_partial.S
> +++ b/arch/mips/lib/csum_partial.S
> @@ -417,13 +417,19 @@ FEXPORT(csum_partial_copy_nocheck)
> *
> * If len < NBYTES use byte operations.
> */
> + PREF( 0, 0(src))
> + PREF( 1, 0(dst))
> sltu t2, len, NBYTES
> and t1, dst, ADDRMASK
> bnez t2, .Lcopy_bytes_checklen
> + PREF( 0, 32(src))
> + PREF( 1, 32(dst))
> and t0, src, ADDRMASK
> andi odd, dst, 0x1 /* odd buffer? */
> bnez t1, .Ldst_unaligned
> nop
> + PREF( 0, 2*32(src))
> + PREF( 1, 2*32(dst))
> bnez t0, .Lsrc_unaligned_dst_aligned
> /*
> * use delay slot for fall-through
> @@ -434,6 +440,8 @@ FEXPORT(csum_partial_copy_nocheck)
> beqz t0, .Lcleanup_both_aligned # len < 8*NBYTES
> nop
> SUB len, 8*NBYTES # subtract here for bgez loop
> + PREF( 0, 3*32(src))
> + PREF( 1, 3*32(dst))
> .align 4
> 1:
> EXC( LOAD t0, UNIT(0)(src), .Ll_exc)
> @@ -464,6 +472,8 @@ EXC( STORE t7, UNIT(7)(dst), .Ls_exc)
> ADDC(sum, t7)
> .set reorder /* DADDI_WAR */
> ADD dst, dst, 8*NBYTES
> + PREF( 0, 8*32(src))
> + PREF( 1, 8*32(dst))
> bgez len, 1b
> .set noreorder
> ADD len, 8*NBYTES # revert len (see above)
> @@ -569,8 +579,10 @@ EXC( STFIRST t3, FIRST(0)(dst), .Ls_exc)
>
> .Lsrc_unaligned_dst_aligned:
> SRL t0, len, LOG_NBYTES+2 # +2 for 4 units/iter
> + PREF( 0, 3*32(src))
> beqz t0, .Lcleanup_src_unaligned
> and rem, len, (4*NBYTES-1) # rem = len % 4*NBYTES
> + PREF( 1, 3*32(dst))
> 1:
> /*
> * Avoid consecutive LD*'s to the same register since some mips
> --
> 1.8.3.2
>
>
--
Florian
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
2014-01-21 16:18 [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching Steven J. Hill
2014-01-21 17:37 ` Florian Fainelli
@ 2014-01-21 18:25 ` David Daney
2014-01-21 20:16 ` Steven J. Hill
2014-01-21 20:49 ` Ralf Baechle
2 siblings, 1 reply; 9+ messages in thread
From: David Daney @ 2014-01-21 18:25 UTC (permalink / raw)
To: Steven J. Hill; +Cc: linux-mips, ralf
On 01/21/2014 08:18 AM, Steven J. Hill wrote:
> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>
> Use the PREF instruction to optimize partial checksum operations.
>
> Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
> Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
NACK. The proper latench and cacheline stride vary by CPU, you cannot
just hard code them for 32-byte cacheline size with some random latency.
This will make some CPUs slower.
> ---
> arch/mips/lib/csum_partial.S | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/arch/mips/lib/csum_partial.S b/arch/mips/lib/csum_partial.S
> index a6adffb..272820e 100644
> --- a/arch/mips/lib/csum_partial.S
> +++ b/arch/mips/lib/csum_partial.S
> @@ -417,13 +417,19 @@ FEXPORT(csum_partial_copy_nocheck)
> *
> * If len < NBYTES use byte operations.
> */
> + PREF( 0, 0(src))
> + PREF( 1, 0(dst))
> sltu t2, len, NBYTES
> and t1, dst, ADDRMASK
> bnez t2, .Lcopy_bytes_checklen
> + PREF( 0, 32(src))
> + PREF( 1, 32(dst))
> and t0, src, ADDRMASK
> andi odd, dst, 0x1 /* odd buffer? */
> bnez t1, .Ldst_unaligned
> nop
> + PREF( 0, 2*32(src))
> + PREF( 1, 2*32(dst))
> bnez t0, .Lsrc_unaligned_dst_aligned
> /*
> * use delay slot for fall-through
> @@ -434,6 +440,8 @@ FEXPORT(csum_partial_copy_nocheck)
> beqz t0, .Lcleanup_both_aligned # len < 8*NBYTES
> nop
> SUB len, 8*NBYTES # subtract here for bgez loop
> + PREF( 0, 3*32(src))
> + PREF( 1, 3*32(dst))
> .align 4
> 1:
> EXC( LOAD t0, UNIT(0)(src), .Ll_exc)
> @@ -464,6 +472,8 @@ EXC( STORE t7, UNIT(7)(dst), .Ls_exc)
> ADDC(sum, t7)
> .set reorder /* DADDI_WAR */
> ADD dst, dst, 8*NBYTES
> + PREF( 0, 8*32(src))
> + PREF( 1, 8*32(dst))
> bgez len, 1b
> .set noreorder
> ADD len, 8*NBYTES # revert len (see above)
> @@ -569,8 +579,10 @@ EXC( STFIRST t3, FIRST(0)(dst), .Ls_exc)
>
> .Lsrc_unaligned_dst_aligned:
> SRL t0, len, LOG_NBYTES+2 # +2 for 4 units/iter
> + PREF( 0, 3*32(src))
> beqz t0, .Lcleanup_src_unaligned
> and rem, len, (4*NBYTES-1) # rem = len % 4*NBYTES
> + PREF( 1, 3*32(dst))
> 1:
> /*
> * Avoid consecutive LD*'s to the same register since some mips
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
2014-01-21 18:25 ` David Daney
@ 2014-01-21 20:16 ` Steven J. Hill
2014-01-21 20:25 ` Florian Fainelli
0 siblings, 1 reply; 9+ messages in thread
From: Steven J. Hill @ 2014-01-21 20:16 UTC (permalink / raw)
To: David Daney; +Cc: LMOL
On 01/21/2014 12:25 PM, David Daney wrote:
> On 01/21/2014 08:18 AM, Steven J. Hill wrote:
>> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>>
>> Use the PREF instruction to optimize partial checksum operations.
>>
>> Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>> Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
>
> NACK. The proper latench and cacheline stride vary by CPU, you cannot
> just hard code them for 32-byte cacheline size with some random latency.
>
> This will make some CPUs slower.
>
Note that memcpy.S already uses fixed cache lines (32 bytes) so this is
merely doing the same thing. I assume you have some empirical evidence
concerning other CPUs being slower?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
2014-01-21 20:16 ` Steven J. Hill
@ 2014-01-21 20:25 ` Florian Fainelli
0 siblings, 0 replies; 9+ messages in thread
From: Florian Fainelli @ 2014-01-21 20:25 UTC (permalink / raw)
To: Steven J. Hill; +Cc: David Daney, LMOL
2014/1/21 Steven J. Hill <Steven.Hill@imgtec.com>:
> On 01/21/2014 12:25 PM, David Daney wrote:
>>
>> On 01/21/2014 08:18 AM, Steven J. Hill wrote:
>>>
>>> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>>>
>>> Use the PREF instruction to optimize partial checksum operations.
>>>
>>> Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>>> Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
>>
>>
>> NACK. The proper latench and cacheline stride vary by CPU, you cannot
>> just hard code them for 32-byte cacheline size with some random latency.
>>
>> This will make some CPUs slower.
>>
> Note that memcpy.S already uses fixed cache lines (32 bytes) so this is
> merely doing the same thing. I assume you have some empirical evidence
> concerning other CPUs being slower?
How about using cpu_dcache_line_size()/MIPS_L1_CACHE_SHIFT? These
should provide a good hint. Octeon has a 128bytes D$ line size, so
prefetching via slices of 32 bytes is most likely suboptimal.
--
Florian
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
2014-01-21 16:18 [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching Steven J. Hill
2014-01-21 17:37 ` Florian Fainelli
2014-01-21 18:25 ` David Daney
@ 2014-01-21 20:49 ` Ralf Baechle
2014-01-21 20:58 ` Steven J. Hill
2 siblings, 1 reply; 9+ messages in thread
From: Ralf Baechle @ 2014-01-21 20:49 UTC (permalink / raw)
To: Steven J. Hill; +Cc: linux-mips
On Tue, Jan 21, 2014 at 10:18:42AM -0600, Steven J. Hill wrote:
> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>
> Use the PREF instruction to optimize partial checksum operations.
Prefetch operations may cause obscure bus error exceptions on some systems
such as Malta, for example, when prefetching beyond the end of memory.
It may also mean memory regions that are just undergoing a DMA transfer
are being brought back into cache.
This pretty much means that pref is only safe to use on cache-coherent
systems.
Those are the very same reasons that are making pref headache for memcpy.
Performance tuning is another can of worms. On those platforms that I've
benchmarked code with and without pref on, it was very hard to predict
if pref was actually an advantage. If data that is not going to be
used is prefetch, pref wastes an issue slot, wastes instruction bandwith
and in the end makes things slower. If data is not prefetched early
enough, same kind of issue. And in the end PREF and MT were invented
to solve the same kind of fundamental problem: memory is slow and slower
on embedded. For both solutions the results are extremly dependent on
workload.
Cheers,
Ralf
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
@ 2014-01-21 20:58 ` Steven J. Hill
0 siblings, 0 replies; 9+ messages in thread
From: Steven J. Hill @ 2014-01-21 20:58 UTC (permalink / raw)
To: Ralf Baechle; +Cc: linux-mips
On 01/21/2014 02:49 PM, Ralf Baechle wrote:
> On Tue, Jan 21, 2014 at 10:18:42AM -0600, Steven J. Hill wrote:
>
>> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>>
>> Use the PREF instruction to optimize partial checksum operations.
>
> Prefetch operations may cause obscure bus error exceptions on some systems
> such as Malta, for example, when prefetching beyond the end of memory.
> It may also mean memory regions that are just undergoing a DMA transfer
> are being brought back into cache.
>
> This pretty much means that pref is only safe to use on cache-coherent
> systems.
>
So, could we have:
#ifdef CONFIG_DMA_NONCOHERENT
#undef CONFIG_CPU_HAS_PREFETCH
#endif
#define PREFSIZE (1 << MIPS_L1_CACHE_SHIFT)
and then use the PREFSIZE value instead of the hardcoded value of 32?
Steve
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
@ 2014-01-21 20:58 ` Steven J. Hill
0 siblings, 0 replies; 9+ messages in thread
From: Steven J. Hill @ 2014-01-21 20:58 UTC (permalink / raw)
To: Ralf Baechle; +Cc: linux-mips
On 01/21/2014 02:49 PM, Ralf Baechle wrote:
> On Tue, Jan 21, 2014 at 10:18:42AM -0600, Steven J. Hill wrote:
>
>> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>>
>> Use the PREF instruction to optimize partial checksum operations.
>
> Prefetch operations may cause obscure bus error exceptions on some systems
> such as Malta, for example, when prefetching beyond the end of memory.
> It may also mean memory regions that are just undergoing a DMA transfer
> are being brought back into cache.
>
> This pretty much means that pref is only safe to use on cache-coherent
> systems.
>
So, could we have:
#ifdef CONFIG_DMA_NONCOHERENT
#undef CONFIG_CPU_HAS_PREFETCH
#endif
#define PREFSIZE (1 << MIPS_L1_CACHE_SHIFT)
and then use the PREFSIZE value instead of the hardcoded value of 32?
Steve
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
2014-01-21 20:58 ` Steven J. Hill
(?)
@ 2014-01-21 21:03 ` David Daney
-1 siblings, 0 replies; 9+ messages in thread
From: David Daney @ 2014-01-21 21:03 UTC (permalink / raw)
To: Steven J. Hill; +Cc: Ralf Baechle, linux-mips
On 01/21/2014 12:58 PM, Steven J. Hill wrote:
> On 01/21/2014 02:49 PM, Ralf Baechle wrote:
>> On Tue, Jan 21, 2014 at 10:18:42AM -0600, Steven J. Hill wrote:
>>
>>> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>>>
>>> Use the PREF instruction to optimize partial checksum operations.
>>
>> Prefetch operations may cause obscure bus error exceptions on some
>> systems
>> such as Malta, for example, when prefetching beyond the end of memory.
>> It may also mean memory regions that are just undergoing a DMA transfer
>> are being brought back into cache.
>>
>> This pretty much means that pref is only safe to use on cache-coherent
>> systems.
>>
> So, could we have:
>
> #ifdef CONFIG_DMA_NONCOHERENT
> #undef CONFIG_CPU_HAS_PREFETCH
> #endif
> #define PREFSIZE (1 << MIPS_L1_CACHE_SHIFT)
>
> and then use the PREFSIZE value instead of the hardcoded value of 32?
See arch/mips/mm/page.c for code that tries to do something sensible
with streaming prefetches.
>
> Steve
>
>
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-01-21 21:03 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-21 16:18 [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching Steven J. Hill
2014-01-21 17:37 ` Florian Fainelli
2014-01-21 18:25 ` David Daney
2014-01-21 20:16 ` Steven J. Hill
2014-01-21 20:25 ` Florian Fainelli
2014-01-21 20:49 ` Ralf Baechle
2014-01-21 20:58 ` Steven J. Hill
2014-01-21 20:58 ` Steven J. Hill
2014-01-21 21:03 ` David Daney
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.