All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
@ 2014-01-21 16:18 Steven J. Hill
  2014-01-21 17:37 ` Florian Fainelli
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Steven J. Hill @ 2014-01-21 16:18 UTC (permalink / raw)
  To: linux-mips; +Cc: ralf

From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>

Use the PREF instruction to optimize partial checksum operations.

Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
---
 arch/mips/lib/csum_partial.S | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/mips/lib/csum_partial.S b/arch/mips/lib/csum_partial.S
index a6adffb..272820e 100644
--- a/arch/mips/lib/csum_partial.S
+++ b/arch/mips/lib/csum_partial.S
@@ -417,13 +417,19 @@ FEXPORT(csum_partial_copy_nocheck)
 	 *
 	 * If len < NBYTES use byte operations.
 	 */
+	PREF(	0, 0(src))
+	PREF(	1, 0(dst))
 	sltu	t2, len, NBYTES
 	and	t1, dst, ADDRMASK
 	bnez	t2, .Lcopy_bytes_checklen
+	PREF(	0, 32(src))
+	PREF(	1, 32(dst))
 	 and	t0, src, ADDRMASK
 	andi	odd, dst, 0x1			/* odd buffer? */
 	bnez	t1, .Ldst_unaligned
 	 nop
+	PREF(	0, 2*32(src))
+	PREF(	1, 2*32(dst))
 	bnez	t0, .Lsrc_unaligned_dst_aligned
 	/*
 	 * use delay slot for fall-through
@@ -434,6 +440,8 @@ FEXPORT(csum_partial_copy_nocheck)
 	beqz	t0, .Lcleanup_both_aligned # len < 8*NBYTES
 	 nop
 	SUB	len, 8*NBYTES		# subtract here for bgez loop
+	PREF(	0, 3*32(src))
+	PREF(	1, 3*32(dst))
 	.align	4
 1:
 EXC(	LOAD	t0, UNIT(0)(src),	.Ll_exc)
@@ -464,6 +472,8 @@ EXC(	STORE	t7, UNIT(7)(dst),	.Ls_exc)
 	ADDC(sum, t7)
 	.set	reorder				/* DADDI_WAR */
 	ADD	dst, dst, 8*NBYTES
+	PREF(	0, 8*32(src))
+	PREF(	1, 8*32(dst))
 	bgez	len, 1b
 	.set	noreorder
 	ADD	len, 8*NBYTES		# revert len (see above)
@@ -569,8 +579,10 @@ EXC(	STFIRST t3, FIRST(0)(dst),	.Ls_exc)
 
 .Lsrc_unaligned_dst_aligned:
 	SRL	t0, len, LOG_NBYTES+2	 # +2 for 4 units/iter
+	PREF(	0, 3*32(src))
 	beqz	t0, .Lcleanup_src_unaligned
 	 and	rem, len, (4*NBYTES-1)	 # rem = len % 4*NBYTES
+	PREF(	1, 3*32(dst))
 1:
 /*
  * Avoid consecutive LD*'s to the same register since some mips
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
  2014-01-21 16:18 [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching Steven J. Hill
@ 2014-01-21 17:37 ` Florian Fainelli
  2014-01-21 18:25 ` David Daney
  2014-01-21 20:49 ` Ralf Baechle
  2 siblings, 0 replies; 9+ messages in thread
From: Florian Fainelli @ 2014-01-21 17:37 UTC (permalink / raw)
  To: Steven J. Hill; +Cc: Linux-MIPS, Ralf Baechle

2014/1/21 Steven J. Hill <Steven.Hill@imgtec.com>:
> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>
> Use the PREF instruction to optimize partial checksum operations.

This does look like a nice feature, do you have any performance
benchmark results?

>
> Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
> Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
> ---
>  arch/mips/lib/csum_partial.S | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/arch/mips/lib/csum_partial.S b/arch/mips/lib/csum_partial.S
> index a6adffb..272820e 100644
> --- a/arch/mips/lib/csum_partial.S
> +++ b/arch/mips/lib/csum_partial.S
> @@ -417,13 +417,19 @@ FEXPORT(csum_partial_copy_nocheck)
>          *
>          * If len < NBYTES use byte operations.
>          */
> +       PREF(   0, 0(src))
> +       PREF(   1, 0(dst))
>         sltu    t2, len, NBYTES
>         and     t1, dst, ADDRMASK
>         bnez    t2, .Lcopy_bytes_checklen
> +       PREF(   0, 32(src))
> +       PREF(   1, 32(dst))
>          and    t0, src, ADDRMASK
>         andi    odd, dst, 0x1                   /* odd buffer? */
>         bnez    t1, .Ldst_unaligned
>          nop
> +       PREF(   0, 2*32(src))
> +       PREF(   1, 2*32(dst))
>         bnez    t0, .Lsrc_unaligned_dst_aligned
>         /*
>          * use delay slot for fall-through
> @@ -434,6 +440,8 @@ FEXPORT(csum_partial_copy_nocheck)
>         beqz    t0, .Lcleanup_both_aligned # len < 8*NBYTES
>          nop
>         SUB     len, 8*NBYTES           # subtract here for bgez loop
> +       PREF(   0, 3*32(src))
> +       PREF(   1, 3*32(dst))
>         .align  4
>  1:
>  EXC(   LOAD    t0, UNIT(0)(src),       .Ll_exc)
> @@ -464,6 +472,8 @@ EXC(        STORE   t7, UNIT(7)(dst),       .Ls_exc)
>         ADDC(sum, t7)
>         .set    reorder                         /* DADDI_WAR */
>         ADD     dst, dst, 8*NBYTES
> +       PREF(   0, 8*32(src))
> +       PREF(   1, 8*32(dst))
>         bgez    len, 1b
>         .set    noreorder
>         ADD     len, 8*NBYTES           # revert len (see above)
> @@ -569,8 +579,10 @@ EXC(       STFIRST t3, FIRST(0)(dst),      .Ls_exc)
>
>  .Lsrc_unaligned_dst_aligned:
>         SRL     t0, len, LOG_NBYTES+2    # +2 for 4 units/iter
> +       PREF(   0, 3*32(src))
>         beqz    t0, .Lcleanup_src_unaligned
>          and    rem, len, (4*NBYTES-1)   # rem = len % 4*NBYTES
> +       PREF(   1, 3*32(dst))
>  1:
>  /*
>   * Avoid consecutive LD*'s to the same register since some mips
> --
> 1.8.3.2
>
>



-- 
Florian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
  2014-01-21 16:18 [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching Steven J. Hill
  2014-01-21 17:37 ` Florian Fainelli
@ 2014-01-21 18:25 ` David Daney
  2014-01-21 20:16   ` Steven J. Hill
  2014-01-21 20:49 ` Ralf Baechle
  2 siblings, 1 reply; 9+ messages in thread
From: David Daney @ 2014-01-21 18:25 UTC (permalink / raw)
  To: Steven J. Hill; +Cc: linux-mips, ralf

On 01/21/2014 08:18 AM, Steven J. Hill wrote:
> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>
> Use the PREF instruction to optimize partial checksum operations.
>
> Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
> Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>

NACK.  The proper latench and cacheline stride vary by CPU, you cannot 
just hard code them for 32-byte cacheline size with some random latency.

This will make some CPUs slower.

> ---
>   arch/mips/lib/csum_partial.S | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
>
> diff --git a/arch/mips/lib/csum_partial.S b/arch/mips/lib/csum_partial.S
> index a6adffb..272820e 100644
> --- a/arch/mips/lib/csum_partial.S
> +++ b/arch/mips/lib/csum_partial.S
> @@ -417,13 +417,19 @@ FEXPORT(csum_partial_copy_nocheck)
>   	 *
>   	 * If len < NBYTES use byte operations.
>   	 */
> +	PREF(	0, 0(src))
> +	PREF(	1, 0(dst))
>   	sltu	t2, len, NBYTES
>   	and	t1, dst, ADDRMASK
>   	bnez	t2, .Lcopy_bytes_checklen
> +	PREF(	0, 32(src))
> +	PREF(	1, 32(dst))
>   	 and	t0, src, ADDRMASK
>   	andi	odd, dst, 0x1			/* odd buffer? */
>   	bnez	t1, .Ldst_unaligned
>   	 nop
> +	PREF(	0, 2*32(src))
> +	PREF(	1, 2*32(dst))
>   	bnez	t0, .Lsrc_unaligned_dst_aligned
>   	/*
>   	 * use delay slot for fall-through
> @@ -434,6 +440,8 @@ FEXPORT(csum_partial_copy_nocheck)
>   	beqz	t0, .Lcleanup_both_aligned # len < 8*NBYTES
>   	 nop
>   	SUB	len, 8*NBYTES		# subtract here for bgez loop
> +	PREF(	0, 3*32(src))
> +	PREF(	1, 3*32(dst))
>   	.align	4
>   1:
>   EXC(	LOAD	t0, UNIT(0)(src),	.Ll_exc)
> @@ -464,6 +472,8 @@ EXC(	STORE	t7, UNIT(7)(dst),	.Ls_exc)
>   	ADDC(sum, t7)
>   	.set	reorder				/* DADDI_WAR */
>   	ADD	dst, dst, 8*NBYTES
> +	PREF(	0, 8*32(src))
> +	PREF(	1, 8*32(dst))
>   	bgez	len, 1b
>   	.set	noreorder
>   	ADD	len, 8*NBYTES		# revert len (see above)
> @@ -569,8 +579,10 @@ EXC(	STFIRST t3, FIRST(0)(dst),	.Ls_exc)
>
>   .Lsrc_unaligned_dst_aligned:
>   	SRL	t0, len, LOG_NBYTES+2	 # +2 for 4 units/iter
> +	PREF(	0, 3*32(src))
>   	beqz	t0, .Lcleanup_src_unaligned
>   	 and	rem, len, (4*NBYTES-1)	 # rem = len % 4*NBYTES
> +	PREF(	1, 3*32(dst))
>   1:
>   /*
>    * Avoid consecutive LD*'s to the same register since some mips
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
  2014-01-21 18:25 ` David Daney
@ 2014-01-21 20:16   ` Steven J. Hill
  2014-01-21 20:25     ` Florian Fainelli
  0 siblings, 1 reply; 9+ messages in thread
From: Steven J. Hill @ 2014-01-21 20:16 UTC (permalink / raw)
  To: David Daney; +Cc: LMOL

On 01/21/2014 12:25 PM, David Daney wrote:
> On 01/21/2014 08:18 AM, Steven J. Hill wrote:
>> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>>
>> Use the PREF instruction to optimize partial checksum operations.
>>
>> Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>> Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
>
> NACK.  The proper latench and cacheline stride vary by CPU, you cannot
> just hard code them for 32-byte cacheline size with some random latency.
>
> This will make some CPUs slower.
>
Note that memcpy.S already uses fixed cache lines (32 bytes) so this is 
merely doing the same thing. I assume you have some empirical evidence 
concerning other CPUs being slower?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
  2014-01-21 20:16   ` Steven J. Hill
@ 2014-01-21 20:25     ` Florian Fainelli
  0 siblings, 0 replies; 9+ messages in thread
From: Florian Fainelli @ 2014-01-21 20:25 UTC (permalink / raw)
  To: Steven J. Hill; +Cc: David Daney, LMOL

2014/1/21 Steven J. Hill <Steven.Hill@imgtec.com>:
> On 01/21/2014 12:25 PM, David Daney wrote:
>>
>> On 01/21/2014 08:18 AM, Steven J. Hill wrote:
>>>
>>> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>>>
>>> Use the PREF instruction to optimize partial checksum operations.
>>>
>>> Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>>> Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
>>
>>
>> NACK.  The proper latench and cacheline stride vary by CPU, you cannot
>> just hard code them for 32-byte cacheline size with some random latency.
>>
>> This will make some CPUs slower.
>>
> Note that memcpy.S already uses fixed cache lines (32 bytes) so this is
> merely doing the same thing. I assume you have some empirical evidence
> concerning other CPUs being slower?

How about using cpu_dcache_line_size()/MIPS_L1_CACHE_SHIFT? These
should provide a good hint. Octeon has a 128bytes D$ line size, so
prefetching via slices of 32 bytes is most likely suboptimal.
-- 
Florian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
  2014-01-21 16:18 [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching Steven J. Hill
  2014-01-21 17:37 ` Florian Fainelli
  2014-01-21 18:25 ` David Daney
@ 2014-01-21 20:49 ` Ralf Baechle
  2014-01-21 20:58     ` Steven J. Hill
  2 siblings, 1 reply; 9+ messages in thread
From: Ralf Baechle @ 2014-01-21 20:49 UTC (permalink / raw)
  To: Steven J. Hill; +Cc: linux-mips

On Tue, Jan 21, 2014 at 10:18:42AM -0600, Steven J. Hill wrote:

> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
> 
> Use the PREF instruction to optimize partial checksum operations.

Prefetch operations may cause obscure bus error exceptions on some systems
such as Malta, for example, when prefetching beyond the end of memory.
It may also mean memory regions that are just undergoing a DMA transfer
are being brought back into cache.

This pretty much means that pref is only safe to use on cache-coherent
systems.

Those are the very same reasons that are making pref headache for memcpy.

Performance tuning is another can of worms.  On those platforms that I've
benchmarked code with and without pref on, it was very hard to predict
if pref was actually an advantage.  If data that is not going to be
used is prefetch, pref wastes an issue slot, wastes instruction bandwith
and in the end makes things slower.  If data is not prefetched early
enough, same kind of issue.  And in the end PREF and MT were invented
to solve the same kind of fundamental problem: memory is slow and slower
on embedded.  For both solutions the results are extremly dependent on
workload.

Cheers,

  Ralf

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
@ 2014-01-21 20:58     ` Steven J. Hill
  0 siblings, 0 replies; 9+ messages in thread
From: Steven J. Hill @ 2014-01-21 20:58 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: linux-mips

On 01/21/2014 02:49 PM, Ralf Baechle wrote:
> On Tue, Jan 21, 2014 at 10:18:42AM -0600, Steven J. Hill wrote:
>
>> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>>
>> Use the PREF instruction to optimize partial checksum operations.
>
> Prefetch operations may cause obscure bus error exceptions on some systems
> such as Malta, for example, when prefetching beyond the end of memory.
> It may also mean memory regions that are just undergoing a DMA transfer
> are being brought back into cache.
>
> This pretty much means that pref is only safe to use on cache-coherent
> systems.
>
So, could we have:

    #ifdef CONFIG_DMA_NONCOHERENT
    #undef CONFIG_CPU_HAS_PREFETCH
    #endif
    #define PREFSIZE   (1 << MIPS_L1_CACHE_SHIFT)

and then use the PREFSIZE value instead of the hardcoded value of 32?

Steve

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
@ 2014-01-21 20:58     ` Steven J. Hill
  0 siblings, 0 replies; 9+ messages in thread
From: Steven J. Hill @ 2014-01-21 20:58 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: linux-mips

On 01/21/2014 02:49 PM, Ralf Baechle wrote:
> On Tue, Jan 21, 2014 at 10:18:42AM -0600, Steven J. Hill wrote:
>
>> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>>
>> Use the PREF instruction to optimize partial checksum operations.
>
> Prefetch operations may cause obscure bus error exceptions on some systems
> such as Malta, for example, when prefetching beyond the end of memory.
> It may also mean memory regions that are just undergoing a DMA transfer
> are being brought back into cache.
>
> This pretty much means that pref is only safe to use on cache-coherent
> systems.
>
So, could we have:

    #ifdef CONFIG_DMA_NONCOHERENT
    #undef CONFIG_CPU_HAS_PREFETCH
    #endif
    #define PREFSIZE   (1 << MIPS_L1_CACHE_SHIFT)

and then use the PREFSIZE value instead of the hardcoded value of 32?

Steve

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching.
  2014-01-21 20:58     ` Steven J. Hill
  (?)
@ 2014-01-21 21:03     ` David Daney
  -1 siblings, 0 replies; 9+ messages in thread
From: David Daney @ 2014-01-21 21:03 UTC (permalink / raw)
  To: Steven J. Hill; +Cc: Ralf Baechle, linux-mips

On 01/21/2014 12:58 PM, Steven J. Hill wrote:
> On 01/21/2014 02:49 PM, Ralf Baechle wrote:
>> On Tue, Jan 21, 2014 at 10:18:42AM -0600, Steven J. Hill wrote:
>>
>>> From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
>>>
>>> Use the PREF instruction to optimize partial checksum operations.
>>
>> Prefetch operations may cause obscure bus error exceptions on some
>> systems
>> such as Malta, for example, when prefetching beyond the end of memory.
>> It may also mean memory regions that are just undergoing a DMA transfer
>> are being brought back into cache.
>>
>> This pretty much means that pref is only safe to use on cache-coherent
>> systems.
>>
> So, could we have:
>
>     #ifdef CONFIG_DMA_NONCOHERENT
>     #undef CONFIG_CPU_HAS_PREFETCH
>     #endif
>     #define PREFSIZE   (1 << MIPS_L1_CACHE_SHIFT)
>
> and then use the PREFSIZE value instead of the hardcoded value of 32?

See arch/mips/mm/page.c for code that tries to do something sensible 
with streaming prefetches.


>
> Steve
>
>
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-01-21 21:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-21 16:18 [PATCH] MIPS: lib: Optimize partial checksum ops using prefetching Steven J. Hill
2014-01-21 17:37 ` Florian Fainelli
2014-01-21 18:25 ` David Daney
2014-01-21 20:16   ` Steven J. Hill
2014-01-21 20:25     ` Florian Fainelli
2014-01-21 20:49 ` Ralf Baechle
2014-01-21 20:58   ` Steven J. Hill
2014-01-21 20:58     ` Steven J. Hill
2014-01-21 21:03     ` David Daney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.