* [PATCH v1 0/3] powerpc: memcmp() optimization
@ 2017-09-19 10:03 wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: wei.guo.simon @ 2017-09-19 10:03 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
There is some room to optimize memcmp() in powerpc for following 2 cases:
(1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
memcmp() can align them and go with .Llong comparision mode without
fallback to .Lshort comparision mode do compare buffer byte by byte.
(2) VMX instructions can be used to speed up for large size comparision.
This patch set also updates selftest case to make it compiled.
Simon Guo (3):
powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
powerpc: enhance memcmp() with VMX instruction for long bytes
comparision
powerpc:selftest update memcmp selftest according to kernel change
arch/powerpc/include/asm/asm-prototypes.h | 2 +-
arch/powerpc/lib/copypage_power7.S | 2 +-
arch/powerpc/lib/memcmp_64.S | 165 ++++++++++++++++++++-
arch/powerpc/lib/memcpy_power7.S | 2 +-
arch/powerpc/lib/vmx-helper.c | 2 +-
.../selftests/powerpc/copyloops/asm/ppc_asm.h | 2 +-
.../selftests/powerpc/stringloops/asm/ppc_asm.h | 31 ++++
7 files changed, 197 insertions(+), 9 deletions(-)
--
1.8.3.1
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
@ 2017-09-19 10:03 ` wei.guo.simon
2017-09-19 10:12 ` David Laight
2017-09-19 12:20 ` Christophe LEROY
2017-09-19 10:03 ` [PATCH v1 2/3] powerpc: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
` (2 subsequent siblings)
3 siblings, 2 replies; 10+ messages in thread
From: wei.guo.simon @ 2017-09-19 10:03 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
Currently memcmp() in powerpc will fall back to .Lshort (compare per byte
mode) if either src or dst address is not 8 bytes aligned. It can be
opmitized if both addresses are with the same offset with 8 bytes boundary.
memcmp() can align the src/dst address with 8 bytes firstly and then
compare with .Llong mode.
This patch optmizes memcmp() behavior in this situation.
Test result:
(1) 256 bytes
Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
- without patch
50.715169506 seconds time elapsed ( +- 0.04% )
- with patch
28.906602373 seconds time elapsed ( +- 0.02% )
-> There is ~+75% percent improvement.
(2) 32 bytes
To observe performance impact on < 32 bytes, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
#include <string.h>
#include "utils.h"
-#define SIZE 256
+#define SIZE 32
#define ITERATIONS 10000
int test_memcmp(const void *s1, const void *s2, size_t n);
--------
- Without patch
0.390677136 seconds time elapsed ( +- 0.03% )
- with patch
0.375685926 seconds time elapsed ( +- 0.05% )
-> There is ~+4% improvement
(3) 0~8 bytes
To observe <8 bytes performance impact, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
#include <string.h>
#include "utils.h"
-#define SIZE 256
-#define ITERATIONS 10000
+#define SIZE 8
+#define ITERATIONS 100000
int test_memcmp(const void *s1, const void *s2, size_t n);
-------
- Without patch
3.169203981 seconds time elapsed ( +- 0.23% )
- With patch
3.208257362 seconds time elapsed ( +- 0.13% )
-> There is ~ -1% decrease.
(I don't know why yet, since there are the same number of instructions
in the code path for 0~8 bytes memcmp() with/without this patch. Any
comments will be appreciated).
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
arch/powerpc/lib/memcmp_64.S | 86 +++++++++++++++++++++++++++++++++++++++++---
1 file changed, 82 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index d75d18b..6dbafdb 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -24,25 +24,95 @@
#define rH r31
#ifdef __LITTLE_ENDIAN__
+#define LH lhbrx
+#define LW lwbrx
#define LD ldbrx
#else
+#define LH lhzx
+#define LW lwzx
#define LD ldx
#endif
_GLOBAL(memcmp)
cmpdi cr1,r5,0
- /* Use the short loop if both strings are not 8B aligned */
- or r6,r3,r4
+ /* Use the short loop if the src/dst addresses are not
+ * with the same offset of 8 bytes align boundary.
+ */
+ xor r6,r3,r4
andi. r6,r6,7
- /* Use the short loop if length is less than 32B */
- cmpdi cr6,r5,31
+ /* fall back to short loop if compare at aligned addrs
+ * with no greater than 8 bytes.
+ */
+ cmpdi cr6,r5,8
beq cr1,.Lzero
bne .Lshort
+ ble cr6,.Lshort
+
+.Lalignbytes_start:
+ /* The bits 0/1/2 of src/dst addr are the same. */
+ neg r0,r3
+ andi. r0,r0,7
+ beq .Lalign8bytes
+
+ PPC_MTOCRF(1,r0)
+ bf 31,.Lalign2bytes
+ lbz rA,0(r3)
+ lbz rB,0(r4)
+ cmplw cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ addi r3,r3,1
+ addi r4,r4,1
+ subi r5,r5,1
+.Lalign2bytes:
+ bf 30,.Lalign4bytes
+ LH rA,0,r3
+ LH rB,0,r4
+ cmplw cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ bne .Lnon_zero
+ addi r3,r3,2
+ addi r4,r4,2
+ subi r5,r5,2
+.Lalign4bytes:
+ bf 29,.Lalign8bytes
+ LW rA,0,r3
+ LW rB,0,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ addi r3,r3,4
+ addi r4,r4,4
+ subi r5,r5,4
+.Lalign8bytes:
+ /* Now addrs are aligned with 8 bytes. Use the short loop if left
+ * bytes are less than 8B.
+ */
+ cmpdi cr6,r5,7
+ ble cr6,.Lshort
+
+ /* Use .Llong loop if left cmp bytes are equal or greater than 32B */
+ cmpdi cr6,r5,31
bgt cr6,.Llong
+.Lcmploop_8bytes_31bytes:
+ /* handle 8 ~ 31 bytes with 8 bytes aligned addrs */
+ srdi. r0,r5,3
+ clrldi r5,r5,61
+ mtctr r0
+831:
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ addi r3,r3,8
+ addi r4,r4,8
+ bdnz 831b
+
+ cmpwi r5,0
+ beq .Lzero
+
.Lshort:
mtctr r5
@@ -232,4 +302,12 @@ _GLOBAL(memcmp)
ld r28,-32(r1)
ld r27,-40(r1)
blr
+
+.LcmpAB_lightweight: /* skip NV GPRS restore */
+ li r3,1
+ bgt cr0,8f
+ li r3,-1
+8:
+ blr
+
EXPORT_SYMBOL(memcmp)
--
1.8.3.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v1 2/3] powerpc: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
@ 2017-09-19 10:03 ` wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 3/3] powerpc:selftest update memcmp selftest according to kernel change wei.guo.simon
2017-09-19 12:21 ` [PATCH v1 0/3] powerpc: memcmp() optimization Christophe LEROY
3 siblings, 0 replies; 10+ messages in thread
From: wei.guo.simon @ 2017-09-19 10:03 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
This patch add VMX primitives to do memcmp() in case the compare size
exceeds 4K bytes.
Test result with following test program:
------
tools/testing/selftests/powerpc/stringloops# cat memcmp.c
int test_memcmp(const void *s1, const void *s2, size_t n);
static int testcase(void)
{
char *s1;
char *s2;
unsigned long i;
s1 = memalign(128, SIZE);
if (!s1) {
perror("memalign");
exit(1);
}
s2 = memalign(128, SIZE);
if (!s2) {
perror("memalign");
exit(1);
}
for (i = 0; i < SIZE; i++) {
s1[i] = i & 0xff;
s2[i] = i & 0xff;
}
for (i = 0; i < ITERATIONS; i++)
test_memcmp(s1, s2, SIZE);
return 0;
}
int main(void)
{
return test_harness(testcase, "memcmp");
}
------
Without VMX patch:
5.085776331 seconds time elapsed ( +- 0.28% )
With VMX patch:
4.584002052 seconds time elapsed ( +- 0.02% )
There is ~10% improvement.
However I am not aware whether there is use case in kernel for memcmp on
large size yet.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
arch/powerpc/include/asm/asm-prototypes.h | 2 +-
arch/powerpc/lib/copypage_power7.S | 2 +-
arch/powerpc/lib/memcmp_64.S | 79 +++++++++++++++++++++++++++++++
arch/powerpc/lib/memcpy_power7.S | 2 +-
arch/powerpc/lib/vmx-helper.c | 2 +-
5 files changed, 83 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
index 7330150..e6530d8 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -49,7 +49,7 @@ void __trace_hcall_exit(long opcode, unsigned long retval,
/* VMX copying */
int enter_vmx_usercopy(void);
int exit_vmx_usercopy(void);
-int enter_vmx_copy(void);
+int enter_vmx_ops(void);
void * exit_vmx_copy(void *dest);
/* Traps */
diff --git a/arch/powerpc/lib/copypage_power7.S b/arch/powerpc/lib/copypage_power7.S
index ca5fc8f..9e7729e 100644
--- a/arch/powerpc/lib/copypage_power7.S
+++ b/arch/powerpc/lib/copypage_power7.S
@@ -60,7 +60,7 @@ _GLOBAL(copypage_power7)
std r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
std r0,16(r1)
stdu r1,-STACKFRAMESIZE(r1)
- bl enter_vmx_copy
+ bl enter_vmx_ops
cmpwi r3,0
ld r0,STACKFRAMESIZE+16(r1)
ld r3,STK_REG(R31)(r1)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 6dbafdb..b86a1d3 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -153,6 +153,13 @@ _GLOBAL(memcmp)
blr
.Llong:
+#ifdef CONFIG_ALTIVEC
+ /* Try to use vmx loop if length is larger than 4K */
+ cmpldi cr6,r5,4096
+ bgt cr6,.Lvmx_cmp
+
+.Llong_novmx_cmp:
+#endif
li off8,8
li off16,16
li off24,24
@@ -310,4 +317,76 @@ _GLOBAL(memcmp)
8:
blr
+#ifdef CONFIG_ALTIVEC
+.Lvmx_cmp:
+ mflr r0
+ std r3,-STACKFRAMESIZE+STK_REG(R31)(r1)
+ std r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
+ std r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
+ std r0,16(r1)
+ stdu r1,-STACKFRAMESIZE(r1)
+ bl enter_vmx_ops
+ cmpwi cr1,r3,0
+ ld r0,STACKFRAMESIZE+16(r1)
+ ld r3,STK_REG(R31)(r1)
+ ld r4,STK_REG(R30)(r1)
+ ld r5,STK_REG(R29)(r1)
+ addi r1,r1,STACKFRAMESIZE
+ mtlr r0
+ beq cr1,.Llong_novmx_cmp
+
+3:
+ /* Enter with src/dst address 8 bytes aligned, and len is
+ * no less than 4KB. Need to align with 16 bytes further.
+ */
+ andi. rA,r3,8
+ beq 4f
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+
+ addi r3,r3,8
+ addi r4,r4,8
+
+4:
+ /* compare 32 bytes for each loop */
+ srdi r0,r5,5
+ mtctr r0
+ andi. r5,r5,31
+ li off16,16
+5:
+ lvx v0,0,r3
+ lvx v1,0,r4
+ vcmpequd. v0,v0,v1
+ bf 24,7f
+ lvx v0,off16,r3
+ lvx v1,off16,r4
+ vcmpequd. v0,v0,v1
+ bf 24,6f
+ addi r3,r3,32
+ addi r4,r4,32
+ bdnz 5b
+
+ cmpdi r5,0
+ beq .Lzero
+ b .Lshort
+
+6:
+ addi r3,r3,16
+ addi r4,r4,16
+
+7:
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+
+ li off8,8
+ LD rA,off8,r3
+ LD rB,off8,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ b .Lzero
+#endif
EXPORT_SYMBOL(memcmp)
diff --git a/arch/powerpc/lib/memcpy_power7.S b/arch/powerpc/lib/memcpy_power7.S
index 193909a..682e386 100644
--- a/arch/powerpc/lib/memcpy_power7.S
+++ b/arch/powerpc/lib/memcpy_power7.S
@@ -230,7 +230,7 @@ _GLOBAL(memcpy_power7)
std r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
std r0,16(r1)
stdu r1,-STACKFRAMESIZE(r1)
- bl enter_vmx_copy
+ bl enter_vmx_ops
cmpwi cr1,r3,0
ld r0,STACKFRAMESIZE+16(r1)
ld r3,STK_REG(R31)(r1)
diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c
index bf925cd..923a9ab 100644
--- a/arch/powerpc/lib/vmx-helper.c
+++ b/arch/powerpc/lib/vmx-helper.c
@@ -53,7 +53,7 @@ int exit_vmx_usercopy(void)
return 0;
}
-int enter_vmx_copy(void)
+int enter_vmx_ops(void)
{
if (in_interrupt())
return 0;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v1 3/3] powerpc:selftest update memcmp selftest according to kernel change
2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 2/3] powerpc: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
@ 2017-09-19 10:03 ` wei.guo.simon
2017-09-19 12:21 ` [PATCH v1 0/3] powerpc: memcmp() optimization Christophe LEROY
3 siblings, 0 replies; 10+ messages in thread
From: wei.guo.simon @ 2017-09-19 10:03 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
This patch adjust selftest files related with memcmp so that memcmp
selftest can be compiled successfully.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
.../selftests/powerpc/copyloops/asm/ppc_asm.h | 2 +-
.../selftests/powerpc/stringloops/asm/ppc_asm.h | 31 ++++++++++++++++++++++
2 files changed, 32 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
index 80d34a9..a9da02d 100644
--- a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
@@ -35,7 +35,7 @@
li r3,0
blr
-FUNC_START(enter_vmx_copy)
+FUNC_START(enter_vmx_ops)
li r3,1
blr
diff --git a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
index 11bece8..793ee54 100644
--- a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
@@ -1,3 +1,5 @@
+#ifndef _PPC_ASM_H
+#define __PC_ASM_H
#include <ppc-asm.h>
#ifndef r1
@@ -5,3 +7,32 @@
#endif
#define _GLOBAL(A) FUNC_START(test_ ## A)
+
+#define CONFIG_ALTIVEC
+
+#define R14 r14
+#define R15 r15
+#define R16 r16
+#define R17 r17
+#define R18 r18
+#define R19 r19
+#define R20 r20
+#define R21 r21
+#define R22 r22
+#define R29 r29
+#define R30 r30
+#define R31 r31
+
+#define STACKFRAMESIZE 256
+#define STK_REG(i) (112 + ((i)-14)*8)
+
+#define _GLOBAL(A) FUNC_START(test_ ## A)
+#define _GLOBAL_TOC(A) _GLOBAL(A)
+
+#define PPC_MTOCRF(A, B) mtocrf A, B
+
+FUNC_START(enter_vmx_ops)
+ li r3, 1
+ blr
+
+#endif
--
1.8.3.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* RE: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
@ 2017-09-19 10:12 ` David Laight
2017-09-20 9:56 ` Simon Guo
2017-09-19 12:20 ` Christophe LEROY
1 sibling, 1 reply; 10+ messages in thread
From: David Laight @ 2017-09-19 10:12 UTC (permalink / raw)
To: 'wei.guo.simon@gmail.com', linuxppc-dev@lists.ozlabs.org
Cc: Naveen N. Rao
RnJvbTogd2VpLmd1by5zaW1vbkBnbWFpbC5jb20NCj4gU2VudDogMTkgU2VwdGVtYmVyIDIwMTcg
MTE6MDQNCj4gQ3VycmVudGx5IG1lbWNtcCgpIGluIHBvd2VycGMgd2lsbCBmYWxsIGJhY2sgdG8g
LkxzaG9ydCAoY29tcGFyZSBwZXIgYnl0ZQ0KPiBtb2RlKSBpZiBlaXRoZXIgc3JjIG9yIGRzdCBh
ZGRyZXNzIGlzIG5vdCA4IGJ5dGVzIGFsaWduZWQuIEl0IGNhbiBiZQ0KPiBvcG1pdGl6ZWQgaWYg
Ym90aCBhZGRyZXNzZXMgYXJlIHdpdGggdGhlIHNhbWUgb2Zmc2V0IHdpdGggOCBieXRlcyBib3Vu
ZGFyeS4NCj4gDQo+IG1lbWNtcCgpIGNhbiBhbGlnbiB0aGUgc3JjL2RzdCBhZGRyZXNzIHdpdGgg
OCBieXRlcyBmaXJzdGx5IGFuZCB0aGVuDQo+IGNvbXBhcmUgd2l0aCAuTGxvbmcgbW9kZS4NCg0K
V2h5IG5vdCBtYXNrIGJvdGggYWRkcmVzc2VzIHdpdGggfjcgYW5kIG1hc2svc2hpZnQgdGhlIHJl
YWQgdmFsdWUgdG8gaWdub3JlDQp0aGUgdW53YW50ZWQgaGlnaCAoQkUpIG9yIGxvdyAoTEUpIGJp
dHMuDQoNClRoZSBzYW1lIGNhbiBiZSBkb25lIGF0IHRoZSBlbmQgb2YgdGhlIGNvbXBhcmUgd2l0
aCBhbnkgZmluYWwsIHBhcnRpYWwgd29yZC4NCg0KCURhdmlkDQogDQo=
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
2017-09-19 10:12 ` David Laight
@ 2017-09-19 12:20 ` Christophe LEROY
1 sibling, 0 replies; 10+ messages in thread
From: Christophe LEROY @ 2017-09-19 12:20 UTC (permalink / raw)
To: wei.guo.simon, linuxppc-dev; +Cc: Naveen N. Rao
Hi
Could you in the email/patch subject write powerpc/64 instead pof
powerpc as it doesn't apply to powerpc/32
Le 19/09/2017 à 12:03, wei.guo.simon@gmail.com a écrit :
> From: Simon Guo <wei.guo.simon@gmail.com>
>
> Currently memcmp() in powerpc will fall back to .Lshort (compare per byte
Say powerpc/64 here too.
Christophe
> mode) if either src or dst address is not 8 bytes aligned. It can be
> opmitized if both addresses are with the same offset with 8 bytes boundary.
>
> memcmp() can align the src/dst address with 8 bytes firstly and then
> compare with .Llong mode.
>
> This patch optmizes memcmp() behavior in this situation.
>
> Test result:
>
> (1) 256 bytes
> Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
> - without patch
> 50.715169506 seconds time elapsed ( +- 0.04% )
> - with patch
> 28.906602373 seconds time elapsed ( +- 0.02% )
> -> There is ~+75% percent improvement.
>
> (2) 32 bytes
> To observe performance impact on < 32 bytes, modify
> tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
> -------
> #include <string.h>
> #include "utils.h"
>
> -#define SIZE 256
> +#define SIZE 32
> #define ITERATIONS 10000
>
> int test_memcmp(const void *s1, const void *s2, size_t n);
> --------
>
> - Without patch
> 0.390677136 seconds time elapsed ( +- 0.03% )
> - with patch
> 0.375685926 seconds time elapsed ( +- 0.05% )
> -> There is ~+4% improvement
>
> (3) 0~8 bytes
> To observe <8 bytes performance impact, modify
> tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
> -------
> #include <string.h>
> #include "utils.h"
>
> -#define SIZE 256
> -#define ITERATIONS 10000
> +#define SIZE 8
> +#define ITERATIONS 100000
>
> int test_memcmp(const void *s1, const void *s2, size_t n);
> -------
> - Without patch
> 3.169203981 seconds time elapsed ( +- 0.23% )
> - With patch
> 3.208257362 seconds time elapsed ( +- 0.13% )
> -> There is ~ -1% decrease.
> (I don't know why yet, since there are the same number of instructions
> in the code path for 0~8 bytes memcmp() with/without this patch. Any
> comments will be appreciated).
>
> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
> ---
> arch/powerpc/lib/memcmp_64.S | 86 +++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 82 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index d75d18b..6dbafdb 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -24,25 +24,95 @@
> #define rH r31
>
> #ifdef __LITTLE_ENDIAN__
> +#define LH lhbrx
> +#define LW lwbrx
> #define LD ldbrx
> #else
> +#define LH lhzx
> +#define LW lwzx
> #define LD ldx
> #endif
>
> _GLOBAL(memcmp)
> cmpdi cr1,r5,0
>
> - /* Use the short loop if both strings are not 8B aligned */
> - or r6,r3,r4
> + /* Use the short loop if the src/dst addresses are not
> + * with the same offset of 8 bytes align boundary.
> + */
> + xor r6,r3,r4
> andi. r6,r6,7
>
> - /* Use the short loop if length is less than 32B */
> - cmpdi cr6,r5,31
> + /* fall back to short loop if compare at aligned addrs
> + * with no greater than 8 bytes.
> + */
> + cmpdi cr6,r5,8
>
> beq cr1,.Lzero
> bne .Lshort
> + ble cr6,.Lshort
> +
> +.Lalignbytes_start:
> + /* The bits 0/1/2 of src/dst addr are the same. */
> + neg r0,r3
> + andi. r0,r0,7
> + beq .Lalign8bytes
> +
> + PPC_MTOCRF(1,r0)
> + bf 31,.Lalign2bytes
> + lbz rA,0(r3)
> + lbz rB,0(r4)
> + cmplw cr0,rA,rB
> + bne cr0,.LcmpAB_lightweight
> + addi r3,r3,1
> + addi r4,r4,1
> + subi r5,r5,1
> +.Lalign2bytes:
> + bf 30,.Lalign4bytes
> + LH rA,0,r3
> + LH rB,0,r4
> + cmplw cr0,rA,rB
> + bne cr0,.LcmpAB_lightweight
> + bne .Lnon_zero
> + addi r3,r3,2
> + addi r4,r4,2
> + subi r5,r5,2
> +.Lalign4bytes:
> + bf 29,.Lalign8bytes
> + LW rA,0,r3
> + LW rB,0,r4
> + cmpld cr0,rA,rB
> + bne cr0,.LcmpAB_lightweight
> + addi r3,r3,4
> + addi r4,r4,4
> + subi r5,r5,4
> +.Lalign8bytes:
> + /* Now addrs are aligned with 8 bytes. Use the short loop if left
> + * bytes are less than 8B.
> + */
> + cmpdi cr6,r5,7
> + ble cr6,.Lshort
> +
> + /* Use .Llong loop if left cmp bytes are equal or greater than 32B */
> + cmpdi cr6,r5,31
> bgt cr6,.Llong
>
> +.Lcmploop_8bytes_31bytes:
> + /* handle 8 ~ 31 bytes with 8 bytes aligned addrs */
> + srdi. r0,r5,3
> + clrldi r5,r5,61
> + mtctr r0
> +831:
> + LD rA,0,r3
> + LD rB,0,r4
> + cmpld cr0,rA,rB
> + bne cr0,.LcmpAB_lightweight
> + addi r3,r3,8
> + addi r4,r4,8
> + bdnz 831b
> +
> + cmpwi r5,0
> + beq .Lzero
> +
> .Lshort:
> mtctr r5
>
> @@ -232,4 +302,12 @@ _GLOBAL(memcmp)
> ld r28,-32(r1)
> ld r27,-40(r1)
> blr
> +
> +.LcmpAB_lightweight: /* skip NV GPRS restore */
> + li r3,1
> + bgt cr0,8f
> + li r3,-1
> +8:
> + blr
> +
> EXPORT_SYMBOL(memcmp)
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 0/3] powerpc: memcmp() optimization
2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
` (2 preceding siblings ...)
2017-09-19 10:03 ` [PATCH v1 3/3] powerpc:selftest update memcmp selftest according to kernel change wei.guo.simon
@ 2017-09-19 12:21 ` Christophe LEROY
2017-09-20 9:57 ` Simon Guo
3 siblings, 1 reply; 10+ messages in thread
From: Christophe LEROY @ 2017-09-19 12:21 UTC (permalink / raw)
To: wei.guo.simon, linuxppc-dev; +Cc: Naveen N. Rao
Hi
Could you in the email/patch subject and in the commit texts write
powerpc/64 instead of powerpc as it doesn't apply to powerpc/32
Christophe
Le 19/09/2017 à 12:03, wei.guo.simon@gmail.com a écrit :
> From: Simon Guo <wei.guo.simon@gmail.com>
>
> There is some room to optimize memcmp() in powerpc for following 2 cases:
> (1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
> memcmp() can align them and go with .Llong comparision mode without
> fallback to .Lshort comparision mode do compare buffer byte by byte.
> (2) VMX instructions can be used to speed up for large size comparision.
>
> This patch set also updates selftest case to make it compiled.
>
>
> Simon Guo (3):
> powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
> powerpc: enhance memcmp() with VMX instruction for long bytes
> comparision
> powerpc:selftest update memcmp selftest according to kernel change
>
> arch/powerpc/include/asm/asm-prototypes.h | 2 +-
> arch/powerpc/lib/copypage_power7.S | 2 +-
> arch/powerpc/lib/memcmp_64.S | 165 ++++++++++++++++++++-
> arch/powerpc/lib/memcpy_power7.S | 2 +-
> arch/powerpc/lib/vmx-helper.c | 2 +-
> .../selftests/powerpc/copyloops/asm/ppc_asm.h | 2 +-
> .../selftests/powerpc/stringloops/asm/ppc_asm.h | 31 ++++
> 7 files changed, 197 insertions(+), 9 deletions(-)
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
2017-09-19 10:12 ` David Laight
@ 2017-09-20 9:56 ` Simon Guo
2017-09-20 10:05 ` David Laight
0 siblings, 1 reply; 10+ messages in thread
From: Simon Guo @ 2017-09-20 9:56 UTC (permalink / raw)
To: David Laight; +Cc: linuxppc-dev@lists.ozlabs.org, Naveen N. Rao
On Tue, Sep 19, 2017 at 10:12:50AM +0000, David Laight wrote:
> From: wei.guo.simon@gmail.com
> > Sent: 19 September 2017 11:04
> > Currently memcmp() in powerpc will fall back to .Lshort (compare per byte
> > mode) if either src or dst address is not 8 bytes aligned. It can be
> > opmitized if both addresses are with the same offset with 8 bytes boundary.
> >
> > memcmp() can align the src/dst address with 8 bytes firstly and then
> > compare with .Llong mode.
>
> Why not mask both addresses with ~7 and mask/shift the read value to ignore
> the unwanted high (BE) or low (LE) bits.
>
> The same can be done at the end of the compare with any final, partial word.
>
> David
>
Yes. That will be better. A prototyping shows ~5% improvement on 32 bytes
size comparison with v1. I will rework on v2.
Thanks for the suggestion.
BR,
- Simon
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 0/3] powerpc: memcmp() optimization
2017-09-19 12:21 ` [PATCH v1 0/3] powerpc: memcmp() optimization Christophe LEROY
@ 2017-09-20 9:57 ` Simon Guo
0 siblings, 0 replies; 10+ messages in thread
From: Simon Guo @ 2017-09-20 9:57 UTC (permalink / raw)
To: Christophe LEROY; +Cc: linuxppc-dev, Naveen N. Rao
Hi Chris,
On Tue, Sep 19, 2017 at 02:21:33PM +0200, Christophe LEROY wrote:
> Hi
>
> Could you in the email/patch subject and in the commit texts write
> powerpc/64 instead of powerpc as it doesn't apply to powerpc/32
>
> Christophe
>
Sure. I will update in v2.
BR,
- Simon
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
2017-09-20 9:56 ` Simon Guo
@ 2017-09-20 10:05 ` David Laight
0 siblings, 0 replies; 10+ messages in thread
From: David Laight @ 2017-09-20 10:05 UTC (permalink / raw)
To: 'Simon Guo'; +Cc: linuxppc-dev@lists.ozlabs.org, Naveen N. Rao
From: Simon Guo
> Sent: 20 September 2017 10:57
> On Tue, Sep 19, 2017 at 10:12:50AM +0000, David Laight wrote:
> > From: wei.guo.simon@gmail.com
> > > Sent: 19 September 2017 11:04
> > > Currently memcmp() in powerpc will fall back to .Lshort (compare per =
byte
> > > mode) if either src or dst address is not 8 bytes aligned. It can be
> > > opmitized if both addresses are with the same offset with 8 bytes bou=
ndary.
> > >
> > > memcmp() can align the src/dst address with 8 bytes firstly and then
> > > compare with .Llong mode.
> >
> > Why not mask both addresses with ~7 and mask/shift the read value to ig=
nore
> > the unwanted high (BE) or low (LE) bits.
> >
> > The same can be done at the end of the compare with any final, partial =
word.
>=20
> Yes. That will be better. A prototyping shows ~5% improvement on 32 bytes
> size comparison with v1. I will rework on v2.
Clearly you have to be careful to return the correct +1/-1 on mismatch.
For systems that can do misaligned transfers you can compare the first
word, then compare aligned words and finally the last word.
Rather like a memcpy() function I wrote (for NetBDSD) that copied
the last word first, then a whole number of words aligned at the start.
(Hope no one expected anything special for overlapping copies.)
David
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-09-20 10:05 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
2017-09-19 10:12 ` David Laight
2017-09-20 9:56 ` Simon Guo
2017-09-20 10:05 ` David Laight
2017-09-19 12:20 ` Christophe LEROY
2017-09-19 10:03 ` [PATCH v1 2/3] powerpc: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 3/3] powerpc:selftest update memcmp selftest according to kernel change wei.guo.simon
2017-09-19 12:21 ` [PATCH v1 0/3] powerpc: memcmp() optimization Christophe LEROY
2017-09-20 9:57 ` Simon Guo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).