[PATCH v1 0/3] powerpc: memcmp() optimization

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/3] powerpc: memcmp() optimization
@ 2017-09-19 10:03 wei.guo.simon
  2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: wei.guo.simon @ 2017-09-19 10:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

There is some room to optimize memcmp() in powerpc for following 2 cases:
(1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
memcmp() can align them and go with .Llong comparision mode without
fallback to .Lshort comparision mode do compare buffer byte by byte.
(2) VMX instructions can be used to speed up for large size comparision.

This patch set also updates selftest case to make it compiled.


Simon Guo (3):
  powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
  powerpc: enhance memcmp() with VMX instruction for long bytes
    comparision
  powerpc:selftest update memcmp selftest according to kernel change

 arch/powerpc/include/asm/asm-prototypes.h          |   2 +-
 arch/powerpc/lib/copypage_power7.S                 |   2 +-
 arch/powerpc/lib/memcmp_64.S                       | 165 ++++++++++++++++++++-
 arch/powerpc/lib/memcpy_power7.S                   |   2 +-
 arch/powerpc/lib/vmx-helper.c                      |   2 +-
 .../selftests/powerpc/copyloops/asm/ppc_asm.h      |   2 +-
 .../selftests/powerpc/stringloops/asm/ppc_asm.h    |  31 ++++
 7 files changed, 197 insertions(+), 9 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
  2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
@ 2017-09-19 10:03 ` wei.guo.simon
  2017-09-19 10:12   ` David Laight
  2017-09-19 12:20   ` Christophe LEROY
  2017-09-19 10:03 ` [PATCH v1 2/3] powerpc: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 10+ messages in thread
From: wei.guo.simon @ 2017-09-19 10:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

Currently memcmp() in powerpc will fall back to .Lshort (compare per byte
mode) if either src or dst address is not 8 bytes aligned. It can be
opmitized if both addresses are with the same offset with 8 bytes boundary.

memcmp() can align the src/dst address with 8 bytes firstly and then
compare with .Llong mode.

This patch optmizes memcmp() behavior in this situation.

Test result:

(1) 256 bytes
Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
- without patch
	50.715169506 seconds time elapsed                                          ( +-  0.04% )
- with patch
	28.906602373 seconds time elapsed                                          ( +-  0.02% )
		-> There is ~+75% percent improvement.

(2) 32 bytes
To observe performance impact on < 32 bytes, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
 #include <string.h>
 #include "utils.h"

-#define SIZE 256
+#define SIZE 32
 #define ITERATIONS 10000

 int test_memcmp(const void *s1, const void *s2, size_t n);
--------

- Without patch
	0.390677136 seconds time elapsed                                          ( +-  0.03% )
- with patch
	0.375685926 seconds time elapsed                                          ( +-  0.05% )
		-> There is ～+4% improvement

(3) 0~8 bytes
To observe <8 bytes performance impact, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
 #include <string.h>
 #include "utils.h"

-#define SIZE 256
-#define ITERATIONS 10000
+#define SIZE 8
+#define ITERATIONS 100000

 int test_memcmp(const void *s1, const void *s2, size_t n);
-------
- Without patch
	3.169203981 seconds time elapsed                                          ( +-  0.23% )
- With patch
	3.208257362 seconds time elapsed                                          ( +-  0.13% )
		-> There is ~ -1% decrease.
(I don't know why yet, since there are the same number of instructions
in the code path for 0~8 bytes memcmp() with/without this patch.  Any
comments will be appreciated).

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
 arch/powerpc/lib/memcmp_64.S | 86 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 82 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index d75d18b..6dbafdb 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -24,25 +24,95 @@
 #define rH	r31
 
 #ifdef __LITTLE_ENDIAN__
+#define LH	lhbrx
+#define LW	lwbrx
 #define LD	ldbrx
 #else
+#define LH	lhzx
+#define LW	lwzx
 #define LD	ldx
 #endif
 
 _GLOBAL(memcmp)
 	cmpdi	cr1,r5,0
 
-	/* Use the short loop if both strings are not 8B aligned */
-	or	r6,r3,r4
+	/* Use the short loop if the src/dst addresses are not
+	 * with the same offset of 8 bytes align boundary.
+	 */
+	xor	r6,r3,r4
 	andi.	r6,r6,7
 
-	/* Use the short loop if length is less than 32B */
-	cmpdi	cr6,r5,31
+	/* fall back to short loop if compare at aligned addrs
+	 * with no greater than 8 bytes.
+	 */
+	cmpdi   cr6,r5,8
 
 	beq	cr1,.Lzero
 	bne	.Lshort
+	ble	cr6,.Lshort
+
+.Lalignbytes_start:
+	/* The bits 0/1/2 of src/dst addr are the same. */
+	neg	r0,r3
+	andi.	r0,r0,7
+	beq	.Lalign8bytes
+
+	PPC_MTOCRF(1,r0)
+	bf	31,.Lalign2bytes
+	lbz	rA,0(r3)
+	lbz	rB,0(r4)
+	cmplw	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+	addi	r3,r3,1
+	addi	r4,r4,1
+	subi	r5,r5,1
+.Lalign2bytes:
+	bf	30,.Lalign4bytes
+	LH	rA,0,r3
+	LH	rB,0,r4
+	cmplw	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+	bne	.Lnon_zero
+	addi	r3,r3,2
+	addi	r4,r4,2
+	subi	r5,r5,2
+.Lalign4bytes:
+	bf	29,.Lalign8bytes
+	LW	rA,0,r3
+	LW	rB,0,r4
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+	addi	r3,r3,4
+	addi	r4,r4,4
+	subi	r5,r5,4
+.Lalign8bytes:
+	/* Now addrs are aligned with 8 bytes. Use the short loop if left
+	 * bytes are less than 8B.
+	 */
+	cmpdi   cr6,r5,7
+	ble	cr6,.Lshort
+
+	/* Use .Llong loop if left cmp bytes are equal or greater than 32B */
+	cmpdi   cr6,r5,31
 	bgt	cr6,.Llong
 
+.Lcmploop_8bytes_31bytes:
+	/* handle 8 ~ 31 bytes with 8 bytes aligned addrs */
+	srdi.   r0,r5,3
+	clrldi  r5,r5,61
+	mtctr   r0
+831:
+	LD	rA,0,r3
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+	addi	r3,r3,8
+	addi	r4,r4,8
+	bdnz	831b
+
+	cmpwi   r5,0
+	beq	.Lzero
+
 .Lshort:
 	mtctr	r5
 
@@ -232,4 +302,12 @@ _GLOBAL(memcmp)
 	ld	r28,-32(r1)
 	ld	r27,-40(r1)
 	blr
+
+.LcmpAB_lightweight:   /* skip NV GPRS restore */
+	li	r3,1
+	bgt	cr0,8f
+	li	r3,-1
+8:
+	blr
+
 EXPORT_SYMBOL(memcmp)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v1 2/3] powerpc: enhance memcmp() with VMX instruction for long bytes comparision
  2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
  2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
@ 2017-09-19 10:03 ` wei.guo.simon
  2017-09-19 10:03 ` [PATCH v1 3/3] powerpc:selftest update memcmp selftest according to kernel change wei.guo.simon
  2017-09-19 12:21 ` [PATCH v1 0/3] powerpc: memcmp() optimization Christophe LEROY
  3 siblings, 0 replies; 10+ messages in thread
From: wei.guo.simon @ 2017-09-19 10:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

This patch add VMX primitives to do memcmp() in case the compare size
exceeds 4K bytes.

Test result with following test program:
------
tools/testing/selftests/powerpc/stringloops# cat memcmp.c

int test_memcmp(const void *s1, const void *s2, size_t n);

static int testcase(void)
{
	char *s1;
	char *s2;
	unsigned long i;

	s1 = memalign(128, SIZE);
	if (!s1) {
		perror("memalign");
		exit(1);
	}

	s2 = memalign(128, SIZE);
	if (!s2) {
		perror("memalign");
		exit(1);
	}

	for (i = 0; i < SIZE; i++)  {
		s1[i] = i & 0xff;
		s2[i] = i & 0xff;
	}
	for (i = 0; i < ITERATIONS; i++)
		test_memcmp(s1, s2, SIZE);

	return 0;
}

int main(void)
{
	return test_harness(testcase, "memcmp");
}

------
Without VMX patch:
       5.085776331 seconds time elapsed                                          ( +-  0.28% )
With VMX patch:
       4.584002052 seconds time elapsed                                          ( +-  0.02% )

		There is ~10% improvement.

However I am not aware whether there is use case in kernel for memcmp on
large size yet.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
 arch/powerpc/include/asm/asm-prototypes.h |  2 +-
 arch/powerpc/lib/copypage_power7.S        |  2 +-
 arch/powerpc/lib/memcmp_64.S              | 79 +++++++++++++++++++++++++++++++
 arch/powerpc/lib/memcpy_power7.S          |  2 +-
 arch/powerpc/lib/vmx-helper.c             |  2 +-
 5 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
index 7330150..e6530d8 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -49,7 +49,7 @@ void __trace_hcall_exit(long opcode, unsigned long retval,
 /* VMX copying */
 int enter_vmx_usercopy(void);
 int exit_vmx_usercopy(void);
-int enter_vmx_copy(void);
+int enter_vmx_ops(void);
 void * exit_vmx_copy(void *dest);
 
 /* Traps */
diff --git a/arch/powerpc/lib/copypage_power7.S b/arch/powerpc/lib/copypage_power7.S
index ca5fc8f..9e7729e 100644
--- a/arch/powerpc/lib/copypage_power7.S
+++ b/arch/powerpc/lib/copypage_power7.S
@@ -60,7 +60,7 @@ _GLOBAL(copypage_power7)
 	std	r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
 	std	r0,16(r1)
 	stdu	r1,-STACKFRAMESIZE(r1)
-	bl	enter_vmx_copy
+	bl	enter_vmx_ops
 	cmpwi	r3,0
 	ld	r0,STACKFRAMESIZE+16(r1)
 	ld	r3,STK_REG(R31)(r1)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 6dbafdb..b86a1d3 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -153,6 +153,13 @@ _GLOBAL(memcmp)
 	blr
 
 .Llong:
+#ifdef CONFIG_ALTIVEC
+	/* Try to use vmx loop if length is larger than 4K */
+	cmpldi  cr6,r5,4096
+	bgt	cr6,.Lvmx_cmp
+
+.Llong_novmx_cmp:
+#endif
 	li	off8,8
 	li	off16,16
 	li	off24,24
@@ -310,4 +317,76 @@ _GLOBAL(memcmp)
 8:
 	blr
 
+#ifdef CONFIG_ALTIVEC
+.Lvmx_cmp:
+	mflr    r0
+	std     r3,-STACKFRAMESIZE+STK_REG(R31)(r1)
+	std     r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
+	std     r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
+	std     r0,16(r1)
+	stdu    r1,-STACKFRAMESIZE(r1)
+	bl      enter_vmx_ops
+	cmpwi   cr1,r3,0
+	ld      r0,STACKFRAMESIZE+16(r1)
+	ld      r3,STK_REG(R31)(r1)
+	ld      r4,STK_REG(R30)(r1)
+	ld      r5,STK_REG(R29)(r1)
+	addi	r1,r1,STACKFRAMESIZE
+	mtlr    r0
+	beq     cr1,.Llong_novmx_cmp
+
+3:
+	/* Enter with src/dst address 8 bytes aligned, and len is
+	 * no less than 4KB. Need to align with 16 bytes further.
+	 */
+	andi.	rA,r3,8
+	beq	4f
+	LD	rA,0,r3
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+
+	addi	r3,r3,8
+	addi	r4,r4,8
+
+4:
+	/* compare 32 bytes for each loop */
+	srdi	r0,r5,5
+	mtctr	r0
+	andi.	r5,r5,31
+	li	off16,16
+5:
+	lvx 	v0,0,r3
+	lvx 	v1,0,r4
+	vcmpequd. v0,v0,v1
+	bf	24,7f
+	lvx 	v0,off16,r3
+	lvx 	v1,off16,r4
+	vcmpequd. v0,v0,v1
+	bf	24,6f
+	addi	r3,r3,32
+	addi	r4,r4,32
+	bdnz	5b
+
+	cmpdi	r5,0
+	beq	.Lzero
+	b	.Lshort
+
+6:
+	addi	r3,r3,16
+	addi	r4,r4,16
+
+7:
+	LD	rA,0,r3
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+
+	li	off8,8
+	LD	rA,off8,r3
+	LD	rB,off8,r4
+	cmpld	cr0,rA,rB
+	bne	cr0,.LcmpAB_lightweight
+	b	.Lzero
+#endif
 EXPORT_SYMBOL(memcmp)
diff --git a/arch/powerpc/lib/memcpy_power7.S b/arch/powerpc/lib/memcpy_power7.S
index 193909a..682e386 100644
--- a/arch/powerpc/lib/memcpy_power7.S
+++ b/arch/powerpc/lib/memcpy_power7.S
@@ -230,7 +230,7 @@ _GLOBAL(memcpy_power7)
 	std	r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
 	std	r0,16(r1)
 	stdu	r1,-STACKFRAMESIZE(r1)
-	bl	enter_vmx_copy
+	bl	enter_vmx_ops
 	cmpwi	cr1,r3,0
 	ld	r0,STACKFRAMESIZE+16(r1)
 	ld	r3,STK_REG(R31)(r1)
diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c
index bf925cd..923a9ab 100644
--- a/arch/powerpc/lib/vmx-helper.c
+++ b/arch/powerpc/lib/vmx-helper.c
@@ -53,7 +53,7 @@ int exit_vmx_usercopy(void)
 	return 0;
 }
 
-int enter_vmx_copy(void)
+int enter_vmx_ops(void)
 {
 	if (in_interrupt())
 		return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v1 3/3] powerpc:selftest update memcmp selftest according to kernel change
  2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
  2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
  2017-09-19 10:03 ` [PATCH v1 2/3] powerpc: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
@ 2017-09-19 10:03 ` wei.guo.simon
  2017-09-19 12:21 ` [PATCH v1 0/3] powerpc: memcmp() optimization Christophe LEROY
  3 siblings, 0 replies; 10+ messages in thread
From: wei.guo.simon @ 2017-09-19 10:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Paul Mackerras, Michael Ellerman, Naveen N.  Rao, Simon Guo

From: Simon Guo <wei.guo.simon@gmail.com>

This patch adjust selftest files related with memcmp so that memcmp
selftest can be compiled successfully.

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
 .../selftests/powerpc/copyloops/asm/ppc_asm.h      |  2 +-
 .../selftests/powerpc/stringloops/asm/ppc_asm.h    | 31 ++++++++++++++++++++++
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
index 80d34a9..a9da02d 100644
--- a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
@@ -35,7 +35,7 @@
 	li	r3,0
 	blr
 
-FUNC_START(enter_vmx_copy)
+FUNC_START(enter_vmx_ops)
 	li	r3,1
 	blr
 
diff --git a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
index 11bece8..793ee54 100644
--- a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
@@ -1,3 +1,5 @@
+#ifndef _PPC_ASM_H
+#define __PC_ASM_H
 #include <ppc-asm.h>
 
 #ifndef r1
@@ -5,3 +7,32 @@
 #endif
 
 #define _GLOBAL(A) FUNC_START(test_ ## A)
+
+#define CONFIG_ALTIVEC
+
+#define R14 r14
+#define R15 r15
+#define R16 r16
+#define R17 r17
+#define R18 r18
+#define R19 r19
+#define R20 r20
+#define R21 r21
+#define R22 r22
+#define R29 r29
+#define R30 r30
+#define R31 r31
+
+#define STACKFRAMESIZE	256
+#define STK_REG(i)	(112 + ((i)-14)*8)
+
+#define _GLOBAL(A) FUNC_START(test_ ## A)
+#define _GLOBAL_TOC(A) _GLOBAL(A)
+
+#define PPC_MTOCRF(A, B)	mtocrf A, B
+
+FUNC_START(enter_vmx_ops)
+	li      r3, 1
+	blr
+
+#endif
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* RE: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
  2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
@ 2017-09-19 10:12   ` David Laight
  2017-09-20  9:56     ` Simon Guo
  2017-09-19 12:20   ` Christophe LEROY
  1 sibling, 1 reply; 10+ messages in thread
From: David Laight @ 2017-09-19 10:12 UTC (permalink / raw)
  To: 'wei.guo.simon@gmail.com', linuxppc-dev@lists.ozlabs.org
  Cc: Naveen N.  Rao

RnJvbTogd2VpLmd1by5zaW1vbkBnbWFpbC5jb20NCj4gU2VudDogMTkgU2VwdGVtYmVyIDIwMTcg
MTE6MDQNCj4gQ3VycmVudGx5IG1lbWNtcCgpIGluIHBvd2VycGMgd2lsbCBmYWxsIGJhY2sgdG8g
LkxzaG9ydCAoY29tcGFyZSBwZXIgYnl0ZQ0KPiBtb2RlKSBpZiBlaXRoZXIgc3JjIG9yIGRzdCBh
ZGRyZXNzIGlzIG5vdCA4IGJ5dGVzIGFsaWduZWQuIEl0IGNhbiBiZQ0KPiBvcG1pdGl6ZWQgaWYg
Ym90aCBhZGRyZXNzZXMgYXJlIHdpdGggdGhlIHNhbWUgb2Zmc2V0IHdpdGggOCBieXRlcyBib3Vu
ZGFyeS4NCj4gDQo+IG1lbWNtcCgpIGNhbiBhbGlnbiB0aGUgc3JjL2RzdCBhZGRyZXNzIHdpdGgg
OCBieXRlcyBmaXJzdGx5IGFuZCB0aGVuDQo+IGNvbXBhcmUgd2l0aCAuTGxvbmcgbW9kZS4NCg0K
V2h5IG5vdCBtYXNrIGJvdGggYWRkcmVzc2VzIHdpdGggfjcgYW5kIG1hc2svc2hpZnQgdGhlIHJl
YWQgdmFsdWUgdG8gaWdub3JlDQp0aGUgdW53YW50ZWQgaGlnaCAoQkUpIG9yIGxvdyAoTEUpIGJp
dHMuDQoNClRoZSBzYW1lIGNhbiBiZSBkb25lIGF0IHRoZSBlbmQgb2YgdGhlIGNvbXBhcmUgd2l0
aCBhbnkgZmluYWwsIHBhcnRpYWwgd29yZC4NCg0KCURhdmlkDQogDQo=

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
  2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
  2017-09-19 10:12   ` David Laight
@ 2017-09-19 12:20   ` Christophe LEROY
  1 sibling, 0 replies; 10+ messages in thread
From: Christophe LEROY @ 2017-09-19 12:20 UTC (permalink / raw)
  To: wei.guo.simon, linuxppc-dev; +Cc: Naveen N. Rao

Hi

Could you in the email/patch subject write powerpc/64 instead pof 
powerpc as it doesn't apply to powerpc/32

Le 19/09/2017 à 12:03, wei.guo.simon@gmail.com a écrit :
> From: Simon Guo <wei.guo.simon@gmail.com>
> 
> Currently memcmp() in powerpc will fall back to .Lshort (compare per byte

Say powerpc/64 here too.

Christophe

> mode) if either src or dst address is not 8 bytes aligned. It can be
> opmitized if both addresses are with the same offset with 8 bytes boundary.
> 
> memcmp() can align the src/dst address with 8 bytes firstly and then
> compare with .Llong mode.
> 
> This patch optmizes memcmp() behavior in this situation.
> 
> Test result:
> 
> (1) 256 bytes
> Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
> - without patch
> 	50.715169506 seconds time elapsed                                          ( +-  0.04% )
> - with patch
> 	28.906602373 seconds time elapsed                                          ( +-  0.02% )
> 		-> There is ~+75% percent improvement.
> 
> (2) 32 bytes
> To observe performance impact on < 32 bytes, modify
> tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
> -------
>   #include <string.h>
>   #include "utils.h"
> 
> -#define SIZE 256
> +#define SIZE 32
>   #define ITERATIONS 10000
> 
>   int test_memcmp(const void *s1, const void *s2, size_t n);
> --------
> 
> - Without patch
> 	0.390677136 seconds time elapsed                                          ( +-  0.03% )
> - with patch
> 	0.375685926 seconds time elapsed                                          ( +-  0.05% )
> 		-> There is ～+4% improvement
> 
> (3) 0~8 bytes
> To observe <8 bytes performance impact, modify
> tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
> -------
>   #include <string.h>
>   #include "utils.h"
> 
> -#define SIZE 256
> -#define ITERATIONS 10000
> +#define SIZE 8
> +#define ITERATIONS 100000
> 
>   int test_memcmp(const void *s1, const void *s2, size_t n);
> -------
> - Without patch
> 	3.169203981 seconds time elapsed                                          ( +-  0.23% )
> - With patch
> 	3.208257362 seconds time elapsed                                          ( +-  0.13% )
> 		-> There is ~ -1% decrease.
> (I don't know why yet, since there are the same number of instructions
> in the code path for 0~8 bytes memcmp() with/without this patch.  Any
> comments will be appreciated).
> 
> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
> ---
>   arch/powerpc/lib/memcmp_64.S | 86 +++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 82 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index d75d18b..6dbafdb 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -24,25 +24,95 @@
>   #define rH	r31
>   
>   #ifdef __LITTLE_ENDIAN__
> +#define LH	lhbrx
> +#define LW	lwbrx
>   #define LD	ldbrx
>   #else
> +#define LH	lhzx
> +#define LW	lwzx
>   #define LD	ldx
>   #endif
>   
>   _GLOBAL(memcmp)
>   	cmpdi	cr1,r5,0
>   
> -	/* Use the short loop if both strings are not 8B aligned */
> -	or	r6,r3,r4
> +	/* Use the short loop if the src/dst addresses are not
> +	 * with the same offset of 8 bytes align boundary.
> +	 */
> +	xor	r6,r3,r4
>   	andi.	r6,r6,7
>   
> -	/* Use the short loop if length is less than 32B */
> -	cmpdi	cr6,r5,31
> +	/* fall back to short loop if compare at aligned addrs
> +	 * with no greater than 8 bytes.
> +	 */
> +	cmpdi   cr6,r5,8
>   
>   	beq	cr1,.Lzero
>   	bne	.Lshort
> +	ble	cr6,.Lshort
> +
> +.Lalignbytes_start:
> +	/* The bits 0/1/2 of src/dst addr are the same. */
> +	neg	r0,r3
> +	andi.	r0,r0,7
> +	beq	.Lalign8bytes
> +
> +	PPC_MTOCRF(1,r0)
> +	bf	31,.Lalign2bytes
> +	lbz	rA,0(r3)
> +	lbz	rB,0(r4)
> +	cmplw	cr0,rA,rB
> +	bne	cr0,.LcmpAB_lightweight
> +	addi	r3,r3,1
> +	addi	r4,r4,1
> +	subi	r5,r5,1
> +.Lalign2bytes:
> +	bf	30,.Lalign4bytes
> +	LH	rA,0,r3
> +	LH	rB,0,r4
> +	cmplw	cr0,rA,rB
> +	bne	cr0,.LcmpAB_lightweight
> +	bne	.Lnon_zero
> +	addi	r3,r3,2
> +	addi	r4,r4,2
> +	subi	r5,r5,2
> +.Lalign4bytes:
> +	bf	29,.Lalign8bytes
> +	LW	rA,0,r3
> +	LW	rB,0,r4
> +	cmpld	cr0,rA,rB
> +	bne	cr0,.LcmpAB_lightweight
> +	addi	r3,r3,4
> +	addi	r4,r4,4
> +	subi	r5,r5,4
> +.Lalign8bytes:
> +	/* Now addrs are aligned with 8 bytes. Use the short loop if left
> +	 * bytes are less than 8B.
> +	 */
> +	cmpdi   cr6,r5,7
> +	ble	cr6,.Lshort
> +
> +	/* Use .Llong loop if left cmp bytes are equal or greater than 32B */
> +	cmpdi   cr6,r5,31
>   	bgt	cr6,.Llong
>   
> +.Lcmploop_8bytes_31bytes:
> +	/* handle 8 ~ 31 bytes with 8 bytes aligned addrs */
> +	srdi.   r0,r5,3
> +	clrldi  r5,r5,61
> +	mtctr   r0
> +831:
> +	LD	rA,0,r3
> +	LD	rB,0,r4
> +	cmpld	cr0,rA,rB
> +	bne	cr0,.LcmpAB_lightweight
> +	addi	r3,r3,8
> +	addi	r4,r4,8
> +	bdnz	831b
> +
> +	cmpwi   r5,0
> +	beq	.Lzero
> +
>   .Lshort:
>   	mtctr	r5
>   
> @@ -232,4 +302,12 @@ _GLOBAL(memcmp)
>   	ld	r28,-32(r1)
>   	ld	r27,-40(r1)
>   	blr
> +
> +.LcmpAB_lightweight:   /* skip NV GPRS restore */
> +	li	r3,1
> +	bgt	cr0,8f
> +	li	r3,-1
> +8:
> +	blr
> +
>   EXPORT_SYMBOL(memcmp)
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 0/3] powerpc: memcmp() optimization
  2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
                   ` (2 preceding siblings ...)
  2017-09-19 10:03 ` [PATCH v1 3/3] powerpc:selftest update memcmp selftest according to kernel change wei.guo.simon
@ 2017-09-19 12:21 ` Christophe LEROY
  2017-09-20  9:57   ` Simon Guo
  3 siblings, 1 reply; 10+ messages in thread
From: Christophe LEROY @ 2017-09-19 12:21 UTC (permalink / raw)
  To: wei.guo.simon, linuxppc-dev; +Cc: Naveen N. Rao

Hi

Could you in the email/patch subject and in the commit texts write 
powerpc/64 instead of powerpc as it doesn't apply to powerpc/32

Christophe

Le 19/09/2017 à 12:03, wei.guo.simon@gmail.com a écrit :
> From: Simon Guo <wei.guo.simon@gmail.com>
> 
> There is some room to optimize memcmp() in powerpc for following 2 cases:
> (1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
> memcmp() can align them and go with .Llong comparision mode without
> fallback to .Lshort comparision mode do compare buffer byte by byte.
> (2) VMX instructions can be used to speed up for large size comparision.
> 
> This patch set also updates selftest case to make it compiled.
> 
> 
> Simon Guo (3):
>    powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
>    powerpc: enhance memcmp() with VMX instruction for long bytes
>      comparision
>    powerpc:selftest update memcmp selftest according to kernel change
> 
>   arch/powerpc/include/asm/asm-prototypes.h          |   2 +-
>   arch/powerpc/lib/copypage_power7.S                 |   2 +-
>   arch/powerpc/lib/memcmp_64.S                       | 165 ++++++++++++++++++++-
>   arch/powerpc/lib/memcpy_power7.S                   |   2 +-
>   arch/powerpc/lib/vmx-helper.c                      |   2 +-
>   .../selftests/powerpc/copyloops/asm/ppc_asm.h      |   2 +-
>   .../selftests/powerpc/stringloops/asm/ppc_asm.h    |  31 ++++
>   7 files changed, 197 insertions(+), 9 deletions(-)
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
  2017-09-19 10:12   ` David Laight
@ 2017-09-20  9:56     ` Simon Guo
  2017-09-20 10:05       ` David Laight
  0 siblings, 1 reply; 10+ messages in thread
From: Simon Guo @ 2017-09-20  9:56 UTC (permalink / raw)
  To: David Laight; +Cc: linuxppc-dev@lists.ozlabs.org, Naveen N.  Rao

On Tue, Sep 19, 2017 at 10:12:50AM +0000, David Laight wrote:
> From: wei.guo.simon@gmail.com
> > Sent: 19 September 2017 11:04
> > Currently memcmp() in powerpc will fall back to .Lshort (compare per byte
> > mode) if either src or dst address is not 8 bytes aligned. It can be
> > opmitized if both addresses are with the same offset with 8 bytes boundary.
> > 
> > memcmp() can align the src/dst address with 8 bytes firstly and then
> > compare with .Llong mode.
> 
> Why not mask both addresses with ~7 and mask/shift the read value to ignore
> the unwanted high (BE) or low (LE) bits.
> 
> The same can be done at the end of the compare with any final, partial word.
> 
> 	David
>  

Yes. That will be better. A prototyping shows ~5% improvement on 32 bytes 
size comparison with v1. I will rework on v2.

Thanks for the suggestion.

BR,
- Simon

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 0/3] powerpc: memcmp() optimization
  2017-09-19 12:21 ` [PATCH v1 0/3] powerpc: memcmp() optimization Christophe LEROY
@ 2017-09-20  9:57   ` Simon Guo
  0 siblings, 0 replies; 10+ messages in thread
From: Simon Guo @ 2017-09-20  9:57 UTC (permalink / raw)
  To: Christophe LEROY; +Cc: linuxppc-dev, Naveen N. Rao

Hi Chris,
On Tue, Sep 19, 2017 at 02:21:33PM +0200, Christophe LEROY wrote:
> Hi
> 
> Could you in the email/patch subject and in the commit texts write
> powerpc/64 instead of powerpc as it doesn't apply to powerpc/32
> 
> Christophe
> 
Sure. I will update in v2.

BR,
- Simon

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
  2017-09-20  9:56     ` Simon Guo
@ 2017-09-20 10:05       ` David Laight
  0 siblings, 0 replies; 10+ messages in thread
From: David Laight @ 2017-09-20 10:05 UTC (permalink / raw)
  To: 'Simon Guo'; +Cc: linuxppc-dev@lists.ozlabs.org, Naveen N.  Rao

From: Simon Guo
> Sent: 20 September 2017 10:57
> On Tue, Sep 19, 2017 at 10:12:50AM +0000, David Laight wrote:
> > From: wei.guo.simon@gmail.com
> > > Sent: 19 September 2017 11:04
> > > Currently memcmp() in powerpc will fall back to .Lshort (compare per =
byte
> > > mode) if either src or dst address is not 8 bytes aligned. It can be
> > > opmitized if both addresses are with the same offset with 8 bytes bou=
ndary.
> > >
> > > memcmp() can align the src/dst address with 8 bytes firstly and then
> > > compare with .Llong mode.
> >
> > Why not mask both addresses with ~7 and mask/shift the read value to ig=
nore
> > the unwanted high (BE) or low (LE) bits.
> >
> > The same can be done at the end of the compare with any final, partial =
word.
>=20
> Yes. That will be better. A prototyping shows ~5% improvement on 32 bytes
> size comparison with v1. I will rework on v2.

Clearly you have to be careful to return the correct +1/-1 on mismatch.

For systems that can do misaligned transfers you can compare the first
word, then compare aligned words and finally the last word.
Rather like a memcpy() function I wrote (for NetBDSD) that copied
the last word first, then a whole number of words aligned at the start.
(Hope no one expected anything special for overlapping copies.)

	David

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-09-20 10:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
2017-09-19 10:12   ` David Laight
2017-09-20  9:56     ` Simon Guo
2017-09-20 10:05       ` David Laight
2017-09-19 12:20   ` Christophe LEROY
2017-09-19 10:03 ` [PATCH v1 2/3] powerpc: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 3/3] powerpc:selftest update memcmp selftest according to kernel change wei.guo.simon
2017-09-19 12:21 ` [PATCH v1 0/3] powerpc: memcmp() optimization Christophe LEROY
2017-09-20  9:57   ` Simon Guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).