* [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
@ 2017-09-19 10:03 ` wei.guo.simon
2017-09-19 10:12 ` David Laight
2017-09-19 12:20 ` Christophe LEROY
2017-09-19 10:03 ` [PATCH v1 2/3] powerpc: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
` (2 subsequent siblings)
3 siblings, 2 replies; 10+ messages in thread
From: wei.guo.simon @ 2017-09-19 10:03 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
Currently memcmp() in powerpc will fall back to .Lshort (compare per byte
mode) if either src or dst address is not 8 bytes aligned. It can be
opmitized if both addresses are with the same offset with 8 bytes boundary.
memcmp() can align the src/dst address with 8 bytes firstly and then
compare with .Llong mode.
This patch optmizes memcmp() behavior in this situation.
Test result:
(1) 256 bytes
Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
- without patch
50.715169506 seconds time elapsed ( +- 0.04% )
- with patch
28.906602373 seconds time elapsed ( +- 0.02% )
-> There is ~+75% percent improvement.
(2) 32 bytes
To observe performance impact on < 32 bytes, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
#include <string.h>
#include "utils.h"
-#define SIZE 256
+#define SIZE 32
#define ITERATIONS 10000
int test_memcmp(const void *s1, const void *s2, size_t n);
--------
- Without patch
0.390677136 seconds time elapsed ( +- 0.03% )
- with patch
0.375685926 seconds time elapsed ( +- 0.05% )
-> There is ~+4% improvement
(3) 0~8 bytes
To observe <8 bytes performance impact, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
-------
#include <string.h>
#include "utils.h"
-#define SIZE 256
-#define ITERATIONS 10000
+#define SIZE 8
+#define ITERATIONS 100000
int test_memcmp(const void *s1, const void *s2, size_t n);
-------
- Without patch
3.169203981 seconds time elapsed ( +- 0.23% )
- With patch
3.208257362 seconds time elapsed ( +- 0.13% )
-> There is ~ -1% decrease.
(I don't know why yet, since there are the same number of instructions
in the code path for 0~8 bytes memcmp() with/without this patch. Any
comments will be appreciated).
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
arch/powerpc/lib/memcmp_64.S | 86 +++++++++++++++++++++++++++++++++++++++++---
1 file changed, 82 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index d75d18b..6dbafdb 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -24,25 +24,95 @@
#define rH r31
#ifdef __LITTLE_ENDIAN__
+#define LH lhbrx
+#define LW lwbrx
#define LD ldbrx
#else
+#define LH lhzx
+#define LW lwzx
#define LD ldx
#endif
_GLOBAL(memcmp)
cmpdi cr1,r5,0
- /* Use the short loop if both strings are not 8B aligned */
- or r6,r3,r4
+ /* Use the short loop if the src/dst addresses are not
+ * with the same offset of 8 bytes align boundary.
+ */
+ xor r6,r3,r4
andi. r6,r6,7
- /* Use the short loop if length is less than 32B */
- cmpdi cr6,r5,31
+ /* fall back to short loop if compare at aligned addrs
+ * with no greater than 8 bytes.
+ */
+ cmpdi cr6,r5,8
beq cr1,.Lzero
bne .Lshort
+ ble cr6,.Lshort
+
+.Lalignbytes_start:
+ /* The bits 0/1/2 of src/dst addr are the same. */
+ neg r0,r3
+ andi. r0,r0,7
+ beq .Lalign8bytes
+
+ PPC_MTOCRF(1,r0)
+ bf 31,.Lalign2bytes
+ lbz rA,0(r3)
+ lbz rB,0(r4)
+ cmplw cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ addi r3,r3,1
+ addi r4,r4,1
+ subi r5,r5,1
+.Lalign2bytes:
+ bf 30,.Lalign4bytes
+ LH rA,0,r3
+ LH rB,0,r4
+ cmplw cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ bne .Lnon_zero
+ addi r3,r3,2
+ addi r4,r4,2
+ subi r5,r5,2
+.Lalign4bytes:
+ bf 29,.Lalign8bytes
+ LW rA,0,r3
+ LW rB,0,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ addi r3,r3,4
+ addi r4,r4,4
+ subi r5,r5,4
+.Lalign8bytes:
+ /* Now addrs are aligned with 8 bytes. Use the short loop if left
+ * bytes are less than 8B.
+ */
+ cmpdi cr6,r5,7
+ ble cr6,.Lshort
+
+ /* Use .Llong loop if left cmp bytes are equal or greater than 32B */
+ cmpdi cr6,r5,31
bgt cr6,.Llong
+.Lcmploop_8bytes_31bytes:
+ /* handle 8 ~ 31 bytes with 8 bytes aligned addrs */
+ srdi. r0,r5,3
+ clrldi r5,r5,61
+ mtctr r0
+831:
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ addi r3,r3,8
+ addi r4,r4,8
+ bdnz 831b
+
+ cmpwi r5,0
+ beq .Lzero
+
.Lshort:
mtctr r5
@@ -232,4 +302,12 @@ _GLOBAL(memcmp)
ld r28,-32(r1)
ld r27,-40(r1)
blr
+
+.LcmpAB_lightweight: /* skip NV GPRS restore */
+ li r3,1
+ bgt cr0,8f
+ li r3,-1
+8:
+ blr
+
EXPORT_SYMBOL(memcmp)
--
1.8.3.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* RE: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
@ 2017-09-19 10:12 ` David Laight
2017-09-20 9:56 ` Simon Guo
2017-09-19 12:20 ` Christophe LEROY
1 sibling, 1 reply; 10+ messages in thread
From: David Laight @ 2017-09-19 10:12 UTC (permalink / raw)
To: 'wei.guo.simon@gmail.com', linuxppc-dev@lists.ozlabs.org
Cc: Naveen N. Rao
RnJvbTogd2VpLmd1by5zaW1vbkBnbWFpbC5jb20NCj4gU2VudDogMTkgU2VwdGVtYmVyIDIwMTcg
MTE6MDQNCj4gQ3VycmVudGx5IG1lbWNtcCgpIGluIHBvd2VycGMgd2lsbCBmYWxsIGJhY2sgdG8g
LkxzaG9ydCAoY29tcGFyZSBwZXIgYnl0ZQ0KPiBtb2RlKSBpZiBlaXRoZXIgc3JjIG9yIGRzdCBh
ZGRyZXNzIGlzIG5vdCA4IGJ5dGVzIGFsaWduZWQuIEl0IGNhbiBiZQ0KPiBvcG1pdGl6ZWQgaWYg
Ym90aCBhZGRyZXNzZXMgYXJlIHdpdGggdGhlIHNhbWUgb2Zmc2V0IHdpdGggOCBieXRlcyBib3Vu
ZGFyeS4NCj4gDQo+IG1lbWNtcCgpIGNhbiBhbGlnbiB0aGUgc3JjL2RzdCBhZGRyZXNzIHdpdGgg
OCBieXRlcyBmaXJzdGx5IGFuZCB0aGVuDQo+IGNvbXBhcmUgd2l0aCAuTGxvbmcgbW9kZS4NCg0K
V2h5IG5vdCBtYXNrIGJvdGggYWRkcmVzc2VzIHdpdGggfjcgYW5kIG1hc2svc2hpZnQgdGhlIHJl
YWQgdmFsdWUgdG8gaWdub3JlDQp0aGUgdW53YW50ZWQgaGlnaCAoQkUpIG9yIGxvdyAoTEUpIGJp
dHMuDQoNClRoZSBzYW1lIGNhbiBiZSBkb25lIGF0IHRoZSBlbmQgb2YgdGhlIGNvbXBhcmUgd2l0
aCBhbnkgZmluYWwsIHBhcnRpYWwgd29yZC4NCg0KCURhdmlkDQogDQo=
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
2017-09-19 10:12 ` David Laight
@ 2017-09-20 9:56 ` Simon Guo
2017-09-20 10:05 ` David Laight
0 siblings, 1 reply; 10+ messages in thread
From: Simon Guo @ 2017-09-20 9:56 UTC (permalink / raw)
To: David Laight; +Cc: linuxppc-dev@lists.ozlabs.org, Naveen N. Rao
On Tue, Sep 19, 2017 at 10:12:50AM +0000, David Laight wrote:
> From: wei.guo.simon@gmail.com
> > Sent: 19 September 2017 11:04
> > Currently memcmp() in powerpc will fall back to .Lshort (compare per byte
> > mode) if either src or dst address is not 8 bytes aligned. It can be
> > opmitized if both addresses are with the same offset with 8 bytes boundary.
> >
> > memcmp() can align the src/dst address with 8 bytes firstly and then
> > compare with .Llong mode.
>
> Why not mask both addresses with ~7 and mask/shift the read value to ignore
> the unwanted high (BE) or low (LE) bits.
>
> The same can be done at the end of the compare with any final, partial word.
>
> David
>
Yes. That will be better. A prototyping shows ~5% improvement on 32 bytes
size comparison with v1. I will rework on v2.
Thanks for the suggestion.
BR,
- Simon
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
2017-09-20 9:56 ` Simon Guo
@ 2017-09-20 10:05 ` David Laight
0 siblings, 0 replies; 10+ messages in thread
From: David Laight @ 2017-09-20 10:05 UTC (permalink / raw)
To: 'Simon Guo'; +Cc: linuxppc-dev@lists.ozlabs.org, Naveen N. Rao
From: Simon Guo
> Sent: 20 September 2017 10:57
> On Tue, Sep 19, 2017 at 10:12:50AM +0000, David Laight wrote:
> > From: wei.guo.simon@gmail.com
> > > Sent: 19 September 2017 11:04
> > > Currently memcmp() in powerpc will fall back to .Lshort (compare per =
byte
> > > mode) if either src or dst address is not 8 bytes aligned. It can be
> > > opmitized if both addresses are with the same offset with 8 bytes bou=
ndary.
> > >
> > > memcmp() can align the src/dst address with 8 bytes firstly and then
> > > compare with .Llong mode.
> >
> > Why not mask both addresses with ~7 and mask/shift the read value to ig=
nore
> > the unwanted high (BE) or low (LE) bits.
> >
> > The same can be done at the end of the compare with any final, partial =
word.
>=20
> Yes. That will be better. A prototyping shows ~5% improvement on 32 bytes
> size comparison with v1. I will rework on v2.
Clearly you have to be careful to return the correct +1/-1 on mismatch.
For systems that can do misaligned transfers you can compare the first
word, then compare aligned words and finally the last word.
Rather like a memcpy() function I wrote (for NetBDSD) that copied
the last word first, then a whole number of words aligned at the start.
(Hope no one expected anything special for overlapping copies.)
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
2017-09-19 10:12 ` David Laight
@ 2017-09-19 12:20 ` Christophe LEROY
1 sibling, 0 replies; 10+ messages in thread
From: Christophe LEROY @ 2017-09-19 12:20 UTC (permalink / raw)
To: wei.guo.simon, linuxppc-dev; +Cc: Naveen N. Rao
Hi
Could you in the email/patch subject write powerpc/64 instead pof
powerpc as it doesn't apply to powerpc/32
Le 19/09/2017 à 12:03, wei.guo.simon@gmail.com a écrit :
> From: Simon Guo <wei.guo.simon@gmail.com>
>
> Currently memcmp() in powerpc will fall back to .Lshort (compare per byte
Say powerpc/64 here too.
Christophe
> mode) if either src or dst address is not 8 bytes aligned. It can be
> opmitized if both addresses are with the same offset with 8 bytes boundary.
>
> memcmp() can align the src/dst address with 8 bytes firstly and then
> compare with .Llong mode.
>
> This patch optmizes memcmp() behavior in this situation.
>
> Test result:
>
> (1) 256 bytes
> Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
> - without patch
> 50.715169506 seconds time elapsed ( +- 0.04% )
> - with patch
> 28.906602373 seconds time elapsed ( +- 0.02% )
> -> There is ~+75% percent improvement.
>
> (2) 32 bytes
> To observe performance impact on < 32 bytes, modify
> tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
> -------
> #include <string.h>
> #include "utils.h"
>
> -#define SIZE 256
> +#define SIZE 32
> #define ITERATIONS 10000
>
> int test_memcmp(const void *s1, const void *s2, size_t n);
> --------
>
> - Without patch
> 0.390677136 seconds time elapsed ( +- 0.03% )
> - with patch
> 0.375685926 seconds time elapsed ( +- 0.05% )
> -> There is ~+4% improvement
>
> (3) 0~8 bytes
> To observe <8 bytes performance impact, modify
> tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
> -------
> #include <string.h>
> #include "utils.h"
>
> -#define SIZE 256
> -#define ITERATIONS 10000
> +#define SIZE 8
> +#define ITERATIONS 100000
>
> int test_memcmp(const void *s1, const void *s2, size_t n);
> -------
> - Without patch
> 3.169203981 seconds time elapsed ( +- 0.23% )
> - With patch
> 3.208257362 seconds time elapsed ( +- 0.13% )
> -> There is ~ -1% decrease.
> (I don't know why yet, since there are the same number of instructions
> in the code path for 0~8 bytes memcmp() with/without this patch. Any
> comments will be appreciated).
>
> Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
> ---
> arch/powerpc/lib/memcmp_64.S | 86 +++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 82 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index d75d18b..6dbafdb 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -24,25 +24,95 @@
> #define rH r31
>
> #ifdef __LITTLE_ENDIAN__
> +#define LH lhbrx
> +#define LW lwbrx
> #define LD ldbrx
> #else
> +#define LH lhzx
> +#define LW lwzx
> #define LD ldx
> #endif
>
> _GLOBAL(memcmp)
> cmpdi cr1,r5,0
>
> - /* Use the short loop if both strings are not 8B aligned */
> - or r6,r3,r4
> + /* Use the short loop if the src/dst addresses are not
> + * with the same offset of 8 bytes align boundary.
> + */
> + xor r6,r3,r4
> andi. r6,r6,7
>
> - /* Use the short loop if length is less than 32B */
> - cmpdi cr6,r5,31
> + /* fall back to short loop if compare at aligned addrs
> + * with no greater than 8 bytes.
> + */
> + cmpdi cr6,r5,8
>
> beq cr1,.Lzero
> bne .Lshort
> + ble cr6,.Lshort
> +
> +.Lalignbytes_start:
> + /* The bits 0/1/2 of src/dst addr are the same. */
> + neg r0,r3
> + andi. r0,r0,7
> + beq .Lalign8bytes
> +
> + PPC_MTOCRF(1,r0)
> + bf 31,.Lalign2bytes
> + lbz rA,0(r3)
> + lbz rB,0(r4)
> + cmplw cr0,rA,rB
> + bne cr0,.LcmpAB_lightweight
> + addi r3,r3,1
> + addi r4,r4,1
> + subi r5,r5,1
> +.Lalign2bytes:
> + bf 30,.Lalign4bytes
> + LH rA,0,r3
> + LH rB,0,r4
> + cmplw cr0,rA,rB
> + bne cr0,.LcmpAB_lightweight
> + bne .Lnon_zero
> + addi r3,r3,2
> + addi r4,r4,2
> + subi r5,r5,2
> +.Lalign4bytes:
> + bf 29,.Lalign8bytes
> + LW rA,0,r3
> + LW rB,0,r4
> + cmpld cr0,rA,rB
> + bne cr0,.LcmpAB_lightweight
> + addi r3,r3,4
> + addi r4,r4,4
> + subi r5,r5,4
> +.Lalign8bytes:
> + /* Now addrs are aligned with 8 bytes. Use the short loop if left
> + * bytes are less than 8B.
> + */
> + cmpdi cr6,r5,7
> + ble cr6,.Lshort
> +
> + /* Use .Llong loop if left cmp bytes are equal or greater than 32B */
> + cmpdi cr6,r5,31
> bgt cr6,.Llong
>
> +.Lcmploop_8bytes_31bytes:
> + /* handle 8 ~ 31 bytes with 8 bytes aligned addrs */
> + srdi. r0,r5,3
> + clrldi r5,r5,61
> + mtctr r0
> +831:
> + LD rA,0,r3
> + LD rB,0,r4
> + cmpld cr0,rA,rB
> + bne cr0,.LcmpAB_lightweight
> + addi r3,r3,8
> + addi r4,r4,8
> + bdnz 831b
> +
> + cmpwi r5,0
> + beq .Lzero
> +
> .Lshort:
> mtctr r5
>
> @@ -232,4 +302,12 @@ _GLOBAL(memcmp)
> ld r28,-32(r1)
> ld r27,-40(r1)
> blr
> +
> +.LcmpAB_lightweight: /* skip NV GPRS restore */
> + li r3,1
> + bgt cr0,8f
> + li r3,-1
> +8:
> + blr
> +
> EXPORT_SYMBOL(memcmp)
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v1 2/3] powerpc: enhance memcmp() with VMX instruction for long bytes comparision
2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
@ 2017-09-19 10:03 ` wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 3/3] powerpc:selftest update memcmp selftest according to kernel change wei.guo.simon
2017-09-19 12:21 ` [PATCH v1 0/3] powerpc: memcmp() optimization Christophe LEROY
3 siblings, 0 replies; 10+ messages in thread
From: wei.guo.simon @ 2017-09-19 10:03 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
This patch add VMX primitives to do memcmp() in case the compare size
exceeds 4K bytes.
Test result with following test program:
------
tools/testing/selftests/powerpc/stringloops# cat memcmp.c
int test_memcmp(const void *s1, const void *s2, size_t n);
static int testcase(void)
{
char *s1;
char *s2;
unsigned long i;
s1 = memalign(128, SIZE);
if (!s1) {
perror("memalign");
exit(1);
}
s2 = memalign(128, SIZE);
if (!s2) {
perror("memalign");
exit(1);
}
for (i = 0; i < SIZE; i++) {
s1[i] = i & 0xff;
s2[i] = i & 0xff;
}
for (i = 0; i < ITERATIONS; i++)
test_memcmp(s1, s2, SIZE);
return 0;
}
int main(void)
{
return test_harness(testcase, "memcmp");
}
------
Without VMX patch:
5.085776331 seconds time elapsed ( +- 0.28% )
With VMX patch:
4.584002052 seconds time elapsed ( +- 0.02% )
There is ~10% improvement.
However I am not aware whether there is use case in kernel for memcmp on
large size yet.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
arch/powerpc/include/asm/asm-prototypes.h | 2 +-
arch/powerpc/lib/copypage_power7.S | 2 +-
arch/powerpc/lib/memcmp_64.S | 79 +++++++++++++++++++++++++++++++
arch/powerpc/lib/memcpy_power7.S | 2 +-
arch/powerpc/lib/vmx-helper.c | 2 +-
5 files changed, 83 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
index 7330150..e6530d8 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -49,7 +49,7 @@ void __trace_hcall_exit(long opcode, unsigned long retval,
/* VMX copying */
int enter_vmx_usercopy(void);
int exit_vmx_usercopy(void);
-int enter_vmx_copy(void);
+int enter_vmx_ops(void);
void * exit_vmx_copy(void *dest);
/* Traps */
diff --git a/arch/powerpc/lib/copypage_power7.S b/arch/powerpc/lib/copypage_power7.S
index ca5fc8f..9e7729e 100644
--- a/arch/powerpc/lib/copypage_power7.S
+++ b/arch/powerpc/lib/copypage_power7.S
@@ -60,7 +60,7 @@ _GLOBAL(copypage_power7)
std r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
std r0,16(r1)
stdu r1,-STACKFRAMESIZE(r1)
- bl enter_vmx_copy
+ bl enter_vmx_ops
cmpwi r3,0
ld r0,STACKFRAMESIZE+16(r1)
ld r3,STK_REG(R31)(r1)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 6dbafdb..b86a1d3 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -153,6 +153,13 @@ _GLOBAL(memcmp)
blr
.Llong:
+#ifdef CONFIG_ALTIVEC
+ /* Try to use vmx loop if length is larger than 4K */
+ cmpldi cr6,r5,4096
+ bgt cr6,.Lvmx_cmp
+
+.Llong_novmx_cmp:
+#endif
li off8,8
li off16,16
li off24,24
@@ -310,4 +317,76 @@ _GLOBAL(memcmp)
8:
blr
+#ifdef CONFIG_ALTIVEC
+.Lvmx_cmp:
+ mflr r0
+ std r3,-STACKFRAMESIZE+STK_REG(R31)(r1)
+ std r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
+ std r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
+ std r0,16(r1)
+ stdu r1,-STACKFRAMESIZE(r1)
+ bl enter_vmx_ops
+ cmpwi cr1,r3,0
+ ld r0,STACKFRAMESIZE+16(r1)
+ ld r3,STK_REG(R31)(r1)
+ ld r4,STK_REG(R30)(r1)
+ ld r5,STK_REG(R29)(r1)
+ addi r1,r1,STACKFRAMESIZE
+ mtlr r0
+ beq cr1,.Llong_novmx_cmp
+
+3:
+ /* Enter with src/dst address 8 bytes aligned, and len is
+ * no less than 4KB. Need to align with 16 bytes further.
+ */
+ andi. rA,r3,8
+ beq 4f
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+
+ addi r3,r3,8
+ addi r4,r4,8
+
+4:
+ /* compare 32 bytes for each loop */
+ srdi r0,r5,5
+ mtctr r0
+ andi. r5,r5,31
+ li off16,16
+5:
+ lvx v0,0,r3
+ lvx v1,0,r4
+ vcmpequd. v0,v0,v1
+ bf 24,7f
+ lvx v0,off16,r3
+ lvx v1,off16,r4
+ vcmpequd. v0,v0,v1
+ bf 24,6f
+ addi r3,r3,32
+ addi r4,r4,32
+ bdnz 5b
+
+ cmpdi r5,0
+ beq .Lzero
+ b .Lshort
+
+6:
+ addi r3,r3,16
+ addi r4,r4,16
+
+7:
+ LD rA,0,r3
+ LD rB,0,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+
+ li off8,8
+ LD rA,off8,r3
+ LD rB,off8,r4
+ cmpld cr0,rA,rB
+ bne cr0,.LcmpAB_lightweight
+ b .Lzero
+#endif
EXPORT_SYMBOL(memcmp)
diff --git a/arch/powerpc/lib/memcpy_power7.S b/arch/powerpc/lib/memcpy_power7.S
index 193909a..682e386 100644
--- a/arch/powerpc/lib/memcpy_power7.S
+++ b/arch/powerpc/lib/memcpy_power7.S
@@ -230,7 +230,7 @@ _GLOBAL(memcpy_power7)
std r5,-STACKFRAMESIZE+STK_REG(R29)(r1)
std r0,16(r1)
stdu r1,-STACKFRAMESIZE(r1)
- bl enter_vmx_copy
+ bl enter_vmx_ops
cmpwi cr1,r3,0
ld r0,STACKFRAMESIZE+16(r1)
ld r3,STK_REG(R31)(r1)
diff --git a/arch/powerpc/lib/vmx-helper.c b/arch/powerpc/lib/vmx-helper.c
index bf925cd..923a9ab 100644
--- a/arch/powerpc/lib/vmx-helper.c
+++ b/arch/powerpc/lib/vmx-helper.c
@@ -53,7 +53,7 @@ int exit_vmx_usercopy(void)
return 0;
}
-int enter_vmx_copy(void)
+int enter_vmx_ops(void)
{
if (in_interrupt())
return 0;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v1 3/3] powerpc:selftest update memcmp selftest according to kernel change
2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp wei.guo.simon
2017-09-19 10:03 ` [PATCH v1 2/3] powerpc: enhance memcmp() with VMX instruction for long bytes comparision wei.guo.simon
@ 2017-09-19 10:03 ` wei.guo.simon
2017-09-19 12:21 ` [PATCH v1 0/3] powerpc: memcmp() optimization Christophe LEROY
3 siblings, 0 replies; 10+ messages in thread
From: wei.guo.simon @ 2017-09-19 10:03 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Paul Mackerras, Michael Ellerman, Naveen N. Rao, Simon Guo
From: Simon Guo <wei.guo.simon@gmail.com>
This patch adjust selftest files related with memcmp so that memcmp
selftest can be compiled successfully.
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
.../selftests/powerpc/copyloops/asm/ppc_asm.h | 2 +-
.../selftests/powerpc/stringloops/asm/ppc_asm.h | 31 ++++++++++++++++++++++
2 files changed, 32 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
index 80d34a9..a9da02d 100644
--- a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
@@ -35,7 +35,7 @@
li r3,0
blr
-FUNC_START(enter_vmx_copy)
+FUNC_START(enter_vmx_ops)
li r3,1
blr
diff --git a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
index 11bece8..793ee54 100644
--- a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
@@ -1,3 +1,5 @@
+#ifndef _PPC_ASM_H
+#define __PC_ASM_H
#include <ppc-asm.h>
#ifndef r1
@@ -5,3 +7,32 @@
#endif
#define _GLOBAL(A) FUNC_START(test_ ## A)
+
+#define CONFIG_ALTIVEC
+
+#define R14 r14
+#define R15 r15
+#define R16 r16
+#define R17 r17
+#define R18 r18
+#define R19 r19
+#define R20 r20
+#define R21 r21
+#define R22 r22
+#define R29 r29
+#define R30 r30
+#define R31 r31
+
+#define STACKFRAMESIZE 256
+#define STK_REG(i) (112 + ((i)-14)*8)
+
+#define _GLOBAL(A) FUNC_START(test_ ## A)
+#define _GLOBAL_TOC(A) _GLOBAL(A)
+
+#define PPC_MTOCRF(A, B) mtocrf A, B
+
+FUNC_START(enter_vmx_ops)
+ li r3, 1
+ blr
+
+#endif
--
1.8.3.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v1 0/3] powerpc: memcmp() optimization
2017-09-19 10:03 [PATCH v1 0/3] powerpc: memcmp() optimization wei.guo.simon
` (2 preceding siblings ...)
2017-09-19 10:03 ` [PATCH v1 3/3] powerpc:selftest update memcmp selftest according to kernel change wei.guo.simon
@ 2017-09-19 12:21 ` Christophe LEROY
2017-09-20 9:57 ` Simon Guo
3 siblings, 1 reply; 10+ messages in thread
From: Christophe LEROY @ 2017-09-19 12:21 UTC (permalink / raw)
To: wei.guo.simon, linuxppc-dev; +Cc: Naveen N. Rao
Hi
Could you in the email/patch subject and in the commit texts write
powerpc/64 instead of powerpc as it doesn't apply to powerpc/32
Christophe
Le 19/09/2017 à 12:03, wei.guo.simon@gmail.com a écrit :
> From: Simon Guo <wei.guo.simon@gmail.com>
>
> There is some room to optimize memcmp() in powerpc for following 2 cases:
> (1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
> memcmp() can align them and go with .Llong comparision mode without
> fallback to .Lshort comparision mode do compare buffer byte by byte.
> (2) VMX instructions can be used to speed up for large size comparision.
>
> This patch set also updates selftest case to make it compiled.
>
>
> Simon Guo (3):
> powerpc: Align bytes before fall back to .Lshort in powerpc memcmp
> powerpc: enhance memcmp() with VMX instruction for long bytes
> comparision
> powerpc:selftest update memcmp selftest according to kernel change
>
> arch/powerpc/include/asm/asm-prototypes.h | 2 +-
> arch/powerpc/lib/copypage_power7.S | 2 +-
> arch/powerpc/lib/memcmp_64.S | 165 ++++++++++++++++++++-
> arch/powerpc/lib/memcpy_power7.S | 2 +-
> arch/powerpc/lib/vmx-helper.c | 2 +-
> .../selftests/powerpc/copyloops/asm/ppc_asm.h | 2 +-
> .../selftests/powerpc/stringloops/asm/ppc_asm.h | 31 ++++
> 7 files changed, 197 insertions(+), 9 deletions(-)
>
^ permalink raw reply [flat|nested] 10+ messages in thread