linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors
@ 2013-12-11  6:24 zhichang.yuan at linaro.org
  2013-12-11  6:24 ` [PATCH 1/6] arm64: lib: Implement optimized memcpy routine zhichang.yuan at linaro.org
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11  6:24 UTC (permalink / raw)
  To: linux-arm-kernel

From: "zhichang.yuan" <zhichang.yuan@linaro.org>

In current aarch64 kernel,there are a few string library routines
implemented in arm64/lib,such as memcpy,memset, memmove,strchr.
Most string routines frequently used are provided by the
architecture-independent string library.Those routines are not efficient.

This patch focus on improving the sting routines' performance in ARMv8.
It contains eight optimized functions.The work is based on the cortex-string
project in Linaro toolchain.

The cortex-string code can be found in this website:
	https://code.launchpad.net/cortex-strings
To obtain better performance,several ideas were utilized:
1) memory burst access;
	For the long memory data operation,adopted the armv8 instruction pairs,
	ldp/stp,to transfer the bulk data.Try best to use continuous ldp/stp
	to trigger the burst access.
2) parallel processing
	The current string routines mostly processed per-byte. This patch
	processes the data in parallel.Such as strlen, it will process
	eight string bytes each time.
3) aligned memory access
	Classfy the process into several categories according to the input
	memory address parameters.For the non-alignment memory address,firstly
	process the begginning short-length data to make the memory address
	aligned,then start the remain processing on alignment address.

After the optimization,those routines have better performance than the current ones.
Please refer to this website to get the test results:
	https://wiki.linaro.org/WorkingGroups/Kernel/ARMv8/cortex-strings

--

zhichang.yuan (6):
  arm64: lib: Implement optimized memcpy routine
  arm64: lib: Implement optimized memmove routine
  arm64: lib: Implement optimized memset routine
  arm64: lib: Implement optimized memcmp routine
  arm64: lib: Implement optimized string compare routines
  arm64: lib: Implement optimized string length routines

 arch/arm64/include/asm/string.h |   15 ++
 arch/arm64/kernel/arm64ksyms.c  |    5 +
 arch/arm64/lib/Makefile         |    5 +-
 arch/arm64/lib/memcmp.S         |  258 +++++++++++++++++++++++++++++
 arch/arm64/lib/memcpy.S         |  182 ++++++++++++++++++---
 arch/arm64/lib/memmove.S        |  195 ++++++++++++++++++----
 arch/arm64/lib/memset.S         |  227 +++++++++++++++++++++++---
 arch/arm64/lib/strcmp.S         |  256 +++++++++++++++++++++++++++++
 arch/arm64/lib/strlen.S         |  131 +++++++++++++++
 arch/arm64/lib/strncmp.S        |  340 +++++++++++++++++++++++++++++++++++++++
 arch/arm64/lib/strnlen.S        |  179 +++++++++++++++++++++
 11 files changed, 1714 insertions(+), 79 deletions(-)
 create mode 100644 arch/arm64/lib/memcmp.S
 create mode 100644 arch/arm64/lib/strcmp.S
 create mode 100644 arch/arm64/lib/strlen.S
 create mode 100644 arch/arm64/lib/strncmp.S
 create mode 100644 arch/arm64/lib/strnlen.S

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/6] arm64: lib: Implement optimized memcpy routine
  2013-12-11  6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
@ 2013-12-11  6:24 ` zhichang.yuan at linaro.org
  2013-12-16 16:08   ` Will Deacon
  2013-12-11  6:24 ` [PATCH 2/6] arm64: lib: Implement optimized memmove routine zhichang.yuan at linaro.org
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11  6:24 UTC (permalink / raw)
  To: linux-arm-kernel

From: "zhichang.yuan" <zhichang.yuan@linaro.org>

This patch, based on Linaro's Cortex Strings library, improves
the performance of the assembly optimized memcpy() function.

Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
---
 arch/arm64/lib/memcpy.S |  182 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 160 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S
index 27b5003..e3bab71 100644
--- a/arch/arm64/lib/memcpy.S
+++ b/arch/arm64/lib/memcpy.S
@@ -1,5 +1,13 @@
 /*
  * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
@@ -27,27 +35,157 @@
  * Returns:
  *	x0 - dest
  */
+#define dstin	x0
+#define src	x1
+#define count	x2
+#define tmp1	x3
+#define tmp1w	w3
+#define tmp2	x4
+#define tmp2w	w4
+#define tmp3	x5
+#define tmp3w	w5
+#define dst	x6
+
+#define A_l	x7
+#define A_h	x8
+#define B_l	x9
+#define B_h	x10
+#define C_l	x11
+#define C_h	x12
+#define D_l	x13
+#define D_h	x14
+
 ENTRY(memcpy)
-	mov	x4, x0
-	subs	x2, x2, #8
-	b.mi	2f
-1:	ldr	x3, [x1], #8
-	subs	x2, x2, #8
-	str	x3, [x4], #8
-	b.pl	1b
-2:	adds	x2, x2, #4
-	b.mi	3f
-	ldr	w3, [x1], #4
-	sub	x2, x2, #4
-	str	w3, [x4], #4
-3:	adds	x2, x2, #2
-	b.mi	4f
-	ldrh	w3, [x1], #2
-	sub	x2, x2, #2
-	strh	w3, [x4], #2
-4:	adds	x2, x2, #1
-	b.mi	5f
-	ldrb	w3, [x1]
-	strb	w3, [x4]
-5:	ret
+	mov	dst, dstin
+	cmp	count, #16
+	/*If memory length is less than 16, stp or ldp can not be used.*/
+	b.lo	.Ltiny15
+.Lover16:
+	neg	tmp2, src
+	ands	tmp2, tmp2, #15/* Bytes to reach alignment. */
+	b.eq	.LSrcAligned
+	sub	count, count, tmp2
+	/*
+	* Use ldp and sdp to copy 16 bytes,then backward the src to
+	* aligned address.This way is more efficient.
+	* But the risk overwriting the source area exists.Here,prefer to
+	* access memory forward straight,no backward.It will need a bit
+	* more instructions, but on the same time,the accesses are aligned.
+	*/
+	tbz	tmp2, #0, 1f
+	ldrb	tmp1w, [src], #1
+	strb	tmp1w, [dst], #1
+1:
+	tbz	tmp2, #1, 1f
+	ldrh	tmp1w, [src], #2
+	strh	tmp1w, [dst], #2
+1:
+	tbz	tmp2, #2, 1f
+	ldr	tmp1w, [src], #4
+	str	tmp1w, [dst], #4
+1:
+	tbz	tmp2, #3, .LSrcAligned
+	ldr	tmp1, [src],#8
+	str	tmp1, [dst],#8
+
+.LSrcAligned:
+	cmp	count, #64
+	b.ge	.Lcpy_over64
+	/*
+	* Deal with small copies quickly by dropping straight into the
+	* exit block.
+	*/
+.Ltail63:
+	/*
+	* Copy up to 48 bytes of data. At this point we only need the
+	* bottom 6 bits of count to be accurate.
+	*/
+	ands	tmp1, count, #0x30
+	b.eq	.Ltiny15
+	cmp	tmp1w, #0x20
+	b.eq	1f
+	b.lt	2f
+	ldp	A_l, A_h, [src], #16
+	stp	A_l, A_h, [dst], #16
+1:
+	ldp	A_l, A_h, [src], #16
+	stp	A_l, A_h, [dst], #16
+2:
+	ldp	A_l, A_h, [src], #16
+	stp	A_l, A_h, [dst], #16
+.Ltiny15:
+	/*
+	* To make memmove simpler, here don't make src backwards.
+	* since backwards will probably overwrite the src area where src
+	* data for nex copy located,if dst is not so far from src.
+	*/
+	tbz	count, #3, 1f
+	ldr	tmp1, [src], #8
+	str	tmp1, [dst], #8
+1:
+	tbz	count, #2, 1f
+	ldr	tmp1w, [src], #4
+	str	tmp1w, [dst], #4
+1:
+	tbz	count, #1, 1f
+	ldrh	tmp1w, [src], #2
+	strh	tmp1w, [dst], #2
+1:
+	tbz	count, #0, .Lexitfunc
+	ldrb	tmp1w, [src]
+	strb	tmp1w, [dst]
+
+.Lexitfunc:
+	ret
+
+.Lcpy_over64:
+	subs	count, count, #128
+	b.ge	.Lcpy_body_large
+	/*
+	* Less than 128 bytes to copy, so handle 64 here and then jump
+	* to the tail.
+	*/
+	ldp	A_l, A_h, [src],#16
+	stp	A_l, A_h, [dst],#16
+	ldp	B_l, B_h, [src],#16
+	ldp	C_l, C_h, [src],#16
+	stp	B_l, B_h, [dst],#16
+	stp	C_l, C_h, [dst],#16
+	ldp	D_l, D_h, [src],#16
+	stp	D_l, D_h, [dst],#16
+
+	tst	count, #0x3f
+	b.ne	.Ltail63
+	ret
+
+	/*
+	* Critical loop.  Start at a new cache line boundary.  Assuming
+	* 64 bytes per line this ensures the entire loop is in one line.
+	*/
+	.p2align	6
+.Lcpy_body_large:
+	/* There are at least 128 bytes to copy.  */
+	ldp	A_l, A_h, [src],#16
+	ldp	B_l, B_h, [src],#16
+	ldp	C_l, C_h, [src],#16
+	ldp	D_l, D_h, [src],#16
+1:
+	stp	A_l, A_h, [dst],#16
+	ldp	A_l, A_h, [src],#16
+	stp	B_l, B_h, [dst],#16
+	ldp	B_l, B_h, [src],#16
+	stp	C_l, C_h, [dst],#16
+	ldp	C_l, C_h, [src],#16
+	stp	D_l, D_h, [dst],#16
+	ldp	D_l, D_h, [src],#16
+	subs	count, count, #64
+	b.ge	1b
+	stp	A_l, A_h, [dst],#16
+	stp	B_l, B_h, [dst],#16
+	stp	C_l, C_h, [dst],#16
+	stp	D_l, D_h, [dst],#16
+
+	tst	count, #0x3f
+	b.ne	.Ltail63
+	ret
 ENDPROC(memcpy)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/6] arm64: lib: Implement optimized memmove routine
  2013-12-11  6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
  2013-12-11  6:24 ` [PATCH 1/6] arm64: lib: Implement optimized memcpy routine zhichang.yuan at linaro.org
@ 2013-12-11  6:24 ` zhichang.yuan at linaro.org
  2013-12-11  6:24 ` [PATCH 3/6] arm64: lib: Implement optimized memset routine zhichang.yuan at linaro.org
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11  6:24 UTC (permalink / raw)
  To: linux-arm-kernel

From: "zhichang.yuan" <zhichang.yuan@linaro.org>

This patch, based on Linaro's Cortex Strings library, improves
the performance of the assembly optimized memmove() function.

Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
---
 arch/arm64/lib/memmove.S |  195 +++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 166 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/lib/memmove.S b/arch/arm64/lib/memmove.S
index b79fdfa..61ee8d9 100644
--- a/arch/arm64/lib/memmove.S
+++ b/arch/arm64/lib/memmove.S
@@ -1,13 +1,21 @@
 /*
  * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
  *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
  * GNU General Public License for more details.
  *
  * You should have received a copy of the GNU General Public License
@@ -18,7 +26,8 @@
 #include <asm/assembler.h>
 
 /*
- * Move a buffer from src to test (alignment handled by the hardware).
+ * Move a buffer from src to test
+ * (alignment handled by the hardware for part cases).
  * If dest <= src, call memcpy, otherwise copy in reverse order.
  *
  * Parameters:
@@ -28,30 +37,158 @@
  * Returns:
  *	x0 - dest
  */
+#define dstin	x0
+#define src	x1
+#define count	x2
+#define tmp1	x3
+#define tmp1w	w3
+#define tmp2	x4
+#define tmp2w	w4
+#define tmp3	x5
+#define tmp3w	w5
+#define dst	x6
+
+#define A_l	x7
+#define A_h	x8
+#define B_l	x9
+#define B_h	x10
+#define C_l	x11
+#define C_h	x12
+#define D_l	x13
+#define D_h	x14
+
 ENTRY(memmove)
-	cmp	x0, x1
-	b.ls	memcpy
-	add	x4, x0, x2
-	add	x1, x1, x2
-	subs	x2, x2, #8
-	b.mi	2f
-1:	ldr	x3, [x1, #-8]!
-	subs	x2, x2, #8
-	str	x3, [x4, #-8]!
-	b.pl	1b
-2:	adds	x2, x2, #4
-	b.mi	3f
-	ldr	w3, [x1, #-4]!
-	sub	x2, x2, #4
-	str	w3, [x4, #-4]!
-3:	adds	x2, x2, #2
-	b.mi	4f
-	ldrh	w3, [x1, #-2]!
-	sub	x2, x2, #2
-	strh	w3, [x4, #-2]!
-4:	adds	x2, x2, #1
-	b.mi	5f
-	ldrb	w3, [x1, #-1]
-	strb	w3, [x4, #-1]
-5:	ret
+	cmp	dstin, src
+	/*b.eq	.Lexitfunc*/
+	b.lo	memcpy
+	add	tmp1, src, count
+	cmp	dstin, tmp1
+	b.hs	memcpy		/* No overlap.  */
+
+	add	dst, dstin, count
+	add	src, src, count
+	cmp	count, #16
+	b.lo	.Ltail15
+
+.Lover16:
+	ands	tmp2, src, #15     /* Bytes to reach alignment.  */
+	b.eq	.LSrcAligned
+	sub	count, count, tmp2
+	/*
+	* process the aligned offset length to make the src aligned firstly.
+	* those extra instructions' cost is acceptable. It also make the
+	* coming accesses are based on aligned address.
+	*/
+	tbz	tmp2, #0, 1f
+	ldrb	tmp1w, [src, #-1]!
+	strb	tmp1w, [dst, #-1]!
+1:
+	tbz	tmp2, #1, 1f
+	ldrh	tmp1w, [src, #-2]!
+	strh	tmp1w, [dst, #-2]!
+1:
+	tbz	tmp2, #2, 1f
+	ldr	tmp1w, [src, #-4]!
+	str	tmp1w, [dst, #-4]!
+1:
+	tbz	tmp2, #3, .LSrcAligned
+	ldr	tmp1, [src, #-8]!
+	str	tmp1, [dst, #-8]!
+
+.LSrcAligned:
+	cmp	count, #64
+	b.ge	.Lcpy_over64
+
+	/*
+	* Deal with small copies quickly by dropping straight into the
+	* exit block.*/
+.Ltail63:
+	/*
+	* Copy up to 48 bytes of data. At this point we only need the
+	* bottom 6 bits of count to be accurate.
+	*/
+	ands	tmp1, count, #0x30
+	b.eq	.Ltail15
+	cmp	tmp1w, #0x20
+	b.eq	1f
+	b.lt	2f
+	ldp	A_l, A_h, [src, #-16]!
+	stp	A_l, A_h, [dst, #-16]!
+1:
+	ldp	A_l, A_h, [src, #-16]!
+	stp	A_l, A_h, [dst, #-16]!
+2:
+	ldp	A_l, A_h, [src, #-16]!
+	stp	A_l, A_h, [dst, #-16]!
+
+.Ltail15:
+	tbz	count, #3, 1f
+	ldr	tmp1, [src, #-8]!
+	str	tmp1, [dst, #-8]!
+1:
+	tbz	count, #2, 1f
+	ldr	tmp1w, [src, #-4]!
+	str	tmp1w, [dst, #-4]!
+1:
+	tbz	count, #1, 1f
+	ldrh	tmp1w, [src, #-2]!
+	strh	tmp1w, [dst, #-2]!
+1:
+	tbz	count, #0, .Lexitfunc
+	ldrb	tmp1w, [src, #-1]
+	strb	tmp1w, [dst, #-1]
+
+.Lexitfunc:
+	ret
+
+.Lcpy_over64:
+	subs	count, count, #128
+	b.ge	.Lcpy_body_large
+	/*
+	* Less than 128 bytes to copy, so handle 64 here and then jump
+	* to the tail.
+	*/
+	ldp	A_l, A_h, [src, #-16]
+	stp	A_l, A_h, [dst, #-16]
+	ldp	B_l, B_h, [src, #-32]
+	ldp	C_l, C_h, [src, #-48]
+	stp	B_l, B_h, [dst, #-32]
+	stp	C_l, C_h, [dst, #-48]
+	ldp	D_l, D_h, [src, #-64]!
+	stp	D_l, D_h, [dst, #-64]!
+
+	tst	count, #0x3f
+	b.ne	.Ltail63
+	ret
+
+	/*
+	* Critical loop. Start at a new cache line boundary. Assuming
+	* 64 bytes per line this ensures the entire loop is in one line.
+	*/
+	.p2align	6
+.Lcpy_body_large:
+	/* There are@least 128 bytes to copy.  */
+	ldp	A_l, A_h, [src, #-16]
+	ldp	B_l, B_h, [src, #-32]
+	ldp	C_l, C_h, [src, #-48]
+	ldp	D_l, D_h, [src, #-64]!
+1:
+	stp	A_l, A_h, [dst, #-16]
+	ldp	A_l, A_h, [src, #-16]
+	stp	B_l, B_h, [dst, #-32]
+	ldp	B_l, B_h, [src, #-32]
+	stp	C_l, C_h, [dst, #-48]
+	ldp	C_l, C_h, [src, #-48]
+	stp	D_l, D_h, [dst, #-64]!
+	ldp	D_l, D_h, [src, #-64]!
+	subs	count, count, #64
+	b.ge	1b
+	stp	A_l, A_h, [dst, #-16]
+	stp	B_l, B_h, [dst, #-32]
+	stp	C_l, C_h, [dst, #-48]
+	stp	D_l, D_h, [dst, #-64]!
+
+	tst	count, #0x3f
+	b.ne	.Ltail63
+	ret
 ENDPROC(memmove)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/6] arm64: lib: Implement optimized memset routine
  2013-12-11  6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
  2013-12-11  6:24 ` [PATCH 1/6] arm64: lib: Implement optimized memcpy routine zhichang.yuan at linaro.org
  2013-12-11  6:24 ` [PATCH 2/6] arm64: lib: Implement optimized memmove routine zhichang.yuan at linaro.org
@ 2013-12-11  6:24 ` zhichang.yuan at linaro.org
  2013-12-16 16:55   ` Will Deacon
  2013-12-11  6:24 ` [PATCH 4/6] arm64: lib: Implement optimized memcmp routine zhichang.yuan at linaro.org
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11  6:24 UTC (permalink / raw)
  To: linux-arm-kernel

From: "zhichang.yuan" <zhichang.yuan@linaro.org>

This patch, based on Linaro's Cortex Strings library, improves
the performance of the assembly optimized memset() function.

Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
---
 arch/arm64/lib/memset.S |  227 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 201 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/lib/memset.S b/arch/arm64/lib/memset.S
index 87e4a68..90b973e 100644
--- a/arch/arm64/lib/memset.S
+++ b/arch/arm64/lib/memset.S
@@ -1,13 +1,21 @@
 /*
  * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
  *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
  * GNU General Public License for more details.
  *
  * You should have received a copy of the GNU General Public License
@@ -18,7 +26,7 @@
 #include <asm/assembler.h>
 
 /*
- * Fill in the buffer with character c (alignment handled by the hardware)
+ * Fill in the buffer with character c
  *
  * Parameters:
  *	x0 - buf
@@ -27,27 +35,194 @@
  * Returns:
  *	x0 - buf
  */
+
+/* By default we assume that the DC instruction can be used to zero
+*  data blocks more efficiently.  In some circumstances this might be
+*  unsafe, for example in an asymmetric multiprocessor environment with
+*  different DC clear lengths (neither the upper nor lower lengths are
+*  safe to use).  The feature can be disabled by defining DONT_USE_DC.
+*/
+
+#define dstin		x0
+#define val		w1
+#define count		x2
+#define tmp1		x3
+#define tmp1w		w3
+#define tmp2		x4
+#define tmp2w		w4
+#define zva_len_x	x5
+#define zva_len	w5
+#define zva_bits_x	x6
+
+#define A_l		x7
+#define A_lw		w7
+#define dst		x8
+#define tmp3w		w9
+#define tmp3		x9
+
 ENTRY(memset)
-	mov	x4, x0
-	and	w1, w1, #0xff
-	orr	w1, w1, w1, lsl #8
-	orr	w1, w1, w1, lsl #16
-	orr	x1, x1, x1, lsl #32
-	subs	x2, x2, #8
-	b.mi	2f
-1:	str	x1, [x4], #8
-	subs	x2, x2, #8
-	b.pl	1b
-2:	adds	x2, x2, #4
-	b.mi	3f
-	sub	x2, x2, #4
-	str	w1, [x4], #4
-3:	adds	x2, x2, #2
-	b.mi	4f
-	sub	x2, x2, #2
-	strh	w1, [x4], #2
-4:	adds	x2, x2, #1
-	b.mi	5f
-	strb	w1, [x4]
-5:	ret
+	mov	dst, dstin	/* Preserve return value.  */
+	and	A_lw, val, #255
+	orr	A_lw, A_lw, A_lw, lsl #8
+	orr	A_lw, A_lw, A_lw, lsl #16
+	orr	A_l, A_l, A_l, lsl #32
+
+	/*first align dst with 16...*/
+	neg	tmp2, dst
+	ands	tmp2, tmp2, #15
+	b.eq	.Laligned
+
+	cmp	count, #15
+	b.le	.Ltail15tiny
+	/*
+	* The count is not less than 16, we can use stp to set 16 bytes
+	* once. This way is more efficient but the access is non-aligned.
+	*/
+	stp	A_l, A_l, [dst]
+	/*make the dst aligned..*/
+	sub	count, count, tmp2
+	add	dst, dst, tmp2
+
+	/*Here, dst is aligned 16 now...*/
+.Laligned:
+#ifndef DONT_USE_DC
+	cbz	A_l, .Lzero_mem
+#endif
+
+.Ltail_maybe_long:
+	cmp	count, #64
+	b.ge	.Lnot_short
+.Ltail63:
+	ands	tmp1, count, #0x30
+	b.eq	.Ltail15tiny
+	cmp	tmp1w, #0x20
+	b.eq	1f
+	b.lt	2f
+	stp	A_l, A_l, [dst], #16
+1:
+	stp	A_l, A_l, [dst], #16
+2:
+	stp	A_l, A_l, [dst], #16
+/*
+* The following process is non-aligned access. But it is more efficient than
+* .Ltail15tiny. Of-course, we can delete this code, but have a bit
+* performance cost.
+*/
+	ands	count, count, #15
+	cbz	count, 1f
+	add	dst, dst, count
+	stp	A_l, A_l, [dst, #-16]	/* Repeat some/all of last store. */
+1:
+	ret
+
+.Ltail15tiny:
+	/* Set up to 15 bytes.  Does not assume earlier memory
+	being set.  */
+	tbz	count, #3, 1f
+	str	A_l, [dst], #8
+1:
+	tbz	count, #2, 1f
+	str	A_lw, [dst], #4
+1:
+	tbz	count, #1, 1f
+	strh	A_lw, [dst], #2
+1:
+	tbz	count, #0, 1f
+	strb	A_lw, [dst]
+1:
+	ret
+
+	/*
+	* Critical loop. Start at a new cache line boundary. Assuming
+	* 64 bytes per line, this ensures the entire loop is in one line.
+	*/
+	.p2align	6
+.Lnot_short: /*count must be not less than 64*/
+	sub	dst, dst, #16/* Pre-bias.  */
+	sub	count, count, #64
+1:
+	stp	A_l, A_l, [dst, #16]
+	stp	A_l, A_l, [dst, #32]
+	stp	A_l, A_l, [dst, #48]
+	stp	A_l, A_l, [dst, #64]!
+	subs	count, count, #64
+	b.ge	1b
+	tst	count, #0x3f
+	add	dst, dst, #16
+	b.ne	.Ltail63
+.Lexitfunc:
+	ret
+
+#ifndef DONT_USE_DC
+	/*
+	* For zeroing memory, check to see if we can use the ZVA feature to
+	* zero entire 'cache' lines.
+	*/
+.Lzero_mem:
+	cmp	count, #63
+	b.le	.Ltail63
+	/*
+	* For zeroing small amounts of memory, it's not worth setting up
+	* the line-clear code.
+	*/
+	cmp	count, #128
+	b.lt	.Lnot_short /*count is at least  128 bytes*/
+
+	mrs	tmp1, dczid_el0
+	tbnz	tmp1, #4, .Lnot_short
+	mov	tmp3w, #4
+	and	zva_len, tmp1w, #15	/* Safety: other bits reserved.  */
+	lsl	zva_len, tmp3w, zva_len
+
+	ands	tmp3w, zva_len, #63
+	/*
+	* ensure the zva_len is not less than 64.
+	* It is not meaningful to use ZVA if the block size is less than 64.
+	*/
+	b.ne	.Lnot_short
+.Lzero_by_line:
+	/*
+	* Compute how far we need to go to become suitably aligned. We're
+	* already at quad-word alignment.
+	*/
+	cmp	count, zva_len_x
+	b.lt	.Lnot_short		/* Not enough to reach alignment.  */
+	sub	zva_bits_x, zva_len_x, #1
+	neg	tmp2, dst
+	ands	tmp2, tmp2, zva_bits_x
+	b.eq	1f			/* Already aligned.  */
+	/* Not aligned, check that there's enough to copy after alignment.*/
+	sub	tmp1, count, tmp2
+	/*
+	* grantee the remain length to be ZVA is bigger than 64,
+	* avoid to make the 2f's process over mem range.*/
+	cmp	tmp1, #64
+	ccmp	tmp1, zva_len_x, #8, ge	/* NZCV=0b1000 */
+	b.lt	.Lnot_short
+	/*
+	* We know that there's at least 64 bytes to zero and that it's safe
+	* to overrun by 64 bytes.
+	*/
+	mov	count, tmp1
+2:
+	stp	A_l, A_l, [dst]
+	stp	A_l, A_l, [dst, #16]
+	stp	A_l, A_l, [dst, #32]
+	subs	tmp2, tmp2, #64
+	stp	A_l, A_l, [dst, #48]
+	add	dst, dst, #64
+	b.ge	2b
+	/* We've overrun a bit, so adjust dst downwards.*/
+	add	dst, dst, tmp2
+1:
+	sub	count, count, zva_len_x
+3:
+	dc	zva, dst
+	add	dst, dst, zva_len_x
+	subs	count, count, zva_len_x
+	b.ge	3b
+	ands	count, count, zva_bits_x
+	b.ne	.Ltail_maybe_long
+	ret
+#endif /* DONT_USE_DC */
 ENDPROC(memset)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/6] arm64: lib: Implement optimized memcmp routine
  2013-12-11  6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
                   ` (2 preceding siblings ...)
  2013-12-11  6:24 ` [PATCH 3/6] arm64: lib: Implement optimized memset routine zhichang.yuan at linaro.org
@ 2013-12-11  6:24 ` zhichang.yuan at linaro.org
  2013-12-16 16:56   ` Will Deacon
  2013-12-11  6:24 ` [PATCH 5/6] arm64: lib: Implement optimized string compare routines zhichang.yuan at linaro.org
  2013-12-11  6:24 ` [PATCH 6/6] arm64: lib: Implement optimized string length routines zhichang.yuan at linaro.org
  5 siblings, 1 reply; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11  6:24 UTC (permalink / raw)
  To: linux-arm-kernel

From: "zhichang.yuan" <zhichang.yuan@linaro.org>

This patch, based on Linaro's Cortex Strings library, adds
an assembly optimized memcmp() function.

Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
---
 arch/arm64/include/asm/string.h |    3 +
 arch/arm64/kernel/arm64ksyms.c  |    1 +
 arch/arm64/lib/Makefile         |    2 +-
 arch/arm64/lib/memcmp.S         |  258 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 263 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/lib/memcmp.S

diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h
index 3ee8b30..3a43305 100644
--- a/arch/arm64/include/asm/string.h
+++ b/arch/arm64/include/asm/string.h
@@ -34,4 +34,7 @@ extern void *memchr(const void *, int, __kernel_size_t);
 #define __HAVE_ARCH_MEMSET
 extern void *memset(void *, int, __kernel_size_t);
 
+#define __HAVE_ARCH_MEMCMP
+extern int memcmp(const void *, const void *, size_t);
+
 #endif
diff --git a/arch/arm64/kernel/arm64ksyms.c b/arch/arm64/kernel/arm64ksyms.c
index e7ee770..af02a25 100644
--- a/arch/arm64/kernel/arm64ksyms.c
+++ b/arch/arm64/kernel/arm64ksyms.c
@@ -51,6 +51,7 @@ EXPORT_SYMBOL(memset);
 EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(memchr);
+EXPORT_SYMBOL(memcmp);
 
 	/* atomic bitops */
 EXPORT_SYMBOL(set_bit);
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index 59acc0e..a6a8d3d 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -2,5 +2,5 @@ lib-y		:= bitops.o delay.o					\
 		   strncpy_from_user.o strnlen_user.o clear_user.o	\
 		   copy_from_user.o copy_to_user.o copy_in_user.o	\
 		   copy_page.o clear_page.o				\
-		   memchr.o memcpy.o memmove.o memset.o			\
+		   memchr.o memcpy.o memmove.o memset.o memcmp.o	\
 		   strchr.o strrchr.o
diff --git a/arch/arm64/lib/memcmp.S b/arch/arm64/lib/memcmp.S
new file mode 100644
index 0000000..97e8431
--- /dev/null
+++ b/arch/arm64/lib/memcmp.S
@@ -0,0 +1,258 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/*
+* compare memory areas(when two memory areas' offset are different,
+* alignment handled by the hardware)
+*
+* Parameters:
+*  x0 - const memory area 1 pointer
+*  x1 - const memory area 2 pointer
+*  x2 - the maximal compare byte length
+* Returns:
+*  x0 - a compare result, maybe less than, equal to, or greater than ZERO
+*/
+
+/* Parameters and result.  */
+#define src1		x0
+#define src2		x1
+#define limit		x2
+#define result		x0
+
+/* Internal variables.  */
+#define data1		x3
+#define data1w		w3
+#define data2		x4
+#define data2w		w4
+#define has_nul		x5
+#define diff		x6
+#define endloop		x7
+#define tmp1		x8
+#define tmp2		x9
+#define tmp3		x10
+#define pos		x11
+#define limit_wd	x12
+#define mask		x13
+
+ENTRY(memcmp)
+	cbz	limit, .Lret0
+	eor	tmp1, src1, src2
+	tst	tmp1, #7
+	b.ne	.Lmisaligned8
+	ands	tmp1, src1, #7
+	b.ne	.Lmutual_align
+	sub	limit_wd, limit, #1 /* limit != 0, so no underflow.  */
+	lsr	limit_wd, limit_wd, #3 /* Convert to Dwords.  */
+.Lloop_aligned:
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+.Lstart_realigned:
+	subs	limit_wd, limit_wd, #1
+	eor	diff, data1, data2	/* Non-zero if differences found.  */
+	csinv	endloop, diff, xzr, cs	/* Last Dword or differences.  */
+	cbz	endloop, .Lloop_aligned
+
+	/* Not reached the limit, must have found a diff.  */
+	tbz	limit_wd, #63, .Lnot_limit
+
+	/* Limit % 8 == 0 => all bytes significant.  */
+	ands	limit, limit, #7
+	b.eq	.Lnot_limit
+
+	lsl	limit, limit, #3	/* Bits -> bytes.  */
+	mov	mask, #~0
+#ifdef __ARM64EB__
+	lsr	mask, mask, limit
+#else
+	lsl	mask, mask, limit
+#endif
+	bic	data1, data1, mask
+	bic	data2, data2, mask
+
+	orr	diff, diff, mask
+	b	.Lnot_limit
+
+.Lmutual_align:
+	/*
+	* Sources are mutually aligned, but are not currently at an
+	* alignment boundary. Round down the addresses and then mask off
+	* the bytes that precede the start point.
+	*/
+	bic	src1, src1, #7
+	bic	src2, src2, #7
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+	sub	limit_wd, limit, #1/* limit != 0, so no underflow.  */
+	and	tmp3, limit_wd, #7
+	lsr	limit_wd, limit_wd, #3
+	add	tmp3, tmp3, tmp1
+	add	limit_wd, limit_wd, tmp3, lsr #3
+	add	limit, limit, tmp1/* Adjust the limit for the extra.  */
+	lsl	tmp1, tmp1, #3/* Bytes beyond alignment -> bits.*/
+	neg	tmp1, tmp1/* Bits to alignment -64.  */
+	mov	tmp2, #~0
+#ifdef __ARM64EB__
+	/* Big-endian.  Early bytes are at MSB.  */
+	lsl	tmp2, tmp2, tmp1/* Shift (tmp1 & 63).  */
+#else
+	/* Little-endian.  Early bytes are at LSB.  */
+	lsr	tmp2, tmp2, tmp1/* Shift (tmp1 & 63).  */
+#endif
+	orr	data1, data1, tmp2
+	orr	data2, data2, tmp2
+	b	.Lstart_realigned
+
+.Lmisaligned8:
+	cmp	limit, #8
+	b.lo	.Ltiny8proc /*limit < 8... */
+	/*
+	* Get the align offset length to compare per byte first.
+	* After this process, one string's address will be aligned.
+	*/
+	and	tmp1, src1, #7
+	neg	tmp1, tmp1
+	add	tmp1, tmp1, #8
+	and	tmp2, src2, #7
+	neg	tmp2, tmp2
+	add	tmp2, tmp2, #8
+	subs	tmp3, tmp1, tmp2
+	csel	pos, tmp1, tmp2, hi /*Choose the maximum. */
+	/*
+	* Here, limit is not less than 8,
+	* so directly run .Ltinycmp without checking the limit.*/
+	sub	limit, limit, pos
+.Ltinycmp:
+	ldrb	data1w, [src1], #1
+	ldrb	data2w, [src2], #1
+	subs	pos, pos, #1
+	ccmp	data1w, data2w, #0, ne  /* NZCV = 0b0000.  */
+	b.eq	.Ltinycmp
+	cbnz	pos, 1f /*find the unequal...*/
+	cmp	data1w, data2w
+	b.eq	.Lstart_align /*the last bytes are equal....*/
+1:
+	sub	result, data1, data2
+	ret
+
+.Lstart_align:
+	lsr	limit_wd, limit, #3
+	cbz	limit_wd, .Lremain8
+	ands	xzr, src1, #7
+	/*
+	* eq means tmp1 bytes finished the compare in the Ltinycmp,
+	* tmp3 is positive here
+	*/
+	b.eq	.Lrecal_offset
+	add	src1, src1, tmp3
+	add	src2, src2, tmp3
+	sub	limit, limit, tmp3
+	lsr	limit_wd, limit, #3
+	cbz	limit_wd, .Lremain8
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+
+	subs	limit_wd, limit_wd, #1
+	eor	diff, data1, data2  /*Non-zero if differences found.*/
+	csinv	endloop, diff, xzr, ne
+	cbnz	endloop, .Lunequal_proc
+	and	tmp3, tmp3, #7  /*tmp3 = 8 + tmp3 ( old tmp3 is negative)*/
+	/*
+	* src1 is aligned and src1 is in the right of src2.
+	* Remain count is not less than 8 here.
+	*/
+.Lrecal_offset:
+	neg	pos, tmp3
+.Lloopcmp_proc:
+	/*
+	* Fall back pos bytes, get the first bytes segment of
+	* one Dword of src1. pos is negative here. We also can use :
+	* ldr	data1, [src1]
+	* ldr	data2, [src2, pos]
+	* These two instructions will read data with aligned address,then
+	* do the compare.But if we adapt this method, have to add some
+	* shift and mask out some bits from these two Dword to construct
+	* two Dwords to compare.Some more instructions will be added,
+	* and most important, it will need more time cost.
+	*/
+	ldr	data1, [src1,pos]
+	ldr	data2, [src2,pos]
+	eor	diff, data1, data2  /* Non-zero if differences found.*/
+	cbnz	diff, .Lnot_limit
+
+	/*The second part process*/
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+	eor	diff, data1, data2  /* Non-zero if differences found.*/
+	subs	limit_wd, limit_wd, #1
+	csinv	endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
+	cbz	endloop, .Lloopcmp_proc
+.Lunequal_proc:
+	/*whether  unequal occurred?*/
+	cbz	diff, .Lremain8
+.Lnot_limit:
+#ifndef	__ARM64EB__
+	rev	diff, diff
+	rev	data1, data1
+	rev	data2, data2
+#endif
+	/*
+	* The MS-non-zero bit of DIFF marks either the first bit
+	* that is different, or the end of the significant data.
+	* Shifting left now will bring the critical information into the
+	* top bits.
+	*/
+	clz	pos, diff
+	lsl	data1, data1, pos
+	lsl	data2, data2, pos
+	/* But we need to zero-extend (char is unsigned) the value and then
+	perform a signed 32-bit subtraction.  */
+	lsr	data1, data1, #56
+	sub	result, data1, data2, lsr #56
+	ret
+
+	.p2align	6
+.Lremain8:
+	/* Limit % 8 == 0 => all bytes significant.  */
+	ands	limit, limit, #7
+	b.eq	.Lret0
+
+.Ltiny8proc:
+	/*Perhaps we can do better than this.*/
+	ldrb	data1w, [src1], #1
+	ldrb	data2w, [src2], #1
+	subs	limit, limit, #1
+	/*
+	* ne satisfied means current limit > 0. Z=1 will make cs =0,
+	* lead to next ccmp use ZERO to set flags,so break the loop.*/
+	ccmp	data1w, data2w, #0, ne  /* NZCV = 0b0000. */
+	b.eq	.Ltiny8proc
+	sub	result, data1, data2
+	ret
+.Lret0:
+	mov	result, #0
+	ret
+ENDPROC(memcmp)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/6] arm64: lib: Implement optimized string compare routines
  2013-12-11  6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
                   ` (3 preceding siblings ...)
  2013-12-11  6:24 ` [PATCH 4/6] arm64: lib: Implement optimized memcmp routine zhichang.yuan at linaro.org
@ 2013-12-11  6:24 ` zhichang.yuan at linaro.org
  2013-12-11  6:24 ` [PATCH 6/6] arm64: lib: Implement optimized string length routines zhichang.yuan at linaro.org
  5 siblings, 0 replies; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11  6:24 UTC (permalink / raw)
  To: linux-arm-kernel

From: "zhichang.yuan" <zhichang.yuan@linaro.org>

This patch, based on Linaro's Cortex Strings library, adds
an assembly optimized strcmp() and strncmp() functions.

Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
---
 arch/arm64/include/asm/string.h |    6 +
 arch/arm64/kernel/arm64ksyms.c  |    2 +
 arch/arm64/lib/Makefile         |    2 +-
 arch/arm64/lib/strcmp.S         |  256 +++++++++++++++++++++++++++++
 arch/arm64/lib/strncmp.S        |  340 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 605 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/lib/strcmp.S
 create mode 100644 arch/arm64/lib/strncmp.S

diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h
index 3a43305..6133f49 100644
--- a/arch/arm64/include/asm/string.h
+++ b/arch/arm64/include/asm/string.h
@@ -22,6 +22,12 @@ extern char *strrchr(const char *, int c);
 #define __HAVE_ARCH_STRCHR
 extern char *strchr(const char *, int c);
 
+#define __HAVE_ARCH_STRCMP
+extern int strcmp(const char *, const char *);
+
+#define __HAVE_ARCH_STRNCMP
+extern int strncmp(const char *, const char *, __kernel_size_t);
+
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *, const void *, __kernel_size_t);
 
diff --git a/arch/arm64/kernel/arm64ksyms.c b/arch/arm64/kernel/arm64ksyms.c
index af02a25..d61fa39 100644
--- a/arch/arm64/kernel/arm64ksyms.c
+++ b/arch/arm64/kernel/arm64ksyms.c
@@ -47,6 +47,8 @@ EXPORT_SYMBOL(memstart_addr);
 	/* string / mem functions */
 EXPORT_SYMBOL(strchr);
 EXPORT_SYMBOL(strrchr);
+EXPORT_SYMBOL(strcmp);
+EXPORT_SYMBOL(strncmp);
 EXPORT_SYMBOL(memset);
 EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(memmove);
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index a6a8d3d..5ded21f 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -3,4 +3,4 @@ lib-y		:= bitops.o delay.o					\
 		   copy_from_user.o copy_to_user.o copy_in_user.o	\
 		   copy_page.o clear_page.o				\
 		   memchr.o memcpy.o memmove.o memset.o memcmp.o	\
-		   strchr.o strrchr.o
+		   strchr.o strrchr.o strcmp.o strncmp.o
diff --git a/arch/arm64/lib/strcmp.S b/arch/arm64/lib/strcmp.S
new file mode 100644
index 0000000..e07ea0d
--- /dev/null
+++ b/arch/arm64/lib/strcmp.S
@@ -0,0 +1,256 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/*
+ * compare two strings
+ *
+ * Parameters:
+ *	x0 - const string 1 pointer
+ *    x1 - const string 2 pointer
+ * Returns:
+ * x0 - an integer less than, equal to, or greater than zero
+ * if  s1  is  found, respectively, to be less than, to match,
+ * or be greater than s2.
+ */
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+/* Parameters and result.  */
+#define src1		x0
+#define src2		x1
+#define result		x0
+
+/* Internal variables.  */
+#define data1		x2
+#define data1w		w2
+#define data2		x3
+#define data2w		w3
+#define has_nul		x4
+#define diff		x5
+#define syndrome	x6
+#define tmp1		x7
+#define tmp2		x8
+#define tmp3		x9
+#define zeroones	x10
+#define pos		x11
+
+ENTRY(strcmp)
+	eor	tmp1, src1, src2
+	mov	zeroones, #REP8_01
+	tst	tmp1, #7
+	b.ne	.Lmisaligned8
+	ands	tmp1, src1, #7
+	b.ne	.Lmutual_align
+
+	/*
+	* NUL detection works on the principle that (X - 1) & (~X) & 0x80
+	* (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+	* can be done in parallel across the entire word.
+	*/
+.Lloop_aligned:
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+.Lstart_realigned:
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	eor	diff, data1, data2	/* Non-zero if differences found.  */
+	bic	has_nul, tmp1, tmp2	/* Non-zero if NUL terminator.  */
+	orr	syndrome, diff, has_nul
+	cbz	syndrome, .Lloop_aligned
+	b	.Lcal_cmpresult
+
+.Lmutual_align:
+	/*
+	* Sources are mutually aligned, but are not currently at an
+	* alignment boundary.  Round down the addresses and then mask off
+	* the bytes that preceed the start point.
+	*/
+	bic	src1, src1, #7
+	bic	src2, src2, #7
+	lsl	tmp1, tmp1, #3		/* Bytes beyond alignment -> bits.  */
+	ldr	data1, [src1], #8
+	neg	tmp1, tmp1		/* Bits to alignment -64.  */
+	ldr	data2, [src2], #8
+	mov	tmp2, #~0
+#ifdef __ARM64EB__
+	/* Big-endian.  Early bytes are at MSB.  */
+	lsl	tmp2, tmp2, tmp1	/* Shift (tmp1 & 63).  */
+#else
+	/* Little-endian.  Early bytes are at LSB.  */
+	lsr	tmp2, tmp2, tmp1	/* Shift (tmp1 & 63).  */
+#endif
+	orr	data1, data1, tmp2
+	orr	data2, data2, tmp2
+	b	.Lstart_realigned
+
+.Lmisaligned8:
+	/*
+	* Get the align offset length to compare per byte first.
+	* After this process, one string's address will be aligned.
+	*/
+	and	tmp1, src1, #7
+	neg	tmp1, tmp1
+	add	tmp1, tmp1, #8
+	and	tmp2, src2, #7
+	neg	tmp2, tmp2
+	add	tmp2, tmp2, #8
+	subs	tmp3, tmp1, tmp2
+	csel	pos, tmp1, tmp2, hi /*Choose the maximum. */
+.Ltinycmp:
+	ldrb	data1w, [src1], #1
+	ldrb	data2w, [src2], #1
+	subs	pos, pos, #1
+	ccmp	data1w, #1, #0, ne  /* NZCV = 0b0000.  */
+	ccmp	data1w, data2w, #0, cs  /* NZCV = 0b0000.  */
+	b.eq	.Ltinycmp
+	cbnz	pos, 1f /*find the null or unequal...*/
+	cmp	data1w, #1
+	ccmp	data1w, data2w, #0, cs
+	b.eq	.Lstart_align /*the last bytes are equal....*/
+1:
+	sub	result, data1, data2
+	ret
+
+.Lstart_align:
+	ands	xzr, src1, #7
+	/*
+	* eq means tmp1 src1 is aligned now;
+	* tmp3 is positive for this branch.
+	*/
+	b.eq	.Lrecal_offset
+	add	src1, src1, tmp3
+	add	src2, src2, tmp3
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	bic	has_nul, tmp1, tmp2
+	eor	diff, data1, data2 /* Non-zero if differences found.  */
+	orr	syndrome, diff, has_nul
+	cbnz	syndrome, .Lcal_cmpresult
+	and	tmp3, tmp3, #7  /*tmp3 = 8 + tmp3 ( old tmp3 is negative)*/
+	/*
+	* src1 is aligned and src1 is in the right of src2.
+	* start to compare the next 8 bytes of two strings.
+	*/
+.Lrecal_offset:
+	neg	pos, tmp3
+.Lloopcmp_proc:
+	/*
+	* Fall back pos bytes, get the first bytes segment of one Dword of src1.
+	* pos is negative here. We also can  use :
+	* ldr	data1, [src1]
+	* ldr	data2, [src2, pos]
+	* These two instructions will read data with aligned address.
+	* But if we adapt this method, have to add some shift and mask out
+	* some bits from these two Dwords to construct two new Dwords.
+	* Some more instructions will be added,and most important,
+	* it will need more time cost.
+	*/
+	ldr	data1, [src1,pos]
+	ldr	data2, [src2,pos]
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	bic	has_nul, tmp1, tmp2
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	orr	syndrome, diff, has_nul
+	cbnz	syndrome, .Lcal_cmpresult
+
+	/*The second part process*/
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	bic	has_nul, tmp1, tmp2
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	orr	syndrome, diff, has_nul
+	cbz	syndrome, .Lloopcmp_proc
+
+.Lcal_cmpresult:
+#ifndef	__ARM64EB__
+	rev	syndrome, syndrome
+	rev	data1, data1
+	/*
+	* The MS-non-zero bit of the syndrome marks either the first bit
+	* that is different, or the top bit of the first zero byte.
+	* Shifting left now will bring the critical information into the
+	* top bits.
+	*/
+	clz	pos, syndrome
+	rev	data2, data2
+	lsl	data1, data1, pos
+	lsl	data2, data2, pos
+	/*
+	* But we need to zero-extend (char is unsigned) the value and then
+	* perform a signed 32-bit subtraction.
+	*/
+	lsr	data1, data1, #56
+	sub	result, data1, data2, lsr #56
+	ret
+#else
+	/*
+	* For big-endian we cannot use the trick with the syndrome value
+	* as carry-propagation can corrupt the upper bits if the trailing
+	* bytes in the string contain 0x01.
+	* However, if there is no NUL byte in the dword, we can generate
+	* the result directly.  We can't just subtract the bytes as the
+	* MSB might be significant.
+	*/
+	cbnz	has_nul, 1f
+	cmp	data1, data2
+	cset	result, ne
+	cneg	result, result, lo
+	ret
+1:
+	/*Re-compute the NUL-byte detection, using a byte-reversed value. */
+	rev	tmp3, data1
+	sub	tmp1, tmp3, zeroones
+	orr	tmp2, tmp3, #REP8_7f
+	bic	has_nul, tmp1, tmp2
+	rev	has_nul, has_nul
+	orr	syndrome, diff, has_nul
+	clz	pos, syndrome
+	/*
+	* The MS-non-zero bit of the syndrome marks either the first bit
+	* that is different, or the top bit of the first zero byte.
+	* Shifting left now will bring the critical information into the
+	* top bits.
+	*/
+	lsl	data1, data1, pos
+	lsl	data2, data2, pos
+	/*
+	* But we need to zero-extend (char is unsigned) the value and then
+	* perform a signed 32-bit subtraction.
+	*/
+	lsr	data1, data1, #56
+	sub	result, data1, data2, lsr #56
+	ret
+#endif
+ENDPROC(strcmp)
diff --git a/arch/arm64/lib/strncmp.S b/arch/arm64/lib/strncmp.S
new file mode 100644
index 0000000..e883fd3
--- /dev/null
+++ b/arch/arm64/lib/strncmp.S
@@ -0,0 +1,340 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/*
+ * compare two strings
+ *
+ * Parameters:
+ *  x0 - const string 1 pointer
+ *  x1 - const string 2 pointer
+ *  x2 - the maximal length to be compared
+ * Returns:
+ *  x0 - an integer less than, equal to, or greater than zero if s1 is found,
+ *     respectively, to be less than, to match, or be greater than s2.
+ */
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+/* Parameters and result.  */
+#define src1		x0
+#define src2		x1
+#define limit		x2
+#define result		x0
+
+/* Internal variables.  */
+#define data1		x3
+#define data1w		w3
+#define data2		x4
+#define data2w		w4
+#define has_nul		x5
+#define diff		x6
+#define syndrome	x7
+#define tmp1		x8
+#define tmp2		x9
+#define tmp3		x10
+#define zeroones	x11
+#define pos		x12
+#define limit_wd	x13
+#define mask		x14
+#define endloop		x15
+
+ENTRY(strncmp)
+	cbz	limit, .Lret0
+	eor	tmp1, src1, src2
+	mov	zeroones, #REP8_01
+	tst	tmp1, #7
+	b.ne	.Lmisaligned8
+	ands	tmp1, src1, #7
+	b.ne	.Lmutual_align
+	/* Calculate the number of full and partial words -1.  */
+	/*
+	* when limit is mulitply of 8, if not sub 1,
+	* the judgement of last dword will wrong.
+	*/
+	sub	limit_wd, limit, #1 /* limit != 0, so no underflow.  */
+	lsr	limit_wd, limit_wd, #3  /* Convert to Dwords.  */
+
+	/*
+	* NUL detection works on the principle that (X - 1) & (~X) & 0x80
+	* (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+	* can be done in parallel across the entire word.
+	*/
+.Lloop_aligned:
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+.Lstart_realigned:
+	subs	limit_wd, limit_wd, #1
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	csinv	endloop, diff, xzr, pl  /* Last Dword or differences.*/
+	bics	has_nul, tmp1, tmp2 /* Non-zero if NUL terminator.  */
+	ccmp	endloop, #0, #0, eq
+	b.eq	.Lloop_aligned
+
+	/*Not reached the limit, must have found the end or a diff.  */
+	tbz	limit_wd, #63, .Lnot_limit
+
+	/* Limit % 8 == 0 => all bytes significant.  */
+	ands	limit, limit, #7
+	b.eq	.Lnot_limit
+
+	lsl	limit, limit, #3    /* Bits -> bytes.  */
+	mov	mask, #~0
+#ifdef __ARM64EB__
+	lsr	mask, mask, limit
+#else
+	lsl	mask, mask, limit
+#endif
+	bic	data1, data1, mask
+	bic	data2, data2, mask
+
+	/* Make sure that the NUL byte is marked in the syndrome.  */
+	orr	has_nul, has_nul, mask
+
+.Lnot_limit:
+	orr	syndrome, diff, has_nul
+	b	.Lcal_cmpresult
+
+.Lmutual_align:
+	/*
+	* Sources are mutually aligned, but are not currently at an
+	* alignment boundary.  Round down the addresses and then mask off
+	* the bytes that precede the start point.
+	* We also need to adjust the limit calculations, but without
+	* overflowing if the limit is near ULONG_MAX.
+	*/
+	bic	src1, src1, #7
+	bic	src2, src2, #7
+	ldr	data1, [src1], #8
+	neg	tmp3, tmp1, lsl #3  /* 64 - bits(bytes beyond align). */
+	ldr	data2, [src2], #8
+	mov	tmp2, #~0
+	sub	limit_wd, limit, #1 /* limit != 0, so no underflow.  */
+#ifdef __ARM64EB__
+	/* Big-endian.  Early bytes are at MSB.  */
+	lsl	tmp2, tmp2, tmp3    /* Shift (tmp1 & 63).  */
+#else
+	/* Little-endian.  Early bytes are at LSB.  */
+	lsr	tmp2, tmp2, tmp3    /* Shift (tmp1 & 63).  */
+#endif
+	and	tmp3, limit_wd, #7
+	lsr	limit_wd, limit_wd, #3
+	/* Adjust the limit. Only low 3 bits used, so overflow irrelevant.*/
+	add	limit, limit, tmp1
+	add	tmp3, tmp3, tmp1
+	orr	data1, data1, tmp2
+	orr	data2, data2, tmp2
+	add	limit_wd, limit_wd, tmp3, lsr #3
+	b	.Lstart_realigned
+
+/*when src1 offset is not equal to src2 offset...*/
+.Lmisaligned8:
+	cmp	limit, #8
+	b.lo	.Ltiny8proc /*limit < 8... */
+	/*
+	* Get the align offset length to compare per byte first.
+	* After this process, one string's address will be aligned.*/
+	and	tmp1, src1, #7
+	neg	tmp1, tmp1
+	add	tmp1, tmp1, #8
+	and	tmp2, src2, #7
+	neg	tmp2, tmp2
+	add	tmp2, tmp2, #8
+	subs	tmp3, tmp1, tmp2
+	csel	pos, tmp1, tmp2, hi /*Choose the maximum. */
+	/*
+	* Here, limit is not less than 8, so directly run .Ltinycmp
+	* without checking the limit.*/
+	sub	limit, limit, pos
+.Ltinycmp:
+	ldrb	data1w, [src1], #1
+	ldrb	data2w, [src2], #1
+	subs	pos, pos, #1
+	ccmp	data1w, #1, #0, ne  /* NZCV = 0b0000.  */
+	ccmp	data1w, data2w, #0, cs  /* NZCV = 0b0000.  */
+	b.eq	.Ltinycmp
+	cbnz	pos, 1f /*find the null or unequal...*/
+	cmp	data1w, #1
+	ccmp	data1w, data2w, #0, cs
+	b.eq	.Lstart_align /*the last bytes are equal....*/
+1:
+	sub	result, data1, data2
+	ret
+
+.Lstart_align:
+	lsr	limit_wd, limit, #3
+	cbz	limit_wd, .Lremain8
+	ands	xzr, src1, #7
+	/*eq means src1 is aligned now,tmp3 is positive in this branch.*/
+	b.eq	.Lrecal_offset
+	add	src1, src1, tmp3
+	add	src2, src2, tmp3
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+	/*
+	* since tmp3 is negative and limit is not less than 8,so cbz is not
+	* needed..
+	*/
+	sub	limit, limit, tmp3
+	lsr	limit_wd, limit, #3
+	subs	limit_wd, limit_wd, #1
+
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	csinv	endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
+	bics	has_nul, tmp1, tmp2
+	ccmp	endloop, #0, #0, eq /*has_null is ZERO: no null byte*/
+	b.ne	.Lunequal_proc
+	and	tmp3, tmp3, #7  /*tmp3 = 8 + tmp3 ( old tmp3 is negative)*/
+	/*
+	* src1 is aligned and src1 is in the right of src2.
+	* start the next 8 bytes compare..
+	*/
+.Lrecal_offset:
+	neg	pos, tmp3
+.Lloopcmp_proc:
+	/*
+	* Fall back pos bytes, get the first bytes segment of one Dword of src1.
+	* pos is negative here. We also can  use :
+	* ldr	data1, [src1]
+	* ldr	data2, [src2, pos]
+	* These two instructions will read data with aligned address.
+	* But if we adapt this method, have to  add some shift and mask out
+	* some bits from these two Dword to construct two new Dwords.
+	* Some more instructions will be added,
+	* and most important, it will need more time cost.
+	*/
+	ldr	data1, [src1,pos]
+	ldr	data2, [src2,pos]
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	bics	has_nul, tmp1, tmp2 /* Non-zero if NUL terminator.  */
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	csinv	endloop, diff, xzr, eq
+	cbnz	endloop, .Lunequal_proc
+
+	/*The second part process*/
+	ldr	data1, [src1], #8
+	ldr	data2, [src2], #8
+	subs	limit_wd, limit_wd, #1
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	eor	diff, data1, data2  /* Non-zero if differences found.  */
+	csinv	endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
+	bics	has_nul, tmp1, tmp2
+	ccmp	endloop, #0, #0, eq /*has_null is ZERO: no null byte*/
+	b.eq	.Lloopcmp_proc
+
+.Lunequal_proc:
+	orr	syndrome, diff, has_nul
+	cbz	syndrome, .Lremain8
+.Lcal_cmpresult:
+#ifndef	__ARM64EB__
+	rev	syndrome, syndrome
+	rev	data1, data1
+	/*
+	* The MS-non-zero bit of the syndrome marks either the first bit
+	* that is different, or the top bit of the first zero byte.
+	* Shifting left now will bring the critical information into the
+	* top bits.
+	*/
+	clz	pos, syndrome
+	rev	data2, data2
+	lsl	data1, data1, pos
+	lsl	data2, data2, pos
+	/* But we need to zero-extend (char is unsigned) the value and then
+	perform a signed 32-bit subtraction.  */
+	lsr	data1, data1, #56
+	sub	result, data1, data2, lsr #56
+	ret
+#else
+	/*
+	* For big-endian we cannot use the trick with the syndrome value
+	* as carry-propagation can corrupt the upper bits if the trailing
+	* bytes in the string contain 0x01.
+	* However, if there is no NUL byte in the dword, we can generate
+	* the result directly.  We can't just subtract the bytes as the
+	* MSB might be significant.
+	*/
+	cbnz	has_nul, 1f
+	cmp	data1, data2
+	cset	result, ne
+	cneg	result, result, lo
+	ret
+1:
+	/* Re-compute the NUL-byte detection, using a byte-reversed value.*/
+	rev	tmp3, data1
+	sub	tmp1, tmp3, zeroones
+	orr	tmp2, tmp3, #REP8_7f
+	bic	has_nul, tmp1, tmp2
+	rev	has_nul, has_nul
+	orr	syndrome, diff, has_nul
+	clz	pos, syndrome
+	/*
+	* The MS-non-zero bit of the syndrome marks either the first bit
+	* that is different, or the top bit of the first zero byte.
+	* Shifting left now will bring the critical information into the
+	* top bits.
+	*/
+	lsl	data1, data1, pos
+	lsl	data2, data2, pos
+	/*
+	* But we need to zero-extend (char is unsigned) the value and then
+	* perform a signed 32-bit subtraction.
+	*/
+	lsr	data1, data1, #56
+	sub	result, data1, data2, lsr #56
+	ret
+#endif
+
+.Lremain8:
+	/* Limit % 8 == 0 => all bytes significant.  */
+	ands	limit, limit, #7
+	b.eq	.Lret0
+.Ltiny8proc:
+	/* Perhaps we can do better than this.  */
+	ldrb	data1w, [src1], #1
+	ldrb	data2w, [src2], #1
+	subs	limit, limit, #1
+	/*
+	* nz satisfied means current limit > 0.
+	* Z=1 will make cs =0, lead to next ccmp use ZERO to set flags
+	*/
+	ccmp	data1w, #1, #0, ne  /* NZCV = 0b0000.  */
+	ccmp	data1w, data2w, #0, cs  /* NZCV = 0b0000.  */
+	b.eq	.Ltiny8proc
+	sub	result, data1, data2
+	ret
+
+.Lret0:
+	mov	result, #0
+	ret
+ENDPROC(strncmp)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/6] arm64: lib: Implement optimized string length routines
  2013-12-11  6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
                   ` (4 preceding siblings ...)
  2013-12-11  6:24 ` [PATCH 5/6] arm64: lib: Implement optimized string compare routines zhichang.yuan at linaro.org
@ 2013-12-11  6:24 ` zhichang.yuan at linaro.org
  5 siblings, 0 replies; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11  6:24 UTC (permalink / raw)
  To: linux-arm-kernel

From: "zhichang.yuan" <zhichang.yuan@linaro.org>

This patch, based on Linaro's Cortex Strings library, adds
an assembly optimized strlen() and strnlen() functions.

Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
---
 arch/arm64/include/asm/string.h |    6 ++
 arch/arm64/kernel/arm64ksyms.c  |    2 +
 arch/arm64/lib/Makefile         |    3 +-
 arch/arm64/lib/strlen.S         |  131 ++++++++++++++++++++++++++++
 arch/arm64/lib/strnlen.S        |  179 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 320 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/lib/strlen.S
 create mode 100644 arch/arm64/lib/strnlen.S

diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h
index 6133f49..64d2d48 100644
--- a/arch/arm64/include/asm/string.h
+++ b/arch/arm64/include/asm/string.h
@@ -28,6 +28,12 @@ extern int strcmp(const char *, const char *);
 #define __HAVE_ARCH_STRNCMP
 extern int strncmp(const char *, const char *, __kernel_size_t);
 
+#define __HAVE_ARCH_STRLEN
+extern __kernel_size_t strlen(const char *);
+
+#define __HAVE_ARCH_STRNLEN
+extern __kernel_size_t strnlen(const char *, __kernel_size_t);
+
 #define __HAVE_ARCH_MEMCPY
 extern void *memcpy(void *, const void *, __kernel_size_t);
 
diff --git a/arch/arm64/kernel/arm64ksyms.c b/arch/arm64/kernel/arm64ksyms.c
index d61fa39..c71b435 100644
--- a/arch/arm64/kernel/arm64ksyms.c
+++ b/arch/arm64/kernel/arm64ksyms.c
@@ -49,6 +49,8 @@ EXPORT_SYMBOL(strchr);
 EXPORT_SYMBOL(strrchr);
 EXPORT_SYMBOL(strcmp);
 EXPORT_SYMBOL(strncmp);
+EXPORT_SYMBOL(strlen);
+EXPORT_SYMBOL(strnlen);
 EXPORT_SYMBOL(memset);
 EXPORT_SYMBOL(memcpy);
 EXPORT_SYMBOL(memmove);
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index 5ded21f..e64a94c 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -3,4 +3,5 @@ lib-y		:= bitops.o delay.o					\
 		   copy_from_user.o copy_to_user.o copy_in_user.o	\
 		   copy_page.o clear_page.o				\
 		   memchr.o memcpy.o memmove.o memset.o memcmp.o	\
-		   strchr.o strrchr.o strcmp.o strncmp.o
+		   strchr.o strrchr.o strcmp.o strncmp.o		\
+		   strlen.o strnlen.o
diff --git a/arch/arm64/lib/strlen.S b/arch/arm64/lib/strlen.S
new file mode 100644
index 0000000..302f6dd
--- /dev/null
+++ b/arch/arm64/lib/strlen.S
@@ -0,0 +1,131 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/*
+ * calculate the length of a string
+ *
+ * Parameters:
+ *	x0 - const string pointer
+ * Returns:
+ *	x0 - the return length of specific string
+ */
+
+/* Arguments and results.  */
+#define srcin		x0
+#define len		x0
+
+/* Locals and temporaries.  */
+#define src		x1
+#define data1		x2
+#define data2		x3
+#define data2a		x4
+#define has_nul1	x5
+#define has_nul2	x6
+#define tmp1		x7
+#define tmp2		x8
+#define tmp3		x9
+#define tmp4		x10
+#define zeroones	x11
+#define pos		x12
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+ENTRY(strlen)
+	mov	zeroones, #REP8_01
+	bic	src, srcin, #15
+	ands	tmp1, srcin, #15
+	b.ne	.Lmisaligned
+	/*
+	* NUL detection works on the principle that (X - 1) & (~X) & 0x80
+	* (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+	* can be done in parallel across the entire word.
+	*/
+	/*
+	* The inner loop deals with two Dwords at a time. This has a
+	* slightly higher start-up cost, but we should win quite quickly,
+	* especially on cores with a high number of issue slots per
+	* cycle, as we get much better parallelism out of the operations.
+	*/
+.Lloop:
+	ldp	data1, data2, [src], #16
+.Lrealigned:
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	sub	tmp3, data2, zeroones
+	orr	tmp4, data2, #REP8_7f
+	bic	has_nul1, tmp1, tmp2
+	bics	has_nul2, tmp3, tmp4
+	ccmp	has_nul1, #0, #0, eq	/* NZCV = 0000  */
+	b.eq	.Lloop
+
+	sub	len, src, srcin
+	cbz	has_nul1, .Lnul_in_data2
+#ifdef __ARM64EB__
+	mov	data2, data1
+#endif
+	sub	len, len, #8
+	mov	has_nul2, has_nul1
+.Lnul_in_data2:
+#ifdef __ARM64EB__
+	/*
+	* For big-endian, carry propagation (if the final byte in the
+	* string is 0x01) means we cannot use has_nul directly.  The
+	* easiest way to get the correct byte is to byte-swap the data
+	* and calculate the syndrome a second time.
+	*/
+	rev	data2, data2
+	sub	tmp1, data2, zeroones
+	orr	tmp2, data2, #REP8_7f
+	bic	has_nul2, tmp1, tmp2
+#endif
+	sub	len, len, #8
+	rev	has_nul2, has_nul2
+	clz	pos, has_nul2
+	add	len, len, pos, lsr #3		/* Bits to bytes.  */
+	ret
+
+.Lmisaligned:
+	cmp	tmp1, #8
+	neg	tmp1, tmp1
+	ldp	data1, data2, [src], #16
+	lsl	tmp1, tmp1, #3		/* Bytes beyond alignment -> bits.  */
+	mov	tmp2, #~0
+#ifdef __ARM64EB__
+	/* Big-endian.  Early bytes are at MSB.  */
+	lsl	tmp2, tmp2, tmp1	/* Shift (tmp1 & 63).  */
+#else
+	/* Little-endian.  Early bytes are at LSB.  */
+	lsr	tmp2, tmp2, tmp1	/* Shift (tmp1 & 63).  */
+#endif
+	orr	data1, data1, tmp2
+	orr	data2a, data2, tmp2
+	csinv	data1, data1, xzr, le
+	csel	data2, data2, data2a, le
+	b	.Lrealigned
+ENDPROC(strlen)
diff --git a/arch/arm64/lib/strnlen.S b/arch/arm64/lib/strnlen.S
new file mode 100644
index 0000000..2293f7b
--- /dev/null
+++ b/arch/arm64/lib/strnlen.S
@@ -0,0 +1,179 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/*
+ * determine the length of a fixed-size string
+ *
+ * Parameters:
+ *	x0 - const string pointer
+ *	x1 - maximal string length
+ * Returns:
+ *	x0 - the return length of specific string
+ */
+
+/* Arguments and results.  */
+#define srcin		x0
+#define len		x0
+#define limit		x1
+
+/* Locals and temporaries.  */
+#define src		x2
+#define data1		x3
+#define data2		x4
+#define data2a		x5
+#define has_nul1	x6
+#define has_nul2	x7
+#define tmp1		x8
+#define tmp2		x9
+#define tmp3		x10
+#define tmp4		x11
+#define zeroones	x12
+#define pos		x13
+#define limit_wd	x14
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+ENTRY(strnlen)
+	cbz	limit, .Lhit_limit
+	mov	zeroones, #REP8_01
+	bic	src, srcin, #15
+	ands	tmp1, srcin, #15
+	b.ne	.Lmisaligned
+	/* Calculate the number of full and partial words -1.  */
+	sub	limit_wd, limit, #1 /* Limit != 0, so no underflow.  */
+	lsr	limit_wd, limit_wd, #4  /* Convert to Qwords.  */
+
+	/*
+	* NUL detection works on the principle that (X - 1) & (~X) & 0x80
+	* (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+	* can be done in parallel across the entire word.
+	*/
+	/*
+	* The inner loop deals with two Dwords at a time.  This has a
+	* slightly higher start-up cost, but we should win quite quickly,
+	* especially on cores with a high number of issue slots per
+	* cycle, as we get much better parallelism out of the operations.
+	*/
+.Lloop:
+	ldp	data1, data2, [src], #16
+.Lrealigned:
+	sub	tmp1, data1, zeroones
+	orr	tmp2, data1, #REP8_7f
+	sub	tmp3, data2, zeroones
+	orr	tmp4, data2, #REP8_7f
+	bic	has_nul1, tmp1, tmp2
+	bic	has_nul2, tmp3, tmp4
+	subs	limit_wd, limit_wd, #1
+	orr	tmp1, has_nul1, has_nul2
+	ccmp	tmp1, #0, #0, pl    /* NZCV = 0000  */
+	b.eq	.Lloop
+
+	cbz	tmp1, .Lhit_limit   /* No null in final Qword.  */
+
+	/*
+	* We know there's a null in the final Qword. The easiest thing
+	* to do now is work out the length of the string and return
+	* MIN (len, limit).
+	*/
+	sub	len, src, srcin
+	cbz	has_nul1, .Lnul_in_data2
+#ifdef __ARM64EB__
+	mov	data2, data1
+#endif
+	sub	len, len, #8
+	mov	has_nul2, has_nul1
+.Lnul_in_data2:
+#ifdef __ARM64EB__
+	/*
+	* For big-endian, carry propagation (if the final byte in the
+	* string is 0x01) means we cannot use has_nul directly.  The
+	* easiest way to get the correct byte is to byte-swap the data
+	* and calculate the syndrome a second time.
+	*/
+	rev	data2, data2
+	sub	tmp1, data2, zeroones
+	orr	tmp2, data2, #REP8_7f
+	bic	has_nul2, tmp1, tmp2
+#endif
+	sub	len, len, #8
+	rev	has_nul2, has_nul2
+	clz	pos, has_nul2
+	add	len, len, pos, lsr #3       /* Bits to bytes.  */
+	cmp	len, limit
+	csel	len, len, limit, ls     /* Return the lower value.  */
+	ret
+
+.Lmisaligned:
+	/*
+	* Deal with a partial first word.
+	* We're doing two things in parallel here;
+	* 1) Calculate the number of words (but avoiding overflow if
+	* limit is near ULONG_MAX) - to do this we need to work out
+	* limit + tmp1 - 1 as a 65-bit value before shifting it;
+	* 2) Load and mask the initial data words - we force the bytes
+	* before the ones we are interested in to 0xff - this ensures
+	* early bytes will not hit any zero detection.
+	*/
+	/*
+	* this operation need some cycles,
+	* some other independent instructions can following it.
+	*/
+	ldp	data1, data2, [src], #16
+
+	sub	limit_wd, limit, #1
+	and	tmp3, limit_wd, #15
+	lsr	limit_wd, limit_wd, #4
+
+	add	tmp3, tmp3, tmp1
+	add	limit_wd, limit_wd, tmp3, lsr #4
+
+	neg	tmp4, tmp1
+	lsl	tmp4, tmp4, #3  /* Bytes beyond alignment -> bits.  */
+
+	mov	tmp2, #~0
+#ifdef __ARM64EB__
+	/* Big-endian.  Early bytes are at MSB.  */
+	lsl	tmp2, tmp2, tmp4	/* Shift (tmp1 & 63).  */
+#else
+	/* Little-endian.  Early bytes are at LSB.  */
+	lsr	tmp2, tmp2, tmp4	/* Shift (tmp1 & 63).  */
+#endif
+	cmp	tmp1, #8
+
+	orr	data1, data1, tmp2
+	orr	data2a, data2, tmp2
+
+	csinv	data1, data1, xzr, le
+	csel	data2, data2, data2a, le
+	b	.Lrealigned
+
+.Lhit_limit:
+	mov	len, limit
+	ret
+ENDPROC(strnlen)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 1/6] arm64: lib: Implement optimized memcpy routine
  2013-12-11  6:24 ` [PATCH 1/6] arm64: lib: Implement optimized memcpy routine zhichang.yuan at linaro.org
@ 2013-12-16 16:08   ` Will Deacon
  2013-12-18  1:54     ` zhichang.yuan
  0 siblings, 1 reply; 14+ messages in thread
From: Will Deacon @ 2013-12-16 16:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Dec 11, 2013 at 06:24:37AM +0000, zhichang.yuan at linaro.org wrote:
> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
> 
> This patch, based on Linaro's Cortex Strings library, improves
> the performance of the assembly optimized memcpy() function.
> 
> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
> ---
>  arch/arm64/lib/memcpy.S |  182 +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 160 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S
> index 27b5003..e3bab71 100644
> --- a/arch/arm64/lib/memcpy.S
> +++ b/arch/arm64/lib/memcpy.S
> @@ -1,5 +1,13 @@
>  /*
>   * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
>   *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License version 2 as
> @@ -27,27 +35,157 @@
>   * Returns:
>   *	x0 - dest
>   */
> +#define dstin	x0
> +#define src	x1
> +#define count	x2
> +#define tmp1	x3
> +#define tmp1w	w3
> +#define tmp2	x4
> +#define tmp2w	w4
> +#define tmp3	x5
> +#define tmp3w	w5
> +#define dst	x6
> +
> +#define A_l	x7
> +#define A_h	x8
> +#define B_l	x9
> +#define B_h	x10
> +#define C_l	x11
> +#define C_h	x12
> +#define D_l	x13
> +#define D_h	x14

Use .req instead of #define?

>  ENTRY(memcpy)
> -	mov	x4, x0
> -	subs	x2, x2, #8
> -	b.mi	2f
> -1:	ldr	x3, [x1], #8
> -	subs	x2, x2, #8
> -	str	x3, [x4], #8
> -	b.pl	1b
> -2:	adds	x2, x2, #4
> -	b.mi	3f
> -	ldr	w3, [x1], #4
> -	sub	x2, x2, #4
> -	str	w3, [x4], #4
> -3:	adds	x2, x2, #2
> -	b.mi	4f
> -	ldrh	w3, [x1], #2
> -	sub	x2, x2, #2
> -	strh	w3, [x4], #2
> -4:	adds	x2, x2, #1
> -	b.mi	5f
> -	ldrb	w3, [x1]
> -	strb	w3, [x4]
> -5:	ret
> +	mov	dst, dstin
> +	cmp	count, #16
> +	/*If memory length is less than 16, stp or ldp can not be used.*/
> +	b.lo	.Ltiny15
> +.Lover16:
> +	neg	tmp2, src
> +	ands	tmp2, tmp2, #15/* Bytes to reach alignment. */
> +	b.eq	.LSrcAligned
> +	sub	count, count, tmp2
> +	/*
> +	* Use ldp and sdp to copy 16 bytes,then backward the src to
> +	* aligned address.This way is more efficient.
> +	* But the risk overwriting the source area exists.Here,prefer to
> +	* access memory forward straight,no backward.It will need a bit
> +	* more instructions, but on the same time,the accesses are aligned.
> +	*/

This comment reads very badly:

  - sdp doesn't exist
  - `more efficient' than what? How is this measured?
  - `access memory forward straight,no backward' What?

Please re-write it in a clearer fashion, so that reviewers have some
understanding of your optimisations when potentially trying to change the
code later on.

> +	tbz	tmp2, #0, 1f
> +	ldrb	tmp1w, [src], #1
> +	strb	tmp1w, [dst], #1
> +1:
> +	tbz	tmp2, #1, 1f
> +	ldrh	tmp1w, [src], #2
> +	strh	tmp1w, [dst], #2
> +1:
> +	tbz	tmp2, #2, 1f
> +	ldr	tmp1w, [src], #4
> +	str	tmp1w, [dst], #4
> +1:

Three labels called '1:' ? Can you make them unique please (the old code
just incremented a counter).

> +	tbz	tmp2, #3, .LSrcAligned
> +	ldr	tmp1, [src],#8
> +	str	tmp1, [dst],#8
> +
> +.LSrcAligned:
> +	cmp	count, #64
> +	b.ge	.Lcpy_over64
> +	/*
> +	* Deal with small copies quickly by dropping straight into the
> +	* exit block.
> +	*/
> +.Ltail63:
> +	/*
> +	* Copy up to 48 bytes of data. At this point we only need the
> +	* bottom 6 bits of count to be accurate.
> +	*/
> +	ands	tmp1, count, #0x30
> +	b.eq	.Ltiny15
> +	cmp	tmp1w, #0x20
> +	b.eq	1f
> +	b.lt	2f
> +	ldp	A_l, A_h, [src], #16
> +	stp	A_l, A_h, [dst], #16
> +1:
> +	ldp	A_l, A_h, [src], #16
> +	stp	A_l, A_h, [dst], #16
> +2:
> +	ldp	A_l, A_h, [src], #16
> +	stp	A_l, A_h, [dst], #16
> +.Ltiny15:
> +	/*
> +	* To make memmove simpler, here don't make src backwards.
> +	* since backwards will probably overwrite the src area where src
> +	* data for nex copy located,if dst is not so far from src.
> +	*/

Another awful comment...

> +	tbz	count, #3, 1f
> +	ldr	tmp1, [src], #8
> +	str	tmp1, [dst], #8
> +1:
> +	tbz	count, #2, 1f
> +	ldr	tmp1w, [src], #4
> +	str	tmp1w, [dst], #4
> +1:
> +	tbz	count, #1, 1f
> +	ldrh	tmp1w, [src], #2
> +	strh	tmp1w, [dst], #2
> +1:

... and more of these labels.

> +	tbz	count, #0, .Lexitfunc
> +	ldrb	tmp1w, [src]
> +	strb	tmp1w, [dst]
> +
> +.Lexitfunc:
> +	ret
> +
> +.Lcpy_over64:
> +	subs	count, count, #128
> +	b.ge	.Lcpy_body_large
> +	/*
> +	* Less than 128 bytes to copy, so handle 64 here and then jump
> +	* to the tail.
> +	*/
> +	ldp	A_l, A_h, [src],#16
> +	stp	A_l, A_h, [dst],#16
> +	ldp	B_l, B_h, [src],#16
> +	ldp	C_l, C_h, [src],#16
> +	stp	B_l, B_h, [dst],#16
> +	stp	C_l, C_h, [dst],#16
> +	ldp	D_l, D_h, [src],#16
> +	stp	D_l, D_h, [dst],#16
> +
> +	tst	count, #0x3f
> +	b.ne	.Ltail63
> +	ret
> +
> +	/*
> +	* Critical loop.  Start at a new cache line boundary.  Assuming
> +	* 64 bytes per line this ensures the entire loop is in one line.
> +	*/
> +	.p2align	6

Can you parameterise this with L1_CACHE_SHIFT instead?

Will

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/6] arm64: lib: Implement optimized memset routine
  2013-12-11  6:24 ` [PATCH 3/6] arm64: lib: Implement optimized memset routine zhichang.yuan at linaro.org
@ 2013-12-16 16:55   ` Will Deacon
  2013-12-18  2:37     ` zhichang.yuan
  0 siblings, 1 reply; 14+ messages in thread
From: Will Deacon @ 2013-12-16 16:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Dec 11, 2013 at 06:24:39AM +0000, zhichang.yuan at linaro.org wrote:
> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
> 
> This patch, based on Linaro's Cortex Strings library, improves
> the performance of the assembly optimized memset() function.
> 
> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
> ---
>  arch/arm64/lib/memset.S |  227 +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 201 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/arm64/lib/memset.S b/arch/arm64/lib/memset.S
> index 87e4a68..90b973e 100644
> --- a/arch/arm64/lib/memset.S
> +++ b/arch/arm64/lib/memset.S
> @@ -1,13 +1,21 @@
>  /*
>   * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
>   *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License version 2 as
>   * published by the Free Software Foundation.
>   *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> + * kind, whether express or implied; without even the implied warranty
> + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

Why are you changing this?

>   * GNU General Public License for more details.
>   *
>   * You should have received a copy of the GNU General Public License
> @@ -18,7 +26,7 @@
>  #include <asm/assembler.h>
>  
>  /*
> - * Fill in the buffer with character c (alignment handled by the hardware)
> + * Fill in the buffer with character c
>   *
>   * Parameters:
>   *	x0 - buf
> @@ -27,27 +35,194 @@
>   * Returns:
>   *	x0 - buf
>   */
> +
> +/* By default we assume that the DC instruction can be used to zero
> +*  data blocks more efficiently.  In some circumstances this might be
> +*  unsafe, for example in an asymmetric multiprocessor environment with
> +*  different DC clear lengths (neither the upper nor lower lengths are
> +*  safe to use).  The feature can be disabled by defining DONT_USE_DC.
> +*/

We already use DC ZVA for clear_page, so I think we should start off using
it unconditionally. If we need to revisit this later, we can, but adding a
random #ifdef doesn't feel like something we need initially.

For the benefit of anybody else reviewing this; the DC ZVA instruction still
works for normal, non-cacheable memory.

The comments I made on the earlier patch wrt quality of comments and labels
seem to apply to all of the patches in this series.

Will

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4/6] arm64: lib: Implement optimized memcmp routine
  2013-12-11  6:24 ` [PATCH 4/6] arm64: lib: Implement optimized memcmp routine zhichang.yuan at linaro.org
@ 2013-12-16 16:56   ` Will Deacon
  2013-12-19  8:18     ` Deepak Saxena
  0 siblings, 1 reply; 14+ messages in thread
From: Will Deacon @ 2013-12-16 16:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Dec 11, 2013 at 06:24:40AM +0000, zhichang.yuan at linaro.org wrote:
> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
> 
> This patch, based on Linaro's Cortex Strings library, adds
> an assembly optimized memcmp() function.

[...]

> +#ifndef	__ARM64EB__
> +	rev	diff, diff
> +	rev	data1, data1
> +	rev	data2, data2
> +#endif

Given that I can't see you defining __ARM64EB__, I'd take a guess at you
never having tested this code on a big-endian machine.

Please can you fix that?

Will

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/6] arm64: lib: Implement optimized memcpy routine
  2013-12-16 16:08   ` Will Deacon
@ 2013-12-18  1:54     ` zhichang.yuan
  0 siblings, 0 replies; 14+ messages in thread
From: zhichang.yuan @ 2013-12-18  1:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will

Thanks for your reply.
I think your comments are ok, i will modify the patches involved those questions.
After those fixes are ready, the patch v2 will be submitted.

Thanks again!
Zhichang




On 2013?12?17? 00:08, Will Deacon wrote:
> On Wed, Dec 11, 2013 at 06:24:37AM +0000, zhichang.yuan at linaro.org wrote:
>> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
>>
>> This patch, based on Linaro's Cortex Strings library, improves
>> the performance of the assembly optimized memcpy() function.
>>
>> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>> ---
>>  arch/arm64/lib/memcpy.S |  182 +++++++++++++++++++++++++++++++++++++++++------
>>  1 file changed, 160 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S
>> index 27b5003..e3bab71 100644
>> --- a/arch/arm64/lib/memcpy.S
>> +++ b/arch/arm64/lib/memcpy.S
>> @@ -1,5 +1,13 @@
>>  /*
>>   * Copyright (C) 2013 ARM Ltd.
>> + * Copyright (C) 2013 Linaro.
>> + *
>> + * This code is based on glibc cortex strings work originally authored by Linaro
>> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
>> + * be found @
>> + *
>> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
>> + * files/head:/src/aarch64/
>>   *
>>   * This program is free software; you can redistribute it and/or modify
>>   * it under the terms of the GNU General Public License version 2 as
>> @@ -27,27 +35,157 @@
>>   * Returns:
>>   *	x0 - dest
>>   */
>> +#define dstin	x0
>> +#define src	x1
>> +#define count	x2
>> +#define tmp1	x3
>> +#define tmp1w	w3
>> +#define tmp2	x4
>> +#define tmp2w	w4
>> +#define tmp3	x5
>> +#define tmp3w	w5
>> +#define dst	x6
>> +
>> +#define A_l	x7
>> +#define A_h	x8
>> +#define B_l	x9
>> +#define B_h	x10
>> +#define C_l	x11
>> +#define C_h	x12
>> +#define D_l	x13
>> +#define D_h	x14
> 
> Use .req instead of #define?
> 
>>  ENTRY(memcpy)
>> -	mov	x4, x0
>> -	subs	x2, x2, #8
>> -	b.mi	2f
>> -1:	ldr	x3, [x1], #8
>> -	subs	x2, x2, #8
>> -	str	x3, [x4], #8
>> -	b.pl	1b
>> -2:	adds	x2, x2, #4
>> -	b.mi	3f
>> -	ldr	w3, [x1], #4
>> -	sub	x2, x2, #4
>> -	str	w3, [x4], #4
>> -3:	adds	x2, x2, #2
>> -	b.mi	4f
>> -	ldrh	w3, [x1], #2
>> -	sub	x2, x2, #2
>> -	strh	w3, [x4], #2
>> -4:	adds	x2, x2, #1
>> -	b.mi	5f
>> -	ldrb	w3, [x1]
>> -	strb	w3, [x4]
>> -5:	ret
>> +	mov	dst, dstin
>> +	cmp	count, #16
>> +	/*If memory length is less than 16, stp or ldp can not be used.*/
>> +	b.lo	.Ltiny15
>> +.Lover16:
>> +	neg	tmp2, src
>> +	ands	tmp2, tmp2, #15/* Bytes to reach alignment. */
>> +	b.eq	.LSrcAligned
>> +	sub	count, count, tmp2
>> +	/*
>> +	* Use ldp and sdp to copy 16 bytes,then backward the src to
>> +	* aligned address.This way is more efficient.
>> +	* But the risk overwriting the source area exists.Here,prefer to
>> +	* access memory forward straight,no backward.It will need a bit
>> +	* more instructions, but on the same time,the accesses are aligned.
>> +	*/
> 
> This comment reads very badly:
> 
>   - sdp doesn't exist
>   - `more efficient' than what? How is this measured?
>   - `access memory forward straight,no backward' What?
> 
> Please re-write it in a clearer fashion, so that reviewers have some
> understanding of your optimisations when potentially trying to change the
> code later on.
> 
>> +	tbz	tmp2, #0, 1f
>> +	ldrb	tmp1w, [src], #1
>> +	strb	tmp1w, [dst], #1
>> +1:
>> +	tbz	tmp2, #1, 1f
>> +	ldrh	tmp1w, [src], #2
>> +	strh	tmp1w, [dst], #2
>> +1:
>> +	tbz	tmp2, #2, 1f
>> +	ldr	tmp1w, [src], #4
>> +	str	tmp1w, [dst], #4
>> +1:
> 
> Three labels called '1:' ? Can you make them unique please (the old code
> just incremented a counter).
> 
>> +	tbz	tmp2, #3, .LSrcAligned
>> +	ldr	tmp1, [src],#8
>> +	str	tmp1, [dst],#8
>> +
>> +.LSrcAligned:
>> +	cmp	count, #64
>> +	b.ge	.Lcpy_over64
>> +	/*
>> +	* Deal with small copies quickly by dropping straight into the
>> +	* exit block.
>> +	*/
>> +.Ltail63:
>> +	/*
>> +	* Copy up to 48 bytes of data. At this point we only need the
>> +	* bottom 6 bits of count to be accurate.
>> +	*/
>> +	ands	tmp1, count, #0x30
>> +	b.eq	.Ltiny15
>> +	cmp	tmp1w, #0x20
>> +	b.eq	1f
>> +	b.lt	2f
>> +	ldp	A_l, A_h, [src], #16
>> +	stp	A_l, A_h, [dst], #16
>> +1:
>> +	ldp	A_l, A_h, [src], #16
>> +	stp	A_l, A_h, [dst], #16
>> +2:
>> +	ldp	A_l, A_h, [src], #16
>> +	stp	A_l, A_h, [dst], #16
>> +.Ltiny15:
>> +	/*
>> +	* To make memmove simpler, here don't make src backwards.
>> +	* since backwards will probably overwrite the src area where src
>> +	* data for nex copy located,if dst is not so far from src.
>> +	*/
> 
> Another awful comment...
> 
>> +	tbz	count, #3, 1f
>> +	ldr	tmp1, [src], #8
>> +	str	tmp1, [dst], #8
>> +1:
>> +	tbz	count, #2, 1f
>> +	ldr	tmp1w, [src], #4
>> +	str	tmp1w, [dst], #4
>> +1:
>> +	tbz	count, #1, 1f
>> +	ldrh	tmp1w, [src], #2
>> +	strh	tmp1w, [dst], #2
>> +1:
> 
> ... and more of these labels.
> 
>> +	tbz	count, #0, .Lexitfunc
>> +	ldrb	tmp1w, [src]
>> +	strb	tmp1w, [dst]
>> +
>> +.Lexitfunc:
>> +	ret
>> +
>> +.Lcpy_over64:
>> +	subs	count, count, #128
>> +	b.ge	.Lcpy_body_large
>> +	/*
>> +	* Less than 128 bytes to copy, so handle 64 here and then jump
>> +	* to the tail.
>> +	*/
>> +	ldp	A_l, A_h, [src],#16
>> +	stp	A_l, A_h, [dst],#16
>> +	ldp	B_l, B_h, [src],#16
>> +	ldp	C_l, C_h, [src],#16
>> +	stp	B_l, B_h, [dst],#16
>> +	stp	C_l, C_h, [dst],#16
>> +	ldp	D_l, D_h, [src],#16
>> +	stp	D_l, D_h, [dst],#16
>> +
>> +	tst	count, #0x3f
>> +	b.ne	.Ltail63
>> +	ret
>> +
>> +	/*
>> +	* Critical loop.  Start at a new cache line boundary.  Assuming
>> +	* 64 bytes per line this ensures the entire loop is in one line.
>> +	*/
>> +	.p2align	6
> 
> Can you parameterise this with L1_CACHE_SHIFT instead?
> 
> Will
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/6] arm64: lib: Implement optimized memset routine
  2013-12-16 16:55   ` Will Deacon
@ 2013-12-18  2:37     ` zhichang.yuan
  0 siblings, 0 replies; 14+ messages in thread
From: zhichang.yuan @ 2013-12-18  2:37 UTC (permalink / raw)
  To: linux-arm-kernel



On 2013?12?17? 00:55, Will Deacon wrote:
> On Wed, Dec 11, 2013 at 06:24:39AM +0000, zhichang.yuan at linaro.org wrote:
>> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
>>
>> This patch, based on Linaro's Cortex Strings library, improves
>> the performance of the assembly optimized memset() function.
>>
>> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>> ---
>>  arch/arm64/lib/memset.S |  227 +++++++++++++++++++++++++++++++++++++++++------
>>  1 file changed, 201 insertions(+), 26 deletions(-)
>>
>> diff --git a/arch/arm64/lib/memset.S b/arch/arm64/lib/memset.S
>> index 87e4a68..90b973e 100644
>> --- a/arch/arm64/lib/memset.S
>> +++ b/arch/arm64/lib/memset.S
>> @@ -1,13 +1,21 @@
>>  /*
>>   * Copyright (C) 2013 ARM Ltd.
>> + * Copyright (C) 2013 Linaro.
>> + *
>> + * This code is based on glibc cortex strings work originally authored by Linaro
>> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
>> + * be found @
>> + *
>> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
>> + * files/head:/src/aarch64/
>>   *
>>   * This program is free software; you can redistribute it and/or modify
>>   * it under the terms of the GNU General Public License version 2 as
>>   * published by the Free Software Foundation.
>>   *
>> - * This program is distributed in the hope that it will be useful,
>> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
>> + * kind, whether express or implied; without even the implied warranty
>> + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> 
> Why are you changing this?
> 
>>   * GNU General Public License for more details.
>>   *
>>   * You should have received a copy of the GNU General Public License
>> @@ -18,7 +26,7 @@
>>  #include <asm/assembler.h>
>>  
>>  /*
>> - * Fill in the buffer with character c (alignment handled by the hardware)
>> + * Fill in the buffer with character c
>>   *
>>   * Parameters:
>>   *	x0 - buf
>> @@ -27,27 +35,194 @@
>>   * Returns:
>>   *	x0 - buf
>>   */
>> +
>> +/* By default we assume that the DC instruction can be used to zero
>> +*  data blocks more efficiently.  In some circumstances this might be
>> +*  unsafe, for example in an asymmetric multiprocessor environment with
>> +*  different DC clear lengths (neither the upper nor lower lengths are
>> +*  safe to use).  The feature can be disabled by defining DONT_USE_DC.
>> +*/
This comments is not so correct, for the AMP,i think the DONT_USE_DC also is not necessary.
Since this memset will read the dczid_el0 each time,it will get the current correct
value from the system register.
> 
> We already use DC ZVA for clear_page, so I think we should start off using
> it unconditionally. If we need to revisit this later, we can, but adding a
> random #ifdef doesn't feel like something we need initially.

As for the macro DONT_USE_DC,i am no sure whether default configure of some systems
do not permit the DC ZVA. In this case,the try to use DC ZVA will be ended and
turn back to the normal memory setting process. It will not cause any error,only
need more instructions executed.
I initially think using the DONT_USE_DC is a bit efficient.
Actually,even if the system doesn't support DC ZVA,the cost of reading the dczid_el0
register is small. It is not worthy to introduce a kernel macro.
I will modify it.

Zhichang
> 
> For the benefit of anybody else reviewing this; the DC ZVA instruction still
> works for normal, non-cacheable memory.
> 
> The comments I made on the earlier patch wrt quality of comments and labels
> seem to apply to all of the patches in this series.
> 
> Will
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4/6] arm64: lib: Implement optimized memcmp routine
  2013-12-16 16:56   ` Will Deacon
@ 2013-12-19  8:18     ` Deepak Saxena
  2013-12-19 16:14       ` Catalin Marinas
  0 siblings, 1 reply; 14+ messages in thread
From: Deepak Saxena @ 2013-12-19  8:18 UTC (permalink / raw)
  To: linux-arm-kernel

On 16 December 2013 08:56, Will Deacon <will.deacon@arm.com> wrote:
> On Wed, Dec 11, 2013 at 06:24:40AM +0000, zhichang.yuan at linaro.org wrote:
>> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
>>
>> This patch, based on Linaro's Cortex Strings library, adds
>> an assembly optimized memcmp() function.
>
> [...]
>
>> +#ifndef      __ARM64EB__
>> +     rev     diff, diff
>> +     rev     data1, data1
>> +     rev     data2, data2
>> +#endif
>
> Given that I can't see you defining __ARM64EB__, I'd take a guess at you
> never having tested this code on a big-endian machine.

We'll test it, but I believe __ARM64BE__ is a compiler constant that
is automagically present when building big endian, similar to
__ARMBE__

>
> Please can you fix that?
>
> Will



-- 
Deepak Saxena - Kernel Working Group Lead
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4/6] arm64: lib: Implement optimized memcmp routine
  2013-12-19  8:18     ` Deepak Saxena
@ 2013-12-19 16:14       ` Catalin Marinas
  0 siblings, 0 replies; 14+ messages in thread
From: Catalin Marinas @ 2013-12-19 16:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 19, 2013 at 08:18:59AM +0000, Deepak Saxena wrote:
> On 16 December 2013 08:56, Will Deacon <will.deacon@arm.com> wrote:
> > On Wed, Dec 11, 2013 at 06:24:40AM +0000, zhichang.yuan at linaro.org wrote:
> >> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
> >>
> >> This patch, based on Linaro's Cortex Strings library, adds
> >> an assembly optimized memcmp() function.
> >
> > [...]
> >
> >> +#ifndef      __ARM64EB__
> >> +     rev     diff, diff
> >> +     rev     data1, data1
> >> +     rev     data2, data2
> >> +#endif
> >
> > Given that I can't see you defining __ARM64EB__, I'd take a guess at you
> > never having tested this code on a big-endian machine.
> 
> We'll test it, but I believe __ARM64BE__ is a compiler constant that
> is automagically present when building big endian, similar to
> __ARMBE__

It's __AARCH64EB__ (that's how we spot people not testing on big-endian ;)).

-- 
Catalin

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-12-19 16:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-11  6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
2013-12-11  6:24 ` [PATCH 1/6] arm64: lib: Implement optimized memcpy routine zhichang.yuan at linaro.org
2013-12-16 16:08   ` Will Deacon
2013-12-18  1:54     ` zhichang.yuan
2013-12-11  6:24 ` [PATCH 2/6] arm64: lib: Implement optimized memmove routine zhichang.yuan at linaro.org
2013-12-11  6:24 ` [PATCH 3/6] arm64: lib: Implement optimized memset routine zhichang.yuan at linaro.org
2013-12-16 16:55   ` Will Deacon
2013-12-18  2:37     ` zhichang.yuan
2013-12-11  6:24 ` [PATCH 4/6] arm64: lib: Implement optimized memcmp routine zhichang.yuan at linaro.org
2013-12-16 16:56   ` Will Deacon
2013-12-19  8:18     ` Deepak Saxena
2013-12-19 16:14       ` Catalin Marinas
2013-12-11  6:24 ` [PATCH 5/6] arm64: lib: Implement optimized string compare routines zhichang.yuan at linaro.org
2013-12-11  6:24 ` [PATCH 6/6] arm64: lib: Implement optimized string length routines zhichang.yuan at linaro.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).