* [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors
@ 2013-12-11 6:24 zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 1/6] arm64: lib: Implement optimized memcpy routine zhichang.yuan at linaro.org
` (5 more replies)
0 siblings, 6 replies; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11 6:24 UTC (permalink / raw)
To: linux-arm-kernel
From: "zhichang.yuan" <zhichang.yuan@linaro.org>
In current aarch64 kernel,there are a few string library routines
implemented in arm64/lib,such as memcpy,memset, memmove,strchr.
Most string routines frequently used are provided by the
architecture-independent string library.Those routines are not efficient.
This patch focus on improving the sting routines' performance in ARMv8.
It contains eight optimized functions.The work is based on the cortex-string
project in Linaro toolchain.
The cortex-string code can be found in this website:
https://code.launchpad.net/cortex-strings
To obtain better performance,several ideas were utilized:
1) memory burst access;
For the long memory data operation,adopted the armv8 instruction pairs,
ldp/stp,to transfer the bulk data.Try best to use continuous ldp/stp
to trigger the burst access.
2) parallel processing
The current string routines mostly processed per-byte. This patch
processes the data in parallel.Such as strlen, it will process
eight string bytes each time.
3) aligned memory access
Classfy the process into several categories according to the input
memory address parameters.For the non-alignment memory address,firstly
process the begginning short-length data to make the memory address
aligned,then start the remain processing on alignment address.
After the optimization,those routines have better performance than the current ones.
Please refer to this website to get the test results:
https://wiki.linaro.org/WorkingGroups/Kernel/ARMv8/cortex-strings
--
zhichang.yuan (6):
arm64: lib: Implement optimized memcpy routine
arm64: lib: Implement optimized memmove routine
arm64: lib: Implement optimized memset routine
arm64: lib: Implement optimized memcmp routine
arm64: lib: Implement optimized string compare routines
arm64: lib: Implement optimized string length routines
arch/arm64/include/asm/string.h | 15 ++
arch/arm64/kernel/arm64ksyms.c | 5 +
arch/arm64/lib/Makefile | 5 +-
arch/arm64/lib/memcmp.S | 258 +++++++++++++++++++++++++++++
arch/arm64/lib/memcpy.S | 182 ++++++++++++++++++---
arch/arm64/lib/memmove.S | 195 ++++++++++++++++++----
arch/arm64/lib/memset.S | 227 +++++++++++++++++++++++---
arch/arm64/lib/strcmp.S | 256 +++++++++++++++++++++++++++++
arch/arm64/lib/strlen.S | 131 +++++++++++++++
arch/arm64/lib/strncmp.S | 340 +++++++++++++++++++++++++++++++++++++++
arch/arm64/lib/strnlen.S | 179 +++++++++++++++++++++
11 files changed, 1714 insertions(+), 79 deletions(-)
create mode 100644 arch/arm64/lib/memcmp.S
create mode 100644 arch/arm64/lib/strcmp.S
create mode 100644 arch/arm64/lib/strlen.S
create mode 100644 arch/arm64/lib/strncmp.S
create mode 100644 arch/arm64/lib/strnlen.S
--
1.7.9.5
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/6] arm64: lib: Implement optimized memcpy routine
2013-12-11 6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
@ 2013-12-11 6:24 ` zhichang.yuan at linaro.org
2013-12-16 16:08 ` Will Deacon
2013-12-11 6:24 ` [PATCH 2/6] arm64: lib: Implement optimized memmove routine zhichang.yuan at linaro.org
` (4 subsequent siblings)
5 siblings, 1 reply; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11 6:24 UTC (permalink / raw)
To: linux-arm-kernel
From: "zhichang.yuan" <zhichang.yuan@linaro.org>
This patch, based on Linaro's Cortex Strings library, improves
the performance of the assembly optimized memcpy() function.
Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
---
arch/arm64/lib/memcpy.S | 182 +++++++++++++++++++++++++++++++++++++++++------
1 file changed, 160 insertions(+), 22 deletions(-)
diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S
index 27b5003..e3bab71 100644
--- a/arch/arm64/lib/memcpy.S
+++ b/arch/arm64/lib/memcpy.S
@@ -1,5 +1,13 @@
/*
* Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
@@ -27,27 +35,157 @@
* Returns:
* x0 - dest
*/
+#define dstin x0
+#define src x1
+#define count x2
+#define tmp1 x3
+#define tmp1w w3
+#define tmp2 x4
+#define tmp2w w4
+#define tmp3 x5
+#define tmp3w w5
+#define dst x6
+
+#define A_l x7
+#define A_h x8
+#define B_l x9
+#define B_h x10
+#define C_l x11
+#define C_h x12
+#define D_l x13
+#define D_h x14
+
ENTRY(memcpy)
- mov x4, x0
- subs x2, x2, #8
- b.mi 2f
-1: ldr x3, [x1], #8
- subs x2, x2, #8
- str x3, [x4], #8
- b.pl 1b
-2: adds x2, x2, #4
- b.mi 3f
- ldr w3, [x1], #4
- sub x2, x2, #4
- str w3, [x4], #4
-3: adds x2, x2, #2
- b.mi 4f
- ldrh w3, [x1], #2
- sub x2, x2, #2
- strh w3, [x4], #2
-4: adds x2, x2, #1
- b.mi 5f
- ldrb w3, [x1]
- strb w3, [x4]
-5: ret
+ mov dst, dstin
+ cmp count, #16
+ /*If memory length is less than 16, stp or ldp can not be used.*/
+ b.lo .Ltiny15
+.Lover16:
+ neg tmp2, src
+ ands tmp2, tmp2, #15/* Bytes to reach alignment. */
+ b.eq .LSrcAligned
+ sub count, count, tmp2
+ /*
+ * Use ldp and sdp to copy 16 bytes,then backward the src to
+ * aligned address.This way is more efficient.
+ * But the risk overwriting the source area exists.Here,prefer to
+ * access memory forward straight,no backward.It will need a bit
+ * more instructions, but on the same time,the accesses are aligned.
+ */
+ tbz tmp2, #0, 1f
+ ldrb tmp1w, [src], #1
+ strb tmp1w, [dst], #1
+1:
+ tbz tmp2, #1, 1f
+ ldrh tmp1w, [src], #2
+ strh tmp1w, [dst], #2
+1:
+ tbz tmp2, #2, 1f
+ ldr tmp1w, [src], #4
+ str tmp1w, [dst], #4
+1:
+ tbz tmp2, #3, .LSrcAligned
+ ldr tmp1, [src],#8
+ str tmp1, [dst],#8
+
+.LSrcAligned:
+ cmp count, #64
+ b.ge .Lcpy_over64
+ /*
+ * Deal with small copies quickly by dropping straight into the
+ * exit block.
+ */
+.Ltail63:
+ /*
+ * Copy up to 48 bytes of data. At this point we only need the
+ * bottom 6 bits of count to be accurate.
+ */
+ ands tmp1, count, #0x30
+ b.eq .Ltiny15
+ cmp tmp1w, #0x20
+ b.eq 1f
+ b.lt 2f
+ ldp A_l, A_h, [src], #16
+ stp A_l, A_h, [dst], #16
+1:
+ ldp A_l, A_h, [src], #16
+ stp A_l, A_h, [dst], #16
+2:
+ ldp A_l, A_h, [src], #16
+ stp A_l, A_h, [dst], #16
+.Ltiny15:
+ /*
+ * To make memmove simpler, here don't make src backwards.
+ * since backwards will probably overwrite the src area where src
+ * data for nex copy located,if dst is not so far from src.
+ */
+ tbz count, #3, 1f
+ ldr tmp1, [src], #8
+ str tmp1, [dst], #8
+1:
+ tbz count, #2, 1f
+ ldr tmp1w, [src], #4
+ str tmp1w, [dst], #4
+1:
+ tbz count, #1, 1f
+ ldrh tmp1w, [src], #2
+ strh tmp1w, [dst], #2
+1:
+ tbz count, #0, .Lexitfunc
+ ldrb tmp1w, [src]
+ strb tmp1w, [dst]
+
+.Lexitfunc:
+ ret
+
+.Lcpy_over64:
+ subs count, count, #128
+ b.ge .Lcpy_body_large
+ /*
+ * Less than 128 bytes to copy, so handle 64 here and then jump
+ * to the tail.
+ */
+ ldp A_l, A_h, [src],#16
+ stp A_l, A_h, [dst],#16
+ ldp B_l, B_h, [src],#16
+ ldp C_l, C_h, [src],#16
+ stp B_l, B_h, [dst],#16
+ stp C_l, C_h, [dst],#16
+ ldp D_l, D_h, [src],#16
+ stp D_l, D_h, [dst],#16
+
+ tst count, #0x3f
+ b.ne .Ltail63
+ ret
+
+ /*
+ * Critical loop. Start at a new cache line boundary. Assuming
+ * 64 bytes per line this ensures the entire loop is in one line.
+ */
+ .p2align 6
+.Lcpy_body_large:
+ /* There are at least 128 bytes to copy. */
+ ldp A_l, A_h, [src],#16
+ ldp B_l, B_h, [src],#16
+ ldp C_l, C_h, [src],#16
+ ldp D_l, D_h, [src],#16
+1:
+ stp A_l, A_h, [dst],#16
+ ldp A_l, A_h, [src],#16
+ stp B_l, B_h, [dst],#16
+ ldp B_l, B_h, [src],#16
+ stp C_l, C_h, [dst],#16
+ ldp C_l, C_h, [src],#16
+ stp D_l, D_h, [dst],#16
+ ldp D_l, D_h, [src],#16
+ subs count, count, #64
+ b.ge 1b
+ stp A_l, A_h, [dst],#16
+ stp B_l, B_h, [dst],#16
+ stp C_l, C_h, [dst],#16
+ stp D_l, D_h, [dst],#16
+
+ tst count, #0x3f
+ b.ne .Ltail63
+ ret
ENDPROC(memcpy)
--
1.7.9.5
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/6] arm64: lib: Implement optimized memmove routine
2013-12-11 6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 1/6] arm64: lib: Implement optimized memcpy routine zhichang.yuan at linaro.org
@ 2013-12-11 6:24 ` zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 3/6] arm64: lib: Implement optimized memset routine zhichang.yuan at linaro.org
` (3 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11 6:24 UTC (permalink / raw)
To: linux-arm-kernel
From: "zhichang.yuan" <zhichang.yuan@linaro.org>
This patch, based on Linaro's Cortex Strings library, improves
the performance of the assembly optimized memmove() function.
Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
---
arch/arm64/lib/memmove.S | 195 +++++++++++++++++++++++++++++++++++++++-------
1 file changed, 166 insertions(+), 29 deletions(-)
diff --git a/arch/arm64/lib/memmove.S b/arch/arm64/lib/memmove.S
index b79fdfa..61ee8d9 100644
--- a/arch/arm64/lib/memmove.S
+++ b/arch/arm64/lib/memmove.S
@@ -1,13 +1,21 @@
/*
* Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
@@ -18,7 +26,8 @@
#include <asm/assembler.h>
/*
- * Move a buffer from src to test (alignment handled by the hardware).
+ * Move a buffer from src to test
+ * (alignment handled by the hardware for part cases).
* If dest <= src, call memcpy, otherwise copy in reverse order.
*
* Parameters:
@@ -28,30 +37,158 @@
* Returns:
* x0 - dest
*/
+#define dstin x0
+#define src x1
+#define count x2
+#define tmp1 x3
+#define tmp1w w3
+#define tmp2 x4
+#define tmp2w w4
+#define tmp3 x5
+#define tmp3w w5
+#define dst x6
+
+#define A_l x7
+#define A_h x8
+#define B_l x9
+#define B_h x10
+#define C_l x11
+#define C_h x12
+#define D_l x13
+#define D_h x14
+
ENTRY(memmove)
- cmp x0, x1
- b.ls memcpy
- add x4, x0, x2
- add x1, x1, x2
- subs x2, x2, #8
- b.mi 2f
-1: ldr x3, [x1, #-8]!
- subs x2, x2, #8
- str x3, [x4, #-8]!
- b.pl 1b
-2: adds x2, x2, #4
- b.mi 3f
- ldr w3, [x1, #-4]!
- sub x2, x2, #4
- str w3, [x4, #-4]!
-3: adds x2, x2, #2
- b.mi 4f
- ldrh w3, [x1, #-2]!
- sub x2, x2, #2
- strh w3, [x4, #-2]!
-4: adds x2, x2, #1
- b.mi 5f
- ldrb w3, [x1, #-1]
- strb w3, [x4, #-1]
-5: ret
+ cmp dstin, src
+ /*b.eq .Lexitfunc*/
+ b.lo memcpy
+ add tmp1, src, count
+ cmp dstin, tmp1
+ b.hs memcpy /* No overlap. */
+
+ add dst, dstin, count
+ add src, src, count
+ cmp count, #16
+ b.lo .Ltail15
+
+.Lover16:
+ ands tmp2, src, #15 /* Bytes to reach alignment. */
+ b.eq .LSrcAligned
+ sub count, count, tmp2
+ /*
+ * process the aligned offset length to make the src aligned firstly.
+ * those extra instructions' cost is acceptable. It also make the
+ * coming accesses are based on aligned address.
+ */
+ tbz tmp2, #0, 1f
+ ldrb tmp1w, [src, #-1]!
+ strb tmp1w, [dst, #-1]!
+1:
+ tbz tmp2, #1, 1f
+ ldrh tmp1w, [src, #-2]!
+ strh tmp1w, [dst, #-2]!
+1:
+ tbz tmp2, #2, 1f
+ ldr tmp1w, [src, #-4]!
+ str tmp1w, [dst, #-4]!
+1:
+ tbz tmp2, #3, .LSrcAligned
+ ldr tmp1, [src, #-8]!
+ str tmp1, [dst, #-8]!
+
+.LSrcAligned:
+ cmp count, #64
+ b.ge .Lcpy_over64
+
+ /*
+ * Deal with small copies quickly by dropping straight into the
+ * exit block.*/
+.Ltail63:
+ /*
+ * Copy up to 48 bytes of data. At this point we only need the
+ * bottom 6 bits of count to be accurate.
+ */
+ ands tmp1, count, #0x30
+ b.eq .Ltail15
+ cmp tmp1w, #0x20
+ b.eq 1f
+ b.lt 2f
+ ldp A_l, A_h, [src, #-16]!
+ stp A_l, A_h, [dst, #-16]!
+1:
+ ldp A_l, A_h, [src, #-16]!
+ stp A_l, A_h, [dst, #-16]!
+2:
+ ldp A_l, A_h, [src, #-16]!
+ stp A_l, A_h, [dst, #-16]!
+
+.Ltail15:
+ tbz count, #3, 1f
+ ldr tmp1, [src, #-8]!
+ str tmp1, [dst, #-8]!
+1:
+ tbz count, #2, 1f
+ ldr tmp1w, [src, #-4]!
+ str tmp1w, [dst, #-4]!
+1:
+ tbz count, #1, 1f
+ ldrh tmp1w, [src, #-2]!
+ strh tmp1w, [dst, #-2]!
+1:
+ tbz count, #0, .Lexitfunc
+ ldrb tmp1w, [src, #-1]
+ strb tmp1w, [dst, #-1]
+
+.Lexitfunc:
+ ret
+
+.Lcpy_over64:
+ subs count, count, #128
+ b.ge .Lcpy_body_large
+ /*
+ * Less than 128 bytes to copy, so handle 64 here and then jump
+ * to the tail.
+ */
+ ldp A_l, A_h, [src, #-16]
+ stp A_l, A_h, [dst, #-16]
+ ldp B_l, B_h, [src, #-32]
+ ldp C_l, C_h, [src, #-48]
+ stp B_l, B_h, [dst, #-32]
+ stp C_l, C_h, [dst, #-48]
+ ldp D_l, D_h, [src, #-64]!
+ stp D_l, D_h, [dst, #-64]!
+
+ tst count, #0x3f
+ b.ne .Ltail63
+ ret
+
+ /*
+ * Critical loop. Start at a new cache line boundary. Assuming
+ * 64 bytes per line this ensures the entire loop is in one line.
+ */
+ .p2align 6
+.Lcpy_body_large:
+ /* There are@least 128 bytes to copy. */
+ ldp A_l, A_h, [src, #-16]
+ ldp B_l, B_h, [src, #-32]
+ ldp C_l, C_h, [src, #-48]
+ ldp D_l, D_h, [src, #-64]!
+1:
+ stp A_l, A_h, [dst, #-16]
+ ldp A_l, A_h, [src, #-16]
+ stp B_l, B_h, [dst, #-32]
+ ldp B_l, B_h, [src, #-32]
+ stp C_l, C_h, [dst, #-48]
+ ldp C_l, C_h, [src, #-48]
+ stp D_l, D_h, [dst, #-64]!
+ ldp D_l, D_h, [src, #-64]!
+ subs count, count, #64
+ b.ge 1b
+ stp A_l, A_h, [dst, #-16]
+ stp B_l, B_h, [dst, #-32]
+ stp C_l, C_h, [dst, #-48]
+ stp D_l, D_h, [dst, #-64]!
+
+ tst count, #0x3f
+ b.ne .Ltail63
+ ret
ENDPROC(memmove)
--
1.7.9.5
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 3/6] arm64: lib: Implement optimized memset routine
2013-12-11 6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 1/6] arm64: lib: Implement optimized memcpy routine zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 2/6] arm64: lib: Implement optimized memmove routine zhichang.yuan at linaro.org
@ 2013-12-11 6:24 ` zhichang.yuan at linaro.org
2013-12-16 16:55 ` Will Deacon
2013-12-11 6:24 ` [PATCH 4/6] arm64: lib: Implement optimized memcmp routine zhichang.yuan at linaro.org
` (2 subsequent siblings)
5 siblings, 1 reply; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11 6:24 UTC (permalink / raw)
To: linux-arm-kernel
From: "zhichang.yuan" <zhichang.yuan@linaro.org>
This patch, based on Linaro's Cortex Strings library, improves
the performance of the assembly optimized memset() function.
Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
---
arch/arm64/lib/memset.S | 227 +++++++++++++++++++++++++++++++++++++++++------
1 file changed, 201 insertions(+), 26 deletions(-)
diff --git a/arch/arm64/lib/memset.S b/arch/arm64/lib/memset.S
index 87e4a68..90b973e 100644
--- a/arch/arm64/lib/memset.S
+++ b/arch/arm64/lib/memset.S
@@ -1,13 +1,21 @@
/*
* Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
@@ -18,7 +26,7 @@
#include <asm/assembler.h>
/*
- * Fill in the buffer with character c (alignment handled by the hardware)
+ * Fill in the buffer with character c
*
* Parameters:
* x0 - buf
@@ -27,27 +35,194 @@
* Returns:
* x0 - buf
*/
+
+/* By default we assume that the DC instruction can be used to zero
+* data blocks more efficiently. In some circumstances this might be
+* unsafe, for example in an asymmetric multiprocessor environment with
+* different DC clear lengths (neither the upper nor lower lengths are
+* safe to use). The feature can be disabled by defining DONT_USE_DC.
+*/
+
+#define dstin x0
+#define val w1
+#define count x2
+#define tmp1 x3
+#define tmp1w w3
+#define tmp2 x4
+#define tmp2w w4
+#define zva_len_x x5
+#define zva_len w5
+#define zva_bits_x x6
+
+#define A_l x7
+#define A_lw w7
+#define dst x8
+#define tmp3w w9
+#define tmp3 x9
+
ENTRY(memset)
- mov x4, x0
- and w1, w1, #0xff
- orr w1, w1, w1, lsl #8
- orr w1, w1, w1, lsl #16
- orr x1, x1, x1, lsl #32
- subs x2, x2, #8
- b.mi 2f
-1: str x1, [x4], #8
- subs x2, x2, #8
- b.pl 1b
-2: adds x2, x2, #4
- b.mi 3f
- sub x2, x2, #4
- str w1, [x4], #4
-3: adds x2, x2, #2
- b.mi 4f
- sub x2, x2, #2
- strh w1, [x4], #2
-4: adds x2, x2, #1
- b.mi 5f
- strb w1, [x4]
-5: ret
+ mov dst, dstin /* Preserve return value. */
+ and A_lw, val, #255
+ orr A_lw, A_lw, A_lw, lsl #8
+ orr A_lw, A_lw, A_lw, lsl #16
+ orr A_l, A_l, A_l, lsl #32
+
+ /*first align dst with 16...*/
+ neg tmp2, dst
+ ands tmp2, tmp2, #15
+ b.eq .Laligned
+
+ cmp count, #15
+ b.le .Ltail15tiny
+ /*
+ * The count is not less than 16, we can use stp to set 16 bytes
+ * once. This way is more efficient but the access is non-aligned.
+ */
+ stp A_l, A_l, [dst]
+ /*make the dst aligned..*/
+ sub count, count, tmp2
+ add dst, dst, tmp2
+
+ /*Here, dst is aligned 16 now...*/
+.Laligned:
+#ifndef DONT_USE_DC
+ cbz A_l, .Lzero_mem
+#endif
+
+.Ltail_maybe_long:
+ cmp count, #64
+ b.ge .Lnot_short
+.Ltail63:
+ ands tmp1, count, #0x30
+ b.eq .Ltail15tiny
+ cmp tmp1w, #0x20
+ b.eq 1f
+ b.lt 2f
+ stp A_l, A_l, [dst], #16
+1:
+ stp A_l, A_l, [dst], #16
+2:
+ stp A_l, A_l, [dst], #16
+/*
+* The following process is non-aligned access. But it is more efficient than
+* .Ltail15tiny. Of-course, we can delete this code, but have a bit
+* performance cost.
+*/
+ ands count, count, #15
+ cbz count, 1f
+ add dst, dst, count
+ stp A_l, A_l, [dst, #-16] /* Repeat some/all of last store. */
+1:
+ ret
+
+.Ltail15tiny:
+ /* Set up to 15 bytes. Does not assume earlier memory
+ being set. */
+ tbz count, #3, 1f
+ str A_l, [dst], #8
+1:
+ tbz count, #2, 1f
+ str A_lw, [dst], #4
+1:
+ tbz count, #1, 1f
+ strh A_lw, [dst], #2
+1:
+ tbz count, #0, 1f
+ strb A_lw, [dst]
+1:
+ ret
+
+ /*
+ * Critical loop. Start at a new cache line boundary. Assuming
+ * 64 bytes per line, this ensures the entire loop is in one line.
+ */
+ .p2align 6
+.Lnot_short: /*count must be not less than 64*/
+ sub dst, dst, #16/* Pre-bias. */
+ sub count, count, #64
+1:
+ stp A_l, A_l, [dst, #16]
+ stp A_l, A_l, [dst, #32]
+ stp A_l, A_l, [dst, #48]
+ stp A_l, A_l, [dst, #64]!
+ subs count, count, #64
+ b.ge 1b
+ tst count, #0x3f
+ add dst, dst, #16
+ b.ne .Ltail63
+.Lexitfunc:
+ ret
+
+#ifndef DONT_USE_DC
+ /*
+ * For zeroing memory, check to see if we can use the ZVA feature to
+ * zero entire 'cache' lines.
+ */
+.Lzero_mem:
+ cmp count, #63
+ b.le .Ltail63
+ /*
+ * For zeroing small amounts of memory, it's not worth setting up
+ * the line-clear code.
+ */
+ cmp count, #128
+ b.lt .Lnot_short /*count is at least 128 bytes*/
+
+ mrs tmp1, dczid_el0
+ tbnz tmp1, #4, .Lnot_short
+ mov tmp3w, #4
+ and zva_len, tmp1w, #15 /* Safety: other bits reserved. */
+ lsl zva_len, tmp3w, zva_len
+
+ ands tmp3w, zva_len, #63
+ /*
+ * ensure the zva_len is not less than 64.
+ * It is not meaningful to use ZVA if the block size is less than 64.
+ */
+ b.ne .Lnot_short
+.Lzero_by_line:
+ /*
+ * Compute how far we need to go to become suitably aligned. We're
+ * already at quad-word alignment.
+ */
+ cmp count, zva_len_x
+ b.lt .Lnot_short /* Not enough to reach alignment. */
+ sub zva_bits_x, zva_len_x, #1
+ neg tmp2, dst
+ ands tmp2, tmp2, zva_bits_x
+ b.eq 1f /* Already aligned. */
+ /* Not aligned, check that there's enough to copy after alignment.*/
+ sub tmp1, count, tmp2
+ /*
+ * grantee the remain length to be ZVA is bigger than 64,
+ * avoid to make the 2f's process over mem range.*/
+ cmp tmp1, #64
+ ccmp tmp1, zva_len_x, #8, ge /* NZCV=0b1000 */
+ b.lt .Lnot_short
+ /*
+ * We know that there's at least 64 bytes to zero and that it's safe
+ * to overrun by 64 bytes.
+ */
+ mov count, tmp1
+2:
+ stp A_l, A_l, [dst]
+ stp A_l, A_l, [dst, #16]
+ stp A_l, A_l, [dst, #32]
+ subs tmp2, tmp2, #64
+ stp A_l, A_l, [dst, #48]
+ add dst, dst, #64
+ b.ge 2b
+ /* We've overrun a bit, so adjust dst downwards.*/
+ add dst, dst, tmp2
+1:
+ sub count, count, zva_len_x
+3:
+ dc zva, dst
+ add dst, dst, zva_len_x
+ subs count, count, zva_len_x
+ b.ge 3b
+ ands count, count, zva_bits_x
+ b.ne .Ltail_maybe_long
+ ret
+#endif /* DONT_USE_DC */
ENDPROC(memset)
--
1.7.9.5
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 4/6] arm64: lib: Implement optimized memcmp routine
2013-12-11 6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
` (2 preceding siblings ...)
2013-12-11 6:24 ` [PATCH 3/6] arm64: lib: Implement optimized memset routine zhichang.yuan at linaro.org
@ 2013-12-11 6:24 ` zhichang.yuan at linaro.org
2013-12-16 16:56 ` Will Deacon
2013-12-11 6:24 ` [PATCH 5/6] arm64: lib: Implement optimized string compare routines zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 6/6] arm64: lib: Implement optimized string length routines zhichang.yuan at linaro.org
5 siblings, 1 reply; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11 6:24 UTC (permalink / raw)
To: linux-arm-kernel
From: "zhichang.yuan" <zhichang.yuan@linaro.org>
This patch, based on Linaro's Cortex Strings library, adds
an assembly optimized memcmp() function.
Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
---
arch/arm64/include/asm/string.h | 3 +
arch/arm64/kernel/arm64ksyms.c | 1 +
arch/arm64/lib/Makefile | 2 +-
arch/arm64/lib/memcmp.S | 258 +++++++++++++++++++++++++++++++++++++++
4 files changed, 263 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/lib/memcmp.S
diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h
index 3ee8b30..3a43305 100644
--- a/arch/arm64/include/asm/string.h
+++ b/arch/arm64/include/asm/string.h
@@ -34,4 +34,7 @@ extern void *memchr(const void *, int, __kernel_size_t);
#define __HAVE_ARCH_MEMSET
extern void *memset(void *, int, __kernel_size_t);
+#define __HAVE_ARCH_MEMCMP
+extern int memcmp(const void *, const void *, size_t);
+
#endif
diff --git a/arch/arm64/kernel/arm64ksyms.c b/arch/arm64/kernel/arm64ksyms.c
index e7ee770..af02a25 100644
--- a/arch/arm64/kernel/arm64ksyms.c
+++ b/arch/arm64/kernel/arm64ksyms.c
@@ -51,6 +51,7 @@ EXPORT_SYMBOL(memset);
EXPORT_SYMBOL(memcpy);
EXPORT_SYMBOL(memmove);
EXPORT_SYMBOL(memchr);
+EXPORT_SYMBOL(memcmp);
/* atomic bitops */
EXPORT_SYMBOL(set_bit);
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index 59acc0e..a6a8d3d 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -2,5 +2,5 @@ lib-y := bitops.o delay.o \
strncpy_from_user.o strnlen_user.o clear_user.o \
copy_from_user.o copy_to_user.o copy_in_user.o \
copy_page.o clear_page.o \
- memchr.o memcpy.o memmove.o memset.o \
+ memchr.o memcpy.o memmove.o memset.o memcmp.o \
strchr.o strrchr.o
diff --git a/arch/arm64/lib/memcmp.S b/arch/arm64/lib/memcmp.S
new file mode 100644
index 0000000..97e8431
--- /dev/null
+++ b/arch/arm64/lib/memcmp.S
@@ -0,0 +1,258 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/*
+* compare memory areas(when two memory areas' offset are different,
+* alignment handled by the hardware)
+*
+* Parameters:
+* x0 - const memory area 1 pointer
+* x1 - const memory area 2 pointer
+* x2 - the maximal compare byte length
+* Returns:
+* x0 - a compare result, maybe less than, equal to, or greater than ZERO
+*/
+
+/* Parameters and result. */
+#define src1 x0
+#define src2 x1
+#define limit x2
+#define result x0
+
+/* Internal variables. */
+#define data1 x3
+#define data1w w3
+#define data2 x4
+#define data2w w4
+#define has_nul x5
+#define diff x6
+#define endloop x7
+#define tmp1 x8
+#define tmp2 x9
+#define tmp3 x10
+#define pos x11
+#define limit_wd x12
+#define mask x13
+
+ENTRY(memcmp)
+ cbz limit, .Lret0
+ eor tmp1, src1, src2
+ tst tmp1, #7
+ b.ne .Lmisaligned8
+ ands tmp1, src1, #7
+ b.ne .Lmutual_align
+ sub limit_wd, limit, #1 /* limit != 0, so no underflow. */
+ lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */
+.Lloop_aligned:
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+.Lstart_realigned:
+ subs limit_wd, limit_wd, #1
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ csinv endloop, diff, xzr, cs /* Last Dword or differences. */
+ cbz endloop, .Lloop_aligned
+
+ /* Not reached the limit, must have found a diff. */
+ tbz limit_wd, #63, .Lnot_limit
+
+ /* Limit % 8 == 0 => all bytes significant. */
+ ands limit, limit, #7
+ b.eq .Lnot_limit
+
+ lsl limit, limit, #3 /* Bits -> bytes. */
+ mov mask, #~0
+#ifdef __ARM64EB__
+ lsr mask, mask, limit
+#else
+ lsl mask, mask, limit
+#endif
+ bic data1, data1, mask
+ bic data2, data2, mask
+
+ orr diff, diff, mask
+ b .Lnot_limit
+
+.Lmutual_align:
+ /*
+ * Sources are mutually aligned, but are not currently at an
+ * alignment boundary. Round down the addresses and then mask off
+ * the bytes that precede the start point.
+ */
+ bic src1, src1, #7
+ bic src2, src2, #7
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+ sub limit_wd, limit, #1/* limit != 0, so no underflow. */
+ and tmp3, limit_wd, #7
+ lsr limit_wd, limit_wd, #3
+ add tmp3, tmp3, tmp1
+ add limit_wd, limit_wd, tmp3, lsr #3
+ add limit, limit, tmp1/* Adjust the limit for the extra. */
+ lsl tmp1, tmp1, #3/* Bytes beyond alignment -> bits.*/
+ neg tmp1, tmp1/* Bits to alignment -64. */
+ mov tmp2, #~0
+#ifdef __ARM64EB__
+ /* Big-endian. Early bytes are at MSB. */
+ lsl tmp2, tmp2, tmp1/* Shift (tmp1 & 63). */
+#else
+ /* Little-endian. Early bytes are at LSB. */
+ lsr tmp2, tmp2, tmp1/* Shift (tmp1 & 63). */
+#endif
+ orr data1, data1, tmp2
+ orr data2, data2, tmp2
+ b .Lstart_realigned
+
+.Lmisaligned8:
+ cmp limit, #8
+ b.lo .Ltiny8proc /*limit < 8... */
+ /*
+ * Get the align offset length to compare per byte first.
+ * After this process, one string's address will be aligned.
+ */
+ and tmp1, src1, #7
+ neg tmp1, tmp1
+ add tmp1, tmp1, #8
+ and tmp2, src2, #7
+ neg tmp2, tmp2
+ add tmp2, tmp2, #8
+ subs tmp3, tmp1, tmp2
+ csel pos, tmp1, tmp2, hi /*Choose the maximum. */
+ /*
+ * Here, limit is not less than 8,
+ * so directly run .Ltinycmp without checking the limit.*/
+ sub limit, limit, pos
+.Ltinycmp:
+ ldrb data1w, [src1], #1
+ ldrb data2w, [src2], #1
+ subs pos, pos, #1
+ ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */
+ b.eq .Ltinycmp
+ cbnz pos, 1f /*find the unequal...*/
+ cmp data1w, data2w
+ b.eq .Lstart_align /*the last bytes are equal....*/
+1:
+ sub result, data1, data2
+ ret
+
+.Lstart_align:
+ lsr limit_wd, limit, #3
+ cbz limit_wd, .Lremain8
+ ands xzr, src1, #7
+ /*
+ * eq means tmp1 bytes finished the compare in the Ltinycmp,
+ * tmp3 is positive here
+ */
+ b.eq .Lrecal_offset
+ add src1, src1, tmp3
+ add src2, src2, tmp3
+ sub limit, limit, tmp3
+ lsr limit_wd, limit, #3
+ cbz limit_wd, .Lremain8
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+
+ subs limit_wd, limit_wd, #1
+ eor diff, data1, data2 /*Non-zero if differences found.*/
+ csinv endloop, diff, xzr, ne
+ cbnz endloop, .Lunequal_proc
+ and tmp3, tmp3, #7 /*tmp3 = 8 + tmp3 ( old tmp3 is negative)*/
+ /*
+ * src1 is aligned and src1 is in the right of src2.
+ * Remain count is not less than 8 here.
+ */
+.Lrecal_offset:
+ neg pos, tmp3
+.Lloopcmp_proc:
+ /*
+ * Fall back pos bytes, get the first bytes segment of
+ * one Dword of src1. pos is negative here. We also can use :
+ * ldr data1, [src1]
+ * ldr data2, [src2, pos]
+ * These two instructions will read data with aligned address,then
+ * do the compare.But if we adapt this method, have to add some
+ * shift and mask out some bits from these two Dword to construct
+ * two Dwords to compare.Some more instructions will be added,
+ * and most important, it will need more time cost.
+ */
+ ldr data1, [src1,pos]
+ ldr data2, [src2,pos]
+ eor diff, data1, data2 /* Non-zero if differences found.*/
+ cbnz diff, .Lnot_limit
+
+ /*The second part process*/
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+ eor diff, data1, data2 /* Non-zero if differences found.*/
+ subs limit_wd, limit_wd, #1
+ csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
+ cbz endloop, .Lloopcmp_proc
+.Lunequal_proc:
+ /*whether unequal occurred?*/
+ cbz diff, .Lremain8
+.Lnot_limit:
+#ifndef __ARM64EB__
+ rev diff, diff
+ rev data1, data1
+ rev data2, data2
+#endif
+ /*
+ * The MS-non-zero bit of DIFF marks either the first bit
+ * that is different, or the end of the significant data.
+ * Shifting left now will bring the critical information into the
+ * top bits.
+ */
+ clz pos, diff
+ lsl data1, data1, pos
+ lsl data2, data2, pos
+ /* But we need to zero-extend (char is unsigned) the value and then
+ perform a signed 32-bit subtraction. */
+ lsr data1, data1, #56
+ sub result, data1, data2, lsr #56
+ ret
+
+ .p2align 6
+.Lremain8:
+ /* Limit % 8 == 0 => all bytes significant. */
+ ands limit, limit, #7
+ b.eq .Lret0
+
+.Ltiny8proc:
+ /*Perhaps we can do better than this.*/
+ ldrb data1w, [src1], #1
+ ldrb data2w, [src2], #1
+ subs limit, limit, #1
+ /*
+ * ne satisfied means current limit > 0. Z=1 will make cs =0,
+ * lead to next ccmp use ZERO to set flags,so break the loop.*/
+ ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */
+ b.eq .Ltiny8proc
+ sub result, data1, data2
+ ret
+.Lret0:
+ mov result, #0
+ ret
+ENDPROC(memcmp)
--
1.7.9.5
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 5/6] arm64: lib: Implement optimized string compare routines
2013-12-11 6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
` (3 preceding siblings ...)
2013-12-11 6:24 ` [PATCH 4/6] arm64: lib: Implement optimized memcmp routine zhichang.yuan at linaro.org
@ 2013-12-11 6:24 ` zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 6/6] arm64: lib: Implement optimized string length routines zhichang.yuan at linaro.org
5 siblings, 0 replies; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11 6:24 UTC (permalink / raw)
To: linux-arm-kernel
From: "zhichang.yuan" <zhichang.yuan@linaro.org>
This patch, based on Linaro's Cortex Strings library, adds
an assembly optimized strcmp() and strncmp() functions.
Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
---
arch/arm64/include/asm/string.h | 6 +
arch/arm64/kernel/arm64ksyms.c | 2 +
arch/arm64/lib/Makefile | 2 +-
arch/arm64/lib/strcmp.S | 256 +++++++++++++++++++++++++++++
arch/arm64/lib/strncmp.S | 340 +++++++++++++++++++++++++++++++++++++++
5 files changed, 605 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/lib/strcmp.S
create mode 100644 arch/arm64/lib/strncmp.S
diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h
index 3a43305..6133f49 100644
--- a/arch/arm64/include/asm/string.h
+++ b/arch/arm64/include/asm/string.h
@@ -22,6 +22,12 @@ extern char *strrchr(const char *, int c);
#define __HAVE_ARCH_STRCHR
extern char *strchr(const char *, int c);
+#define __HAVE_ARCH_STRCMP
+extern int strcmp(const char *, const char *);
+
+#define __HAVE_ARCH_STRNCMP
+extern int strncmp(const char *, const char *, __kernel_size_t);
+
#define __HAVE_ARCH_MEMCPY
extern void *memcpy(void *, const void *, __kernel_size_t);
diff --git a/arch/arm64/kernel/arm64ksyms.c b/arch/arm64/kernel/arm64ksyms.c
index af02a25..d61fa39 100644
--- a/arch/arm64/kernel/arm64ksyms.c
+++ b/arch/arm64/kernel/arm64ksyms.c
@@ -47,6 +47,8 @@ EXPORT_SYMBOL(memstart_addr);
/* string / mem functions */
EXPORT_SYMBOL(strchr);
EXPORT_SYMBOL(strrchr);
+EXPORT_SYMBOL(strcmp);
+EXPORT_SYMBOL(strncmp);
EXPORT_SYMBOL(memset);
EXPORT_SYMBOL(memcpy);
EXPORT_SYMBOL(memmove);
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index a6a8d3d..5ded21f 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -3,4 +3,4 @@ lib-y := bitops.o delay.o \
copy_from_user.o copy_to_user.o copy_in_user.o \
copy_page.o clear_page.o \
memchr.o memcpy.o memmove.o memset.o memcmp.o \
- strchr.o strrchr.o
+ strchr.o strrchr.o strcmp.o strncmp.o
diff --git a/arch/arm64/lib/strcmp.S b/arch/arm64/lib/strcmp.S
new file mode 100644
index 0000000..e07ea0d
--- /dev/null
+++ b/arch/arm64/lib/strcmp.S
@@ -0,0 +1,256 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/*
+ * compare two strings
+ *
+ * Parameters:
+ * x0 - const string 1 pointer
+ * x1 - const string 2 pointer
+ * Returns:
+ * x0 - an integer less than, equal to, or greater than zero
+ * if s1 is found, respectively, to be less than, to match,
+ * or be greater than s2.
+ */
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+/* Parameters and result. */
+#define src1 x0
+#define src2 x1
+#define result x0
+
+/* Internal variables. */
+#define data1 x2
+#define data1w w2
+#define data2 x3
+#define data2w w3
+#define has_nul x4
+#define diff x5
+#define syndrome x6
+#define tmp1 x7
+#define tmp2 x8
+#define tmp3 x9
+#define zeroones x10
+#define pos x11
+
+ENTRY(strcmp)
+ eor tmp1, src1, src2
+ mov zeroones, #REP8_01
+ tst tmp1, #7
+ b.ne .Lmisaligned8
+ ands tmp1, src1, #7
+ b.ne .Lmutual_align
+
+ /*
+ * NUL detection works on the principle that (X - 1) & (~X) & 0x80
+ * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+ * can be done in parallel across the entire word.
+ */
+.Lloop_aligned:
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+.Lstart_realigned:
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ bic has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
+ orr syndrome, diff, has_nul
+ cbz syndrome, .Lloop_aligned
+ b .Lcal_cmpresult
+
+.Lmutual_align:
+ /*
+ * Sources are mutually aligned, but are not currently at an
+ * alignment boundary. Round down the addresses and then mask off
+ * the bytes that preceed the start point.
+ */
+ bic src1, src1, #7
+ bic src2, src2, #7
+ lsl tmp1, tmp1, #3 /* Bytes beyond alignment -> bits. */
+ ldr data1, [src1], #8
+ neg tmp1, tmp1 /* Bits to alignment -64. */
+ ldr data2, [src2], #8
+ mov tmp2, #~0
+#ifdef __ARM64EB__
+ /* Big-endian. Early bytes are at MSB. */
+ lsl tmp2, tmp2, tmp1 /* Shift (tmp1 & 63). */
+#else
+ /* Little-endian. Early bytes are at LSB. */
+ lsr tmp2, tmp2, tmp1 /* Shift (tmp1 & 63). */
+#endif
+ orr data1, data1, tmp2
+ orr data2, data2, tmp2
+ b .Lstart_realigned
+
+.Lmisaligned8:
+ /*
+ * Get the align offset length to compare per byte first.
+ * After this process, one string's address will be aligned.
+ */
+ and tmp1, src1, #7
+ neg tmp1, tmp1
+ add tmp1, tmp1, #8
+ and tmp2, src2, #7
+ neg tmp2, tmp2
+ add tmp2, tmp2, #8
+ subs tmp3, tmp1, tmp2
+ csel pos, tmp1, tmp2, hi /*Choose the maximum. */
+.Ltinycmp:
+ ldrb data1w, [src1], #1
+ ldrb data2w, [src2], #1
+ subs pos, pos, #1
+ ccmp data1w, #1, #0, ne /* NZCV = 0b0000. */
+ ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */
+ b.eq .Ltinycmp
+ cbnz pos, 1f /*find the null or unequal...*/
+ cmp data1w, #1
+ ccmp data1w, data2w, #0, cs
+ b.eq .Lstart_align /*the last bytes are equal....*/
+1:
+ sub result, data1, data2
+ ret
+
+.Lstart_align:
+ ands xzr, src1, #7
+ /*
+ * eq means tmp1 src1 is aligned now;
+ * tmp3 is positive for this branch.
+ */
+ b.eq .Lrecal_offset
+ add src1, src1, tmp3
+ add src2, src2, tmp3
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ bic has_nul, tmp1, tmp2
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ orr syndrome, diff, has_nul
+ cbnz syndrome, .Lcal_cmpresult
+ and tmp3, tmp3, #7 /*tmp3 = 8 + tmp3 ( old tmp3 is negative)*/
+ /*
+ * src1 is aligned and src1 is in the right of src2.
+ * start to compare the next 8 bytes of two strings.
+ */
+.Lrecal_offset:
+ neg pos, tmp3
+.Lloopcmp_proc:
+ /*
+ * Fall back pos bytes, get the first bytes segment of one Dword of src1.
+ * pos is negative here. We also can use :
+ * ldr data1, [src1]
+ * ldr data2, [src2, pos]
+ * These two instructions will read data with aligned address.
+ * But if we adapt this method, have to add some shift and mask out
+ * some bits from these two Dwords to construct two new Dwords.
+ * Some more instructions will be added,and most important,
+ * it will need more time cost.
+ */
+ ldr data1, [src1,pos]
+ ldr data2, [src2,pos]
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ bic has_nul, tmp1, tmp2
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ orr syndrome, diff, has_nul
+ cbnz syndrome, .Lcal_cmpresult
+
+ /*The second part process*/
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ bic has_nul, tmp1, tmp2
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ orr syndrome, diff, has_nul
+ cbz syndrome, .Lloopcmp_proc
+
+.Lcal_cmpresult:
+#ifndef __ARM64EB__
+ rev syndrome, syndrome
+ rev data1, data1
+ /*
+ * The MS-non-zero bit of the syndrome marks either the first bit
+ * that is different, or the top bit of the first zero byte.
+ * Shifting left now will bring the critical information into the
+ * top bits.
+ */
+ clz pos, syndrome
+ rev data2, data2
+ lsl data1, data1, pos
+ lsl data2, data2, pos
+ /*
+ * But we need to zero-extend (char is unsigned) the value and then
+ * perform a signed 32-bit subtraction.
+ */
+ lsr data1, data1, #56
+ sub result, data1, data2, lsr #56
+ ret
+#else
+ /*
+ * For big-endian we cannot use the trick with the syndrome value
+ * as carry-propagation can corrupt the upper bits if the trailing
+ * bytes in the string contain 0x01.
+ * However, if there is no NUL byte in the dword, we can generate
+ * the result directly. We can't just subtract the bytes as the
+ * MSB might be significant.
+ */
+ cbnz has_nul, 1f
+ cmp data1, data2
+ cset result, ne
+ cneg result, result, lo
+ ret
+1:
+ /*Re-compute the NUL-byte detection, using a byte-reversed value. */
+ rev tmp3, data1
+ sub tmp1, tmp3, zeroones
+ orr tmp2, tmp3, #REP8_7f
+ bic has_nul, tmp1, tmp2
+ rev has_nul, has_nul
+ orr syndrome, diff, has_nul
+ clz pos, syndrome
+ /*
+ * The MS-non-zero bit of the syndrome marks either the first bit
+ * that is different, or the top bit of the first zero byte.
+ * Shifting left now will bring the critical information into the
+ * top bits.
+ */
+ lsl data1, data1, pos
+ lsl data2, data2, pos
+ /*
+ * But we need to zero-extend (char is unsigned) the value and then
+ * perform a signed 32-bit subtraction.
+ */
+ lsr data1, data1, #56
+ sub result, data1, data2, lsr #56
+ ret
+#endif
+ENDPROC(strcmp)
diff --git a/arch/arm64/lib/strncmp.S b/arch/arm64/lib/strncmp.S
new file mode 100644
index 0000000..e883fd3
--- /dev/null
+++ b/arch/arm64/lib/strncmp.S
@@ -0,0 +1,340 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/*
+ * compare two strings
+ *
+ * Parameters:
+ * x0 - const string 1 pointer
+ * x1 - const string 2 pointer
+ * x2 - the maximal length to be compared
+ * Returns:
+ * x0 - an integer less than, equal to, or greater than zero if s1 is found,
+ * respectively, to be less than, to match, or be greater than s2.
+ */
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+/* Parameters and result. */
+#define src1 x0
+#define src2 x1
+#define limit x2
+#define result x0
+
+/* Internal variables. */
+#define data1 x3
+#define data1w w3
+#define data2 x4
+#define data2w w4
+#define has_nul x5
+#define diff x6
+#define syndrome x7
+#define tmp1 x8
+#define tmp2 x9
+#define tmp3 x10
+#define zeroones x11
+#define pos x12
+#define limit_wd x13
+#define mask x14
+#define endloop x15
+
+ENTRY(strncmp)
+ cbz limit, .Lret0
+ eor tmp1, src1, src2
+ mov zeroones, #REP8_01
+ tst tmp1, #7
+ b.ne .Lmisaligned8
+ ands tmp1, src1, #7
+ b.ne .Lmutual_align
+ /* Calculate the number of full and partial words -1. */
+ /*
+ * when limit is mulitply of 8, if not sub 1,
+ * the judgement of last dword will wrong.
+ */
+ sub limit_wd, limit, #1 /* limit != 0, so no underflow. */
+ lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */
+
+ /*
+ * NUL detection works on the principle that (X - 1) & (~X) & 0x80
+ * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+ * can be done in parallel across the entire word.
+ */
+.Lloop_aligned:
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+.Lstart_realigned:
+ subs limit_wd, limit_wd, #1
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ csinv endloop, diff, xzr, pl /* Last Dword or differences.*/
+ bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
+ ccmp endloop, #0, #0, eq
+ b.eq .Lloop_aligned
+
+ /*Not reached the limit, must have found the end or a diff. */
+ tbz limit_wd, #63, .Lnot_limit
+
+ /* Limit % 8 == 0 => all bytes significant. */
+ ands limit, limit, #7
+ b.eq .Lnot_limit
+
+ lsl limit, limit, #3 /* Bits -> bytes. */
+ mov mask, #~0
+#ifdef __ARM64EB__
+ lsr mask, mask, limit
+#else
+ lsl mask, mask, limit
+#endif
+ bic data1, data1, mask
+ bic data2, data2, mask
+
+ /* Make sure that the NUL byte is marked in the syndrome. */
+ orr has_nul, has_nul, mask
+
+.Lnot_limit:
+ orr syndrome, diff, has_nul
+ b .Lcal_cmpresult
+
+.Lmutual_align:
+ /*
+ * Sources are mutually aligned, but are not currently at an
+ * alignment boundary. Round down the addresses and then mask off
+ * the bytes that precede the start point.
+ * We also need to adjust the limit calculations, but without
+ * overflowing if the limit is near ULONG_MAX.
+ */
+ bic src1, src1, #7
+ bic src2, src2, #7
+ ldr data1, [src1], #8
+ neg tmp3, tmp1, lsl #3 /* 64 - bits(bytes beyond align). */
+ ldr data2, [src2], #8
+ mov tmp2, #~0
+ sub limit_wd, limit, #1 /* limit != 0, so no underflow. */
+#ifdef __ARM64EB__
+ /* Big-endian. Early bytes are at MSB. */
+ lsl tmp2, tmp2, tmp3 /* Shift (tmp1 & 63). */
+#else
+ /* Little-endian. Early bytes are at LSB. */
+ lsr tmp2, tmp2, tmp3 /* Shift (tmp1 & 63). */
+#endif
+ and tmp3, limit_wd, #7
+ lsr limit_wd, limit_wd, #3
+ /* Adjust the limit. Only low 3 bits used, so overflow irrelevant.*/
+ add limit, limit, tmp1
+ add tmp3, tmp3, tmp1
+ orr data1, data1, tmp2
+ orr data2, data2, tmp2
+ add limit_wd, limit_wd, tmp3, lsr #3
+ b .Lstart_realigned
+
+/*when src1 offset is not equal to src2 offset...*/
+.Lmisaligned8:
+ cmp limit, #8
+ b.lo .Ltiny8proc /*limit < 8... */
+ /*
+ * Get the align offset length to compare per byte first.
+ * After this process, one string's address will be aligned.*/
+ and tmp1, src1, #7
+ neg tmp1, tmp1
+ add tmp1, tmp1, #8
+ and tmp2, src2, #7
+ neg tmp2, tmp2
+ add tmp2, tmp2, #8
+ subs tmp3, tmp1, tmp2
+ csel pos, tmp1, tmp2, hi /*Choose the maximum. */
+ /*
+ * Here, limit is not less than 8, so directly run .Ltinycmp
+ * without checking the limit.*/
+ sub limit, limit, pos
+.Ltinycmp:
+ ldrb data1w, [src1], #1
+ ldrb data2w, [src2], #1
+ subs pos, pos, #1
+ ccmp data1w, #1, #0, ne /* NZCV = 0b0000. */
+ ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */
+ b.eq .Ltinycmp
+ cbnz pos, 1f /*find the null or unequal...*/
+ cmp data1w, #1
+ ccmp data1w, data2w, #0, cs
+ b.eq .Lstart_align /*the last bytes are equal....*/
+1:
+ sub result, data1, data2
+ ret
+
+.Lstart_align:
+ lsr limit_wd, limit, #3
+ cbz limit_wd, .Lremain8
+ ands xzr, src1, #7
+ /*eq means src1 is aligned now,tmp3 is positive in this branch.*/
+ b.eq .Lrecal_offset
+ add src1, src1, tmp3
+ add src2, src2, tmp3
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+ /*
+ * since tmp3 is negative and limit is not less than 8,so cbz is not
+ * needed..
+ */
+ sub limit, limit, tmp3
+ lsr limit_wd, limit, #3
+ subs limit_wd, limit_wd, #1
+
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
+ bics has_nul, tmp1, tmp2
+ ccmp endloop, #0, #0, eq /*has_null is ZERO: no null byte*/
+ b.ne .Lunequal_proc
+ and tmp3, tmp3, #7 /*tmp3 = 8 + tmp3 ( old tmp3 is negative)*/
+ /*
+ * src1 is aligned and src1 is in the right of src2.
+ * start the next 8 bytes compare..
+ */
+.Lrecal_offset:
+ neg pos, tmp3
+.Lloopcmp_proc:
+ /*
+ * Fall back pos bytes, get the first bytes segment of one Dword of src1.
+ * pos is negative here. We also can use :
+ * ldr data1, [src1]
+ * ldr data2, [src2, pos]
+ * These two instructions will read data with aligned address.
+ * But if we adapt this method, have to add some shift and mask out
+ * some bits from these two Dword to construct two new Dwords.
+ * Some more instructions will be added,
+ * and most important, it will need more time cost.
+ */
+ ldr data1, [src1,pos]
+ ldr data2, [src2,pos]
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ bics has_nul, tmp1, tmp2 /* Non-zero if NUL terminator. */
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ csinv endloop, diff, xzr, eq
+ cbnz endloop, .Lunequal_proc
+
+ /*The second part process*/
+ ldr data1, [src1], #8
+ ldr data2, [src2], #8
+ subs limit_wd, limit_wd, #1
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ eor diff, data1, data2 /* Non-zero if differences found. */
+ csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
+ bics has_nul, tmp1, tmp2
+ ccmp endloop, #0, #0, eq /*has_null is ZERO: no null byte*/
+ b.eq .Lloopcmp_proc
+
+.Lunequal_proc:
+ orr syndrome, diff, has_nul
+ cbz syndrome, .Lremain8
+.Lcal_cmpresult:
+#ifndef __ARM64EB__
+ rev syndrome, syndrome
+ rev data1, data1
+ /*
+ * The MS-non-zero bit of the syndrome marks either the first bit
+ * that is different, or the top bit of the first zero byte.
+ * Shifting left now will bring the critical information into the
+ * top bits.
+ */
+ clz pos, syndrome
+ rev data2, data2
+ lsl data1, data1, pos
+ lsl data2, data2, pos
+ /* But we need to zero-extend (char is unsigned) the value and then
+ perform a signed 32-bit subtraction. */
+ lsr data1, data1, #56
+ sub result, data1, data2, lsr #56
+ ret
+#else
+ /*
+ * For big-endian we cannot use the trick with the syndrome value
+ * as carry-propagation can corrupt the upper bits if the trailing
+ * bytes in the string contain 0x01.
+ * However, if there is no NUL byte in the dword, we can generate
+ * the result directly. We can't just subtract the bytes as the
+ * MSB might be significant.
+ */
+ cbnz has_nul, 1f
+ cmp data1, data2
+ cset result, ne
+ cneg result, result, lo
+ ret
+1:
+ /* Re-compute the NUL-byte detection, using a byte-reversed value.*/
+ rev tmp3, data1
+ sub tmp1, tmp3, zeroones
+ orr tmp2, tmp3, #REP8_7f
+ bic has_nul, tmp1, tmp2
+ rev has_nul, has_nul
+ orr syndrome, diff, has_nul
+ clz pos, syndrome
+ /*
+ * The MS-non-zero bit of the syndrome marks either the first bit
+ * that is different, or the top bit of the first zero byte.
+ * Shifting left now will bring the critical information into the
+ * top bits.
+ */
+ lsl data1, data1, pos
+ lsl data2, data2, pos
+ /*
+ * But we need to zero-extend (char is unsigned) the value and then
+ * perform a signed 32-bit subtraction.
+ */
+ lsr data1, data1, #56
+ sub result, data1, data2, lsr #56
+ ret
+#endif
+
+.Lremain8:
+ /* Limit % 8 == 0 => all bytes significant. */
+ ands limit, limit, #7
+ b.eq .Lret0
+.Ltiny8proc:
+ /* Perhaps we can do better than this. */
+ ldrb data1w, [src1], #1
+ ldrb data2w, [src2], #1
+ subs limit, limit, #1
+ /*
+ * nz satisfied means current limit > 0.
+ * Z=1 will make cs =0, lead to next ccmp use ZERO to set flags
+ */
+ ccmp data1w, #1, #0, ne /* NZCV = 0b0000. */
+ ccmp data1w, data2w, #0, cs /* NZCV = 0b0000. */
+ b.eq .Ltiny8proc
+ sub result, data1, data2
+ ret
+
+.Lret0:
+ mov result, #0
+ ret
+ENDPROC(strncmp)
--
1.7.9.5
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 6/6] arm64: lib: Implement optimized string length routines
2013-12-11 6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
` (4 preceding siblings ...)
2013-12-11 6:24 ` [PATCH 5/6] arm64: lib: Implement optimized string compare routines zhichang.yuan at linaro.org
@ 2013-12-11 6:24 ` zhichang.yuan at linaro.org
5 siblings, 0 replies; 14+ messages in thread
From: zhichang.yuan at linaro.org @ 2013-12-11 6:24 UTC (permalink / raw)
To: linux-arm-kernel
From: "zhichang.yuan" <zhichang.yuan@linaro.org>
This patch, based on Linaro's Cortex Strings library, adds
an assembly optimized strlen() and strnlen() functions.
Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
---
arch/arm64/include/asm/string.h | 6 ++
arch/arm64/kernel/arm64ksyms.c | 2 +
arch/arm64/lib/Makefile | 3 +-
arch/arm64/lib/strlen.S | 131 ++++++++++++++++++++++++++++
arch/arm64/lib/strnlen.S | 179 +++++++++++++++++++++++++++++++++++++++
5 files changed, 320 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/lib/strlen.S
create mode 100644 arch/arm64/lib/strnlen.S
diff --git a/arch/arm64/include/asm/string.h b/arch/arm64/include/asm/string.h
index 6133f49..64d2d48 100644
--- a/arch/arm64/include/asm/string.h
+++ b/arch/arm64/include/asm/string.h
@@ -28,6 +28,12 @@ extern int strcmp(const char *, const char *);
#define __HAVE_ARCH_STRNCMP
extern int strncmp(const char *, const char *, __kernel_size_t);
+#define __HAVE_ARCH_STRLEN
+extern __kernel_size_t strlen(const char *);
+
+#define __HAVE_ARCH_STRNLEN
+extern __kernel_size_t strnlen(const char *, __kernel_size_t);
+
#define __HAVE_ARCH_MEMCPY
extern void *memcpy(void *, const void *, __kernel_size_t);
diff --git a/arch/arm64/kernel/arm64ksyms.c b/arch/arm64/kernel/arm64ksyms.c
index d61fa39..c71b435 100644
--- a/arch/arm64/kernel/arm64ksyms.c
+++ b/arch/arm64/kernel/arm64ksyms.c
@@ -49,6 +49,8 @@ EXPORT_SYMBOL(strchr);
EXPORT_SYMBOL(strrchr);
EXPORT_SYMBOL(strcmp);
EXPORT_SYMBOL(strncmp);
+EXPORT_SYMBOL(strlen);
+EXPORT_SYMBOL(strnlen);
EXPORT_SYMBOL(memset);
EXPORT_SYMBOL(memcpy);
EXPORT_SYMBOL(memmove);
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index 5ded21f..e64a94c 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -3,4 +3,5 @@ lib-y := bitops.o delay.o \
copy_from_user.o copy_to_user.o copy_in_user.o \
copy_page.o clear_page.o \
memchr.o memcpy.o memmove.o memset.o memcmp.o \
- strchr.o strrchr.o strcmp.o strncmp.o
+ strchr.o strrchr.o strcmp.o strncmp.o \
+ strlen.o strnlen.o
diff --git a/arch/arm64/lib/strlen.S b/arch/arm64/lib/strlen.S
new file mode 100644
index 0000000..302f6dd
--- /dev/null
+++ b/arch/arm64/lib/strlen.S
@@ -0,0 +1,131 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/*
+ * calculate the length of a string
+ *
+ * Parameters:
+ * x0 - const string pointer
+ * Returns:
+ * x0 - the return length of specific string
+ */
+
+/* Arguments and results. */
+#define srcin x0
+#define len x0
+
+/* Locals and temporaries. */
+#define src x1
+#define data1 x2
+#define data2 x3
+#define data2a x4
+#define has_nul1 x5
+#define has_nul2 x6
+#define tmp1 x7
+#define tmp2 x8
+#define tmp3 x9
+#define tmp4 x10
+#define zeroones x11
+#define pos x12
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+ENTRY(strlen)
+ mov zeroones, #REP8_01
+ bic src, srcin, #15
+ ands tmp1, srcin, #15
+ b.ne .Lmisaligned
+ /*
+ * NUL detection works on the principle that (X - 1) & (~X) & 0x80
+ * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+ * can be done in parallel across the entire word.
+ */
+ /*
+ * The inner loop deals with two Dwords at a time. This has a
+ * slightly higher start-up cost, but we should win quite quickly,
+ * especially on cores with a high number of issue slots per
+ * cycle, as we get much better parallelism out of the operations.
+ */
+.Lloop:
+ ldp data1, data2, [src], #16
+.Lrealigned:
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ sub tmp3, data2, zeroones
+ orr tmp4, data2, #REP8_7f
+ bic has_nul1, tmp1, tmp2
+ bics has_nul2, tmp3, tmp4
+ ccmp has_nul1, #0, #0, eq /* NZCV = 0000 */
+ b.eq .Lloop
+
+ sub len, src, srcin
+ cbz has_nul1, .Lnul_in_data2
+#ifdef __ARM64EB__
+ mov data2, data1
+#endif
+ sub len, len, #8
+ mov has_nul2, has_nul1
+.Lnul_in_data2:
+#ifdef __ARM64EB__
+ /*
+ * For big-endian, carry propagation (if the final byte in the
+ * string is 0x01) means we cannot use has_nul directly. The
+ * easiest way to get the correct byte is to byte-swap the data
+ * and calculate the syndrome a second time.
+ */
+ rev data2, data2
+ sub tmp1, data2, zeroones
+ orr tmp2, data2, #REP8_7f
+ bic has_nul2, tmp1, tmp2
+#endif
+ sub len, len, #8
+ rev has_nul2, has_nul2
+ clz pos, has_nul2
+ add len, len, pos, lsr #3 /* Bits to bytes. */
+ ret
+
+.Lmisaligned:
+ cmp tmp1, #8
+ neg tmp1, tmp1
+ ldp data1, data2, [src], #16
+ lsl tmp1, tmp1, #3 /* Bytes beyond alignment -> bits. */
+ mov tmp2, #~0
+#ifdef __ARM64EB__
+ /* Big-endian. Early bytes are at MSB. */
+ lsl tmp2, tmp2, tmp1 /* Shift (tmp1 & 63). */
+#else
+ /* Little-endian. Early bytes are at LSB. */
+ lsr tmp2, tmp2, tmp1 /* Shift (tmp1 & 63). */
+#endif
+ orr data1, data1, tmp2
+ orr data2a, data2, tmp2
+ csinv data1, data1, xzr, le
+ csel data2, data2, data2a, le
+ b .Lrealigned
+ENDPROC(strlen)
diff --git a/arch/arm64/lib/strnlen.S b/arch/arm64/lib/strnlen.S
new file mode 100644
index 0000000..2293f7b
--- /dev/null
+++ b/arch/arm64/lib/strnlen.S
@@ -0,0 +1,179 @@
+/*
+ * Copyright (C) 2013 ARM Ltd.
+ * Copyright (C) 2013 Linaro.
+ *
+ * This code is based on glibc cortex strings work originally authored by Linaro
+ * and re-licensed under GPLv2 for the Linux kernel. The original code can
+ * be found @
+ *
+ * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
+ * files/head:/src/aarch64/
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/linkage.h>
+#include <asm/assembler.h>
+
+/*
+ * determine the length of a fixed-size string
+ *
+ * Parameters:
+ * x0 - const string pointer
+ * x1 - maximal string length
+ * Returns:
+ * x0 - the return length of specific string
+ */
+
+/* Arguments and results. */
+#define srcin x0
+#define len x0
+#define limit x1
+
+/* Locals and temporaries. */
+#define src x2
+#define data1 x3
+#define data2 x4
+#define data2a x5
+#define has_nul1 x6
+#define has_nul2 x7
+#define tmp1 x8
+#define tmp2 x9
+#define tmp3 x10
+#define tmp4 x11
+#define zeroones x12
+#define pos x13
+#define limit_wd x14
+
+#define REP8_01 0x0101010101010101
+#define REP8_7f 0x7f7f7f7f7f7f7f7f
+#define REP8_80 0x8080808080808080
+
+ENTRY(strnlen)
+ cbz limit, .Lhit_limit
+ mov zeroones, #REP8_01
+ bic src, srcin, #15
+ ands tmp1, srcin, #15
+ b.ne .Lmisaligned
+ /* Calculate the number of full and partial words -1. */
+ sub limit_wd, limit, #1 /* Limit != 0, so no underflow. */
+ lsr limit_wd, limit_wd, #4 /* Convert to Qwords. */
+
+ /*
+ * NUL detection works on the principle that (X - 1) & (~X) & 0x80
+ * (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
+ * can be done in parallel across the entire word.
+ */
+ /*
+ * The inner loop deals with two Dwords at a time. This has a
+ * slightly higher start-up cost, but we should win quite quickly,
+ * especially on cores with a high number of issue slots per
+ * cycle, as we get much better parallelism out of the operations.
+ */
+.Lloop:
+ ldp data1, data2, [src], #16
+.Lrealigned:
+ sub tmp1, data1, zeroones
+ orr tmp2, data1, #REP8_7f
+ sub tmp3, data2, zeroones
+ orr tmp4, data2, #REP8_7f
+ bic has_nul1, tmp1, tmp2
+ bic has_nul2, tmp3, tmp4
+ subs limit_wd, limit_wd, #1
+ orr tmp1, has_nul1, has_nul2
+ ccmp tmp1, #0, #0, pl /* NZCV = 0000 */
+ b.eq .Lloop
+
+ cbz tmp1, .Lhit_limit /* No null in final Qword. */
+
+ /*
+ * We know there's a null in the final Qword. The easiest thing
+ * to do now is work out the length of the string and return
+ * MIN (len, limit).
+ */
+ sub len, src, srcin
+ cbz has_nul1, .Lnul_in_data2
+#ifdef __ARM64EB__
+ mov data2, data1
+#endif
+ sub len, len, #8
+ mov has_nul2, has_nul1
+.Lnul_in_data2:
+#ifdef __ARM64EB__
+ /*
+ * For big-endian, carry propagation (if the final byte in the
+ * string is 0x01) means we cannot use has_nul directly. The
+ * easiest way to get the correct byte is to byte-swap the data
+ * and calculate the syndrome a second time.
+ */
+ rev data2, data2
+ sub tmp1, data2, zeroones
+ orr tmp2, data2, #REP8_7f
+ bic has_nul2, tmp1, tmp2
+#endif
+ sub len, len, #8
+ rev has_nul2, has_nul2
+ clz pos, has_nul2
+ add len, len, pos, lsr #3 /* Bits to bytes. */
+ cmp len, limit
+ csel len, len, limit, ls /* Return the lower value. */
+ ret
+
+.Lmisaligned:
+ /*
+ * Deal with a partial first word.
+ * We're doing two things in parallel here;
+ * 1) Calculate the number of words (but avoiding overflow if
+ * limit is near ULONG_MAX) - to do this we need to work out
+ * limit + tmp1 - 1 as a 65-bit value before shifting it;
+ * 2) Load and mask the initial data words - we force the bytes
+ * before the ones we are interested in to 0xff - this ensures
+ * early bytes will not hit any zero detection.
+ */
+ /*
+ * this operation need some cycles,
+ * some other independent instructions can following it.
+ */
+ ldp data1, data2, [src], #16
+
+ sub limit_wd, limit, #1
+ and tmp3, limit_wd, #15
+ lsr limit_wd, limit_wd, #4
+
+ add tmp3, tmp3, tmp1
+ add limit_wd, limit_wd, tmp3, lsr #4
+
+ neg tmp4, tmp1
+ lsl tmp4, tmp4, #3 /* Bytes beyond alignment -> bits. */
+
+ mov tmp2, #~0
+#ifdef __ARM64EB__
+ /* Big-endian. Early bytes are at MSB. */
+ lsl tmp2, tmp2, tmp4 /* Shift (tmp1 & 63). */
+#else
+ /* Little-endian. Early bytes are at LSB. */
+ lsr tmp2, tmp2, tmp4 /* Shift (tmp1 & 63). */
+#endif
+ cmp tmp1, #8
+
+ orr data1, data1, tmp2
+ orr data2a, data2, tmp2
+
+ csinv data1, data1, xzr, le
+ csel data2, data2, data2a, le
+ b .Lrealigned
+
+.Lhit_limit:
+ mov len, limit
+ ret
+ENDPROC(strnlen)
--
1.7.9.5
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 1/6] arm64: lib: Implement optimized memcpy routine
2013-12-11 6:24 ` [PATCH 1/6] arm64: lib: Implement optimized memcpy routine zhichang.yuan at linaro.org
@ 2013-12-16 16:08 ` Will Deacon
2013-12-18 1:54 ` zhichang.yuan
0 siblings, 1 reply; 14+ messages in thread
From: Will Deacon @ 2013-12-16 16:08 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Dec 11, 2013 at 06:24:37AM +0000, zhichang.yuan at linaro.org wrote:
> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
>
> This patch, based on Linaro's Cortex Strings library, improves
> the performance of the assembly optimized memcpy() function.
>
> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
> ---
> arch/arm64/lib/memcpy.S | 182 +++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 160 insertions(+), 22 deletions(-)
>
> diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S
> index 27b5003..e3bab71 100644
> --- a/arch/arm64/lib/memcpy.S
> +++ b/arch/arm64/lib/memcpy.S
> @@ -1,5 +1,13 @@
> /*
> * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
> *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License version 2 as
> @@ -27,27 +35,157 @@
> * Returns:
> * x0 - dest
> */
> +#define dstin x0
> +#define src x1
> +#define count x2
> +#define tmp1 x3
> +#define tmp1w w3
> +#define tmp2 x4
> +#define tmp2w w4
> +#define tmp3 x5
> +#define tmp3w w5
> +#define dst x6
> +
> +#define A_l x7
> +#define A_h x8
> +#define B_l x9
> +#define B_h x10
> +#define C_l x11
> +#define C_h x12
> +#define D_l x13
> +#define D_h x14
Use .req instead of #define?
> ENTRY(memcpy)
> - mov x4, x0
> - subs x2, x2, #8
> - b.mi 2f
> -1: ldr x3, [x1], #8
> - subs x2, x2, #8
> - str x3, [x4], #8
> - b.pl 1b
> -2: adds x2, x2, #4
> - b.mi 3f
> - ldr w3, [x1], #4
> - sub x2, x2, #4
> - str w3, [x4], #4
> -3: adds x2, x2, #2
> - b.mi 4f
> - ldrh w3, [x1], #2
> - sub x2, x2, #2
> - strh w3, [x4], #2
> -4: adds x2, x2, #1
> - b.mi 5f
> - ldrb w3, [x1]
> - strb w3, [x4]
> -5: ret
> + mov dst, dstin
> + cmp count, #16
> + /*If memory length is less than 16, stp or ldp can not be used.*/
> + b.lo .Ltiny15
> +.Lover16:
> + neg tmp2, src
> + ands tmp2, tmp2, #15/* Bytes to reach alignment. */
> + b.eq .LSrcAligned
> + sub count, count, tmp2
> + /*
> + * Use ldp and sdp to copy 16 bytes,then backward the src to
> + * aligned address.This way is more efficient.
> + * But the risk overwriting the source area exists.Here,prefer to
> + * access memory forward straight,no backward.It will need a bit
> + * more instructions, but on the same time,the accesses are aligned.
> + */
This comment reads very badly:
- sdp doesn't exist
- `more efficient' than what? How is this measured?
- `access memory forward straight,no backward' What?
Please re-write it in a clearer fashion, so that reviewers have some
understanding of your optimisations when potentially trying to change the
code later on.
> + tbz tmp2, #0, 1f
> + ldrb tmp1w, [src], #1
> + strb tmp1w, [dst], #1
> +1:
> + tbz tmp2, #1, 1f
> + ldrh tmp1w, [src], #2
> + strh tmp1w, [dst], #2
> +1:
> + tbz tmp2, #2, 1f
> + ldr tmp1w, [src], #4
> + str tmp1w, [dst], #4
> +1:
Three labels called '1:' ? Can you make them unique please (the old code
just incremented a counter).
> + tbz tmp2, #3, .LSrcAligned
> + ldr tmp1, [src],#8
> + str tmp1, [dst],#8
> +
> +.LSrcAligned:
> + cmp count, #64
> + b.ge .Lcpy_over64
> + /*
> + * Deal with small copies quickly by dropping straight into the
> + * exit block.
> + */
> +.Ltail63:
> + /*
> + * Copy up to 48 bytes of data. At this point we only need the
> + * bottom 6 bits of count to be accurate.
> + */
> + ands tmp1, count, #0x30
> + b.eq .Ltiny15
> + cmp tmp1w, #0x20
> + b.eq 1f
> + b.lt 2f
> + ldp A_l, A_h, [src], #16
> + stp A_l, A_h, [dst], #16
> +1:
> + ldp A_l, A_h, [src], #16
> + stp A_l, A_h, [dst], #16
> +2:
> + ldp A_l, A_h, [src], #16
> + stp A_l, A_h, [dst], #16
> +.Ltiny15:
> + /*
> + * To make memmove simpler, here don't make src backwards.
> + * since backwards will probably overwrite the src area where src
> + * data for nex copy located,if dst is not so far from src.
> + */
Another awful comment...
> + tbz count, #3, 1f
> + ldr tmp1, [src], #8
> + str tmp1, [dst], #8
> +1:
> + tbz count, #2, 1f
> + ldr tmp1w, [src], #4
> + str tmp1w, [dst], #4
> +1:
> + tbz count, #1, 1f
> + ldrh tmp1w, [src], #2
> + strh tmp1w, [dst], #2
> +1:
... and more of these labels.
> + tbz count, #0, .Lexitfunc
> + ldrb tmp1w, [src]
> + strb tmp1w, [dst]
> +
> +.Lexitfunc:
> + ret
> +
> +.Lcpy_over64:
> + subs count, count, #128
> + b.ge .Lcpy_body_large
> + /*
> + * Less than 128 bytes to copy, so handle 64 here and then jump
> + * to the tail.
> + */
> + ldp A_l, A_h, [src],#16
> + stp A_l, A_h, [dst],#16
> + ldp B_l, B_h, [src],#16
> + ldp C_l, C_h, [src],#16
> + stp B_l, B_h, [dst],#16
> + stp C_l, C_h, [dst],#16
> + ldp D_l, D_h, [src],#16
> + stp D_l, D_h, [dst],#16
> +
> + tst count, #0x3f
> + b.ne .Ltail63
> + ret
> +
> + /*
> + * Critical loop. Start at a new cache line boundary. Assuming
> + * 64 bytes per line this ensures the entire loop is in one line.
> + */
> + .p2align 6
Can you parameterise this with L1_CACHE_SHIFT instead?
Will
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/6] arm64: lib: Implement optimized memset routine
2013-12-11 6:24 ` [PATCH 3/6] arm64: lib: Implement optimized memset routine zhichang.yuan at linaro.org
@ 2013-12-16 16:55 ` Will Deacon
2013-12-18 2:37 ` zhichang.yuan
0 siblings, 1 reply; 14+ messages in thread
From: Will Deacon @ 2013-12-16 16:55 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Dec 11, 2013 at 06:24:39AM +0000, zhichang.yuan at linaro.org wrote:
> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
>
> This patch, based on Linaro's Cortex Strings library, improves
> the performance of the assembly optimized memset() function.
>
> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
> ---
> arch/arm64/lib/memset.S | 227 +++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 201 insertions(+), 26 deletions(-)
>
> diff --git a/arch/arm64/lib/memset.S b/arch/arm64/lib/memset.S
> index 87e4a68..90b973e 100644
> --- a/arch/arm64/lib/memset.S
> +++ b/arch/arm64/lib/memset.S
> @@ -1,13 +1,21 @@
> /*
> * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
> *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License version 2 as
> * published by the Free Software Foundation.
> *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
> + * kind, whether express or implied; without even the implied warranty
> + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
Why are you changing this?
> * GNU General Public License for more details.
> *
> * You should have received a copy of the GNU General Public License
> @@ -18,7 +26,7 @@
> #include <asm/assembler.h>
>
> /*
> - * Fill in the buffer with character c (alignment handled by the hardware)
> + * Fill in the buffer with character c
> *
> * Parameters:
> * x0 - buf
> @@ -27,27 +35,194 @@
> * Returns:
> * x0 - buf
> */
> +
> +/* By default we assume that the DC instruction can be used to zero
> +* data blocks more efficiently. In some circumstances this might be
> +* unsafe, for example in an asymmetric multiprocessor environment with
> +* different DC clear lengths (neither the upper nor lower lengths are
> +* safe to use). The feature can be disabled by defining DONT_USE_DC.
> +*/
We already use DC ZVA for clear_page, so I think we should start off using
it unconditionally. If we need to revisit this later, we can, but adding a
random #ifdef doesn't feel like something we need initially.
For the benefit of anybody else reviewing this; the DC ZVA instruction still
works for normal, non-cacheable memory.
The comments I made on the earlier patch wrt quality of comments and labels
seem to apply to all of the patches in this series.
Will
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 4/6] arm64: lib: Implement optimized memcmp routine
2013-12-11 6:24 ` [PATCH 4/6] arm64: lib: Implement optimized memcmp routine zhichang.yuan at linaro.org
@ 2013-12-16 16:56 ` Will Deacon
2013-12-19 8:18 ` Deepak Saxena
0 siblings, 1 reply; 14+ messages in thread
From: Will Deacon @ 2013-12-16 16:56 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Dec 11, 2013 at 06:24:40AM +0000, zhichang.yuan at linaro.org wrote:
> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
>
> This patch, based on Linaro's Cortex Strings library, adds
> an assembly optimized memcmp() function.
[...]
> +#ifndef __ARM64EB__
> + rev diff, diff
> + rev data1, data1
> + rev data2, data2
> +#endif
Given that I can't see you defining __ARM64EB__, I'd take a guess at you
never having tested this code on a big-endian machine.
Please can you fix that?
Will
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/6] arm64: lib: Implement optimized memcpy routine
2013-12-16 16:08 ` Will Deacon
@ 2013-12-18 1:54 ` zhichang.yuan
0 siblings, 0 replies; 14+ messages in thread
From: zhichang.yuan @ 2013-12-18 1:54 UTC (permalink / raw)
To: linux-arm-kernel
Hi Will
Thanks for your reply.
I think your comments are ok, i will modify the patches involved those questions.
After those fixes are ready, the patch v2 will be submitted.
Thanks again!
Zhichang
On 2013?12?17? 00:08, Will Deacon wrote:
> On Wed, Dec 11, 2013 at 06:24:37AM +0000, zhichang.yuan at linaro.org wrote:
>> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
>>
>> This patch, based on Linaro's Cortex Strings library, improves
>> the performance of the assembly optimized memcpy() function.
>>
>> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>> ---
>> arch/arm64/lib/memcpy.S | 182 +++++++++++++++++++++++++++++++++++++++++------
>> 1 file changed, 160 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S
>> index 27b5003..e3bab71 100644
>> --- a/arch/arm64/lib/memcpy.S
>> +++ b/arch/arm64/lib/memcpy.S
>> @@ -1,5 +1,13 @@
>> /*
>> * Copyright (C) 2013 ARM Ltd.
>> + * Copyright (C) 2013 Linaro.
>> + *
>> + * This code is based on glibc cortex strings work originally authored by Linaro
>> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
>> + * be found @
>> + *
>> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
>> + * files/head:/src/aarch64/
>> *
>> * This program is free software; you can redistribute it and/or modify
>> * it under the terms of the GNU General Public License version 2 as
>> @@ -27,27 +35,157 @@
>> * Returns:
>> * x0 - dest
>> */
>> +#define dstin x0
>> +#define src x1
>> +#define count x2
>> +#define tmp1 x3
>> +#define tmp1w w3
>> +#define tmp2 x4
>> +#define tmp2w w4
>> +#define tmp3 x5
>> +#define tmp3w w5
>> +#define dst x6
>> +
>> +#define A_l x7
>> +#define A_h x8
>> +#define B_l x9
>> +#define B_h x10
>> +#define C_l x11
>> +#define C_h x12
>> +#define D_l x13
>> +#define D_h x14
>
> Use .req instead of #define?
>
>> ENTRY(memcpy)
>> - mov x4, x0
>> - subs x2, x2, #8
>> - b.mi 2f
>> -1: ldr x3, [x1], #8
>> - subs x2, x2, #8
>> - str x3, [x4], #8
>> - b.pl 1b
>> -2: adds x2, x2, #4
>> - b.mi 3f
>> - ldr w3, [x1], #4
>> - sub x2, x2, #4
>> - str w3, [x4], #4
>> -3: adds x2, x2, #2
>> - b.mi 4f
>> - ldrh w3, [x1], #2
>> - sub x2, x2, #2
>> - strh w3, [x4], #2
>> -4: adds x2, x2, #1
>> - b.mi 5f
>> - ldrb w3, [x1]
>> - strb w3, [x4]
>> -5: ret
>> + mov dst, dstin
>> + cmp count, #16
>> + /*If memory length is less than 16, stp or ldp can not be used.*/
>> + b.lo .Ltiny15
>> +.Lover16:
>> + neg tmp2, src
>> + ands tmp2, tmp2, #15/* Bytes to reach alignment. */
>> + b.eq .LSrcAligned
>> + sub count, count, tmp2
>> + /*
>> + * Use ldp and sdp to copy 16 bytes,then backward the src to
>> + * aligned address.This way is more efficient.
>> + * But the risk overwriting the source area exists.Here,prefer to
>> + * access memory forward straight,no backward.It will need a bit
>> + * more instructions, but on the same time,the accesses are aligned.
>> + */
>
> This comment reads very badly:
>
> - sdp doesn't exist
> - `more efficient' than what? How is this measured?
> - `access memory forward straight,no backward' What?
>
> Please re-write it in a clearer fashion, so that reviewers have some
> understanding of your optimisations when potentially trying to change the
> code later on.
>
>> + tbz tmp2, #0, 1f
>> + ldrb tmp1w, [src], #1
>> + strb tmp1w, [dst], #1
>> +1:
>> + tbz tmp2, #1, 1f
>> + ldrh tmp1w, [src], #2
>> + strh tmp1w, [dst], #2
>> +1:
>> + tbz tmp2, #2, 1f
>> + ldr tmp1w, [src], #4
>> + str tmp1w, [dst], #4
>> +1:
>
> Three labels called '1:' ? Can you make them unique please (the old code
> just incremented a counter).
>
>> + tbz tmp2, #3, .LSrcAligned
>> + ldr tmp1, [src],#8
>> + str tmp1, [dst],#8
>> +
>> +.LSrcAligned:
>> + cmp count, #64
>> + b.ge .Lcpy_over64
>> + /*
>> + * Deal with small copies quickly by dropping straight into the
>> + * exit block.
>> + */
>> +.Ltail63:
>> + /*
>> + * Copy up to 48 bytes of data. At this point we only need the
>> + * bottom 6 bits of count to be accurate.
>> + */
>> + ands tmp1, count, #0x30
>> + b.eq .Ltiny15
>> + cmp tmp1w, #0x20
>> + b.eq 1f
>> + b.lt 2f
>> + ldp A_l, A_h, [src], #16
>> + stp A_l, A_h, [dst], #16
>> +1:
>> + ldp A_l, A_h, [src], #16
>> + stp A_l, A_h, [dst], #16
>> +2:
>> + ldp A_l, A_h, [src], #16
>> + stp A_l, A_h, [dst], #16
>> +.Ltiny15:
>> + /*
>> + * To make memmove simpler, here don't make src backwards.
>> + * since backwards will probably overwrite the src area where src
>> + * data for nex copy located,if dst is not so far from src.
>> + */
>
> Another awful comment...
>
>> + tbz count, #3, 1f
>> + ldr tmp1, [src], #8
>> + str tmp1, [dst], #8
>> +1:
>> + tbz count, #2, 1f
>> + ldr tmp1w, [src], #4
>> + str tmp1w, [dst], #4
>> +1:
>> + tbz count, #1, 1f
>> + ldrh tmp1w, [src], #2
>> + strh tmp1w, [dst], #2
>> +1:
>
> ... and more of these labels.
>
>> + tbz count, #0, .Lexitfunc
>> + ldrb tmp1w, [src]
>> + strb tmp1w, [dst]
>> +
>> +.Lexitfunc:
>> + ret
>> +
>> +.Lcpy_over64:
>> + subs count, count, #128
>> + b.ge .Lcpy_body_large
>> + /*
>> + * Less than 128 bytes to copy, so handle 64 here and then jump
>> + * to the tail.
>> + */
>> + ldp A_l, A_h, [src],#16
>> + stp A_l, A_h, [dst],#16
>> + ldp B_l, B_h, [src],#16
>> + ldp C_l, C_h, [src],#16
>> + stp B_l, B_h, [dst],#16
>> + stp C_l, C_h, [dst],#16
>> + ldp D_l, D_h, [src],#16
>> + stp D_l, D_h, [dst],#16
>> +
>> + tst count, #0x3f
>> + b.ne .Ltail63
>> + ret
>> +
>> + /*
>> + * Critical loop. Start at a new cache line boundary. Assuming
>> + * 64 bytes per line this ensures the entire loop is in one line.
>> + */
>> + .p2align 6
>
> Can you parameterise this with L1_CACHE_SHIFT instead?
>
> Will
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/6] arm64: lib: Implement optimized memset routine
2013-12-16 16:55 ` Will Deacon
@ 2013-12-18 2:37 ` zhichang.yuan
0 siblings, 0 replies; 14+ messages in thread
From: zhichang.yuan @ 2013-12-18 2:37 UTC (permalink / raw)
To: linux-arm-kernel
On 2013?12?17? 00:55, Will Deacon wrote:
> On Wed, Dec 11, 2013 at 06:24:39AM +0000, zhichang.yuan at linaro.org wrote:
>> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
>>
>> This patch, based on Linaro's Cortex Strings library, improves
>> the performance of the assembly optimized memset() function.
>>
>> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>> ---
>> arch/arm64/lib/memset.S | 227 +++++++++++++++++++++++++++++++++++++++++------
>> 1 file changed, 201 insertions(+), 26 deletions(-)
>>
>> diff --git a/arch/arm64/lib/memset.S b/arch/arm64/lib/memset.S
>> index 87e4a68..90b973e 100644
>> --- a/arch/arm64/lib/memset.S
>> +++ b/arch/arm64/lib/memset.S
>> @@ -1,13 +1,21 @@
>> /*
>> * Copyright (C) 2013 ARM Ltd.
>> + * Copyright (C) 2013 Linaro.
>> + *
>> + * This code is based on glibc cortex strings work originally authored by Linaro
>> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
>> + * be found @
>> + *
>> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
>> + * files/head:/src/aarch64/
>> *
>> * This program is free software; you can redistribute it and/or modify
>> * it under the terms of the GNU General Public License version 2 as
>> * published by the Free Software Foundation.
>> *
>> - * This program is distributed in the hope that it will be useful,
>> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>> + * This program is distributed "as is" WITHOUT ANY WARRANTY of any
>> + * kind, whether express or implied; without even the implied warranty
>> + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>
> Why are you changing this?
>
>> * GNU General Public License for more details.
>> *
>> * You should have received a copy of the GNU General Public License
>> @@ -18,7 +26,7 @@
>> #include <asm/assembler.h>
>>
>> /*
>> - * Fill in the buffer with character c (alignment handled by the hardware)
>> + * Fill in the buffer with character c
>> *
>> * Parameters:
>> * x0 - buf
>> @@ -27,27 +35,194 @@
>> * Returns:
>> * x0 - buf
>> */
>> +
>> +/* By default we assume that the DC instruction can be used to zero
>> +* data blocks more efficiently. In some circumstances this might be
>> +* unsafe, for example in an asymmetric multiprocessor environment with
>> +* different DC clear lengths (neither the upper nor lower lengths are
>> +* safe to use). The feature can be disabled by defining DONT_USE_DC.
>> +*/
This comments is not so correct, for the AMP,i think the DONT_USE_DC also is not necessary.
Since this memset will read the dczid_el0 each time,it will get the current correct
value from the system register.
>
> We already use DC ZVA for clear_page, so I think we should start off using
> it unconditionally. If we need to revisit this later, we can, but adding a
> random #ifdef doesn't feel like something we need initially.
As for the macro DONT_USE_DC,i am no sure whether default configure of some systems
do not permit the DC ZVA. In this case,the try to use DC ZVA will be ended and
turn back to the normal memory setting process. It will not cause any error,only
need more instructions executed.
I initially think using the DONT_USE_DC is a bit efficient.
Actually,even if the system doesn't support DC ZVA,the cost of reading the dczid_el0
register is small. It is not worthy to introduce a kernel macro.
I will modify it.
Zhichang
>
> For the benefit of anybody else reviewing this; the DC ZVA instruction still
> works for normal, non-cacheable memory.
>
> The comments I made on the earlier patch wrt quality of comments and labels
> seem to apply to all of the patches in this series.
>
> Will
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 4/6] arm64: lib: Implement optimized memcmp routine
2013-12-16 16:56 ` Will Deacon
@ 2013-12-19 8:18 ` Deepak Saxena
2013-12-19 16:14 ` Catalin Marinas
0 siblings, 1 reply; 14+ messages in thread
From: Deepak Saxena @ 2013-12-19 8:18 UTC (permalink / raw)
To: linux-arm-kernel
On 16 December 2013 08:56, Will Deacon <will.deacon@arm.com> wrote:
> On Wed, Dec 11, 2013 at 06:24:40AM +0000, zhichang.yuan at linaro.org wrote:
>> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
>>
>> This patch, based on Linaro's Cortex Strings library, adds
>> an assembly optimized memcmp() function.
>
> [...]
>
>> +#ifndef __ARM64EB__
>> + rev diff, diff
>> + rev data1, data1
>> + rev data2, data2
>> +#endif
>
> Given that I can't see you defining __ARM64EB__, I'd take a guess at you
> never having tested this code on a big-endian machine.
We'll test it, but I believe __ARM64BE__ is a compiler constant that
is automagically present when building big endian, similar to
__ARMBE__
>
> Please can you fix that?
>
> Will
--
Deepak Saxena - Kernel Working Group Lead
Linaro.org | Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 4/6] arm64: lib: Implement optimized memcmp routine
2013-12-19 8:18 ` Deepak Saxena
@ 2013-12-19 16:14 ` Catalin Marinas
0 siblings, 0 replies; 14+ messages in thread
From: Catalin Marinas @ 2013-12-19 16:14 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Dec 19, 2013 at 08:18:59AM +0000, Deepak Saxena wrote:
> On 16 December 2013 08:56, Will Deacon <will.deacon@arm.com> wrote:
> > On Wed, Dec 11, 2013 at 06:24:40AM +0000, zhichang.yuan at linaro.org wrote:
> >> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
> >>
> >> This patch, based on Linaro's Cortex Strings library, adds
> >> an assembly optimized memcmp() function.
> >
> > [...]
> >
> >> +#ifndef __ARM64EB__
> >> + rev diff, diff
> >> + rev data1, data1
> >> + rev data2, data2
> >> +#endif
> >
> > Given that I can't see you defining __ARM64EB__, I'd take a guess at you
> > never having tested this code on a big-endian machine.
>
> We'll test it, but I believe __ARM64BE__ is a compiler constant that
> is automagically present when building big endian, similar to
> __ARMBE__
It's __AARCH64EB__ (that's how we spot people not testing on big-endian ;)).
--
Catalin
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2013-12-19 16:14 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-11 6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 1/6] arm64: lib: Implement optimized memcpy routine zhichang.yuan at linaro.org
2013-12-16 16:08 ` Will Deacon
2013-12-18 1:54 ` zhichang.yuan
2013-12-11 6:24 ` [PATCH 2/6] arm64: lib: Implement optimized memmove routine zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 3/6] arm64: lib: Implement optimized memset routine zhichang.yuan at linaro.org
2013-12-16 16:55 ` Will Deacon
2013-12-18 2:37 ` zhichang.yuan
2013-12-11 6:24 ` [PATCH 4/6] arm64: lib: Implement optimized memcmp routine zhichang.yuan at linaro.org
2013-12-16 16:56 ` Will Deacon
2013-12-19 8:18 ` Deepak Saxena
2013-12-19 16:14 ` Catalin Marinas
2013-12-11 6:24 ` [PATCH 5/6] arm64: lib: Implement optimized string compare routines zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 6/6] arm64: lib: Implement optimized string length routines zhichang.yuan at linaro.org
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).