From: zhichang.yuan@linaro.org (zhichang.yuan)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH 1/6] arm64: lib: Implement optimized memcpy routine
Date: Wed, 18 Dec 2013 09:54:15 +0800 [thread overview]
Message-ID: <52B10047.2020501@linaro.org> (raw)
In-Reply-To: <20131216160845.GD20193@mudshark.cambridge.arm.com>
Hi Will
Thanks for your reply.
I think your comments are ok, i will modify the patches involved those questions.
After those fixes are ready, the patch v2 will be submitted.
Thanks again!
Zhichang
On 2013?12?17? 00:08, Will Deacon wrote:
> On Wed, Dec 11, 2013 at 06:24:37AM +0000, zhichang.yuan at linaro.org wrote:
>> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
>>
>> This patch, based on Linaro's Cortex Strings library, improves
>> the performance of the assembly optimized memcpy() function.
>>
>> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>> ---
>> arch/arm64/lib/memcpy.S | 182 +++++++++++++++++++++++++++++++++++++++++------
>> 1 file changed, 160 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S
>> index 27b5003..e3bab71 100644
>> --- a/arch/arm64/lib/memcpy.S
>> +++ b/arch/arm64/lib/memcpy.S
>> @@ -1,5 +1,13 @@
>> /*
>> * Copyright (C) 2013 ARM Ltd.
>> + * Copyright (C) 2013 Linaro.
>> + *
>> + * This code is based on glibc cortex strings work originally authored by Linaro
>> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
>> + * be found @
>> + *
>> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
>> + * files/head:/src/aarch64/
>> *
>> * This program is free software; you can redistribute it and/or modify
>> * it under the terms of the GNU General Public License version 2 as
>> @@ -27,27 +35,157 @@
>> * Returns:
>> * x0 - dest
>> */
>> +#define dstin x0
>> +#define src x1
>> +#define count x2
>> +#define tmp1 x3
>> +#define tmp1w w3
>> +#define tmp2 x4
>> +#define tmp2w w4
>> +#define tmp3 x5
>> +#define tmp3w w5
>> +#define dst x6
>> +
>> +#define A_l x7
>> +#define A_h x8
>> +#define B_l x9
>> +#define B_h x10
>> +#define C_l x11
>> +#define C_h x12
>> +#define D_l x13
>> +#define D_h x14
>
> Use .req instead of #define?
>
>> ENTRY(memcpy)
>> - mov x4, x0
>> - subs x2, x2, #8
>> - b.mi 2f
>> -1: ldr x3, [x1], #8
>> - subs x2, x2, #8
>> - str x3, [x4], #8
>> - b.pl 1b
>> -2: adds x2, x2, #4
>> - b.mi 3f
>> - ldr w3, [x1], #4
>> - sub x2, x2, #4
>> - str w3, [x4], #4
>> -3: adds x2, x2, #2
>> - b.mi 4f
>> - ldrh w3, [x1], #2
>> - sub x2, x2, #2
>> - strh w3, [x4], #2
>> -4: adds x2, x2, #1
>> - b.mi 5f
>> - ldrb w3, [x1]
>> - strb w3, [x4]
>> -5: ret
>> + mov dst, dstin
>> + cmp count, #16
>> + /*If memory length is less than 16, stp or ldp can not be used.*/
>> + b.lo .Ltiny15
>> +.Lover16:
>> + neg tmp2, src
>> + ands tmp2, tmp2, #15/* Bytes to reach alignment. */
>> + b.eq .LSrcAligned
>> + sub count, count, tmp2
>> + /*
>> + * Use ldp and sdp to copy 16 bytes,then backward the src to
>> + * aligned address.This way is more efficient.
>> + * But the risk overwriting the source area exists.Here,prefer to
>> + * access memory forward straight,no backward.It will need a bit
>> + * more instructions, but on the same time,the accesses are aligned.
>> + */
>
> This comment reads very badly:
>
> - sdp doesn't exist
> - `more efficient' than what? How is this measured?
> - `access memory forward straight,no backward' What?
>
> Please re-write it in a clearer fashion, so that reviewers have some
> understanding of your optimisations when potentially trying to change the
> code later on.
>
>> + tbz tmp2, #0, 1f
>> + ldrb tmp1w, [src], #1
>> + strb tmp1w, [dst], #1
>> +1:
>> + tbz tmp2, #1, 1f
>> + ldrh tmp1w, [src], #2
>> + strh tmp1w, [dst], #2
>> +1:
>> + tbz tmp2, #2, 1f
>> + ldr tmp1w, [src], #4
>> + str tmp1w, [dst], #4
>> +1:
>
> Three labels called '1:' ? Can you make them unique please (the old code
> just incremented a counter).
>
>> + tbz tmp2, #3, .LSrcAligned
>> + ldr tmp1, [src],#8
>> + str tmp1, [dst],#8
>> +
>> +.LSrcAligned:
>> + cmp count, #64
>> + b.ge .Lcpy_over64
>> + /*
>> + * Deal with small copies quickly by dropping straight into the
>> + * exit block.
>> + */
>> +.Ltail63:
>> + /*
>> + * Copy up to 48 bytes of data. At this point we only need the
>> + * bottom 6 bits of count to be accurate.
>> + */
>> + ands tmp1, count, #0x30
>> + b.eq .Ltiny15
>> + cmp tmp1w, #0x20
>> + b.eq 1f
>> + b.lt 2f
>> + ldp A_l, A_h, [src], #16
>> + stp A_l, A_h, [dst], #16
>> +1:
>> + ldp A_l, A_h, [src], #16
>> + stp A_l, A_h, [dst], #16
>> +2:
>> + ldp A_l, A_h, [src], #16
>> + stp A_l, A_h, [dst], #16
>> +.Ltiny15:
>> + /*
>> + * To make memmove simpler, here don't make src backwards.
>> + * since backwards will probably overwrite the src area where src
>> + * data for nex copy located,if dst is not so far from src.
>> + */
>
> Another awful comment...
>
>> + tbz count, #3, 1f
>> + ldr tmp1, [src], #8
>> + str tmp1, [dst], #8
>> +1:
>> + tbz count, #2, 1f
>> + ldr tmp1w, [src], #4
>> + str tmp1w, [dst], #4
>> +1:
>> + tbz count, #1, 1f
>> + ldrh tmp1w, [src], #2
>> + strh tmp1w, [dst], #2
>> +1:
>
> ... and more of these labels.
>
>> + tbz count, #0, .Lexitfunc
>> + ldrb tmp1w, [src]
>> + strb tmp1w, [dst]
>> +
>> +.Lexitfunc:
>> + ret
>> +
>> +.Lcpy_over64:
>> + subs count, count, #128
>> + b.ge .Lcpy_body_large
>> + /*
>> + * Less than 128 bytes to copy, so handle 64 here and then jump
>> + * to the tail.
>> + */
>> + ldp A_l, A_h, [src],#16
>> + stp A_l, A_h, [dst],#16
>> + ldp B_l, B_h, [src],#16
>> + ldp C_l, C_h, [src],#16
>> + stp B_l, B_h, [dst],#16
>> + stp C_l, C_h, [dst],#16
>> + ldp D_l, D_h, [src],#16
>> + stp D_l, D_h, [dst],#16
>> +
>> + tst count, #0x3f
>> + b.ne .Ltail63
>> + ret
>> +
>> + /*
>> + * Critical loop. Start at a new cache line boundary. Assuming
>> + * 64 bytes per line this ensures the entire loop is in one line.
>> + */
>> + .p2align 6
>
> Can you parameterise this with L1_CACHE_SHIFT instead?
>
> Will
>
next prev parent reply other threads:[~2013-12-18 1:54 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-11 6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 1/6] arm64: lib: Implement optimized memcpy routine zhichang.yuan at linaro.org
2013-12-16 16:08 ` Will Deacon
2013-12-18 1:54 ` zhichang.yuan [this message]
2013-12-11 6:24 ` [PATCH 2/6] arm64: lib: Implement optimized memmove routine zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 3/6] arm64: lib: Implement optimized memset routine zhichang.yuan at linaro.org
2013-12-16 16:55 ` Will Deacon
2013-12-18 2:37 ` zhichang.yuan
2013-12-11 6:24 ` [PATCH 4/6] arm64: lib: Implement optimized memcmp routine zhichang.yuan at linaro.org
2013-12-16 16:56 ` Will Deacon
2013-12-19 8:18 ` Deepak Saxena
2013-12-19 16:14 ` Catalin Marinas
2013-12-11 6:24 ` [PATCH 5/6] arm64: lib: Implement optimized string compare routines zhichang.yuan at linaro.org
2013-12-11 6:24 ` [PATCH 6/6] arm64: lib: Implement optimized string length routines zhichang.yuan at linaro.org
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52B10047.2020501@linaro.org \
--to=zhichang.yuan@linaro.org \
--cc=linux-arm-kernel@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.