All of lore.kernel.org
 help / color / mirror / Atom feed
From: zhichang.yuan@linaro.org (zhichang.yuan)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH 1/6] arm64: lib: Implement optimized memcpy routine
Date: Wed, 18 Dec 2013 09:54:15 +0800	[thread overview]
Message-ID: <52B10047.2020501@linaro.org> (raw)
In-Reply-To: <20131216160845.GD20193@mudshark.cambridge.arm.com>

Hi Will

Thanks for your reply.
I think your comments are ok, i will modify the patches involved those questions.
After those fixes are ready, the patch v2 will be submitted.

Thanks again!
Zhichang




On 2013?12?17? 00:08, Will Deacon wrote:
> On Wed, Dec 11, 2013 at 06:24:37AM +0000, zhichang.yuan at linaro.org wrote:
>> From: "zhichang.yuan" <zhichang.yuan@linaro.org>
>>
>> This patch, based on Linaro's Cortex Strings library, improves
>> the performance of the assembly optimized memcpy() function.
>>
>> Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>> Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>> ---
>>  arch/arm64/lib/memcpy.S |  182 +++++++++++++++++++++++++++++++++++++++++------
>>  1 file changed, 160 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/arm64/lib/memcpy.S b/arch/arm64/lib/memcpy.S
>> index 27b5003..e3bab71 100644
>> --- a/arch/arm64/lib/memcpy.S
>> +++ b/arch/arm64/lib/memcpy.S
>> @@ -1,5 +1,13 @@
>>  /*
>>   * Copyright (C) 2013 ARM Ltd.
>> + * Copyright (C) 2013 Linaro.
>> + *
>> + * This code is based on glibc cortex strings work originally authored by Linaro
>> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
>> + * be found @
>> + *
>> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
>> + * files/head:/src/aarch64/
>>   *
>>   * This program is free software; you can redistribute it and/or modify
>>   * it under the terms of the GNU General Public License version 2 as
>> @@ -27,27 +35,157 @@
>>   * Returns:
>>   *	x0 - dest
>>   */
>> +#define dstin	x0
>> +#define src	x1
>> +#define count	x2
>> +#define tmp1	x3
>> +#define tmp1w	w3
>> +#define tmp2	x4
>> +#define tmp2w	w4
>> +#define tmp3	x5
>> +#define tmp3w	w5
>> +#define dst	x6
>> +
>> +#define A_l	x7
>> +#define A_h	x8
>> +#define B_l	x9
>> +#define B_h	x10
>> +#define C_l	x11
>> +#define C_h	x12
>> +#define D_l	x13
>> +#define D_h	x14
> 
> Use .req instead of #define?
> 
>>  ENTRY(memcpy)
>> -	mov	x4, x0
>> -	subs	x2, x2, #8
>> -	b.mi	2f
>> -1:	ldr	x3, [x1], #8
>> -	subs	x2, x2, #8
>> -	str	x3, [x4], #8
>> -	b.pl	1b
>> -2:	adds	x2, x2, #4
>> -	b.mi	3f
>> -	ldr	w3, [x1], #4
>> -	sub	x2, x2, #4
>> -	str	w3, [x4], #4
>> -3:	adds	x2, x2, #2
>> -	b.mi	4f
>> -	ldrh	w3, [x1], #2
>> -	sub	x2, x2, #2
>> -	strh	w3, [x4], #2
>> -4:	adds	x2, x2, #1
>> -	b.mi	5f
>> -	ldrb	w3, [x1]
>> -	strb	w3, [x4]
>> -5:	ret
>> +	mov	dst, dstin
>> +	cmp	count, #16
>> +	/*If memory length is less than 16, stp or ldp can not be used.*/
>> +	b.lo	.Ltiny15
>> +.Lover16:
>> +	neg	tmp2, src
>> +	ands	tmp2, tmp2, #15/* Bytes to reach alignment. */
>> +	b.eq	.LSrcAligned
>> +	sub	count, count, tmp2
>> +	/*
>> +	* Use ldp and sdp to copy 16 bytes,then backward the src to
>> +	* aligned address.This way is more efficient.
>> +	* But the risk overwriting the source area exists.Here,prefer to
>> +	* access memory forward straight,no backward.It will need a bit
>> +	* more instructions, but on the same time,the accesses are aligned.
>> +	*/
> 
> This comment reads very badly:
> 
>   - sdp doesn't exist
>   - `more efficient' than what? How is this measured?
>   - `access memory forward straight,no backward' What?
> 
> Please re-write it in a clearer fashion, so that reviewers have some
> understanding of your optimisations when potentially trying to change the
> code later on.
> 
>> +	tbz	tmp2, #0, 1f
>> +	ldrb	tmp1w, [src], #1
>> +	strb	tmp1w, [dst], #1
>> +1:
>> +	tbz	tmp2, #1, 1f
>> +	ldrh	tmp1w, [src], #2
>> +	strh	tmp1w, [dst], #2
>> +1:
>> +	tbz	tmp2, #2, 1f
>> +	ldr	tmp1w, [src], #4
>> +	str	tmp1w, [dst], #4
>> +1:
> 
> Three labels called '1:' ? Can you make them unique please (the old code
> just incremented a counter).
> 
>> +	tbz	tmp2, #3, .LSrcAligned
>> +	ldr	tmp1, [src],#8
>> +	str	tmp1, [dst],#8
>> +
>> +.LSrcAligned:
>> +	cmp	count, #64
>> +	b.ge	.Lcpy_over64
>> +	/*
>> +	* Deal with small copies quickly by dropping straight into the
>> +	* exit block.
>> +	*/
>> +.Ltail63:
>> +	/*
>> +	* Copy up to 48 bytes of data. At this point we only need the
>> +	* bottom 6 bits of count to be accurate.
>> +	*/
>> +	ands	tmp1, count, #0x30
>> +	b.eq	.Ltiny15
>> +	cmp	tmp1w, #0x20
>> +	b.eq	1f
>> +	b.lt	2f
>> +	ldp	A_l, A_h, [src], #16
>> +	stp	A_l, A_h, [dst], #16
>> +1:
>> +	ldp	A_l, A_h, [src], #16
>> +	stp	A_l, A_h, [dst], #16
>> +2:
>> +	ldp	A_l, A_h, [src], #16
>> +	stp	A_l, A_h, [dst], #16
>> +.Ltiny15:
>> +	/*
>> +	* To make memmove simpler, here don't make src backwards.
>> +	* since backwards will probably overwrite the src area where src
>> +	* data for nex copy located,if dst is not so far from src.
>> +	*/
> 
> Another awful comment...
> 
>> +	tbz	count, #3, 1f
>> +	ldr	tmp1, [src], #8
>> +	str	tmp1, [dst], #8
>> +1:
>> +	tbz	count, #2, 1f
>> +	ldr	tmp1w, [src], #4
>> +	str	tmp1w, [dst], #4
>> +1:
>> +	tbz	count, #1, 1f
>> +	ldrh	tmp1w, [src], #2
>> +	strh	tmp1w, [dst], #2
>> +1:
> 
> ... and more of these labels.
> 
>> +	tbz	count, #0, .Lexitfunc
>> +	ldrb	tmp1w, [src]
>> +	strb	tmp1w, [dst]
>> +
>> +.Lexitfunc:
>> +	ret
>> +
>> +.Lcpy_over64:
>> +	subs	count, count, #128
>> +	b.ge	.Lcpy_body_large
>> +	/*
>> +	* Less than 128 bytes to copy, so handle 64 here and then jump
>> +	* to the tail.
>> +	*/
>> +	ldp	A_l, A_h, [src],#16
>> +	stp	A_l, A_h, [dst],#16
>> +	ldp	B_l, B_h, [src],#16
>> +	ldp	C_l, C_h, [src],#16
>> +	stp	B_l, B_h, [dst],#16
>> +	stp	C_l, C_h, [dst],#16
>> +	ldp	D_l, D_h, [src],#16
>> +	stp	D_l, D_h, [dst],#16
>> +
>> +	tst	count, #0x3f
>> +	b.ne	.Ltail63
>> +	ret
>> +
>> +	/*
>> +	* Critical loop.  Start at a new cache line boundary.  Assuming
>> +	* 64 bytes per line this ensures the entire loop is in one line.
>> +	*/
>> +	.p2align	6
> 
> Can you parameterise this with L1_CACHE_SHIFT instead?
> 
> Will
> 

  reply	other threads:[~2013-12-18  1:54 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-11  6:24 [PATCH 0/6] arm64:lib: the optimized string library routines for armv8 processors zhichang.yuan at linaro.org
2013-12-11  6:24 ` [PATCH 1/6] arm64: lib: Implement optimized memcpy routine zhichang.yuan at linaro.org
2013-12-16 16:08   ` Will Deacon
2013-12-18  1:54     ` zhichang.yuan [this message]
2013-12-11  6:24 ` [PATCH 2/6] arm64: lib: Implement optimized memmove routine zhichang.yuan at linaro.org
2013-12-11  6:24 ` [PATCH 3/6] arm64: lib: Implement optimized memset routine zhichang.yuan at linaro.org
2013-12-16 16:55   ` Will Deacon
2013-12-18  2:37     ` zhichang.yuan
2013-12-11  6:24 ` [PATCH 4/6] arm64: lib: Implement optimized memcmp routine zhichang.yuan at linaro.org
2013-12-16 16:56   ` Will Deacon
2013-12-19  8:18     ` Deepak Saxena
2013-12-19 16:14       ` Catalin Marinas
2013-12-11  6:24 ` [PATCH 5/6] arm64: lib: Implement optimized string compare routines zhichang.yuan at linaro.org
2013-12-11  6:24 ` [PATCH 6/6] arm64: lib: Implement optimized string length routines zhichang.yuan at linaro.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52B10047.2020501@linaro.org \
    --to=zhichang.yuan@linaro.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.