Re: [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Julien Grall <julien.grall@linaro.org>
To: Ian Campbell <ian.campbell@citrix.com>, xen-devel@lists.xen.org
Cc: tim@xen.org, stefano.stabellini@eu.citrix.com
Subject: Re: [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6
Date: Fri, 25 Jul 2014 16:36:23 +0100	[thread overview]
Message-ID: <53D27977.9000307@linaro.org> (raw)
In-Reply-To: <80a33cc325055bc9d63e4ef272c5b7f68f8fa812.1406301772.git.ian.campbell@citrix.com>

Hi Ian,

On 07/25/2014 04:22 PM, Ian Campbell wrote:
> The only really interesting changes here are the updates to mem* which update
> to actually optimised versions and introduce an optimised memcmp.

I didn't read the whole code as I assume it's just a copy with few
changes from Linux.

Acked-by: Julien Grall <julien.grall@linaro.org>

Regards,

> bitops: No change to the bits we import. Record new baseline.
> 
> cmpxchg: Import:
>   60010e5 arm64: cmpxchg: update macros to prevent warnings
>     Author: Mark Hambleton <mahamble@broadcom.com>
>     Signed-off-by: Mark Hambleton <mahamble@broadcom.com>
>     Signed-off-by: Mark Brown <broonie@linaro.org>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> 
>   e1dfda9 arm64: xchg: prevent warning if return value is unused
>     Author: Will Deacon <will.deacon@arm.com>
>     Signed-off-by: Will Deacon <will.deacon@arm.com>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> 
>   e1dfda9 resolves the warning which previous caused us to skip 60010e508111.
> 
>   Since arm32 and arm64 now differ (as do Linux arm and arm64) here the
>   existing definition in asm/system.h gets moved to asm/arm32/cmpxchg.h.
>   Previously this was shadowing the arm64 one but they happened to be identical.
> 
> atomics: Import:
>   8715466 arch,arm64: Convert smp_mb__*()
>     Author: Peter Zijlstra <peterz@infradead.org>
>     Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> 
>   This just drops some unused (by us) smp_mb__*_atomic_*.
> 
> spinlocks: No change. Record new baseline.
> 
> mem*: Import:
>   808dbac arm64: lib: Implement optimized memcpy routine
>     Author: zhichang.yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>   280adc1 arm64: lib: Implement optimized memmove routine
>     Author: zhichang.yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>   b29a51f arm64: lib: Implement optimized memset routine
>     Author: zhichang.yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>   d875c9b arm64: lib: Implement optimized memcmp routine
>     Author: zhichang.yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org>
>     Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
>     Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> 
>   These import various routines from Linaro's Cortex Strings library.
> 
>   Added assembler.h similar to on arm32 to define the various magic symbols
>   which these imported routines depend on (e.g. CPU_LE() and CPU_BE())
> 
> str*: No changes. Record new baseline.
> 
>   Correct the paths in the README.
> 
> *_page: No changes. Record new baseline.
> 
>   README previous said clear_page was unused while clear page was, which was
>   backwards.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
>  xen/arch/arm/README.LinuxPrimitives |   36 +++--
>  xen/arch/arm/arm64/lib/Makefile     |    2 +-
>  xen/arch/arm/arm64/lib/assembler.h  |   13 ++
>  xen/arch/arm/arm64/lib/memchr.S     |    1 +
>  xen/arch/arm/arm64/lib/memcmp.S     |  258 +++++++++++++++++++++++++++++++++++
>  xen/arch/arm/arm64/lib/memcpy.S     |  193 +++++++++++++++++++++++---
>  xen/arch/arm/arm64/lib/memmove.S    |  191 ++++++++++++++++++++++----
>  xen/arch/arm/arm64/lib/memset.S     |  208 +++++++++++++++++++++++++---
>  xen/include/asm-arm/arm32/cmpxchg.h |    3 +
>  xen/include/asm-arm/arm64/atomic.h  |    5 -
>  xen/include/asm-arm/arm64/cmpxchg.h |   35 +++--
>  xen/include/asm-arm/string.h        |    5 +
>  xen/include/asm-arm/system.h        |    3 -
>  13 files changed, 844 insertions(+), 109 deletions(-)
>  create mode 100644 xen/arch/arm/arm64/lib/assembler.h
>  create mode 100644 xen/arch/arm/arm64/lib/memcmp.S
> 
> diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives
> index 6cd03ca..69eeb70 100644
> --- a/xen/arch/arm/README.LinuxPrimitives
> +++ b/xen/arch/arm/README.LinuxPrimitives
> @@ -6,29 +6,26 @@ were last updated.
>  arm64:
>  =====================================================================
>  
> -bitops: last sync @ v3.14-rc7 (last commit: 8e86f0b)
> +bitops: last sync @ v3.16-rc6 (last commit: 8715466b6027)
>  
>  linux/arch/arm64/lib/bitops.S           xen/arch/arm/arm64/lib/bitops.S
>  linux/arch/arm64/include/asm/bitops.h   xen/include/asm-arm/arm64/bitops.h
>  
>  ---------------------------------------------------------------------
>  
> -cmpxchg: last sync @ v3.14-rc7 (last commit: 95c4189)
> +cmpxchg: last sync @ v3.16-rc6 (last commit: e1dfda9ced9b)
>  
>  linux/arch/arm64/include/asm/cmpxchg.h  xen/include/asm-arm/arm64/cmpxchg.h
>  
> -Skipped:
> -  60010e5 arm64: cmpxchg: update macros to prevent warnings
> -
>  ---------------------------------------------------------------------
>  
> -atomics: last sync @ v3.14-rc7 (last commit: 95c4189)
> +atomics: last sync @ v3.16-rc6 (last commit: 8715466b6027)
>  
>  linux/arch/arm64/include/asm/atomic.h   xen/include/asm-arm/arm64/atomic.h
>  
>  ---------------------------------------------------------------------
>  
> -spinlocks: last sync @ v3.14-rc7 (last commit: 95c4189)
> +spinlocks: last sync @ v3.16-rc6 (last commit: 95c4189689f9)
>  
>  linux/arch/arm64/include/asm/spinlock.h xen/include/asm-arm/arm64/spinlock.h
>  
> @@ -38,30 +35,31 @@ Skipped:
>  
>  ---------------------------------------------------------------------
>  
> -mem*: last sync @ v3.14-rc7 (last commit: 4a89922)
> +mem*: last sync @ v3.16-rc6 (last commit: d875c9b37240)
>  
> -linux/arch/arm64/lib/memchr.S             xen/arch/arm/arm64/lib/memchr.S
> -linux/arch/arm64/lib/memcpy.S             xen/arch/arm/arm64/lib/memcpy.S
> -linux/arch/arm64/lib/memmove.S            xen/arch/arm/arm64/lib/memmove.S
> -linux/arch/arm64/lib/memset.S             xen/arch/arm/arm64/lib/memset.S
> +linux/arch/arm64/lib/memchr.S           xen/arch/arm/arm64/lib/memchr.S
> +linux/arch/arm64/lib/memcmp.S           xen/arch/arm/arm64/lib/memcmp.S
> +linux/arch/arm64/lib/memcpy.S           xen/arch/arm/arm64/lib/memcpy.S
> +linux/arch/arm64/lib/memmove.S          xen/arch/arm/arm64/lib/memmove.S
> +linux/arch/arm64/lib/memset.S           xen/arch/arm/arm64/lib/memset.S
>  
> -for i in memchr.S memcpy.S memmove.S memset.S ; do
> +for i in memchr.S memcmp.S memcpy.S memmove.S memset.S ; do
>      diff -u linux/arch/arm64/lib/$i xen/arch/arm/arm64/lib/$i
>  done
>  
>  ---------------------------------------------------------------------
>  
> -str*: last sync @ v3.14-rc7 (last commit: 2b8cac8)
> +str*: last sync @ v3.16-rc6 (last commit: 2b8cac814cd5)
>  
> -linux/arch/arm/lib/strchr.S             xen/arch/arm/arm64/lib/strchr.S
> -linux/arch/arm/lib/strrchr.S            xen/arch/arm/arm64/lib/strrchr.S
> +linux/arch/arm64/lib/strchr.S           xen/arch/arm/arm64/lib/strchr.S
> +linux/arch/arm64/lib/strrchr.S          xen/arch/arm/arm64/lib/strrchr.S
>  
>  ---------------------------------------------------------------------
>  
> -{clear,copy}_page: last sync @ v3.14-rc7 (last commit: f27bb13)
> +{clear,copy}_page: last sync @ v3.16-rc6 (last commit: f27bb139c387)
>  
> -linux/arch/arm64/lib/clear_page.S       unused in Xen
> -linux/arch/arm64/lib/copy_page.S        xen/arch/arm/arm64/lib/copy_page.S
> +linux/arch/arm64/lib/clear_page.S       xen/arch/arm/arm64/lib/clear_page.S
> +linux/arch/arm64/lib/copy_page.S        unused in Xen
>  
>  =====================================================================
>  arm32
> diff --git a/xen/arch/arm/arm64/lib/Makefile b/xen/arch/arm/arm64/lib/Makefile
> index b895afa..2e7fb64 100644
> --- a/xen/arch/arm/arm64/lib/Makefile
> +++ b/xen/arch/arm/arm64/lib/Makefile
> @@ -1,4 +1,4 @@
> -obj-y += memcpy.o memmove.o memset.o memchr.o
> +obj-y += memcpy.o memcmp.o memmove.o memset.o memchr.o
>  obj-y += clear_page.o
>  obj-y += bitops.o find_next_bit.o
>  obj-y += strchr.o strrchr.o
> diff --git a/xen/arch/arm/arm64/lib/assembler.h b/xen/arch/arm/arm64/lib/assembler.h
> new file mode 100644
> index 0000000..84669d1
> --- /dev/null
> +++ b/xen/arch/arm/arm64/lib/assembler.h
> @@ -0,0 +1,13 @@
> +#ifndef __ASM_ASSEMBLER_H__
> +#define __ASM_ASSEMBLER_H__
> +
> +#ifndef __ASSEMBLY__
> +#error "Only include this from assembly code"
> +#endif
> +
> +/* Only LE support so far */
> +#define CPU_BE(x...)
> +#define CPU_LE(x...) x
> +
> +#endif /* __ASM_ASSEMBLER_H__ */
> +
> diff --git a/xen/arch/arm/arm64/lib/memchr.S b/xen/arch/arm/arm64/lib/memchr.S
> index 3cc1b01..b04590c 100644
> --- a/xen/arch/arm/arm64/lib/memchr.S
> +++ b/xen/arch/arm/arm64/lib/memchr.S
> @@ -18,6 +18,7 @@
>   */
>  
>  #include <xen/config.h>
> +#include "assembler.h"
>  
>  /*
>   * Find a character in an area of memory.
> diff --git a/xen/arch/arm/arm64/lib/memcmp.S b/xen/arch/arm/arm64/lib/memcmp.S
> new file mode 100644
> index 0000000..9aad925
> --- /dev/null
> +++ b/xen/arch/arm/arm64/lib/memcmp.S
> @@ -0,0 +1,258 @@
> +/*
> + * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <xen/config.h>
> +#include "assembler.h"
> +
> +/*
> +* compare memory areas(when two memory areas' offset are different,
> +* alignment handled by the hardware)
> +*
> +* Parameters:
> +*  x0 - const memory area 1 pointer
> +*  x1 - const memory area 2 pointer
> +*  x2 - the maximal compare byte length
> +* Returns:
> +*  x0 - a compare result, maybe less than, equal to, or greater than ZERO
> +*/
> +
> +/* Parameters and result.  */
> +src1		.req	x0
> +src2		.req	x1
> +limit		.req	x2
> +result		.req	x0
> +
> +/* Internal variables.  */
> +data1		.req	x3
> +data1w		.req	w3
> +data2		.req	x4
> +data2w		.req	w4
> +has_nul		.req	x5
> +diff		.req	x6
> +endloop		.req	x7
> +tmp1		.req	x8
> +tmp2		.req	x9
> +tmp3		.req	x10
> +pos		.req	x11
> +limit_wd	.req	x12
> +mask		.req	x13
> +
> +ENTRY(memcmp)
> +	cbz	limit, .Lret0
> +	eor	tmp1, src1, src2
> +	tst	tmp1, #7
> +	b.ne	.Lmisaligned8
> +	ands	tmp1, src1, #7
> +	b.ne	.Lmutual_align
> +	sub	limit_wd, limit, #1 /* limit != 0, so no underflow.  */
> +	lsr	limit_wd, limit_wd, #3 /* Convert to Dwords.  */
> +	/*
> +	* The input source addresses are at alignment boundary.
> +	* Directly compare eight bytes each time.
> +	*/
> +.Lloop_aligned:
> +	ldr	data1, [src1], #8
> +	ldr	data2, [src2], #8
> +.Lstart_realigned:
> +	subs	limit_wd, limit_wd, #1
> +	eor	diff, data1, data2	/* Non-zero if differences found.  */
> +	csinv	endloop, diff, xzr, cs	/* Last Dword or differences.  */
> +	cbz	endloop, .Lloop_aligned
> +
> +	/* Not reached the limit, must have found a diff.  */
> +	tbz	limit_wd, #63, .Lnot_limit
> +
> +	/* Limit % 8 == 0 => the diff is in the last 8 bytes. */
> +	ands	limit, limit, #7
> +	b.eq	.Lnot_limit
> +	/*
> +	* The remained bytes less than 8. It is needed to extract valid data
> +	* from last eight bytes of the intended memory range.
> +	*/
> +	lsl	limit, limit, #3	/* bytes-> bits.  */
> +	mov	mask, #~0
> +CPU_BE( lsr	mask, mask, limit )
> +CPU_LE( lsl	mask, mask, limit )
> +	bic	data1, data1, mask
> +	bic	data2, data2, mask
> +
> +	orr	diff, diff, mask
> +	b	.Lnot_limit
> +
> +.Lmutual_align:
> +	/*
> +	* Sources are mutually aligned, but are not currently at an
> +	* alignment boundary. Round down the addresses and then mask off
> +	* the bytes that precede the start point.
> +	*/
> +	bic	src1, src1, #7
> +	bic	src2, src2, #7
> +	ldr	data1, [src1], #8
> +	ldr	data2, [src2], #8
> +	/*
> +	* We can not add limit with alignment offset(tmp1) here. Since the
> +	* addition probably make the limit overflown.
> +	*/
> +	sub	limit_wd, limit, #1/*limit != 0, so no underflow.*/
> +	and	tmp3, limit_wd, #7
> +	lsr	limit_wd, limit_wd, #3
> +	add	tmp3, tmp3, tmp1
> +	add	limit_wd, limit_wd, tmp3, lsr #3
> +	add	limit, limit, tmp1/* Adjust the limit for the extra.  */
> +
> +	lsl	tmp1, tmp1, #3/* Bytes beyond alignment -> bits.*/
> +	neg	tmp1, tmp1/* Bits to alignment -64.  */
> +	mov	tmp2, #~0
> +	/*mask off the non-intended bytes before the start address.*/
> +CPU_BE( lsl	tmp2, tmp2, tmp1 )/*Big-endian.Early bytes are at MSB*/
> +	/* Little-endian.  Early bytes are at LSB.  */
> +CPU_LE( lsr	tmp2, tmp2, tmp1 )
> +
> +	orr	data1, data1, tmp2
> +	orr	data2, data2, tmp2
> +	b	.Lstart_realigned
> +
> +	/*src1 and src2 have different alignment offset.*/
> +.Lmisaligned8:
> +	cmp	limit, #8
> +	b.lo	.Ltiny8proc /*limit < 8: compare byte by byte*/
> +
> +	and	tmp1, src1, #7
> +	neg	tmp1, tmp1
> +	add	tmp1, tmp1, #8/*valid length in the first 8 bytes of src1*/
> +	and	tmp2, src2, #7
> +	neg	tmp2, tmp2
> +	add	tmp2, tmp2, #8/*valid length in the first 8 bytes of src2*/
> +	subs	tmp3, tmp1, tmp2
> +	csel	pos, tmp1, tmp2, hi /*Choose the maximum.*/
> +
> +	sub	limit, limit, pos
> +	/*compare the proceeding bytes in the first 8 byte segment.*/
> +.Ltinycmp:
> +	ldrb	data1w, [src1], #1
> +	ldrb	data2w, [src2], #1
> +	subs	pos, pos, #1
> +	ccmp	data1w, data2w, #0, ne  /* NZCV = 0b0000.  */
> +	b.eq	.Ltinycmp
> +	cbnz	pos, 1f /*diff occurred before the last byte.*/
> +	cmp	data1w, data2w
> +	b.eq	.Lstart_align
> +1:
> +	sub	result, data1, data2
> +	ret
> +
> +.Lstart_align:
> +	lsr	limit_wd, limit, #3
> +	cbz	limit_wd, .Lremain8
> +
> +	ands	xzr, src1, #7
> +	b.eq	.Lrecal_offset
> +	/*process more leading bytes to make src1 aligned...*/
> +	add	src1, src1, tmp3 /*backwards src1 to alignment boundary*/
> +	add	src2, src2, tmp3
> +	sub	limit, limit, tmp3
> +	lsr	limit_wd, limit, #3
> +	cbz	limit_wd, .Lremain8
> +	/*load 8 bytes from aligned SRC1..*/
> +	ldr	data1, [src1], #8
> +	ldr	data2, [src2], #8
> +
> +	subs	limit_wd, limit_wd, #1
> +	eor	diff, data1, data2  /*Non-zero if differences found.*/
> +	csinv	endloop, diff, xzr, ne
> +	cbnz	endloop, .Lunequal_proc
> +	/*How far is the current SRC2 from the alignment boundary...*/
> +	and	tmp3, tmp3, #7
> +
> +.Lrecal_offset:/*src1 is aligned now..*/
> +	neg	pos, tmp3
> +.Lloopcmp_proc:
> +	/*
> +	* Divide the eight bytes into two parts. First,backwards the src2
> +	* to an alignment boundary,load eight bytes and compare from
> +	* the SRC2 alignment boundary. If all 8 bytes are equal,then start
> +	* the second part's comparison. Otherwise finish the comparison.
> +	* This special handle can garantee all the accesses are in the
> +	* thread/task space in avoid to overrange access.
> +	*/
> +	ldr	data1, [src1,pos]
> +	ldr	data2, [src2,pos]
> +	eor	diff, data1, data2  /* Non-zero if differences found.  */
> +	cbnz	diff, .Lnot_limit
> +
> +	/*The second part process*/
> +	ldr	data1, [src1], #8
> +	ldr	data2, [src2], #8
> +	eor	diff, data1, data2  /* Non-zero if differences found.  */
> +	subs	limit_wd, limit_wd, #1
> +	csinv	endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/
> +	cbz	endloop, .Lloopcmp_proc
> +.Lunequal_proc:
> +	cbz	diff, .Lremain8
> +
> +/*There is differnence occured in the latest comparison.*/
> +.Lnot_limit:
> +/*
> +* For little endian,reverse the low significant equal bits into MSB,then
> +* following CLZ can find how many equal bits exist.
> +*/
> +CPU_LE( rev	diff, diff )
> +CPU_LE( rev	data1, data1 )
> +CPU_LE( rev	data2, data2 )
> +
> +	/*
> +	* The MS-non-zero bit of DIFF marks either the first bit
> +	* that is different, or the end of the significant data.
> +	* Shifting left now will bring the critical information into the
> +	* top bits.
> +	*/
> +	clz	pos, diff
> +	lsl	data1, data1, pos
> +	lsl	data2, data2, pos
> +	/*
> +	* We need to zero-extend (char is unsigned) the value and then
> +	* perform a signed subtraction.
> +	*/
> +	lsr	data1, data1, #56
> +	sub	result, data1, data2, lsr #56
> +	ret
> +
> +.Lremain8:
> +	/* Limit % 8 == 0 =>. all data are equal.*/
> +	ands	limit, limit, #7
> +	b.eq	.Lret0
> +
> +.Ltiny8proc:
> +	ldrb	data1w, [src1], #1
> +	ldrb	data2w, [src2], #1
> +	subs	limit, limit, #1
> +
> +	ccmp	data1w, data2w, #0, ne  /* NZCV = 0b0000. */
> +	b.eq	.Ltiny8proc
> +	sub	result, data1, data2
> +	ret
> +.Lret0:
> +	mov	result, #0
> +	ret
> +ENDPROC(memcmp)
> diff --git a/xen/arch/arm/arm64/lib/memcpy.S b/xen/arch/arm/arm64/lib/memcpy.S
> index c8197c6..7cc885d 100644
> --- a/xen/arch/arm/arm64/lib/memcpy.S
> +++ b/xen/arch/arm/arm64/lib/memcpy.S
> @@ -1,5 +1,13 @@
>  /*
>   * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
>   *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License version 2 as
> @@ -15,6 +23,8 @@
>   */
>  
>  #include <xen/config.h>
> +#include <asm/cache.h>
> +#include "assembler.h"
>  
>  /*
>   * Copy a buffer from src to dest (alignment handled by the hardware)
> @@ -26,27 +36,166 @@
>   * Returns:
>   *	x0 - dest
>   */
> +dstin	.req	x0
> +src	.req	x1
> +count	.req	x2
> +tmp1	.req	x3
> +tmp1w	.req	w3
> +tmp2	.req	x4
> +tmp2w	.req	w4
> +tmp3	.req	x5
> +tmp3w	.req	w5
> +dst	.req	x6
> +
> +A_l	.req	x7
> +A_h	.req	x8
> +B_l	.req	x9
> +B_h	.req	x10
> +C_l	.req	x11
> +C_h	.req	x12
> +D_l	.req	x13
> +D_h	.req	x14
> +
>  ENTRY(memcpy)
> -	mov	x4, x0
> -	subs	x2, x2, #8
> -	b.mi	2f
> -1:	ldr	x3, [x1], #8
> -	subs	x2, x2, #8
> -	str	x3, [x4], #8
> -	b.pl	1b
> -2:	adds	x2, x2, #4
> -	b.mi	3f
> -	ldr	w3, [x1], #4
> -	sub	x2, x2, #4
> -	str	w3, [x4], #4
> -3:	adds	x2, x2, #2
> -	b.mi	4f
> -	ldrh	w3, [x1], #2
> -	sub	x2, x2, #2
> -	strh	w3, [x4], #2
> -4:	adds	x2, x2, #1
> -	b.mi	5f
> -	ldrb	w3, [x1]
> -	strb	w3, [x4]
> -5:	ret
> +	mov	dst, dstin
> +	cmp	count, #16
> +	/*When memory length is less than 16, the accessed are not aligned.*/
> +	b.lo	.Ltiny15
> +
> +	neg	tmp2, src
> +	ands	tmp2, tmp2, #15/* Bytes to reach alignment. */
> +	b.eq	.LSrcAligned
> +	sub	count, count, tmp2
> +	/*
> +	* Copy the leading memory data from src to dst in an increasing
> +	* address order.By this way,the risk of overwritting the source
> +	* memory data is eliminated when the distance between src and
> +	* dst is less than 16. The memory accesses here are alignment.
> +	*/
> +	tbz	tmp2, #0, 1f
> +	ldrb	tmp1w, [src], #1
> +	strb	tmp1w, [dst], #1
> +1:
> +	tbz	tmp2, #1, 2f
> +	ldrh	tmp1w, [src], #2
> +	strh	tmp1w, [dst], #2
> +2:
> +	tbz	tmp2, #2, 3f
> +	ldr	tmp1w, [src], #4
> +	str	tmp1w, [dst], #4
> +3:
> +	tbz	tmp2, #3, .LSrcAligned
> +	ldr	tmp1, [src],#8
> +	str	tmp1, [dst],#8
> +
> +.LSrcAligned:
> +	cmp	count, #64
> +	b.ge	.Lcpy_over64
> +	/*
> +	* Deal with small copies quickly by dropping straight into the
> +	* exit block.
> +	*/
> +.Ltail63:
> +	/*
> +	* Copy up to 48 bytes of data. At this point we only need the
> +	* bottom 6 bits of count to be accurate.
> +	*/
> +	ands	tmp1, count, #0x30
> +	b.eq	.Ltiny15
> +	cmp	tmp1w, #0x20
> +	b.eq	1f
> +	b.lt	2f
> +	ldp	A_l, A_h, [src], #16
> +	stp	A_l, A_h, [dst], #16
> +1:
> +	ldp	A_l, A_h, [src], #16
> +	stp	A_l, A_h, [dst], #16
> +2:
> +	ldp	A_l, A_h, [src], #16
> +	stp	A_l, A_h, [dst], #16
> +.Ltiny15:
> +	/*
> +	* Prefer to break one ldp/stp into several load/store to access
> +	* memory in an increasing address order,rather than to load/store 16
> +	* bytes from (src-16) to (dst-16) and to backward the src to aligned
> +	* address,which way is used in original cortex memcpy. If keeping
> +	* the original memcpy process here, memmove need to satisfy the
> +	* precondition that src address is at least 16 bytes bigger than dst
> +	* address,otherwise some source data will be overwritten when memove
> +	* call memcpy directly. To make memmove simpler and decouple the
> +	* memcpy's dependency on memmove, withdrew the original process.
> +	*/
> +	tbz	count, #3, 1f
> +	ldr	tmp1, [src], #8
> +	str	tmp1, [dst], #8
> +1:
> +	tbz	count, #2, 2f
> +	ldr	tmp1w, [src], #4
> +	str	tmp1w, [dst], #4
> +2:
> +	tbz	count, #1, 3f
> +	ldrh	tmp1w, [src], #2
> +	strh	tmp1w, [dst], #2
> +3:
> +	tbz	count, #0, .Lexitfunc
> +	ldrb	tmp1w, [src]
> +	strb	tmp1w, [dst]
> +
> +.Lexitfunc:
> +	ret
> +
> +.Lcpy_over64:
> +	subs	count, count, #128
> +	b.ge	.Lcpy_body_large
> +	/*
> +	* Less than 128 bytes to copy, so handle 64 here and then jump
> +	* to the tail.
> +	*/
> +	ldp	A_l, A_h, [src],#16
> +	stp	A_l, A_h, [dst],#16
> +	ldp	B_l, B_h, [src],#16
> +	ldp	C_l, C_h, [src],#16
> +	stp	B_l, B_h, [dst],#16
> +	stp	C_l, C_h, [dst],#16
> +	ldp	D_l, D_h, [src],#16
> +	stp	D_l, D_h, [dst],#16
> +
> +	tst	count, #0x3f
> +	b.ne	.Ltail63
> +	ret
> +
> +	/*
> +	* Critical loop.  Start at a new cache line boundary.  Assuming
> +	* 64 bytes per line this ensures the entire loop is in one line.
> +	*/
> +	.p2align	L1_CACHE_SHIFT
> +.Lcpy_body_large:
> +	/* pre-get 64 bytes data. */
> +	ldp	A_l, A_h, [src],#16
> +	ldp	B_l, B_h, [src],#16
> +	ldp	C_l, C_h, [src],#16
> +	ldp	D_l, D_h, [src],#16
> +1:
> +	/*
> +	* interlace the load of next 64 bytes data block with store of the last
> +	* loaded 64 bytes data.
> +	*/
> +	stp	A_l, A_h, [dst],#16
> +	ldp	A_l, A_h, [src],#16
> +	stp	B_l, B_h, [dst],#16
> +	ldp	B_l, B_h, [src],#16
> +	stp	C_l, C_h, [dst],#16
> +	ldp	C_l, C_h, [src],#16
> +	stp	D_l, D_h, [dst],#16
> +	ldp	D_l, D_h, [src],#16
> +	subs	count, count, #64
> +	b.ge	1b
> +	stp	A_l, A_h, [dst],#16
> +	stp	B_l, B_h, [dst],#16
> +	stp	C_l, C_h, [dst],#16
> +	stp	D_l, D_h, [dst],#16
> +
> +	tst	count, #0x3f
> +	b.ne	.Ltail63
> +	ret
>  ENDPROC(memcpy)
> diff --git a/xen/arch/arm/arm64/lib/memmove.S b/xen/arch/arm/arm64/lib/memmove.S
> index 1bf0936..f4065b9 100644
> --- a/xen/arch/arm/arm64/lib/memmove.S
> +++ b/xen/arch/arm/arm64/lib/memmove.S
> @@ -1,5 +1,13 @@
>  /*
>   * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
>   *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License version 2 as
> @@ -15,6 +23,8 @@
>   */
>  
>  #include <xen/config.h>
> +#include <asm/cache.h>
> +#include "assembler.h"
>  
>  /*
>   * Move a buffer from src to test (alignment handled by the hardware).
> @@ -27,30 +37,161 @@
>   * Returns:
>   *	x0 - dest
>   */
> +dstin	.req	x0
> +src	.req	x1
> +count	.req	x2
> +tmp1	.req	x3
> +tmp1w	.req	w3
> +tmp2	.req	x4
> +tmp2w	.req	w4
> +tmp3	.req	x5
> +tmp3w	.req	w5
> +dst	.req	x6
> +
> +A_l	.req	x7
> +A_h	.req	x8
> +B_l	.req	x9
> +B_h	.req	x10
> +C_l	.req	x11
> +C_h	.req	x12
> +D_l	.req	x13
> +D_h	.req	x14
> +
>  ENTRY(memmove)
> -	cmp	x0, x1
> -	b.ls	memcpy
> -	add	x4, x0, x2
> -	add	x1, x1, x2
> -	subs	x2, x2, #8
> -	b.mi	2f
> -1:	ldr	x3, [x1, #-8]!
> -	subs	x2, x2, #8
> -	str	x3, [x4, #-8]!
> -	b.pl	1b
> -2:	adds	x2, x2, #4
> -	b.mi	3f
> -	ldr	w3, [x1, #-4]!
> -	sub	x2, x2, #4
> -	str	w3, [x4, #-4]!
> -3:	adds	x2, x2, #2
> -	b.mi	4f
> -	ldrh	w3, [x1, #-2]!
> -	sub	x2, x2, #2
> -	strh	w3, [x4, #-2]!
> -4:	adds	x2, x2, #1
> -	b.mi	5f
> -	ldrb	w3, [x1, #-1]
> -	strb	w3, [x4, #-1]
> -5:	ret
> +	cmp	dstin, src
> +	b.lo	memcpy
> +	add	tmp1, src, count
> +	cmp	dstin, tmp1
> +	b.hs	memcpy		/* No overlap.  */
> +
> +	add	dst, dstin, count
> +	add	src, src, count
> +	cmp	count, #16
> +	b.lo	.Ltail15  /*probably non-alignment accesses.*/
> +
> +	ands	tmp2, src, #15     /* Bytes to reach alignment.  */
> +	b.eq	.LSrcAligned
> +	sub	count, count, tmp2
> +	/*
> +	* process the aligned offset length to make the src aligned firstly.
> +	* those extra instructions' cost is acceptable. It also make the
> +	* coming accesses are based on aligned address.
> +	*/
> +	tbz	tmp2, #0, 1f
> +	ldrb	tmp1w, [src, #-1]!
> +	strb	tmp1w, [dst, #-1]!
> +1:
> +	tbz	tmp2, #1, 2f
> +	ldrh	tmp1w, [src, #-2]!
> +	strh	tmp1w, [dst, #-2]!
> +2:
> +	tbz	tmp2, #2, 3f
> +	ldr	tmp1w, [src, #-4]!
> +	str	tmp1w, [dst, #-4]!
> +3:
> +	tbz	tmp2, #3, .LSrcAligned
> +	ldr	tmp1, [src, #-8]!
> +	str	tmp1, [dst, #-8]!
> +
> +.LSrcAligned:
> +	cmp	count, #64
> +	b.ge	.Lcpy_over64
> +
> +	/*
> +	* Deal with small copies quickly by dropping straight into the
> +	* exit block.
> +	*/
> +.Ltail63:
> +	/*
> +	* Copy up to 48 bytes of data. At this point we only need the
> +	* bottom 6 bits of count to be accurate.
> +	*/
> +	ands	tmp1, count, #0x30
> +	b.eq	.Ltail15
> +	cmp	tmp1w, #0x20
> +	b.eq	1f
> +	b.lt	2f
> +	ldp	A_l, A_h, [src, #-16]!
> +	stp	A_l, A_h, [dst, #-16]!
> +1:
> +	ldp	A_l, A_h, [src, #-16]!
> +	stp	A_l, A_h, [dst, #-16]!
> +2:
> +	ldp	A_l, A_h, [src, #-16]!
> +	stp	A_l, A_h, [dst, #-16]!
> +
> +.Ltail15:
> +	tbz	count, #3, 1f
> +	ldr	tmp1, [src, #-8]!
> +	str	tmp1, [dst, #-8]!
> +1:
> +	tbz	count, #2, 2f
> +	ldr	tmp1w, [src, #-4]!
> +	str	tmp1w, [dst, #-4]!
> +2:
> +	tbz	count, #1, 3f
> +	ldrh	tmp1w, [src, #-2]!
> +	strh	tmp1w, [dst, #-2]!
> +3:
> +	tbz	count, #0, .Lexitfunc
> +	ldrb	tmp1w, [src, #-1]
> +	strb	tmp1w, [dst, #-1]
> +
> +.Lexitfunc:
> +	ret
> +
> +.Lcpy_over64:
> +	subs	count, count, #128
> +	b.ge	.Lcpy_body_large
> +	/*
> +	* Less than 128 bytes to copy, so handle 64 bytes here and then jump
> +	* to the tail.
> +	*/
> +	ldp	A_l, A_h, [src, #-16]
> +	stp	A_l, A_h, [dst, #-16]
> +	ldp	B_l, B_h, [src, #-32]
> +	ldp	C_l, C_h, [src, #-48]
> +	stp	B_l, B_h, [dst, #-32]
> +	stp	C_l, C_h, [dst, #-48]
> +	ldp	D_l, D_h, [src, #-64]!
> +	stp	D_l, D_h, [dst, #-64]!
> +
> +	tst	count, #0x3f
> +	b.ne	.Ltail63
> +	ret
> +
> +	/*
> +	* Critical loop. Start at a new cache line boundary. Assuming
> +	* 64 bytes per line this ensures the entire loop is in one line.
> +	*/
> +	.p2align	L1_CACHE_SHIFT
> +.Lcpy_body_large:
> +	/* pre-load 64 bytes data. */
> +	ldp	A_l, A_h, [src, #-16]
> +	ldp	B_l, B_h, [src, #-32]
> +	ldp	C_l, C_h, [src, #-48]
> +	ldp	D_l, D_h, [src, #-64]!
> +1:
> +	/*
> +	* interlace the load of next 64 bytes data block with store of the last
> +	* loaded 64 bytes data.
> +	*/
> +	stp	A_l, A_h, [dst, #-16]
> +	ldp	A_l, A_h, [src, #-16]
> +	stp	B_l, B_h, [dst, #-32]
> +	ldp	B_l, B_h, [src, #-32]
> +	stp	C_l, C_h, [dst, #-48]
> +	ldp	C_l, C_h, [src, #-48]
> +	stp	D_l, D_h, [dst, #-64]!
> +	ldp	D_l, D_h, [src, #-64]!
> +	subs	count, count, #64
> +	b.ge	1b
> +	stp	A_l, A_h, [dst, #-16]
> +	stp	B_l, B_h, [dst, #-32]
> +	stp	C_l, C_h, [dst, #-48]
> +	stp	D_l, D_h, [dst, #-64]!
> +
> +	tst	count, #0x3f
> +	b.ne	.Ltail63
> +	ret
>  ENDPROC(memmove)
> diff --git a/xen/arch/arm/arm64/lib/memset.S b/xen/arch/arm/arm64/lib/memset.S
> index 25a4fb6..4ee714d 100644
> --- a/xen/arch/arm/arm64/lib/memset.S
> +++ b/xen/arch/arm/arm64/lib/memset.S
> @@ -1,5 +1,13 @@
>  /*
>   * Copyright (C) 2013 ARM Ltd.
> + * Copyright (C) 2013 Linaro.
> + *
> + * This code is based on glibc cortex strings work originally authored by Linaro
> + * and re-licensed under GPLv2 for the Linux kernel. The original code can
> + * be found @
> + *
> + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/
> + * files/head:/src/aarch64/
>   *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License version 2 as
> @@ -15,6 +23,8 @@
>   */
>  
>  #include <xen/config.h>
> +#include <asm/cache.h>
> +#include "assembler.h"
>  
>  /*
>   * Fill in the buffer with character c (alignment handled by the hardware)
> @@ -26,27 +36,181 @@
>   * Returns:
>   *	x0 - buf
>   */
> +
> +dstin		.req	x0
> +val		.req	w1
> +count		.req	x2
> +tmp1		.req	x3
> +tmp1w		.req	w3
> +tmp2		.req	x4
> +tmp2w		.req	w4
> +zva_len_x	.req	x5
> +zva_len		.req	w5
> +zva_bits_x	.req	x6
> +
> +A_l		.req	x7
> +A_lw		.req	w7
> +dst		.req	x8
> +tmp3w		.req	w9
> +tmp3		.req	x9
> +
>  ENTRY(memset)
> -	mov	x4, x0
> -	and	w1, w1, #0xff
> -	orr	w1, w1, w1, lsl #8
> -	orr	w1, w1, w1, lsl #16
> -	orr	x1, x1, x1, lsl #32
> -	subs	x2, x2, #8
> -	b.mi	2f
> -1:	str	x1, [x4], #8
> -	subs	x2, x2, #8
> -	b.pl	1b
> -2:	adds	x2, x2, #4
> -	b.mi	3f
> -	sub	x2, x2, #4
> -	str	w1, [x4], #4
> -3:	adds	x2, x2, #2
> -	b.mi	4f
> -	sub	x2, x2, #2
> -	strh	w1, [x4], #2
> -4:	adds	x2, x2, #1
> -	b.mi	5f
> -	strb	w1, [x4]
> -5:	ret
> +	mov	dst, dstin	/* Preserve return value.  */
> +	and	A_lw, val, #255
> +	orr	A_lw, A_lw, A_lw, lsl #8
> +	orr	A_lw, A_lw, A_lw, lsl #16
> +	orr	A_l, A_l, A_l, lsl #32
> +
> +	cmp	count, #15
> +	b.hi	.Lover16_proc
> +	/*All store maybe are non-aligned..*/
> +	tbz	count, #3, 1f
> +	str	A_l, [dst], #8
> +1:
> +	tbz	count, #2, 2f
> +	str	A_lw, [dst], #4
> +2:
> +	tbz	count, #1, 3f
> +	strh	A_lw, [dst], #2
> +3:
> +	tbz	count, #0, 4f
> +	strb	A_lw, [dst]
> +4:
> +	ret
> +
> +.Lover16_proc:
> +	/*Whether  the start address is aligned with 16.*/
> +	neg	tmp2, dst
> +	ands	tmp2, tmp2, #15
> +	b.eq	.Laligned
> +/*
> +* The count is not less than 16, we can use stp to store the start 16 bytes,
> +* then adjust the dst aligned with 16.This process will make the current
> +* memory address at alignment boundary.
> +*/
> +	stp	A_l, A_l, [dst] /*non-aligned store..*/
> +	/*make the dst aligned..*/
> +	sub	count, count, tmp2
> +	add	dst, dst, tmp2
> +
> +.Laligned:
> +	cbz	A_l, .Lzero_mem
> +
> +.Ltail_maybe_long:
> +	cmp	count, #64
> +	b.ge	.Lnot_short
> +.Ltail63:
> +	ands	tmp1, count, #0x30
> +	b.eq	3f
> +	cmp	tmp1w, #0x20
> +	b.eq	1f
> +	b.lt	2f
> +	stp	A_l, A_l, [dst], #16
> +1:
> +	stp	A_l, A_l, [dst], #16
> +2:
> +	stp	A_l, A_l, [dst], #16
> +/*
> +* The last store length is less than 16,use stp to write last 16 bytes.
> +* It will lead some bytes written twice and the access is non-aligned.
> +*/
> +3:
> +	ands	count, count, #15
> +	cbz	count, 4f
> +	add	dst, dst, count
> +	stp	A_l, A_l, [dst, #-16]	/* Repeat some/all of last store. */
> +4:
> +	ret
> +
> +	/*
> +	* Critical loop. Start at a new cache line boundary. Assuming
> +	* 64 bytes per line, this ensures the entire loop is in one line.
> +	*/
> +	.p2align	L1_CACHE_SHIFT
> +.Lnot_short:
> +	sub	dst, dst, #16/* Pre-bias.  */
> +	sub	count, count, #64
> +1:
> +	stp	A_l, A_l, [dst, #16]
> +	stp	A_l, A_l, [dst, #32]
> +	stp	A_l, A_l, [dst, #48]
> +	stp	A_l, A_l, [dst, #64]!
> +	subs	count, count, #64
> +	b.ge	1b
> +	tst	count, #0x3f
> +	add	dst, dst, #16
> +	b.ne	.Ltail63
> +.Lexitfunc:
> +	ret
> +
> +	/*
> +	* For zeroing memory, check to see if we can use the ZVA feature to
> +	* zero entire 'cache' lines.
> +	*/
> +.Lzero_mem:
> +	cmp	count, #63
> +	b.le	.Ltail63
> +	/*
> +	* For zeroing small amounts of memory, it's not worth setting up
> +	* the line-clear code.
> +	*/
> +	cmp	count, #128
> +	b.lt	.Lnot_short /*count is at least  128 bytes*/
> +
> +	mrs	tmp1, dczid_el0
> +	tbnz	tmp1, #4, .Lnot_short
> +	mov	tmp3w, #4
> +	and	zva_len, tmp1w, #15	/* Safety: other bits reserved.  */
> +	lsl	zva_len, tmp3w, zva_len
> +
> +	ands	tmp3w, zva_len, #63
> +	/*
> +	* ensure the zva_len is not less than 64.
> +	* It is not meaningful to use ZVA if the block size is less than 64.
> +	*/
> +	b.ne	.Lnot_short
> +.Lzero_by_line:
> +	/*
> +	* Compute how far we need to go to become suitably aligned. We're
> +	* already at quad-word alignment.
> +	*/
> +	cmp	count, zva_len_x
> +	b.lt	.Lnot_short		/* Not enough to reach alignment.  */
> +	sub	zva_bits_x, zva_len_x, #1
> +	neg	tmp2, dst
> +	ands	tmp2, tmp2, zva_bits_x
> +	b.eq	2f			/* Already aligned.  */
> +	/* Not aligned, check that there's enough to copy after alignment.*/
> +	sub	tmp1, count, tmp2
> +	/*
> +	* grantee the remain length to be ZVA is bigger than 64,
> +	* avoid to make the 2f's process over mem range.*/
> +	cmp	tmp1, #64
> +	ccmp	tmp1, zva_len_x, #8, ge	/* NZCV=0b1000 */
> +	b.lt	.Lnot_short
> +	/*
> +	* We know that there's at least 64 bytes to zero and that it's safe
> +	* to overrun by 64 bytes.
> +	*/
> +	mov	count, tmp1
> +1:
> +	stp	A_l, A_l, [dst]
> +	stp	A_l, A_l, [dst, #16]
> +	stp	A_l, A_l, [dst, #32]
> +	subs	tmp2, tmp2, #64
> +	stp	A_l, A_l, [dst, #48]
> +	add	dst, dst, #64
> +	b.ge	1b
> +	/* We've overrun a bit, so adjust dst downwards.*/
> +	add	dst, dst, tmp2
> +2:
> +	sub	count, count, zva_len_x
> +3:
> +	dc	zva, dst
> +	add	dst, dst, zva_len_x
> +	subs	count, count, zva_len_x
> +	b.ge	3b
> +	ands	count, count, zva_bits_x
> +	b.ne	.Ltail_maybe_long
> +	ret
>  ENDPROC(memset)
> diff --git a/xen/include/asm-arm/arm32/cmpxchg.h b/xen/include/asm-arm/arm32/cmpxchg.h
> index 3f4e7a1..9a511f2 100644
> --- a/xen/include/asm-arm/arm32/cmpxchg.h
> +++ b/xen/include/asm-arm/arm32/cmpxchg.h
> @@ -40,6 +40,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  	return ret;
>  }
>  
> +#define xchg(ptr,x) \
> +	((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
> +
>  /*
>   * Atomic compare and exchange.  Compare OLD with MEM, if identical,
>   * store NEW in MEM.  Return the initial value in MEM.  Success is
> diff --git a/xen/include/asm-arm/arm64/atomic.h b/xen/include/asm-arm/arm64/atomic.h
> index b5d50f2..b49219e 100644
> --- a/xen/include/asm-arm/arm64/atomic.h
> +++ b/xen/include/asm-arm/arm64/atomic.h
> @@ -136,11 +136,6 @@ static inline int __atomic_add_unless(atomic_t *v, int a, int u)
>  
>  #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
>  
> -#define smp_mb__before_atomic_dec()	smp_mb()
> -#define smp_mb__after_atomic_dec()	smp_mb()
> -#define smp_mb__before_atomic_inc()	smp_mb()
> -#define smp_mb__after_atomic_inc()	smp_mb()
> -
>  #endif
>  /*
>   * Local variables:
> diff --git a/xen/include/asm-arm/arm64/cmpxchg.h b/xen/include/asm-arm/arm64/cmpxchg.h
> index 4e930ce..ae42b2f 100644
> --- a/xen/include/asm-arm/arm64/cmpxchg.h
> +++ b/xen/include/asm-arm/arm64/cmpxchg.h
> @@ -54,7 +54,12 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  }
>  
>  #define xchg(ptr,x) \
> -	((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
> +({ \
> +	__typeof__(*(ptr)) __ret; \
> +	__ret = (__typeof__(*(ptr))) \
> +		__xchg((unsigned long)(x), (ptr), sizeof(*(ptr))); \
> +	__ret; \
> +})
>  
>  extern void __bad_cmpxchg(volatile void *ptr, int size);
>  
> @@ -144,17 +149,23 @@ static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old,
>  	return ret;
>  }
>  
> -#define cmpxchg(ptr,o,n)						\
> -	((__typeof__(*(ptr)))__cmpxchg_mb((ptr),			\
> -					  (unsigned long)(o),		\
> -					  (unsigned long)(n),		\
> -					  sizeof(*(ptr))))
> -
> -#define cmpxchg_local(ptr,o,n)						\
> -	((__typeof__(*(ptr)))__cmpxchg((ptr),				\
> -				       (unsigned long)(o),		\
> -				       (unsigned long)(n),		\
> -				       sizeof(*(ptr))))
> +#define cmpxchg(ptr, o, n) \
> +({ \
> +	__typeof__(*(ptr)) __ret; \
> +	__ret = (__typeof__(*(ptr))) \
> +		__cmpxchg_mb((ptr), (unsigned long)(o), (unsigned long)(n), \
> +			     sizeof(*(ptr))); \
> +	__ret; \
> +})
> +
> +#define cmpxchg_local(ptr, o, n) \
> +({ \
> +	__typeof__(*(ptr)) __ret; \
> +	__ret = (__typeof__(*(ptr))) \
> +		__cmpxchg((ptr), (unsigned long)(o), \
> +			  (unsigned long)(n), sizeof(*(ptr))); \
> +	__ret; \
> +})
>  
>  #endif
>  /*
> diff --git a/xen/include/asm-arm/string.h b/xen/include/asm-arm/string.h
> index 3242762..dfad1fe 100644
> --- a/xen/include/asm-arm/string.h
> +++ b/xen/include/asm-arm/string.h
> @@ -17,6 +17,11 @@ extern char * strchr(const char * s, int c);
>  #define __HAVE_ARCH_MEMCPY
>  extern void * memcpy(void *, const void *, __kernel_size_t);
>  
> +#if defined(CONFIG_ARM_64)
> +#define __HAVE_ARCH_MEMCMP
> +extern int memcmp(const void *, const void *, __kernel_size_t);
> +#endif
> +
>  /* Some versions of gcc don't have this builtin. It's non-critical anyway. */
>  #define __HAVE_ARCH_MEMMOVE
>  extern void *memmove(void *dest, const void *src, size_t n);
> diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h
> index 7aaaf50..ce3d38a 100644
> --- a/xen/include/asm-arm/system.h
> +++ b/xen/include/asm-arm/system.h
> @@ -33,9 +33,6 @@
>  
>  #define smp_wmb()       dmb(ishst)
>  
> -#define xchg(ptr,x) \
> -        ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
> -
>  /*
>   * This is used to ensure the compiler did actually allocate the register we
>   * asked it for some inline assembly sequences.  Apparently we can't trust
> 


-- 
Julien Grall

next prev parent reply	other threads:[~2014-07-25 15:36 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-25 15:22 [PATCH 1/2] xen: arm: update arm64 assembly primitives to Linux v3.16-rc6 Ian Campbell
2014-07-25 15:22 ` [PATCH 2/2] xen: arm: update arm32 " Ian Campbell
2014-07-25 15:42   ` Julien Grall
2014-07-25 15:48     ` Ian Campbell
2014-07-25 15:48       ` Julien Grall
2014-07-25 16:03         ` Ian Campbell
2014-07-25 16:13           ` Ian Campbell
2014-07-25 16:20             ` Julien Grall
2014-07-25 16:17           ` Julien Grall
2014-07-25 16:23             ` Ian Campbell
2014-07-25 15:36 ` Julien Grall [this message]
2014-08-04 16:16   ` [PATCH 1/2] xen: arm: update arm64 " Ian Campbell
2014-07-25 15:43 ` Ian Campbell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53D27977.9000307@linaro.org \
    --to=julien.grall@linaro.org \
    --cc=ian.campbell@citrix.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.