From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Subject: Re: [PATCH v2] arch/arm: optimization for memcpy on
	AArch64
Date: Fri, 15 Dec 2017 09:11:29 +0530
Message-ID: <20171215034127.GA5874@jerin>
References: <1511768985-21639-1-git-send-email-herbert.guan@arm.com>
 <1512453723-4513-1-git-send-email-herbert.guan@arm.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: dev@dpdk.org, pbhagavatula@caviumnetworks.com, jianbo.liu@arm.com
To: Herbert Guan <herbert.guan@arm.com>
Return-path: <dev-bounces@dpdk.org>
Received: from NAM03-CO1-obe.outbound.protection.outlook.com
 (mail-co1nam03on0046.outbound.protection.outlook.com [104.47.40.46])
 by dpdk.org (Postfix) with ESMTP id 988752B9F
 for <dev@dpdk.org>; Fri, 15 Dec 2017 04:41:50 +0100 (CET)
Content-Disposition: inline
In-Reply-To: <1512453723-4513-1-git-send-email-herbert.guan@arm.com>
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

-----Original Message-----
> Date: Tue, 5 Dec 2017 14:02:03 +0800
> From: Herbert Guan <herbert.guan@arm.com>
> To: dev@dpdk.org
> CC: jerin.jacob@caviumnetworks.com, pbhagavatula@caviumnetworks.com,
>  jianbo.liu@arm.com, Herbert Guan <herbert.guan@arm.com>
> Subject: [PATCH v2] arch/arm: optimization for memcpy on AArch64
> X-Mailer: git-send-email 1.8.3.1
> 
> This patch provides an option to do rte_memcpy() using 'restrict'
> qualifier, which can induce GCC to do optimizations by using more
> efficient instructions, providing some performance gain over memcpy()
> on some AArch64 platforms/enviroments.
> 
> The memory copy performance differs between different AArch64
> platforms. And a more recent glibc (e.g. 2.23 or later)
> can provide a better memcpy() performance compared to old glibc
> versions. It's always suggested to use a more recent glibc if
> possible, from which the entire system can get benefit. If for some
> reason an old glibc has to be used, this patch is provided for an
> alternative.
> 
> This implementation can improve memory copy on some AArch64
> platforms, when an old glibc (e.g. 2.19, 2.17...) is being used.
> It is disabled by default and needs "RTE_ARCH_ARM64_MEMCPY"
> defined to activate. It's not always proving better performance
> than memcpy() so users need to run DPDK unit test
> "memcpy_perf_autotest" and customize parameters in "customization
> section" in rte_memcpy_64.h for best performance.
> 
> Compiler version will also impact the rte_memcpy() performance.
> It's observed on some platforms and with the same code, GCC 7.2.0
> compiled binary can provide better performance than GCC 4.8.5. It's
> suggested to use GCC 5.4.0 or later.

Description looks good.

> 
> Signed-off-by: Herbert Guan <herbert.guan@arm.com>
> ---
>  config/common_armv8a_linuxapp                      |   6 +
>  .../common/include/arch/arm/rte_memcpy_64.h        | 195 +++++++++++++++++++++
>  2 files changed, 201 insertions(+)
> 
> diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp
> index 6732d1e..158ce00 100644
> --- a/config/common_armv8a_linuxapp
> +++ b/config/common_armv8a_linuxapp
> @@ -44,6 +44,12 @@ CONFIG_RTE_FORCE_INTRINSICS=y
>  # to address minimum DMA alignment across all arm64 implementations.
>  CONFIG_RTE_CACHE_LINE_SIZE=128
>  
> +# Accelarate rte_memcpy.  Be sure to run unit test to determine the
> +# best threshold in code.  Refer to notes in source file
> +# (lib/librte_eam/common/include/arch/arm/rte_memcpy_64.h) for more

s/librte_eam/librte_eal

> +# info.
> +CONFIG_RTE_ARCH_ARM64_MEMCPY=n
> +
>  CONFIG_RTE_LIBRTE_FM10K_PMD=n
>  CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
>  CONFIG_RTE_LIBRTE_AVP_PMD=n
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> index b80d8ba..a6ad286 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> @@ -42,6 +42,199 @@
>  
>  #include "generic/rte_memcpy.h"
>  
> +#ifdef RTE_ARCH_ARM64_MEMCPY
> +#include <rte_common.h>
> +#include <rte_branch_prediction.h>
> +
> +/*******************************************************************************

Please remove "*******************************".The standard C comment don't have that.

> + * The memory copy performance differs on different AArch64 micro-architectures.
> + * And the most recent glibc (e.g. 2.23 or later) can provide a better memcpy()
> + * performance compared to old glibc versions. It's always suggested to use a
> + * more recent glibc if possible, from which the entire system can get benefit.
> + *
> + * This implementation improves memory copy on some aarch64 micro-architectures,
> + * when an old glibc (e.g. 2.19, 2.17...) is being used. It is disabled by
> + * default and needs "RTE_ARCH_ARM64_MEMCPY" defined to activate. It's not
> + * always providing better performance than memcpy() so users need to run unit
> + * test "memcpy_perf_autotest" and customize parameters in customization section
> + * below for best performance.
> + *
> + * Compiler version will also impact the rte_memcpy() performance. It's observed
> + * on some platforms and with the same code, GCC 7.2.0 compiled binaries can
> + * provide better performance than GCC 4.8.5 compiled binaries.
> + ******************************************************************************/
> +
> +/**************************************
> + * Beginning of customization section
> + **************************************/
> +#define ALIGNMENT_MASK 0x0F
> +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN
> +/* Only src unalignment will be treaed as unaligned copy */
> +#define IS_UNALIGNED_COPY(dst, src) ((uintptr_t)(dst) & ALIGNMENT_MASK)
> +#else
> +/* Both dst and src unalignment will be treated as unaligned copy */
> +#define IS_UNALIGNED_COPY(dst, src) \
> +		(((uintptr_t)(dst) | (uintptr_t)(src)) & ALIGNMENT_MASK)
> +#endif
> +
> +
> +/*
> + * If copy size is larger than threshold, memcpy() will be used.
> + * Run "memcpy_perf_autotest" to determine the proper threshold.
> + */
> +#define ALIGNED_THRESHOLD       ((size_t)(0xffffffff))
> +#define UNALIGNED_THRESHOLD     ((size_t)(0xffffffff))
> +
> +
> +/**************************************
> + * End of customization section
> + **************************************/
> +#ifdef RTE_TOOLCHAIN_GCC
> +#if (GCC_VERSION < 50400)
> +#warning "The GCC version is quite old, which may result in sub-optimal \
> +performance of the compiled code. It is suggested that at least GCC 5.4.0 \
> +be used."
> +#endif
> +#endif
> +
> +static inline void __attribute__ ((__always_inline__))
> +rte_mov16(uint8_t *restrict dst, const uint8_t *restrict src)
> +{
> +	__uint128_t *restrict dst128 = (__uint128_t *restrict)dst;
> +	const __uint128_t *restrict src128 = (const __uint128_t *restrict)src;
> +	*dst128 = *src128;
> +}
> +
> +static inline void __attribute__ ((__always_inline__))
> +rte_mov32(uint8_t *restrict dst, const uint8_t *restrict src)
> +{
> +	__uint128_t *restrict dst128 = (__uint128_t *restrict)dst;
> +	const __uint128_t *restrict src128 = (const __uint128_t *restrict)src;
> +	dst128[0] = src128[0];
> +	dst128[1] = src128[1];
> +}
> +
> +static inline void __attribute__ ((__always_inline__))
> +rte_mov48(uint8_t *restrict dst, const uint8_t *restrict src)
> +{
> +	__uint128_t *restrict dst128 = (__uint128_t *restrict)dst;
> +	const __uint128_t *restrict src128 = (const __uint128_t *restrict)src;
> +	dst128[0] = src128[0];
> +	dst128[1] = src128[1];
> +	dst128[2] = src128[2];
> +}
> +
> +static inline void __attribute__ ((__always_inline__))
> +rte_mov64(uint8_t *restrict dst, const uint8_t *restrict src)
> +{
> +	__uint128_t *restrict dst128 = (__uint128_t *restrict)dst;
> +	const __uint128_t *restrict src128 = (const __uint128_t *restrict)src;
> +	dst128[0] = src128[0];
> +	dst128[1] = src128[1];
> +	dst128[2] = src128[2];
> +	dst128[3] = src128[3];
> +}
> +
> +static inline void __attribute__ ((__always_inline__))
> +rte_mov128(uint8_t *restrict dst, const uint8_t *restrict src)
> +{
> +	rte_mov64(dst, src);
> +	rte_mov64(dst + 64, src + 64);
> +}
> +
> +static inline void __attribute__ ((__always_inline__))
> +rte_mov256(uint8_t *restrict dst, const uint8_t *restrict src)
> +{
> +	rte_mov128(dst, src);
> +	rte_mov128(dst + 128, src + 128);
> +}
> +
> +static inline void __attribute__ ((__always_inline__))
> +rte_memcpy_lt16(uint8_t *restrict dst, const uint8_t *restrict src, size_t n)
> +{
> +	if (n & 0x08) {
> +		/* copy 8 ~ 15 bytes */
> +		*(uint64_t *)dst = *(const uint64_t *)src;
> +		*(uint64_t *)(dst - 8 + n) = *(const uint64_t *)(src - 8 + n);
> +	} else if (n & 0x04) {
> +		/* copy 4 ~ 7 bytes */
> +		*(uint32_t *)dst = *(const uint32_t *)src;
> +		*(uint32_t *)(dst - 4 + n) = *(const uint32_t *)(src - 4 + n);
> +	} else if (n & 0x02) {
> +		/* copy 2 ~ 3 bytes */
> +		*(uint16_t *)dst = *(const uint16_t *)src;
> +		*(uint16_t *)(dst - 2 + n) = *(const uint16_t *)(src - 2 + n);
> +	} else if (n & 0x01) {
> +		/* copy 1 byte */
> +		*dst = *src;
> +	}
> +}
> +
> +static inline void __attribute__ ((__always_inline__))
> +rte_memcpy_ge16_lt64
> +(uint8_t *restrict dst, const uint8_t *restrict src, size_t n)
> +{
> +	if (n == 16) {
> +		rte_mov16(dst, src);
> +	} else if (n <= 32) {
> +		rte_mov16(dst, src);
> +		rte_mov16(dst - 16 + n, src - 16 + n);
> +	} else if (n <= 48) {
> +		rte_mov32(dst, src);
> +		rte_mov16(dst - 16 + n, src - 16 + n);
> +	} else {
> +		rte_mov48(dst, src);
> +		rte_mov16(dst - 16 + n, src - 16 + n);
> +	}
> +}
> +
> +static inline void __attribute__ ((__always_inline__))
> +rte_memcpy_ge64(uint8_t *restrict dst, const uint8_t *restrict src, size_t n)
> +{
> +	do {
> +		rte_mov64(dst, src);
> +		src += 64;
> +		dst += 64;
> +		n -= 64;
> +	} while (likely(n >= 64));
> +
> +	if (likely(n)) {
> +		if (n > 48)
> +			rte_mov64(dst - 64 + n, src - 64 + n);
> +		else if (n > 32)
> +			rte_mov48(dst - 48 + n, src - 48 + n);
> +		else if (n > 16)
> +			rte_mov32(dst - 32 + n, src - 32 + n);
> +		else
> +			rte_mov16(dst - 16 + n, src - 16 + n);
> +	}
> +}
> +
> +static inline void *__attribute__ ((__always_inline__))
> +rte_memcpy(void *restrict dst, const void *restrict src, size_t n)
> +{
> +	if (n < 16) {
> +		rte_memcpy_lt16((uint8_t *)dst, (const uint8_t *)src, n);
> +		return dst;
> +	}
> +	if (n < 64) {
> +		rte_memcpy_ge16_lt64((uint8_t *)dst, (const uint8_t *)src, n);
> +		return dst;
> +	}

I have comment here, I will reply to original thread.