Re: [PATCH v4] arch/arm: optimization for memcpy on AArch64

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
To: Herbert Guan <herbert.guan@arm.com>
Cc: dev@dpdk.org
Subject: Re: [PATCH v4] arch/arm: optimization for memcpy on AArch64
Date: Wed, 3 Jan 2018 19:05:15 +0530	[thread overview]
Message-ID: <20180103133513.GA30368@jerin> (raw)
In-Reply-To: <1513834427-12635-1-git-send-email-herbert.guan@arm.com>

-----Original Message-----
> Date: Thu, 21 Dec 2017 13:33:47 +0800
> From: Herbert Guan <herbert.guan@arm.com>
> To: dev@dpdk.org, jerin.jacob@caviumnetworks.com
> CC: Herbert Guan <herbert.guan@arm.com>
> Subject: [PATCH v4] arch/arm: optimization for memcpy on AArch64
> X-Mailer: git-send-email 1.8.3.1
> 
> This patch provides an option to do rte_memcpy() using 'restrict'
> qualifier, which can induce GCC to do optimizations by using more
> efficient instructions, providing some performance gain over memcpy()
> on some AArch64 platforms/enviroments.
> 
> The memory copy performance differs between different AArch64
> platforms. And a more recent glibc (e.g. 2.23 or later)
> can provide a better memcpy() performance compared to old glibc
> versions. It's always suggested to use a more recent glibc if
> possible, from which the entire system can get benefit. If for some
> reason an old glibc has to be used, this patch is provided for an
> alternative.
> 
> This implementation can improve memory copy on some AArch64
> platforms, when an old glibc (e.g. 2.19, 2.17...) is being used.
> It is disabled by default and needs "RTE_ARCH_ARM64_MEMCPY"
> defined to activate. It's not always proving better performance
> than memcpy() so users need to run DPDK unit test
> "memcpy_perf_autotest" and customize parameters in "customization
> section" in rte_memcpy_64.h for best performance.
> 
> Compiler version will also impact the rte_memcpy() performance.
> It's observed on some platforms and with the same code, GCC 7.2.0
> compiled binary can provide better performance than GCC 4.8.5. It's
> suggested to use GCC 5.4.0 or later.
> 
> Signed-off-by: Herbert Guan <herbert.guan@arm.com>

Looks good. Find inline request for some minor changes.
Feel free to add my Acked-by with those changes.


> ---
>  config/common_armv8a_linuxapp                      |   6 +
>  .../common/include/arch/arm/rte_memcpy_64.h        | 287 +++++++++++++++++++++
>  2 files changed, 293 insertions(+)
> 
> diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp
> index 6732d1e..8f0cbed 100644
> --- a/config/common_armv8a_linuxapp
> +++ b/config/common_armv8a_linuxapp
> @@ -44,6 +44,12 @@ CONFIG_RTE_FORCE_INTRINSICS=y
>  # to address minimum DMA alignment across all arm64 implementations.
>  CONFIG_RTE_CACHE_LINE_SIZE=128
>  
> +# Accelarate rte_memcpy.  Be sure to run unit test to determine the
> +# best threshold in code.  Refer to notes in source file
> +# (lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h) for more
> +# info.
> +CONFIG_RTE_ARCH_ARM64_MEMCPY=n
> +
>  CONFIG_RTE_LIBRTE_FM10K_PMD=n
>  CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
>  CONFIG_RTE_LIBRTE_AVP_PMD=n
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> index b80d8ba..b269f34 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> @@ -42,6 +42,291 @@
>  
>  #include "generic/rte_memcpy.h"
>  
> +#ifdef RTE_ARCH_ARM64_MEMCPY
> +#include <rte_common.h>
> +#include <rte_branch_prediction.h>
> +
> +/*
> + * The memory copy performance differs on different AArch64 micro-architectures.
> + * And the most recent glibc (e.g. 2.23 or later) can provide a better memcpy()
> + * performance compared to old glibc versions. It's always suggested to use a
> + * more recent glibc if possible, from which the entire system can get benefit.
> + *
> + * This implementation improves memory copy on some aarch64 micro-architectures,
> + * when an old glibc (e.g. 2.19, 2.17...) is being used. It is disabled by
> + * default and needs "RTE_ARCH_ARM64_MEMCPY" defined to activate. It's not
> + * always providing better performance than memcpy() so users need to run unit
> + * test "memcpy_perf_autotest" and customize parameters in customization section
> + * below for best performance.
> + *
> + * Compiler version will also impact the rte_memcpy() performance. It's observed
> + * on some platforms and with the same code, GCC 7.2.0 compiled binaries can
> + * provide better performance than GCC 4.8.5 compiled binaries.
> + */
> +
> +/**************************************
> + * Beginning of customization section
> + **************************************/
> +#define RTE_ARM64_MEMCPY_ALIGN_MASK 0x0F
> +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN
> +/* Only src unalignment will be treaed as unaligned copy */
> +#define IS_UNALIGNED_COPY(dst, src) \

Better to to change to RTE_ARM64_MEMCPY_IS_UNALIGNED_COPY, as it is
defined in public DPDK header file.


> +	((uintptr_t)(dst) & RTE_ARM64_MEMCPY_ALIGN_MASK)
> +#else
> +/* Both dst and src unalignment will be treated as unaligned copy */
> +#define IS_UNALIGNED_COPY(dst, src) \
> +	(((uintptr_t)(dst) | (uintptr_t)(src)) & RTE_ARM64_MEMCPY_ALIGN_MASK)

Same as above

> +#endif
> +
> +
> +/*
> + * If copy size is larger than threshold, memcpy() will be used.
> + * Run "memcpy_perf_autotest" to determine the proper threshold.
> + */
> +#define RTE_ARM64_MEMCPY_ALIGNED_THRESHOLD       ((size_t)(0xffffffff))
> +#define RTE_ARM64_MEMCPY_UNALIGNED_THRESHOLD     ((size_t)(0xffffffff))
> +
> +/*
> + * The logic of USE_RTE_MEMCPY() can also be modified to best fit platform.
> + */
> +#define USE_RTE_MEMCPY(dst, src, n) \
> +((!IS_UNALIGNED_COPY(dst, src) && n <= RTE_ARM64_MEMCPY_ALIGNED_THRESHOLD) \
> +|| (IS_UNALIGNED_COPY(dst, src) && n <= RTE_ARM64_MEMCPY_UNALIGNED_THRESHOLD))
> +
> +
> +/**************************************
> + * End of customization section
> + **************************************/
> +#if defined(RTE_TOOLCHAIN_GCC) && !defined(RTE_AARCH64_SKIP_GCC_VERSION_CHECK)

To maintain consistency
s/RTE_AARCH64_SKIP_GCC_VERSION_CHECK/RTE_ARM64_MEMCPY_SKIP_GCC_VERSION_CHECK

> +#if (GCC_VERSION < 50400)
> +#warning "The GCC version is quite old, which may result in sub-optimal \
> +performance of the compiled code. It is suggested that at least GCC 5.4.0 \
> +be used."
> +#endif
> +#endif
> +
> +static __rte_always_inline void rte_mov16(uint8_t *dst, const uint8_t *src)

static __rte_always_inline
void rte_mov16(uint8_t *dst, const uint8_t *src)

> +{
> +	__uint128_t *dst128 = (__uint128_t *)dst;
> +	const __uint128_t *src128 = (const __uint128_t *)src;
> +	*dst128 = *src128;
> +}
> +
> +static __rte_always_inline void rte_mov32(uint8_t *dst, const uint8_t *src)

See above

> +{
> +	__uint128_t *dst128 = (__uint128_t *)dst;
> +	const __uint128_t *src128 = (const __uint128_t *)src;
> +	const __uint128_t x0 = src128[0], x1 = src128[1];
> +	dst128[0] = x0;
> +	dst128[1] = x1;
> +}
> +
> +static __rte_always_inline void rte_mov48(uint8_t *dst, const uint8_t *src)
> +{

See above

> +	__uint128_t *dst128 = (__uint128_t *)dst;
> +	const __uint128_t *src128 = (const __uint128_t *)src;
> +	const __uint128_t x0 = src128[0], x1 = src128[1], x2 = src128[2];
> +	dst128[0] = x0;
> +	dst128[1] = x1;
> +	dst128[2] = x2;
> +}
> +
> +static __rte_always_inline void rte_mov64(uint8_t *dst, const uint8_t *src)
> +{

See above

> +	__uint128_t *dst128 = (__uint128_t *)dst;
> +	const __uint128_t *src128 = (const __uint128_t *)src;
> +	const __uint128_t
> +		x0 = src128[0], x1 = src128[1], x2 = src128[2], x3 = src128[3];
> +	dst128[0] = x0;
> +	dst128[1] = x1;
> +	dst128[2] = x2;
> +	dst128[3] = x3;
> +}
> +
> +static __rte_always_inline void rte_mov128(uint8_t *dst, const uint8_t *src)
> +{

See above

> +	__uint128_t *dst128 = (__uint128_t *)dst;
> +	const __uint128_t *src128 = (const __uint128_t *)src;
> +	/* Keep below declaration & copy sequence for optimized instructions */
> +	const __uint128_t
> +		x0 = src128[0], x1 = src128[1], x2 = src128[2], x3 = src128[3];
> +	dst128[0] = x0;
> +	__uint128_t x4 = src128[4];
> +	dst128[1] = x1;
> +	__uint128_t x5 = src128[5];
> +	dst128[2] = x2;
> +	__uint128_t x6 = src128[6];
> +	dst128[3] = x3;
> +	__uint128_t x7 = src128[7];
> +	dst128[4] = x4;
> +	dst128[5] = x5;
> +	dst128[6] = x6;
> +	dst128[7] = x7;
> +}
> +
> +static __rte_always_inline void rte_mov256(uint8_t *dst, const uint8_t *src)
> +{

See above

next prev parent reply	other threads:[~2018-01-03 13:35 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-27  7:49 [PATCH] arch/arm: optimization for memcpy on AArch64 Herbert Guan
2017-11-29 12:31 ` Jerin Jacob
2017-12-03 12:37   ` Herbert Guan
2017-12-15  4:06     ` Jerin Jacob
2017-12-18  2:51       ` Herbert Guan
2017-12-18  4:17         ` Jerin Jacob
2017-12-02  7:33 ` Pavan Nikhilesh Bhagavatula
2017-12-03 12:38   ` Herbert Guan
2017-12-03 14:20     ` Pavan Nikhilesh Bhagavatula
2017-12-04  7:14       ` Herbert Guan
2017-12-05  6:02 ` [PATCH v2] " Herbert Guan
2017-12-15  3:41   ` Jerin Jacob
2017-12-18  2:54 ` [PATCH v3] " Herbert Guan
2017-12-18  7:43   ` Jerin Jacob
2017-12-19  5:33     ` Herbert Guan
2017-12-19  7:24       ` Jerin Jacob
2017-12-21  5:33   ` [PATCH v4] " Herbert Guan
2018-01-03 13:35     ` Jerin Jacob [this message]
2018-01-04 10:23       ` Herbert Guan
2018-01-04 10:20 ` [PATCH v5] " Herbert Guan
2018-01-12 17:03   ` Thomas Monjalon
2018-01-15 10:57     ` Herbert Guan
2018-01-15 11:37       ` Thomas Monjalon
2018-01-18 23:54         ` Thomas Monjalon
2018-01-19  6:16           ` 答复: " Herbert Guan
2018-01-19  6:10   ` [PATCH v6] arch/arm: optimization for memcpy on ARM64 Herbert Guan
2018-01-20 16:21     ` Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180103133513.GA30368@jerin \
    --to=jerin.jacob@caviumnetworks.com \
    --cc=dev@dpdk.org \
    --cc=herbert.guan@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.