From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerin Jacob Subject: Re: [PATCH v2] arch/arm: optimization for memcpy on AArch64 Date: Fri, 15 Dec 2017 09:11:29 +0530 Message-ID: <20171215034127.GA5874@jerin> References: <1511768985-21639-1-git-send-email-herbert.guan@arm.com> <1512453723-4513-1-git-send-email-herbert.guan@arm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: dev@dpdk.org, pbhagavatula@caviumnetworks.com, jianbo.liu@arm.com To: Herbert Guan Return-path: Received: from NAM03-CO1-obe.outbound.protection.outlook.com (mail-co1nam03on0046.outbound.protection.outlook.com [104.47.40.46]) by dpdk.org (Postfix) with ESMTP id 988752B9F for ; Fri, 15 Dec 2017 04:41:50 +0100 (CET) Content-Disposition: inline In-Reply-To: <1512453723-4513-1-git-send-email-herbert.guan@arm.com> List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" -----Original Message----- > Date: Tue, 5 Dec 2017 14:02:03 +0800 > From: Herbert Guan > To: dev@dpdk.org > CC: jerin.jacob@caviumnetworks.com, pbhagavatula@caviumnetworks.com, > jianbo.liu@arm.com, Herbert Guan > Subject: [PATCH v2] arch/arm: optimization for memcpy on AArch64 > X-Mailer: git-send-email 1.8.3.1 > > This patch provides an option to do rte_memcpy() using 'restrict' > qualifier, which can induce GCC to do optimizations by using more > efficient instructions, providing some performance gain over memcpy() > on some AArch64 platforms/enviroments. > > The memory copy performance differs between different AArch64 > platforms. And a more recent glibc (e.g. 2.23 or later) > can provide a better memcpy() performance compared to old glibc > versions. It's always suggested to use a more recent glibc if > possible, from which the entire system can get benefit. If for some > reason an old glibc has to be used, this patch is provided for an > alternative. > > This implementation can improve memory copy on some AArch64 > platforms, when an old glibc (e.g. 2.19, 2.17...) is being used. > It is disabled by default and needs "RTE_ARCH_ARM64_MEMCPY" > defined to activate. It's not always proving better performance > than memcpy() so users need to run DPDK unit test > "memcpy_perf_autotest" and customize parameters in "customization > section" in rte_memcpy_64.h for best performance. > > Compiler version will also impact the rte_memcpy() performance. > It's observed on some platforms and with the same code, GCC 7.2.0 > compiled binary can provide better performance than GCC 4.8.5. It's > suggested to use GCC 5.4.0 or later. Description looks good. > > Signed-off-by: Herbert Guan > --- > config/common_armv8a_linuxapp | 6 + > .../common/include/arch/arm/rte_memcpy_64.h | 195 +++++++++++++++++++++ > 2 files changed, 201 insertions(+) > > diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp > index 6732d1e..158ce00 100644 > --- a/config/common_armv8a_linuxapp > +++ b/config/common_armv8a_linuxapp > @@ -44,6 +44,12 @@ CONFIG_RTE_FORCE_INTRINSICS=y > # to address minimum DMA alignment across all arm64 implementations. > CONFIG_RTE_CACHE_LINE_SIZE=128 > > +# Accelarate rte_memcpy. Be sure to run unit test to determine the > +# best threshold in code. Refer to notes in source file > +# (lib/librte_eam/common/include/arch/arm/rte_memcpy_64.h) for more s/librte_eam/librte_eal > +# info. > +CONFIG_RTE_ARCH_ARM64_MEMCPY=n > + > CONFIG_RTE_LIBRTE_FM10K_PMD=n > CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n > CONFIG_RTE_LIBRTE_AVP_PMD=n > diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > index b80d8ba..a6ad286 100644 > --- a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > +++ b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > @@ -42,6 +42,199 @@ > > #include "generic/rte_memcpy.h" > > +#ifdef RTE_ARCH_ARM64_MEMCPY > +#include > +#include > + > +/******************************************************************************* Please remove "*******************************".The standard C comment don't have that. > + * The memory copy performance differs on different AArch64 micro-architectures. > + * And the most recent glibc (e.g. 2.23 or later) can provide a better memcpy() > + * performance compared to old glibc versions. It's always suggested to use a > + * more recent glibc if possible, from which the entire system can get benefit. > + * > + * This implementation improves memory copy on some aarch64 micro-architectures, > + * when an old glibc (e.g. 2.19, 2.17...) is being used. It is disabled by > + * default and needs "RTE_ARCH_ARM64_MEMCPY" defined to activate. It's not > + * always providing better performance than memcpy() so users need to run unit > + * test "memcpy_perf_autotest" and customize parameters in customization section > + * below for best performance. > + * > + * Compiler version will also impact the rte_memcpy() performance. It's observed > + * on some platforms and with the same code, GCC 7.2.0 compiled binaries can > + * provide better performance than GCC 4.8.5 compiled binaries. > + ******************************************************************************/ > + > +/************************************** > + * Beginning of customization section > + **************************************/ > +#define ALIGNMENT_MASK 0x0F > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN > +/* Only src unalignment will be treaed as unaligned copy */ > +#define IS_UNALIGNED_COPY(dst, src) ((uintptr_t)(dst) & ALIGNMENT_MASK) > +#else > +/* Both dst and src unalignment will be treated as unaligned copy */ > +#define IS_UNALIGNED_COPY(dst, src) \ > + (((uintptr_t)(dst) | (uintptr_t)(src)) & ALIGNMENT_MASK) > +#endif > + > + > +/* > + * If copy size is larger than threshold, memcpy() will be used. > + * Run "memcpy_perf_autotest" to determine the proper threshold. > + */ > +#define ALIGNED_THRESHOLD ((size_t)(0xffffffff)) > +#define UNALIGNED_THRESHOLD ((size_t)(0xffffffff)) > + > + > +/************************************** > + * End of customization section > + **************************************/ > +#ifdef RTE_TOOLCHAIN_GCC > +#if (GCC_VERSION < 50400) > +#warning "The GCC version is quite old, which may result in sub-optimal \ > +performance of the compiled code. It is suggested that at least GCC 5.4.0 \ > +be used." > +#endif > +#endif > + > +static inline void __attribute__ ((__always_inline__)) > +rte_mov16(uint8_t *restrict dst, const uint8_t *restrict src) > +{ > + __uint128_t *restrict dst128 = (__uint128_t *restrict)dst; > + const __uint128_t *restrict src128 = (const __uint128_t *restrict)src; > + *dst128 = *src128; > +} > + > +static inline void __attribute__ ((__always_inline__)) > +rte_mov32(uint8_t *restrict dst, const uint8_t *restrict src) > +{ > + __uint128_t *restrict dst128 = (__uint128_t *restrict)dst; > + const __uint128_t *restrict src128 = (const __uint128_t *restrict)src; > + dst128[0] = src128[0]; > + dst128[1] = src128[1]; > +} > + > +static inline void __attribute__ ((__always_inline__)) > +rte_mov48(uint8_t *restrict dst, const uint8_t *restrict src) > +{ > + __uint128_t *restrict dst128 = (__uint128_t *restrict)dst; > + const __uint128_t *restrict src128 = (const __uint128_t *restrict)src; > + dst128[0] = src128[0]; > + dst128[1] = src128[1]; > + dst128[2] = src128[2]; > +} > + > +static inline void __attribute__ ((__always_inline__)) > +rte_mov64(uint8_t *restrict dst, const uint8_t *restrict src) > +{ > + __uint128_t *restrict dst128 = (__uint128_t *restrict)dst; > + const __uint128_t *restrict src128 = (const __uint128_t *restrict)src; > + dst128[0] = src128[0]; > + dst128[1] = src128[1]; > + dst128[2] = src128[2]; > + dst128[3] = src128[3]; > +} > + > +static inline void __attribute__ ((__always_inline__)) > +rte_mov128(uint8_t *restrict dst, const uint8_t *restrict src) > +{ > + rte_mov64(dst, src); > + rte_mov64(dst + 64, src + 64); > +} > + > +static inline void __attribute__ ((__always_inline__)) > +rte_mov256(uint8_t *restrict dst, const uint8_t *restrict src) > +{ > + rte_mov128(dst, src); > + rte_mov128(dst + 128, src + 128); > +} > + > +static inline void __attribute__ ((__always_inline__)) > +rte_memcpy_lt16(uint8_t *restrict dst, const uint8_t *restrict src, size_t n) > +{ > + if (n & 0x08) { > + /* copy 8 ~ 15 bytes */ > + *(uint64_t *)dst = *(const uint64_t *)src; > + *(uint64_t *)(dst - 8 + n) = *(const uint64_t *)(src - 8 + n); > + } else if (n & 0x04) { > + /* copy 4 ~ 7 bytes */ > + *(uint32_t *)dst = *(const uint32_t *)src; > + *(uint32_t *)(dst - 4 + n) = *(const uint32_t *)(src - 4 + n); > + } else if (n & 0x02) { > + /* copy 2 ~ 3 bytes */ > + *(uint16_t *)dst = *(const uint16_t *)src; > + *(uint16_t *)(dst - 2 + n) = *(const uint16_t *)(src - 2 + n); > + } else if (n & 0x01) { > + /* copy 1 byte */ > + *dst = *src; > + } > +} > + > +static inline void __attribute__ ((__always_inline__)) > +rte_memcpy_ge16_lt64 > +(uint8_t *restrict dst, const uint8_t *restrict src, size_t n) > +{ > + if (n == 16) { > + rte_mov16(dst, src); > + } else if (n <= 32) { > + rte_mov16(dst, src); > + rte_mov16(dst - 16 + n, src - 16 + n); > + } else if (n <= 48) { > + rte_mov32(dst, src); > + rte_mov16(dst - 16 + n, src - 16 + n); > + } else { > + rte_mov48(dst, src); > + rte_mov16(dst - 16 + n, src - 16 + n); > + } > +} > + > +static inline void __attribute__ ((__always_inline__)) > +rte_memcpy_ge64(uint8_t *restrict dst, const uint8_t *restrict src, size_t n) > +{ > + do { > + rte_mov64(dst, src); > + src += 64; > + dst += 64; > + n -= 64; > + } while (likely(n >= 64)); > + > + if (likely(n)) { > + if (n > 48) > + rte_mov64(dst - 64 + n, src - 64 + n); > + else if (n > 32) > + rte_mov48(dst - 48 + n, src - 48 + n); > + else if (n > 16) > + rte_mov32(dst - 32 + n, src - 32 + n); > + else > + rte_mov16(dst - 16 + n, src - 16 + n); > + } > +} > + > +static inline void *__attribute__ ((__always_inline__)) > +rte_memcpy(void *restrict dst, const void *restrict src, size_t n) > +{ > + if (n < 16) { > + rte_memcpy_lt16((uint8_t *)dst, (const uint8_t *)src, n); > + return dst; > + } > + if (n < 64) { > + rte_memcpy_ge16_lt64((uint8_t *)dst, (const uint8_t *)src, n); > + return dst; > + } I have comment here, I will reply to original thread.