From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerin Jacob Subject: Re: [PATCH v3] arch/arm: optimization for memcpy on AArch64 Date: Tue, 19 Dec 2017 12:54:32 +0530 Message-ID: <20171219072431.GA19364@jerin> References: <1511768985-21639-1-git-send-email-herbert.guan@arm.com> <1513565664-19509-1-git-send-email-herbert.guan@arm.com> <20171218074349.GA16659@jerin> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: "dev@dpdk.org" , nd To: Herbert Guan Return-path: Received: from NAM03-CO1-obe.outbound.protection.outlook.com (mail-co1nam03on0080.outbound.protection.outlook.com [104.47.40.80]) by dpdk.org (Postfix) with ESMTP id F08752C18 for ; Tue, 19 Dec 2017 08:24:57 +0100 (CET) Content-Disposition: inline In-Reply-To: List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" -----Original Message----- > Date: Tue, 19 Dec 2017 05:33:19 +0000 > From: Herbert Guan > To: Jerin Jacob > CC: "dev@dpdk.org" , nd > Subject: RE: [PATCH v3] arch/arm: optimization for memcpy on AArch64 > > Jerin, > > Thanks for review and comments. Please find my feedbacks below inline. > > > -----Original Message----- > > From: Jerin Jacob [mailto:jerin.jacob@caviumnetworks.com] > > Sent: Monday, December 18, 2017 15:44 > > To: Herbert Guan > > Cc: dev@dpdk.org > > Subject: Re: [PATCH v3] arch/arm: optimization for memcpy on AArch64 > > > > -----Original Message----- > > > Date: Mon, 18 Dec 2017 10:54:24 +0800 > > > From: Herbert Guan > > > To: dev@dpdk.org, jerin.jacob@caviumnetworks.com > > > CC: Herbert Guan > > > Subject: [PATCH v3] arch/arm: optimization for memcpy on AArch64 > > > X-Mailer: git-send-email 1.8.3.1 > > > > > > Signed-off-by: Herbert Guan > > > --- > > > config/common_armv8a_linuxapp | 6 + > > > .../common/include/arch/arm/rte_memcpy_64.h | 292 > > +++++++++++++++++++++ > > > 2 files changed, 298 insertions(+) > > > > > > diff --git a/config/common_armv8a_linuxapp > > > b/config/common_armv8a_linuxapp index 6732d1e..8f0cbed 100644 > > > --- a/config/common_armv8a_linuxapp > > > +++ b/config/common_armv8a_linuxapp > > > @@ -44,6 +44,12 @@ CONFIG_RTE_FORCE_INTRINSICS=y # to address > > minimum > > > DMA alignment across all arm64 implementations. > > > CONFIG_RTE_CACHE_LINE_SIZE=128 > > > > > > +# Accelarate rte_memcpy. Be sure to run unit test to determine the > > > > Additional space before "Be". Rather than just mentioning the unit test, > > mention the absolute test case name(memcpy_perf_autotest) > > > > > +# best threshold in code. Refer to notes in source file > > > > Additional space before "Refer" > > Fixed in new version. > > > > > > +# (lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h) for more > > # > > > +info. > > > +CONFIG_RTE_ARCH_ARM64_MEMCPY=n > > > + > > > CONFIG_RTE_LIBRTE_FM10K_PMD=n > > > CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n > > > CONFIG_RTE_LIBRTE_AVP_PMD=n > > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > > > b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > > > index b80d8ba..1ea275d 100644 > > > --- a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > > > +++ b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > > > @@ -42,6 +42,296 @@ > > > > > > #include "generic/rte_memcpy.h" > > > > > > +#ifdef RTE_ARCH_ARM64_MEMCPY > > > > See the comment below at "(GCC_VERSION < 50400)" check > > > > > +#include > > > +#include > > > + > > > +/* > > > + * The memory copy performance differs on different AArch64 micro- > > architectures. > > > + * And the most recent glibc (e.g. 2.23 or later) can provide a > > > +better memcpy() > > > + * performance compared to old glibc versions. It's always suggested > > > +to use a > > > + * more recent glibc if possible, from which the entire system can get > > benefit. > > > + * > > > + * This implementation improves memory copy on some aarch64 > > > +micro-architectures, > > > + * when an old glibc (e.g. 2.19, 2.17...) is being used. It is > > > +disabled by > > > + * default and needs "RTE_ARCH_ARM64_MEMCPY" defined to activate. > > > +It's not > > > + * always providing better performance than memcpy() so users need to > > > +run unit > > > + * test "memcpy_perf_autotest" and customize parameters in > > > +customization section > > > + * below for best performance. > > > + * > > > + * Compiler version will also impact the rte_memcpy() performance. > > > +It's observed > > > + * on some platforms and with the same code, GCC 7.2.0 compiled > > > +binaries can > > > + * provide better performance than GCC 4.8.5 compiled binaries. > > > + */ > > > + > > > +/************************************** > > > + * Beginning of customization section > > > +**************************************/ > > > +#define ALIGNMENT_MASK 0x0F > > > > This symbol will be included in public rte_memcpy.h version for arm64 DPDK > > build. > > Please use RTE_ prefix to avoid multi > > definition.(RTE_ARCH_ARM64_ALIGN_MASK ? or any shorter name) > > > Changed to RTE_AARCH64_ALIGN_MASK in new version. Since it is something to do with memcpy and arm64, I prefer, RTE_ARM64_MEMCPY_ALIGN_MASK > > > > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN > > > +/* Only src unalignment will be treaed as unaligned copy */ #define > > > +IS_UNALIGNED_COPY(dst, src) ((uintptr_t)(dst) & ALIGNMENT_MASK) > > #else > > > +/* Both dst and src unalignment will be treated as unaligned copy */ > > > +#define IS_UNALIGNED_COPY(dst, src) \ > > > + (((uintptr_t)(dst) | (uintptr_t)(src)) & ALIGNMENT_MASK) > > #endif > > > + > > > + > > > +/* > > > + * If copy size is larger than threshold, memcpy() will be used. > > > + * Run "memcpy_perf_autotest" to determine the proper threshold. > > > + */ > > > +#define ALIGNED_THRESHOLD ((size_t)(0xffffffff)) > > > +#define UNALIGNED_THRESHOLD ((size_t)(0xffffffff)) > > > > Same as above comment. > Added RTE_AARCH64_ prefix in new version. Same as above. > > > > > > + > > > +/************************************** > > > + * End of customization section > > > + **************************************/ > > > +#ifdef RTE_TOOLCHAIN_GCC > > > +#if (GCC_VERSION < 50400) > > > +#warning "The GCC version is quite old, which may result in sub-optimal \ > > > +performance of the compiled code. It is suggested that at least GCC 5.4.0 \ > > > +be used." > > > > Even though it is warning, based on where this file get included it will > > generate error(see below) > > How about, selecting optimized memcpy when RTE_ARCH_ARM64_MEMCPY > > && if (GCC_VERSION >= 50400) ? > > > Fully understand that. While I'm not tending to make it 'silent'. GCC 4.x is just > quite old and may not provide best optimized code -- not only for DPDK app. > We can provide another option RTE_AARCH64_SKIP_GCC_VERSION_CHECK to allow > skipping the GCC version check. How do you think? I prefer to reduce the options. But, No strong opinion on this as this the RTE_ARCH_ARM64_MEMCPY option is by default disabled(ie. No risk).