From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B8E41125866 for ; Wed, 11 Mar 2026 18:29:46 +0000 (UTC) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 0FCFD410D0; Wed, 11 Mar 2026 19:29:41 +0100 (CET) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 21ECF40EF0 for ; Wed, 11 Mar 2026 19:29:40 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id EC46720A38; Wed, 11 Mar 2026 19:29:39 +0100 (CET) Content-class: urn:content-classes:message Subject: RE: [PATCH v7] eal/x86: optimize memcpy of small sizes MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Date: Wed, 11 Mar 2026 19:29:38 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35F6577E@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v7] eal/x86: optimize memcpy of small sizes X-MimeOLE: Produced By Microsoft Exchange V6.5 Thread-Index: AdyxeF5VSAlf5NnOQICw9RagTe1E4AACcbCQ References: <20251120114554.950287-1-mb@smartsharesystems.com> <20260220110824.235784-1-mb@smartsharesystems.com> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" Cc: , "Konstantin Ananyev" , "Vipin Varghese" , "Stephen Hemminger" , "Liangxing Wang" , "Thiyagarajan P" , "Bala Murali Krishna" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Wednesday, 11 March 2026 17.59 >=20 > On Fri, Feb 20, 2026 at 11:08:24AM +0000, Morten Br=F8rup wrote: > > The implementation for copying up to 64 bytes does not depend on > address > > alignment with the size of the CPU's vector registers. Nonetheless, > the > > exact same code for copying up to 64 bytes was present in both the > aligned > > copy function and all the CPU vector register size specific variants > of > > the unaligned copy functions. > > With this patch, the implementation for copying up to 64 bytes was > > consolidated into one instance, located in the common copy function, > > before checking alignment requirements. > > This provides three benefits: > > 1. No copy-paste in the source code. > > 2. A performance gain for copying up to 64 bytes, because the > > address alignment check is avoided in this case. > > 3. Reduced instruction memory footprint, because the compiler only > > generates one instance of the function for copying up to 64 bytes, > instead > > of two instances (one in the unaligned copy function, and one in the > > aligned copy function). > > > > Furthermore, the function for copying less than 16 bytes was = replaced > with > > a smarter implementation using fewer branches and potentially fewer > > load/store operations. > > This function was also extended to handle copying of up to 16 bytes, > > instead of up to 15 bytes. > > This small extension reduces the code path, and thus improves the > > performance, for copying two pointers on 64-bit architectures and > four > > pointers on 32-bit architectures. > > > > Also, __rte_restrict was added to source and destination addresses. > > > > And finally, the missing implementation of rte_mov48() was added. > > > > Regarding performance, the memcpy performance test showed cache-to- > cache > > copying of up to 32 bytes now takes 2 cycles, versus ca. 6.5 cycles > before > > this patch. > > Copying 64 bytes now takes 4 cycles, versus 7 cycles before. > > > > Signed-off-by: Morten Br=F8rup > > --- > > v7: > > * Updated patch description. Mainly to clarify that the changes > related to > > copying up to 64 bytes simply replaces multiple instances of copy- > pasted > > code with one common instance. > > * Fixed copy of build time known 16 bytes in rte_mov17_to_32(). > (Vipin) > > * Rebased. > > v6: > > * Went back to using rte_uintN_alias structures for copying instead > of > > using memcpy(). They were there for a reason. > > (Inspired by the discussion about optimizing the checksum > function.) > > * Removed note about copying uninitialized data. > > * Added __rte_restrict to source and destination addresses. > > Updated function descriptions from "should" to "must" not overlap. > > * Changed rte_mov48() AVX implementation to copy 32+16 bytes instead > of > > copying 32 + 32 overlapping bytes. (Konstantin) > > * Ignoring "-Wstringop-overflow" is not needed, so it was removed. > > v5: > > * Reverted v4: Replace SSE2 _mm_loadu_si128() with SSE3 > _mm_lddqu_si128(). > > It was slower. > > * Improved some comments. (Konstantin Ananyev) > > * Moved the size range 17..32 inside the size <=3D 64 branch, so = when > > building for SSE, the generated code can start copying the first > > 16 bytes before comparing if the size is greater than 32 or not. > > * Just require RTE_MEMCPY_AVX for using rte_mov32() in > rte_mov33_to_64(). > > v4: > > * Replace SSE2 _mm_loadu_si128() with SSE3 _mm_lddqu_si128(). > > v3: > > * Fixed typo in comment. > > v2: > > * Updated patch title to reflect that the performance is improved. > > * Use the design pattern of two overlapping stores for small copies > too. > > * Expanded first branch from size < 16 to size <=3D 16. > > * Handle more build time constant copy sizes. > > --- > > lib/eal/x86/include/rte_memcpy.h | 526 = ++++++++++++++++++++--------- > -- > > 1 file changed, 348 insertions(+), 178 deletions(-) > > >=20 > I'm a little unhappy to see the amount of memcpy code growing rather > than > shrinking, but since it improves performance I'm ok with it. We should > keep > it under constant review though. Agree! I just counted; 149 of the added lines are for handling = __rte_constant(n). So it's not as bad as it looks. But still growing, which was not the intention. When I started working on this patch, the intention was to consolidate = the copy-pasted instances for handling up to 64 bytes into one instance. = This should have reduced the amount of code. But then it somehow grew anyway. >=20 > > diff --git a/lib/eal/x86/include/rte_memcpy.h > b/lib/eal/x86/include/rte_memcpy.h > > index 46d34b8081..ed8e5f8dc4 100644 > > --- a/lib/eal/x86/include/rte_memcpy.h > > +++ b/lib/eal/x86/include/rte_memcpy.h > > @@ -22,11 +22,6 @@ > > extern "C" { > > #endif > > > > -#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >=3D 100000) > > -#pragma GCC diagnostic push > > -#pragma GCC diagnostic ignored "-Wstringop-overflow" > > -#endif > > - > > /* > > * GCC older than version 11 doesn't compile AVX properly, so use > SSE instead. > > * There are no problems with AVX2. > > @@ -40,9 +35,6 @@ extern "C" { > > /** > > * Copy bytes from one location to another. The locations must not > overlap. > > * > > - * @note This is implemented as a macro, so it's address should not > be taken > > - * and care is needed as parameter expressions may be evaluated > multiple times. > > - * >=20 > I'd be wary about completely removing this comment, as we may well = want > to > go back to a macro in the future, e.g. if we decide to remove the > custom > rte_memcpy altogether. Therefore, rather than removing the comment, = can > we > tweak it to say "This may be implemented as a macro..." The comment is still present in the "generic" header file used for the = Doxygen documentation: https://elixir.bootlin.com/dpdk/v26.03-rc1/source/lib/eal/include/generic= /rte_memcpy.h#L99 All other architectures rely on the "generic" header file, and have no = Doxygen comments at all. We could also remove them from the x86 implementation. That would shrink the file even more. ;-) But I'd rather keep the comments - at least for now. >=20 >=20 > Acked-by: Bruce Richardson Thank you for quick response, Bruce. >=20 > PS: If we want a little further cleanup, I'd consider removing the > RTE_MEMCPY_AVX macro and replacing it with a straight check for > __AVX2__. > CPUs with AVX2 was introduced in 2013, and checking Claude and > Wikipedia > says that AMD parts started having it in 2015, meaning that there were > only > a few generations of CPUs >10 years ago which had AVX but not AVX2. > [There > were later CPUs e.g. lower-end parts, which didn't have AVX2, but they > didn't have AVX1 either, so SSE is the only choice there] > Not a big cleanup if we did remove it, but sometimes every little > helps! Good idea. But let's not do it now.