From: Stephen Hemminger <stephen@networkplumber.org>
To: "Morten Brørup" <mb@smartsharesystems.com>
Cc: dev@dpdk.org, Bruce Richardson <bruce.richardson@intel.com>,
Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>,
Vipin Varghese <vipin.varghese@amd.com>,
Liangxing Wang <wangliangxing@hygon.cn>,
Thiyagarajan P <Thiyagarajan.P@amd.com>,
Bala Murali Krishna <Bala.MuraliKrishna@amd.com>,
Anatoly Burakov <anatoly.burakov@intel.com>,
Vladimir Medvedkin <vladimir.medvedkin@intel.com>,
Konstantin Ananyev <konstantin.ananyev@huawei.com>
Subject: Re: [PATCH v11] eal/x86: optimize memcpy of small sizes
Date: Thu, 21 May 2026 15:42:14 -0700 [thread overview]
Message-ID: <20260521154214.1c171a74@phoenix.local> (raw)
In-Reply-To: <20260521185631.116046-1-mb@smartsharesystems.com>
On Thu, 21 May 2026 18:56:31 +0000
Morten Brørup <mb@smartsharesystems.com> wrote:
> The implementation for copying up to 64 bytes does not depend on address
> alignment with the size of the CPU's vector registers. Nonetheless, the
> exact same code for copying up to 64 bytes was present in both the aligned
> copy function and all the CPU vector register size specific variants of
> the unaligned copy functions.
> With this patch, the implementation for copying up to 64 bytes was
> consolidated into one instance, located in the common copy function,
> before checking alignment requirements.
> This provides three benefits:
> 1. No copy-paste in the source code.
> 2. A performance gain for copying up to 64 bytes, because the
> address alignment check is avoided in this case.
> 3. Reduced instruction memory footprint, because the compiler only
> generates one instance of the function for copying up to 64 bytes, instead
> of two instances (one in the unaligned copy function, and one in the
> aligned copy function).
>
> Furthermore, __rte_restrict was added to source and destination addresses.
>
> Also, the missing implementation of rte_mov48() was added.
>
> Until recently, some drivers required disabling stringop-overflow warnings
> when using rte_memcpy().
> For some strange reason, these warnings were disabled in the rte_memcpy
> header file, instead of in the problematic drivers.
> With series-38174 ("remove use of rte_memcpy from net/intel"), the
> problematic drivers were updated to use memcpy() instead of rte_memcpy(),
> so disabling these warnings is no longer required, and was removed.
>
> Regarding performance...
> The memcpy performance test (cache-to-cache copy) shows:
> Copying up to 15 bytes takes ca. 4.5 cycles, versus ca. 6.5 cycles before.
> Copying 8 bytes takes 4 cycles, versus 7 cycles before.
> Copying 16 bytes takes 2 cycles, versus 4 cycles before.
> Copying 64 bytes takes 4 cycles, versus 7 cycles before.
>
> Depends-on: series-38174 ("remove use of rte_memcpy from net/intel")
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> ---
Here is the full wordy all providers reviews.
Mostly this is test to make sure script works.
$ ./devtools/ai/compare-patch-reviews.sh ~/Downloads/v11-eal-x86-optimize-memcpy-of-small-sizes.patch
Reviewing patch: v11-eal-x86-optimize-memcpy-of-small-sizes.patch
Providers: anthropic,openai,xai,google
Format: text
========================================
>>> Running review with: anthropic
=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Claude) ===
Review date: 2026-05-21
# Review of v11-eal-x86-optimize-memcpy-of-small-sizes.patch
## Correctness Issues
### Errors
**Use-after-free pattern in overlapping copy optimization (lines 672-673, 684, 704, 707)**
The pattern `(uint8_t *)dst - 16 + n` performs pointer arithmetic that, when `n < 16`, results in accessing memory before the `dst` pointer. This is used in an optimization technique (two overlapping stores) that is correct when `n >= 16`, but the code paths in question guarantee `n` is in specific ranges where this is safe. However, the readability is poor and the intent is not obvious from the code.
**Confidence: 80%** - This pattern is widely used in memcpy implementations and appears intentional, but deserves a comment explaining the technique to avoid confusion in future reviews.
---
## Warnings
**Missing documentation for new function rte_mov48() (line 130)**
The patch adds `rte_mov48()` but does not mention it in the commit message's "Also, the missing implementation of rte_mov48() was added." The implementation itself is correct, but there's no context about why it was missing or whether it's used elsewhere in DPDK.
**Suggestion:** Verify that `rte_mov48()` is actually used somewhere, or document why it's being added speculatively.
---
**Inconsistent handling of compile-time constant size (lines 682-683)**
The code checks `__rte_constant(n) && n == 16` to avoid a "harmless duplicate copy" but does not apply this optimization consistently. For example:
- Line 682: avoids duplicate for `n == 16`
- Line 690: avoids duplicate for `n == 32`
- Line 693: avoids duplicate for `n == 64`
But in the 33-64 byte range (lines 695-707), there's no similar optimization. This is not a bug, but the inconsistency in optimization strategy could be noted.
**Suggestion:** Add a comment explaining the rationale for when this optimization matters vs when the "harmless duplicate" is acceptable.
---
**Potential alignment assumption in AVX path (line 699)**
The AVX path at line 699 uses two overlapping `rte_mov32()` calls for sizes 33-64 bytes:
```c
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
```
For `n = 33`, the second call accesses `dst + 1` and `src + 1`, which may be misaligned for the 32-byte `__m256i` operations. However, `_mm256_loadu_si256()` and `_mm256_storeu_si256()` are documented as supporting unaligned access, so this is safe.
**Confidence: 90%** - Not a bug, but worth noting that the AVX path relies on unaligned load/store support.
---
## Style and Process Issues
### Warnings
**Unconventional comment style in rte_mov15_or_less() (lines 67-72)**
The multi-line comment explaining the rationale for the `rte_uint*_alias` structures does not follow the standard DPDK comment block format with the `*` on each line. However, this may be intentional for a technical explanation.
---
**Return value inconsistency (lines 682, 690-691, 693, 707-708)**
The function returns `dst` in some branches but the original `ret` pointer in others:
- Lines 682, 690, 707: return `dst`
- Line 714: return from `rte_memcpy_aligned_more_than_64(dst, src, n)` which returns `ret = dst`
- Line 716: return from `rte_memcpy_generic_more_than_64(dst, src, n)` which returns `ret = dst`
Both `ret` and `dst` point to the same location, so this is harmless, but using `dst` consistently would be clearer. The patch description states "Pointer to the destination data" is returned, and both satisfy this.
**Confidence: 95%** - Not an error, but inconsistent style within the same function.
---
## Documentation Issues
### Warnings
**Missing documentation update for removed warning suppression (lines removed at top)**
The patch removes:
```c
#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wstringop-overflow"
#endif
```
The commit message explains this is because the problematic drivers were updated, with a `Depends-on: series-38174`. However, there's no comment in the code explaining why the suppression was removed or a note in `doc/guides/rel_notes/` documenting this change.
**Suggestion:** Add a release note entry documenting the removal of the warning suppression and the dependency on the driver updates.
---
## Summary
**Priority findings:**
1. **(Info)** The overlapping copy optimization `dst - 16 + n` is correct but deserves a comment explaining the technique.
2. **(Warning)** Missing documentation for why `rte_mov48()` was added.
3. **(Warning)** Missing release notes entry for removal of stringop-overflow warning suppression.
4. **(Info)** Return value uses both `dst` and `ret` inconsistently (harmless but could be unified).
**Overall assessment:** The patch appears functionally correct. The main concerns are documentation completeness (missing context for `rte_mov48()`, missing release notes for warning removal) and code clarity (overlapping copy pattern could use explanatory comments). No blocking correctness issues were identified.
(anthropic: review reported errors)
========================================
>>> Running review with: openai
=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via ChatGPT) ===
Review date: 2026-05-21
**DPDK Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch**
---
### Errors
None found.
- The patch does not introduce resource leaks, use-after-free, race conditions, or missing synchronization.
- There are no buffer overflows or out-of-bounds accesses in the memcpy routines; all copy size calculations and code paths for small copies (<64 bytes) are explicit and conservative.
- No uninitialized or stale variable use detected.
- All error paths are clean (these are computational functions; return values are always valid).
---
### Warnings
1. **Potential inconsistent comment phrasing regarding overlap requirements.**
- Some Doxygen comments use "locations should not overlap" while others now correctly say "locations must not overlap".
- While this isn't a correctness bug, consistent strong wording clarifies requirement for users.
**Suggested fix:** Standardize all relevant Doxygen comments to use "locations must not overlap."
```c
/**
* Copy bytes from one location to another,
* locations must not overlap.
*/
```
2. **Redundant Doxygen param alignment in function headers.**
- Some parameter descriptions ("Pointer to the destination of the data"/"Pointer to the destination data") remain a bit unclear or inconsistent. If updating comments, consider making param descriptions exact and explicit.
**Suggested fix:** For each `@param` clarify if it's the destination or source pointer. For `rte_memcpy`, `"Pointer to the destination."` and `"Pointer to the source."` suffice.
---
### Info
- Use of `__rte_restrict` is correct and improves optimization; matches DPDK style for pointer restrict qualifiers.
- Removal of `#pragma GCC diagnostic ignored "-Wstringop-overflow"` is correct because the problematic drivers now use standard `memcpy()`.
- All new `rte_mov*` functions properly use `static __rte_always_inline` and restrict-qualified pointers.
- Macro duplication is avoided by consolidation, in line with maintainability guidelines.
- No forbidden tokens, banned constructs, or use of deprecated APIs/macros were added.
- No symbol namespace pollution or missing `rte_`/`RTE_` prefix issues present.
- Coding style is maintained in indentation, tabs use, and function/variable names.
- No API, public header, or C11 atomic/volatile/race issues are present.
- No documentation or API tag requirements are triggered (no public API change).
- Patch is self-contained and does not require changes to release notes or test infrastructure.
- No RST or external documentation is involved in this patch.
---
### Summary
**No correctness/blocking issues found.**
Minor documentation consistency updates suggested for comment clarity, but these do not affect correctness.
**This patch is in good shape and complies with DPDK coding and architectural guidelines.**
(openai: review reported errors)
========================================
>>> Running review with: xai
=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Grok) ===
Review date: 2026-05-21
**Review of patch: v11-eal-x86-optimize-memcpy-of-small-sizes.patch**
### Errors
**1. Missing `__rte_restrict` in rte_mov48() implementation (consistency / correctness risk)**
The new `rte_mov48()` function is declared with `__rte_restrict`, but the implementation casts away the restrict qualifier when calling `rte_mov32()` and `rte_mov16()`:
```c
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
```
**Suggested fix**: Keep the restrict-qualified parameters in the implementation or cast only after documenting the aliasing contract. This is a minor style inconsistency but worth fixing for uniformity.
**2. Potential sign-extension issue on 64-bit offset calculations (left-shift of narrow type)**
In `rte_mov15_or_less()`, the expression `(n & 8)` is `size_t` (ok), but the subsequent pointer arithmetic:
```c
*(uint64_t *)((uint8_t *)dst + n - 8) = *(const uint64_t *)((const uint8_t *)src + n - 8);
```
relies on implicit promotion. While currently safe, the pattern of adding a `uint8_t`/`size_t` value to a pointer after a narrow mask (`n & 7`) could trigger Coverity SIGN_EXTENSION warnings on some toolchains. Consider an explicit cast:
```c
uint8_t *d = (uint8_t *)dst + (size_t)(n - 8);
```
**3. Use of `__rte_constant` inside hot path without documented behavior**
The code repeatedly uses `__rte_constant(n)` to optimize compile-time known sizes. This is a DPDK internal macro. While acceptable for performance-critical code, it should be documented (or at least noted) that the optimization only triggers for constant `n`. No correctness bug, but worth a comment.
### Warnings
**1. Duplicate function definitions for AVX/SSE variants**
The patch introduces three nearly identical `rte_memcpy_generic_more_than_64()` and `rte_memcpy_aligned_more_than_64()` functions (one per vector size). This increases maintenance burden and instruction cache pressure. Consider factoring the common >64-byte logic into a single helper.
**2. Hard-coded cast chains in rte_mov48()**
The SSE implementation of `rte_mov48()` uses explicit `(uint8_t *)` casts on restrict-qualified pointers. While functionally correct, it reduces readability. A cleaner pattern would be:
```c
rte_mov16(dst + 0, src + 0);
rte_mov16(dst + 16, src + 16);
rte_mov16(dst + 32, src + 32);
```
**3. Inconsistent use of `return dst` vs `return ret`**
In the new `rte_memcpy()` implementation the fast path returns `dst`, while the old `rte_memcpy_generic()` returned the saved `ret`. Although `dst` and `ret` are identical, using `ret` consistently would avoid any future divergence.
**4. Missing Doxygen for new rte_mov48()**
The new function has a Doxygen comment, but the description is minimal. Consider adding a brief note that it is an internal helper (or mark it `__rte_internal` if it should not be public).
### Info
- The removal of the `#pragma GCC diagnostic` block for `-Wstringop-overflow` is correct and aligns with the series that removed `rte_memcpy` usage from drivers.
- Adding `__rte_restrict` to `rte_memcpy()` and helper functions is a good improvement and matches modern DPDK style.
- Performance claims in the commit message are outside the scope of this review (they are not code).
### Summary
No critical correctness bugs (use-after-free, leaks, races, or missing error checks) were identified. The main issues are minor style/consistency items around restrict qualifiers, pointer arithmetic, and code duplication. The patch is otherwise clean and follows DPDK coding guidelines for the x86 EAL memcpy implementation.
(xai: review reported errors)
========================================
>>> Running review with: google
=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Gemini) ===
Review date: 2026-05-21
The following is a review of the patch against DPDK coding standards and guidelines.
### Warnings
**Implicit integer comparison in rte_memcpy.**
DPDK coding style requires explicit comparison against 0 for integer types, rather than using truthiness/logical negation.
```c
/* Current implementation */
if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
/* Suggested fix */
if ((((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK) == 0)
```
### Info
**Use of bitwise logic in rte_mov15_or_less.**
While this patch only adds `__rte_restrict` to the signature, the existing
(google: review reported warnings)
========================================
Review comparison complete.
Summary across 4 provider(s): clean=0 warnings=1 errors=3 failed=0
prev parent reply other threads:[~2026-05-21 22:42 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-20 11:45 [PATCH] eal/x86: reduce memcpy code duplication Morten Brørup
2025-11-21 10:35 ` [PATCH v2] eal/x86: optimize memcpy of small sizes Morten Brørup
2025-11-21 16:57 ` Stephen Hemminger
2025-11-21 17:02 ` Bruce Richardson
2025-11-21 17:11 ` Stephen Hemminger
2025-11-21 21:36 ` Morten Brørup
2025-11-21 10:40 ` Morten Brørup
2025-11-21 10:40 ` [PATCH v3] " Morten Brørup
2025-11-24 13:36 ` Morten Brørup
2025-11-24 15:46 ` Patrick Robb
2025-11-28 14:02 ` Konstantin Ananyev
2025-11-28 15:55 ` Morten Brørup
2025-11-28 18:10 ` Konstantin Ananyev
2025-11-29 2:17 ` Morten Brørup
2025-12-01 9:35 ` Konstantin Ananyev
2025-12-01 10:41 ` Morten Brørup
2025-11-24 20:31 ` [PATCH v4] " Morten Brørup
2025-11-25 8:19 ` Morten Brørup
2025-12-01 15:55 ` [PATCH v5] " Morten Brørup
2025-12-03 13:29 ` Morten Brørup
2026-01-03 17:53 ` Morten Brørup
2026-01-09 15:05 ` Varghese, Vipin
2026-01-11 15:52 ` Konstantin Ananyev
2026-01-11 16:01 ` Stephen Hemminger
2026-01-12 8:02 ` Morten Brørup
2026-01-12 16:00 ` Scott Mitchell
2026-01-13 0:39 ` Stephen Hemminger
2026-01-12 12:03 ` [PATCH v6] " Morten Brørup
2026-01-13 23:19 ` Stephen Hemminger
2026-01-20 11:00 ` Varghese, Vipin
2026-01-20 11:19 ` Varghese, Vipin
2026-01-20 11:22 ` Morten Brørup
2026-01-21 11:48 ` Varghese, Vipin
2026-01-22 6:59 ` Varghese, Vipin
2026-01-22 7:28 ` Liangxing Wang
2026-01-23 6:58 ` Varghese, Vipin
2026-02-20 11:08 ` [PATCH v7] " Morten Brørup
2026-03-11 7:28 ` Morten Brørup
2026-03-11 16:58 ` Bruce Richardson
2026-03-11 18:29 ` Morten Brørup
2026-03-11 19:09 ` Bruce Richardson
2026-03-12 8:33 ` Konstantin Ananyev
2026-03-19 15:55 ` Morten Brørup
2026-04-29 9:36 ` [PATCH v8] " Morten Brørup
2026-04-29 10:35 ` [PATCH v9] " Morten Brørup
2026-04-29 11:24 ` Morten Brørup
2026-05-08 6:32 ` Morten Brørup
2026-05-21 10:54 ` [TEST PATCH " Morten Brørup
2026-05-08 9:58 ` [PATCH v10] " Morten Brørup
2026-05-21 18:56 ` [PATCH v11] " Morten Brørup
2026-05-21 19:48 ` Stephen Hemminger
2026-05-21 22:42 ` Stephen Hemminger [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260521154214.1c171a74@phoenix.local \
--to=stephen@networkplumber.org \
--cc=Bala.MuraliKrishna@amd.com \
--cc=Thiyagarajan.P@amd.com \
--cc=anatoly.burakov@intel.com \
--cc=bruce.richardson@intel.com \
--cc=dev@dpdk.org \
--cc=konstantin.ananyev@huawei.com \
--cc=konstantin.v.ananyev@yandex.ru \
--cc=mb@smartsharesystems.com \
--cc=vipin.varghese@amd.com \
--cc=vladimir.medvedkin@intel.com \
--cc=wangliangxing@hygon.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox