Re: [PATCH v11] eal/x86: optimize memcpy of small sizes

DPDK-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Stephen Hemminger <stephen@networkplumber.org>
To: "Morten Brørup" <mb@smartsharesystems.com>
Cc: dev@dpdk.org, Bruce Richardson <bruce.richardson@intel.com>,
	Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>,
	Vipin Varghese <vipin.varghese@amd.com>,
	Liangxing Wang <wangliangxing@hygon.cn>,
	Thiyagarajan P <Thiyagarajan.P@amd.com>,
	Bala Murali Krishna <Bala.MuraliKrishna@amd.com>,
	Anatoly Burakov <anatoly.burakov@intel.com>,
	Vladimir Medvedkin <vladimir.medvedkin@intel.com>,
	Konstantin Ananyev <konstantin.ananyev@huawei.com>
Subject: Re: [PATCH v11] eal/x86: optimize memcpy of small sizes
Date: Thu, 21 May 2026 15:42:14 -0700	[thread overview]
Message-ID: <20260521154214.1c171a74@phoenix.local> (raw)
In-Reply-To: <20260521185631.116046-1-mb@smartsharesystems.com>

On Thu, 21 May 2026 18:56:31 +0000
Morten Brørup <mb@smartsharesystems.com> wrote:

> The implementation for copying up to 64 bytes does not depend on address
> alignment with the size of the CPU's vector registers. Nonetheless, the
> exact same code for copying up to 64 bytes was present in both the aligned
> copy function and all the CPU vector register size specific variants of
> the unaligned copy functions.
> With this patch, the implementation for copying up to 64 bytes was
> consolidated into one instance, located in the common copy function,
> before checking alignment requirements.
> This provides three benefits:
> 1. No copy-paste in the source code.
> 2. A performance gain for copying up to 64 bytes, because the
> address alignment check is avoided in this case.
> 3. Reduced instruction memory footprint, because the compiler only
> generates one instance of the function for copying up to 64 bytes, instead
> of two instances (one in the unaligned copy function, and one in the
> aligned copy function).
> 
> Furthermore, __rte_restrict was added to source and destination addresses.
> 
> Also, the missing implementation of rte_mov48() was added.
> 
> Until recently, some drivers required disabling stringop-overflow warnings
> when using rte_memcpy().
> For some strange reason, these warnings were disabled in the rte_memcpy
> header file, instead of in the problematic drivers.
> With series-38174 ("remove use of rte_memcpy from net/intel"), the
> problematic drivers were updated to use memcpy() instead of rte_memcpy(),
> so disabling these warnings is no longer required, and was removed.
> 
> Regarding performance...
> The memcpy performance test (cache-to-cache copy) shows:
> Copying up to 15 bytes takes ca. 4.5 cycles, versus ca. 6.5 cycles before.
> Copying 8 bytes takes 4 cycles, versus 7 cycles before.
> Copying 16 bytes takes 2 cycles, versus 4 cycles before.
> Copying 64 bytes takes 4 cycles, versus 7 cycles before.
> 
> Depends-on: series-38174 ("remove use of rte_memcpy from net/intel")
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> ---

Here is the full wordy all providers reviews.
Mostly this is test to make sure script works.

$ ./devtools/ai/compare-patch-reviews.sh ~/Downloads/v11-eal-x86-optimize-memcpy-of-small-sizes.patch 
Reviewing patch: v11-eal-x86-optimize-memcpy-of-small-sizes.patch
Providers: anthropic,openai,xai,google
Format: text
========================================

>>> Running review with: anthropic

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Claude) ===
Review date: 2026-05-21

# Review of v11-eal-x86-optimize-memcpy-of-small-sizes.patch

## Correctness Issues

### Errors

**Use-after-free pattern in overlapping copy optimization (lines 672-673, 684, 704, 707)**

The pattern `(uint8_t *)dst - 16 + n` performs pointer arithmetic that, when `n < 16`, results in accessing memory before the `dst` pointer. This is used in an optimization technique (two overlapping stores) that is correct when `n >= 16`, but the code paths in question guarantee `n` is in specific ranges where this is safe. However, the readability is poor and the intent is not obvious from the code.

**Confidence: 80%** - This pattern is widely used in memcpy implementations and appears intentional, but deserves a comment explaining the technique to avoid confusion in future reviews.

---

## Warnings

**Missing documentation for new function rte_mov48() (line 130)**

The patch adds `rte_mov48()` but does not mention it in the commit message's "Also, the missing implementation of rte_mov48() was added." The implementation itself is correct, but there's no context about why it was missing or whether it's used elsewhere in DPDK.

**Suggestion:** Verify that `rte_mov48()` is actually used somewhere, or document why it's being added speculatively.

---

**Inconsistent handling of compile-time constant size (lines 682-683)**

The code checks `__rte_constant(n) && n == 16` to avoid a "harmless duplicate copy" but does not apply this optimization consistently. For example:
- Line 682: avoids duplicate for `n == 16`
- Line 690: avoids duplicate for `n == 32`
- Line 693: avoids duplicate for `n == 64`

But in the 33-64 byte range (lines 695-707), there's no similar optimization. This is not a bug, but the inconsistency in optimization strategy could be noted.

**Suggestion:** Add a comment explaining the rationale for when this optimization matters vs when the "harmless duplicate" is acceptable.

---

**Potential alignment assumption in AVX path (line 699)**

The AVX path at line 699 uses two overlapping `rte_mov32()` calls for sizes 33-64 bytes:
```c
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
```

For `n = 33`, the second call accesses `dst + 1` and `src + 1`, which may be misaligned for the 32-byte `__m256i` operations. However, `_mm256_loadu_si256()` and `_mm256_storeu_si256()` are documented as supporting unaligned access, so this is safe.

**Confidence: 90%** - Not a bug, but worth noting that the AVX path relies on unaligned load/store support.

---

## Style and Process Issues

### Warnings

**Unconventional comment style in rte_mov15_or_less() (lines 67-72)**

The multi-line comment explaining the rationale for the `rte_uint*_alias` structures does not follow the standard DPDK comment block format with the `*` on each line. However, this may be intentional for a technical explanation.

---

**Return value inconsistency (lines 682, 690-691, 693, 707-708)**

The function returns `dst` in some branches but the original `ret` pointer in others:
- Lines 682, 690, 707: return `dst`
- Line 714: return from `rte_memcpy_aligned_more_than_64(dst, src, n)` which returns `ret = dst`
- Line 716: return from `rte_memcpy_generic_more_than_64(dst, src, n)` which returns `ret = dst`

Both `ret` and `dst` point to the same location, so this is harmless, but using `dst` consistently would be clearer. The patch description states "Pointer to the destination data" is returned, and both satisfy this.

**Confidence: 95%** - Not an error, but inconsistent style within the same function.

---

## Documentation Issues

### Warnings

**Missing documentation update for removed warning suppression (lines removed at top)**

The patch removes:
```c
#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wstringop-overflow"
#endif
```

The commit message explains this is because the problematic drivers were updated, with a `Depends-on: series-38174`. However, there's no comment in the code explaining why the suppression was removed or a note in `doc/guides/rel_notes/` documenting this change.

**Suggestion:** Add a release note entry documenting the removal of the warning suppression and the dependency on the driver updates.

---

## Summary

**Priority findings:**

1. **(Info)** The overlapping copy optimization `dst - 16 + n` is correct but deserves a comment explaining the technique.
2. **(Warning)** Missing documentation for why `rte_mov48()` was added.
3. **(Warning)** Missing release notes entry for removal of stringop-overflow warning suppression.
4. **(Info)** Return value uses both `dst` and `ret` inconsistently (harmless but could be unified).

**Overall assessment:** The patch appears functionally correct. The main concerns are documentation completeness (missing context for `rte_mov48()`, missing release notes for warning removal) and code clarity (overlapping copy pattern could use explanatory comments). No blocking correctness issues were identified.
(anthropic: review reported errors)

========================================

>>> Running review with: openai

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via ChatGPT) ===
Review date: 2026-05-21

**DPDK Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch**

---

### Errors

None found.

- The patch does not introduce resource leaks, use-after-free, race conditions, or missing synchronization.
- There are no buffer overflows or out-of-bounds accesses in the memcpy routines; all copy size calculations and code paths for small copies (<64 bytes) are explicit and conservative.
- No uninitialized or stale variable use detected.
- All error paths are clean (these are computational functions; return values are always valid).

---

### Warnings

1. **Potential inconsistent comment phrasing regarding overlap requirements.**

   - Some Doxygen comments use "locations should not overlap" while others now correctly say "locations must not overlap".
   - While this isn't a correctness bug, consistent strong wording clarifies requirement for users.

   **Suggested fix:** Standardize all relevant Doxygen comments to use "locations must not overlap."

   ```c
   /**
    * Copy bytes from one location to another,
    * locations must not overlap.
    */
   ```

2. **Redundant Doxygen param alignment in function headers.**

   - Some parameter descriptions ("Pointer to the destination of the data"/"Pointer to the destination data") remain a bit unclear or inconsistent. If updating comments, consider making param descriptions exact and explicit.

   **Suggested fix:** For each `@param` clarify if it's the destination or source pointer. For `rte_memcpy`, `"Pointer to the destination."` and `"Pointer to the source."` suffice.

---

### Info

- Use of `__rte_restrict` is correct and improves optimization; matches DPDK style for pointer restrict qualifiers.
- Removal of `#pragma GCC diagnostic ignored "-Wstringop-overflow"` is correct because the problematic drivers now use standard `memcpy()`.
- All new `rte_mov*` functions properly use `static __rte_always_inline` and restrict-qualified pointers.
- Macro duplication is avoided by consolidation, in line with maintainability guidelines.
- No forbidden tokens, banned constructs, or use of deprecated APIs/macros were added.
- No symbol namespace pollution or missing `rte_`/`RTE_` prefix issues present.
- Coding style is maintained in indentation, tabs use, and function/variable names.
- No API, public header, or C11 atomic/volatile/race issues are present.
- No documentation or API tag requirements are triggered (no public API change).
- Patch is self-contained and does not require changes to release notes or test infrastructure.
- No RST or external documentation is involved in this patch.

---

### Summary

**No correctness/blocking issues found.**
Minor documentation consistency updates suggested for comment clarity, but these do not affect correctness.

**This patch is in good shape and complies with DPDK coding and architectural guidelines.**
(openai: review reported errors)

========================================

>>> Running review with: xai

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Grok) ===
Review date: 2026-05-21

**Review of patch: v11-eal-x86-optimize-memcpy-of-small-sizes.patch**

### Errors

**1. Missing `__rte_restrict` in rte_mov48() implementation (consistency / correctness risk)**  
The new `rte_mov48()` function is declared with `__rte_restrict`, but the implementation casts away the restrict qualifier when calling `rte_mov32()` and `rte_mov16()`:

```c
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
```

**Suggested fix**: Keep the restrict-qualified parameters in the implementation or cast only after documenting the aliasing contract. This is a minor style inconsistency but worth fixing for uniformity.

**2. Potential sign-extension issue on 64-bit offset calculations (left-shift of narrow type)**  
In `rte_mov15_or_less()`, the expression `(n & 8)` is `size_t` (ok), but the subsequent pointer arithmetic:

```c
*(uint64_t *)((uint8_t *)dst + n - 8) = *(const uint64_t *)((const uint8_t *)src + n - 8);
```

relies on implicit promotion. While currently safe, the pattern of adding a `uint8_t`/`size_t` value to a pointer after a narrow mask (`n & 7`) could trigger Coverity SIGN_EXTENSION warnings on some toolchains. Consider an explicit cast:

```c
uint8_t *d = (uint8_t *)dst + (size_t)(n - 8);
```

**3. Use of `__rte_constant` inside hot path without documented behavior**  
The code repeatedly uses `__rte_constant(n)` to optimize compile-time known sizes. This is a DPDK internal macro. While acceptable for performance-critical code, it should be documented (or at least noted) that the optimization only triggers for constant `n`. No correctness bug, but worth a comment.

### Warnings

**1. Duplicate function definitions for AVX/SSE variants**  
The patch introduces three nearly identical `rte_memcpy_generic_more_than_64()` and `rte_memcpy_aligned_more_than_64()` functions (one per vector size). This increases maintenance burden and instruction cache pressure. Consider factoring the common >64-byte logic into a single helper.

**2. Hard-coded cast chains in rte_mov48()**  
The SSE implementation of `rte_mov48()` uses explicit `(uint8_t *)` casts on restrict-qualified pointers. While functionally correct, it reduces readability. A cleaner pattern would be:

```c
rte_mov16(dst + 0, src + 0);
rte_mov16(dst + 16, src + 16);
rte_mov16(dst + 32, src + 32);
```

**3. Inconsistent use of `return dst` vs `return ret`**  
In the new `rte_memcpy()` implementation the fast path returns `dst`, while the old `rte_memcpy_generic()` returned the saved `ret`. Although `dst` and `ret` are identical, using `ret` consistently would avoid any future divergence.

**4. Missing Doxygen for new rte_mov48()**  
The new function has a Doxygen comment, but the description is minimal. Consider adding a brief note that it is an internal helper (or mark it `__rte_internal` if it should not be public).

### Info

- The removal of the `#pragma GCC diagnostic` block for `-Wstringop-overflow` is correct and aligns with the series that removed `rte_memcpy` usage from drivers.
- Adding `__rte_restrict` to `rte_memcpy()` and helper functions is a good improvement and matches modern DPDK style.
- Performance claims in the commit message are outside the scope of this review (they are not code).

### Summary

No critical correctness bugs (use-after-free, leaks, races, or missing error checks) were identified. The main issues are minor style/consistency items around restrict qualifiers, pointer arithmetic, and code duplication. The patch is otherwise clean and follows DPDK coding guidelines for the x86 EAL memcpy implementation.
(xai: review reported errors)

========================================

>>> Running review with: google

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Gemini) ===
Review date: 2026-05-21

The following is a review of the patch against DPDK coding standards and guidelines.

### Warnings

**Implicit integer comparison in rte_memcpy.**
DPDK coding style requires explicit comparison against 0 for integer types, rather than using truthiness/logical negation.
```c
/* Current implementation */
if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))

/* Suggested fix */
if ((((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK) == 0)
```

### Info

**Use of bitwise logic in rte_mov15_or_less.**
While this patch only adds `__rte_restrict` to the signature, the existing
(google: review reported warnings)

========================================

Review comparison complete.
Summary across 4 provider(s): clean=0 warnings=1 errors=3 failed=0

     prev parent reply	other threads:[~2026-05-21 22:42 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-20 11:45 [PATCH] eal/x86: reduce memcpy code duplication Morten Brørup
2025-11-21 10:35 ` [PATCH v2] eal/x86: optimize memcpy of small sizes Morten Brørup
2025-11-21 16:57   ` Stephen Hemminger
2025-11-21 17:02     ` Bruce Richardson
2025-11-21 17:11       ` Stephen Hemminger
2025-11-21 21:36         ` Morten Brørup
2025-11-21 10:40 ` Morten Brørup
2025-11-21 10:40 ` [PATCH v3] " Morten Brørup
2025-11-24 13:36   ` Morten Brørup
2025-11-24 15:46     ` Patrick Robb
2025-11-28 14:02   ` Konstantin Ananyev
2025-11-28 15:55     ` Morten Brørup
2025-11-28 18:10       ` Konstantin Ananyev
2025-11-29  2:17         ` Morten Brørup
2025-12-01  9:35           ` Konstantin Ananyev
2025-12-01 10:41             ` Morten Brørup
2025-11-24 20:31 ` [PATCH v4] " Morten Brørup
2025-11-25  8:19   ` Morten Brørup
2025-12-01 15:55 ` [PATCH v5] " Morten Brørup
2025-12-03 13:29   ` Morten Brørup
2026-01-03 17:53   ` Morten Brørup
2026-01-09 15:05     ` Varghese, Vipin
2026-01-11 15:52     ` Konstantin Ananyev
2026-01-11 16:01       ` Stephen Hemminger
2026-01-12  8:02       ` Morten Brørup
2026-01-12 16:00         ` Scott Mitchell
2026-01-13  0:39           ` Stephen Hemminger
2026-01-12 12:03 ` [PATCH v6] " Morten Brørup
2026-01-13 23:19   ` Stephen Hemminger
2026-01-20 11:00     ` Varghese, Vipin
2026-01-20 11:19       ` Varghese, Vipin
2026-01-20 11:22         ` Morten Brørup
2026-01-21 11:48           ` Varghese, Vipin
2026-01-22  6:59             ` Varghese, Vipin
2026-01-22  7:28               ` Liangxing Wang
2026-01-23  6:58               ` Varghese, Vipin
2026-02-20 11:08 ` [PATCH v7] " Morten Brørup
2026-03-11  7:28   ` Morten Brørup
2026-03-11 16:58   ` Bruce Richardson
2026-03-11 18:29     ` Morten Brørup
2026-03-11 19:09       ` Bruce Richardson
2026-03-12  8:33   ` Konstantin Ananyev
2026-03-19 15:55   ` Morten Brørup
2026-04-29  9:36 ` [PATCH v8] " Morten Brørup
2026-04-29 10:35 ` [PATCH v9] " Morten Brørup
2026-04-29 11:24   ` Morten Brørup
2026-05-08  6:32   ` Morten Brørup
2026-05-21 10:54   ` [TEST PATCH " Morten Brørup
2026-05-08  9:58 ` [PATCH v10] " Morten Brørup
2026-05-21 18:56 ` [PATCH v11] " Morten Brørup
2026-05-21 19:48   ` Stephen Hemminger
2026-05-21 22:42   ` Stephen Hemminger [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260521154214.1c171a74@phoenix.local \
    --to=stephen@networkplumber.org \
    --cc=Bala.MuraliKrishna@amd.com \
    --cc=Thiyagarajan.P@amd.com \
    --cc=anatoly.burakov@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=konstantin.ananyev@huawei.com \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=mb@smartsharesystems.com \
    --cc=vipin.varghese@amd.com \
    --cc=vladimir.medvedkin@intel.com \
    --cc=wangliangxing@hygon.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox