Re: [PATCH v11] eal/x86: optimize memcpy of small sizes

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stephen Hemminger <stephen@networkplumber.org>
To: "Morten Brørup" <mb@smartsharesystems.com>
Cc: dev@dpdk.org, Bruce Richardson <bruce.richardson@intel.com>,
	Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>,
	Vipin Varghese <vipin.varghese@amd.com>,
	Liangxing Wang <wangliangxing@hygon.cn>,
	Thiyagarajan P <Thiyagarajan.P@amd.com>,
	Bala Murali Krishna <Bala.MuraliKrishna@amd.com>,
	Anatoly Burakov <anatoly.burakov@intel.com>,
	Vladimir Medvedkin <vladimir.medvedkin@intel.com>,
	Konstantin Ananyev <konstantin.ananyev@huawei.com>
Subject: Re: [PATCH v11] eal/x86: optimize memcpy of small sizes
Date: Thu, 21 May 2026 15:42:14 -0700	[thread overview]
Message-ID: <20260521154214.1c171a74@phoenix.local> (raw)
In-Reply-To: <20260521185631.116046-1-mb@smartsharesystems.com>

On Thu, 21 May 2026 18:56:31 +0000
Morten Brørup <mb@smartsharesystems.com> wrote:

> The implementation for copying up to 64 bytes does not depend on address
> alignment with the size of the CPU's vector registers. Nonetheless, the
> exact same code for copying up to 64 bytes was present in both the aligned
> copy function and all the CPU vector register size specific variants of
> the unaligned copy functions.
> With this patch, the implementation for copying up to 64 bytes was
> consolidated into one instance, located in the common copy function,
> before checking alignment requirements.
> This provides three benefits:
> 1. No copy-paste in the source code.
> 2. A performance gain for copying up to 64 bytes, because the
> address alignment check is avoided in this case.
> 3. Reduced instruction memory footprint, because the compiler only
> generates one instance of the function for copying up to 64 bytes, instead
> of two instances (one in the unaligned copy function, and one in the
> aligned copy function).
> 
> Furthermore, __rte_restrict was added to source and destination addresses.
> 
> Also, the missing implementation of rte_mov48() was added.
> 
> Until recently, some drivers required disabling stringop-overflow warnings
> when using rte_memcpy().
> For some strange reason, these warnings were disabled in the rte_memcpy
> header file, instead of in the problematic drivers.
> With series-38174 ("remove use of rte_memcpy from net/intel"), the
> problematic drivers were updated to use memcpy() instead of rte_memcpy(),
> so disabling these warnings is no longer required, and was removed.
> 
> Regarding performance...
> The memcpy performance test (cache-to-cache copy) shows:
> Copying up to 15 bytes takes ca. 4.5 cycles, versus ca. 6.5 cycles before.
> Copying 8 bytes takes 4 cycles, versus 7 cycles before.
> Copying 16 bytes takes 2 cycles, versus 4 cycles before.
> Copying 64 bytes takes 4 cycles, versus 7 cycles before.
> 
> Depends-on: series-38174 ("remove use of rte_memcpy from net/intel")
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> ---

Here is the full wordy all providers reviews.
Mostly this is test to make sure script works.

$ ./devtools/ai/compare-patch-reviews.sh ~/Downloads/v11-eal-x86-optimize-memcpy-of-small-sizes.patch 
Reviewing patch: v11-eal-x86-optimize-memcpy-of-small-sizes.patch
Providers: anthropic,openai,xai,google
Format: text
========================================

>>> Running review with: anthropic

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Claude) ===
Review date: 2026-05-21

# Review of v11-eal-x86-optimize-memcpy-of-small-sizes.patch

## Correctness Issues

### Errors

**Use-after-free pattern in overlapping copy optimization (lines 672-673, 684, 704, 707)**

The pattern `(uint8_t *)dst - 16 + n` performs pointer arithmetic that, when `n < 16`, results in accessing memory before the `dst` pointer. This is used in an optimization technique (two overlapping stores) that is correct when `n >= 16`, but the code paths in question guarantee `n` is in specific ranges where this is safe. However, the readability is poor and the intent is not obvious from the code.

**Confidence: 80%** - This pattern is widely used in memcpy implementations and appears intentional, but deserves a comment explaining the technique to avoid confusion in future reviews.

---

## Warnings

**Missing documentation for new function rte_mov48() (line 130)**

The patch adds `rte_mov48()` but does not mention it in the commit message's "Also, the missing implementation of rte_mov48() was added." The implementation itself is correct, but there's no context about why it was missing or whether it's used elsewhere in DPDK.

**Suggestion:** Verify that `rte_mov48()` is actually used somewhere, or document why it's being added speculatively.

---

**Inconsistent handling of compile-time constant size (lines 682-683)**

The code checks `__rte_constant(n) && n == 16` to avoid a "harmless duplicate copy" but does not apply this optimization consistently. For example:
- Line 682: avoids duplicate for `n == 16`
- Line 690: avoids duplicate for `n == 32`
- Line 693: avoids duplicate for `n == 64`

But in the 33-64 byte range (lines 695-707), there's no similar optimization. This is not a bug, but the inconsistency in optimization strategy could be noted.

**Suggestion:** Add a comment explaining the rationale for when this optimization matters vs when the "harmless duplicate" is acceptable.

---

**Potential alignment assumption in AVX path (line 699)**

The AVX path at line 699 uses two overlapping `rte_mov32()` calls for sizes 33-64 bytes:
```c
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
```

For `n = 33`, the second call accesses `dst + 1` and `src + 1`, which may be misaligned for the 32-byte `__m256i` operations. However, `_mm256_loadu_si256()` and `_mm256_storeu_si256()` are documented as supporting unaligned access, so this is safe.

**Confidence: 90%** - Not a bug, but worth noting that the AVX path relies on unaligned load/store support.

---

## Style and Process Issues

### Warnings

**Unconventional comment style in rte_mov15_or_less() (lines 67-72)**

The multi-line comment explaining the rationale for the `rte_uint*_alias` structures does not follow the standard DPDK comment block format with the `*` on each line. However, this may be intentional for a technical explanation.

---

**Return value inconsistency (lines 682, 690-691, 693, 707-708)**

The function returns `dst` in some branches but the original `ret` pointer in others:
- Lines 682, 690, 707: return `dst`
- Line 714: return from `rte_memcpy_aligned_more_than_64(dst, src, n)` which returns `ret = dst`
- Line 716: return from `rte_memcpy_generic_more_than_64(dst, src, n)` which returns `ret = dst`

Both `ret` and `dst` point to the same location, so this is harmless, but using `dst` consistently would be clearer. The patch description states "Pointer to the destination data" is returned, and both satisfy this.

**Confidence: 95%** - Not an error, but inconsistent style within the same function.

---

## Documentation Issues

### Warnings

**Missing documentation update for removed warning suppression (lines removed at top)**

The patch removes:
```c
#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wstringop-overflow"
#endif
```

The commit message explains this is because the problematic drivers were updated, with a `Depends-on: series-38174`. However, there's no comment in the code explaining why the suppression was removed or a note in `doc/guides/rel_notes/` documenting this change.

**Suggestion:** Add a release note entry documenting the removal of the warning suppression and the dependency on the driver updates.

---

## Summary

**Priority findings:**

1. **(Info)** The overlapping copy optimization `dst - 16 + n` is correct but deserves a comment explaining the technique.
2. **(Warning)** Missing documentation for why `rte_mov48()` was added.
3. **(Warning)** Missing release notes entry for removal of stringop-overflow warning suppression.
4. **(Info)** Return value uses both `dst` and `ret` inconsistently (harmless but could be unified).

**Overall assessment:** The patch appears functionally correct. The main concerns are documentation completeness (missing context for `rte_mov48()`, missing release notes for warning removal) and code clarity (overlapping copy pattern could use explanatory comments). No blocking correctness issues were identified.
(anthropic: review reported errors)

========================================

>>> Running review with: openai

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via ChatGPT) ===
Review date: 2026-05-21

**DPDK Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch**

---

### Errors

None found.

- The patch does not introduce resource leaks, use-after-free, race conditions, or missing synchronization.
- There are no buffer overflows or out-of-bounds accesses in the memcpy routines; all copy size calculations and code paths for small copies (<64 bytes) are explicit and conservative.
- No uninitialized or stale variable use detected.
- All error paths are clean (these are computational functions; return values are always valid).

---

### Warnings

1. **Potential inconsistent comment phrasing regarding overlap requirements.**

   - Some Doxygen comments use "locations should not overlap" while others now correctly say "locations must not overlap".
   - While this isn't a correctness bug, consistent strong wording clarifies requirement for users.

   **Suggested fix:** Standardize all relevant Doxygen comments to use "locations must not overlap."

   ```c
   /**
    * Copy bytes from one location to another,
    * locations must not overlap.
    */
   ```

2. **Redundant Doxygen param alignment in function headers.**

   - Some parameter descriptions ("Pointer to the destination of the data"/"Pointer to the destination data") remain a bit unclear or inconsistent. If updating comments, consider making param descriptions exact and explicit.

   **Suggested fix:** For each `@param` clarify if it's the destination or source pointer. For `rte_memcpy`, `"Pointer to the destination."` and `"Pointer to the source."` suffice.

---

### Info

- Use of `__rte_restrict` is correct and improves optimization; matches DPDK style for pointer restrict qualifiers.
- Removal of `#pragma GCC diagnostic ignored "-Wstringop-overflow"` is correct because the problematic drivers now use standard `memcpy()`.
- All new `rte_mov*` functions properly use `static __rte_always_inline` and restrict-qualified pointers.
- Macro duplication is avoided by consolidation, in line with maintainability guidelines.
- No forbidden tokens, banned constructs, or use of deprecated APIs/macros were added.
- No symbol namespace pollution or missing `rte_`/`RTE_` prefix issues present.
- Coding style is maintained in indentation, tabs use, and function/variable names.
- No API, public header, or C11 atomic/volatile/race issues are present.
- No documentation or API tag requirements are triggered (no public API change).
- Patch is self-contained and does not require changes to release notes or test infrastructure.
- No RST or external documentation is involved in this patch.

---

### Summary

**No correctness/blocking issues found.**
Minor documentation consistency updates suggested for comment clarity, but these do not affect correctness.

**This patch is in good shape and complies with DPDK coding and architectural guidelines.**
(openai: review reported errors)

========================================

>>> Running review with: xai

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Grok) ===
Review date: 2026-05-21

**Review of patch: v11-eal-x86-optimize-memcpy-of-small-sizes.patch**

### Errors

**1. Missing `__rte_restrict` in rte_mov48() implementation (consistency / correctness risk)**  
The new `rte_mov48()` function is declared with `__rte_restrict`, but the implementation casts away the restrict qualifier when calling `rte_mov32()` and `rte_mov16()`:

```c
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
```

**Suggested fix**: Keep the restrict-qualified parameters in the implementation or cast only after documenting the aliasing contract. This is a minor style inconsistency but worth fixing for uniformity.

**2. Potential sign-extension issue on 64-bit offset calculations (left-shift of narrow type)**  
In `rte_mov15_or_less()`, the expression `(n & 8)` is `size_t` (ok), but the subsequent pointer arithmetic:

```c
*(uint64_t *)((uint8_t *)dst + n - 8) = *(const uint64_t *)((const uint8_t *)src + n - 8);
```

relies on implicit promotion. While currently safe, the pattern of adding a `uint8_t`/`size_t` value to a pointer after a narrow mask (`n & 7`) could trigger Coverity SIGN_EXTENSION warnings on some toolchains. Consider an explicit cast:

```c
uint8_t *d = (uint8_t *)dst + (size_t)(n - 8);
```

**3. Use of `__rte_constant` inside hot path without documented behavior**  
The code repeatedly uses `__rte_constant(n)` to optimize compile-time known sizes. This is a DPDK internal macro. While acceptable for performance-critical code, it should be documented (or at least noted) that the optimization only triggers for constant `n`. No correctness bug, but worth a comment.

### Warnings

**1. Duplicate function definitions for AVX/SSE variants**  
The patch introduces three nearly identical `rte_memcpy_generic_more_than_64()` and `rte_memcpy_aligned_more_than_64()` functions (one per vector size). This increases maintenance burden and instruction cache pressure. Consider factoring the common >64-byte logic into a single helper.

**2. Hard-coded cast chains in rte_mov48()**  
The SSE implementation of `rte_mov48()` uses explicit `(uint8_t *)` casts on restrict-qualified pointers. While functionally correct, it reduces readability. A cleaner pattern would be:

```c
rte_mov16(dst + 0, src + 0);
rte_mov16(dst + 16, src + 16);
rte_mov16(dst + 32, src + 32);
```

**3. Inconsistent use of `return dst` vs `return ret`**  
In the new `rte_memcpy()` implementation the fast path returns `dst`, while the old `rte_memcpy_generic()` returned the saved `ret`. Although `dst` and `ret` are identical, using `ret` consistently would avoid any future divergence.

**4. Missing Doxygen for new rte_mov48()**  
The new function has a Doxygen comment, but the description is minimal. Consider adding a brief note that it is an internal helper (or mark it `__rte_internal` if it should not be public).

### Info

- The removal of the `#pragma GCC diagnostic` block for `-Wstringop-overflow` is correct and aligns with the series that removed `rte_memcpy` usage from drivers.
- Adding `__rte_restrict` to `rte_memcpy()` and helper functions is a good improvement and matches modern DPDK style.
- Performance claims in the commit message are outside the scope of this review (they are not code).

### Summary

No critical correctness bugs (use-after-free, leaks, races, or missing error checks) were identified. The main issues are minor style/consistency items around restrict qualifiers, pointer arithmetic, and code duplication. The patch is otherwise clean and follows DPDK coding guidelines for the x86 EAL memcpy implementation.
(xai: review reported errors)

========================================

>>> Running review with: google

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Gemini) ===
Review date: 2026-05-21

The following is a review of the patch against DPDK coding standards and guidelines.

### Warnings

**Implicit integer comparison in rte_memcpy.**
DPDK coding style requires explicit comparison against 0 for integer types, rather than using truthiness/logical negation.
```c
/* Current implementation */
if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))

/* Suggested fix */
if ((((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK) == 0)
```

### Info

**Use of bitwise logic in rte_mov15_or_less.**
While this patch only adds `__rte_restrict` to the signature, the existing
(google: review reported warnings)

========================================

Review comparison complete.
Summary across 4 provider(s): clean=0 warnings=1 errors=3 failed=0

next prev parent reply	other threads:[~2026-05-21 22:42 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-20 11:45 [PATCH] eal/x86: reduce memcpy code duplication Morten Brørup
2025-11-21 10:35 ` [PATCH v2] eal/x86: optimize memcpy of small sizes Morten Brørup
2025-11-21 16:57   ` Stephen Hemminger
2025-11-21 17:02     ` Bruce Richardson
2025-11-21 17:11       ` Stephen Hemminger
2025-11-21 21:36         ` Morten Brørup
2025-11-21 10:40 ` Morten Brørup
2025-11-21 10:40 ` [PATCH v3] " Morten Brørup
2025-11-24 13:36   ` Morten Brørup
2025-11-24 15:46     ` Patrick Robb
2025-11-28 14:02   ` Konstantin Ananyev
2025-11-28 15:55     ` Morten Brørup
2025-11-28 18:10       ` Konstantin Ananyev
2025-11-29  2:17         ` Morten Brørup
2025-12-01  9:35           ` Konstantin Ananyev
2025-12-01 10:41             ` Morten Brørup
2025-11-24 20:31 ` [PATCH v4] " Morten Brørup
2025-11-25  8:19   ` Morten Brørup
2025-12-01 15:55 ` [PATCH v5] " Morten Brørup
2025-12-03 13:29   ` Morten Brørup
2026-01-03 17:53   ` Morten Brørup
2026-01-09 15:05     ` Varghese, Vipin
2026-01-11 15:52     ` Konstantin Ananyev
2026-01-11 16:01       ` Stephen Hemminger
2026-01-12  8:02       ` Morten Brørup
2026-01-12 16:00         ` Scott Mitchell
2026-01-13  0:39           ` Stephen Hemminger
2026-01-12 12:03 ` [PATCH v6] " Morten Brørup
2026-01-13 23:19   ` Stephen Hemminger
2026-01-20 11:00     ` Varghese, Vipin
2026-01-20 11:19       ` Varghese, Vipin
2026-01-20 11:22         ` Morten Brørup
2026-01-21 11:48           ` Varghese, Vipin
2026-01-22  6:59             ` Varghese, Vipin
2026-01-22  7:28               ` Liangxing Wang
2026-01-23  6:58               ` Varghese, Vipin
2026-02-20 11:08 ` [PATCH v7] " Morten Brørup
2026-03-11  7:28   ` Morten Brørup
2026-03-11 16:58   ` Bruce Richardson
2026-03-11 18:29     ` Morten Brørup
2026-03-11 19:09       ` Bruce Richardson
2026-03-12  8:33   ` Konstantin Ananyev
2026-03-19 15:55   ` Morten Brørup
2026-04-29  9:36 ` [PATCH v8] " Morten Brørup
2026-04-29 10:35 ` [PATCH v9] " Morten Brørup
2026-04-29 11:24   ` Morten Brørup
2026-05-08  6:32   ` Morten Brørup
2026-05-21 10:54   ` [TEST PATCH " Morten Brørup
2026-05-08  9:58 ` [PATCH v10] " Morten Brørup
2026-05-21 18:56 ` [PATCH v11] " Morten Brørup
2026-05-21 19:48   ` Stephen Hemminger
2026-05-21 22:42   ` Stephen Hemminger [this message]
2026-06-01 13:38     ` Thomas Monjalon
2026-06-01 14:19       ` Morten Brørup
2026-06-01 19:48   ` Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260521154214.1c171a74@phoenix.local \
    --to=stephen@networkplumber.org \
    --cc=Bala.MuraliKrishna@amd.com \
    --cc=Thiyagarajan.P@amd.com \
    --cc=anatoly.burakov@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=konstantin.ananyev@huawei.com \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=mb@smartsharesystems.com \
    --cc=vipin.varghese@amd.com \
    --cc=vladimir.medvedkin@intel.com \
    --cc=wangliangxing@hygon.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.