public inbox for dev@dpdk.org
 help / color / mirror / Atom feed
* [PATCH v14 0/2] net: optimize __rte_raw_cksum
@ 2026-01-12 12:04 scott.k.mitch1
  2026-01-12 12:04 ` [PATCH v14 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
                   ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-12 12:04 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott Mitchell

From: Scott Mitchell <scott.k.mitch1@gmail.com>

This series optimizes __rte_raw_cksum by replacing memcpy with direct
pointer access, enabling compiler vectorization on both GCC and Clang.

Patch 1 adds __rte_may_alias to unaligned typedefs to prevent a GCC
strict-aliasing bug where struct initialization is incorrectly elided.

Patch 2 uses the improved unaligned_uint16_t type in __rte_raw_cksum
to enable compiler optimizations while maintaining correctness across
all architectures (including strict-alignment platforms).

Performance results show significant improvements (40% for small buffers,
up to 8x for larger buffers) on Intel Xeon with Clang 18.1.

Changes in v14:
- Split into two patches: EAL typedef fix and checksum optimization
- Use unaligned_uint16_t directly instead of wrapper struct
- Added __rte_may_alias to unaligned typedefs to prevent GCC bug

Scott Mitchell (2):
  eal: add __rte_may_alias to unaligned typedefs
  net: __rte_raw_cksum pointers enable compiler optimizations

 app/test/meson.build         |   1 +
 app/test/test_cksum_fuzz.c   | 240 +++++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c   |   2 +-
 lib/eal/include/rte_common.h |  34 ++---
 lib/net/rte_cksum.h          |  14 +-
 5 files changed, 266 insertions(+), 25 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

--
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v14 1/2] eal: add __rte_may_alias to unaligned typedefs
  2026-01-12 12:04 [PATCH v14 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
@ 2026-01-12 12:04 ` scott.k.mitch1
  2026-01-12 13:28   ` Morten Brørup
  2026-01-12 12:04 ` [PATCH v14 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
  2026-01-17 21:21 ` [PATCH v15 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2 siblings, 1 reply; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-12 12:04 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott Mitchell

From: Scott Mitchell <scott.k.mitch1@gmail.com>

Add __rte_may_alias attribute to unaligned_uint{16,32,64}_t typedefs
to prevent GCC strict-aliasing optimization bugs. GCC has a bug where
it incorrectly elides struct initialization when strict aliasing is
enabled, causing reads from uninitialized memory.

The __rte_may_alias attribute signals to the compiler that these types
can alias other types, preventing the incorrect optimization.

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 lib/eal/include/rte_common.h | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/lib/eal/include/rte_common.h b/lib/eal/include/rte_common.h
index 9e7d84f929..ac70270cfb 100644
--- a/lib/eal/include/rte_common.h
+++ b/lib/eal/include/rte_common.h
@@ -121,14 +121,27 @@ extern "C" {
 #define __rte_aligned(a) __attribute__((__aligned__(a)))
 #endif
 
+/**
+ * Macro to mark a type that is not subject to type-based aliasing rules
+ */
+#ifdef RTE_TOOLCHAIN_MSVC
+#define __rte_may_alias
+#else
+#define __rte_may_alias __attribute__((__may_alias__))
+#endif
+
+/**
+ * __rte_may_alias avoids compiler bugs (GCC) that elide initialization
+ * of memory when strict-aliasing is enabled.
+ */
 #ifdef RTE_ARCH_STRICT_ALIGN
-typedef uint64_t unaligned_uint64_t __rte_aligned(1);
-typedef uint32_t unaligned_uint32_t __rte_aligned(1);
-typedef uint16_t unaligned_uint16_t __rte_aligned(1);
+typedef uint64_t unaligned_uint64_t __rte_may_alias __rte_aligned(1);
+typedef uint32_t unaligned_uint32_t __rte_may_alias __rte_aligned(1);
+typedef uint16_t unaligned_uint16_t __rte_may_alias __rte_aligned(1);
 #else
-typedef uint64_t unaligned_uint64_t;
-typedef uint32_t unaligned_uint32_t;
-typedef uint16_t unaligned_uint16_t;
+typedef uint64_t unaligned_uint64_t __rte_may_alias;
+typedef uint32_t unaligned_uint32_t __rte_may_alias;
+typedef uint16_t unaligned_uint16_t __rte_may_alias;
 #endif
 
 /**
@@ -159,15 +172,6 @@ typedef uint16_t unaligned_uint16_t;
 #define __rte_packed_end __attribute__((__packed__))
 #endif
 
-/**
- * Macro to mark a type that is not subject to type-based aliasing rules
- */
-#ifdef RTE_TOOLCHAIN_MSVC
-#define __rte_may_alias
-#else
-#define __rte_may_alias __attribute__((__may_alias__))
-#endif
-
 /******* Macro to mark functions and fields scheduled for removal *****/
 #ifdef RTE_TOOLCHAIN_MSVC
 #define __rte_deprecated
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v14 2/2] net: __rte_raw_cksum pointers enable compiler optimizations
  2026-01-12 12:04 [PATCH v14 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2026-01-12 12:04 ` [PATCH v14 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
@ 2026-01-12 12:04 ` scott.k.mitch1
  2026-01-17 21:21 ` [PATCH v15 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2 siblings, 0 replies; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-12 12:04 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott Mitchell

From: Scott Mitchell <scott.k.mitch1@gmail.com>

__rte_raw_cksum uses a loop with memcpy on each iteration.
GCC 15+ is able to vectorize the loop but Clang 18.1 is not.

Replace memcpy with direct pointer access using unaligned_uint16_t.
This enables both GCC and Clang to vectorize the loop while handling
unaligned access safely on all architectures.

Performance results from cksum_perf_autotest on Intel Xeon
(Cascade Lake, AVX-512) built with Clang 18.1 (TSC cycles/byte):

  Block size    Before    After    Improvement
         100      0.40     0.24        ~40%
        1500      0.50     0.06        ~8x
        9000      0.49     0.06        ~8x

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 app/test/meson.build       |   1 +
 app/test/test_cksum_fuzz.c | 240 +++++++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c |   2 +-
 lib/net/rte_cksum.h        |  14 +--
 4 files changed, 247 insertions(+), 10 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

diff --git a/app/test/meson.build b/app/test/meson.build
index efec42a6bf..c92325ad58 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -38,6 +38,7 @@ source_file_deps = {
     'test_byteorder.c': [],
     'test_cfgfile.c': ['cfgfile'],
     'test_cksum.c': ['net'],
+    'test_cksum_fuzz.c': ['net'],
     'test_cksum_perf.c': ['net'],
     'test_cmdline.c': [],
     'test_cmdline_cirbuf.c': [],
diff --git a/app/test/test_cksum_fuzz.c b/app/test/test_cksum_fuzz.c
new file mode 100644
index 0000000000..839861f57d
--- /dev/null
+++ b/app/test/test_cksum_fuzz.c
@@ -0,0 +1,240 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Apple Inc.
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_hexdump.h>
+#include <rte_cksum.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+/*
+ * Fuzz test for __rte_raw_cksum optimization.
+ * Compares the optimized implementation against the original reference
+ * implementation across random data of various lengths.
+ */
+
+#define DEFAULT_ITERATIONS 1000
+#define MAX_TEST_LEN 65536  /* 64K to match GRO frame sizes */
+
+/*
+ * Original (reference) implementation of __rte_raw_cksum from DPDK v23.11.
+ * This is retained here for comparison testing against the optimized version.
+ */
+static inline uint32_t
+__rte_raw_cksum_reference(const void *buf, size_t len, uint32_t sum)
+{
+	const void *end;
+
+	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
+	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
+		uint16_t v;
+
+		memcpy(&v, buf, sizeof(uint16_t));
+		sum += v;
+	}
+
+	/* if length is odd, keeping it byte order independent */
+	if (unlikely(len % 2)) {
+		uint16_t left = 0;
+
+		memcpy(&left, end, 1);
+		sum += left;
+	}
+
+	return sum;
+}
+
+static void
+init_random_buffer(uint8_t *buf, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		buf[i] = (uint8_t)rte_rand();
+}
+
+static inline uint32_t
+get_initial_sum(bool random_initial_sum)
+{
+	return random_initial_sum ? (rte_rand() & 0xFFFFFFFF) : 0;
+}
+
+/*
+ * Test a single buffer length with specific alignment and initial sum
+ */
+static int
+test_cksum_fuzz_length_aligned(size_t len, bool aligned, uint32_t initial_sum)
+{
+	uint8_t *data;
+	uint8_t *buf;
+	size_t alloc_size;
+	uint32_t sum_ref, sum_opt;
+
+	if (len == 0 && !aligned) {
+		/* Skip unaligned test for zero length - nothing to test */
+		return TEST_SUCCESS;
+	}
+
+	/* Allocate exact size for aligned, +1 for unaligned offset */
+	alloc_size = aligned ? len : len + 1;
+	if (alloc_size == 0)
+		alloc_size = 1;  /* rte_malloc doesn't like 0 */
+
+	data = rte_malloc(NULL, alloc_size, 64);
+	if (data == NULL) {
+		printf("Failed to allocate %zu bytes\n", alloc_size);
+		return TEST_FAILED;
+	}
+
+	buf = aligned ? data : (data + 1);
+
+	init_random_buffer(buf, len);
+
+	sum_ref = __rte_raw_cksum_reference(buf, len, initial_sum);
+	sum_opt = __rte_raw_cksum(buf, len, initial_sum);
+
+	if (sum_ref != sum_opt) {
+		printf("MISMATCH at len=%zu aligned='%s' initial_sum=0x%08x ref=0x%08x opt=0x%08x\n",
+		       len, aligned ? "aligned" : "unaligned",
+		       initial_sum, sum_ref, sum_opt);
+		rte_hexdump(stdout, "failing buffer", buf, len);
+		rte_free(data);
+		return TEST_FAILED;
+	}
+
+	rte_free(data);
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test a length with both alignments
+ */
+static int
+test_cksum_fuzz_length(size_t len, uint32_t initial_sum)
+{
+	int rc;
+
+	/* Test aligned */
+	rc = test_cksum_fuzz_length_aligned(len, true, initial_sum);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	/* Test unaligned */
+	rc = test_cksum_fuzz_length_aligned(len, false, initial_sum);
+
+	return rc;
+}
+
+/*
+ * Test specific edge case lengths
+ */
+static int
+test_cksum_fuzz_edge_cases(void)
+{
+	/* Edge case lengths that might trigger bugs */
+	static const size_t edge_lengths[] = {
+		0, 1, 2, 3, 4, 5, 6, 7, 8,
+		15, 16, 17,
+		31, 32, 33,
+		63, 64, 65,
+		127, 128, 129,
+		255, 256, 257,
+		511, 512, 513,
+		1023, 1024, 1025,
+		1500, 1501,  /* MTU boundaries */
+		2047, 2048, 2049,
+		4095, 4096, 4097,
+		8191, 8192, 8193,
+		16383, 16384, 16385,
+		32767, 32768, 32769,
+		65534, 65535, 65536  /* 64K GRO boundaries */
+	};
+	unsigned int i;
+	int rc;
+
+	printf("Testing edge case lengths...\n");
+
+	for (i = 0; i < RTE_DIM(edge_lengths); i++) {
+		/* Test with zero initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], 0);
+		if (rc != TEST_SUCCESS)
+			return rc;
+
+		/* Test with random initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], get_initial_sum(true));
+		if (rc != TEST_SUCCESS)
+			return rc;
+	}
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test random lengths with optional random initial sums
+ */
+static int
+test_cksum_fuzz_random(unsigned int iterations, bool random_initial_sum)
+{
+	unsigned int i;
+	int rc;
+
+	printf("Testing random lengths (0-%d)%s...\n", MAX_TEST_LEN,
+	       random_initial_sum ? " with random initial sums" : "");
+
+	for (i = 0; i < iterations; i++) {
+		size_t len = rte_rand() % (MAX_TEST_LEN + 1);
+
+		rc = test_cksum_fuzz_length(len, get_initial_sum(random_initial_sum));
+		if (rc != TEST_SUCCESS) {
+			printf("Failed at len=%zu\n", len);
+			return rc;
+		}
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cksum_fuzz(void)
+{
+	int rc;
+	unsigned int iterations = DEFAULT_ITERATIONS;
+	printf("### __rte_raw_cksum optimization fuzz test ###\n");
+	printf("Iterations per test: %u\n\n", iterations);
+
+	/* Test edge cases */
+	rc = test_cksum_fuzz_edge_cases();
+	if (rc != TEST_SUCCESS) {
+		printf("Edge case test FAILED\n");
+		return rc;
+	}
+	printf("Edge case test PASSED\n\n");
+
+	/* Test random lengths with zero initial sum */
+	rc = test_cksum_fuzz_random(iterations, false);
+	if (rc != TEST_SUCCESS) {
+		printf("Random length test FAILED\n");
+		return rc;
+	}
+	printf("Random length test PASSED\n\n");
+
+	/* Test random lengths with random initial sums */
+	rc = test_cksum_fuzz_random(iterations, true);
+	if (rc != TEST_SUCCESS) {
+		printf("Random initial sum test FAILED\n");
+		return rc;
+	}
+	printf("Random initial sum test PASSED\n\n");
+
+	printf("All fuzz tests PASSED!\n");
+	return TEST_SUCCESS;
+}
+
+REGISTER_FAST_TEST(cksum_fuzz_autotest, true, true, test_cksum_fuzz);
diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
index 0b919cd59f..6b1d4589e0 100644
--- a/app/test/test_cksum_perf.c
+++ b/app/test/test_cksum_perf.c
@@ -15,7 +15,7 @@
 #define NUM_BLOCKS 10
 #define ITERATIONS 1000000
 
-static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
+static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501, 9000, 9001, 65536, 65537 };
 
 static __rte_noinline uint16_t
 do_rte_raw_cksum(const void *buf, size_t len)
diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
index a8e8927952..f04b46a6c3 100644
--- a/lib/net/rte_cksum.h
+++ b/lib/net/rte_cksum.h
@@ -42,15 +42,11 @@ extern "C" {
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-	const void *end;
-
-	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
-	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
-		uint16_t v;
-
-		memcpy(&v, buf, sizeof(uint16_t));
-		sum += v;
-	}
+	/* Process uint16 chunks to preserve overflow/carry math. GCC/Clang vectorize the loop. */
+	const unaligned_uint16_t *buf16 = (const unaligned_uint16_t *)buf;
+	const unaligned_uint16_t *end = buf16 + (len / sizeof(*buf16));
+	for (; buf16 != end; buf16++)
+		sum += *buf16;
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* RE: [PATCH v14 1/2] eal: add __rte_may_alias to unaligned typedefs
  2026-01-12 12:04 ` [PATCH v14 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
@ 2026-01-12 13:28   ` Morten Brørup
  2026-01-12 15:00     ` Scott Mitchell
  0 siblings, 1 reply; 39+ messages in thread
From: Morten Brørup @ 2026-01-12 13:28 UTC (permalink / raw)
  To: scott.k.mitch1, dev; +Cc: stephen

> From: Scott Mitchell <scott.k.mitch1@gmail.com>
> 
> Add __rte_may_alias attribute to unaligned_uint{16,32,64}_t typedefs
> to prevent GCC strict-aliasing optimization bugs. GCC has a bug where
> it incorrectly elides struct initialization when strict aliasing is
> enabled, causing reads from uninitialized memory.
> 
> The __rte_may_alias attribute signals to the compiler that these types
> can alias other types, preventing the incorrect optimization.

I'm wondering if this is the right place to add __rte_may_alias, i.e. if the scope of the workaround is correct.

Are the unaligned_uintNN_t types only used in a way where they are affected by the GCC bug?
If not, adding __rte_may_alias to the types themselves may be too broad.

Does the GCC bug only affect the unaligned_uintNN_t types?
Or does it occur elsewhere or for other types too? Then this workaround only solves the problem for parts of the code.

Minor detail:
If the bug only occurs on GCC, not Clang, please make the workaround GCC-only, using the preprocessor.

> 
> Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
> ---
>  lib/eal/include/rte_common.h | 34 +++++++++++++++++++---------------
>  1 file changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/lib/eal/include/rte_common.h
> b/lib/eal/include/rte_common.h
> index 9e7d84f929..ac70270cfb 100644
> --- a/lib/eal/include/rte_common.h
> +++ b/lib/eal/include/rte_common.h
> @@ -121,14 +121,27 @@ extern "C" {
>  #define __rte_aligned(a) __attribute__((__aligned__(a)))
>  #endif
> 
> +/**
> + * Macro to mark a type that is not subject to type-based aliasing
> rules
> + */
> +#ifdef RTE_TOOLCHAIN_MSVC
> +#define __rte_may_alias
> +#else
> +#define __rte_may_alias __attribute__((__may_alias__))
> +#endif
> +
> +/**
> + * __rte_may_alias avoids compiler bugs (GCC) that elide
> initialization
> + * of memory when strict-aliasing is enabled.
> + */
>  #ifdef RTE_ARCH_STRICT_ALIGN
> -typedef uint64_t unaligned_uint64_t __rte_aligned(1);
> -typedef uint32_t unaligned_uint32_t __rte_aligned(1);
> -typedef uint16_t unaligned_uint16_t __rte_aligned(1);
> +typedef uint64_t unaligned_uint64_t __rte_may_alias __rte_aligned(1);
> +typedef uint32_t unaligned_uint32_t __rte_may_alias __rte_aligned(1);
> +typedef uint16_t unaligned_uint16_t __rte_may_alias __rte_aligned(1);
>  #else
> -typedef uint64_t unaligned_uint64_t;
> -typedef uint32_t unaligned_uint32_t;
> -typedef uint16_t unaligned_uint16_t;
> +typedef uint64_t unaligned_uint64_t __rte_may_alias;
> +typedef uint32_t unaligned_uint32_t __rte_may_alias;
> +typedef uint16_t unaligned_uint16_t __rte_may_alias;
>  #endif
> 
>  /**
> @@ -159,15 +172,6 @@ typedef uint16_t unaligned_uint16_t;
>  #define __rte_packed_end __attribute__((__packed__))
>  #endif
> 
> -/**
> - * Macro to mark a type that is not subject to type-based aliasing
> rules
> - */
> -#ifdef RTE_TOOLCHAIN_MSVC
> -#define __rte_may_alias
> -#else
> -#define __rte_may_alias __attribute__((__may_alias__))
> -#endif
> -
>  /******* Macro to mark functions and fields scheduled for removal
> *****/
>  #ifdef RTE_TOOLCHAIN_MSVC
>  #define __rte_deprecated
> --
> 2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v14 1/2] eal: add __rte_may_alias to unaligned typedefs
  2026-01-12 13:28   ` Morten Brørup
@ 2026-01-12 15:00     ` Scott Mitchell
  0 siblings, 0 replies; 39+ messages in thread
From: Scott Mitchell @ 2026-01-12 15:00 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev, stephen

> I'm wondering if this is the right place to add __rte_may_alias, i.e. if the scope of the workaround is correct.
>
> Are the unaligned_uintNN_t types only used in a way where they are affected by the GCC bug?

All uses in DPDK are for aliasing (casting pointers to access existing
memory). There are no cases
where these types declare actual data. Therefore adding
__rte_may_alias is semantically correct
and not too broad.

> If not, adding __rte_may_alias to the types themselves may be too broad.
>
> Does the GCC bug only affect the unaligned_uintNN_t types?
> Or does it occur elsewhere or for other types too? Then this workaround only solves the problem for parts of the code.
>

The GCC strict-aliasing bug is broader and can occur with other
aliasing patterns
involving struct initialization. This patch targets the
unaligned_uintNN_t types specifically because:
1. They are known to trigger the bug in practice (reproduced in testing)
2. They are explicitly designed for aliasing
3. All existing DPDK usage is for aliasing

Added benefits of this approach:
1. Simplifies existing workarounds: We can remove the intermediate
packed structs in
rte_memcpy.h for x86 and use unaligned_NN_t directly
(https://elixir.bootlin.com/dpdk/v25.11/source/lib/eal/x86/include/rte_memcpy.h#L66)
2. Provides safe aliasing primitive: If other code needs to alias
types and wants to
avoid potential bugs, these unaligned_uintNN_t types are now a
correct, safe option

> Minor detail:
> If the bug only occurs on GCC, not Clang, please make the workaround GCC-only, using the preprocessor.

I've only reproduced the bug on GCC, but __rte_may_alias is
semantically correct for these types
on all compilers since they're exclusively used for aliasing. The attribute:
- Has no negative impact on GCC/Clang (verified on Godbolt - still optimizes
correctly: https://godbolt.org/z/Gj9EfqMTn)
- Makes the code semantically accurate about its intent
- Avoids #ifdef complexity

However, if you prefer a GCC-only workaround for minimal change, I'm
happy to add the
preprocessor conditionals. Please let me know your preference.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v15 0/2] net: optimize __rte_raw_cksum
  2026-01-12 12:04 [PATCH v14 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2026-01-12 12:04 ` [PATCH v14 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
  2026-01-12 12:04 ` [PATCH v14 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
@ 2026-01-17 21:21 ` scott.k.mitch1
  2026-01-17 21:21   ` [PATCH v15 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
                     ` (3 more replies)
  2 siblings, 4 replies; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-17 21:21 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott Mitchell

From: Scott Mitchell <scott.k.mitch1@gmail.com>

This series optimizes __rte_raw_cksum by replacing memcpy with direct
pointer access, enabling compiler vectorization on both GCC and Clang.

Patch 1 adds __rte_may_alias to unaligned typedefs to prevent a GCC
strict-aliasing bug where struct initialization is incorrectly elided.

Patch 2 uses the improved unaligned_uint16_t type in __rte_raw_cksum
to enable compiler optimizations while maintaining correctness across
all architectures (including strict-alignment platforms).

Performance results show significant improvements (40% for small buffers,
up to 8x for larger buffers) on Intel Xeon with Clang 18.1.

Changes in v15:
- Use NOHUGE_OK and ASAN_OK constants in REGISTER_FAST_TEST

Changes in v14:
- Split into two patches: EAL typedef fix and checksum optimization
- Use unaligned_uint16_t directly instead of wrapper struct
- Added __rte_may_alias to unaligned typedefs to prevent GCC bug

Scott Mitchell (2):
  eal: add __rte_may_alias to unaligned typedefs
  net: __rte_raw_cksum pointers enable compiler optimizations

 app/test/meson.build         |   1 +
 app/test/test_cksum_fuzz.c   | 240 +++++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c   |   2 +-
 lib/eal/include/rte_common.h |  34 ++---
 lib/net/rte_cksum.h          |  14 +-
 5 files changed, 266 insertions(+), 25 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

--
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v15 1/2] eal: add __rte_may_alias to unaligned typedefs
  2026-01-17 21:21 ` [PATCH v15 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
@ 2026-01-17 21:21   ` scott.k.mitch1
  2026-01-20 15:23     ` Morten Brørup
  2026-01-17 21:21   ` [PATCH v15 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-17 21:21 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott Mitchell

From: Scott Mitchell <scott.k.mitch1@gmail.com>

Add __rte_may_alias attribute to unaligned_uint{16,32,64}_t typedefs
to prevent GCC strict-aliasing optimization bugs. GCC has a bug where
it incorrectly elides struct initialization when strict aliasing is
enabled, causing reads from uninitialized memory.

The __rte_may_alias attribute signals to the compiler that these types
can alias other types, preventing the incorrect optimization.

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 lib/eal/include/rte_common.h | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/lib/eal/include/rte_common.h b/lib/eal/include/rte_common.h
index 9e7d84f929..ac70270cfb 100644
--- a/lib/eal/include/rte_common.h
+++ b/lib/eal/include/rte_common.h
@@ -121,14 +121,27 @@ extern "C" {
 #define __rte_aligned(a) __attribute__((__aligned__(a)))
 #endif
 
+/**
+ * Macro to mark a type that is not subject to type-based aliasing rules
+ */
+#ifdef RTE_TOOLCHAIN_MSVC
+#define __rte_may_alias
+#else
+#define __rte_may_alias __attribute__((__may_alias__))
+#endif
+
+/**
+ * __rte_may_alias avoids compiler bugs (GCC) that elide initialization
+ * of memory when strict-aliasing is enabled.
+ */
 #ifdef RTE_ARCH_STRICT_ALIGN
-typedef uint64_t unaligned_uint64_t __rte_aligned(1);
-typedef uint32_t unaligned_uint32_t __rte_aligned(1);
-typedef uint16_t unaligned_uint16_t __rte_aligned(1);
+typedef uint64_t unaligned_uint64_t __rte_may_alias __rte_aligned(1);
+typedef uint32_t unaligned_uint32_t __rte_may_alias __rte_aligned(1);
+typedef uint16_t unaligned_uint16_t __rte_may_alias __rte_aligned(1);
 #else
-typedef uint64_t unaligned_uint64_t;
-typedef uint32_t unaligned_uint32_t;
-typedef uint16_t unaligned_uint16_t;
+typedef uint64_t unaligned_uint64_t __rte_may_alias;
+typedef uint32_t unaligned_uint32_t __rte_may_alias;
+typedef uint16_t unaligned_uint16_t __rte_may_alias;
 #endif
 
 /**
@@ -159,15 +172,6 @@ typedef uint16_t unaligned_uint16_t;
 #define __rte_packed_end __attribute__((__packed__))
 #endif
 
-/**
- * Macro to mark a type that is not subject to type-based aliasing rules
- */
-#ifdef RTE_TOOLCHAIN_MSVC
-#define __rte_may_alias
-#else
-#define __rte_may_alias __attribute__((__may_alias__))
-#endif
-
 /******* Macro to mark functions and fields scheduled for removal *****/
 #ifdef RTE_TOOLCHAIN_MSVC
 #define __rte_deprecated
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v15 2/2] net: __rte_raw_cksum pointers enable compiler optimizations
  2026-01-17 21:21 ` [PATCH v15 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2026-01-17 21:21   ` [PATCH v15 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
@ 2026-01-17 21:21   ` scott.k.mitch1
  2026-01-17 22:08   ` [PATCH v15 0/2] net: optimize __rte_raw_cksum Stephen Hemminger
  2026-01-23 16:02   ` [PATCH v16 " scott.k.mitch1
  3 siblings, 0 replies; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-17 21:21 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott Mitchell

From: Scott Mitchell <scott.k.mitch1@gmail.com>

__rte_raw_cksum uses a loop with memcpy on each iteration.
GCC 15+ is able to vectorize the loop but Clang 18.1 is not.

Replace memcpy with direct pointer access using unaligned_uint16_t.
This enables both GCC and Clang to vectorize the loop while handling
unaligned access safely on all architectures.

Performance results from cksum_perf_autotest on Intel Xeon
(Cascade Lake, AVX-512) built with Clang 18.1 (TSC cycles/byte):

  Block size    Before    After    Improvement
         100      0.40     0.24        ~40%
        1500      0.50     0.06        ~8x
        9000      0.49     0.06        ~8x

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 app/test/meson.build       |   1 +
 app/test/test_cksum_fuzz.c | 240 +++++++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c |   2 +-
 lib/net/rte_cksum.h        |  14 +--
 4 files changed, 247 insertions(+), 10 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

diff --git a/app/test/meson.build b/app/test/meson.build
index efec42a6bf..c92325ad58 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -38,6 +38,7 @@ source_file_deps = {
     'test_byteorder.c': [],
     'test_cfgfile.c': ['cfgfile'],
     'test_cksum.c': ['net'],
+    'test_cksum_fuzz.c': ['net'],
     'test_cksum_perf.c': ['net'],
     'test_cmdline.c': [],
     'test_cmdline_cirbuf.c': [],
diff --git a/app/test/test_cksum_fuzz.c b/app/test/test_cksum_fuzz.c
new file mode 100644
index 0000000000..3df11e3dc2
--- /dev/null
+++ b/app/test/test_cksum_fuzz.c
@@ -0,0 +1,240 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Apple Inc.
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_hexdump.h>
+#include <rte_cksum.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+/*
+ * Fuzz test for __rte_raw_cksum optimization.
+ * Compares the optimized implementation against the original reference
+ * implementation across random data of various lengths.
+ */
+
+#define DEFAULT_ITERATIONS 1000
+#define MAX_TEST_LEN 65536  /* 64K to match GRO frame sizes */
+
+/*
+ * Original (reference) implementation of __rte_raw_cksum from DPDK v23.11.
+ * This is retained here for comparison testing against the optimized version.
+ */
+static inline uint32_t
+__rte_raw_cksum_reference(const void *buf, size_t len, uint32_t sum)
+{
+	const void *end;
+
+	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
+	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
+		uint16_t v;
+
+		memcpy(&v, buf, sizeof(uint16_t));
+		sum += v;
+	}
+
+	/* if length is odd, keeping it byte order independent */
+	if (unlikely(len % 2)) {
+		uint16_t left = 0;
+
+		memcpy(&left, end, 1);
+		sum += left;
+	}
+
+	return sum;
+}
+
+static void
+init_random_buffer(uint8_t *buf, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		buf[i] = (uint8_t)rte_rand();
+}
+
+static inline uint32_t
+get_initial_sum(bool random_initial_sum)
+{
+	return random_initial_sum ? (rte_rand() & 0xFFFFFFFF) : 0;
+}
+
+/*
+ * Test a single buffer length with specific alignment and initial sum
+ */
+static int
+test_cksum_fuzz_length_aligned(size_t len, bool aligned, uint32_t initial_sum)
+{
+	uint8_t *data;
+	uint8_t *buf;
+	size_t alloc_size;
+	uint32_t sum_ref, sum_opt;
+
+	if (len == 0 && !aligned) {
+		/* Skip unaligned test for zero length - nothing to test */
+		return TEST_SUCCESS;
+	}
+
+	/* Allocate exact size for aligned, +1 for unaligned offset */
+	alloc_size = aligned ? len : len + 1;
+	if (alloc_size == 0)
+		alloc_size = 1;  /* rte_malloc doesn't like 0 */
+
+	data = rte_malloc(NULL, alloc_size, 64);
+	if (data == NULL) {
+		printf("Failed to allocate %zu bytes\n", alloc_size);
+		return TEST_FAILED;
+	}
+
+	buf = aligned ? data : (data + 1);
+
+	init_random_buffer(buf, len);
+
+	sum_ref = __rte_raw_cksum_reference(buf, len, initial_sum);
+	sum_opt = __rte_raw_cksum(buf, len, initial_sum);
+
+	if (sum_ref != sum_opt) {
+		printf("MISMATCH at len=%zu aligned='%s' initial_sum=0x%08x ref=0x%08x opt=0x%08x\n",
+		       len, aligned ? "aligned" : "unaligned",
+		       initial_sum, sum_ref, sum_opt);
+		rte_hexdump(stdout, "failing buffer", buf, len);
+		rte_free(data);
+		return TEST_FAILED;
+	}
+
+	rte_free(data);
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test a length with both alignments
+ */
+static int
+test_cksum_fuzz_length(size_t len, uint32_t initial_sum)
+{
+	int rc;
+
+	/* Test aligned */
+	rc = test_cksum_fuzz_length_aligned(len, true, initial_sum);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	/* Test unaligned */
+	rc = test_cksum_fuzz_length_aligned(len, false, initial_sum);
+
+	return rc;
+}
+
+/*
+ * Test specific edge case lengths
+ */
+static int
+test_cksum_fuzz_edge_cases(void)
+{
+	/* Edge case lengths that might trigger bugs */
+	static const size_t edge_lengths[] = {
+		0, 1, 2, 3, 4, 5, 6, 7, 8,
+		15, 16, 17,
+		31, 32, 33,
+		63, 64, 65,
+		127, 128, 129,
+		255, 256, 257,
+		511, 512, 513,
+		1023, 1024, 1025,
+		1500, 1501,  /* MTU boundaries */
+		2047, 2048, 2049,
+		4095, 4096, 4097,
+		8191, 8192, 8193,
+		16383, 16384, 16385,
+		32767, 32768, 32769,
+		65534, 65535, 65536  /* 64K GRO boundaries */
+	};
+	unsigned int i;
+	int rc;
+
+	printf("Testing edge case lengths...\n");
+
+	for (i = 0; i < RTE_DIM(edge_lengths); i++) {
+		/* Test with zero initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], 0);
+		if (rc != TEST_SUCCESS)
+			return rc;
+
+		/* Test with random initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], get_initial_sum(true));
+		if (rc != TEST_SUCCESS)
+			return rc;
+	}
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test random lengths with optional random initial sums
+ */
+static int
+test_cksum_fuzz_random(unsigned int iterations, bool random_initial_sum)
+{
+	unsigned int i;
+	int rc;
+
+	printf("Testing random lengths (0-%d)%s...\n", MAX_TEST_LEN,
+	       random_initial_sum ? " with random initial sums" : "");
+
+	for (i = 0; i < iterations; i++) {
+		size_t len = rte_rand() % (MAX_TEST_LEN + 1);
+
+		rc = test_cksum_fuzz_length(len, get_initial_sum(random_initial_sum));
+		if (rc != TEST_SUCCESS) {
+			printf("Failed at len=%zu\n", len);
+			return rc;
+		}
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cksum_fuzz(void)
+{
+	int rc;
+	unsigned int iterations = DEFAULT_ITERATIONS;
+	printf("### __rte_raw_cksum optimization fuzz test ###\n");
+	printf("Iterations per test: %u\n\n", iterations);
+
+	/* Test edge cases */
+	rc = test_cksum_fuzz_edge_cases();
+	if (rc != TEST_SUCCESS) {
+		printf("Edge case test FAILED\n");
+		return rc;
+	}
+	printf("Edge case test PASSED\n\n");
+
+	/* Test random lengths with zero initial sum */
+	rc = test_cksum_fuzz_random(iterations, false);
+	if (rc != TEST_SUCCESS) {
+		printf("Random length test FAILED\n");
+		return rc;
+	}
+	printf("Random length test PASSED\n\n");
+
+	/* Test random lengths with random initial sums */
+	rc = test_cksum_fuzz_random(iterations, true);
+	if (rc != TEST_SUCCESS) {
+		printf("Random initial sum test FAILED\n");
+		return rc;
+	}
+	printf("Random initial sum test PASSED\n\n");
+
+	printf("All fuzz tests PASSED!\n");
+	return TEST_SUCCESS;
+}
+
+REGISTER_FAST_TEST(cksum_fuzz_autotest, NOHUGE_OK, ASAN_OK, test_cksum_fuzz);
diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
index 0b919cd59f..6b1d4589e0 100644
--- a/app/test/test_cksum_perf.c
+++ b/app/test/test_cksum_perf.c
@@ -15,7 +15,7 @@
 #define NUM_BLOCKS 10
 #define ITERATIONS 1000000
 
-static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
+static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501, 9000, 9001, 65536, 65537 };
 
 static __rte_noinline uint16_t
 do_rte_raw_cksum(const void *buf, size_t len)
diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
index a8e8927952..f04b46a6c3 100644
--- a/lib/net/rte_cksum.h
+++ b/lib/net/rte_cksum.h
@@ -42,15 +42,11 @@ extern "C" {
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-	const void *end;
-
-	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
-	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
-		uint16_t v;
-
-		memcpy(&v, buf, sizeof(uint16_t));
-		sum += v;
-	}
+	/* Process uint16 chunks to preserve overflow/carry math. GCC/Clang vectorize the loop. */
+	const unaligned_uint16_t *buf16 = (const unaligned_uint16_t *)buf;
+	const unaligned_uint16_t *end = buf16 + (len / sizeof(*buf16));
+	for (; buf16 != end; buf16++)
+		sum += *buf16;
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v15 0/2] net: optimize __rte_raw_cksum
  2026-01-17 21:21 ` [PATCH v15 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2026-01-17 21:21   ` [PATCH v15 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
  2026-01-17 21:21   ` [PATCH v15 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
@ 2026-01-17 22:08   ` Stephen Hemminger
  2026-01-20 12:45     ` Morten Brørup
  2026-01-23 16:02   ` [PATCH v16 " scott.k.mitch1
  3 siblings, 1 reply; 39+ messages in thread
From: Stephen Hemminger @ 2026-01-17 22:08 UTC (permalink / raw)
  To: scott.k.mitch1; +Cc: dev, mb

On Sat, 17 Jan 2026 13:21:12 -0800
scott.k.mitch1@gmail.com wrote:

> From: Scott Mitchell <scott.k.mitch1@gmail.com>
> 
> This series optimizes __rte_raw_cksum by replacing memcpy with direct
> pointer access, enabling compiler vectorization on both GCC and Clang.
> 
> Patch 1 adds __rte_may_alias to unaligned typedefs to prevent a GCC
> strict-aliasing bug where struct initialization is incorrectly elided.
> 
> Patch 2 uses the improved unaligned_uint16_t type in __rte_raw_cksum
> to enable compiler optimizations while maintaining correctness across
> all architectures (including strict-alignment platforms).
> 
> Performance results show significant improvements (40% for small buffers,
> up to 8x for larger buffers) on Intel Xeon with Clang 18.1.
> 
> Changes in v15:
> - Use NOHUGE_OK and ASAN_OK constants in REGISTER_FAST_TEST
> 
> Changes in v14:
> - Split into two patches: EAL typedef fix and checksum optimization
> - Use unaligned_uint16_t directly instead of wrapper struct
> - Added __rte_may_alias to unaligned typedefs to prevent GCC bug
> 
> Scott Mitchell (2):
>   eal: add __rte_may_alias to unaligned typedefs
>   net: __rte_raw_cksum pointers enable compiler optimizations
> 
>  app/test/meson.build         |   1 +
>  app/test/test_cksum_fuzz.c   | 240 +++++++++++++++++++++++++++++++++++
>  app/test/test_cksum_perf.c   |   2 +-
>  lib/eal/include/rte_common.h |  34 ++---
>  lib/net/rte_cksum.h          |  14 +-
>  5 files changed, 266 insertions(+), 25 deletions(-)
>  create mode 100644 app/test/test_cksum_fuzz.c
> 
> --
> 2.39.5 (Apple Git-154)
> 

Looks good now.
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

AI review agrees with me...


## Patch Review: [PATCH v15 1/2] eal: add __rte_may_alias to unaligned typedefs

### Commit Message Analysis

| Criterion | Status | Notes |
|-----------|--------|-------|
| Subject ≤60 chars | ✅ Pass | 47 characters |
| Lowercase after colon | ✅ Pass | "add __rte_may_alias..." |
| Imperative mood | ✅ Pass | "add" |
| No trailing period | ✅ Pass | |
| Correct prefix | ✅ Pass | "eal:" for lib/eal/ files |
| Body ≤75 chars/line | ✅ Pass | Lines appear within limit |
| Body doesn't start with "It" | ✅ Pass | Starts with "Add" |
| Signed-off-by present | ✅ Pass | `Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>` |

### Missing Tags (Warning)

**No `Fixes:` tag**: The commit message describes fixing "GCC strict-aliasing optimization bugs" and "incorrect optimization." This sounds like a bug fix that should reference the original commit introducing the unaligned typedefs. Consider adding:
```
Fixes: <12-char-sha> ("original commit introducing unaligned typedefs")
```

**No `Cc: stable@dpdk.org`**: If this fixes a real bug causing reads from uninitialized memory, it's likely a stable backport candidate.

### Code Review

**Positive aspects:**
- Proper Doxygen comment added for `__rte_may_alias` macro
- Good explanation of the GCC bug workaround
- MSVC fallback handled correctly
- Macro moved before its use (necessary for the typedefs)

**Minor observations:**
- The second comment block (lines 121-124 in the diff) is somewhat redundant with the first Doxygen comment. Consider consolidating.

---

## Patch Review: [PATCH v15 2/2] net: use unaligned type for raw checksum

### Commit Message Analysis

The mbox was truncated, but based on what's visible:

| Criterion | Status | Notes |
|-----------|--------|-------|
| Correct prefix | ✅ Pass | "net:" for lib/net/ files |

### Code Review - lib/net/rte_cksum.h

**The core change:**
```c
// OLD (memcpy-based):
for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
    uint16_t v;
    memcpy(&v, buf, sizeof(uint16_t));
    sum += v;
}

// NEW (direct access via unaligned type):
const unaligned_uint16_t *buf16 = (const unaligned_uint16_t *)buf;
const unaligned_uint16_t *end = buf16 + (len / sizeof(*buf16));
for (; buf16 != end; buf16++)
    sum += *buf16;
```

**Positive aspects:**
- Cleaner, more readable code
- Relies on the `__rte_may_alias` attribute from patch 1 to prevent aliasing bugs
- Comment explains vectorization benefit: "GCC/Clang vectorize the loop"
- Good dependency ordering (patch 1 must come before patch 2)

**Style observations:**
- ✅ Line length within 100 chars
- ✅ Proper use of `const`

### Code Review - app/test/test_cksum_fuzz.c (New File)

**Positive aspects:**
- ✅ Uses `TEST_SUCCESS`/`TEST_FAILED` correctly
- ✅ Uses `REGISTER_FAST_TEST` macro properly
- ✅ `printf()` usage is acceptable in test code per AGENTS.md
- ✅ `rte_malloc()` usage acceptable in test code
- ✅ Comprehensive edge case testing (power-of-2 boundaries, MTU sizes, GRO boundaries)
- ✅ Tests both aligned and unaligned cases
- ✅ Tests with zero and random initial sums

**Issues to verify** (file header not visible in truncated mbox):
- Ensure SPDX license identifier present on first line
- Ensure copyright line follows SPDX
- Ensure blank line before includes

**Style warning (lines 394-396):**
```c
printf("MISMATCH at len=%zu aligned='%s' initial_sum=0x%08x ref=0x%08x opt=0x%08x\n",
       len, aligned ? "aligned" : "unaligned",
       initial_sum, sum_ref, sum_opt);
```
Line length appears to be ~95 chars which is acceptable (<100).

### Code Review - app/test/test_cksum_perf.c

Minor change extending test coverage - looks fine.

---

## Summary

### Errors (Must Fix)
None identified.

### Warnings (Should Fix)

| Issue | Patch | Recommendation |
|-------|-------|----------------|
| Missing `Fixes:` tag | 1/2 | Add if this fixes a regression from a specific commit |
| Missing `Cc: stable@dpdk.org` | 1/2 | Consider if this should be backported |
| Verify SPDX header | 2/2 | Ensure test_cksum_fuzz.c has proper license header |

### Info (Consider)

1. **Patch 1**: The two comment blocks for `__rte_may_alias` could be consolidated into a single, more comprehensive Doxygen comment.

2. **Patch 2**: The new fuzz test is well-structured and follows DPDK test conventions. Good use of the `unit_test_suite_runner`-style approach with `REGISTER_FAST_TEST`.

3. **Series overall**: Good logical ordering - patch 1 provides the infrastructure, patch 2 uses it. Each commit should compile independently.

---

**Verdict**: This is a well-structured patch series at v15. The code changes are clean and the test coverage is thorough. The main actionable items are adding appropriate `Fixes:` and `Cc: stable` tags if this is indeed a bug fix worth backporting.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v15 0/2] net: optimize __rte_raw_cksum
  2026-01-17 22:08   ` [PATCH v15 0/2] net: optimize __rte_raw_cksum Stephen Hemminger
@ 2026-01-20 12:45     ` Morten Brørup
  2026-01-23 15:43       ` Scott Mitchell
  0 siblings, 1 reply; 39+ messages in thread
From: Morten Brørup @ 2026-01-20 12:45 UTC (permalink / raw)
  To: scott.k.mitch1; +Cc: dev, Stephen Hemminger

> > From: Scott Mitchell <scott.k.mitch1@gmail.com>
> >
> > This series optimizes __rte_raw_cksum by replacing memcpy with direct
> > pointer access, enabling compiler vectorization on both GCC and
> Clang.
> >
> > Patch 1 adds __rte_may_alias to unaligned typedefs to prevent a GCC
> > strict-aliasing bug where struct initialization is incorrectly
> elided.
> >
> > Patch 2 uses the improved unaligned_uint16_t type in __rte_raw_cksum
> > to enable compiler optimizations while maintaining correctness across
> > all architectures (including strict-alignment platforms).
> >
> > Performance results show significant improvements (40% for small
> buffers,
> > up to 8x for larger buffers) on Intel Xeon with Clang 18.1.
> >
> > Changes in v15:
> > - Use NOHUGE_OK and ASAN_OK constants in REGISTER_FAST_TEST
> >
> > Changes in v14:
> > - Split into two patches: EAL typedef fix and checksum optimization
> > - Use unaligned_uint16_t directly instead of wrapper struct
> > - Added __rte_may_alias to unaligned typedefs to prevent GCC bug
> >
> > Scott Mitchell (2):
> >   eal: add __rte_may_alias to unaligned typedefs
> >   net: __rte_raw_cksum pointers enable compiler optimizations
> >
> >  app/test/meson.build         |   1 +
> >  app/test/test_cksum_fuzz.c   | 240
> +++++++++++++++++++++++++++++++++++
> >  app/test/test_cksum_perf.c   |   2 +-
> >  lib/eal/include/rte_common.h |  34 ++---
> >  lib/net/rte_cksum.h          |  14 +-
> >  5 files changed, 266 insertions(+), 25 deletions(-)
> >  create mode 100644 app/test/test_cksum_fuzz.c
> >
> > --
> > 2.39.5 (Apple Git-154)
> >
> 
> Looks good now.
> Acked-by: Stephen Hemminger <stephen@networkplumber.org>

LGTM too.
Acked-by: Morten Brørup <mb@smartsharesystems.com>

Thank you for the effort and prompt reaction to feedback, Scott.
It has been a pleasure reviewing this series!


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v15 1/2] eal: add __rte_may_alias to unaligned typedefs
  2026-01-17 21:21   ` [PATCH v15 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
@ 2026-01-20 15:23     ` Morten Brørup
  2026-01-23 14:34       ` Scott Mitchell
  0 siblings, 1 reply; 39+ messages in thread
From: Morten Brørup @ 2026-01-20 15:23 UTC (permalink / raw)
  To: stable; +Cc: scott.k.mitch1, dev, stephen

> From: scott.k.mitch1@gmail.com [mailto:scott.k.mitch1@gmail.com]
> Sent: Saturday, 17 January 2026 22.21
> To: dev@dpdk.org
> Cc: Morten Brørup; stephen@networkplumber.org; Scott Mitchell
> Subject: [PATCH v15 1/2] eal: add __rte_may_alias to unaligned typedefs
> 
> From: Scott Mitchell <scott.k.mitch1@gmail.com>
> 
> Add __rte_may_alias attribute to unaligned_uint{16,32,64}_t typedefs
> to prevent GCC strict-aliasing optimization bugs. GCC has a bug where
> it incorrectly elides struct initialization when strict aliasing is
> enabled, causing reads from uninitialized memory.
> 
> The __rte_may_alias attribute signals to the compiler that these types
> can alias other types, preventing the incorrect optimization.

Although this is a workaround to fix bugs in GCC, and not DPDK itself,
I think it should be backported.
It may fix (GCC induced) bugs in applications using these types.
That's my opinion; let's get more opinions.

If so, add:

Fixes: 7621d6a8d0bd ("eal: add and use unaligned integer types")

> 
> Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
> ---
>  lib/eal/include/rte_common.h | 34 +++++++++++++++++++---------------
>  1 file changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/lib/eal/include/rte_common.h
> b/lib/eal/include/rte_common.h
> index 9e7d84f929..ac70270cfb 100644
> --- a/lib/eal/include/rte_common.h
> +++ b/lib/eal/include/rte_common.h
> @@ -121,14 +121,27 @@ extern "C" {
>  #define __rte_aligned(a) __attribute__((__aligned__(a)))
>  #endif
> 
> +/**
> + * Macro to mark a type that is not subject to type-based aliasing
> rules
> + */
> +#ifdef RTE_TOOLCHAIN_MSVC
> +#define __rte_may_alias
> +#else
> +#define __rte_may_alias __attribute__((__may_alias__))
> +#endif
> +
> +/**
> + * __rte_may_alias avoids compiler bugs (GCC) that elide
> initialization
> + * of memory when strict-aliasing is enabled.
> + */
>  #ifdef RTE_ARCH_STRICT_ALIGN
> -typedef uint64_t unaligned_uint64_t __rte_aligned(1);
> -typedef uint32_t unaligned_uint32_t __rte_aligned(1);
> -typedef uint16_t unaligned_uint16_t __rte_aligned(1);
> +typedef uint64_t unaligned_uint64_t __rte_may_alias __rte_aligned(1);
> +typedef uint32_t unaligned_uint32_t __rte_may_alias __rte_aligned(1);
> +typedef uint16_t unaligned_uint16_t __rte_may_alias __rte_aligned(1);
>  #else
> -typedef uint64_t unaligned_uint64_t;
> -typedef uint32_t unaligned_uint32_t;
> -typedef uint16_t unaligned_uint16_t;
> +typedef uint64_t unaligned_uint64_t __rte_may_alias;
> +typedef uint32_t unaligned_uint32_t __rte_may_alias;
> +typedef uint16_t unaligned_uint16_t __rte_may_alias;
>  #endif
> 
>  /**
> @@ -159,15 +172,6 @@ typedef uint16_t unaligned_uint16_t;
>  #define __rte_packed_end __attribute__((__packed__))
>  #endif
> 
> -/**
> - * Macro to mark a type that is not subject to type-based aliasing
> rules
> - */
> -#ifdef RTE_TOOLCHAIN_MSVC
> -#define __rte_may_alias
> -#else
> -#define __rte_may_alias __attribute__((__may_alias__))
> -#endif
> -
>  /******* Macro to mark functions and fields scheduled for removal
> *****/
>  #ifdef RTE_TOOLCHAIN_MSVC
>  #define __rte_deprecated
> --
> 2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v15 1/2] eal: add __rte_may_alias to unaligned typedefs
  2026-01-20 15:23     ` Morten Brørup
@ 2026-01-23 14:34       ` Scott Mitchell
  0 siblings, 0 replies; 39+ messages in thread
From: Scott Mitchell @ 2026-01-23 14:34 UTC (permalink / raw)
  To: Morten Brørup; +Cc: stable, dev, stephen

>
> Although this is a workaround to fix bugs in GCC, and not DPDK itself,
> I think it should be backported.
> It may fix (GCC induced) bugs in applications using these types.
> That's my opinion; let's get more opinions.
>
> If so, add:
>
> Fixes: 7621d6a8d0bd ("eal: add and use unaligned integer types")
>

I agree. I will submit a v16 with this.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v15 0/2] net: optimize __rte_raw_cksum
  2026-01-20 12:45     ` Morten Brørup
@ 2026-01-23 15:43       ` Scott Mitchell
  0 siblings, 0 replies; 39+ messages in thread
From: Scott Mitchell @ 2026-01-23 15:43 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev, Stephen Hemminger

Awesome! Thanks Morten & Stephen for the review and constructive
feedback, leading to a better result in the end!

I will submit a v16 with 1/2 including "Fixes: 7621d6a8d0bd ("eal: add
and use unaligned integer types")" as requested.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v16 0/2] net: optimize __rte_raw_cksum
  2026-01-17 21:21 ` [PATCH v15 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
                     ` (2 preceding siblings ...)
  2026-01-17 22:08   ` [PATCH v15 0/2] net: optimize __rte_raw_cksum Stephen Hemminger
@ 2026-01-23 16:02   ` scott.k.mitch1
  2026-01-23 16:02     ` [PATCH v16 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
                       ` (3 more replies)
  3 siblings, 4 replies; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-23 16:02 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott

From: Scott <scott_mitchell@apple.com>

This series optimizes __rte_raw_cksum by replacing memcpy with direct
pointer access, enabling compiler vectorization on both GCC and Clang.

Patch 1 adds __rte_may_alias to unaligned typedefs to prevent a GCC
strict-aliasing bug where struct initialization is incorrectly elided.

Patch 2 uses the improved unaligned_uint16_t type in __rte_raw_cksum
to enable compiler optimizations while maintaining correctness across
all architectures (including strict-alignment platforms).

Performance results show significant improvements (40% for small buffers,
up to 8x for larger buffers) on Intel Xeon with Clang 18.1.

Changes in v16:
- Add Fixes tag and Cc stable/author for backporting (patch 1)

Changes in v15:
- Use NOHUGE_OK and ASAN_OK constants in REGISTER_FAST_TEST

Changes in v14:
- Split into two patches: EAL typedef fix and checksum optimization
- Use unaligned_uint16_t directly instead of wrapper struct
- Added __rte_may_alias to unaligned typedefs to prevent GCC bug

Scott Mitchell (2):
  eal: add __rte_may_alias to unaligned typedefs
  net: __rte_raw_cksum pointers enable compiler optimizations

 app/test/meson.build         |   1 +
 app/test/test_cksum_fuzz.c   | 240 +++++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c   |   2 +-
 lib/eal/include/rte_common.h |  34 ++---
 lib/net/rte_cksum.h          |  14 +-
 5 files changed, 266 insertions(+), 25 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

-- 
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v16 1/2] eal: add __rte_may_alias to unaligned typedefs
  2026-01-23 16:02   ` [PATCH v16 " scott.k.mitch1
@ 2026-01-23 16:02     ` scott.k.mitch1
  2026-01-23 16:02     ` [PATCH v16 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-23 16:02 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott Mitchell, Cyril Chemparathy, stable

From: Scott Mitchell <scott.k.mitch1@gmail.com>

Add __rte_may_alias attribute to unaligned_uint{16,32,64}_t typedefs
to prevent GCC strict-aliasing optimization bugs. GCC has a bug where
it incorrectly elides struct initialization when strict aliasing is
enabled, causing reads from uninitialized memory.

The __rte_may_alias attribute signals to the compiler that these types
can alias other types, preventing the incorrect optimization.

Fixes: 7621d6a8d0bd ("eal: add and use unaligned integer types")
Cc: Cyril Chemparathy <cchemparathy@ezchip.com>
Cc: stable@dpdk.org
Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 lib/eal/include/rte_common.h | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/lib/eal/include/rte_common.h b/lib/eal/include/rte_common.h
index 573bf4f2ce..8a9623ea74 100644
--- a/lib/eal/include/rte_common.h
+++ b/lib/eal/include/rte_common.h
@@ -121,14 +121,27 @@ extern "C" {
 #define __rte_aligned(a) __attribute__((__aligned__(a)))
 #endif
 
+/**
+ * Macro to mark a type that is not subject to type-based aliasing rules
+ */
+#ifdef RTE_TOOLCHAIN_MSVC
+#define __rte_may_alias
+#else
+#define __rte_may_alias __attribute__((__may_alias__))
+#endif
+
+/**
+ * __rte_may_alias avoids compiler bugs (GCC) that elide initialization
+ * of memory when strict-aliasing is enabled.
+ */
 #ifdef RTE_ARCH_STRICT_ALIGN
-typedef uint64_t unaligned_uint64_t __rte_aligned(1);
-typedef uint32_t unaligned_uint32_t __rte_aligned(1);
-typedef uint16_t unaligned_uint16_t __rte_aligned(1);
+typedef uint64_t unaligned_uint64_t __rte_may_alias __rte_aligned(1);
+typedef uint32_t unaligned_uint32_t __rte_may_alias __rte_aligned(1);
+typedef uint16_t unaligned_uint16_t __rte_may_alias __rte_aligned(1);
 #else
-typedef uint64_t unaligned_uint64_t;
-typedef uint32_t unaligned_uint32_t;
-typedef uint16_t unaligned_uint16_t;
+typedef uint64_t unaligned_uint64_t __rte_may_alias;
+typedef uint32_t unaligned_uint32_t __rte_may_alias;
+typedef uint16_t unaligned_uint16_t __rte_may_alias;
 #endif
 
 /**
@@ -159,15 +172,6 @@ typedef uint16_t unaligned_uint16_t;
 #define __rte_packed_end __attribute__((__packed__))
 #endif
 
-/**
- * Macro to mark a type that is not subject to type-based aliasing rules
- */
-#ifdef RTE_TOOLCHAIN_MSVC
-#define __rte_may_alias
-#else
-#define __rte_may_alias __attribute__((__may_alias__))
-#endif
-
 /******* Macro to mark functions and fields scheduled for removal *****/
 #ifdef RTE_TOOLCHAIN_MSVC
 #define __rte_deprecated
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v16 2/2] net: __rte_raw_cksum pointers enable compiler optimizations
  2026-01-23 16:02   ` [PATCH v16 " scott.k.mitch1
  2026-01-23 16:02     ` [PATCH v16 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
@ 2026-01-23 16:02     ` scott.k.mitch1
  2026-01-28 11:05       ` David Marchand
  2026-01-24  8:23     ` [PATCH v16 0/2] net: optimize __rte_raw_cksum Morten Brørup
  2026-01-28 18:05     ` [PATCH v17 " scott.k.mitch1
  3 siblings, 1 reply; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-23 16:02 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott Mitchell

From: Scott Mitchell <scott.k.mitch1@gmail.com>

__rte_raw_cksum uses a loop with memcpy on each iteration.
GCC 15+ is able to vectorize the loop but Clang 18.1 is not.

Replace memcpy with direct pointer access using unaligned_uint16_t.
This enables both GCC and Clang to vectorize the loop while handling
unaligned access safely on all architectures.

Performance results from cksum_perf_autotest on Intel Xeon
(Cascade Lake, AVX-512) built with Clang 18.1 (TSC cycles/byte):

  Block size    Before    After    Improvement
         100      0.40     0.24        ~40%
        1500      0.50     0.06        ~8x
        9000      0.49     0.06        ~8x

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 app/test/meson.build       |   1 +
 app/test/test_cksum_fuzz.c | 240 +++++++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c |   2 +-
 lib/net/rte_cksum.h        |  14 +--
 4 files changed, 247 insertions(+), 10 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

diff --git a/app/test/meson.build b/app/test/meson.build
index f4d04a6e42..2ca17716b9 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -38,6 +38,7 @@ source_file_deps = {
     'test_byteorder.c': [],
     'test_cfgfile.c': ['cfgfile'],
     'test_cksum.c': ['net'],
+    'test_cksum_fuzz.c': ['net'],
     'test_cksum_perf.c': ['net'],
     'test_cmdline.c': [],
     'test_cmdline_cirbuf.c': [],
diff --git a/app/test/test_cksum_fuzz.c b/app/test/test_cksum_fuzz.c
new file mode 100644
index 0000000000..3df11e3dc2
--- /dev/null
+++ b/app/test/test_cksum_fuzz.c
@@ -0,0 +1,240 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Apple Inc.
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_hexdump.h>
+#include <rte_cksum.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+/*
+ * Fuzz test for __rte_raw_cksum optimization.
+ * Compares the optimized implementation against the original reference
+ * implementation across random data of various lengths.
+ */
+
+#define DEFAULT_ITERATIONS 1000
+#define MAX_TEST_LEN 65536  /* 64K to match GRO frame sizes */
+
+/*
+ * Original (reference) implementation of __rte_raw_cksum from DPDK v23.11.
+ * This is retained here for comparison testing against the optimized version.
+ */
+static inline uint32_t
+__rte_raw_cksum_reference(const void *buf, size_t len, uint32_t sum)
+{
+	const void *end;
+
+	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
+	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
+		uint16_t v;
+
+		memcpy(&v, buf, sizeof(uint16_t));
+		sum += v;
+	}
+
+	/* if length is odd, keeping it byte order independent */
+	if (unlikely(len % 2)) {
+		uint16_t left = 0;
+
+		memcpy(&left, end, 1);
+		sum += left;
+	}
+
+	return sum;
+}
+
+static void
+init_random_buffer(uint8_t *buf, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		buf[i] = (uint8_t)rte_rand();
+}
+
+static inline uint32_t
+get_initial_sum(bool random_initial_sum)
+{
+	return random_initial_sum ? (rte_rand() & 0xFFFFFFFF) : 0;
+}
+
+/*
+ * Test a single buffer length with specific alignment and initial sum
+ */
+static int
+test_cksum_fuzz_length_aligned(size_t len, bool aligned, uint32_t initial_sum)
+{
+	uint8_t *data;
+	uint8_t *buf;
+	size_t alloc_size;
+	uint32_t sum_ref, sum_opt;
+
+	if (len == 0 && !aligned) {
+		/* Skip unaligned test for zero length - nothing to test */
+		return TEST_SUCCESS;
+	}
+
+	/* Allocate exact size for aligned, +1 for unaligned offset */
+	alloc_size = aligned ? len : len + 1;
+	if (alloc_size == 0)
+		alloc_size = 1;  /* rte_malloc doesn't like 0 */
+
+	data = rte_malloc(NULL, alloc_size, 64);
+	if (data == NULL) {
+		printf("Failed to allocate %zu bytes\n", alloc_size);
+		return TEST_FAILED;
+	}
+
+	buf = aligned ? data : (data + 1);
+
+	init_random_buffer(buf, len);
+
+	sum_ref = __rte_raw_cksum_reference(buf, len, initial_sum);
+	sum_opt = __rte_raw_cksum(buf, len, initial_sum);
+
+	if (sum_ref != sum_opt) {
+		printf("MISMATCH at len=%zu aligned='%s' initial_sum=0x%08x ref=0x%08x opt=0x%08x\n",
+		       len, aligned ? "aligned" : "unaligned",
+		       initial_sum, sum_ref, sum_opt);
+		rte_hexdump(stdout, "failing buffer", buf, len);
+		rte_free(data);
+		return TEST_FAILED;
+	}
+
+	rte_free(data);
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test a length with both alignments
+ */
+static int
+test_cksum_fuzz_length(size_t len, uint32_t initial_sum)
+{
+	int rc;
+
+	/* Test aligned */
+	rc = test_cksum_fuzz_length_aligned(len, true, initial_sum);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	/* Test unaligned */
+	rc = test_cksum_fuzz_length_aligned(len, false, initial_sum);
+
+	return rc;
+}
+
+/*
+ * Test specific edge case lengths
+ */
+static int
+test_cksum_fuzz_edge_cases(void)
+{
+	/* Edge case lengths that might trigger bugs */
+	static const size_t edge_lengths[] = {
+		0, 1, 2, 3, 4, 5, 6, 7, 8,
+		15, 16, 17,
+		31, 32, 33,
+		63, 64, 65,
+		127, 128, 129,
+		255, 256, 257,
+		511, 512, 513,
+		1023, 1024, 1025,
+		1500, 1501,  /* MTU boundaries */
+		2047, 2048, 2049,
+		4095, 4096, 4097,
+		8191, 8192, 8193,
+		16383, 16384, 16385,
+		32767, 32768, 32769,
+		65534, 65535, 65536  /* 64K GRO boundaries */
+	};
+	unsigned int i;
+	int rc;
+
+	printf("Testing edge case lengths...\n");
+
+	for (i = 0; i < RTE_DIM(edge_lengths); i++) {
+		/* Test with zero initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], 0);
+		if (rc != TEST_SUCCESS)
+			return rc;
+
+		/* Test with random initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], get_initial_sum(true));
+		if (rc != TEST_SUCCESS)
+			return rc;
+	}
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test random lengths with optional random initial sums
+ */
+static int
+test_cksum_fuzz_random(unsigned int iterations, bool random_initial_sum)
+{
+	unsigned int i;
+	int rc;
+
+	printf("Testing random lengths (0-%d)%s...\n", MAX_TEST_LEN,
+	       random_initial_sum ? " with random initial sums" : "");
+
+	for (i = 0; i < iterations; i++) {
+		size_t len = rte_rand() % (MAX_TEST_LEN + 1);
+
+		rc = test_cksum_fuzz_length(len, get_initial_sum(random_initial_sum));
+		if (rc != TEST_SUCCESS) {
+			printf("Failed at len=%zu\n", len);
+			return rc;
+		}
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cksum_fuzz(void)
+{
+	int rc;
+	unsigned int iterations = DEFAULT_ITERATIONS;
+	printf("### __rte_raw_cksum optimization fuzz test ###\n");
+	printf("Iterations per test: %u\n\n", iterations);
+
+	/* Test edge cases */
+	rc = test_cksum_fuzz_edge_cases();
+	if (rc != TEST_SUCCESS) {
+		printf("Edge case test FAILED\n");
+		return rc;
+	}
+	printf("Edge case test PASSED\n\n");
+
+	/* Test random lengths with zero initial sum */
+	rc = test_cksum_fuzz_random(iterations, false);
+	if (rc != TEST_SUCCESS) {
+		printf("Random length test FAILED\n");
+		return rc;
+	}
+	printf("Random length test PASSED\n\n");
+
+	/* Test random lengths with random initial sums */
+	rc = test_cksum_fuzz_random(iterations, true);
+	if (rc != TEST_SUCCESS) {
+		printf("Random initial sum test FAILED\n");
+		return rc;
+	}
+	printf("Random initial sum test PASSED\n\n");
+
+	printf("All fuzz tests PASSED!\n");
+	return TEST_SUCCESS;
+}
+
+REGISTER_FAST_TEST(cksum_fuzz_autotest, NOHUGE_OK, ASAN_OK, test_cksum_fuzz);
diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
index 0b919cd59f..6b1d4589e0 100644
--- a/app/test/test_cksum_perf.c
+++ b/app/test/test_cksum_perf.c
@@ -15,7 +15,7 @@
 #define NUM_BLOCKS 10
 #define ITERATIONS 1000000
 
-static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
+static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501, 9000, 9001, 65536, 65537 };
 
 static __rte_noinline uint16_t
 do_rte_raw_cksum(const void *buf, size_t len)
diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
index a8e8927952..f04b46a6c3 100644
--- a/lib/net/rte_cksum.h
+++ b/lib/net/rte_cksum.h
@@ -42,15 +42,11 @@ extern "C" {
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-	const void *end;
-
-	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
-	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
-		uint16_t v;
-
-		memcpy(&v, buf, sizeof(uint16_t));
-		sum += v;
-	}
+	/* Process uint16 chunks to preserve overflow/carry math. GCC/Clang vectorize the loop. */
+	const unaligned_uint16_t *buf16 = (const unaligned_uint16_t *)buf;
+	const unaligned_uint16_t *end = buf16 + (len / sizeof(*buf16));
+	for (; buf16 != end; buf16++)
+		sum += *buf16;
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* RE: [PATCH v16 0/2] net: optimize __rte_raw_cksum
  2026-01-23 16:02   ` [PATCH v16 " scott.k.mitch1
  2026-01-23 16:02     ` [PATCH v16 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
  2026-01-23 16:02     ` [PATCH v16 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
@ 2026-01-24  8:23     ` Morten Brørup
  2026-01-28 18:05     ` [PATCH v17 " scott.k.mitch1
  3 siblings, 0 replies; 39+ messages in thread
From: Morten Brørup @ 2026-01-24  8:23 UTC (permalink / raw)
  To: scott.k.mitch1, dev; +Cc: stephen, Scott

> From: Scott <scott_mitchell@apple.com>
> 
> This series optimizes __rte_raw_cksum by replacing memcpy with direct
> pointer access, enabling compiler vectorization on both GCC and Clang.
> 
> Patch 1 adds __rte_may_alias to unaligned typedefs to prevent a GCC
> strict-aliasing bug where struct initialization is incorrectly elided.
> 
> Patch 2 uses the improved unaligned_uint16_t type in __rte_raw_cksum
> to enable compiler optimizations while maintaining correctness across
> all architectures (including strict-alignment platforms).
> 
> Performance results show significant improvements (40% for small
> buffers,
> up to 8x for larger buffers) on Intel Xeon with Clang 18.1.

It's usually allowed to carry forward ACKs from previous versions.
With major changes between versions, the author should consider if previous ACKs can remain or not.

Carrying forward from v15 of the series,
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16 2/2] net: __rte_raw_cksum pointers enable compiler optimizations
  2026-01-23 16:02     ` [PATCH v16 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
@ 2026-01-28 11:05       ` David Marchand
  2026-01-28 17:39         ` Scott Mitchell
  0 siblings, 1 reply; 39+ messages in thread
From: David Marchand @ 2026-01-28 11:05 UTC (permalink / raw)
  To: scott.k.mitch1; +Cc: dev, mb, stephen

Hello Scott,

On Fri, 23 Jan 2026 at 17:03, <scott.k.mitch1@gmail.com> wrote:
>
> From: Scott Mitchell <scott.k.mitch1@gmail.com>
>
> __rte_raw_cksum uses a loop with memcpy on each iteration.
> GCC 15+ is able to vectorize the loop but Clang 18.1 is not.
>
> Replace memcpy with direct pointer access using unaligned_uint16_t.
> This enables both GCC and Clang to vectorize the loop while handling
> unaligned access safely on all architectures.
>
> Performance results from cksum_perf_autotest on Intel Xeon
> (Cascade Lake, AVX-512) built with Clang 18.1 (TSC cycles/byte):
>
>   Block size    Before    After    Improvement
>          100      0.40     0.24        ~40%
>         1500      0.50     0.06        ~8x
>         9000      0.49     0.06        ~8x
>
> Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>

Unfortunately, clang 14 (Ubuntu 22.04) is complaining about unaligned
access in the new test.
Could you have a look?

RTE>>cksum_fuzz_autotest
../lib/net/rte_cksum.h:49:10: runtime error: load of misaligned
address 0x0001816c2e81 for type 'const unaligned_uint16_t' (aka 'const
unsigned short'), which requires 2 byte alignment
0x0001816c2e81: note: pointer points here
 00 00 00  00 70 f2 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00
00 00 00 00 00  00 00 00 00 00
              ^
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
../lib/net/rte_cksum.h:49:10 in

The whole backtrace is as follows:

RTE>>cksum_fuzz_autotest
../lib/net/rte_cksum.h:49:10: runtime error: load of misaligned
address 0x0001816c2e81 for type 'const unaligned_uint16_t' (aka 'const
unsigned short'), which requires 2 byte alignment
0x0001816c2e81: note: pointer points here
 00 00 00  00 0e ce 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00
00 00 00 00 00  00 00 00 00 00
              ^
    #0 0x55a725ec25e7 in __rte_raw_cksum test_cksum_fuzz.c
    #1 0x55a725ec21ce in test_cksum_fuzz_length_aligned test_cksum_fuzz.c
    #2 0x55a725ec1f65 in test_cksum_fuzz_length test_cksum_fuzz.c
    #3 0x55a725ec1c8f in test_cksum_fuzz_edge_cases test_cksum_fuzz.c
    #4 0x55a725ec1ab2 in test_cksum_fuzz test_cksum_fuzz.c
    #5 0x55a725ceece9 in cmd_autotest_parsed commands.c
    #6 0x7fdb96d7e668 in __cmdline_parse cmdline_parse.c
    #7 0x7fdb96d7dcb1 in cmdline_parse
(/home/runner/work/dpdk/dpdk/build/app/../lib/librte_cmdline.so.26+0x1bcb1)
(BuildId: bcf9387da4939ba68c89cec1938166c878fca318)
    #8 0x7fdb96d74b69 in cmdline_valid_buffer cmdline.c
    #9 0x7fdb96d8b9c3 in rdline_char_in
(/home/runner/work/dpdk/dpdk/build/app/../lib/librte_cmdline.so.26+0x299c3)
(BuildId: bcf9387da4939ba68c89cec1938166c878fca318)
    #10 0x7fdb96d752d3 in cmdline_in
(/home/runner/work/dpdk/dpdk/build/app/../lib/librte_cmdline.so.26+0x132d3)
(BuildId: bcf9387da4939ba68c89cec1938166c878fca318)
    #11 0x55a725cf0f0b in main
(/home/runner/work/dpdk/dpdk/build/app/dpdk-test+0x4ddf0b) (BuildId:
5905b821f00329f9c5b95c7064ea051d7aacac48)
    #12 0x7fdb94629d8f in __libc_start_call_main
csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #13 0x7fdb94629e3f in __libc_start_main csu/../csu/libc-start.c:392:3
    #14 0x55a725cc5ed4 in _start
(/home/runner/work/dpdk/dpdk/build/app/dpdk-test+0x4b2ed4) (BuildId:
5905b821f00329f9c5b95c7064ea051d7aacac48)

[snip]

> diff --git a/app/test/test_cksum_fuzz.c b/app/test/test_cksum_fuzz.c
> new file mode 100644
> index 0000000000..3df11e3dc2
> --- /dev/null
> +++ b/app/test/test_cksum_fuzz.c
> @@ -0,0 +1,240 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2026 Apple Inc.
> + */
> +
> +#include <stdio.h>
> +#include <string.h>
> +
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_hexdump.h>
> +#include <rte_cksum.h>
> +#include <rte_malloc.h>
> +#include <rte_random.h>
> +
> +#include "test.h"
> +
> +/*
> + * Fuzz test for __rte_raw_cksum optimization.
> + * Compares the optimized implementation against the original reference
> + * implementation across random data of various lengths.
> + */
> +
> +#define DEFAULT_ITERATIONS 1000
> +#define MAX_TEST_LEN 65536  /* 64K to match GRO frame sizes */
> +
> +/*
> + * Original (reference) implementation of __rte_raw_cksum from DPDK v23.11.
> + * This is retained here for comparison testing against the optimized version.
> + */
> +static inline uint32_t
> +__rte_raw_cksum_reference(const void *buf, size_t len, uint32_t sum)
> +{

Just a nit, I prefer we don't declare test functions with the same
prefix as a public dpdk API.
It is confusing when reading the test code.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v16 2/2] net: __rte_raw_cksum pointers enable compiler optimizations
  2026-01-28 11:05       ` David Marchand
@ 2026-01-28 17:39         ` Scott Mitchell
  0 siblings, 0 replies; 39+ messages in thread
From: Scott Mitchell @ 2026-01-28 17:39 UTC (permalink / raw)
  To: David Marchand; +Cc: dev, mb, stephen

> Unfortunately, clang 14 (Ubuntu 22.04) is complaining about unaligned
> access in the new test.
> Could you have a look?

Yes, thx for flagging this. I think the unaligned types need both
`__rte_may_alias __rte_aligned(1)` unconditionally and will submit a
v17. I verified the asm on clang/gcc on x86 (-mavx512cd) and armv8
(-msve-vector-bits=512) are identical when adding `__rte_aligned(1)`:
https://godbolt.org/z/fdYPdoTa5

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v17 0/2] net: optimize __rte_raw_cksum
  2026-01-23 16:02   ` [PATCH v16 " scott.k.mitch1
                       ` (2 preceding siblings ...)
  2026-01-24  8:23     ` [PATCH v16 0/2] net: optimize __rte_raw_cksum Morten Brørup
@ 2026-01-28 18:05     ` scott.k.mitch1
  2026-01-28 18:05       ` [PATCH v17 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
                         ` (2 more replies)
  3 siblings, 3 replies; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-28 18:05 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, bruce.richardson, david.marchand, Scott

From: Scott <scott.k.mitch1@gmail.com>

This series optimizes __rte_raw_cksum by replacing memcpy with direct
pointer access, enabling compiler vectorization on both GCC and Clang.

Patch 1 adds __rte_may_alias and __rte_aligned(1) to unaligned typedefs
to prevent a GCC strict-aliasing bug where struct initialization is
incorrectly elided, and avoid UB by clarifying access can be from any
address.

Patch 2 uses the improved unaligned_uint16_t type in __rte_raw_cksum
to enable compiler optimizations while maintaining correctness across
all architectures (including strict-alignment platforms).

Performance results show significant improvements (40% for small buffers,
up to 8x for larger buffers) on Intel Xeon with Clang 18.1.

Changes in v17:
- Use __rte_aligned(1) unconditionally on unaligned type aliases
- test_cksum_fuzz uses unit_test_suite_runner
- test_cksum_fuzz reference method rename to
test_cksum_fuzz_cksum_reference

Changes in v16:
- Add Fixes tag and Cc stable/author for backporting (patch 1)

Changes in v15:
- Use NOHUGE_OK and ASAN_OK constants in REGISTER_FAST_TEST

Changes in v14:
- Split into two patches: EAL typedef fix and checksum optimization
- Use unaligned_uint16_t directly instead of wrapper struct
- Added __rte_may_alias to unaligned typedefs to prevent GCC bug

Scott Mitchell (2):
  eal: add __rte_may_alias and __rte_aligned to unaligned typedefs
  net: __rte_raw_cksum pointers enable compiler optimizations

 app/test/meson.build         |   1 +
 app/test/test_cksum_fuzz.c   | 234 +++++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c   |   2 +-
 lib/eal/include/rte_common.h |  39 +++---
 lib/net/rte_cksum.h          |  14 +--
 5 files changed, 264 insertions(+), 26 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

--
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v17 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs
  2026-01-28 18:05     ` [PATCH v17 " scott.k.mitch1
@ 2026-01-28 18:05       ` scott.k.mitch1
  2026-01-28 18:05       ` [PATCH v17 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
  2026-01-28 19:41       ` [PATCH v18 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2 siblings, 0 replies; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-28 18:05 UTC (permalink / raw)
  To: dev
  Cc: mb, stephen, bruce.richardson, david.marchand, Scott Mitchell,
	Cyril Chemparathy, stable

From: Scott Mitchell <scott.k.mitch1@gmail.com>

Add __rte_may_alias attribute to unaligned_uint{16,32,64}_t typedefs
to prevent GCC strict-aliasing optimization bugs. GCC has a bug where
it incorrectly elides struct initialization when strict aliasing is
enabled, causing reads from uninitialized memory.

Add __rte_aligned(1) attribute to unaligned_uint{16,32,64}_t typedefs
which allows for safe access at any alignment. Without this, accessing
a uint16_t at an odd address is undefined behavior. Without this
UBSan detects `UndefinedBehaviorSanitizer: undefined-behavior`.

Fixes: 7621d6a8d0bd ("eal: add and use unaligned integer types")
Cc: Cyril Chemparathy <cchemparathy@ezchip.com>
Cc: stable@dpdk.org
Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 lib/eal/include/rte_common.h | 39 +++++++++++++++++++++---------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/lib/eal/include/rte_common.h b/lib/eal/include/rte_common.h
index 573bf4f2ce..15d379619a 100644
--- a/lib/eal/include/rte_common.h
+++ b/lib/eal/include/rte_common.h
@@ -121,16 +121,32 @@ extern "C" {
 #define __rte_aligned(a) __attribute__((__aligned__(a)))
 #endif
 
-#ifdef RTE_ARCH_STRICT_ALIGN
-typedef uint64_t unaligned_uint64_t __rte_aligned(1);
-typedef uint32_t unaligned_uint32_t __rte_aligned(1);
-typedef uint16_t unaligned_uint16_t __rte_aligned(1);
+/**
+ * Macro to mark a type that is not subject to type-based aliasing rules
+ */
+#ifdef RTE_TOOLCHAIN_MSVC
+#define __rte_may_alias
 #else
-typedef uint64_t unaligned_uint64_t;
-typedef uint32_t unaligned_uint32_t;
-typedef uint16_t unaligned_uint16_t;
+#define __rte_may_alias __attribute__((__may_alias__))
 #endif
 
+/**
+ * Types for potentially unaligned access.
+ *
+ * __rte_aligned(1) - Reduces alignment requirement to 1 byte, allowing
+ *                    these types to safely access memory at any address.
+ *                    Without this, accessing a uint16_t at an odd address
+ *                    is undefined behavior (even on x86 where hardware
+ *                    handles it).
+ *
+ * __rte_may_alias  - Prevents strict-aliasing optimization bugs where
+ *                    compilers may incorrectly elide memory operations
+ *                    when casting between pointer types.
+ */
+typedef uint64_t unaligned_uint64_t __rte_may_alias __rte_aligned(1);
+typedef uint32_t unaligned_uint32_t __rte_may_alias __rte_aligned(1);
+typedef uint16_t unaligned_uint16_t __rte_may_alias __rte_aligned(1);
+
 /**
  * @deprecated
  * @see __rte_packed_begin
@@ -159,15 +175,6 @@ typedef uint16_t unaligned_uint16_t;
 #define __rte_packed_end __attribute__((__packed__))
 #endif
 
-/**
- * Macro to mark a type that is not subject to type-based aliasing rules
- */
-#ifdef RTE_TOOLCHAIN_MSVC
-#define __rte_may_alias
-#else
-#define __rte_may_alias __attribute__((__may_alias__))
-#endif
-
 /******* Macro to mark functions and fields scheduled for removal *****/
 #ifdef RTE_TOOLCHAIN_MSVC
 #define __rte_deprecated
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v17 2/2] net: __rte_raw_cksum pointers enable compiler optimizations
  2026-01-28 18:05     ` [PATCH v17 " scott.k.mitch1
  2026-01-28 18:05       ` [PATCH v17 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
@ 2026-01-28 18:05       ` scott.k.mitch1
  2026-01-28 19:41       ` [PATCH v18 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2 siblings, 0 replies; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-28 18:05 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, bruce.richardson, david.marchand, Scott Mitchell

From: Scott Mitchell <scott.k.mitch1@gmail.com>

__rte_raw_cksum uses a loop with memcpy on each iteration.
GCC 15+ is able to vectorize the loop but Clang 18.1 is not.

Replace memcpy with direct pointer access using unaligned_uint16_t.
This enables both GCC and Clang to vectorize the loop while handling
unaligned access safely on all architectures.

Performance results from cksum_perf_autotest on Intel Xeon
(Cascade Lake, AVX-512) built with Clang 18.1 (TSC cycles/byte):

  Block size    Before    After    Improvement
         100      0.40     0.24        ~40%
        1500      0.50     0.06        ~8x
        9000      0.49     0.06        ~8x

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 app/test/meson.build       |   1 +
 app/test/test_cksum_fuzz.c | 234 +++++++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c |   2 +-
 lib/net/rte_cksum.h        |  14 +--
 4 files changed, 241 insertions(+), 10 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

diff --git a/app/test/meson.build b/app/test/meson.build
index f4d04a6e42..2ca17716b9 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -38,6 +38,7 @@ source_file_deps = {
     'test_byteorder.c': [],
     'test_cfgfile.c': ['cfgfile'],
     'test_cksum.c': ['net'],
+    'test_cksum_fuzz.c': ['net'],
     'test_cksum_perf.c': ['net'],
     'test_cmdline.c': [],
     'test_cmdline_cirbuf.c': [],
diff --git a/app/test/test_cksum_fuzz.c b/app/test/test_cksum_fuzz.c
new file mode 100644
index 0000000000..33b4c77f51
--- /dev/null
+++ b/app/test/test_cksum_fuzz.c
@@ -0,0 +1,234 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Apple Inc.
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_hexdump.h>
+#include <rte_cksum.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+/*
+ * Fuzz test for __rte_raw_cksum optimization.
+ * Compares the optimized implementation against the original reference
+ * implementation across random data of various lengths.
+ */
+
+#define DEFAULT_ITERATIONS 1000
+#define MAX_TEST_LEN 65536  /* 64K to match GRO frame sizes */
+
+/*
+ * Original (reference) implementation of __rte_raw_cksum from DPDK v23.11.
+ * This is retained here for comparison testing against the optimized version.
+ */
+static inline uint32_t
+test_cksum_fuzz_cksum_reference(const void *buf, size_t len, uint32_t sum)
+{
+	const void *end;
+
+	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
+	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
+		uint16_t v;
+
+		memcpy(&v, buf, sizeof(uint16_t));
+		sum += v;
+	}
+
+	/* if length is odd, keeping it byte order independent */
+	if (unlikely(len % 2)) {
+		uint16_t left = 0;
+
+		memcpy(&left, end, 1);
+		sum += left;
+	}
+
+	return sum;
+}
+
+static void
+init_random_buffer(uint8_t *buf, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		buf[i] = (uint8_t)rte_rand();
+}
+
+static inline uint32_t
+get_initial_sum(bool random_initial_sum)
+{
+	return random_initial_sum ? (rte_rand() & 0xFFFFFFFF) : 0;
+}
+
+/*
+ * Test a single buffer length with specific alignment and initial sum
+ */
+static int
+test_cksum_fuzz_length_aligned(size_t len, bool aligned, uint32_t initial_sum)
+{
+	uint8_t *data;
+	uint8_t *buf;
+	size_t alloc_size;
+	uint32_t sum_ref, sum_opt;
+
+	if (len == 0 && !aligned) {
+		/* Skip unaligned test for zero length - nothing to test */
+		return TEST_SUCCESS;
+	}
+
+	/* Allocate exact size for aligned, +1 for unaligned offset */
+	alloc_size = aligned ? len : len + 1;
+	if (alloc_size == 0)
+		alloc_size = 1;  /* rte_malloc doesn't like 0 */
+
+	data = rte_malloc(NULL, alloc_size, 64);
+	if (data == NULL) {
+		printf("Failed to allocate %zu bytes\n", alloc_size);
+		return TEST_FAILED;
+	}
+
+	buf = aligned ? data : (data + 1);
+
+	init_random_buffer(buf, len);
+
+	sum_ref = test_cksum_fuzz_cksum_reference(buf, len, initial_sum);
+	sum_opt = __rte_raw_cksum(buf, len, initial_sum);
+
+	if (sum_ref != sum_opt) {
+		printf("MISMATCH at len=%zu aligned='%s' initial_sum=0x%08x ref=0x%08x opt=0x%08x\n",
+		       len, aligned ? "aligned" : "unaligned",
+		       initial_sum, sum_ref, sum_opt);
+		rte_hexdump(stdout, "failing buffer", buf, len);
+		rte_free(data);
+		return TEST_FAILED;
+	}
+
+	rte_free(data);
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test a length with both alignments
+ */
+static int
+test_cksum_fuzz_length(size_t len, uint32_t initial_sum)
+{
+	int rc;
+
+	/* Test aligned */
+	rc = test_cksum_fuzz_length_aligned(len, true, initial_sum);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	/* Test unaligned */
+	rc = test_cksum_fuzz_length_aligned(len, false, initial_sum);
+
+	return rc;
+}
+
+/*
+ * Test specific edge case lengths
+ */
+static int
+test_cksum_fuzz_edge_cases(void)
+{
+	/* Edge case lengths that might trigger bugs */
+	static const size_t edge_lengths[] = {
+		0, 1, 2, 3, 4, 5, 6, 7, 8,
+		15, 16, 17,
+		31, 32, 33,
+		63, 64, 65,
+		127, 128, 129,
+		255, 256, 257,
+		511, 512, 513,
+		1023, 1024, 1025,
+		1500, 1501,  /* MTU boundaries */
+		2047, 2048, 2049,
+		4095, 4096, 4097,
+		8191, 8192, 8193,
+		16383, 16384, 16385,
+		32767, 32768, 32769,
+		65534, 65535, 65536  /* 64K GRO boundaries */
+	};
+	unsigned int i;
+	int rc;
+
+	printf("Testing edge case lengths...\n");
+
+	for (i = 0; i < RTE_DIM(edge_lengths); i++) {
+		/* Test with zero initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], 0);
+		if (rc != TEST_SUCCESS)
+			return rc;
+
+		/* Test with random initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], get_initial_sum(true));
+		if (rc != TEST_SUCCESS)
+			return rc;
+	}
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test random lengths with optional random initial sums
+ */
+static int
+test_cksum_fuzz_random(unsigned int iterations, bool random_initial_sum)
+{
+	unsigned int i;
+	int rc;
+
+	printf("Testing random lengths (0-%d)%s...\n", MAX_TEST_LEN,
+	       random_initial_sum ? " with random initial sums" : "");
+
+	for (i = 0; i < iterations; i++) {
+		size_t len = rte_rand() % (MAX_TEST_LEN + 1);
+
+		rc = test_cksum_fuzz_length(len, get_initial_sum(random_initial_sum));
+		if (rc != TEST_SUCCESS) {
+			printf("Failed at len=%zu\n", len);
+			return rc;
+		}
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cksum_fuzz_random_zero_sum(void)
+{
+	return test_cksum_fuzz_random(DEFAULT_ITERATIONS, false);
+}
+
+static int
+test_cksum_fuzz_random_random_sum(void)
+{
+	return test_cksum_fuzz_random(DEFAULT_ITERATIONS, true);
+}
+
+static struct unit_test_suite ptr_cksum_fuzz_suite = {
+	.suite_name = "cksum fuzz autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_cksum_fuzz_edge_cases),
+		TEST_CASE(test_cksum_fuzz_random_zero_sum),
+		TEST_CASE(test_cksum_fuzz_random_random_sum),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_cksum_fuzz_suite(void)
+{
+	return unit_test_suite_runner(&ptr_cksum_fuzz_suite);
+}
+
+REGISTER_FAST_TEST(cksum_fuzz_autotest, NOHUGE_OK, ASAN_OK, test_cksum_fuzz_suite);
diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
index 0b919cd59f..6b1d4589e0 100644
--- a/app/test/test_cksum_perf.c
+++ b/app/test/test_cksum_perf.c
@@ -15,7 +15,7 @@
 #define NUM_BLOCKS 10
 #define ITERATIONS 1000000
 
-static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
+static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501, 9000, 9001, 65536, 65537 };
 
 static __rte_noinline uint16_t
 do_rte_raw_cksum(const void *buf, size_t len)
diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
index a8e8927952..f04b46a6c3 100644
--- a/lib/net/rte_cksum.h
+++ b/lib/net/rte_cksum.h
@@ -42,15 +42,11 @@ extern "C" {
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-	const void *end;
-
-	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
-	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
-		uint16_t v;
-
-		memcpy(&v, buf, sizeof(uint16_t));
-		sum += v;
-	}
+	/* Process uint16 chunks to preserve overflow/carry math. GCC/Clang vectorize the loop. */
+	const unaligned_uint16_t *buf16 = (const unaligned_uint16_t *)buf;
+	const unaligned_uint16_t *end = buf16 + (len / sizeof(*buf16));
+	for (; buf16 != end; buf16++)
+		sum += *buf16;
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v18 0/2] net: optimize __rte_raw_cksum
  2026-01-28 18:05     ` [PATCH v17 " scott.k.mitch1
  2026-01-28 18:05       ` [PATCH v17 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
  2026-01-28 18:05       ` [PATCH v17 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
@ 2026-01-28 19:41       ` scott.k.mitch1
  2026-01-28 19:41         ` [PATCH v18 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
                           ` (2 more replies)
  2 siblings, 3 replies; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-28 19:41 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, bruce.richardson, david.marchand, Scott

From: Scott <scott.k.mitch1@gmail.com>

This series optimizes __rte_raw_cksum by replacing memcpy with direct
pointer access, enabling compiler vectorization on both GCC and Clang.

Patch 1 adds __rte_may_alias and __rte_aligned(1) to unaligned typedefs
to prevent a GCC strict-aliasing bug where struct initialization is
incorrectly elided, and avoid UB by clarifying access can be from any
address.

Patch 2 uses the improved unaligned_uint16_t type in __rte_raw_cksum
to enable compiler optimizations while maintaining correctness across
all architectures (including strict-alignment platforms).

Performance results show significant improvements (40% for small buffers,
up to 8x for larger buffers) on Intel Xeon with Clang 18.1.

Changes in v18:
- Fix MSVC compile error __rte_aligned(1) must come before type
- Fix test_hash_functions incorrect usage of unaligned_uint32_t

Changes in v17:
- Use __rte_aligned(1) unconditionally on unaligned type aliases
- test_cksum_fuzz uses unit_test_suite_runner
- test_cksum_fuzz reference method rename to
test_cksum_fuzz_cksum_reference

Changes in v16:
- Add Fixes tag and Cc stable/author for backporting (patch 1)

Changes in v15:
- Use NOHUGE_OK and ASAN_OK constants in REGISTER_FAST_TEST

Changes in v14:
- Split into two patches: EAL typedef fix and checksum optimization
- Use unaligned_uint16_t directly instead of wrapper struct
- Added __rte_may_alias to unaligned typedefs to prevent GCC bug

Scott Mitchell (2):
  eal: add __rte_may_alias and __rte_aligned to unaligned typedefs
  net: __rte_raw_cksum pointers enable compiler optimizations

 app/test/meson.build           |   1 +
 app/test/test_cksum_fuzz.c     | 234 +++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c     |   2 +-
 app/test/test_hash_functions.c |   2 +-
 lib/eal/include/rte_common.h   |  45 ++++---
 lib/net/rte_cksum.h            |  14 +-
 6 files changed, 271 insertions(+), 27 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

--
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v18 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs
  2026-01-28 19:41       ` [PATCH v18 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
@ 2026-01-28 19:41         ` scott.k.mitch1
  2026-01-29  8:28           ` Morten Brørup
  2026-01-28 19:41         ` [PATCH v18 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
  2026-02-02  4:48         ` [PATCH v19 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2 siblings, 1 reply; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-28 19:41 UTC (permalink / raw)
  To: dev
  Cc: mb, stephen, bruce.richardson, david.marchand, Scott Mitchell,
	Cyril Chemparathy, stable

From: Scott Mitchell <scott.k.mitch1@gmail.com>

Add __rte_may_alias attribute to unaligned_uint{16,32,64}_t typedefs
to prevent GCC strict-aliasing optimization bugs. GCC has a bug where
it incorrectly elides struct initialization when strict aliasing is
enabled, causing reads from uninitialized memory.

Add __rte_aligned(1) attribute to unaligned_uint{16,32,64}_t typedefs
which allows for safe access at any alignment. Without this, accessing
a uint16_t at an odd address is undefined behavior. Without this
UBSan detects `UndefinedBehaviorSanitizer: undefined-behavior`.

Fixes: 7621d6a8d0bd ("eal: add and use unaligned integer types")
Cc: Cyril Chemparathy <cchemparathy@ezchip.com>
Cc: stable@dpdk.org
Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 app/test/test_hash_functions.c |  2 +-
 lib/eal/include/rte_common.h   | 45 ++++++++++++++++++++++------------
 2 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/app/test/test_hash_functions.c b/app/test/test_hash_functions.c
index 70820d1f19..5b8b9c3e5d 100644
--- a/app/test/test_hash_functions.c
+++ b/app/test/test_hash_functions.c
@@ -199,7 +199,7 @@ verify_jhash_32bits(void)
 				hash = rte_jhash(key, hashtest_key_lens[i],
 						hashtest_initvals[j]);
 				/* Divide key length by 4 in rte_jhash for 32 bits */
-				hash32 = rte_jhash_32b((const unaligned_uint32_t *)key,
+				hash32 = rte_jhash_32b((const uint32_t *)key,
 						hashtest_key_lens[i] >> 2,
 						hashtest_initvals[j]);
 				if (hash != hash32) {
diff --git a/lib/eal/include/rte_common.h b/lib/eal/include/rte_common.h
index 573bf4f2ce..b10816d0d7 100644
--- a/lib/eal/include/rte_common.h
+++ b/lib/eal/include/rte_common.h
@@ -121,14 +121,36 @@ extern "C" {
 #define __rte_aligned(a) __attribute__((__aligned__(a)))
 #endif
 
-#ifdef RTE_ARCH_STRICT_ALIGN
-typedef uint64_t unaligned_uint64_t __rte_aligned(1);
-typedef uint32_t unaligned_uint32_t __rte_aligned(1);
-typedef uint16_t unaligned_uint16_t __rte_aligned(1);
+/**
+ * Macro to mark a type that is not subject to type-based aliasing rules
+ */
+#ifdef RTE_TOOLCHAIN_MSVC
+#define __rte_may_alias
 #else
-typedef uint64_t unaligned_uint64_t;
-typedef uint32_t unaligned_uint32_t;
-typedef uint16_t unaligned_uint16_t;
+#define __rte_may_alias __attribute__((__may_alias__))
+#endif
+
+/**
+ * Types for potentially unaligned access.
+ *
+ * __rte_aligned(1) - Reduces alignment requirement to 1 byte, allowing
+ *                    these types to safely access memory at any address.
+ *                    Without this, accessing a uint16_t at an odd address
+ *                    is undefined behavior (even on x86 where hardware
+ *                    handles it).
+ *
+ * __rte_may_alias  - Prevents strict-aliasing optimization bugs where
+ *                    compilers may incorrectly elide memory operations
+ *                    when casting between pointer types.
+ */
+#ifdef RTE_TOOLCHAIN_MSVC
+typedef __rte_may_alias __rte_aligned(1) uint64_t unaligned_uint64_t;
+typedef __rte_may_alias __rte_aligned(1) uint32_t unaligned_uint32_t;
+typedef __rte_may_alias __rte_aligned(1) uint16_t unaligned_uint16_t;
+#else
+typedef uint64_t unaligned_uint64_t __rte_may_alias __rte_aligned(1);
+typedef uint32_t unaligned_uint32_t __rte_may_alias __rte_aligned(1);
+typedef uint16_t unaligned_uint16_t __rte_may_alias __rte_aligned(1);
 #endif
 
 /**
@@ -159,15 +181,6 @@ typedef uint16_t unaligned_uint16_t;
 #define __rte_packed_end __attribute__((__packed__))
 #endif
 
-/**
- * Macro to mark a type that is not subject to type-based aliasing rules
- */
-#ifdef RTE_TOOLCHAIN_MSVC
-#define __rte_may_alias
-#else
-#define __rte_may_alias __attribute__((__may_alias__))
-#endif
-
 /******* Macro to mark functions and fields scheduled for removal *****/
 #ifdef RTE_TOOLCHAIN_MSVC
 #define __rte_deprecated
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v18 2/2] net: __rte_raw_cksum pointers enable compiler optimizations
  2026-01-28 19:41       ` [PATCH v18 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2026-01-28 19:41         ` [PATCH v18 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
@ 2026-01-28 19:41         ` scott.k.mitch1
  2026-01-29  8:31           ` Morten Brørup
  2026-02-02  4:48         ` [PATCH v19 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2 siblings, 1 reply; 39+ messages in thread
From: scott.k.mitch1 @ 2026-01-28 19:41 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, bruce.richardson, david.marchand, Scott Mitchell

From: Scott Mitchell <scott.k.mitch1@gmail.com>

__rte_raw_cksum uses a loop with memcpy on each iteration.
GCC 15+ is able to vectorize the loop but Clang 18.1 is not.

Replace memcpy with direct pointer access using unaligned_uint16_t.
This enables both GCC and Clang to vectorize the loop while handling
unaligned access safely on all architectures.

Performance results from cksum_perf_autotest on Intel Xeon
(Cascade Lake, AVX-512) built with Clang 18.1 (TSC cycles/byte):

  Block size    Before    After    Improvement
         100      0.40     0.24        ~40%
        1500      0.50     0.06        ~8x
        9000      0.49     0.06        ~8x

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 app/test/meson.build       |   1 +
 app/test/test_cksum_fuzz.c | 234 +++++++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c |   2 +-
 lib/net/rte_cksum.h        |  14 +--
 4 files changed, 241 insertions(+), 10 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

diff --git a/app/test/meson.build b/app/test/meson.build
index f4d04a6e42..2ca17716b9 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -38,6 +38,7 @@ source_file_deps = {
     'test_byteorder.c': [],
     'test_cfgfile.c': ['cfgfile'],
     'test_cksum.c': ['net'],
+    'test_cksum_fuzz.c': ['net'],
     'test_cksum_perf.c': ['net'],
     'test_cmdline.c': [],
     'test_cmdline_cirbuf.c': [],
diff --git a/app/test/test_cksum_fuzz.c b/app/test/test_cksum_fuzz.c
new file mode 100644
index 0000000000..33b4c77f51
--- /dev/null
+++ b/app/test/test_cksum_fuzz.c
@@ -0,0 +1,234 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Apple Inc.
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_hexdump.h>
+#include <rte_cksum.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+/*
+ * Fuzz test for __rte_raw_cksum optimization.
+ * Compares the optimized implementation against the original reference
+ * implementation across random data of various lengths.
+ */
+
+#define DEFAULT_ITERATIONS 1000
+#define MAX_TEST_LEN 65536  /* 64K to match GRO frame sizes */
+
+/*
+ * Original (reference) implementation of __rte_raw_cksum from DPDK v23.11.
+ * This is retained here for comparison testing against the optimized version.
+ */
+static inline uint32_t
+test_cksum_fuzz_cksum_reference(const void *buf, size_t len, uint32_t sum)
+{
+	const void *end;
+
+	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
+	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
+		uint16_t v;
+
+		memcpy(&v, buf, sizeof(uint16_t));
+		sum += v;
+	}
+
+	/* if length is odd, keeping it byte order independent */
+	if (unlikely(len % 2)) {
+		uint16_t left = 0;
+
+		memcpy(&left, end, 1);
+		sum += left;
+	}
+
+	return sum;
+}
+
+static void
+init_random_buffer(uint8_t *buf, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		buf[i] = (uint8_t)rte_rand();
+}
+
+static inline uint32_t
+get_initial_sum(bool random_initial_sum)
+{
+	return random_initial_sum ? (rte_rand() & 0xFFFFFFFF) : 0;
+}
+
+/*
+ * Test a single buffer length with specific alignment and initial sum
+ */
+static int
+test_cksum_fuzz_length_aligned(size_t len, bool aligned, uint32_t initial_sum)
+{
+	uint8_t *data;
+	uint8_t *buf;
+	size_t alloc_size;
+	uint32_t sum_ref, sum_opt;
+
+	if (len == 0 && !aligned) {
+		/* Skip unaligned test for zero length - nothing to test */
+		return TEST_SUCCESS;
+	}
+
+	/* Allocate exact size for aligned, +1 for unaligned offset */
+	alloc_size = aligned ? len : len + 1;
+	if (alloc_size == 0)
+		alloc_size = 1;  /* rte_malloc doesn't like 0 */
+
+	data = rte_malloc(NULL, alloc_size, 64);
+	if (data == NULL) {
+		printf("Failed to allocate %zu bytes\n", alloc_size);
+		return TEST_FAILED;
+	}
+
+	buf = aligned ? data : (data + 1);
+
+	init_random_buffer(buf, len);
+
+	sum_ref = test_cksum_fuzz_cksum_reference(buf, len, initial_sum);
+	sum_opt = __rte_raw_cksum(buf, len, initial_sum);
+
+	if (sum_ref != sum_opt) {
+		printf("MISMATCH at len=%zu aligned='%s' initial_sum=0x%08x ref=0x%08x opt=0x%08x\n",
+		       len, aligned ? "aligned" : "unaligned",
+		       initial_sum, sum_ref, sum_opt);
+		rte_hexdump(stdout, "failing buffer", buf, len);
+		rte_free(data);
+		return TEST_FAILED;
+	}
+
+	rte_free(data);
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test a length with both alignments
+ */
+static int
+test_cksum_fuzz_length(size_t len, uint32_t initial_sum)
+{
+	int rc;
+
+	/* Test aligned */
+	rc = test_cksum_fuzz_length_aligned(len, true, initial_sum);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	/* Test unaligned */
+	rc = test_cksum_fuzz_length_aligned(len, false, initial_sum);
+
+	return rc;
+}
+
+/*
+ * Test specific edge case lengths
+ */
+static int
+test_cksum_fuzz_edge_cases(void)
+{
+	/* Edge case lengths that might trigger bugs */
+	static const size_t edge_lengths[] = {
+		0, 1, 2, 3, 4, 5, 6, 7, 8,
+		15, 16, 17,
+		31, 32, 33,
+		63, 64, 65,
+		127, 128, 129,
+		255, 256, 257,
+		511, 512, 513,
+		1023, 1024, 1025,
+		1500, 1501,  /* MTU boundaries */
+		2047, 2048, 2049,
+		4095, 4096, 4097,
+		8191, 8192, 8193,
+		16383, 16384, 16385,
+		32767, 32768, 32769,
+		65534, 65535, 65536  /* 64K GRO boundaries */
+	};
+	unsigned int i;
+	int rc;
+
+	printf("Testing edge case lengths...\n");
+
+	for (i = 0; i < RTE_DIM(edge_lengths); i++) {
+		/* Test with zero initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], 0);
+		if (rc != TEST_SUCCESS)
+			return rc;
+
+		/* Test with random initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], get_initial_sum(true));
+		if (rc != TEST_SUCCESS)
+			return rc;
+	}
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test random lengths with optional random initial sums
+ */
+static int
+test_cksum_fuzz_random(unsigned int iterations, bool random_initial_sum)
+{
+	unsigned int i;
+	int rc;
+
+	printf("Testing random lengths (0-%d)%s...\n", MAX_TEST_LEN,
+	       random_initial_sum ? " with random initial sums" : "");
+
+	for (i = 0; i < iterations; i++) {
+		size_t len = rte_rand() % (MAX_TEST_LEN + 1);
+
+		rc = test_cksum_fuzz_length(len, get_initial_sum(random_initial_sum));
+		if (rc != TEST_SUCCESS) {
+			printf("Failed at len=%zu\n", len);
+			return rc;
+		}
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cksum_fuzz_random_zero_sum(void)
+{
+	return test_cksum_fuzz_random(DEFAULT_ITERATIONS, false);
+}
+
+static int
+test_cksum_fuzz_random_random_sum(void)
+{
+	return test_cksum_fuzz_random(DEFAULT_ITERATIONS, true);
+}
+
+static struct unit_test_suite ptr_cksum_fuzz_suite = {
+	.suite_name = "cksum fuzz autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_cksum_fuzz_edge_cases),
+		TEST_CASE(test_cksum_fuzz_random_zero_sum),
+		TEST_CASE(test_cksum_fuzz_random_random_sum),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_cksum_fuzz_suite(void)
+{
+	return unit_test_suite_runner(&ptr_cksum_fuzz_suite);
+}
+
+REGISTER_FAST_TEST(cksum_fuzz_autotest, NOHUGE_OK, ASAN_OK, test_cksum_fuzz_suite);
diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
index 0b919cd59f..6b1d4589e0 100644
--- a/app/test/test_cksum_perf.c
+++ b/app/test/test_cksum_perf.c
@@ -15,7 +15,7 @@
 #define NUM_BLOCKS 10
 #define ITERATIONS 1000000
 
-static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
+static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501, 9000, 9001, 65536, 65537 };
 
 static __rte_noinline uint16_t
 do_rte_raw_cksum(const void *buf, size_t len)
diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
index a8e8927952..f04b46a6c3 100644
--- a/lib/net/rte_cksum.h
+++ b/lib/net/rte_cksum.h
@@ -42,15 +42,11 @@ extern "C" {
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-	const void *end;
-
-	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
-	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
-		uint16_t v;
-
-		memcpy(&v, buf, sizeof(uint16_t));
-		sum += v;
-	}
+	/* Process uint16 chunks to preserve overflow/carry math. GCC/Clang vectorize the loop. */
+	const unaligned_uint16_t *buf16 = (const unaligned_uint16_t *)buf;
+	const unaligned_uint16_t *end = buf16 + (len / sizeof(*buf16));
+	for (; buf16 != end; buf16++)
+		sum += *buf16;
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* RE: [PATCH v18 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs
  2026-01-28 19:41         ` [PATCH v18 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
@ 2026-01-29  8:28           ` Morten Brørup
  2026-02-02  4:31             ` Scott Mitchell
  0 siblings, 1 reply; 39+ messages in thread
From: Morten Brørup @ 2026-01-29  8:28 UTC (permalink / raw)
  To: scott.k.mitch1, dev
  Cc: stephen, bruce.richardson, david.marchand, Cyril Chemparathy,
	stable

> @@ -199,7 +199,7 @@ verify_jhash_32bits(void)
>  				hash = rte_jhash(key, hashtest_key_lens[i],
>  						hashtest_initvals[j]);
>  				/* Divide key length by 4 in rte_jhash for 32
> bits */
> -				hash32 = rte_jhash_32b((const
> unaligned_uint32_t *)key,
> +				hash32 = rte_jhash_32b((const uint32_t *)key,
>  						hashtest_key_lens[i] >> 2,
>  						hashtest_initvals[j]);
>  				if (hash != hash32) {

rte_jhash_32b() correctly takes a pointer to (aligned) uint32_t, not unaligned, so casting to unaligned might be introducing a bug. (The automatically aligned allocation of the local "key" variable prevents this bug from occurring, but anyway.)
Instead of changing the type cast, I'd prefer fixing this as follows:
Add a local variable uint32_t key32[sizeof(key)/sizeof(uint32_t)], and memcpy(key32,key,sizeof(key)), and then call rte_jhash_32b(key32,...) without type casting.

> +/**
> + * Types for potentially unaligned access.
> + *
> + * __rte_aligned(1) - Reduces alignment requirement to 1 byte,
> allowing
> + *                    these types to safely access memory at any
> address.
> + *                    Without this, accessing a uint16_t at an odd
> address
> + *                    is undefined behavior (even on x86 where
> hardware
> + *                    handles it).
> + *
> + * __rte_may_alias  - Prevents strict-aliasing optimization bugs where
> + *                    compilers may incorrectly elide memory
> operations
> + *                    when casting between pointer types.
> + */
> +#ifdef RTE_TOOLCHAIN_MSVC
> +typedef __rte_may_alias __rte_aligned(1) uint64_t unaligned_uint64_t;
> +typedef __rte_may_alias __rte_aligned(1) uint32_t unaligned_uint32_t;
> +typedef __rte_may_alias __rte_aligned(1) uint16_t unaligned_uint16_t;
> +#else
> +typedef uint64_t unaligned_uint64_t __rte_may_alias __rte_aligned(1);
> +typedef uint32_t unaligned_uint32_t __rte_may_alias __rte_aligned(1);
> +typedef uint16_t unaligned_uint16_t __rte_may_alias __rte_aligned(1);
>  #endif

Skimming GCC documentation, it looks like older versions required placing such attributes after the type, but newer versions seem to recommend placing them before, like qualifiers (const, volatile, ...).
Placing them before the type, like qualifiers, seems more natural to me.
And apparently, MSVC requires it.
Does it work for GCC and Clang if they are placed before, like MSVC?
Then we can get rid of the #ifdef RTE_TOOLCHAIN_MSVC.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v18 2/2] net: __rte_raw_cksum pointers enable compiler optimizations
  2026-01-28 19:41         ` [PATCH v18 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
@ 2026-01-29  8:31           ` Morten Brørup
  0 siblings, 0 replies; 39+ messages in thread
From: Morten Brørup @ 2026-01-29  8:31 UTC (permalink / raw)
  To: scott.k.mitch1, dev; +Cc: stephen, bruce.richardson, david.marchand

Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v18 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs
  2026-01-29  8:28           ` Morten Brørup
@ 2026-02-02  4:31             ` Scott Mitchell
  0 siblings, 0 replies; 39+ messages in thread
From: Scott Mitchell @ 2026-02-02  4:31 UTC (permalink / raw)
  To: Morten Brørup
  Cc: dev, stephen, bruce.richardson, david.marchand, Cyril Chemparathy,
	stable

> > +                             hash32 = rte_jhash_32b((const uint32_t *)key,
> >                                               hashtest_key_lens[i] >> 2,
> >                                               hashtest_initvals[j]);
> >                               if (hash != hash32) {
>
> rte_jhash_32b() correctly takes a pointer to (aligned) uint32_t, not unaligned, so casting to unaligned might be introducing a bug. (The automatically aligned allocation of the local "key" variable prevents this bug from occurring, but anyway.)
> Instead of changing the type cast, I'd prefer fixing this as follows:
> Add a local variable uint32_t key32[sizeof(key)/sizeof(uint32_t)], and memcpy(key32,key,sizeof(key)), and then call rte_jhash_32b(key32,...) without type casting.

Sounds good, fix coming in v19.

> > +/**
> > + * Types for potentially unaligned access.
> > + *
> > + * __rte_aligned(1) - Reduces alignment requirement to 1 byte,
> > allowing
> > + *                    these types to safely access memory at any
> > address.
> > + *                    Without this, accessing a uint16_t at an odd
> > address
> > + *                    is undefined behavior (even on x86 where
> > hardware
> > + *                    handles it).
> > + *
> > + * __rte_may_alias  - Prevents strict-aliasing optimization bugs where
> > + *                    compilers may incorrectly elide memory
> > operations
> > + *                    when casting between pointer types.
> > + */
> > +#ifdef RTE_TOOLCHAIN_MSVC
> > +typedef __rte_may_alias __rte_aligned(1) uint64_t unaligned_uint64_t;
> > +typedef __rte_may_alias __rte_aligned(1) uint32_t unaligned_uint32_t;
> > +typedef __rte_may_alias __rte_aligned(1) uint16_t unaligned_uint16_t;
> > +#else
> > +typedef uint64_t unaligned_uint64_t __rte_may_alias __rte_aligned(1);
> > +typedef uint32_t unaligned_uint32_t __rte_may_alias __rte_aligned(1);
> > +typedef uint16_t unaligned_uint16_t __rte_may_alias __rte_aligned(1);
> >  #endif
>
> Skimming GCC documentation, it looks like older versions required placing such attributes after the type, but newer versions seem to recommend placing them before, like qualifiers (const, volatile, ...).
> Placing them before the type, like qualifiers, seems more natural to me.
> And apparently, MSVC requires it.
> Does it work for GCC and Clang if they are placed before, like MSVC?
> Then we can get rid of the #ifdef RTE_TOOLCHAIN_MSVC.

Good call! https://godbolt.org/z/oYrnfsMM3 gcc 8 and clang 7 both
support attributes before the type.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v19 0/2] net: optimize __rte_raw_cksum
  2026-01-28 19:41       ` [PATCH v18 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2026-01-28 19:41         ` [PATCH v18 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
  2026-01-28 19:41         ` [PATCH v18 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
@ 2026-02-02  4:48         ` scott.k.mitch1
  2026-02-02  4:48           ` [PATCH v19 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
                             ` (2 more replies)
  2 siblings, 3 replies; 39+ messages in thread
From: scott.k.mitch1 @ 2026-02-02  4:48 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, bruce.richardson, david.marchand, Scott

From: Scott <scott.k.mitch1@gmail.com>

This series optimizes __rte_raw_cksum by replacing memcpy with direct
pointer access, enabling compiler vectorization on both GCC and Clang.

Patch 1 adds __rte_may_alias and __rte_aligned(1) to unaligned typedefs
to prevent a GCC strict-aliasing bug where struct initialization is
incorrectly elided, and avoid UB by clarifying access can be from any
address.

Patch 2 uses the improved unaligned_uint16_t type in __rte_raw_cksum
to enable compiler optimizations while maintaining correctness across
all architectures (including strict-alignment platforms).

Performance results show significant improvements (40% for small buffers,
up to 8x for larger buffers) on Intel Xeon with Clang 18.1.

Changes in v19:
- Move qualifiers before typedef on all platforms
- test_hash_functions explicit 32 bit variable use

Changes in v18:
- Fix MSVC compile error __rte_aligned(1) must come before type
- Fix test_hash_functions incorrect usage of unaligned_uint32_t

Changes in v17:
- Use __rte_aligned(1) unconditionally on unaligned type aliases
- test_cksum_fuzz uses unit_test_suite_runner
- test_cksum_fuzz reference method rename to
test_cksum_fuzz_cksum_reference

Changes in v16:
- Add Fixes tag and Cc stable/author for backporting (patch 1)

Changes in v15:
- Use NOHUGE_OK and ASAN_OK constants in REGISTER_FAST_TEST

Changes in v14:
- Split into two patches: EAL typedef fix and checksum optimization
- Use unaligned_uint16_t directly instead of wrapper struct
- Added __rte_may_alias to unaligned typedefs to prevent GCC bug

Scott Mitchell (2):
  eal: add __rte_may_alias and __rte_aligned to unaligned typedefs
  net: __rte_raw_cksum pointers enable compiler optimizations

 app/test/meson.build           |   1 +
 app/test/test_cksum_fuzz.c     | 234 +++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c     |   2 +-
 app/test/test_hash_functions.c |   6 +-
 lib/eal/include/rte_common.h   |  49 ++++---
 lib/net/rte_cksum.h            |  14 +-
 6 files changed, 279 insertions(+), 27 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

-- 
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v19 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs
  2026-02-02  4:48         ` [PATCH v19 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
@ 2026-02-02  4:48           ` scott.k.mitch1
  2026-02-03  8:18             ` Morten Brørup
  2026-02-16 14:29             ` David Marchand
  2026-02-02  4:48           ` [PATCH v19 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
  2026-02-06 14:54           ` [PATCH v19 0/2] net: optimize __rte_raw_cksum David Marchand
  2 siblings, 2 replies; 39+ messages in thread
From: scott.k.mitch1 @ 2026-02-02  4:48 UTC (permalink / raw)
  To: dev
  Cc: mb, stephen, bruce.richardson, david.marchand, Scott Mitchell,
	Cyril Chemparathy, stable

From: Scott Mitchell <scott.k.mitch1@gmail.com>

Add __rte_may_alias attribute to unaligned_uint{16,32,64}_t typedefs
to prevent GCC strict-aliasing optimization bugs. GCC has a bug where
it incorrectly elides struct initialization when strict aliasing is
enabled, causing reads from uninitialized memory.

Add __rte_aligned(1) attribute to unaligned_uint{16,32,64}_t typedefs
which allows for safe access at any alignment. Without this, accessing
a uint16_t at an odd address is undefined behavior. Without this
UBSan detects `UndefinedBehaviorSanitizer: undefined-behavior`.

Fixes: 7621d6a8d0bd ("eal: add and use unaligned integer types")
Cc: Cyril Chemparathy <cchemparathy@ezchip.com>
Cc: stable@dpdk.org
Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 app/test/test_hash_functions.c |  6 ++++-
 lib/eal/include/rte_common.h   | 49 +++++++++++++++++++++++-----------
 2 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/app/test/test_hash_functions.c b/app/test/test_hash_functions.c
index 70820d1f19..9524e3135f 100644
--- a/app/test/test_hash_functions.c
+++ b/app/test/test_hash_functions.c
@@ -187,11 +187,15 @@ verify_jhash_32bits(void)
 {
 	unsigned i, j;
 	uint8_t key[64];
+	/* to guarantee alignment for rte_jhash_32b, use u32 and copy data */
+	uint32_t key32[sizeof(key) / sizeof(uint32_t)];
 	uint32_t hash, hash32;
 
 	for (i = 0; i < 64; i++)
 		key[i] = rand() & 0xff;
 
+	memcpy(key32, key, sizeof(key));
+
 	for (i = 0; i < RTE_DIM(hashtest_key_lens); i++) {
 		for (j = 0; j < RTE_DIM(hashtest_initvals); j++) {
 			/* Key size must be multiple of 4 (32 bits) */
@@ -199,7 +203,7 @@ verify_jhash_32bits(void)
 				hash = rte_jhash(key, hashtest_key_lens[i],
 						hashtest_initvals[j]);
 				/* Divide key length by 4 in rte_jhash for 32 bits */
-				hash32 = rte_jhash_32b((const unaligned_uint32_t *)key,
+				hash32 = rte_jhash_32b(key32,
 						hashtest_key_lens[i] >> 2,
 						hashtest_initvals[j]);
 				if (hash != hash32) {
diff --git a/lib/eal/include/rte_common.h b/lib/eal/include/rte_common.h
index 573bf4f2ce..7b36966019 100644
--- a/lib/eal/include/rte_common.h
+++ b/lib/eal/include/rte_common.h
@@ -121,16 +121,42 @@ extern "C" {
 #define __rte_aligned(a) __attribute__((__aligned__(a)))
 #endif
 
-#ifdef RTE_ARCH_STRICT_ALIGN
-typedef uint64_t unaligned_uint64_t __rte_aligned(1);
-typedef uint32_t unaligned_uint32_t __rte_aligned(1);
-typedef uint16_t unaligned_uint16_t __rte_aligned(1);
+/**
+ * Macro to mark a type that is not subject to type-based aliasing rules
+ */
+#ifdef RTE_TOOLCHAIN_MSVC
+#define __rte_may_alias
 #else
-typedef uint64_t unaligned_uint64_t;
-typedef uint32_t unaligned_uint32_t;
-typedef uint16_t unaligned_uint16_t;
+#define __rte_may_alias __attribute__((__may_alias__))
 #endif
 
+/* Unaligned types implementation notes:
+ * __rte_aligned(1) - Reduces alignment requirement to 1 byte, allowing
+ *                    these types to safely access memory at any address.
+ *                    Without this, accessing a uint16_t at an odd address
+ *                    is undefined behavior (even on x86 where hardware
+ *                    handles it).
+ *
+ * __rte_may_alias  - Prevents strict-aliasing optimization bugs where
+ *                    compilers may incorrectly elide memory operations
+ *                    when casting between pointer types.
+ */
+
+/**
+ * Type for safe unaligned u64 access.
+ */
+typedef __rte_may_alias __rte_aligned(1) uint64_t unaligned_uint64_t;
+
+/**
+ * Type for safe unaligned u32 access.
+ */
+typedef __rte_may_alias __rte_aligned(1) uint32_t unaligned_uint32_t;
+
+/**
+ * Type for safe unaligned u16 access.
+ */
+typedef __rte_may_alias __rte_aligned(1) uint16_t unaligned_uint16_t;
+
 /**
  * @deprecated
  * @see __rte_packed_begin
@@ -159,15 +185,6 @@ typedef uint16_t unaligned_uint16_t;
 #define __rte_packed_end __attribute__((__packed__))
 #endif
 
-/**
- * Macro to mark a type that is not subject to type-based aliasing rules
- */
-#ifdef RTE_TOOLCHAIN_MSVC
-#define __rte_may_alias
-#else
-#define __rte_may_alias __attribute__((__may_alias__))
-#endif
-
 /******* Macro to mark functions and fields scheduled for removal *****/
 #ifdef RTE_TOOLCHAIN_MSVC
 #define __rte_deprecated
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v19 2/2] net: __rte_raw_cksum pointers enable compiler optimizations
  2026-02-02  4:48         ` [PATCH v19 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2026-02-02  4:48           ` [PATCH v19 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
@ 2026-02-02  4:48           ` scott.k.mitch1
  2026-02-03  8:19             ` Morten Brørup
  2026-02-06 14:54           ` [PATCH v19 0/2] net: optimize __rte_raw_cksum David Marchand
  2 siblings, 1 reply; 39+ messages in thread
From: scott.k.mitch1 @ 2026-02-02  4:48 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, bruce.richardson, david.marchand, Scott Mitchell

From: Scott Mitchell <scott.k.mitch1@gmail.com>

__rte_raw_cksum uses a loop with memcpy on each iteration.
GCC 15+ is able to vectorize the loop but Clang 18.1 is not.

Replace memcpy with direct pointer access using unaligned_uint16_t.
This enables both GCC and Clang to vectorize the loop while handling
unaligned access safely on all architectures.

Performance results from cksum_perf_autotest on Intel Xeon
(Cascade Lake, AVX-512) built with Clang 18.1 (TSC cycles/byte):

  Block size    Before    After    Improvement
         100      0.40     0.24        ~40%
        1500      0.50     0.06        ~8x
        9000      0.49     0.06        ~8x

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 app/test/meson.build       |   1 +
 app/test/test_cksum_fuzz.c | 234 +++++++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c |   2 +-
 lib/net/rte_cksum.h        |  14 +--
 4 files changed, 241 insertions(+), 10 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

diff --git a/app/test/meson.build b/app/test/meson.build
index f4d04a6e42..2ca17716b9 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -38,6 +38,7 @@ source_file_deps = {
     'test_byteorder.c': [],
     'test_cfgfile.c': ['cfgfile'],
     'test_cksum.c': ['net'],
+    'test_cksum_fuzz.c': ['net'],
     'test_cksum_perf.c': ['net'],
     'test_cmdline.c': [],
     'test_cmdline_cirbuf.c': [],
diff --git a/app/test/test_cksum_fuzz.c b/app/test/test_cksum_fuzz.c
new file mode 100644
index 0000000000..33b4c77f51
--- /dev/null
+++ b/app/test/test_cksum_fuzz.c
@@ -0,0 +1,234 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Apple Inc.
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_hexdump.h>
+#include <rte_cksum.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+/*
+ * Fuzz test for __rte_raw_cksum optimization.
+ * Compares the optimized implementation against the original reference
+ * implementation across random data of various lengths.
+ */
+
+#define DEFAULT_ITERATIONS 1000
+#define MAX_TEST_LEN 65536  /* 64K to match GRO frame sizes */
+
+/*
+ * Original (reference) implementation of __rte_raw_cksum from DPDK v23.11.
+ * This is retained here for comparison testing against the optimized version.
+ */
+static inline uint32_t
+test_cksum_fuzz_cksum_reference(const void *buf, size_t len, uint32_t sum)
+{
+	const void *end;
+
+	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
+	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
+		uint16_t v;
+
+		memcpy(&v, buf, sizeof(uint16_t));
+		sum += v;
+	}
+
+	/* if length is odd, keeping it byte order independent */
+	if (unlikely(len % 2)) {
+		uint16_t left = 0;
+
+		memcpy(&left, end, 1);
+		sum += left;
+	}
+
+	return sum;
+}
+
+static void
+init_random_buffer(uint8_t *buf, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		buf[i] = (uint8_t)rte_rand();
+}
+
+static inline uint32_t
+get_initial_sum(bool random_initial_sum)
+{
+	return random_initial_sum ? (rte_rand() & 0xFFFFFFFF) : 0;
+}
+
+/*
+ * Test a single buffer length with specific alignment and initial sum
+ */
+static int
+test_cksum_fuzz_length_aligned(size_t len, bool aligned, uint32_t initial_sum)
+{
+	uint8_t *data;
+	uint8_t *buf;
+	size_t alloc_size;
+	uint32_t sum_ref, sum_opt;
+
+	if (len == 0 && !aligned) {
+		/* Skip unaligned test for zero length - nothing to test */
+		return TEST_SUCCESS;
+	}
+
+	/* Allocate exact size for aligned, +1 for unaligned offset */
+	alloc_size = aligned ? len : len + 1;
+	if (alloc_size == 0)
+		alloc_size = 1;  /* rte_malloc doesn't like 0 */
+
+	data = rte_malloc(NULL, alloc_size, 64);
+	if (data == NULL) {
+		printf("Failed to allocate %zu bytes\n", alloc_size);
+		return TEST_FAILED;
+	}
+
+	buf = aligned ? data : (data + 1);
+
+	init_random_buffer(buf, len);
+
+	sum_ref = test_cksum_fuzz_cksum_reference(buf, len, initial_sum);
+	sum_opt = __rte_raw_cksum(buf, len, initial_sum);
+
+	if (sum_ref != sum_opt) {
+		printf("MISMATCH at len=%zu aligned='%s' initial_sum=0x%08x ref=0x%08x opt=0x%08x\n",
+		       len, aligned ? "aligned" : "unaligned",
+		       initial_sum, sum_ref, sum_opt);
+		rte_hexdump(stdout, "failing buffer", buf, len);
+		rte_free(data);
+		return TEST_FAILED;
+	}
+
+	rte_free(data);
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test a length with both alignments
+ */
+static int
+test_cksum_fuzz_length(size_t len, uint32_t initial_sum)
+{
+	int rc;
+
+	/* Test aligned */
+	rc = test_cksum_fuzz_length_aligned(len, true, initial_sum);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	/* Test unaligned */
+	rc = test_cksum_fuzz_length_aligned(len, false, initial_sum);
+
+	return rc;
+}
+
+/*
+ * Test specific edge case lengths
+ */
+static int
+test_cksum_fuzz_edge_cases(void)
+{
+	/* Edge case lengths that might trigger bugs */
+	static const size_t edge_lengths[] = {
+		0, 1, 2, 3, 4, 5, 6, 7, 8,
+		15, 16, 17,
+		31, 32, 33,
+		63, 64, 65,
+		127, 128, 129,
+		255, 256, 257,
+		511, 512, 513,
+		1023, 1024, 1025,
+		1500, 1501,  /* MTU boundaries */
+		2047, 2048, 2049,
+		4095, 4096, 4097,
+		8191, 8192, 8193,
+		16383, 16384, 16385,
+		32767, 32768, 32769,
+		65534, 65535, 65536  /* 64K GRO boundaries */
+	};
+	unsigned int i;
+	int rc;
+
+	printf("Testing edge case lengths...\n");
+
+	for (i = 0; i < RTE_DIM(edge_lengths); i++) {
+		/* Test with zero initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], 0);
+		if (rc != TEST_SUCCESS)
+			return rc;
+
+		/* Test with random initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], get_initial_sum(true));
+		if (rc != TEST_SUCCESS)
+			return rc;
+	}
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test random lengths with optional random initial sums
+ */
+static int
+test_cksum_fuzz_random(unsigned int iterations, bool random_initial_sum)
+{
+	unsigned int i;
+	int rc;
+
+	printf("Testing random lengths (0-%d)%s...\n", MAX_TEST_LEN,
+	       random_initial_sum ? " with random initial sums" : "");
+
+	for (i = 0; i < iterations; i++) {
+		size_t len = rte_rand() % (MAX_TEST_LEN + 1);
+
+		rc = test_cksum_fuzz_length(len, get_initial_sum(random_initial_sum));
+		if (rc != TEST_SUCCESS) {
+			printf("Failed at len=%zu\n", len);
+			return rc;
+		}
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cksum_fuzz_random_zero_sum(void)
+{
+	return test_cksum_fuzz_random(DEFAULT_ITERATIONS, false);
+}
+
+static int
+test_cksum_fuzz_random_random_sum(void)
+{
+	return test_cksum_fuzz_random(DEFAULT_ITERATIONS, true);
+}
+
+static struct unit_test_suite ptr_cksum_fuzz_suite = {
+	.suite_name = "cksum fuzz autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_cksum_fuzz_edge_cases),
+		TEST_CASE(test_cksum_fuzz_random_zero_sum),
+		TEST_CASE(test_cksum_fuzz_random_random_sum),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_cksum_fuzz_suite(void)
+{
+	return unit_test_suite_runner(&ptr_cksum_fuzz_suite);
+}
+
+REGISTER_FAST_TEST(cksum_fuzz_autotest, NOHUGE_OK, ASAN_OK, test_cksum_fuzz_suite);
diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
index 0b919cd59f..6b1d4589e0 100644
--- a/app/test/test_cksum_perf.c
+++ b/app/test/test_cksum_perf.c
@@ -15,7 +15,7 @@
 #define NUM_BLOCKS 10
 #define ITERATIONS 1000000
 
-static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
+static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501, 9000, 9001, 65536, 65537 };
 
 static __rte_noinline uint16_t
 do_rte_raw_cksum(const void *buf, size_t len)
diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
index a8e8927952..f04b46a6c3 100644
--- a/lib/net/rte_cksum.h
+++ b/lib/net/rte_cksum.h
@@ -42,15 +42,11 @@ extern "C" {
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-	const void *end;
-
-	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
-	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
-		uint16_t v;
-
-		memcpy(&v, buf, sizeof(uint16_t));
-		sum += v;
-	}
+	/* Process uint16 chunks to preserve overflow/carry math. GCC/Clang vectorize the loop. */
+	const unaligned_uint16_t *buf16 = (const unaligned_uint16_t *)buf;
+	const unaligned_uint16_t *end = buf16 + (len / sizeof(*buf16));
+	for (; buf16 != end; buf16++)
+		sum += *buf16;
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* RE: [PATCH v19 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs
  2026-02-02  4:48           ` [PATCH v19 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
@ 2026-02-03  8:18             ` Morten Brørup
  2026-02-16 14:29             ` David Marchand
  1 sibling, 0 replies; 39+ messages in thread
From: Morten Brørup @ 2026-02-03  8:18 UTC (permalink / raw)
  To: scott.k.mitch1, dev
  Cc: stephen, bruce.richardson, david.marchand, Cyril Chemparathy,
	stable

Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v19 2/2] net: __rte_raw_cksum pointers enable compiler optimizations
  2026-02-02  4:48           ` [PATCH v19 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
@ 2026-02-03  8:19             ` Morten Brørup
  0 siblings, 0 replies; 39+ messages in thread
From: Morten Brørup @ 2026-02-03  8:19 UTC (permalink / raw)
  To: scott.k.mitch1, dev; +Cc: stephen, bruce.richardson, david.marchand

Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v19 0/2] net: optimize __rte_raw_cksum
  2026-02-02  4:48         ` [PATCH v19 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
  2026-02-02  4:48           ` [PATCH v19 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
  2026-02-02  4:48           ` [PATCH v19 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
@ 2026-02-06 14:54           ` David Marchand
  2026-02-07  1:29             ` Scott Mitchell
  2 siblings, 1 reply; 39+ messages in thread
From: David Marchand @ 2026-02-06 14:54 UTC (permalink / raw)
  To: scott.k.mitch1; +Cc: dev, mb, stephen, bruce.richardson, Thomas Monjalon

Hi Scott,

On Mon, 2 Feb 2026 at 05:48, <scott.k.mitch1@gmail.com> wrote:
>
> From: Scott <scott.k.mitch1@gmail.com>
>
> This series optimizes __rte_raw_cksum by replacing memcpy with direct
> pointer access, enabling compiler vectorization on both GCC and Clang.
>
> Patch 1 adds __rte_may_alias and __rte_aligned(1) to unaligned typedefs
> to prevent a GCC strict-aliasing bug where struct initialization is
> incorrectly elided, and avoid UB by clarifying access can be from any
> address.
>
> Patch 2 uses the improved unaligned_uint16_t type in __rte_raw_cksum
> to enable compiler optimizations while maintaining correctness across
> all architectures (including strict-alignment platforms).
>
> Performance results show significant improvements (40% for small buffers,
> up to 8x for larger buffers) on Intel Xeon with Clang 18.1.
>
> Changes in v19:
> - Move qualifiers before typedef on all platforms
> - test_hash_functions explicit 32 bit variable use
>
> Changes in v18:
> - Fix MSVC compile error __rte_aligned(1) must come before type
> - Fix test_hash_functions incorrect usage of unaligned_uint32_t
>
> Changes in v17:
> - Use __rte_aligned(1) unconditionally on unaligned type aliases
> - test_cksum_fuzz uses unit_test_suite_runner
> - test_cksum_fuzz reference method rename to
> test_cksum_fuzz_cksum_reference
>
> Changes in v16:
> - Add Fixes tag and Cc stable/author for backporting (patch 1)
>
> Changes in v15:
> - Use NOHUGE_OK and ASAN_OK constants in REGISTER_FAST_TEST
>
> Changes in v14:
> - Split into two patches: EAL typedef fix and checksum optimization
> - Use unaligned_uint16_t directly instead of wrapper struct
> - Added __rte_may_alias to unaligned typedefs to prevent GCC bug
>
> Scott Mitchell (2):
>   eal: add __rte_may_alias and __rte_aligned to unaligned typedefs
>   net: __rte_raw_cksum pointers enable compiler optimizations
>
>  app/test/meson.build           |   1 +
>  app/test/test_cksum_fuzz.c     | 234 +++++++++++++++++++++++++++++++++
>  app/test/test_cksum_perf.c     |   2 +-
>  app/test/test_hash_functions.c |   6 +-
>  lib/eal/include/rte_common.h   |  49 ++++---
>  lib/net/rte_cksum.h            |  14 +-
>  6 files changed, 279 insertions(+), 27 deletions(-)
>  create mode 100644 app/test/test_cksum_fuzz.c

I have been trying to reproduce the numbers with one (venerable)
Skylake processor but I see no difference before/after the series.
Numbers are in the same range with gcc (11) and clang (20) on this
RHEL 9 system.

RTE>>cksum_perf_autotest
### rte_raw_cksum() performance ###
Alignment  Block size    TSC cycles/block  TSC cycles/byte
Aligned           20                13.0             0.65
Unaligned         20                13.0             0.65
Aligned           21                14.0             0.67
Unaligned         21                14.0             0.67
Aligned          100                19.1             0.19
Unaligned        100                19.4             0.19
Aligned          101                20.1             0.20
Unaligned        101                22.1             0.22
Aligned         1500               132.5             0.09
Unaligned       1500               134.9             0.09
Aligned         1501               133.1             0.09
Unaligned       1501               146.3             0.10
Aligned         9000               766.7             0.09
Unaligned       9000               802.2             0.09
Aligned         9001               767.6             0.09
Unaligned       9001               800.3             0.09
Aligned        65536              5404.8             0.08
Unaligned      65536              5596.3             0.09
Aligned        65537              5406.8             0.08
Unaligned      65537              5604.5             0.09


Is the improvement only affecting clang18?
Other things I should check?


-- 
David Marchand


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v19 0/2] net: optimize __rte_raw_cksum
  2026-02-06 14:54           ` [PATCH v19 0/2] net: optimize __rte_raw_cksum David Marchand
@ 2026-02-07  1:29             ` Scott Mitchell
  2026-02-10 11:53               ` Thomas Monjalon
  2026-02-16 14:04               ` David Marchand
  0 siblings, 2 replies; 39+ messages in thread
From: Scott Mitchell @ 2026-02-07  1:29 UTC (permalink / raw)
  To: David Marchand; +Cc: dev, mb, stephen, bruce.richardson, Thomas Monjalon

Thanks for testing! I included my build/host config, results on the
main branch, and then with this path applied below. What is your build
flags/configuration (e, cpu_instruction_set, march, optimization
level, etc.)? I wasn't able to get any Clang version (18, 19, 20) to
vectorize on Godbolt https://godbolt.org/z/8149r7sq8, and curious if
your config enables vectorization.

#### build / host config
  User defined options
    b_lto              : false
    buildtype          : release
    c_args             : -fno-omit-frame-pointer
-DPACKET_QDISC_BYPASS=1 -DRTE_MEMCPY_AVX512=1
    cpu_instruction_set: cascadelake
    default_library    : static
    max_lcores         : 128
    optimization       : 3
$ clang --version
clang version 18.1.8 (Red Hat, Inc. 18.1.8-3.el9)
$ cat /etc/redhat-release
Red Hat Enterprise Linux release 9.4 (Plow)

#### main branch
$ echo "cksum_perf_autotest" | /usr/local/bin/dpdk-test
### rte_raw_cksum() performance ###
Alignment  Block size    TSC cycles/block  TSC cycles/byte
Aligned           20                10.0             0.50
Unaligned         20                10.1             0.50
Aligned           21                11.1             0.53
Unaligned         21                11.6             0.55
Aligned          100                39.4             0.39
Unaligned        100                67.3             0.67
Aligned          101                43.3             0.43
Unaligned        101                41.5             0.41
Aligned         1500               728.2             0.49
Unaligned       1500               805.8             0.54
Aligned         1501               768.8             0.51
Unaligned       1501               787.3             0.52
Test OK

#### with this patch
$ echo "cksum_perf_autotest" | /usr/local/bin/dpdk-test
### rte_raw_cksum() performance ###
Alignment  Block size    TSC cycles/block  TSC cycles/byte
Aligned           20                12.6             0.63
Unaligned         20                12.3             0.62
Aligned           21                13.6             0.65
Unaligned         21                13.6             0.65
Aligned          100                22.7             0.23
Unaligned        100                22.6             0.23
Aligned          101                47.4             0.47
Unaligned        101                23.9             0.24
Aligned         1500                73.9             0.05
Unaligned       1500                73.9             0.05
Aligned         1501                95.7             0.06
Unaligned       1501                73.9             0.05
Aligned         9000               459.8             0.05
Unaligned       9000               523.5             0.06
Aligned         9001               536.7             0.06
Unaligned       9001               507.5             0.06
Aligned        65536              3158.4             0.05
Unaligned      65536              3506.1             0.05
Aligned        65537              3277.6             0.05
Unaligned      65537              3697.6             0.06
Test OK

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v19 0/2] net: optimize __rte_raw_cksum
  2026-02-07  1:29             ` Scott Mitchell
@ 2026-02-10 11:53               ` Thomas Monjalon
  2026-02-16 14:04               ` David Marchand
  1 sibling, 0 replies; 39+ messages in thread
From: Thomas Monjalon @ 2026-02-10 11:53 UTC (permalink / raw)
  To: Scott Mitchell; +Cc: David Marchand, dev, mb, stephen, bruce.richardson

Here are my test results:

    buildtype             : debugoptimized
    default_library       : shared
    -march=x86-64-v4 (Cascade Lake)
    gcc 15.2.1
    clang 21.1.6

GCC - BEFORE
Alignment  Block size    TSC cycles/block  TSC cycles/byte
Aligned           20                20.5             1.02
Unaligned         20                14.1             0.70
Aligned           21                15.8             0.75
Unaligned         21                15.8             0.75
Aligned         1500               148.2             0.10
Unaligned       1500               148.3             0.10
Aligned         1501               148.4             0.10
Unaligned       1501               148.2             0.10

GCC - AFTER
Alignment  Block size    TSC cycles/block  TSC cycles/byte
Aligned           20                20.8             1.04
Unaligned         20                15.6             0.78
Aligned           21                16.9             0.81
Unaligned         21                16.9             0.80
Aligned         1500               109.5             0.07
Unaligned       1500               111.6             0.07
Aligned         1501               111.1             0.07
Unaligned       1501               113.0             0.08
Aligned         9000               612.4             0.07
Unaligned       9000               612.6             0.07
Aligned         9001               581.5             0.06
Unaligned       9001               601.7             0.07

CLANG - BEFORE
Alignment  Block size    TSC cycles/block  TSC cycles/byte
Aligned           20                14.2             0.71
Unaligned         20                 9.5             0.47
Aligned           21                11.7             0.56
Unaligned         21                11.8             0.56
Aligned         1500               610.7             0.41
Unaligned       1500               632.0             0.42
Aligned         1501               610.4             0.41
Unaligned       1501               627.6             0.42

CLANG - AFTER
Alignment  Block size    TSC cycles/block  TSC cycles/byte
Aligned           20                14.0             0.70
Unaligned         20                 9.1             0.45
Aligned           21                 9.7             0.46
Unaligned         21                 9.6             0.46
Aligned         1500                77.9             0.05
Unaligned       1500                79.4             0.05
Aligned         1501                79.4             0.05
Unaligned       1501                80.4             0.05
Aligned         9000               447.8             0.05
Unaligned       9000               492.1             0.05
Aligned         9001               448.5             0.05
Unaligned       9001               492.6             0.05

Before your patch,
With small block size, clang is better than GCC.
With large block size, GCC is better than clang.
After your patch, clang is always better than GCC.


07/02/2026 02:29, Scott Mitchell:
> Thanks for testing! I included my build/host config, results on the
> main branch, and then with this path applied below. What is your build
> flags/configuration (e, cpu_instruction_set, march, optimization
> level, etc.)? I wasn't able to get any Clang version (18, 19, 20) to
> vectorize on Godbolt https://godbolt.org/z/8149r7sq8, and curious if
> your config enables vectorization.
> 
> #### build / host config
>   User defined options
>     b_lto              : false
>     buildtype          : release
>     c_args             : -fno-omit-frame-pointer
> -DPACKET_QDISC_BYPASS=1 -DRTE_MEMCPY_AVX512=1
>     cpu_instruction_set: cascadelake
>     default_library    : static
>     max_lcores         : 128
>     optimization       : 3
> $ clang --version
> clang version 18.1.8 (Red Hat, Inc. 18.1.8-3.el9)
> $ cat /etc/redhat-release
> Red Hat Enterprise Linux release 9.4 (Plow)
> 
> #### main branch
> $ echo "cksum_perf_autotest" | /usr/local/bin/dpdk-test
> ### rte_raw_cksum() performance ###
> Alignment  Block size    TSC cycles/block  TSC cycles/byte
> Aligned           20                10.0             0.50
> Unaligned         20                10.1             0.50
> Aligned           21                11.1             0.53
> Unaligned         21                11.6             0.55
> Aligned          100                39.4             0.39
> Unaligned        100                67.3             0.67
> Aligned          101                43.3             0.43
> Unaligned        101                41.5             0.41
> Aligned         1500               728.2             0.49
> Unaligned       1500               805.8             0.54
> Aligned         1501               768.8             0.51
> Unaligned       1501               787.3             0.52
> Test OK
> 
> #### with this patch
> $ echo "cksum_perf_autotest" | /usr/local/bin/dpdk-test
> ### rte_raw_cksum() performance ###
> Alignment  Block size    TSC cycles/block  TSC cycles/byte
> Aligned           20                12.6             0.63
> Unaligned         20                12.3             0.62
> Aligned           21                13.6             0.65
> Unaligned         21                13.6             0.65
> Aligned          100                22.7             0.23
> Unaligned        100                22.6             0.23
> Aligned          101                47.4             0.47
> Unaligned        101                23.9             0.24
> Aligned         1500                73.9             0.05
> Unaligned       1500                73.9             0.05
> Aligned         1501                95.7             0.06
> Unaligned       1501                73.9             0.05
> Aligned         9000               459.8             0.05
> Unaligned       9000               523.5             0.06
> Aligned         9001               536.7             0.06
> Unaligned       9001               507.5             0.06
> Aligned        65536              3158.4             0.05
> Unaligned      65536              3506.1             0.05
> Aligned        65537              3277.6             0.05
> Unaligned      65537              3697.6             0.06
> Test OK
> 






^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v19 0/2] net: optimize __rte_raw_cksum
  2026-02-07  1:29             ` Scott Mitchell
  2026-02-10 11:53               ` Thomas Monjalon
@ 2026-02-16 14:04               ` David Marchand
  1 sibling, 0 replies; 39+ messages in thread
From: David Marchand @ 2026-02-16 14:04 UTC (permalink / raw)
  To: Scott Mitchell; +Cc: dev, mb, stephen, bruce.richardson, Thomas Monjalon

On Sat, 7 Feb 2026 at 02:29, Scott Mitchell <scott.k.mitch1@gmail.com> wrote:
>
> Thanks for testing! I included my build/host config, results on the
> main branch, and then with this path applied below. What is your build
> flags/configuration (e, cpu_instruction_set, march, optimization
> level, etc.)? I wasn't able to get any Clang version (18, 19, 20) to
> vectorize on Godbolt https://godbolt.org/z/8149r7sq8, and curious if
> your config enables vectorization.
>
> #### build / host config
>   User defined options
>     b_lto              : false
>     buildtype          : release
>     c_args             : -fno-omit-frame-pointer
> -DPACKET_QDISC_BYPASS=1 -DRTE_MEMCPY_AVX512=1
>     cpu_instruction_set: cascadelake
>     default_library    : static
>     max_lcores         : 128
>     optimization       : 3
> $ clang --version
> clang version 18.1.8 (Red Hat, Inc. 18.1.8-3.el9)
> $ cat /etc/redhat-release
> Red Hat Enterprise Linux release 9.4 (Plow)
>
> #### main branch
> $ echo "cksum_perf_autotest" | /usr/local/bin/dpdk-test
> ### rte_raw_cksum() performance ###
> Alignment  Block size    TSC cycles/block  TSC cycles/byte
> Aligned           20                10.0             0.50
> Unaligned         20                10.1             0.50
> Aligned           21                11.1             0.53
> Unaligned         21                11.6             0.55
> Aligned          100                39.4             0.39
> Unaligned        100                67.3             0.67
> Aligned          101                43.3             0.43
> Unaligned        101                41.5             0.41
> Aligned         1500               728.2             0.49
> Unaligned       1500               805.8             0.54
> Aligned         1501               768.8             0.51
> Unaligned       1501               787.3             0.52
> Test OK
>
> #### with this patch
> $ echo "cksum_perf_autotest" | /usr/local/bin/dpdk-test
> ### rte_raw_cksum() performance ###
> Alignment  Block size    TSC cycles/block  TSC cycles/byte
> Aligned           20                12.6             0.63
> Unaligned         20                12.3             0.62
> Aligned           21                13.6             0.65
> Unaligned         21                13.6             0.65
> Aligned          100                22.7             0.23
> Unaligned        100                22.6             0.23
> Aligned          101                47.4             0.47
> Unaligned        101                23.9             0.24
> Aligned         1500                73.9             0.05
> Unaligned       1500                73.9             0.05
> Aligned         1501                95.7             0.06
> Unaligned       1501                73.9             0.05
> Aligned         9000               459.8             0.05
> Unaligned       9000               523.5             0.06
> Aligned         9001               536.7             0.06
> Unaligned       9001               507.5             0.06
> Aligned        65536              3158.4             0.05
> Unaligned      65536              3506.1             0.05
> Aligned        65537              3277.6             0.05
> Unaligned      65537              3697.6             0.06
> Test OK

I redid my bench from scratch and I do see an improvement for clang.
-Aligned         1500               905.3             0.60
-Unaligned       1500               924.9             0.62
-Aligned         1501               907.6             0.60
-Unaligned       1501               932.1             0.62
-Aligned         9000              5252.1             0.58
-Unaligned       9000              5433.0             0.60
-Aligned         9001              5260.9             0.58
-Unaligned       9001              5440.4             0.60
-Aligned        65536             38395.2             0.59
-Unaligned      65536             39639.5             0.60
-Aligned        65537             38030.3             0.58
-Unaligned      65537             39292.7             0.60

+Aligned         1500               104.0             0.07
+Unaligned       1500               106.5             0.07
+Aligned         1501               104.1             0.07
+Unaligned       1501               107.0             0.07
+Aligned         9000               596.7             0.07
+Unaligned       9000               655.1             0.07
+Aligned         9001               597.6             0.07
+Unaligned       9001               657.2             0.07
+Aligned        65536              4139.3             0.06
+Unaligned      65536              4583.2             0.07
+Aligned        65537              4139.9             0.06
+Unaligned      65537              4585.9             0.07

Something was most likely wrong in my test (and seeing how the gcc and
clang numbers looked so close... I may have been using the gcc
binary...).
This is noticeable with clang, and no special cpu_instruction_set or
any kind of compiler optimisation level set.

I'll finish my checks and merge this nice improvement for rc1.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v19 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs
  2026-02-02  4:48           ` [PATCH v19 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
  2026-02-03  8:18             ` Morten Brørup
@ 2026-02-16 14:29             ` David Marchand
  2026-02-16 15:00               ` Morten Brørup
  1 sibling, 1 reply; 39+ messages in thread
From: David Marchand @ 2026-02-16 14:29 UTC (permalink / raw)
  To: scott.k.mitch1, Andre Muezerie, Tyler Retzlaff
  Cc: dev, mb, stephen, bruce.richardson, Cyril Chemparathy, stable

Hello Scott, Andre, Tyler,

On Mon, 2 Feb 2026 at 05:48, <scott.k.mitch1@gmail.com> wrote:
>
> From: Scott Mitchell <scott.k.mitch1@gmail.com>
>
> Add __rte_may_alias attribute to unaligned_uint{16,32,64}_t typedefs
> to prevent GCC strict-aliasing optimization bugs. GCC has a bug where
> it incorrectly elides struct initialization when strict aliasing is
> enabled, causing reads from uninitialized memory.
>
> Add __rte_aligned(1) attribute to unaligned_uint{16,32,64}_t typedefs
> which allows for safe access at any alignment. Without this, accessing
> a uint16_t at an odd address is undefined behavior. Without this
> UBSan detects `UndefinedBehaviorSanitizer: undefined-behavior`.
>
> Fixes: 7621d6a8d0bd ("eal: add and use unaligned integer types")
> Cc: stable@dpdk.org
>
> Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>

[snip]

> diff --git a/lib/eal/include/rte_common.h b/lib/eal/include/rte_common.h
> index 573bf4f2ce..7b36966019 100644
> --- a/lib/eal/include/rte_common.h
> +++ b/lib/eal/include/rte_common.h
> @@ -121,16 +121,42 @@ extern "C" {
>  #define __rte_aligned(a) __attribute__((__aligned__(a)))
>  #endif
>
> -#ifdef RTE_ARCH_STRICT_ALIGN
> -typedef uint64_t unaligned_uint64_t __rte_aligned(1);
> -typedef uint32_t unaligned_uint32_t __rte_aligned(1);
> -typedef uint16_t unaligned_uint16_t __rte_aligned(1);
> +/**
> + * Macro to mark a type that is not subject to type-based aliasing rules
> + */
> +#ifdef RTE_TOOLCHAIN_MSVC
> +#define __rte_may_alias
>  #else
> -typedef uint64_t unaligned_uint64_t;
> -typedef uint32_t unaligned_uint32_t;
> -typedef uint16_t unaligned_uint16_t;
> +#define __rte_may_alias __attribute__((__may_alias__))
>  #endif
>
> +/* Unaligned types implementation notes:
> + * __rte_aligned(1) - Reduces alignment requirement to 1 byte, allowing
> + *                    these types to safely access memory at any address.
> + *                    Without this, accessing a uint16_t at an odd address
> + *                    is undefined behavior (even on x86 where hardware
> + *                    handles it).
> + *
> + * __rte_may_alias  - Prevents strict-aliasing optimization bugs where
> + *                    compilers may incorrectly elide memory operations
> + *                    when casting between pointer types.
> + */
> +
> +/**
> + * Type for safe unaligned u64 access.
> + */
> +typedef __rte_may_alias __rte_aligned(1) uint64_t unaligned_uint64_t;
> +
> +/**
> + * Type for safe unaligned u32 access.
> + */
> +typedef __rte_may_alias __rte_aligned(1) uint32_t unaligned_uint32_t;
> +
> +/**
> + * Type for safe unaligned u16 access.
> + */
> +typedef __rte_may_alias __rte_aligned(1) uint16_t unaligned_uint16_t;
> +
>  /**
>   * @deprecated
>   * @see __rte_packed_begin
> @@ -159,15 +185,6 @@ typedef uint16_t unaligned_uint16_t;
>  #define __rte_packed_end __attribute__((__packed__))
>  #endif
>
> -/**
> - * Macro to mark a type that is not subject to type-based aliasing rules
> - */
> -#ifdef RTE_TOOLCHAIN_MSVC
> -#define __rte_may_alias
> -#else
> -#define __rte_may_alias __attribute__((__may_alias__))
> -#endif
> -
>  /******* Macro to mark functions and fields scheduled for removal *****/
>  #ifdef RTE_TOOLCHAIN_MSVC
>  #define __rte_deprecated

This change raises a warning in checkpatch.
https://mails.dpdk.org/archives/test-report/2026-February/955237.html

IIRC, we added this check for MSVC support, making sure no
__rte_aligned() would be added in unsupported locations.

@Microsoft guys, do you have a suggestion?


-- 
David Marchand


^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v19 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs
  2026-02-16 14:29             ` David Marchand
@ 2026-02-16 15:00               ` Morten Brørup
  0 siblings, 0 replies; 39+ messages in thread
From: Morten Brørup @ 2026-02-16 15:00 UTC (permalink / raw)
  To: David Marchand, scott.k.mitch1, Andre Muezerie, Tyler Retzlaff
  Cc: dev, stephen, bruce.richardson, Cyril Chemparathy, stable

> From: David Marchand [mailto:david.marchand@redhat.com]
> Sent: Monday, 16 February 2026 15.29
> 
> Hello Scott, Andre, Tyler,
> 
> On Mon, 2 Feb 2026 at 05:48, <scott.k.mitch1@gmail.com> wrote:
> >
> > From: Scott Mitchell <scott.k.mitch1@gmail.com>
> >
> > Add __rte_may_alias attribute to unaligned_uint{16,32,64}_t typedefs
> > to prevent GCC strict-aliasing optimization bugs. GCC has a bug where
> > it incorrectly elides struct initialization when strict aliasing is
> > enabled, causing reads from uninitialized memory.
> >
> > Add __rte_aligned(1) attribute to unaligned_uint{16,32,64}_t typedefs
> > which allows for safe access at any alignment. Without this,
> accessing
> > a uint16_t at an odd address is undefined behavior. Without this
> > UBSan detects `UndefinedBehaviorSanitizer: undefined-behavior`.
> >
> > Fixes: 7621d6a8d0bd ("eal: add and use unaligned integer types")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
> 
> [snip]
> 
> > diff --git a/lib/eal/include/rte_common.h
> b/lib/eal/include/rte_common.h
> > index 573bf4f2ce..7b36966019 100644
> > --- a/lib/eal/include/rte_common.h
> > +++ b/lib/eal/include/rte_common.h
> > @@ -121,16 +121,42 @@ extern "C" {
> >  #define __rte_aligned(a) __attribute__((__aligned__(a)))
> >  #endif
> >
> > -#ifdef RTE_ARCH_STRICT_ALIGN
> > -typedef uint64_t unaligned_uint64_t __rte_aligned(1);
> > -typedef uint32_t unaligned_uint32_t __rte_aligned(1);
> > -typedef uint16_t unaligned_uint16_t __rte_aligned(1);
> > +/**
> > + * Macro to mark a type that is not subject to type-based aliasing
> rules
> > + */
> > +#ifdef RTE_TOOLCHAIN_MSVC
> > +#define __rte_may_alias
> >  #else
> > -typedef uint64_t unaligned_uint64_t;
> > -typedef uint32_t unaligned_uint32_t;
> > -typedef uint16_t unaligned_uint16_t;
> > +#define __rte_may_alias __attribute__((__may_alias__))
> >  #endif
> >
> > +/* Unaligned types implementation notes:
> > + * __rte_aligned(1) - Reduces alignment requirement to 1 byte,
> allowing
> > + *                    these types to safely access memory at any
> address.
> > + *                    Without this, accessing a uint16_t at an odd
> address
> > + *                    is undefined behavior (even on x86 where
> hardware
> > + *                    handles it).
> > + *
> > + * __rte_may_alias  - Prevents strict-aliasing optimization bugs
> where
> > + *                    compilers may incorrectly elide memory
> operations
> > + *                    when casting between pointer types.
> > + */
> > +
> > +/**
> > + * Type for safe unaligned u64 access.
> > + */
> > +typedef __rte_may_alias __rte_aligned(1) uint64_t
> unaligned_uint64_t;
> > +
> > +/**
> > + * Type for safe unaligned u32 access.
> > + */
> > +typedef __rte_may_alias __rte_aligned(1) uint32_t
> unaligned_uint32_t;
> > +
> > +/**
> > + * Type for safe unaligned u16 access.
> > + */
> > +typedef __rte_may_alias __rte_aligned(1) uint16_t
> unaligned_uint16_t;
> > +
> >  /**
> >   * @deprecated
> >   * @see __rte_packed_begin
> > @@ -159,15 +185,6 @@ typedef uint16_t unaligned_uint16_t;
> >  #define __rte_packed_end __attribute__((__packed__))
> >  #endif
> >
> > -/**
> > - * Macro to mark a type that is not subject to type-based aliasing
> rules
> > - */
> > -#ifdef RTE_TOOLCHAIN_MSVC
> > -#define __rte_may_alias
> > -#else
> > -#define __rte_may_alias __attribute__((__may_alias__))
> > -#endif
> > -
> >  /******* Macro to mark functions and fields scheduled for removal
> *****/
> >  #ifdef RTE_TOOLCHAIN_MSVC
> >  #define __rte_deprecated
> 
> This change raises a warning in checkpatch.
> https://mails.dpdk.org/archives/test-report/2026-February/955237.html
> 
> IIRC, we added this check for MSVC support, making sure no
> __rte_aligned() would be added in unsupported locations.

It looks like MSVC can use alignment for type definitions too:
https://learn.microsoft.com/en-us/cpp/cpp/align-cpp?view=msvc-170#vclrf_declspecaligntypedef

It is applied on a structure, though, so may not be viable for scalar types. IDK.

But it looks like MSVC can only increase alignment:
https://learn.microsoft.com/en-us/cpp/cpp/align-cpp?view=msvc-170#:~:text=__declspec(align(%23))%20can%20only%20increase%20alignment%20restrictions.

So packing may be needed too.

> 
> @Microsoft guys, do you have a suggestion?
> 
> 
> --
> David Marchand


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2026-02-16 15:00 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-12 12:04 [PATCH v14 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
2026-01-12 12:04 ` [PATCH v14 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
2026-01-12 13:28   ` Morten Brørup
2026-01-12 15:00     ` Scott Mitchell
2026-01-12 12:04 ` [PATCH v14 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
2026-01-17 21:21 ` [PATCH v15 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
2026-01-17 21:21   ` [PATCH v15 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
2026-01-20 15:23     ` Morten Brørup
2026-01-23 14:34       ` Scott Mitchell
2026-01-17 21:21   ` [PATCH v15 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
2026-01-17 22:08   ` [PATCH v15 0/2] net: optimize __rte_raw_cksum Stephen Hemminger
2026-01-20 12:45     ` Morten Brørup
2026-01-23 15:43       ` Scott Mitchell
2026-01-23 16:02   ` [PATCH v16 " scott.k.mitch1
2026-01-23 16:02     ` [PATCH v16 1/2] eal: add __rte_may_alias to unaligned typedefs scott.k.mitch1
2026-01-23 16:02     ` [PATCH v16 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
2026-01-28 11:05       ` David Marchand
2026-01-28 17:39         ` Scott Mitchell
2026-01-24  8:23     ` [PATCH v16 0/2] net: optimize __rte_raw_cksum Morten Brørup
2026-01-28 18:05     ` [PATCH v17 " scott.k.mitch1
2026-01-28 18:05       ` [PATCH v17 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
2026-01-28 18:05       ` [PATCH v17 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
2026-01-28 19:41       ` [PATCH v18 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
2026-01-28 19:41         ` [PATCH v18 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
2026-01-29  8:28           ` Morten Brørup
2026-02-02  4:31             ` Scott Mitchell
2026-01-28 19:41         ` [PATCH v18 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
2026-01-29  8:31           ` Morten Brørup
2026-02-02  4:48         ` [PATCH v19 0/2] net: optimize __rte_raw_cksum scott.k.mitch1
2026-02-02  4:48           ` [PATCH v19 1/2] eal: add __rte_may_alias and __rte_aligned to unaligned typedefs scott.k.mitch1
2026-02-03  8:18             ` Morten Brørup
2026-02-16 14:29             ` David Marchand
2026-02-16 15:00               ` Morten Brørup
2026-02-02  4:48           ` [PATCH v19 2/2] net: __rte_raw_cksum pointers enable compiler optimizations scott.k.mitch1
2026-02-03  8:19             ` Morten Brørup
2026-02-06 14:54           ` [PATCH v19 0/2] net: optimize __rte_raw_cksum David Marchand
2026-02-07  1:29             ` Scott Mitchell
2026-02-10 11:53               ` Thomas Monjalon
2026-02-16 14:04               ` David Marchand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox