[PATCH v2 0/6] fix test failures on larger core systems

public inbox for dev@dpdk.org
 help / color / mirror / Atom feed

* [PATCH v2 0/6] fix test failures on larger core systems
       [not found] <0260118201223.323024-1-stephen@networkplumber.org>
@ 2026-01-20  1:55 ` Stephen Hemminger
  2026-01-20  1:55   ` [PATCH v2 1/6] test: add pause to synchronization spinloops Stephen Hemminger
                     ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Stephen Hemminger @ 2026-01-20  1:55 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

This series addresses several test failures that occur sporadically on
systems with many cores (32+), particularly on AMD Zen architectures.
I think Ferruh may have addressed similar problems in earlier	
releases.

The root causes fall into three categories:

1. Missing rte_pause() in synchronization spinloops (patch 1)
   Tight spinloops without pause cause SMT thread starvation and
   unpredictable timing behavior.

2. Fixed iteration counts that don't scale (patch 2)
   The atomic test performs 1M iterations per worker regardless of
   core count. With 32+ cores, contention causes timeout failures.

3. File-prefix collisions during parallel test execution (patches 5-6)
   Multiple tests using the default "rte" prefix compete for the same
   fbarray files, causing EAL initialization failures.

Additionally, two BPF-related fixes that I was seeing on
this system.

4. Lack of error checking in BPF elf load test (patch 3)

5. Unsupported BPF instructions with newer clang (patch 4)
   Clang 20+ generates JMP32 instructions that DPDK BPF doesn't support.

v2 - Drop the unnecessary fsync()
   - Rework the file prefix handling for trace tests

Stephen Hemminger (6):
  test: add pause to synchronization spinloops
  test: fix timeout for atomic test on high core count systems
  test: fix error handling in ELF load tests
  test: fix unsupported BPF instructions in elf load test
  test: add file-prefix for all fast-tests on Linux
  test: fix trace_autotest_with_traces parallel execution

 app/test/bpf/meson.build    |  3 +-
 app/test/suites/meson.build | 23 +++++++++----
 app/test/test_atomic.c      | 67 ++++++++++++++++++++++---------------
 app/test/test_bpf.c         |  3 +-
 app/test/test_threads.c     | 17 +++++-----
 5 files changed, 70 insertions(+), 43 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/6] test: add pause to synchronization spinloops
  2026-01-20  1:55 ` [PATCH v2 0/6] fix test failures on larger core systems Stephen Hemminger
@ 2026-01-20  1:55   ` Stephen Hemminger
  2026-01-21 16:10     ` Bruce Richardson
  2026-01-20  1:55   ` [PATCH v2 2/6] test: fix timeout for atomic test on high core count systems Stephen Hemminger
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2026-01-20  1:55 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, stable

The atomic and thread tests use tight spinloops to synchronize.
These spinloops lack rte_pause() which causes problems on high core
count systems, particularly AMD Zen architectures where:

- Tight spinloops without pause can starve SMT sibling threads
- Memory ordering and store-buffer forwarding behave differently
- Higher core counts amplify timing windows for race conditions

This manifests as sporadic test failures on systems with 32+ cores
that don't reproduce on smaller core count systems.

Add rte_pause() to all seven synchronization spinloops to allow
proper CPU resource sharing and improve memory ordering behavior.

Fixes: af75078fece3 ("first public release")
Cc: stable@dpdk.org

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 app/test/test_atomic.c  | 15 ++++++++-------
 app/test/test_threads.c | 17 +++++++++--------
 2 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/app/test/test_atomic.c b/app/test/test_atomic.c
index 8160a33e0e..b1a0d40ece 100644
--- a/app/test/test_atomic.c
+++ b/app/test/test_atomic.c
@@ -15,6 +15,7 @@
 #include <rte_atomic.h>
 #include <rte_eal.h>
 #include <rte_lcore.h>
+#include <rte_pause.h>
 #include <rte_random.h>
 #include <rte_hash_crc.h>
 
@@ -114,7 +115,7 @@ test_atomic_usual(__rte_unused void *arg)
 	unsigned i;
 
 	while (rte_atomic32_read(&synchro) == 0)
-		;
+		rte_pause();
 
 	for (i = 0; i < N; i++)
 		rte_atomic16_inc(&a16);
@@ -150,7 +151,7 @@ static int
 test_atomic_tas(__rte_unused void *arg)
 {
 	while (rte_atomic32_read(&synchro) == 0)
-		;
+		rte_pause();
 
 	if (rte_atomic16_test_and_set(&a16))
 		rte_atomic64_inc(&count);
@@ -171,7 +172,7 @@ test_atomic_addsub_and_return(__rte_unused void *arg)
 	unsigned i;
 
 	while (rte_atomic32_read(&synchro) == 0)
-		;
+		rte_pause();
 
 	for (i = 0; i < N; i++) {
 		tmp16 = rte_atomic16_add_return(&a16, 1);
@@ -210,7 +211,7 @@ static int
 test_atomic_inc_and_test(__rte_unused void *arg)
 {
 	while (rte_atomic32_read(&synchro) == 0)
-		;
+		rte_pause();
 
 	if (rte_atomic16_inc_and_test(&a16)) {
 		rte_atomic64_inc(&count);
@@ -237,7 +238,7 @@ static int
 test_atomic_dec_and_test(__rte_unused void *arg)
 {
 	while (rte_atomic32_read(&synchro) == 0)
-		;
+		rte_pause();
 
 	if (rte_atomic16_dec_and_test(&a16))
 		rte_atomic64_inc(&count);
@@ -269,7 +270,7 @@ test_atomic128_cmp_exchange(__rte_unused void *arg)
 	unsigned int i;
 
 	while (rte_atomic32_read(&synchro) == 0)
-		;
+		rte_pause();
 
 	expected = count128;
 
@@ -407,7 +408,7 @@ test_atomic_exchange(__rte_unused void *arg)
 
 	/* Wait until all of the other threads have been dispatched */
 	while (rte_atomic32_read(&synchro) == 0)
-		;
+		rte_pause();
 
 	/*
 	 * Let the battle begin! Every thread attempts to steal the current
diff --git a/app/test/test_threads.c b/app/test/test_threads.c
index 5cd8bd4559..e2700b4a92 100644
--- a/app/test/test_threads.c
+++ b/app/test/test_threads.c
@@ -7,6 +7,7 @@
 #include <rte_thread.h>
 #include <rte_debug.h>
 #include <rte_stdatomic.h>
+#include <rte_pause.h>
 
 #include "test.h"
 
@@ -23,7 +24,7 @@ thread_main(void *arg)
 	rte_atomic_store_explicit(&thread_id_ready, 1, rte_memory_order_release);
 
 	while (rte_atomic_load_explicit(&thread_id_ready, rte_memory_order_acquire) == 1)
-		;
+		rte_pause();
 
 	return 0;
 }
@@ -39,7 +40,7 @@ test_thread_create_join(void)
 		"Failed to create thread.");
 
 	while (rte_atomic_load_explicit(&thread_id_ready, rte_memory_order_acquire) == 0)
-		;
+		rte_pause();
 
 	RTE_TEST_ASSERT(rte_thread_equal(thread_id, thread_main_id) != 0,
 		"Unexpected thread id.");
@@ -63,7 +64,7 @@ test_thread_create_detach(void)
 		&thread_main_id) == 0, "Failed to create thread.");
 
 	while (rte_atomic_load_explicit(&thread_id_ready, rte_memory_order_acquire) == 0)
-		;
+		rte_pause();
 
 	RTE_TEST_ASSERT(rte_thread_equal(thread_id, thread_main_id) != 0,
 		"Unexpected thread id.");
@@ -87,7 +88,7 @@ test_thread_priority(void)
 		"Failed to create thread");
 
 	while (rte_atomic_load_explicit(&thread_id_ready, rte_memory_order_acquire) == 0)
-		;
+		rte_pause();
 
 	priority = RTE_THREAD_PRIORITY_NORMAL;
 	RTE_TEST_ASSERT(rte_thread_set_priority(thread_id, priority) == 0,
@@ -139,7 +140,7 @@ test_thread_affinity(void)
 		"Failed to create thread");
 
 	while (rte_atomic_load_explicit(&thread_id_ready, rte_memory_order_acquire) == 0)
-		;
+		rte_pause();
 
 	RTE_TEST_ASSERT(rte_thread_get_affinity_by_id(thread_id, &cpuset0) == 0,
 		"Failed to get thread affinity");
@@ -192,7 +193,7 @@ test_thread_attributes_affinity(void)
 		"Failed to create attributes affinity thread.");
 
 	while (rte_atomic_load_explicit(&thread_id_ready, rte_memory_order_acquire) == 0)
-		;
+		rte_pause();
 
 	RTE_TEST_ASSERT(rte_thread_get_affinity_by_id(thread_id, &cpuset1) == 0,
 		"Failed to get attributes thread affinity");
@@ -221,7 +222,7 @@ test_thread_attributes_priority(void)
 		"Failed to create attributes priority thread.");
 
 	while (rte_atomic_load_explicit(&thread_id_ready, rte_memory_order_acquire) == 0)
-		;
+		rte_pause();
 
 	RTE_TEST_ASSERT(rte_thread_get_priority(thread_id, &priority) == 0,
 		"Failed to get thread priority");
@@ -245,7 +246,7 @@ test_thread_control_create_join(void)
 		"Failed to create thread.");
 
 	while (rte_atomic_load_explicit(&thread_id_ready, rte_memory_order_acquire) == 0)
-		;
+		rte_pause();
 
 	RTE_TEST_ASSERT(rte_thread_equal(thread_id, thread_main_id) != 0,
 		"Unexpected thread id.");
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/6] test: fix timeout for atomic test on high core count systems
  2026-01-20  1:55 ` [PATCH v2 0/6] fix test failures on larger core systems Stephen Hemminger
  2026-01-20  1:55   ` [PATCH v2 1/6] test: add pause to synchronization spinloops Stephen Hemminger
@ 2026-01-20  1:55   ` Stephen Hemminger
  2026-01-21 16:11     ` Bruce Richardson
  2026-01-20  1:55   ` [PATCH v2 3/6] test: fix error handling in ELF load tests Stephen Hemminger
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2026-01-20  1:55 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, stable

The atomic test uses tight spinloops to synchronize worker threads
and performs a fixed 1,000,000 iterations per worker. This causes
two problems on high core count systems:

With many cores (e.g., 32), the massive contention on shared
atomic variables causes the test to exceed the 10 second timeout.

Scale iterations inversely with core count to maintain roughly
constant test duration regardless of system size

With 32 cores, iterations drop from 1,000,000 to 31,250 per worker,
which keeps the test well within the timeout while still providing
meaningful coverage.

Bugzilla ID: 952
Fixes: af75078fece3 ("first public release")
Cc: stable@dpdk.org

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 app/test/test_atomic.c | 52 ++++++++++++++++++++++++++----------------
 1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/app/test/test_atomic.c b/app/test/test_atomic.c
index b1a0d40ece..ccd8e5d29b 100644
--- a/app/test/test_atomic.c
+++ b/app/test/test_atomic.c
@@ -10,6 +10,7 @@
 #include <sys/queue.h>
 
 #include <rte_memory.h>
+#include <rte_common.h>
 #include <rte_per_lcore.h>
 #include <rte_launch.h>
 #include <rte_atomic.h>
@@ -101,7 +102,15 @@
 
 #define NUM_ATOMIC_TYPES 3
 
-#define N 1000000
+#define N_BASE 1000000u
+#define N_MIN  10000u
+
+/*
+ * Number of iterations for each test, scaled inversely with core count.
+ * More cores means more contention which increases time per operation.
+ * Calculated once at test start to avoid repeated computation in workers.
+ */
+static unsigned int num_iterations;
 
 static rte_atomic16_t a16;
 static rte_atomic32_t a32;
@@ -112,36 +121,36 @@ static rte_atomic32_t synchro;
 static int
 test_atomic_usual(__rte_unused void *arg)
 {
-	unsigned i;
+	unsigned int i;
 
 	while (rte_atomic32_read(&synchro) == 0)
 		rte_pause();
 
-	for (i = 0; i < N; i++)
+	for (i = 0; i < num_iterations; i++)
 		rte_atomic16_inc(&a16);
-	for (i = 0; i < N; i++)
+	for (i = 0; i < num_iterations; i++)
 		rte_atomic16_dec(&a16);
-	for (i = 0; i < (N / 5); i++)
+	for (i = 0; i < (num_iterations / 5); i++)
 		rte_atomic16_add(&a16, 5);
-	for (i = 0; i < (N / 5); i++)
+	for (i = 0; i < (num_iterations / 5); i++)
 		rte_atomic16_sub(&a16, 5);
 
-	for (i = 0; i < N; i++)
+	for (i = 0; i < num_iterations; i++)
 		rte_atomic32_inc(&a32);
-	for (i = 0; i < N; i++)
+	for (i = 0; i < num_iterations; i++)
 		rte_atomic32_dec(&a32);
-	for (i = 0; i < (N / 5); i++)
+	for (i = 0; i < (num_iterations / 5); i++)
 		rte_atomic32_add(&a32, 5);
-	for (i = 0; i < (N / 5); i++)
+	for (i = 0; i < (num_iterations / 5); i++)
 		rte_atomic32_sub(&a32, 5);
 
-	for (i = 0; i < N; i++)
+	for (i = 0; i < num_iterations; i++)
 		rte_atomic64_inc(&a64);
-	for (i = 0; i < N; i++)
+	for (i = 0; i < num_iterations; i++)
 		rte_atomic64_dec(&a64);
-	for (i = 0; i < (N / 5); i++)
+	for (i = 0; i < (num_iterations / 5); i++)
 		rte_atomic64_add(&a64, 5);
-	for (i = 0; i < (N / 5); i++)
+	for (i = 0; i < (num_iterations / 5); i++)
 		rte_atomic64_sub(&a64, 5);
 
 	return 0;
@@ -169,12 +178,12 @@ test_atomic_addsub_and_return(__rte_unused void *arg)
 	uint32_t tmp16;
 	uint32_t tmp32;
 	uint64_t tmp64;
-	unsigned i;
+	unsigned int i;
 
 	while (rte_atomic32_read(&synchro) == 0)
 		rte_pause();
 
-	for (i = 0; i < N; i++) {
+	for (i = 0; i < num_iterations; i++) {
 		tmp16 = rte_atomic16_add_return(&a16, 1);
 		rte_atomic64_add(&count, tmp16);
 
@@ -274,7 +283,7 @@ test_atomic128_cmp_exchange(__rte_unused void *arg)
 
 	expected = count128;
 
-	for (i = 0; i < N; i++) {
+	for (i = 0; i < num_iterations; i++) {
 		do {
 			rte_int128_t desired;
 
@@ -401,7 +410,7 @@ get_crc8(uint8_t *message, int length)
 static int
 test_atomic_exchange(__rte_unused void *arg)
 {
-	int i;
+	unsigned int i;
 	test16_t nt16, ot16; /* new token, old token */
 	test32_t nt32, ot32;
 	test64_t nt64, ot64;
@@ -417,7 +426,7 @@ test_atomic_exchange(__rte_unused void *arg)
 	 * appropriate crc32 hash for the data) then the test iteration has
 	 * passed.  If the token is invalid, increment the counter.
 	 */
-	for (i = 0; i < N; i++) {
+	for (i = 0; i < num_iterations; i++) {
 
 		/* Test 64bit Atomic Exchange */
 		nt64.u64 = rte_rand();
@@ -446,6 +455,9 @@ test_atomic_exchange(__rte_unused void *arg)
 static int
 test_atomic(void)
 {
+	/* Scale iterations by number of cores to keep test duration reasonable */
+	num_iterations = RTE_MAX(N_BASE / rte_lcore_count(), N_MIN);
+
 	rte_atomic16_init(&a16);
 	rte_atomic32_init(&a32);
 	rte_atomic64_init(&a64);
@@ -593,7 +605,7 @@ test_atomic(void)
 	rte_atomic32_clear(&synchro);
 
 	iterations = count128.val[0] - count128.val[1];
-	if (iterations != (uint64_t)4*N*(rte_lcore_count()-1)) {
+	if (iterations != (uint64_t)4*num_iterations*(rte_lcore_count()-1)) {
 		printf("128-bit compare and swap failed\n");
 		return -1;
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/6] test: fix error handling in ELF load tests
  2026-01-20  1:55 ` [PATCH v2 0/6] fix test failures on larger core systems Stephen Hemminger
  2026-01-20  1:55   ` [PATCH v2 1/6] test: add pause to synchronization spinloops Stephen Hemminger
  2026-01-20  1:55   ` [PATCH v2 2/6] test: fix timeout for atomic test on high core count systems Stephen Hemminger
@ 2026-01-20  1:55   ` Stephen Hemminger
  2026-01-20 12:08     ` Marat Khalili
  2026-01-20  1:55   ` [PATCH v2 4/6] test: fix unsupported BPF instructions in elf load test Stephen Hemminger
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2026-01-20  1:55 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, stable

Address related issues found during review
- Add missing TEST_ASSERT for mempool creation in test_bpf_elf_tx_load
- Initialize port variable in test_bpf_elf_rx_load to avoid undefined
  behavior in cleanup path if null_vdev_setup fails early

Fixes: cf1e03f881af ("test/bpf: add ELF loading")
Cc: stable@dpdk.org

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 app/test/test_bpf.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/app/test/test_bpf.c b/app/test/test_bpf.c
index a7d56f8d86..0e969f9f13 100644
--- a/app/test/test_bpf.c
+++ b/app/test/test_bpf.c
@@ -3580,6 +3580,7 @@ test_bpf_elf_tx_load(void)
 	mb_pool = rte_pktmbuf_pool_create("bpf_tx_test_pool", BPF_TEST_POOLSIZE,
 					  0, 0, RTE_MBUF_DEFAULT_BUF_SIZE,
 					  SOCKET_ID_ANY);
+	TEST_ASSERT(mb_pool != NULL, "failed to create mempool");
 
 	ret = null_vdev_setup(null_dev, &port, mb_pool);
 	if (ret != 0)
@@ -3664,7 +3665,7 @@ test_bpf_elf_rx_load(void)
 	static const char null_dev[] = "net_null_bpf0";
 	struct rte_mempool *pool = NULL;
 	char *tmpfile = NULL;
-	uint16_t port;
+	uint16_t port = UINT16_MAX;
 	int ret;
 
 	printf("%s start\n", __func__);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/6] test: fix unsupported BPF instructions in elf load test
  2026-01-20  1:55 ` [PATCH v2 0/6] fix test failures on larger core systems Stephen Hemminger
                     ` (2 preceding siblings ...)
  2026-01-20  1:55   ` [PATCH v2 3/6] test: fix error handling in ELF load tests Stephen Hemminger
@ 2026-01-20  1:55   ` Stephen Hemminger
  2026-01-21 16:20     ` Bruce Richardson
  2026-01-20  1:55   ` [PATCH v2 5/6] test: add file-prefix for all fast-tests on Linux Stephen Hemminger
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2026-01-20  1:55 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, stable, Marat Khalili

The DPDK BPF library only handles the base BPF instructions.
It does not handle JMP32 which would cause the bpf_elf_load
test to fail on clang 20 or later.

Bugzilla ID: 1844
Fixes: cf1e03f881af ("test/bpf: add ELF loading")
Cc: stable@dpdk.org

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Marat Khalili <marat.khalili@huawei.com>
---
 app/test/bpf/meson.build | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/app/test/bpf/meson.build b/app/test/bpf/meson.build
index aaecfa7018..91c1b434f8 100644
--- a/app/test/bpf/meson.build
+++ b/app/test/bpf/meson.build
@@ -24,7 +24,8 @@ if not xxd.found()
 endif
 
 # BPF compiler flags
-bpf_cflags = [ '-O2', '-target', 'bpf', '-g', '-c']
+# At present: DPDK BPF does not support v3 or later
+bpf_cflags = [ '-O2', '-target', 'bpf', '-mcpu=v2', '-g', '-c']
 
 # Enable test in test_bpf.c
 cflags += '-DTEST_BPF_ELF_LOAD'
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 5/6] test: add file-prefix for all fast-tests on Linux
  2026-01-20  1:55 ` [PATCH v2 0/6] fix test failures on larger core systems Stephen Hemminger
                     ` (3 preceding siblings ...)
  2026-01-20  1:55   ` [PATCH v2 4/6] test: fix unsupported BPF instructions in elf load test Stephen Hemminger
@ 2026-01-20  1:55   ` Stephen Hemminger
  2026-01-21 16:22     ` Bruce Richardson
  2026-01-20  1:55   ` [PATCH v2 6/6] test: fix trace_autotest_with_traces parallel execution Stephen Hemminger
  2026-01-21 16:31   ` [PATCH v2 0/6] fix test failures on larger core systems Bruce Richardson
  6 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2026-01-20  1:55 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, stable, Marat Khalili

When running tests in parallel on systems with many cores, multiple test
processes collide on the default "rte" file-prefix, causing EAL
initialization failures:

  EAL: Cannot allocate memzone list: Device or resource busy
  EAL: Cannot init memzone

This occurs because all DPDK tests (including --no-huge tests) use
file-backed arrays for memzone tracking. These files are created at
/var/run/dpdk/<prefix>/fbarray_memzone and require exclusive locking
during initialization. When multiple tests run in parallel with the
same file-prefix, they compete for this lock.

The original implementation included --file-prefix for Linux to
prevent this collision. This was later removed during test
infrastructure refactoring.

Restore the --file-prefix argument for all fast-tests on Linux,
regardless of whether they use hugepages. Tests that exercise
file-prefix functionality (like eal_flags_file_prefix_autotest)
spawn child processes with their own hardcoded prefixes and use
get_current_prefix() to verify the parent's resources, so they work
correctly regardless of what prefix the parent process uses.

Fixes: 50823f30f0c8 ("test: build using per-file dependencies")
Cc: stable@dpdk.org

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Marat Khalili <marat.khalili@huawei.com>
---
 app/test/suites/meson.build | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/app/test/suites/meson.build b/app/test/suites/meson.build
index 1010150eee..4c815ea097 100644
--- a/app/test/suites/meson.build
+++ b/app/test/suites/meson.build
@@ -85,11 +85,15 @@ foreach suite:test_suites
             if nohuge
                 test_args += test_no_huge_args
             elif not has_hugepage
-                continue  #skip this tests
+                continue  # skip this test
             endif
             if not asan and get_option('b_sanitize').contains('address')
                 continue  # skip this test
             endif
+            if is_linux
+                # use unique file-prefix to allow parallel runs
+                test_args += ['--file-prefix=' + test_name.underscorify()]
+            endif
 
             if get_option('default_library') == 'shared'
                 test_args += ['-d', dpdk_drivers_build_dir]
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 6/6] test: fix trace_autotest_with_traces parallel execution
  2026-01-20  1:55 ` [PATCH v2 0/6] fix test failures on larger core systems Stephen Hemminger
                     ` (4 preceding siblings ...)
  2026-01-20  1:55   ` [PATCH v2 5/6] test: add file-prefix for all fast-tests on Linux Stephen Hemminger
@ 2026-01-20  1:55   ` Stephen Hemminger
  2026-01-21 16:29     ` Bruce Richardson
  2026-01-21 16:31   ` [PATCH v2 0/6] fix test failures on larger core systems Bruce Richardson
  6 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2026-01-20  1:55 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, stable

The trace_autotest_with_traces test needs a unique file-prefix to avoid
collisions when running in parallel with other tests.

Rather than duplicating test argument construction, restructure to add
file-prefix as the last step. This allows reusing test_args for the
trace variant by concatenating the trace-specific arguments and a
different file-prefix at the end.

Fixes: 0aeaf75df879 ("test: define unit tests suites based on test types")
Cc: stable@dpdk.org

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 app/test/suites/meson.build | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/app/test/suites/meson.build b/app/test/suites/meson.build
index 4c815ea097..fdc0b77149 100644
--- a/app/test/suites/meson.build
+++ b/app/test/suites/meson.build
@@ -90,26 +90,33 @@ foreach suite:test_suites
             if not asan and get_option('b_sanitize').contains('address')
                 continue  # skip this test
             endif
-            if is_linux
-                # use unique file-prefix to allow parallel runs
-                test_args += ['--file-prefix=' + test_name.underscorify()]
-            endif
-
             if get_option('default_library') == 'shared'
                 test_args += ['-d', dpdk_drivers_build_dir]
             endif
 
+            # use unique file-prefix to allow parallel runs
+            if is_linux
+                file_prefix = ['--file-prefix=' + test_name.underscorify()]
+            else
+                file_prefix = []
+            endif
+
             test(test_name, dpdk_test,
-                args : test_args,
+                args : test_args + file_prefix,
                 env: ['DPDK_TEST=' + test_name],
                 timeout : timeout_seconds_fast,
                 is_parallel : false,
                 suite : 'fast-tests')
             if not is_windows and test_name == 'trace_autotest'
-                test_args += ['--trace=.*']
-                test_args += ['--trace-dir=@0@'.format(meson.current_build_dir())]
+                trace_extra = ['--trace=.*',
+                               '--trace-dir=@0@'.format(meson.current_build_dir())]
+                if is_linux
+                    trace_prefix = ['--file-prefix=trace_autotest_with_traces']
+                else
+                    trace_prefix = []
+                endif
                 test(test_name + '_with_traces', dpdk_test,
-                    args : test_args,
+                    args : test_args + trace_extra + trace_prefix,
                     env: ['DPDK_TEST=' + test_name],
                     timeout : timeout_seconds_fast,
                     is_parallel : false,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* RE: [PATCH v2 3/6] test: fix error handling in ELF load tests
  2026-01-20  1:55   ` [PATCH v2 3/6] test: fix error handling in ELF load tests Stephen Hemminger
@ 2026-01-20 12:08     ` Marat Khalili
  0 siblings, 0 replies; 14+ messages in thread
From: Marat Khalili @ 2026-01-20 12:08 UTC (permalink / raw)
  To: Stephen Hemminger, dev@dpdk.org; +Cc: stable@dpdk.org

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday 20 January 2026 01:55
> To: dev@dpdk.org
> Cc: Stephen Hemminger <stephen@networkplumber.org>; stable@dpdk.org
> Subject: [PATCH v2 3/6] test: fix error handling in ELF load tests
> 
> Address related issues found during review
> - Add missing TEST_ASSERT for mempool creation in test_bpf_elf_tx_load
> - Initialize port variable in test_bpf_elf_rx_load to avoid undefined
>   behavior in cleanup path if null_vdev_setup fails early
> 
> Fixes: cf1e03f881af ("test/bpf: add ELF loading")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
>  app/test/test_bpf.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/app/test/test_bpf.c b/app/test/test_bpf.c
> index a7d56f8d86..0e969f9f13 100644
> --- a/app/test/test_bpf.c
> +++ b/app/test/test_bpf.c
> @@ -3580,6 +3580,7 @@ test_bpf_elf_tx_load(void)
>  	mb_pool = rte_pktmbuf_pool_create("bpf_tx_test_pool", BPF_TEST_POOLSIZE,
>  					  0, 0, RTE_MBUF_DEFAULT_BUF_SIZE,
>  					  SOCKET_ID_ANY);
> +	TEST_ASSERT(mb_pool != NULL, "failed to create mempool");
> 
>  	ret = null_vdev_setup(null_dev, &port, mb_pool);
>  	if (ret != 0)
> @@ -3664,7 +3665,7 @@ test_bpf_elf_rx_load(void)
>  	static const char null_dev[] = "net_null_bpf0";
>  	struct rte_mempool *pool = NULL;
>  	char *tmpfile = NULL;
> -	uint16_t port;
> +	uint16_t port = UINT16_MAX;
>  	int ret;
> 
>  	printf("%s start\n", __func__);
> --
> 2.51.0
> 

Acked-by: Marat Khalili <marat.khalili@huawei.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/6] test: add pause to synchronization spinloops
  2026-01-20  1:55   ` [PATCH v2 1/6] test: add pause to synchronization spinloops Stephen Hemminger
@ 2026-01-21 16:10     ` Bruce Richardson
  0 siblings, 0 replies; 14+ messages in thread
From: Bruce Richardson @ 2026-01-21 16:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, stable

On Mon, Jan 19, 2026 at 05:55:04PM -0800, Stephen Hemminger wrote:
> The atomic and thread tests use tight spinloops to synchronize.
> These spinloops lack rte_pause() which causes problems on high core
> count systems, particularly AMD Zen architectures where:
> 
> - Tight spinloops without pause can starve SMT sibling threads
> - Memory ordering and store-buffer forwarding behave differently
> - Higher core counts amplify timing windows for race conditions
> 
> This manifests as sporadic test failures on systems with 32+ cores
> that don't reproduce on smaller core count systems.
> 
> Add rte_pause() to all seven synchronization spinloops to allow
> proper CPU resource sharing and improve memory ordering behavior.
> 
> Fixes: af75078fece3 ("first public release")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
>  app/test/test_atomic.c  | 15 ++++++++-------
>  app/test/test_threads.c | 17 +++++++++--------
>  2 files changed, 17 insertions(+), 15 deletions(-)
> 
Acked-by: Bruce Richardson <bruce.richardson@intel.com>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/6] test: fix timeout for atomic test on high core count systems
  2026-01-20  1:55   ` [PATCH v2 2/6] test: fix timeout for atomic test on high core count systems Stephen Hemminger
@ 2026-01-21 16:11     ` Bruce Richardson
  0 siblings, 0 replies; 14+ messages in thread
From: Bruce Richardson @ 2026-01-21 16:11 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, stable

On Mon, Jan 19, 2026 at 05:55:05PM -0800, Stephen Hemminger wrote:
> The atomic test uses tight spinloops to synchronize worker threads
> and performs a fixed 1,000,000 iterations per worker. This causes
> two problems on high core count systems:
> 
> With many cores (e.g., 32), the massive contention on shared
> atomic variables causes the test to exceed the 10 second timeout.
> 
> Scale iterations inversely with core count to maintain roughly
> constant test duration regardless of system size
> 
> With 32 cores, iterations drop from 1,000,000 to 31,250 per worker,
> which keeps the test well within the timeout while still providing
> meaningful coverage.
> 
> Bugzilla ID: 952
> Fixes: af75078fece3 ("first public release")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
>  app/test/test_atomic.c | 52 ++++++++++++++++++++++++++----------------
>  1 file changed, 32 insertions(+), 20 deletions(-)
> 
Tested-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>

Tested this on a system with 96 cores and test no longer times out or fails
for me.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 4/6] test: fix unsupported BPF instructions in elf load test
  2026-01-20  1:55   ` [PATCH v2 4/6] test: fix unsupported BPF instructions in elf load test Stephen Hemminger
@ 2026-01-21 16:20     ` Bruce Richardson
  0 siblings, 0 replies; 14+ messages in thread
From: Bruce Richardson @ 2026-01-21 16:20 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, stable, Marat Khalili

On Mon, Jan 19, 2026 at 05:55:07PM -0800, Stephen Hemminger wrote:
> The DPDK BPF library only handles the base BPF instructions.
> It does not handle JMP32 which would cause the bpf_elf_load
> test to fail on clang 20 or later.
> 
> Bugzilla ID: 1844
> Fixes: cf1e03f881af ("test/bpf: add ELF loading")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Acked-by: Marat Khalili <marat.khalili@huawei.com>
> ---
>  app/test/bpf/meson.build | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/app/test/bpf/meson.build b/app/test/bpf/meson.build
> index aaecfa7018..91c1b434f8 100644
> --- a/app/test/bpf/meson.build
> +++ b/app/test/bpf/meson.build
> @@ -24,7 +24,8 @@ if not xxd.found()
>  endif
>  
>  # BPF compiler flags
> -bpf_cflags = [ '-O2', '-target', 'bpf', '-g', '-c']
> +# At present: DPDK BPF does not support v3 or later
> +bpf_cflags = [ '-O2', '-target', 'bpf', '-mcpu=v2', '-g', '-c']
>  
>  # Enable test in test_bpf.c
>  cflags += '-DTEST_BPF_ELF_LOAD'
> -- 

One small additional thing in the bpf autotest, is that the test fails if
net/null driver is disabled. It would be good if it reported skipped in
that case.

/Bruce

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 5/6] test: add file-prefix for all fast-tests on Linux
  2026-01-20  1:55   ` [PATCH v2 5/6] test: add file-prefix for all fast-tests on Linux Stephen Hemminger
@ 2026-01-21 16:22     ` Bruce Richardson
  0 siblings, 0 replies; 14+ messages in thread
From: Bruce Richardson @ 2026-01-21 16:22 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, stable, Marat Khalili

On Mon, Jan 19, 2026 at 05:55:08PM -0800, Stephen Hemminger wrote:
> When running tests in parallel on systems with many cores, multiple test
> processes collide on the default "rte" file-prefix, causing EAL
> initialization failures:
> 
>   EAL: Cannot allocate memzone list: Device or resource busy
>   EAL: Cannot init memzone
> 
> This occurs because all DPDK tests (including --no-huge tests) use
> file-backed arrays for memzone tracking. These files are created at
> /var/run/dpdk/<prefix>/fbarray_memzone and require exclusive locking
> during initialization. When multiple tests run in parallel with the
> same file-prefix, they compete for this lock.
> 
> The original implementation included --file-prefix for Linux to
> prevent this collision. This was later removed during test
> infrastructure refactoring.
> 
> Restore the --file-prefix argument for all fast-tests on Linux,
> regardless of whether they use hugepages. Tests that exercise
> file-prefix functionality (like eal_flags_file_prefix_autotest)
> spawn child processes with their own hardcoded prefixes and use
> get_current_prefix() to verify the parent's resources, so they work
> correctly regardless of what prefix the parent process uses.
> 
> Fixes: 50823f30f0c8 ("test: build using per-file dependencies")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Acked-by: Marat Khalili <marat.khalili@huawei.com>
> ---
>  app/test/suites/meson.build | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/app/test/suites/meson.build b/app/test/suites/meson.build
> index 1010150eee..4c815ea097 100644
> --- a/app/test/suites/meson.build
> +++ b/app/test/suites/meson.build
> @@ -85,11 +85,15 @@ foreach suite:test_suites
>              if nohuge
>                  test_args += test_no_huge_args
>              elif not has_hugepage
> -                continue  #skip this tests
> +                continue  # skip this test
>              endif
>              if not asan and get_option('b_sanitize').contains('address')
>                  continue  # skip this test
>              endif
> +            if is_linux
> +                # use unique file-prefix to allow parallel runs
> +                test_args += ['--file-prefix=' + test_name.underscorify()]
> +            endif
>  

No harm in this, even though I suspect parallel runs may hit other issues.

Acked-by: Bruce Richardson <bruce.richardson@intel.com>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 6/6] test: fix trace_autotest_with_traces parallel execution
  2026-01-20  1:55   ` [PATCH v2 6/6] test: fix trace_autotest_with_traces parallel execution Stephen Hemminger
@ 2026-01-21 16:29     ` Bruce Richardson
  0 siblings, 0 replies; 14+ messages in thread
From: Bruce Richardson @ 2026-01-21 16:29 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, stable

On Mon, Jan 19, 2026 at 05:55:09PM -0800, Stephen Hemminger wrote:
> The trace_autotest_with_traces test needs a unique file-prefix to avoid
> collisions when running in parallel with other tests.
> 
> Rather than duplicating test argument construction, restructure to add
> file-prefix as the last step. This allows reusing test_args for the
> trace variant by concatenating the trace-specific arguments and a
> different file-prefix at the end.
> 
> Fixes: 0aeaf75df879 ("test: define unit tests suites based on test types")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
>  app/test/suites/meson.build | 25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/app/test/suites/meson.build b/app/test/suites/meson.build
> index 4c815ea097..fdc0b77149 100644
> --- a/app/test/suites/meson.build
> +++ b/app/test/suites/meson.build
> @@ -90,26 +90,33 @@ foreach suite:test_suites
>              if not asan and get_option('b_sanitize').contains('address')
>                  continue  # skip this test
>              endif
> -            if is_linux
> -                # use unique file-prefix to allow parallel runs
> -                test_args += ['--file-prefix=' + test_name.underscorify()]
> -            endif
> -
>              if get_option('default_library') == 'shared'
>                  test_args += ['-d', dpdk_drivers_build_dir]
>              endif
>  
> +            # use unique file-prefix to allow parallel runs
> +            if is_linux
> +                file_prefix = ['--file-prefix=' + test_name.underscorify()]
> +            else
> +                file_prefix = []
> +            endif
> +

I would test to shorten, and merge generating a trace prefix into this, to
avoid multiple if-else branches.:

file_prefix = []
trace_file_prefix = []
if is_linux
    file_prefix = ['--file-prefix=' + test_name.underscorify()]
    trace_file_prefix = [file_prefix[0] + '_with_traces']
endif

>              test(test_name, dpdk_test,
> -                args : test_args,
> +                args : test_args + file_prefix,
>                  env: ['DPDK_TEST=' + test_name],
>                  timeout : timeout_seconds_fast,
>                  is_parallel : false,
>                  suite : 'fast-tests')
>              if not is_windows and test_name == 'trace_autotest'
> -                test_args += ['--trace=.*']
> -                test_args += ['--trace-dir=@0@'.format(meson.current_build_dir())]
> +                trace_extra = ['--trace=.*',
> +                               '--trace-dir=@0@'.format(meson.current_build_dir())]
> +                if is_linux
> +                    trace_prefix = ['--file-prefix=trace_autotest_with_traces']
> +                else
> +                    trace_prefix = []
> +                endif
>                  test(test_name + '_with_traces', dpdk_test,
> -                    args : test_args,
> +                    args : test_args + trace_extra + trace_prefix,
>                      env: ['DPDK_TEST=' + test_name],
>                      timeout : timeout_seconds_fast,
>                      is_parallel : false,
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/6] fix test failures on larger core systems
  2026-01-20  1:55 ` [PATCH v2 0/6] fix test failures on larger core systems Stephen Hemminger
                     ` (5 preceding siblings ...)
  2026-01-20  1:55   ` [PATCH v2 6/6] test: fix trace_autotest_with_traces parallel execution Stephen Hemminger
@ 2026-01-21 16:31   ` Bruce Richardson
  6 siblings, 0 replies; 14+ messages in thread
From: Bruce Richardson @ 2026-01-21 16:31 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

On Mon, Jan 19, 2026 at 05:55:03PM -0800, Stephen Hemminger wrote:
> This series addresses several test failures that occur sporadically on
> systems with many cores (32+), particularly on AMD Zen architectures.
> I think Ferruh may have addressed similar problems in earlier	
> releases.
> 
> The root causes fall into three categories:
> 
> 1. Missing rte_pause() in synchronization spinloops (patch 1)
>    Tight spinloops without pause cause SMT thread starvation and
>    unpredictable timing behavior.
> 
> 2. Fixed iteration counts that don't scale (patch 2)
>    The atomic test performs 1M iterations per worker regardless of
>    core count. With 32+ cores, contention causes timeout failures.
> 
Testing on a 96-core part, I still see timeouts (with -t2, so 20-second
allowed) in mcslock, stack, stack_lf and timer tests. Limiting core counts
for those makes the failures go away, so it's likely the same issue as you
solved here.

/Bruce

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-01-21 16:31 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <0260118201223.323024-1-stephen@networkplumber.org>
2026-01-20  1:55 ` [PATCH v2 0/6] fix test failures on larger core systems Stephen Hemminger
2026-01-20  1:55   ` [PATCH v2 1/6] test: add pause to synchronization spinloops Stephen Hemminger
2026-01-21 16:10     ` Bruce Richardson
2026-01-20  1:55   ` [PATCH v2 2/6] test: fix timeout for atomic test on high core count systems Stephen Hemminger
2026-01-21 16:11     ` Bruce Richardson
2026-01-20  1:55   ` [PATCH v2 3/6] test: fix error handling in ELF load tests Stephen Hemminger
2026-01-20 12:08     ` Marat Khalili
2026-01-20  1:55   ` [PATCH v2 4/6] test: fix unsupported BPF instructions in elf load test Stephen Hemminger
2026-01-21 16:20     ` Bruce Richardson
2026-01-20  1:55   ` [PATCH v2 5/6] test: add file-prefix for all fast-tests on Linux Stephen Hemminger
2026-01-21 16:22     ` Bruce Richardson
2026-01-20  1:55   ` [PATCH v2 6/6] test: fix trace_autotest_with_traces parallel execution Stephen Hemminger
2026-01-21 16:29     ` Bruce Richardson
2026-01-21 16:31   ` [PATCH v2 0/6] fix test failures on larger core systems Bruce Richardson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox