* [PATCH] ring: avoid extra store at move head
@ 2026-06-01 18:15 Konstantin Ananyev
2026-06-01 22:23 ` Stephen Hemminger
0 siblings, 1 reply; 4+ messages in thread
From: Konstantin Ananyev @ 2026-06-01 18:15 UTC (permalink / raw)
To: dev; +Cc: stephen
C11 __rte_ring_headtail_move_head_mt() uses output
parameter: 'uint32_t *old_head' directly within CAS operation.
In x86_64 that cause gcc to generate extra instructions to
store return value of CAS (eax) within 'old_head' memory location,
even when CAS was not successful and another attempt should be
performed. In some cases, even extra branch can be observed.
To be more specific the code like that is generated:
// start of 'do { } while();' loop
.L2
...
lock cmpxchgl %r8d, (%rdi)
jne .L17 //
.L1: // <---- successful completion of CAS, finish
movl %edx, %eax
ret
.L17: // <---- unsuccessful completion of CAS, repeat
movl %eax, (%r9)
jmp .L2
In constrast, x86 specific version that uses
__sync_bool_compare_and_swap() doesn't exibit such problem,
as __sync_bool_compare_and_swap() doesn't update the 'old_head'
with new value, and we have to re-read it explicitly on each iteration.
Overcome that problem by using local variable 'head' inside the loop,
and updaing '*old_head' value only at exit.
With such change gcc manages to avoid extra store(/branch).
Depends-on: series-38225 ("deprecate rte_atomicNN family")
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
---
lib/ring/rte_ring_c11_pvt.h | 19 +++++++++++--------
lib/ring/rte_ring_elem_pvt.h | 5 -----
2 files changed, 11 insertions(+), 13 deletions(-)
diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index 3efe011f08..ee98155bea 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -52,6 +52,7 @@ __rte_ring_headtail_move_head_mt(struct rte_ring_headtail *d,
unsigned int n, enum rte_ring_queue_behavior behavior,
uint32_t *old_head, uint32_t *new_head, uint32_t *entries)
{
+ uint32_t head;
unsigned int max = n;
/*
@@ -61,7 +62,7 @@ __rte_ring_headtail_move_head_mt(struct rte_ring_headtail *d,
* d->head.
* If not, an unsafe partial order may ensue.
*/
- *old_head = rte_atomic_load_explicit(&d->head, rte_memory_order_acquire);
+ head = rte_atomic_load_explicit(&d->head, rte_memory_order_acquire);
do {
/* Reset n to the initial burst count */
n = max;
@@ -76,10 +77,10 @@ __rte_ring_headtail_move_head_mt(struct rte_ring_headtail *d,
/* The subtraction is done between two unsigned 32bits value
* (the result is always modulo 32 bits even if we have
- * *old_head > s->tail). So 'entries' is always between 0
+ * head > s->tail). So 'entries' is always between 0
* and capacity (which is < size).
*/
- *entries = capacity + stail - *old_head;
+ *entries = capacity + stail - head;
/* check that we have enough room in ring */
if (unlikely(n > *entries))
@@ -87,11 +88,11 @@ __rte_ring_headtail_move_head_mt(struct rte_ring_headtail *d,
0 : *entries;
if (n == 0)
- return 0;
+ break;
- *new_head = *old_head + n;
+ *new_head = head + n;
- /* on failure, *old_head is updated */
+ /* on failure, head is updated */
/*
* R1/A2.
* R1: Establishes a synchronizing edge with A0 of a
@@ -99,11 +100,13 @@ __rte_ring_headtail_move_head_mt(struct rte_ring_headtail *d,
* A2: Establishes a synchronizing edge with R1 of a
* different thread to observe same value for stail
* observed by that thread on CAS failure (to retry
- * with an updated *old_head).
+ * with an updated head).
*/
} while (unlikely(!rte_atomic_compare_exchange_strong_explicit(
- &d->head, old_head, *new_head,
+ &d->head, &head, *new_head,
rte_memory_order_release, rte_memory_order_acquire)));
+
+ *old_head = head;
return n;
}
diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 9d1da12a92..51176b0405 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -396,12 +396,7 @@ __rte_ring_headtail_move_head_st(struct rte_ring_headtail *d,
return n;
}
-/* There are two choices because GCC optimizer does poorly on atomic_compare_exchange */
-#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
-#include "rte_ring_x86_pvt.h"
-#else
#include "rte_ring_c11_pvt.h"
-#endif
/**
* @internal This function updates the producer head for enqueue
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] ring: avoid extra store at move head
2026-06-01 18:15 [PATCH] ring: avoid extra store at move head Konstantin Ananyev
@ 2026-06-01 22:23 ` Stephen Hemminger
2026-06-02 9:02 ` Konstantin Ananyev
0 siblings, 1 reply; 4+ messages in thread
From: Stephen Hemminger @ 2026-06-01 22:23 UTC (permalink / raw)
To: Konstantin Ananyev; +Cc: dev
On Mon, 1 Jun 2026 19:15:09 +0100
Konstantin Ananyev <konstantin.ananyev@huawei.com> wrote:
> C11 __rte_ring_headtail_move_head_mt() uses output
> parameter: 'uint32_t *old_head' directly within CAS operation.
> In x86_64 that cause gcc to generate extra instructions to
> store return value of CAS (eax) within 'old_head' memory location,
> even when CAS was not successful and another attempt should be
> performed. In some cases, even extra branch can be observed.
> To be more specific the code like that is generated:
> // start of 'do { } while();' loop
> .L2
> ...
> lock cmpxchgl %r8d, (%rdi)
> jne .L17 //
> .L1: // <---- successful completion of CAS, finish
> movl %edx, %eax
> ret
> .L17: // <---- unsuccessful completion of CAS, repeat
> movl %eax, (%r9)
> jmp .L2
>
> In constrast, x86 specific version that uses
> __sync_bool_compare_and_swap() doesn't exibit such problem,
> as __sync_bool_compare_and_swap() doesn't update the 'old_head'
> with new value, and we have to re-read it explicitly on each iteration.
>
> Overcome that problem by using local variable 'head' inside the loop,
> and updaing '*old_head' value only at exit.
> With such change gcc manages to avoid extra store(/branch).
>
> Depends-on: series-38225 ("deprecate rte_atomicNN family")
>
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> ---
I used the standard ring perf tests and ran 10 times via:
! /bin/bash
if [ -z "$1" ]; then
echo "Usage $0 version"
exit 1
fi
VERSION=$1
for i in $(seq 1 10); do
sudo DPDK_TEST=ring_perf_autotest \
./build/app/dpdk-test -l 2-5 -n 4 --no-pci --file-prefix=run$i \
> ~/DPDK/ring_perf_results/${VERSION}_run${i}.log 2>&1
echo "${VERSION} run $i done"
done
Then had Claude compare results:
Key metric (two physical cores legacy MP/MC bulk n=128):
main: 5.380 cycles/elem
sync-bool: 5.377 cycles/elem (-0.07%)
avoid-store: 5.892 cycles/elem (+9.52%) ← regresses
Looking at the dissassembly of ring_enqueue_bulk:
The inner loop of main and sync-bool versions is:
mov 0x80(%rdi),%r11d ; load d->head via displacement
mov 0x104(%rdi),%ebx ; load s->tail
add %ecx,%ebx
sub %r11d,%ebx
cmp %ebx,%r12d
jae [exit]
lea (%r8,%r11,1),%r13d ; new_head = old_head + n
mov %r11d,%eax ; expected → eax
lock cmpxchg %r13d,0x80(%rdi) ; ← displacement addressing
jne [retry] ; ← direct jne, eax preserved
Using atomic_compare_exchange and your patch:
mov 0x38(%rdi),%r10d
mov 0x80(%rdi),%eax ; load d->head directly into %eax
lea 0x80(%rdi),%rcx ; ← MATERIALIZE &d->head into %rcx
lea -0x1(%r8),%r12d
mov 0x104(%rdi),%r11d
add %r10d,%r11d
sub %eax,%r11d
cmp %r11d,%r12d
jae [exit]
lea (%r8,%rax,1),%r13d ; new_head
lock cmpxchg %r13d,(%rcx) ; ← INDIRECT addressing via %rcx
mov %eax,%ebx ; ← EXTRA: save post-CAS %eax to %ebx
jne [retry]
Bottom line: good idea but still fighting with Gcc optimizer here.
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [PATCH] ring: avoid extra store at move head
2026-06-01 22:23 ` Stephen Hemminger
@ 2026-06-02 9:02 ` Konstantin Ananyev
2026-06-02 9:21 ` Morten Brørup
0 siblings, 1 reply; 4+ messages in thread
From: Konstantin Ananyev @ 2026-06-02 9:02 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: dev@dpdk.org
> > C11 __rte_ring_headtail_move_head_mt() uses output
> > parameter: 'uint32_t *old_head' directly within CAS operation.
> > In x86_64 that cause gcc to generate extra instructions to
> > store return value of CAS (eax) within 'old_head' memory location,
> > even when CAS was not successful and another attempt should be
> > performed. In some cases, even extra branch can be observed.
> > To be more specific the code like that is generated:
> > // start of 'do { } while();' loop
> > .L2
> > ...
> > lock cmpxchgl %r8d, (%rdi)
> > jne .L17 //
> > .L1: // <---- successful completion of CAS, finish
> > movl %edx, %eax
> > ret
> > .L17: // <---- unsuccessful completion of CAS, repeat
> > movl %eax, (%r9)
> > jmp .L2
> >
> > In constrast, x86 specific version that uses
> > __sync_bool_compare_and_swap() doesn't exibit such problem,
> > as __sync_bool_compare_and_swap() doesn't update the 'old_head'
> > with new value, and we have to re-read it explicitly on each iteration.
> >
> > Overcome that problem by using local variable 'head' inside the loop,
> > and updaing '*old_head' value only at exit.
> > With such change gcc manages to avoid extra store(/branch).
> >
> > Depends-on: series-38225 ("deprecate rte_atomicNN family")
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > ---
>
> I used the standard ring perf tests and ran 10 times via:
> ! /bin/bash
> if [ -z "$1" ]; then
> echo "Usage $0 version"
> exit 1
> fi
>
> VERSION=$1
> for i in $(seq 1 10); do
> sudo DPDK_TEST=ring_perf_autotest \
> ./build/app/dpdk-test -l 2-5 -n 4 --no-pci --file-prefix=run$i \
> > ~/DPDK/ring_perf_results/${VERSION}_run${i}.log 2>&1
> echo "${VERSION} run $i done"
> done
>
>
> Then had Claude compare results:
>
> Key metric (two physical cores legacy MP/MC bulk n=128):
> main: 5.380 cycles/elem
> sync-bool: 5.377 cycles/elem (-0.07%)
> avoid-store: 5.892 cycles/elem (+9.52%) ← regresses
>
>
> Looking at the dissassembly of ring_enqueue_bulk:
>
> The inner loop of main and sync-bool versions is:
> mov 0x80(%rdi),%r11d ; load d->head via displacement
> mov 0x104(%rdi),%ebx ; load s->tail
> add %ecx,%ebx
> sub %r11d,%ebx
> cmp %ebx,%r12d
> jae [exit]
> lea (%r8,%r11,1),%r13d ; new_head = old_head + n
> mov %r11d,%eax ; expected → eax
> lock cmpxchg %r13d,0x80(%rdi) ; ← displacement addressing
> jne [retry] ; ← direct jne, eax preserved
>
> Using atomic_compare_exchange and your patch:
> mov 0x38(%rdi),%r10d
> mov 0x80(%rdi),%eax ; load d->head directly into %eax
> lea 0x80(%rdi),%rcx ; ← MATERIALIZE &d->head into %rcx
> lea -0x1(%r8),%r12d
> mov 0x104(%rdi),%r11d
> add %r10d,%r11d
> sub %eax,%r11d
> cmp %r11d,%r12d
> jae [exit]
> lea (%r8,%rax,1),%r13d ; new_head
> lock cmpxchg %r13d,(%rcx) ; ← INDIRECT addressing via %rcx
> mov %eax,%ebx ; ← EXTRA: save post-CAS %eax to %ebx
> jne [retry]
>
> Bottom line: good idea but still fighting with Gcc optimizer here.
Thanks for trying.
On my box (AMD EPYC 9534) with same test, there is no much difference between all of them:
use-sync-bool: 2.2273
use-c11-current-version: 2.2422
use-c11-patched: 2.2431
Anyway, -10% on some boxes - that's probably good enough reason to keep specific version
for __rte_ring_headtail_move_head_mt().
My ask would be to have some special macro for it, so users can enable/disable it via
'meson setup' at will.
Konstantin
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: [PATCH] ring: avoid extra store at move head
2026-06-02 9:02 ` Konstantin Ananyev
@ 2026-06-02 9:21 ` Morten Brørup
0 siblings, 0 replies; 4+ messages in thread
From: Morten Brørup @ 2026-06-02 9:21 UTC (permalink / raw)
To: Konstantin Ananyev, Stephen Hemminger; +Cc: dev
> > Then had Claude compare results:
> >
> > Key metric (two physical cores legacy MP/MC bulk n=128):
> > main: 5.380 cycles/elem
> > sync-bool: 5.377 cycles/elem (-0.07%)
> > avoid-store: 5.892 cycles/elem (+9.52%) ← regresses
> >
> >
> > Looking at the dissassembly of ring_enqueue_bulk:
> >
> > The inner loop of main and sync-bool versions is:
> > mov 0x80(%rdi),%r11d ; load d->head via displacement
> > mov 0x104(%rdi),%ebx ; load s->tail
> > add %ecx,%ebx
> > sub %r11d,%ebx
> > cmp %ebx,%r12d
> > jae [exit]
> > lea (%r8,%r11,1),%r13d ; new_head = old_head + n
> > mov %r11d,%eax ; expected → eax
> > lock cmpxchg %r13d,0x80(%rdi) ; ← displacement addressing
> > jne [retry] ; ← direct jne, eax preserved
> >
> > Using atomic_compare_exchange and your patch:
> > mov 0x38(%rdi),%r10d
> > mov 0x80(%rdi),%eax ; load d->head directly into %eax
> > lea 0x80(%rdi),%rcx ; ← MATERIALIZE &d->head into
> %rcx
> > lea -0x1(%r8),%r12d
> > mov 0x104(%rdi),%r11d
> > add %r10d,%r11d
> > sub %eax,%r11d
> > cmp %r11d,%r12d
> > jae [exit]
> > lea (%r8,%rax,1),%r13d ; new_head
> > lock cmpxchg %r13d,(%rcx) ; ← INDIRECT addressing via %rcx
> > mov %eax,%ebx ; ← EXTRA: save post-CAS %eax to
> %ebx
> > jne [retry]
> >
> > Bottom line: good idea but still fighting with Gcc optimizer here.
>
> Thanks for trying.
> On my box (AMD EPYC 9534) with same test, there is no much difference
> between all of them:
> use-sync-bool: 2.2273
> use-c11-current-version: 2.2422
> use-c11-patched: 2.2431
> Anyway, -10% on some boxes - that's probably good enough reason to keep
> specific version
> for __rte_ring_headtail_move_head_mt().
> My ask would be to have some special macro for it, so users can
> enable/disable it via 'meson setup' at will.
This seems very exotic as a meson command line option.
Either put it in rte_config.h, or make it CPU specific.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-06-02 9:21 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-01 18:15 [PATCH] ring: avoid extra store at move head Konstantin Ananyev
2026-06-01 22:23 ` Stephen Hemminger
2026-06-02 9:02 ` Konstantin Ananyev
2026-06-02 9:21 ` Morten Brørup
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox