RE: [PATCH v1 3/4] ring: fix potential sync issue between head and tail values

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
	"jerinj@marvell.com" <jerinj@marvell.com>,
	"hemant.agrawal@nxp.com" <hemant.agrawal@nxp.com>,
	"drc@linux.ibm.com" <drc@linux.ibm.com>, nd <nd@arm.com>,
	nd <nd@arm.com>
Subject: RE: [PATCH v1 3/4] ring: fix potential sync issue between head and tail values
Date: Mon, 2 Jun 2025 11:11:34 +0000	[thread overview]
Message-ID: <0832db1027614c918301e48010d92e4a@huawei.com> (raw)
In-Reply-To: <AS2PR08MB8902DE7ECC64BAFE65C02F9B9F64A@AS2PR08MB8902.eurprd08.prod.outlook.com>


Hi Wathsala,

> >
> > Thanks for the clarification, I now see your concern.
> >
> > So, in summary what you are saying is that tail update from
> > Thread #1 that happened at T0 is not observed by the thread #2
> > at T2 when it computed new_head and entries calculation.

Yes

> > That cannot happen in Arm v8/v9 because tail update is a
> > store-release and a load-acquire that program order follows it
> > can only be issued after all the cores have observed the
> > store-release (there is a synchronization with relationship to
> > store-release and load-acquire pairs).
> >
> > In the example you have provided Thread #1's
> > store(&prod.tail, 2, release) is observed by all the cores in the
> > system by the time same thread performs load(&prod.tail, acquire)
> > at T2. As per Arm v8/v9 memory model Thread #2 should observe
> > prod.tail=2 (not prod.tail=1).
> >
> > Arm Architecture Reference Manual section B2.10.11 states…
> >
> > "Where a Load-Acquire appears in program order after a Store-Release,
> > the memory access generated by the Store-Release instruction is
> > observed by each PE to the extent that PE is required to observe the
> > access coherently, before the memory access generated by the
> > Load-Acquire instruction is observed by that PE, to the extent that the
> > PE is required to observe the access coherently."

Interesting, thanks for pointing to that.
Indeed, according to the docs, acquire/release operations use a sequentially consistent model.
But we not always getting pure LDAR instruction here, it depends on compiler march/mcpu  flags.
For some HW arch gcc (and clang) generate LDAR instruction, for others - LDPAPR.
And according to the doc:
https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions
for Load-AcquirePC  the requirement that Load-Acquires are observed in order with preceding Store-Releases is dropped.  
To be more specific here, on my box:
1) meson with -Dmachine=default (gcc options: "-march=armv8-a+crc") generates code with LDAR instruction
https://godbolt.org/z/5cEscdxrb
2) meson with -Dmachine=native (gcc options: " -mcpu=neoverse-n1") generates code with LDAPR instruction 
https://godbolt.org/z/8KjcvYxch

soring_stress_test for 1) (LDAR) passes
soring_stress_test for 2) (LDAPR) fails with symptoms I described earlier 

> 
> In soring is this the pair that update the tail and move head?
> 
> __rte_soring_update_tail:
> 	__rte_ring_update_tail(&rht->ht, head, next, st, enq);

That function is used to update prod.tail and cons.tail

> __rte_soring_move_cons_head:
> 	 __rte_ring_headtail_move_head(&r->cons.ht,  &r->stage[stage].ht, 0, ...);
> 

__rte_ring_headtail_move_head is invoked to update prod.head (by __rte_soring_move_prod_head)
 and cons.head (by __rte_soring_move_cons_head). 

> If so, &rht->ht and &r->stage[stage].ht are the same address?

No.
In conventional ring we have just two stages: 'prod'  and 'cons'.
So moving prod.head depends on cons.tail and visa-versa: moving cons.head depends on prod.tail.
With soring we have multiple (N) extra stages in between.
So the dependency chain is a bit longer:
prod.head depends on cons.tail
stage[0].head depends on prod.tail,
stage[1].head depends on stage[0].tail
...
stage{N-1].head depends on stage[N-2].tail
cons.head depends on stage[N-1].tail

> If they are
> not, then you will run into the issue you have seen (a.k.a "Other-multi-copy
>  atomic" which is legit in Arm v8 and above).

Can you probably elaborate a bit more for me here?
Thanks
Konstantin
 

> Thanks.
> 
> --wathsala
> 
> 
> > > -----Original Message-----
> > > From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > > Sent: Monday, May 26, 2025 6:54 AM
> > > To: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>;
> > > dev@dpdk.org
> > > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> > > jerinj@marvell.com; hemant.agrawal@nxp.com; drc@linux.ibm.com; nd
> > > <nd@arm.com>
> > > Subject: RE: [PATCH v1 3/4] ring: fix potential sync issue between head and
> > tail
> > > values
> > >
> > > Hi Wathsala,
> > >
> > > >
> > > > Hi Konstanin,
> > > >
> > > > In rte_ring the store-release on tail update guarantees that CAS
> > > > won't get reordered with the store-released of the tail update.
> > > >
> > > > So, the sequence of events would look like this (combined view
> > > > of head and tail update)
> > > >
> > > > Releaxed-load(new_head,  N)                              ----------------> (A)
> > > > Relaxed-CAS(d->head, new_head, old_head)   ----------------> (B)
> > > > Store-release-store(d->tail, new_head)             ----------------> (C)
> > > >
> > > > If we look at address dependencies, then...
> > > >
> > > > (B) depends on (A) due to new_head address dependency.
> > > > (C) depends on (A) due to new_head address dependency.
> > > >
> > > > So, dependency graph looks like this
> > > >        (A)
> > > >     /       \
> > > >    v        v
> > > >  (B)     (C)
> > > >
> > > > There is no implicit dependence between (B) and (C), I think
> > > > this is the issue you are brining up.
> > > > Even though there is no dependence between the two,
> > > > the store-release of (C) ensures that (B) won't drop below it.
> > > > Therefore, the above graph can be turned into an ordered
> > > > sequence as shown below..
> > > >
> > > > (A) -> (B) -> (C)
> > >
> > > I do agree that with current implementation of
> > > __rte_ring_headtail_move_head()
> > > in lib/ring/rte_ring_c11_pvt.h the order of these 3 operations (A->B->C)
> > > should be sequential.
> > > The problem I am talking about is a different one:
> > > thread can see 'latest' 'cons.head' value, with 'previous' value for 'prod.tail' or
> > > visa-versa.
> > > In other words: 'cons.head' value depends on 'prod.tail', so before making
> > > latest 'cons.head'
> > > value visible to other threads, we need to ensure that latest 'prod.tail' is also
> > > visible.
> > > Let me try to explain it on the example:
> > >
> > > Suppose at some moment we have:
> > > prod={.head=2,.tail=1};
> > > cons={.head=0,.tail=0};
> > > I.e. thead #1 is in process to enqueue one more element into the ring.
> > >
> > >        Thread #1                                                                             Thread #2
> > > T0:
> > >   store(&prod.tail, 2, release);
> > >   /*AFAIU: this is somewhat equivalent to: wmb(); prod.tail=2;
> > >   * I.E. - it guarantees that all stores initiated before that operation will be
> > > visible
> > >   * by other threads at the same moment or before new value of prod.tail wlll
> > > become
> > >   * visible, but it doesn't guarantee that new prod.tail  value will be visible  to
> > > other
> > >   * threads immediately.
> > >   */
> > > ...
> > > move_cons_head(...,n=2)
> > move_cons_head(...,n=1)
> > > ...                                                                                              ...
> > > T1:
> > >   *old_head = load(&cons.head, relaxed);
> > >   fence(acquire);
> > >   /*old_head == 0, no surprises */
> > >   stail = load(&prod.tail, acquire);
> > >   /*stail == 2, no surprises */
> > >  *entries = (capacity + stail - *old_head);
> > > *new_head = *old_head + n;
> > >  /* *entries == (2 - 0) == 2; *new_head==2; all good */
> > > ...
> > > T2:
> > >                                                                                                 *old_head =
> > > load(&cons.head, relaxed);
> > >                                                                                                 fence(acquire);
> > >                                                                                                 /*old_head == 0, no
> > surprises
> > > */
> > >                                                                                                 stail = load(&prod.tail,
> > > acquire);
> > > /* !!!!! stail == 1  !!!!! for Thread 2
> > >  * Even though we do use acquire here - there was no *release* after
> > > store(&prod.tail).
> > >  * So, still no guarantee that Thread 2 will see latest prod.tail value.
> > >  */
> > >                                                                                                *entries = (capacity + stail -
> > > *old_head);
> > >                                                                                                /* *entries == (1 - 0) == 1,
> > still
> > > ok */
> > >                                                                                                *new_head = *old_head + n;
> > >                                                                                                /* *new_head == 1 */
> > > T3:
> > >   success = CAS(&cons.head,
> > >     old_head /*==0/, *new_head /*==2*/,
> > >     relaxed, relaxed);
> > >   /*success==1, cons.head==2*/
> > >  ...
> > > T4:
> > >                                                                                            success = CAS(&cons.head,
> > >                                                                                            old_head /*==0/, *new_head
> > > /*==1*/,
> > >                                                                                           relaxed, relaxed);
> > >                                                                                           /*success==0, *old_head==2*/
> > > /* CAS() failed and provided Thread 2 with latest valued for cons.head(==2)
> > >  *  Thread 2 repeats attempt, starts second iteration
> > >  */
> > >                                                                                           fence(acquire);
> > >                                                                                           stail = load(&prod.tail, acquire);
> > > /* !!!!! stail == 1  !!!!! for Thread 2
> > >  * Still no *release* had happened after store(&prod.tail) at T0.
> > >  * So, still no guarantee that Thread 2 will see latest prod.tail value.
> > >  */
> > >                                                                                           *entries = (capacity + stail -
> > > *old_head);
> > >                                                                                           *new_head = *old_head + n;
> > >
> > > /* !!!!! *entries == (1 - 2) == -1(UINT32_MAX); *new_head==(2+1)==3; !!!!!
> > >  *  we are about to corrupt our ring !!!!!
> > >  */
> > >
> > > >
> > > > I haven't looked at the so-ring yet. Could it be possible that the
> > > > issue is due to something else introduced in that code?
> > >
> > > Well, as I said, so far I wasn't able to re-produce this problem with
> > > conventional
> > > ring (ring_stress_autotest), only soring_stress_autotest is failing and
> > > for now - only on one specific ARM platform.
> > > Regarding soring specific fix (without touching common code) -
> > > sure it is also possible, pls see patch #2.
> > > There I just added 'fence(release);' straight after 'store(&tail);'
> > > That's seems enough to fix that problem within the soring only.
> > > Though, from my understanding rte_ring might also be affected,
> > > that's why I went ahead and prepared that patch.
> > > If you feel, that I a missing something here - pls shout.
> > > Konstantin
> > >
> > >
> > > > Thanks,
> > > >
> > > > --wathsala
> > > >
> > > > > This patch aims several purposes:
> > > > > - provide an alternative (and I think a better) way to fix the
> > > > >   issue discussed in previous patch:
> > > > >   "ring/soring: fix synchronization issue between head and tail values"
> > > > > - make sure that such problem wouldn’t happen within other usages of
> > > > >   __rte_ring_headtail_move_head() – both current rte_ring
> > > > >   implementation and possible future use-cases.
> > > > > - step towards unification of move_head() implementations and
> > > > >   removing rte_ring_generic_pvt.h
> > > > > It uses Acquire-Release memory ordering for CAS operation in
> > > move_head().
> > > > > That guarantees that corresponding ‘tail’ updates will be visible before
> > > current
> > > > > ‘head’ is updated.
> > > > > As I said before: I think that in theory the problem described in previous
> > > patch
> > > > > might happen with our conventional rte_ring too (when
> > > > > RTE_USE_C11_MEM_MODEL enabled).
> > > > > But, so far I didn’t manage to reproduce it in reality.
> > > > > For that reason and also because it touches a critical rte_ring code-path, I
> > > put
> > > > > these changes into a separate patch. Expect all interested stakeholders to
> > > come-
> > > > > up with their comments and observations.
> > > > > Regarding performance impact – on my boxes both ring_perf_autotest
> > and
> > > > > ring_stress_autotest – show a mixed set of results: some of them
> > become
> > > few
> > > > > cycles faster, another few cycles slower.
> > > > > But so far, I didn’t notice any real degradations with that patch.
> > > > >
> > > > > Fixes: b5458e2cc483 ("ring: introduce staged ordered ring")
> > > > > Fixes: 1cc363b8ce06 ("ring: introduce HTS ring mode")
> > > > > Fixes: e6ba4731c0f3 ("ring: introduce RTS ring mode")
> > > > > Fixes: 49594a63147a ("ring/c11: relax ordering for load and store of the
> > > head")
> > > > >
> > > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > > > > ---
> > > > >  lib/ring/rte_ring_c11_pvt.h      | 27 +++++++++++++++++----------
> > > > >  lib/ring/rte_ring_hts_elem_pvt.h |  6 ++++--
> > > lib/ring/rte_ring_rts_elem_pvt.h
> > > > > |  6 ++++--
> > > > >  lib/ring/soring.c                |  5 -----
> > > > >  4 files changed, 25 insertions(+), 19 deletions(-)
> > > > >
> > > > > diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> > index
> > > > > 0845cd6dcf..6d1c46df9a 100644
> > > > > --- a/lib/ring/rte_ring_c11_pvt.h
> > > > > +++ b/lib/ring/rte_ring_c11_pvt.h
> > > > > @@ -77,20 +77,19 @@ __rte_ring_headtail_move_head(struct
> > > > > rte_ring_headtail *d,
> > > > >  	int success;
> > > > >  	unsigned int max = n;
> > > > >
> > > > > +	/* Ensure the head is read before tail */
> > > > >  	*old_head = rte_atomic_load_explicit(&d->head,
> > > > > -			rte_memory_order_relaxed);
> > > > > +			rte_memory_order_acquire);
> > > > >  	do {
> > > > >  		/* Reset n to the initial burst count */
> > > > >  		n = max;
> > > > >
> > > > > -		/* Ensure the head is read before tail */
> > > > > -		rte_atomic_thread_fence(rte_memory_order_acquire);
> > > > > -
> > > > > -		/* load-acquire synchronize with store-release of ht->tail
> > > > > -		 * in update_tail.
> > > > > +		/*
> > > > > +		 * Read s->tail value. Note that it will be loaded after
> > > > > +		 * d->head load, but before CAS operation for the d->head.
> > > > >  		 */
> > > > >  		stail = rte_atomic_load_explicit(&s->tail,
> > > > > -					rte_memory_order_acquire);
> > > > > +					rte_memory_order_relaxed);
> > > > >
> > > > >  		/* The subtraction is done between two unsigned 32bits value
> > > > >  		 * (the result is always modulo 32 bits even if we have @@ -
> > > > > 112,11 +111,19 @@ __rte_ring_headtail_move_head(struct
> > > rte_ring_headtail
> > > > > *d,
> > > > >  			d->head = *new_head;
> > > > >  			success = 1;
> > > > >  		} else
> > > > > -			/* on failure, *old_head is updated */
> > > > > +			/*
> > > > > +			 * on failure, *old_head is updated.
> > > > > +			 * this CAS(ACQ_REL, ACQUIRE) serves as a hoist
> > > > > +			 * barrier to prevent:
> > > > > +			 *  - OOO reads of cons tail value
> > > > > +			 *  - OOO copy of elems from the ring
> > > > > +			 *  Also RELEASE guarantees that latest tail value
> > > > > +			 *  will become visible before the new head value.
> > > > > +			 */
> > > > >  			success =
> > > > > rte_atomic_compare_exchange_strong_explicit(
> > > > >  					&d->head, old_head, *new_head,
> > > > > -					rte_memory_order_relaxed,
> > > > > -					rte_memory_order_relaxed);
> > > > > +					rte_memory_order_acq_rel,
> > > > > +					rte_memory_order_acquire);
> > > > >  	} while (unlikely(success == 0));
> > > > >  	return n;
> > > > >  }
> > > > > diff --git a/lib/ring/rte_ring_hts_elem_pvt.h
> > > b/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > index c59e5f6420..cc593433b9 100644
> > > > > --- a/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > +++ b/lib/ring/rte_ring_hts_elem_pvt.h
> > > > > @@ -116,13 +116,15 @@ __rte_ring_hts_move_head(struct
> > > > > rte_ring_hts_headtail *d,
> > > > >  		np.pos.head = op.pos.head + n;
> > > > >
> > > > >  	/*
> > > > > -	 * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
> > > > > +	 * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
> > > > >  	 *  - OOO reads of cons tail value
> > > > >  	 *  - OOO copy of elems from the ring
> > > > > +	 *   Also RELEASE guarantees that latest tail value
> > > > > +	 *   will become visible before the new head value.
> > > > >  	 */
> > > > >  	} while (rte_atomic_compare_exchange_strong_explicit(&d->ht.raw,
> > > > >  			(uint64_t *)(uintptr_t)&op.raw, np.raw,
> > > > > -			rte_memory_order_acquire,
> > > > > +			rte_memory_order_acq_rel,
> > > > >  			rte_memory_order_acquire) == 0);
> > > > >
> > > > >  	*old_head = op.pos.head;
> > > > > diff --git a/lib/ring/rte_ring_rts_elem_pvt.h
> > > b/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > index 509fa674fb..860b13cc61 100644
> > > > > --- a/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > +++ b/lib/ring/rte_ring_rts_elem_pvt.h
> > > > > @@ -131,13 +131,15 @@ __rte_ring_rts_move_head(struct
> > > > > rte_ring_rts_headtail *d,
> > > > >  		nh.val.cnt = oh.val.cnt + 1;
> > > > >
> > > > >  	/*
> > > > > -	 * this CAS(ACQUIRE, ACQUIRE) serves as a hoist barrier to prevent:
> > > > > +	 * this CAS(ACQ_REL, ACQUIRE) serves as a hoist barrier to prevent:
> > > > >  	 *  - OOO reads of cons tail value
> > > > >  	 *  - OOO copy of elems to the ring
> > > > > +	 *  Also RELEASE guarantees that latest tail value
> > > > > +	 *  will become visible before the new head value.
> > > > >  	 */
> > > > >  	} while (rte_atomic_compare_exchange_strong_explicit(&d-
> > > >head.raw,
> > > > >  			(uint64_t *)(uintptr_t)&oh.raw, nh.raw,
> > > > > -			rte_memory_order_acquire,
> > > > > +			rte_memory_order_acq_rel,
> > > > >  			rte_memory_order_acquire) == 0);
> > > > >
> > > > >  	*old_head = oh.val.pos;
> > > > > diff --git a/lib/ring/soring.c b/lib/ring/soring.c index
> > > 7bcbf35516..21a1a27e24
> > > > > 100644
> > > > > --- a/lib/ring/soring.c
> > > > > +++ b/lib/ring/soring.c
> > > > > @@ -123,8 +123,6 @@ __rte_soring_stage_finalize(struct
> > > > > soring_stage_headtail *sht, uint32_t stage,
> > > > >  	rte_atomic_store_explicit(&sht->tail.raw, ot.raw,
> > > > >  			rte_memory_order_release);
> > > > >
> > > > > -	/* make sure that new tail value is visible */
> > > > > -	rte_atomic_thread_fence(rte_memory_order_release);
> > > > >  	return i;
> > > > >  }
> > > > >
> > > > > @@ -219,9 +217,6 @@ __rte_soring_update_tail(struct
> > > __rte_ring_headtail
> > > > > *rht,
> > > > >  		/* unsupported mode, shouldn't be here */
> > > > >  		RTE_ASSERT(0);
> > > > >  	}
> > > > > -
> > > > > -	/* make sure that new tail value is visible */
> > > > > -	rte_atomic_thread_fence(rte_memory_order_release);
> > > > >  }
> > > > >
> > > > >  static __rte_always_inline uint32_t
> > > > > --
> > > > > 2.43.0

next prev parent reply	other threads:[~2025-06-02 11:11 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-21 11:14 [PATCH v1 0/4] ring: some fixes and improvements Konstantin Ananyev
2025-05-21 11:14 ` [PATCH v1 1/4] ring: introduce extra run-time checks Konstantin Ananyev
2025-05-21 12:14   ` Morten Brørup
2025-05-21 12:34     ` Konstantin Ananyev
2025-05-21 18:36       ` Morten Brørup
2025-05-21 19:38         ` Konstantin Ananyev
2025-05-21 22:02           ` Morten Brørup
2025-05-26  8:39             ` Konstantin Ananyev
2025-05-26  9:57               ` Morten Brørup
2025-05-21 11:14 ` [PATCH v1 2/4] ring/soring: fix head-tail synchronization issue Konstantin Ananyev
2025-05-21 11:14 ` [PATCH v1 3/4] ring: fix potential sync issue between head and tail values Konstantin Ananyev
2025-05-21 20:26   ` Morten Brørup
2025-05-22 22:03   ` Wathsala Wathawana Vithanage
2025-05-26 11:54     ` Konstantin Ananyev
2025-05-27 21:33       ` Wathsala Wathawana Vithanage
2025-05-27 22:47         ` Wathsala Wathawana Vithanage
2025-06-02 11:11           ` Konstantin Ananyev [this message]
2025-06-04  3:47             ` Wathsala Wathawana Vithanage
2025-06-04 17:20               ` Wathsala Wathawana Vithanage
2025-06-12 11:39                 ` Konstantin Ananyev
2025-05-21 11:14 ` [PATCH v1 4/4] config/x86: enable RTE_USE_C11_MEM_MODEL by default Konstantin Ananyev
2025-05-21 19:47   ` Morten Brørup
2026-01-14 19:42 ` [PATCH v1 0/4] ring: some fixes and improvements Stephen Hemminger
2026-01-23  5:01 ` Stephen Hemminger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0832db1027614c918301e48010d92e4a@huawei.com \
    --to=konstantin.ananyev@huawei.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=dev@dpdk.org \
    --cc=drc@linux.ibm.com \
    --cc=hemant.agrawal@nxp.com \
    --cc=jerinj@marvell.com \
    --cc=nd@arm.com \
    --cc=wathsala.vithanage@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.