Re: [PATCH v11 01/14] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()

Linux Power Management development
 help / color / mirror / Atom feed

From: David Laight <david.laight.linux@gmail.com>
To: Ankur Arora <ankur.a.arora@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-pm@vger.kernel.org,
	bpf@vger.kernel.org, arnd@arndb.de, catalin.marinas@arm.com,
	will@kernel.org, peterz@infradead.org, akpm@linux-foundation.org,
	mark.rutland@arm.com, harisokn@amazon.com, cl@gentwo.org,
	ast@kernel.org, rafael@kernel.org, daniel.lezcano@linaro.org,
	memxor@gmail.com, zhenglifeng1@huawei.com,
	xueshuai@linux.alibaba.com, rdunlap@infradead.org,
	joao.m.martins@oracle.com, boris.ostrovsky@oracle.com,
	konrad.wilk@oracle.com, ashok.bhat@arm.com
Subject: Re: [PATCH v11 01/14] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
Date: Fri, 8 May 2026 09:32:06 +0100	[thread overview]
Message-ID: <20260508093206.389d9af2@pumpkin> (raw)
In-Reply-To: <87lddujttz.fsf@oracle.com>

On Thu, 07 May 2026 23:31:20 -0700
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> David Laight <david.laight.linux@gmail.com> writes:
> 
> > On Wed, 06 May 2026 13:54:06 -0700
> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >  
> >> David Laight <david.laight.linux@gmail.com> writes:
> >>  
> >> > On Wed, 06 May 2026 00:30:29 -0700
> >> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >> >  
> >> >> Ankur Arora <ankur.a.arora@oracle.com> writes:
> >> >>  
> >> >> > Add smp_cond_load_relaxed_timeout(), which extends
> >> >> > smp_cond_load_relaxed() to allow waiting for a duration.
> >> >> >
> >> >> > We loop around waiting for the condition variable to change while
> >> >> > peridically doing a time-check. The loop uses cpu_poll_relax() to slow
> >> >> > down the busy-wait, which, unless overridden by the architecture
> >> >> > code, amounts to a cpu_relax().
> >> >> >
> >> >> > Note that there are two ways for the time-check to fail: the timeout
> >> >> > case or, @time_expr_ns returning an invalid value (negative or zero).
> >> >> > The second failure mode allows for clocks attached to the clock-domain
> >> >> > of @cond_expr --  which might cease to operate meaningfully once some
> >> >> > state internal to @cond_expr has changed -- to fail.
> >> >> >
> >> >> > Evaluation of @time_expr_ns: in the fastpath we want to keep the
> >> >> > performance close to smp_cond_load_relaxed(). So defer evaluation
> >> >> > of the potentially costly @time_expr_ns to the slowpath.
> >> >> >
> >> >> > This also means that there will always be some hardware dependent
> >> >> > duration that has passed in cpu_poll_relax() iterations at the time
> >> >> > of first evaluation. Additionally cpu_poll_relax() is not guaranteed
> >> >> > to return at timeout boundary. In sum, expect timeout overshoot when
> >> >> > we exit due to expiration of the timeout.
> >> >> >
> >> >> > The number of spin iterations before time-check, SMP_TIMEOUT_POLL_COUNT
> >> >> > is chosen to be 200 by default. With a cpu_poll_relax() iteration
> >> >> > taking ~20-30 cycles (measured on a variety of x86 platforms), we
> >> >> > expect a time-check every ~4000-6000 cycles.
> >> >> >
> >> >> > The outer limit of the overshoot is double that when working with the
> >> >> > parameters above. This might be higher or lower depending on the
> >> >> > implementation of cpu_poll_relax() across architectures.
> >> >> >
> >> >> > Lastly, config option ARCH_HAS_CPU_RELAX indicates availability of a
> >> >> > cpu_poll_relax() that is cheaper than polling. This might be relevant
> >> >> > for cases with a long timeout.
> >> >> >
> >> >> > Cc: Arnd Bergmann <arnd@arndb.de>
> >> >> > Cc: Will Deacon <will@kernel.org>
> >> >> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> >> >> > Cc: Peter Zijlstra <peterz@infradead.org>
> >> >> > Cc: linux-arch@vger.kernel.org
> >> >> > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> >> >> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> >> > ---
> >> >> > Notes:
> >> >> >    - add a comment mentioning that smp_cond_load_relaxed_timeout() might
> >> >> >      be using architectural primitives that don't support MMIO.
> >> >> >      (David Laight, Catalin Marinas)
> >> >> >
> >> >> >  include/asm-generic/barrier.h | 69 +++++++++++++++++++++++++++++++++++
> >> >> >  1 file changed, 69 insertions(+)
> >> >> >
> >> >> > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> >> >> > index d4f581c1e21d..e5a6a1c04649 100644
> >> >> > --- a/include/asm-generic/barrier.h
> >> >> > +++ b/include/asm-generic/barrier.h
> >> >> > @@ -273,6 +273,75 @@ do {									\
> >> >> >  })
> >> >> >  #endif
> >> >> >
> >> >> > +/*
> >> >> > + * Number of times we iterate in the loop before doing the time check.
> >> >> > + * Note that the iteration count assumes that the loop condition is
> >> >> > + * relatively cheap.
> >> >> > + */
> >> >> > +#ifndef SMP_TIMEOUT_POLL_COUNT
> >> >> > +#define SMP_TIMEOUT_POLL_COUNT		200
> >> >> > +#endif
> >> >> > +
> >> >> > +/*
> >> >> > + * Platforms with ARCH_HAS_CPU_RELAX have a cpu_poll_relax() implementation
> >> >> > + * that is expected to be cheaper (lower power) than pure polling.
> >> >> > + */
> >> >> > +#ifndef cpu_poll_relax
> >> >> > +#define cpu_poll_relax(ptr, val, timeout_ns)	cpu_relax()
> >> >> > +#endif
> >> >> > +
> >> >> > +/**
> >> >> > + * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
> >> >> > + * guarantees until a timeout expires.
> >> >> > + * @ptr: pointer to the variable to wait on.
> >> >> > + * @cond_expr: boolean expression to wait for.
> >> >> > + * @time_expr_ns: expression that evaluates to monotonic time (in ns) or,
> >> >> > + *  on failure, returns a negative value.
> >> >> > + * @timeout_ns: timeout value in ns
> >> >> > + * Both of the above are assumed to be compatible with s64; the signed
> >> >> > + * value is used to handle the failure case in @time_expr_ns.
> >> >> > + *
> >> >> > + * Equivalent to using READ_ONCE() on the condition variable.
> >> >> > + *
> >> >> > + * Callers that expect to wait for prolonged durations might want
> >> >> > + * to take into account the availability of ARCH_HAS_CPU_RELAX.
> >> >> > + *
> >> >> > + * Note that @ptr is expected to point to a memory address. Using this
> >> >> > + * interface with MMIO will be slower (since SMP_TIMEOUT_POLL_COUNT is
> >> >> > + * tuned for memory) and might also break in interesting architecture
> >> >> > + * dependent ways.
> >> >> > + */
> >> >> > +#ifndef smp_cond_load_relaxed_timeout
> >> >> > +#define smp_cond_load_relaxed_timeout(ptr, cond_expr,			\
> >> >> > +				      time_expr_ns, timeout_ns)		\
> >> >> > +({									\
> >> >> > +	typeof(ptr) __PTR = (ptr);					\  
> >
> > 		auto __PTR = ptr;
> >  
> >> >> > +	__unqual_scalar_typeof(*ptr) VAL;				\  
> >
> > It can't matter if integer promotions before assigning to VAL.
> > So something like:
> > 		auto VAL = 1 ? 0 : *__PTR + 0;
> > will generate a suitable writable variable.
> > (The '+ 0' is needed because some versions of gcc incorrectly propagate
> > 'const'.)  
> 
> Thanks. This is useful to know. However, we use the unqualified typeof
> dictum all over barrier.h. I didn't really see the need to depart from
> that.

The have been discussions in other threads, my interpretation of the result
is that the use of __unqual_scalar_typeof() should be limited to a few places
where there really is no other option.
If you've ever looked at the preprocessor output you'll understand part of the
problem.

> 
> >> >> > +	u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT;			\
> >> >> > +	s64 __timeout = (s64)timeout_ns;				\  
> >
> > The (s64) cast can only hide errors.
> >  
> >> >> > +	s64 __time_now, __time_end = 0;					\
> >> >> > +									\
> >> >> > +	for (;;) {							\
> >> >> > +		VAL = READ_ONCE(*__PTR);				\
> >> >> > +		if (cond_expr)						\
> >> >> > +			break;						\
> >> >> > +		cpu_poll_relax(__PTR, VAL, (u64)__timeout);		\  
> >
> > That doesn't look right, __timeout is relative; if the underlying code
> > uses the timeout then the code delays for 200 * timeout_ns before even
> > looking at the actual time.
> >
> > If you want to spin then you may not even want the cpu_relax() at all.
> > (Or with a very short timeout as in the version below.)  
> 
> Yeah, BPF uses this in the fastpath where we want to avoid looking at
> the clock in the fastpath.
> Overshooting the deadline was a minor problem in comparison.
> 
> But I agree the version below with the shorter timeout works better.
> Unfortunately it doesn't help on arm64 if we are using WFE.

Yes, the code is ok if cpu_poll_relax() ignores the timeout.
But for WFE it is all broken.

Perhaps refactoring/optimising the WFE code might help.
IIRC it converts the relative timeout_ns into an absolute 'end time'
(of some other clock).
So if you do that conversion once, returning 0 (usually compile-time)
when WFE isn't in use then it should be possible to make this code work.

Actually, if you assume/require that the converted time is either 0 or
greater than 200 then it can be used as the initialiser for __n.

Even if the timeout passed to cpu_poll_relax() is relative, it doesn't
necessarily need reducing.
I'd expect that the timeout is only a 'guard' timeout to detect lockups.
So just documenting that the timeout might be double the specified value
would solve that problem.

> 
> >> >> > +		if (++__n < __spin)					\
> >> >> > +			continue;					\
> >> >> > +		__time_now = (s64)(time_expr_ns);			\  
> >
> > Another cast that can only hide bugs.
> >  
> >> >> > +		if (unlikely(__time_end == 0))				\
> >> >> > +			__time_end = __time_now + __timeout;		\
> >> >> > +		__timeout = __time_end - __time_now;			\
> >> >> > +		if (__time_now <= 0 || __timeout <= 0) {		\
> >> >> > +			VAL = READ_ONCE(*__PTR);			\
> >> >> > +			break;						\
> >> >> > +		}							\
> >> >> > +		__n = 0;						\  
> >
> > Resetting the spin count doesn't look right at all.
> > In principle the code will delay for 200 * __timeout.
> > Possibly the earlier check should be:
> > 			if (__n < __spin) {
> > 				__n++;
> > 				continue;
> > 			}
> > (Or just don't worry that the code will spin again after 4M loops.
> > The problem you have is that if cpu_poll_relay() ignores the timeout you
> > probably want to spin 'for a bit' in code that only accesses local data
> > (in particular avoiding evaluating cond_expr or time_expr_ns).  
> 
> Yeah we do avoid evaluating the time_expr_ns. And I agree we don't want
> to hammer the cond_expr but the cpu_relax() should help with that.
> (In my measurements I see an IPC of ~0.05 in a cpu_relax() loop of this
> kind.)

I'm not sure what cpu_relax() really does.
I suspect it helps if hyperthreading is in use, and may reduce power a bit
(and may be quite slow).
But the cpu will still be hammering cond_expr.
This is probably fine most of the time - since it will just hit the local
d-cache. Even if the line is evicted I'd expect the cost to mainly be load
on the inter-cpu bus (slowing down cond_expr won't matter).

I have a feeling that one of the reasons that WFE was better than just
spinning was that the expressions read a per-cpu variable; and doing that
on arm64 required locked bus cycles so affected system performance.
I'm sure that using mwait (or equivalent) will almost always increase latency
over just reading the variable, it is lower power - but that is different.

...

> >> >> I think this is worth cleaning up a bit. The change is mostly around
> >> >> introducing a u32 __itertime and explicitly computing the waiting time.
> >> >> And adding a check to ensure that we start with a valid value.
> >> >>
> >> >> This does make the implementation a little more involved. So just wanted
> >> >> to see if people have any opinions on this?
> >> >>
> >> >> +#ifndef smp_cond_load_relaxed_timeout
> >> >> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr,          \
> >> >> +                                     time_expr_ns, timeout_ns) \
> >> >> +({                                                             \
> >> >> +       typeof(ptr) __PTR = (ptr);                              \
> >> >> +       __unqual_scalar_typeof(*(ptr)) VAL;                     \
> >> >> +       u32 __count = 0, __spin = SMP_TIMEOUT_POLL_COUNT;       \
> >> >> +       s64 __timeout = (s64)(timeout_ns);                      \
> >> >> +       s64 __time_now, __time_end = 0;                         \
> >> >> +       u32 __maybe_unused __itertime;                          \
> >> >> +                                                               \
> >> >> +       for (__itertime = NSEC_PER_USEC;                        \  
> >
> > Ok, so that limits the initial 'spin' to 200 usecs.
> > That gets added to any caller-specified timeout.
> >  
> >> >> +               VAL = READ_ONCE(*__PTR), __timeout > 0; ) {     \  
> >
> > Broken indentation.
> > I'd change it back to a for (;;) loop.
> >
> > If __timeout <= 0 then the code goes through the 'timer expired'
> > path (below) on the first iteration.
> > So the extra check is just bloat.  
> 
> Yes, but by the time of the first check we've done this
> computation with it:
> >> >> +               if (unlikely(__time_end == 0))                  \
> >> >> +                       __time_end = __time_now + __timeout;    \
> >> >> +               __timeout = __time_end - __time_now;            \  

Yep, but you don't expect 0 so there is no point optimising for it.

> >> >> +               if (cond_expr)                                  \
> >> >> +                       break;                                  \
> >> >> +               cpu_poll_relax(__PTR, VAL, __itertime);         \
> >> >> +               if (++__count < __spin)                         \
> >> >> +                       continue;                               \
> >> >> +               __time_now = (s64)(time_expr_ns);               \
> >> >> +               if (unlikely(__time_end == 0))                  \
> >> >> +                       __time_end = __time_now + __timeout;    \
> >> >> +               __timeout = __time_end - __time_now;            \
> >> >> +               if (__time_now <= 0 || __timeout <= 0) {        \
> >> >> +                       VAL = READ_ONCE(*__PTR);                \
> >> >> +                       break;                                  \
> >> >> +               }                                               \  
> >
> > How about:
> > 			if (unlikely(__time_end == 0)) {
> > 				if (__time_now <= 0)
> > 					goto timed_out;
> > 				__time_end = __time_now + __timeout;
> > 			} else {
> > 				if (__time_now >= __time_end) {
> > timed_out:
> > 					VAL = READ_ONCE(*__PTR);
> > 					break;
> > 				}
> > 				__timeout = __time_end - __time_now;
> > 			}  
> 
> I had a version like that for one of the iterations. One of the problems
> with it was that needed a named goto (because the whole thing is wrapped
> in a macro). I don't tihnk the extra check is expensive enough in the
> slowpath that it's worth rewriting this code.

It is probably possible to write it all in one condition - but it would
be unreadable :-)

Actually if you require that time_expr_ns be -1 on failure and use u64
(not s64) then you can do:
	if (!__time_end)
		__time_end = __time_now + __timeout;
	if (__time_now >= __time_end) {
		VAL = READ_ONCE(*__PTR);
		break;
	}
That relies on the '+' wrapping when the initial time_expr_ns fails.

> 
> >> >> +               __itertime = __timeout % NSEC_PER_MSEC +        \
> >> >> +                               NSEC_PER_USEC;                  \  
> >
> > That seems to just be putting a bound on the timeout.
> > So the '% NSEC_PER_MSEC' could be '& ((1u << 20) - 1)'
> > replacing an expensive signed divide with a cheap mask.  
> 
> I think this is a good idea. Let me do something like that instead.
> 
> > But overall this is a lot of code to inline.  
> 
> Sure. But it's a small number of callsites (and it's a relatively niche
> interface) so I don't think inlining it is a huge problem.

Until 'next week' :-)

	David

next prev parent reply	other threads:[~2026-05-08  8:32 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-08 12:25 [PATCH v11 00/14] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
2026-04-08 12:25 ` [PATCH v11 01/14] asm-generic: barrier: Add smp_cond_load_relaxed_timeout() Ankur Arora
2026-05-06  7:30   ` Ankur Arora
2026-05-06  8:58     ` David Laight
2026-05-06 20:54       ` Ankur Arora
2026-05-07  9:57         ` David Laight
2026-05-08  6:31           ` Ankur Arora
2026-05-08  8:32             ` David Laight [this message]
2026-04-08 12:25 ` [PATCH v11 02/14] arm64: barrier: Support smp_cond_load_relaxed_timeout() Ankur Arora
2026-04-08 12:25 ` [PATCH v11 03/14] arm64/delay: move some constants out to a separate header Ankur Arora
2026-04-08 12:25 ` [PATCH v11 04/14] arm64: support WFET in smp_cond_load_relaxed_timeout() Ankur Arora
2026-04-08 12:25 ` [PATCH v11 05/14] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait() Ankur Arora
2026-04-08 12:25 ` [PATCH v11 06/14] asm-generic: barrier: Add smp_cond_load_acquire_timeout() Ankur Arora
2026-04-08 12:25 ` [PATCH v11 07/14] atomic: Add atomic_cond_read_*_timeout() Ankur Arora
2026-04-08 12:25 ` [PATCH v11 08/14] locking/atomic: scripts: build atomic_long_cond_read_*_timeout() Ankur Arora
2026-04-08 12:25 ` [PATCH v11 09/14] bpf/rqspinlock: switch check_timeout() to a clock interface Ankur Arora
2026-04-08 12:25 ` [PATCH v11 10/14] bpf/rqspinlock: Use smp_cond_load_acquire_timeout() Ankur Arora
2026-04-08 12:25 ` [PATCH v11 11/14] sched: add need-resched timed wait interface Ankur Arora
2026-04-08 12:25 ` [PATCH v11 12/14] cpuidle/poll_state: Wait for need-resched via tif_need_resched_relaxed_wait() Ankur Arora
2026-04-20 16:57   ` Okanovic, Haris
2026-04-20 17:50     ` Ankur Arora
2026-04-21  7:15       ` Catalin Marinas
2026-04-20 22:12     ` Christoph Lameter (Ampere)
2026-04-08 12:25 ` [PATCH v11 13/14] kunit: enable testing smp_cond_load_relaxed_timeout() Ankur Arora
2026-04-08 12:25 ` [PATCH v11 14/14] kunit: add tests for smp_cond_load_relaxed_timeout() Ankur Arora
2026-05-21  8:30   ` [PATCH v11 1/2] kunit: add tests for smp_cond_load_*_timeout() Ankur Arora
2026-05-21  8:30     ` [PATCH v11 2/2] kunit: add clock tests for smp_cond_load_relaxed_timeout() Ankur Arora
2026-05-21 21:46     ` [PATCH v11 1/2] kunit: add tests for smp_cond_load_*_timeout() Andrew Morton
2026-05-21 21:57       ` Mark Brown
2026-05-22  5:10       ` Ankur Arora
2026-04-23 17:16 ` [PATCH v11 00/14] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Andrew Morton
2026-04-23 19:29   ` Ankur Arora
2026-04-24 14:16     ` [PATCH v11 00/14] barrier: Add smp_cond_load_{relaxed, acquire}_timeout() Okanovic, Haris
2026-04-24 14:10   ` Okanovic, Haris
2026-04-24 14:28 ` [PATCH v11 00/14] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Andrew Morton
2026-04-24 18:10   ` Ankur Arora

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260508093206.389d9af2@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ankur.a.arora@oracle.com \
    --cc=arnd@arndb.de \
    --cc=ashok.bhat@arm.com \
    --cc=ast@kernel.org \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bpf@vger.kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=daniel.lezcano@linaro.org \
    --cc=harisokn@amazon.com \
    --cc=joao.m.martins@oracle.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=memxor@gmail.com \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rdunlap@infradead.org \
    --cc=will@kernel.org \
    --cc=xueshuai@linux.alibaba.com \
    --cc=zhenglifeng1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox