Re: [PATCH v11 01/14] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Ankur Arora <ankur.a.arora@oracle.com>
To: David Laight <david.laight.linux@gmail.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-pm@vger.kernel.org,
	bpf@vger.kernel.org, arnd@arndb.de, catalin.marinas@arm.com,
	will@kernel.org, peterz@infradead.org, akpm@linux-foundation.org,
	mark.rutland@arm.com, harisokn@amazon.com, cl@gentwo.org,
	ast@kernel.org, rafael@kernel.org, daniel.lezcano@linaro.org,
	memxor@gmail.com, zhenglifeng1@huawei.com,
	xueshuai@linux.alibaba.com, rdunlap@infradead.org,
	joao.m.martins@oracle.com, boris.ostrovsky@oracle.com,
	konrad.wilk@oracle.com, ashok.bhat@arm.com
Subject: Re: [PATCH v11 01/14] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
Date: Thu, 07 May 2026 23:31:20 -0700	[thread overview]
Message-ID: <87lddujttz.fsf@oracle.com> (raw)
In-Reply-To: <20260507105721.66ba1e45@pumpkin>


David Laight <david.laight.linux@gmail.com> writes:

> On Wed, 06 May 2026 13:54:06 -0700
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> David Laight <david.laight.linux@gmail.com> writes:
>>
>> > On Wed, 06 May 2026 00:30:29 -0700
>> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >
>> >> Ankur Arora <ankur.a.arora@oracle.com> writes:
>> >>
>> >> > Add smp_cond_load_relaxed_timeout(), which extends
>> >> > smp_cond_load_relaxed() to allow waiting for a duration.
>> >> >
>> >> > We loop around waiting for the condition variable to change while
>> >> > peridically doing a time-check. The loop uses cpu_poll_relax() to slow
>> >> > down the busy-wait, which, unless overridden by the architecture
>> >> > code, amounts to a cpu_relax().
>> >> >
>> >> > Note that there are two ways for the time-check to fail: the timeout
>> >> > case or, @time_expr_ns returning an invalid value (negative or zero).
>> >> > The second failure mode allows for clocks attached to the clock-domain
>> >> > of @cond_expr --  which might cease to operate meaningfully once some
>> >> > state internal to @cond_expr has changed -- to fail.
>> >> >
>> >> > Evaluation of @time_expr_ns: in the fastpath we want to keep the
>> >> > performance close to smp_cond_load_relaxed(). So defer evaluation
>> >> > of the potentially costly @time_expr_ns to the slowpath.
>> >> >
>> >> > This also means that there will always be some hardware dependent
>> >> > duration that has passed in cpu_poll_relax() iterations at the time
>> >> > of first evaluation. Additionally cpu_poll_relax() is not guaranteed
>> >> > to return at timeout boundary. In sum, expect timeout overshoot when
>> >> > we exit due to expiration of the timeout.
>> >> >
>> >> > The number of spin iterations before time-check, SMP_TIMEOUT_POLL_COUNT
>> >> > is chosen to be 200 by default. With a cpu_poll_relax() iteration
>> >> > taking ~20-30 cycles (measured on a variety of x86 platforms), we
>> >> > expect a time-check every ~4000-6000 cycles.
>> >> >
>> >> > The outer limit of the overshoot is double that when working with the
>> >> > parameters above. This might be higher or lower depending on the
>> >> > implementation of cpu_poll_relax() across architectures.
>> >> >
>> >> > Lastly, config option ARCH_HAS_CPU_RELAX indicates availability of a
>> >> > cpu_poll_relax() that is cheaper than polling. This might be relevant
>> >> > for cases with a long timeout.
>> >> >
>> >> > Cc: Arnd Bergmann <arnd@arndb.de>
>> >> > Cc: Will Deacon <will@kernel.org>
>> >> > Cc: Catalin Marinas <catalin.marinas@arm.com>
>> >> > Cc: Peter Zijlstra <peterz@infradead.org>
>> >> > Cc: linux-arch@vger.kernel.org
>> >> > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>> >> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> >> > ---
>> >> > Notes:
>> >> >    - add a comment mentioning that smp_cond_load_relaxed_timeout() might
>> >> >      be using architectural primitives that don't support MMIO.
>> >> >      (David Laight, Catalin Marinas)
>> >> >
>> >> >  include/asm-generic/barrier.h | 69 +++++++++++++++++++++++++++++++++++
>> >> >  1 file changed, 69 insertions(+)
>> >> >
>> >> > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
>> >> > index d4f581c1e21d..e5a6a1c04649 100644
>> >> > --- a/include/asm-generic/barrier.h
>> >> > +++ b/include/asm-generic/barrier.h
>> >> > @@ -273,6 +273,75 @@ do {									\
>> >> >  })
>> >> >  #endif
>> >> >
>> >> > +/*
>> >> > + * Number of times we iterate in the loop before doing the time check.
>> >> > + * Note that the iteration count assumes that the loop condition is
>> >> > + * relatively cheap.
>> >> > + */
>> >> > +#ifndef SMP_TIMEOUT_POLL_COUNT
>> >> > +#define SMP_TIMEOUT_POLL_COUNT		200
>> >> > +#endif
>> >> > +
>> >> > +/*
>> >> > + * Platforms with ARCH_HAS_CPU_RELAX have a cpu_poll_relax() implementation
>> >> > + * that is expected to be cheaper (lower power) than pure polling.
>> >> > + */
>> >> > +#ifndef cpu_poll_relax
>> >> > +#define cpu_poll_relax(ptr, val, timeout_ns)	cpu_relax()
>> >> > +#endif
>> >> > +
>> >> > +/**
>> >> > + * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
>> >> > + * guarantees until a timeout expires.
>> >> > + * @ptr: pointer to the variable to wait on.
>> >> > + * @cond_expr: boolean expression to wait for.
>> >> > + * @time_expr_ns: expression that evaluates to monotonic time (in ns) or,
>> >> > + *  on failure, returns a negative value.
>> >> > + * @timeout_ns: timeout value in ns
>> >> > + * Both of the above are assumed to be compatible with s64; the signed
>> >> > + * value is used to handle the failure case in @time_expr_ns.
>> >> > + *
>> >> > + * Equivalent to using READ_ONCE() on the condition variable.
>> >> > + *
>> >> > + * Callers that expect to wait for prolonged durations might want
>> >> > + * to take into account the availability of ARCH_HAS_CPU_RELAX.
>> >> > + *
>> >> > + * Note that @ptr is expected to point to a memory address. Using this
>> >> > + * interface with MMIO will be slower (since SMP_TIMEOUT_POLL_COUNT is
>> >> > + * tuned for memory) and might also break in interesting architecture
>> >> > + * dependent ways.
>> >> > + */
>> >> > +#ifndef smp_cond_load_relaxed_timeout
>> >> > +#define smp_cond_load_relaxed_timeout(ptr, cond_expr,			\
>> >> > +				      time_expr_ns, timeout_ns)		\
>> >> > +({									\
>> >> > +	typeof(ptr) __PTR = (ptr);					\
>
> 		auto __PTR = ptr;
>
>> >> > +	__unqual_scalar_typeof(*ptr) VAL;				\
>
> It can't matter if integer promotions before assigning to VAL.
> So something like:
> 		auto VAL = 1 ? 0 : *__PTR + 0;
> will generate a suitable writable variable.
> (The '+ 0' is needed because some versions of gcc incorrectly propagate
> 'const'.)

Thanks. This is useful to know. However, we use the unqualified typeof
dictum all over barrier.h. I didn't really see the need to depart from
that.

>> >> > +	u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT;			\
>> >> > +	s64 __timeout = (s64)timeout_ns;				\
>
> The (s64) cast can only hide errors.
>
>> >> > +	s64 __time_now, __time_end = 0;					\
>> >> > +									\
>> >> > +	for (;;) {							\
>> >> > +		VAL = READ_ONCE(*__PTR);				\
>> >> > +		if (cond_expr)						\
>> >> > +			break;						\
>> >> > +		cpu_poll_relax(__PTR, VAL, (u64)__timeout);		\
>
> That doesn't look right, __timeout is relative; if the underlying code
> uses the timeout then the code delays for 200 * timeout_ns before even
> looking at the actual time.
>
> If you want to spin then you may not even want the cpu_relax() at all.
> (Or with a very short timeout as in the version below.)

Yeah, BPF uses this in the fastpath where we want to avoid looking at
the clock in the fastpath.
Overshooting the deadline was a minor problem in comparison.

But I agree the version below with the shorter timeout works better.
Unfortunately it doesn't help on arm64 if we are using WFE.

>> >> > +		if (++__n < __spin)					\
>> >> > +			continue;					\
>> >> > +		__time_now = (s64)(time_expr_ns);			\
>
> Another cast that can only hide bugs.
>
>> >> > +		if (unlikely(__time_end == 0))				\
>> >> > +			__time_end = __time_now + __timeout;		\
>> >> > +		__timeout = __time_end - __time_now;			\
>> >> > +		if (__time_now <= 0 || __timeout <= 0) {		\
>> >> > +			VAL = READ_ONCE(*__PTR);			\
>> >> > +			break;						\
>> >> > +		}							\
>> >> > +		__n = 0;						\
>
> Resetting the spin count doesn't look right at all.
> In principle the code will delay for 200 * __timeout.
> Possibly the earlier check should be:
> 			if (__n < __spin) {
> 				__n++;
> 				continue;
> 			}
> (Or just don't worry that the code will spin again after 4M loops.
> The problem you have is that if cpu_poll_relay() ignores the timeout you
> probably want to spin 'for a bit' in code that only accesses local data
> (in particular avoiding evaluating cond_expr or time_expr_ns).

Yeah we do avoid evaluating the time_expr_ns. And I agree we don't want
to hammer the cond_expr but the cpu_relax() should help with that.
(In my measurements I see an IPC of ~0.05 in a cpu_relax() loop of this
kind.)

>> >> > +	}								\
>> >> > +	(typeof(*ptr))VAL;						\
>
> That cast is pointless; the return value will be subject to 'integer promotion'
> and converted to a rvalue - which removes any const/volatile qualifiers.
>
>> >> > +})
>> >> > +#endif
>> >> > +
>> >>
>> >> A cluster of issues that got flagged by sashiko was around timeout_ns
>> >> being specified as s64 and a bunch of potential edge cases around
>> >> that.
>> >>
>> >> These were mostly caused by an implicit assumption in the code that
>> >> the timeout specified by the caller is generally reasonable. So, way
>> >> below S64_MAX, not 0 etc.
>> >
>> > There are plenty of ways kernel code can break things.
>> > Provided this code doesn't itself overwrite anywhere (rather than
>> > just loop forever or return immediately etc) I'd be tempted to
>> > just document the valid range rather than slow everything down
>> > with the extra tests.
>>
>> I don't disagree. In this case, however, it's somewhat borderline.
>>
>> On the pro side, we get rid of some of the implicit type conversions
>> and assumptions around those.
>>
>> On the negative, it adds an extra modulo operation in the slow path.
>> And, the for loop is structured a little differently from the usual
>> version.
>>
>> On balance, I think this is a good change if only because it makes
>> the types a little more explicit.
>>
>> Ankur
>>
>> > 	David
>> >
>> >>
>> >> I think this is worth cleaning up a bit. The change is mostly around
>> >> introducing a u32 __itertime and explicitly computing the waiting time.
>> >> And adding a check to ensure that we start with a valid value.
>> >>
>> >> This does make the implementation a little more involved. So just wanted
>> >> to see if people have any opinions on this?
>> >>
>> >> +#ifndef smp_cond_load_relaxed_timeout
>> >> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr,          \
>> >> +                                     time_expr_ns, timeout_ns) \
>> >> +({                                                             \
>> >> +       typeof(ptr) __PTR = (ptr);                              \
>> >> +       __unqual_scalar_typeof(*(ptr)) VAL;                     \
>> >> +       u32 __count = 0, __spin = SMP_TIMEOUT_POLL_COUNT;       \
>> >> +       s64 __timeout = (s64)(timeout_ns);                      \
>> >> +       s64 __time_now, __time_end = 0;                         \
>> >> +       u32 __maybe_unused __itertime;                          \
>> >> +                                                               \
>> >> +       for (__itertime = NSEC_PER_USEC;                        \
>
> Ok, so that limits the initial 'spin' to 200 usecs.
> That gets added to any caller-specified timeout.
>
>> >> +               VAL = READ_ONCE(*__PTR), __timeout > 0; ) {     \
>
> Broken indentation.
> I'd change it back to a for (;;) loop.
>
> If __timeout <= 0 then the code goes through the 'timer expired'
> path (below) on the first iteration.
> So the extra check is just bloat.

Yes, but by the time of the first check we've done this
computation with it:
>> >> +               if (unlikely(__time_end == 0))                  \
>> >> +                       __time_end = __time_now + __timeout;    \
>> >> +               __timeout = __time_end - __time_now;            \



>> >> +               if (cond_expr)                                  \
>> >> +                       break;                                  \
>> >> +               cpu_poll_relax(__PTR, VAL, __itertime);         \
>> >> +               if (++__count < __spin)                         \
>> >> +                       continue;                               \
>> >> +               __time_now = (s64)(time_expr_ns);               \
>> >> +               if (unlikely(__time_end == 0))                  \
>> >> +                       __time_end = __time_now + __timeout;    \
>> >> +               __timeout = __time_end - __time_now;            \
>> >> +               if (__time_now <= 0 || __timeout <= 0) {        \
>> >> +                       VAL = READ_ONCE(*__PTR);                \
>> >> +                       break;                                  \
>> >> +               }                                               \
>
> How about:
> 			if (unlikely(__time_end == 0)) {
> 				if (__time_now <= 0)
> 					goto timed_out;
> 				__time_end = __time_now + __timeout;
> 			} else {
> 				if (__time_now >= __time_end) {
> timed_out:
> 					VAL = READ_ONCE(*__PTR);
> 					break;
> 				}
> 				__timeout = __time_end - __time_now;
> 			}

I had a version like that for one of the iterations. One of the problems
with it was that needed a named goto (because the whole thing is wrapped
in a macro). I don't tihnk the extra check is expensive enough in the
slowpath that it's worth rewriting this code.

>> >> +               __itertime = __timeout % NSEC_PER_MSEC +        \
>> >> +                               NSEC_PER_USEC;                  \
>
> That seems to just be putting a bound on the timeout.
> So the '% NSEC_PER_MSEC' could be '& ((1u << 20) - 1)'
> replacing an expensive signed divide with a cheap mask.

I think this is a good idea. Let me do something like that instead.

> But overall this is a lot of code to inline.

Sure. But it's a small number of callsites (and it's a relatively niche
interface) so I don't think inlining it is a huge problem.

> It has to be possible to get it down to something like:
> 	struct info info = { .tmo_ns = timeout_ns };
> 	for (;;) {
> 		VAL = READ_ONCE(PTR);
> 		if (cond_expr)
> 			break;
> 		if (_smp_cond_load_relaxed_timeout(&info, PTR, VAL))
> 			break;
> 	}
> 	VAL;
> (yes, I know it isn't that simple because the arm 'relax' code
> has a re-read in it so needs to know the size.)

Yeah and as you say above we want to minimize hammering on the cond_expr
and the time_expr_ns (but only on platforms without an event based wait).

So, we'll end up with similar issues inside this
__smp_cond_load_relaxed_timeout().


Ankur

> -- David
>
>
>> >> +               __count = 0;                                    \
>> >> +       }                                                       \
>> >> +       (typeof(*(ptr)))VAL;                                    \
>> >> +})
>> >> +#endif
>> >>
>> >> Thanks
>> >>
>> >> --
>> >> ankur
>> >>
>>
>>
>> --
>> ankur
>>


--
ankur

next prev parent reply	other threads:[~2026-05-08  6:32 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260408122538.3610871-1-ankur.a.arora@oracle.com>
     [not found] ` <20260408122538.3610871-2-ankur.a.arora@oracle.com>
2026-05-06  7:30   ` [PATCH v11 01/14] asm-generic: barrier: Add smp_cond_load_relaxed_timeout() Ankur Arora
2026-05-06  8:58     ` David Laight
2026-05-06 20:54       ` Ankur Arora
2026-05-07  9:57         ` David Laight
2026-05-08  6:31           ` Ankur Arora [this message]
2026-05-08  8:32             ` David Laight

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87lddujttz.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=ashok.bhat@arm.com \
    --cc=ast@kernel.org \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bpf@vger.kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=daniel.lezcano@linaro.org \
    --cc=david.laight.linux@gmail.com \
    --cc=harisokn@amazon.com \
    --cc=joao.m.martins@oracle.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=memxor@gmail.com \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rdunlap@infradead.org \
    --cc=will@kernel.org \
    --cc=xueshuai@linux.alibaba.com \
    --cc=zhenglifeng1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox