* [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
2025-10-28 5:31 [RESEND PATCH v7 0/7] barrier: Add smp_cond_load_*_timeout() Ankur Arora
@ 2025-10-28 5:31 ` Ankur Arora
2025-10-28 9:42 ` Arnd Bergmann
2025-10-28 16:13 ` Christoph Lameter (Ampere)
2025-10-28 5:31 ` [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout() Ankur Arora
` (5 subsequent siblings)
6 siblings, 2 replies; 32+ messages in thread
From: Ankur Arora @ 2025-10-28 5:31 UTC (permalink / raw)
To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
joao.m.martins, boris.ostrovsky, konrad.wilk
Add smp_cond_load_relaxed_timeout(), which extends
smp_cond_load_relaxed() to allow waiting for a duration.
The waiting loop uses cpu_poll_relax() to wait on the condition
variable with a periodic evaluation of a time-check.
cpu_poll_relax() unless overridden by the arch code, amounts to
a cpu_relax().
The number of times we spin is defined by SMP_TIMEOUT_POLL_COUNT
(chosen to be 200 by default) which, assuming each cpu_poll_relax()
iteration takes around 20-30 cycles (measured on a variety of x86
platforms), for a total of ~4000-6000 cycles.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Haris Okanovic <harisokn@amazon.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
include/asm-generic/barrier.h | 41 +++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..0063b46ec065 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -273,6 +273,47 @@ do { \
})
#endif
+#ifndef SMP_TIMEOUT_POLL_COUNT
+#define SMP_TIMEOUT_POLL_COUNT 200
+#endif
+
+#ifndef cpu_poll_relax
+#define cpu_poll_relax(ptr, val) cpu_relax()
+#endif
+
+/**
+ * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
+ * guarantees until a timeout expires.
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ * @time_check_expr: expression to decide when to bail out
+ *
+ * Equivalent to using READ_ONCE() on the condition variable.
+ */
+#ifndef smp_cond_load_relaxed_timeout
+#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr) \
+({ \
+ typeof(ptr) __PTR = (ptr); \
+ __unqual_scalar_typeof(*ptr) VAL; \
+ u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
+ \
+ for (;;) { \
+ VAL = READ_ONCE(*__PTR); \
+ if (cond_expr) \
+ break; \
+ cpu_poll_relax(__PTR, VAL); \
+ if (++__n < __spin) \
+ continue; \
+ if (time_check_expr) { \
+ VAL = READ_ONCE(*__PTR); \
+ break; \
+ } \
+ __n = 0; \
+ } \
+ (typeof(*ptr))VAL; \
+})
+#endif
+
/*
* pmem_wmb() ensures that all stores for which the modification
* are written to persistent storage by preceding instructions have
--
2.43.5
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
2025-10-28 5:31 ` [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout() Ankur Arora
@ 2025-10-28 9:42 ` Arnd Bergmann
2025-10-29 3:17 ` Ankur Arora
2025-10-28 16:13 ` Christoph Lameter (Ampere)
1 sibling, 1 reply; 32+ messages in thread
From: Arnd Bergmann @ 2025-10-28 9:42 UTC (permalink / raw)
To: Ankur Arora, linux-kernel, Linux-Arch, linux-arm-kernel, linux-pm,
bpf
Cc: Catalin Marinas, Will Deacon, Peter Zijlstra, Andrew Morton,
Mark Rutland, Haris Okanovic, Christoph Lameter (Ampere),
Alexei Starovoitov, Rafael J . Wysocki, Daniel Lezcano,
Kumar Kartikeya Dwivedi, zhenglifeng1, xueshuai, Joao Martins,
Boris Ostrovsky, Konrad Rzeszutek Wilk
On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
> + */
> +#ifndef smp_cond_load_relaxed_timeout
> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr) \
> +({ \
> + typeof(ptr) __PTR = (ptr); \
> + __unqual_scalar_typeof(*ptr) VAL; \
> + u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
> + \
> + for (;;) { \
> + VAL = READ_ONCE(*__PTR); \
> + if (cond_expr) \
> + break; \
> + cpu_poll_relax(__PTR, VAL); \
> + if (++__n < __spin) \
> + continue; \
> + if (time_check_expr) { \
> + VAL = READ_ONCE(*__PTR); \
> + break; \
> + } \
> + __n = 0; \
> + } \
> + (typeof(*ptr))VAL; \
> +})
> +#endif
I'm trying to think of ideas for how this would done on arm64
with FEAT_FWXT in a way that doesn't hurt other architectures.
The best idea I've come up with is to change that inner loop
to combine the cpu_poll_relax() with the timecheck and then
define the 'time_check_expr' so it has to return an approximate
(ceiling) number of nanoseconds of remaining time or zero if
expired.
The FEAT_WFXT version would then look something like
static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
{
unsigned long tmp;
asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
: "=&r" (tmp), "+Q" (*ptr)
: "r" (val), "r" (ns));
}
#define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
({ \
u64 __t = TIMECHECK;
if (__t)
__cmpwait_u64_timeout(__PTR, VAL, __t);
})
while the 'wfe' version would continue to do the timecheck after the
wait.
I have two lesser concerns with the generic definition here:
- having both a timeout and a spin counter in the same loop
feels redundant and error-prone, as the behavior in practice
would likely depend a lot on the platform. What is the reason
for keeping the counter if we already have a fixed timeout
condition?
- I generally dislike the type-agnostic macros like this one,
it adds a lot of extra complexity here that I feel can be
completely avoided if we make explicitly 32-bit and 64-bit
wide versions of these macros. We probably won't be able
to resolve this as part of your series, but ideally I'd like
have explicitly-typed versions of cmpxchg(), smp_load_acquire()
and all the related ones, the same way we do for atomic_*()
and atomic64_*().
Arnd
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
2025-10-28 9:42 ` Arnd Bergmann
@ 2025-10-29 3:17 ` Ankur Arora
2025-11-02 21:52 ` Arnd Bergmann
0 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2025-10-29 3:17 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Ankur Arora, linux-kernel, Linux-Arch, linux-arm-kernel, linux-pm,
bpf, Catalin Marinas, Will Deacon, Peter Zijlstra, Andrew Morton,
Mark Rutland, Haris Okanovic, Christoph Lameter (Ampere),
Alexei Starovoitov, Rafael J . Wysocki, Daniel Lezcano,
Kumar Kartikeya Dwivedi, zhenglifeng1, xueshuai, Joao Martins,
Boris Ostrovsky, Konrad Rzeszutek Wilk
Arnd Bergmann <arnd@arndb.de> writes:
> On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>
>> + */
>> +#ifndef smp_cond_load_relaxed_timeout
>> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr) \
>> +({ \
>> + typeof(ptr) __PTR = (ptr); \
>> + __unqual_scalar_typeof(*ptr) VAL; \
>> + u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
>> + \
>> + for (;;) { \
>> + VAL = READ_ONCE(*__PTR); \
>> + if (cond_expr) \
>> + break; \
>> + cpu_poll_relax(__PTR, VAL); \
>> + if (++__n < __spin) \
>> + continue; \
>> + if (time_check_expr) { \
>> + VAL = READ_ONCE(*__PTR); \
>> + break; \
>> + } \
>> + __n = 0; \
>> + } \
>> + (typeof(*ptr))VAL; \
>> +})
>> +#endif
>
> I'm trying to think of ideas for how this would done on arm64
> with FEAT_FWXT in a way that doesn't hurt other architectures.
>
> The best idea I've come up with is to change that inner loop
> to combine the cpu_poll_relax() with the timecheck and then
> define the 'time_check_expr' so it has to return an approximate
> (ceiling) number of nanoseconds of remaining time or zero if
> expired.
Agree that it's a pretty good idea :). I came up with something pretty
similar. Though that had taken a bunch of iterations.
> The FEAT_WFXT version would then look something like
>
> static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
> {
> unsigned long tmp;
> asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
> : "=&r" (tmp), "+Q" (*ptr)
> : "r" (val), "r" (ns));
> }
> #define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
> ({ \
> u64 __t = TIMECHECK;
> if (__t)
> __cmpwait_u64_timeout(__PTR, VAL, __t);
> })
>
> while the 'wfe' version would continue to do the timecheck after the
> wait.
I think this is a good way to do it if we need the precision
at some point in the future.
> I have two lesser concerns with the generic definition here:
>
> - having both a timeout and a spin counter in the same loop
> feels redundant and error-prone, as the behavior in practice
> would likely depend a lot on the platform. What is the reason
> for keeping the counter if we already have a fixed timeout
> condition?
The main reason was that the time check is expensive in power terms.
Which is fine for platforms with a WFE like primitive but others
want to do the time check only infrequently. That's why poll_idle()
introduced a rate limit on polling (which the generic definition
reused here.)
commit 4dc2375c1a4e88ed2701f6961e0e4f9a7696ad3c
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date: Tue Mar 27 23:58:45 2018 +0200
cpuidle: poll_state: Avoid invoking local_clock() too often
Rik reports that he sees an increase in CPU use in one benchmark
due to commit 612f1a22f067 "cpuidle: poll_state: Add time limit to
poll_idle()" that caused poll_idle() to call local_clock() in every
iteration of the loop. Utilization increase generally means more
non-idle time with respect to total CPU time (on the average) which
implies reduced CPU frequency.
Doug reports that limiting the rate of local_clock() invocations
in there causes much less power to be drawn during a CPU-intensive
parallel workload (with idle states 1 and 2 disabled to enforce more
state 0 residency).
These two reports together suggest that executing local_clock() on
multiple CPUs in parallel at a high rate may cause chips to get hot
and trigger thermal/power limits on them to kick in, so reduce the
rate of local_clock() invocations in poll_idle() to avoid that issue.
> - I generally dislike the type-agnostic macros like this one,
> it adds a lot of extra complexity here that I feel can be
> completely avoided if we make explicitly 32-bit and 64-bit
> wide versions of these macros. We probably won't be able
> to resolve this as part of your series, but ideally I'd like
> have explicitly-typed versions of cmpxchg(), smp_load_acquire()
> and all the related ones, the same way we do for atomic_*()
> and atomic64_*().
Ah. And the caller uses say smp_load_acquire_long() or whatever, and
that resolves to whatever makes sense for the arch.
The __unqual_scalar_typeof() does look pretty ugly when looking at the
preprocesed version but other than that smp_cond_load() etc look
pretty straight forward. Just for my curiousity could you elaborate on
the complexity?
--
ankur
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
2025-10-29 3:17 ` Ankur Arora
@ 2025-11-02 21:52 ` Arnd Bergmann
2025-11-03 21:41 ` Ankur Arora
0 siblings, 1 reply; 32+ messages in thread
From: Arnd Bergmann @ 2025-11-02 21:52 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, Linux-Arch, linux-arm-kernel, linux-pm, bpf,
Catalin Marinas, Will Deacon, Peter Zijlstra, Andrew Morton,
Mark Rutland, Haris Okanovic, Christoph Lameter (Ampere),
Alexei Starovoitov, Rafael J . Wysocki, Daniel Lezcano,
Kumar Kartikeya Dwivedi, zhenglifeng1, xueshuai, Joao Martins,
Boris Ostrovsky, Konrad Rzeszutek Wilk
On Wed, Oct 29, 2025, at 04:17, Ankur Arora wrote:
> Arnd Bergmann <arnd@arndb.de> writes:
>> On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>> The FEAT_WFXT version would then look something like
>>
>> static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
>> {
>> unsigned long tmp;
>> asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
>> : "=&r" (tmp), "+Q" (*ptr)
>> : "r" (val), "r" (ns));
>> }
>> #define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
>> ({ \
>> u64 __t = TIMECHECK;
>> if (__t)
>> __cmpwait_u64_timeout(__PTR, VAL, __t);
>> })
>>
>> while the 'wfe' version would continue to do the timecheck after the
>> wait.
>
> I think this is a good way to do it if we need the precision
> at some point in the future.
I'm sorry I wrote this in a confusing way, but I really don't
care about precision here, but for power savings reasons I
would like to use a wfe/wfet based wait here (like you do)
while at the same time being able to turn off the event
stream entirely as long as FEAR_WFXT is available, saving
additional power.
>> I have two lesser concerns with the generic definition here:
>>
>> - having both a timeout and a spin counter in the same loop
>> feels redundant and error-prone, as the behavior in practice
>> would likely depend a lot on the platform. What is the reason
>> for keeping the counter if we already have a fixed timeout
>> condition?
>
> The main reason was that the time check is expensive in power terms.
> Which is fine for platforms with a WFE like primitive but others
> want to do the time check only infrequently. That's why poll_idle()
> introduced a rate limit on polling (which the generic definition
> reused here.)
Right, I misunderstood how this works, so this part is fine.
>> - I generally dislike the type-agnostic macros like this one,
>> it adds a lot of extra complexity here that I feel can be
>> completely avoided if we make explicitly 32-bit and 64-bit
>> wide versions of these macros. We probably won't be able
>> to resolve this as part of your series, but ideally I'd like
>> have explicitly-typed versions of cmpxchg(), smp_load_acquire()
>> and all the related ones, the same way we do for atomic_*()
>> and atomic64_*().
>
> Ah. And the caller uses say smp_load_acquire_long() or whatever, and
> that resolves to whatever makes sense for the arch.
>
> The __unqual_scalar_typeof() does look pretty ugly when looking at the
> preprocesed version but other than that smp_cond_load() etc look
> pretty straight forward. Just for my curiousity could you elaborate on
> the complexity?
The nested macros with __unqual_scalar_typeof() make the
preprocessed version completely unreadable, especially when
combined with other sets of complex macros like min()/max(),
tracepoints or arm64 atomics.
If something goes wrong inside of it, it becomes rather
hard for people to debug or even read the compiler warnings.
Since I do a lot of build testing, I do tend to run into those
more than others do.
I've also seen cases where excessively complex macros get
nested to the point of slowing down the kernel build, which
can happen once the nesting expands to megabytes of source
code.
Arnd
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
2025-11-02 21:52 ` Arnd Bergmann
@ 2025-11-03 21:41 ` Ankur Arora
0 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-11-03 21:41 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Ankur Arora, linux-kernel, Linux-Arch, linux-arm-kernel, linux-pm,
bpf, Catalin Marinas, Will Deacon, Peter Zijlstra, Andrew Morton,
Mark Rutland, Haris Okanovic, Christoph Lameter (Ampere),
Alexei Starovoitov, Rafael J . Wysocki, Daniel Lezcano,
Kumar Kartikeya Dwivedi, zhenglifeng1, xueshuai, Joao Martins,
Boris Ostrovsky, Konrad Rzeszutek Wilk
Arnd Bergmann <arnd@arndb.de> writes:
> On Wed, Oct 29, 2025, at 04:17, Ankur Arora wrote:
>> Arnd Bergmann <arnd@arndb.de> writes:
>>> On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>>> The FEAT_WFXT version would then look something like
>>>
>>> static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
>>> {
>>> unsigned long tmp;
>>> asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
>>> : "=&r" (tmp), "+Q" (*ptr)
>>> : "r" (val), "r" (ns));
>>> }
>>> #define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
>>> ({ \
>>> u64 __t = TIMECHECK;
>>> if (__t)
>>> __cmpwait_u64_timeout(__PTR, VAL, __t);
>>> })
>>>
>>> while the 'wfe' version would continue to do the timecheck after the
>>> wait.
>>
>> I think this is a good way to do it if we need the precision
>> at some point in the future.
>
> I'm sorry I wrote this in a confusing way, but I really don't
> care about precision here, but for power savings reasons I
> would like to use a wfe/wfet based wait here (like you do)
> while at the same time being able to turn off the event
> stream entirely as long as FEAR_WFXT is available, saving
> additional power.
>
>>> I have two lesser concerns with the generic definition here:
>>>
>>> - having both a timeout and a spin counter in the same loop
>>> feels redundant and error-prone, as the behavior in practice
>>> would likely depend a lot on the platform. What is the reason
>>> for keeping the counter if we already have a fixed timeout
>>> condition?
>>
>> The main reason was that the time check is expensive in power terms.
>> Which is fine for platforms with a WFE like primitive but others
>> want to do the time check only infrequently. That's why poll_idle()
>> introduced a rate limit on polling (which the generic definition
>> reused here.)
>
> Right, I misunderstood how this works, so this part is fine.
>
>>> - I generally dislike the type-agnostic macros like this one,
>>> it adds a lot of extra complexity here that I feel can be
>>> completely avoided if we make explicitly 32-bit and 64-bit
>>> wide versions of these macros. We probably won't be able
>>> to resolve this as part of your series, but ideally I'd like
>>> have explicitly-typed versions of cmpxchg(), smp_load_acquire()
>>> and all the related ones, the same way we do for atomic_*()
>>> and atomic64_*().
>>
>> Ah. And the caller uses say smp_load_acquire_long() or whatever, and
>> that resolves to whatever makes sense for the arch.
>>
>> The __unqual_scalar_typeof() does look pretty ugly when looking at the
>> preprocesed version but other than that smp_cond_load() etc look
>> pretty straight forward. Just for my curiousity could you elaborate on
>> the complexity?
>
> The nested macros with __unqual_scalar_typeof() make the
> preprocessed version completely unreadable, especially when
> combined with other sets of complex macros like min()/max(),
> tracepoints or arm64 atomics.
>
> If something goes wrong inside of it, it becomes rather
> hard for people to debug or even read the compiler warnings.
> Since I do a lot of build testing, I do tend to run into those
> more than others do.
>
> I've also seen cases where excessively complex macros get
> nested to the point of slowing down the kernel build, which
> can happen once the nesting expands to megabytes of source
> code.
Thanks for the detail. Yeah, looking at the preprocessed code (as I had
to do when working on this) it all really feels quite excessive.
Combined with other complex macro constructs ...
--
ankur
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
2025-10-28 5:31 ` [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout() Ankur Arora
2025-10-28 9:42 ` Arnd Bergmann
@ 2025-10-28 16:13 ` Christoph Lameter (Ampere)
1 sibling, 0 replies; 32+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-10-28 16:13 UTC (permalink / raw)
To: Ankur Arora
Cc: bpf, arnd, catalin.marinas, will, peterz, akpm, mark.rutland,
harisokn, ast, rafael, daniel.lezcano, memxor, zhenglifeng1,
xueshuai, joao.m.martins, boris.ostrovsky, konrad.wilk
Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
^ permalink raw reply [flat|nested] 32+ messages in thread
* [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout()
2025-10-28 5:31 [RESEND PATCH v7 0/7] barrier: Add smp_cond_load_*_timeout() Ankur Arora
2025-10-28 5:31 ` [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout() Ankur Arora
@ 2025-10-28 5:31 ` Ankur Arora
2025-10-28 8:42 ` Arnd Bergmann
2025-10-28 5:31 ` [RESEND PATCH v7 3/7] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait() Ankur Arora
` (4 subsequent siblings)
6 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2025-10-28 5:31 UTC (permalink / raw)
To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
joao.m.martins, boris.ostrovsky, konrad.wilk
Support waiting in smp_cond_load_relaxed_timeout() via
__cmpwait_relaxed(). Limit this to when the event-stream is enabled,
to ensure that we wake from WFE periodically and don't block forever
if there are no stores to the cacheline.
In the unlikely event that the event-stream is unavailable, fallback
to spin-waiting.
Also set SMP_TIMEOUT_POLL_COUNT to 1 so we do the time-check for each
iteration in smp_cond_load_relaxed_timeout().
Cc: linux-arm-kernel@lists.infradead.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/arm64/include/asm/barrier.h | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index f5801b0ba9e9..92c16dfb8ca6 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -219,6 +219,19 @@ do { \
(typeof(*ptr))VAL; \
})
+#define SMP_TIMEOUT_POLL_COUNT 1
+
+/* Re-declared here to avoid include dependency. */
+extern bool arch_timer_evtstrm_available(void);
+
+#define cpu_poll_relax(ptr, val) \
+do { \
+ if (arch_timer_evtstrm_available()) \
+ __cmpwait_relaxed(ptr, val); \
+ else \
+ cpu_relax(); \
+} while (0)
+
#include <asm-generic/barrier.h>
#endif /* __ASSEMBLY__ */
--
2.43.5
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout()
2025-10-28 5:31 ` [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout() Ankur Arora
@ 2025-10-28 8:42 ` Arnd Bergmann
2025-10-28 16:21 ` Christoph Lameter (Ampere)
2025-10-28 18:01 ` Ankur Arora
0 siblings, 2 replies; 32+ messages in thread
From: Arnd Bergmann @ 2025-10-28 8:42 UTC (permalink / raw)
To: Ankur Arora, linux-kernel, Linux-Arch, linux-arm-kernel, linux-pm,
bpf
Cc: Catalin Marinas, Will Deacon, Peter Zijlstra, Andrew Morton,
Mark Rutland, Haris Okanovic, Christoph Lameter (Ampere),
Alexei Starovoitov, Rafael J . Wysocki, Daniel Lezcano,
Kumar Kartikeya Dwivedi, zhenglifeng1, xueshuai, Joao Martins,
Boris Ostrovsky, Konrad Rzeszutek Wilk
On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
> Support waiting in smp_cond_load_relaxed_timeout() via
> __cmpwait_relaxed(). Limit this to when the event-stream is enabled,
> to ensure that we wake from WFE periodically and don't block forever
> if there are no stores to the cacheline.
>
> In the unlikely event that the event-stream is unavailable, fallback
> to spin-waiting.
>
> Also set SMP_TIMEOUT_POLL_COUNT to 1 so we do the time-check for each
> iteration in smp_cond_load_relaxed_timeout().
After I looked at the entire series again, this one feels like
a missed opportunity. Especially on low-power systems but possibly
on any ARMv9.2+ implementation including Cortex-A320, it would
be nice to be able to both turn off the event stream and also
make this function take fewer wakeups:
> +/* Re-declared here to avoid include dependency. */
> +extern bool arch_timer_evtstrm_available(void);
> +
> +#define cpu_poll_relax(ptr, val) \
> +do { \
> + if (arch_timer_evtstrm_available()) \
> + __cmpwait_relaxed(ptr, val); \
> + else \
> + cpu_relax(); \
> +} while (0)
> +
Since the caller knows exactly how long it wants to wait for,
we should be able to fit a 'wfet' based primitive in here and
pass the timeout as another argument.
Arnd
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout()
2025-10-28 8:42 ` Arnd Bergmann
@ 2025-10-28 16:21 ` Christoph Lameter (Ampere)
2025-10-28 18:01 ` Ankur Arora
1 sibling, 0 replies; 32+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-10-28 16:21 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Ankur Arora, linux-kernel, Linux-Arch, linux-arm-kernel, linux-pm,
bpf, Catalin Marinas, Will Deacon, Peter Zijlstra, Andrew Morton,
Mark Rutland, Haris Okanovic, Alexei Starovoitov,
Rafael J . Wysocki, Daniel Lezcano, Kumar Kartikeya Dwivedi,
zhenglifeng1, xueshuai, Joao Martins, Boris Ostrovsky,
Konrad Rzeszutek Wilk
On Tue, 28 Oct 2025, Arnd Bergmann wrote:
> After I looked at the entire series again, this one feels like
> a missed opportunity. Especially on low-power systems but possibly
> on any ARMv9.2+ implementation including Cortex-A320, it would
> be nice to be able to both turn off the event stream and also
> make this function take fewer wakeups:
That is certainly something that should be done at some point but for now
this patchset does the best possible with the available systems.
Switching of the event stream is another set of complexities that should
be handled separately later.
Lets just get this one finally in. We have been working on this issue
now for more than 2 years.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout()
2025-10-28 8:42 ` Arnd Bergmann
2025-10-28 16:21 ` Christoph Lameter (Ampere)
@ 2025-10-28 18:01 ` Ankur Arora
2025-10-28 21:17 ` Catalin Marinas
1 sibling, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2025-10-28 18:01 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Ankur Arora, linux-kernel, Linux-Arch, linux-arm-kernel, linux-pm,
bpf, Catalin Marinas, Will Deacon, Peter Zijlstra, Andrew Morton,
Mark Rutland, Haris Okanovic, Christoph Lameter (Ampere),
Alexei Starovoitov, Rafael J . Wysocki, Daniel Lezcano,
Kumar Kartikeya Dwivedi, zhenglifeng1, xueshuai, Joao Martins,
Boris Ostrovsky, Konrad Rzeszutek Wilk
Arnd Bergmann <arnd@arndb.de> writes:
> On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>> Support waiting in smp_cond_load_relaxed_timeout() via
>> __cmpwait_relaxed(). Limit this to when the event-stream is enabled,
>> to ensure that we wake from WFE periodically and don't block forever
>> if there are no stores to the cacheline.
>>
>> In the unlikely event that the event-stream is unavailable, fallback
>> to spin-waiting.
>>
>> Also set SMP_TIMEOUT_POLL_COUNT to 1 so we do the time-check for each
>> iteration in smp_cond_load_relaxed_timeout().
>
> After I looked at the entire series again, this one feels like
> a missed opportunity. Especially on low-power systems but possibly
> on any ARMv9.2+ implementation including Cortex-A320, it would
> be nice to be able to both turn off the event stream and also
> make this function take fewer wakeups:
>
>> +/* Re-declared here to avoid include dependency. */
>> +extern bool arch_timer_evtstrm_available(void);
>> +
>> +#define cpu_poll_relax(ptr, val) \
>> +do { \
>> + if (arch_timer_evtstrm_available()) \
>> + __cmpwait_relaxed(ptr, val); \
>> + else \
>> + cpu_relax(); \
>> +} while (0)
>> +
>
> Since the caller knows exactly how long it wants to wait for,
> we should be able to fit a 'wfet' based primitive in here and
> pass the timeout as another argument.
Per se, I don't disagree with this when it comes to WFET.
Handling a timeout, however, is messier when we use other mechanisms.
Some problems that came up in my earlier discussions with Catalin:
- when using WFE, we also need some notion of slack
- and if a caller specifies only a small or no slack, then we need
to combine WFE+cpu_relax()
- for platforms that only use a polling primitive, we want to check
the clock only intermittently for power reasons.
Now, this could be done with an architecture specific spin-count.
However, if the caller specifies a small slack, then we might need
to we check the clock more often as we get closer to the deadline etc.
A smaller problem was that different users want different clocks and so
folding the timeout in a 'timeout_cond_expr' lets us do away with the
interface having to handle any of that.
I had earlier versions [v2] [v3] which had rather elaborate policies for
handling timeout, slack etc. But, given that the current users of the
interface don't actually care about precision, all of that seemed
a little overengineered.
[v2] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/#r
[v3] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/
--
ankur
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout()
2025-10-28 18:01 ` Ankur Arora
@ 2025-10-28 21:17 ` Catalin Marinas
2025-11-02 21:39 ` Arnd Bergmann
0 siblings, 1 reply; 32+ messages in thread
From: Catalin Marinas @ 2025-10-28 21:17 UTC (permalink / raw)
To: Ankur Arora
Cc: Arnd Bergmann, linux-kernel, Linux-Arch, linux-arm-kernel,
linux-pm, bpf, Will Deacon, Peter Zijlstra, Andrew Morton,
Mark Rutland, Haris Okanovic, Christoph Lameter (Ampere),
Alexei Starovoitov, Rafael J . Wysocki, Daniel Lezcano,
Kumar Kartikeya Dwivedi, zhenglifeng1, xueshuai, Joao Martins,
Boris Ostrovsky, Konrad Rzeszutek Wilk
On Tue, Oct 28, 2025 at 11:01:22AM -0700, Ankur Arora wrote:
> Arnd Bergmann <arnd@arndb.de> writes:
> > On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
> >> Support waiting in smp_cond_load_relaxed_timeout() via
> >> __cmpwait_relaxed(). Limit this to when the event-stream is enabled,
> >> to ensure that we wake from WFE periodically and don't block forever
> >> if there are no stores to the cacheline.
> >>
> >> In the unlikely event that the event-stream is unavailable, fallback
> >> to spin-waiting.
> >>
> >> Also set SMP_TIMEOUT_POLL_COUNT to 1 so we do the time-check for each
> >> iteration in smp_cond_load_relaxed_timeout().
> >
> > After I looked at the entire series again, this one feels like
> > a missed opportunity. Especially on low-power systems but possibly
> > on any ARMv9.2+ implementation including Cortex-A320, it would
> > be nice to be able to both turn off the event stream and also
> > make this function take fewer wakeups:
> >
> >> +/* Re-declared here to avoid include dependency. */
> >> +extern bool arch_timer_evtstrm_available(void);
> >> +
> >> +#define cpu_poll_relax(ptr, val) \
> >> +do { \
> >> + if (arch_timer_evtstrm_available()) \
> >> + __cmpwait_relaxed(ptr, val); \
> >> + else \
> >> + cpu_relax(); \
> >> +} while (0)
> >> +
> >
> > Since the caller knows exactly how long it wants to wait for,
> > we should be able to fit a 'wfet' based primitive in here and
> > pass the timeout as another argument.
>
> Per se, I don't disagree with this when it comes to WFET.
>
> Handling a timeout, however, is messier when we use other mechanisms.
>
> Some problems that came up in my earlier discussions with Catalin:
>
> - when using WFE, we also need some notion of slack
> - and if a caller specifies only a small or no slack, then we need
> to combine WFE+cpu_relax()
>
> - for platforms that only use a polling primitive, we want to check
> the clock only intermittently for power reasons.
> Now, this could be done with an architecture specific spin-count.
> However, if the caller specifies a small slack, then we might need
> to we check the clock more often as we get closer to the deadline etc.
>
> A smaller problem was that different users want different clocks and so
> folding the timeout in a 'timeout_cond_expr' lets us do away with the
> interface having to handle any of that.
>
> I had earlier versions [v2] [v3] which had rather elaborate policies for
> handling timeout, slack etc. But, given that the current users of the
> interface don't actually care about precision, all of that seemed
> a little overengineered.
Indeed, we've been through all these options and without a concrete user
that needs a more precise timeout, we decided it's not worth it. It can,
however, be improved later if such users appear.
> [v2] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/#r
> [v3] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/
--
Catalin
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout()
2025-10-28 21:17 ` Catalin Marinas
@ 2025-11-02 21:39 ` Arnd Bergmann
2025-11-03 21:00 ` Ankur Arora
0 siblings, 1 reply; 32+ messages in thread
From: Arnd Bergmann @ 2025-11-02 21:39 UTC (permalink / raw)
To: Catalin Marinas, Ankur Arora
Cc: linux-kernel, Linux-Arch, linux-arm-kernel, linux-pm, bpf,
Will Deacon, Peter Zijlstra, Andrew Morton, Mark Rutland,
Haris Okanovic, Christoph Lameter (Ampere), Alexei Starovoitov,
Rafael J . Wysocki, Daniel Lezcano, Kumar Kartikeya Dwivedi,
zhenglifeng1, xueshuai, Joao Martins, Boris Ostrovsky,
Konrad Rzeszutek Wilk
On Tue, Oct 28, 2025, at 22:17, Catalin Marinas wrote:
> On Tue, Oct 28, 2025 at 11:01:22AM -0700, Ankur Arora wrote:
>> Arnd Bergmann <arnd@arndb.de> writes:
>> > On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>> >> +
>> >
>> > Since the caller knows exactly how long it wants to wait for,
>> > we should be able to fit a 'wfet' based primitive in here and
>> > pass the timeout as another argument.
>>
>> Per se, I don't disagree with this when it comes to WFET.
>>
>> Handling a timeout, however, is messier when we use other mechanisms.
>>
>> Some problems that came up in my earlier discussions with Catalin:
>>
>> - when using WFE, we also need some notion of slack
>> - and if a caller specifies only a small or no slack, then we need
>> to combine WFE+cpu_relax()
I don't see the difference to what you have: with the event stream,
you implicitly define a slack to be the programmed event stream rate
of ~100µs.
I'm not asking for anything better in this case, only for machines
with WFET but no event stream to also avoid the spin loop.
>> - for platforms that only use a polling primitive, we want to check
>> the clock only intermittently for power reasons.
Right, I missed that bit.
>> Now, this could be done with an architecture specific spin-count.
>> However, if the caller specifies a small slack, then we might need
>> to we check the clock more often as we get closer to the deadline etc.
Again, I think this is solved by defining the slack as architecture
specific as well rather than an explicit argument, which is essentially
what we already have.
>> A smaller problem was that different users want different clocks and so
>> folding the timeout in a 'timeout_cond_expr' lets us do away with the
>> interface having to handle any of that.
>>
>> I had earlier versions [v2] [v3] which had rather elaborate policies for
>> handling timeout, slack etc. But, given that the current users of the
>> interface don't actually care about precision, all of that seemed
>> a little overengineered.
>
> Indeed, we've been through all these options and without a concrete user
> that needs a more precise timeout, we decided it's not worth it. It can,
> however, be improved later if such users appear.
The main worry I have is that we get too many users of cpu_poll_relax()
hardcoding the use of the event stream without a timeout argument, it
becomes too hard to change later without introducing regressions
from the behavior change.
As far as I can tell, the only place that currently uses the
event stream on a functional level is the delay() loop, and that
has a working wfet based version.
Arnd
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout()
2025-11-02 21:39 ` Arnd Bergmann
@ 2025-11-03 21:00 ` Ankur Arora
2025-11-04 13:55 ` Catalin Marinas
0 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2025-11-03 21:00 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Catalin Marinas, Ankur Arora, linux-kernel, Linux-Arch,
linux-arm-kernel, linux-pm, bpf, Will Deacon, Peter Zijlstra,
Andrew Morton, Mark Rutland, Haris Okanovic,
Christoph Lameter (Ampere), Alexei Starovoitov,
Rafael J . Wysocki, Daniel Lezcano, Kumar Kartikeya Dwivedi,
zhenglifeng1, xueshuai, Joao Martins, Boris Ostrovsky,
Konrad Rzeszutek Wilk
Arnd Bergmann <arnd@arndb.de> writes:
> On Tue, Oct 28, 2025, at 22:17, Catalin Marinas wrote:
>> On Tue, Oct 28, 2025 at 11:01:22AM -0700, Ankur Arora wrote:
>>> Arnd Bergmann <arnd@arndb.de> writes:
>>> > On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>>> >> +
>>> >
>>> > Since the caller knows exactly how long it wants to wait for,
>>> > we should be able to fit a 'wfet' based primitive in here and
>>> > pass the timeout as another argument.
>>>
>>> Per se, I don't disagree with this when it comes to WFET.
>>>
>>> Handling a timeout, however, is messier when we use other mechanisms.
>>>
>>> Some problems that came up in my earlier discussions with Catalin:
>>>
>>> - when using WFE, we also need some notion of slack
>>> - and if a caller specifies only a small or no slack, then we need
>>> to combine WFE+cpu_relax()
>
> I don't see the difference to what you have: with the event stream,
> you implicitly define a slack to be the programmed event stream rate
> of ~100µs.
True. The thinking was that an adding an explicit timeout just begs the
question of how closely the interface adheres to the timeout and I guess
the final interface tried to sidestep all of that.
> I'm not asking for anything better in this case, only for machines
> with WFET but no event stream to also avoid the spin loop.
That makes sense. It's a good point that the WFET+event-stream-off case
would just end up using the spin lock which is quite suboptimal.
>>> - for platforms that only use a polling primitive, we want to check
>>> the clock only intermittently for power reasons.
>
> Right, I missed that bit.
>
>>> Now, this could be done with an architecture specific spin-count.
>>> However, if the caller specifies a small slack, then we might need
>>> to we check the clock more often as we get closer to the deadline etc.
>
> Again, I think this is solved by defining the slack as architecture
> specific as well rather than an explicit argument, which is essentially
> what we already have.
Great. I think that means that I can keep more or less the same interface
with an explicit time_end. Which allows WFET to do the right thing.
And, WFE can have an architecture specific slack (event-stream period).
>>> A smaller problem was that different users want different clocks and so
>>> folding the timeout in a 'timeout_cond_expr' lets us do away with the
>>> interface having to handle any of that.
>>>
>>> I had earlier versions [v2] [v3] which had rather elaborate policies for
>>> handling timeout, slack etc. But, given that the current users of the
>>> interface don't actually care about precision, all of that seemed
>>> a little overengineered.
>>
>> Indeed, we've been through all these options and without a concrete user
>> that needs a more precise timeout, we decided it's not worth it. It can,
>> however, be improved later if such users appear.
>
> The main worry I have is that we get too many users of cpu_poll_relax()
> hardcoding the use of the event stream without a timeout argument, it
> becomes too hard to change later without introducing regressions
> from the behavior change.
True.
> As far as I can tell, the only place that currently uses the
> event stream on a functional level is the delay() loop, and that
> has a working wfet based version.
Will send out the next version with an interface on the following lines:
/**
* smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
* guarantees until a timeout expires.
* @ptr: pointer to the variable to wait on
* @cond: boolean expression to wait for
* @time_expr: time expression in caller's preferred clock
* @time_end: end time in nanosecond (compared against time_expr;
* might also be used for setting up a future event.)
*
* Equivalent to using READ_ONCE() on the condition variable.
*
* Note that the expiration of the timeout might have an architecture specific
* delay.
*/
#ifndef smp_cond_load_relaxed_timeout
#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr, time_end_ns) \
({ \
typeof(ptr) __PTR = (ptr); \
__unqual_scalar_typeof(*ptr) VAL; \
u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
u64 __time_end_ns = (time_end_ns); \
\
for (;;) { \
VAL = READ_ONCE(*__PTR); \
if (cond_expr) \
break; \
cpu_poll_relax(__PTR, VAL, __time_end_ns); \
if (++__n < __spin) \
continue; \
if ((time_expr) >= __time_end_ns) { \
VAL = READ_ONCE(*__PTR); \
break; \
} \
__n = 0; \
} \
(typeof(*ptr))VAL; \
})
#endif
That allows for a __cmpwait_timeout() as you had outlined and similar to
these two patches:
https://lore.kernel.org/lkml/20241107190818.522639-15-ankur.a.arora@oracle.com/
https://lore.kernel.org/lkml/20241107190818.522639-16-ankur.a.arora@oracle.com/
(this one incorporating some changes that Catalin had suggested:
https://lore.kernel.org/lkml/aKRTRyQAaWFtRvDv@arm.com/)
--
ankur
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout()
2025-11-03 21:00 ` Ankur Arora
@ 2025-11-04 13:55 ` Catalin Marinas
2025-11-05 8:27 ` Ankur Arora
0 siblings, 1 reply; 32+ messages in thread
From: Catalin Marinas @ 2025-11-04 13:55 UTC (permalink / raw)
To: Ankur Arora
Cc: Arnd Bergmann, linux-kernel, Linux-Arch, linux-arm-kernel,
linux-pm, bpf, Will Deacon, Peter Zijlstra, Andrew Morton,
Mark Rutland, Haris Okanovic, Christoph Lameter (Ampere),
Alexei Starovoitov, Rafael J . Wysocki, Daniel Lezcano,
Kumar Kartikeya Dwivedi, zhenglifeng1, xueshuai, Joao Martins,
Boris Ostrovsky, Konrad Rzeszutek Wilk
On Mon, Nov 03, 2025 at 01:00:33PM -0800, Ankur Arora wrote:
> /**
> * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
> * guarantees until a timeout expires.
> * @ptr: pointer to the variable to wait on
> * @cond: boolean expression to wait for
> * @time_expr: time expression in caller's preferred clock
> * @time_end: end time in nanosecond (compared against time_expr;
> * might also be used for setting up a future event.)
> *
> * Equivalent to using READ_ONCE() on the condition variable.
> *
> * Note that the expiration of the timeout might have an architecture specific
> * delay.
> */
> #ifndef smp_cond_load_relaxed_timeout
> #define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr, time_end_ns) \
> ({ \
> typeof(ptr) __PTR = (ptr); \
> __unqual_scalar_typeof(*ptr) VAL; \
> u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
> u64 __time_end_ns = (time_end_ns); \
> \
> for (;;) { \
> VAL = READ_ONCE(*__PTR); \
> if (cond_expr) \
> break; \
> cpu_poll_relax(__PTR, VAL, __time_end_ns); \
With time_end_ns being passed to cpu_poll_relax(), we assume that this
is always the absolute time. Do we still need time_expr in this case?
It works for WFET as long as we can map this time_end_ns onto the
hardware CNTVCT.
Alternatively, we could pass something like remaining_ns, though not
sure how smp_cond_load_relaxed_timeout() can decide to spin before
checking time_expr again (we probably went over this in the past two
years ;)).
> if (++__n < __spin) \
> continue; \
> if ((time_expr) >= __time_end_ns) { \
> VAL = READ_ONCE(*__PTR); \
> break; \
> } \
> __n = 0; \
> } \
> (typeof(*ptr))VAL; \
> })
> #endif
--
Catalin
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout()
2025-11-04 13:55 ` Catalin Marinas
@ 2025-11-05 8:27 ` Ankur Arora
2025-11-05 10:37 ` Arnd Bergmann
0 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2025-11-05 8:27 UTC (permalink / raw)
To: Catalin Marinas
Cc: Ankur Arora, Arnd Bergmann, linux-kernel, Linux-Arch,
linux-arm-kernel, linux-pm, bpf, Will Deacon, Peter Zijlstra,
Andrew Morton, Mark Rutland, Haris Okanovic,
Christoph Lameter (Ampere), Alexei Starovoitov,
Rafael J . Wysocki, Daniel Lezcano, Kumar Kartikeya Dwivedi,
zhenglifeng1, xueshuai, Joao Martins, Boris Ostrovsky,
Konrad Rzeszutek Wilk
Catalin Marinas <catalin.marinas@arm.com> writes:
> On Mon, Nov 03, 2025 at 01:00:33PM -0800, Ankur Arora wrote:
>> /**
>> * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
>> * guarantees until a timeout expires.
>> * @ptr: pointer to the variable to wait on
>> * @cond: boolean expression to wait for
>> * @time_expr: time expression in caller's preferred clock
>> * @time_end: end time in nanosecond (compared against time_expr;
>> * might also be used for setting up a future event.)
>> *
>> * Equivalent to using READ_ONCE() on the condition variable.
>> *
>> * Note that the expiration of the timeout might have an architecture specific
>> * delay.
>> */
>> #ifndef smp_cond_load_relaxed_timeout
>> #define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr, time_end_ns) \
>> ({ \
>> typeof(ptr) __PTR = (ptr); \
>> __unqual_scalar_typeof(*ptr) VAL; \
>> u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
>> u64 __time_end_ns = (time_end_ns); \
>> \
>> for (;;) { \
>> VAL = READ_ONCE(*__PTR); \
>> if (cond_expr) \
>> break; \
>> cpu_poll_relax(__PTR, VAL, __time_end_ns); \
>
> With time_end_ns being passed to cpu_poll_relax(), we assume that this
> is always the absolute time. Do we still need time_expr in this case?
> It works for WFET as long as we can map this time_end_ns onto the
> hardware CNTVCT.
So I like this idea. Given that we only promise a coarse granularity we
should be able to get by with using a coarse clock of our choosing.
However, maybe some callers need a globally consistent clock just in
case they could migrate and do something stateful in the cond_expr?
(For instance rqspinlock wants ktime_mono. Though I don't think these
callers can migrate.)
> Alternatively, we could pass something like remaining_ns, though not
> sure how smp_cond_load_relaxed_timeout() can decide to spin before
> checking time_expr again (we probably went over this in the past two
> years ;)).
I'm sure it is in there somewhere :).
This one?: https://lore.kernel.org/lkml/aJy414YufthzC1nv@arm.com/.
Though the whole wait_policy thing confused the issue somewhat there.
Though that problem exists for both remaining_ns and for time_end_ns
with WFE. I think we are fine so long as SMP_TIMEOUT_POLL_COUNT is
defined to be 1.
For now, I think it makes sense to always pass the absolute deadline
even if the caller uses remaining_ns. So:
#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr, remaining_ns) \
({ \
typeof(ptr) __PTR = (ptr); \
__unqual_scalar_typeof(*ptr) VAL; \
u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
u64 __time_start_ns = (time_expr); \
s64 __time_end_ns = __time_start_ns + (remaining_ns); \
\
for (;;) { \
VAL = READ_ONCE(*__PTR); \
if (cond_expr) \
break; \
cpu_poll_relax(__PTR, VAL, __time_end_ns); \
if (++__n < __spin) \
continue; \
if ((time_expr) >= __time_end_ns) { \
VAL = READ_ONCE(*__PTR); \
break; \
} \
__n = 0; \
} \
(typeof(*ptr))VAL; \
})
--
ankur
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout()
2025-11-05 8:27 ` Ankur Arora
@ 2025-11-05 10:37 ` Arnd Bergmann
2025-11-06 0:36 ` Ankur Arora
0 siblings, 1 reply; 32+ messages in thread
From: Arnd Bergmann @ 2025-11-05 10:37 UTC (permalink / raw)
To: Ankur Arora, Catalin Marinas
Cc: linux-kernel, Linux-Arch, linux-arm-kernel, linux-pm, bpf,
Will Deacon, Peter Zijlstra, Andrew Morton, Mark Rutland,
Haris Okanovic, Christoph Lameter (Ampere), Alexei Starovoitov,
Rafael J . Wysocki, Daniel Lezcano, Kumar Kartikeya Dwivedi,
zhenglifeng1, xueshuai, Joao Martins, Boris Ostrovsky,
Konrad Rzeszutek Wilk
On Wed, Nov 5, 2025, at 09:27, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@arm.com> writes:
>> On Mon, Nov 03, 2025 at 01:00:33PM -0800, Ankur Arora wrote:
>> With time_end_ns being passed to cpu_poll_relax(), we assume that this
>> is always the absolute time. Do we still need time_expr in this case?
>> It works for WFET as long as we can map this time_end_ns onto the
>> hardware CNTVCT.
>>
>> Alternatively, we could pass something like remaining_ns, though not
>> sure how smp_cond_load_relaxed_timeout() can decide to spin before
>> checking time_expr again (we probably went over this in the past two
>> years ;)).
>
> I'm sure it is in there somewhere :).
> This one?: https://lore.kernel.org/lkml/aJy414YufthzC1nv@arm.com/.
> Though the whole wait_policy thing confused the issue somewhat there.
>
> Though that problem exists for both remaining_ns and for time_end_ns
> with WFE. I think we are fine so long as SMP_TIMEOUT_POLL_COUNT is
> defined to be 1.
We need to be careful with the definition of the time_expr() if
cpu_poll_relax requires the absolute time in CNTVCT domain.
We're probably fine here, but it feels like a bit of a layering
violation. I think passing relative time into it would be cleaner
because it avoids the ambiguity it but probably requires an extra
access to the timer register that is hopefully fast on arm64.
I'm ok with either way.
> For now, I think it makes sense to always pass the absolute deadline
> even if the caller uses remaining_ns. So:
>
> #define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr, remaining_ns) \
> ({ \
> typeof(ptr) __PTR = (ptr); \
> __unqual_scalar_typeof(*ptr) VAL; \
> u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
> u64 __time_start_ns = (time_expr); \
> s64 __time_end_ns = __time_start_ns + (remaining_ns); \
> \
> for (;;) { \
> VAL = READ_ONCE(*__PTR); \
> if (cond_expr) \
> break; \
> cpu_poll_relax(__PTR, VAL, __time_end_ns); \
This looks perfectly fine to me, thanks for the update!
Arnd
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout()
2025-11-05 10:37 ` Arnd Bergmann
@ 2025-11-06 0:36 ` Ankur Arora
0 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-11-06 0:36 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Ankur Arora, Catalin Marinas, linux-kernel, Linux-Arch,
linux-arm-kernel, linux-pm, bpf, Will Deacon, Peter Zijlstra,
Andrew Morton, Mark Rutland, Haris Okanovic,
Christoph Lameter (Ampere), Alexei Starovoitov,
Rafael J . Wysocki, Daniel Lezcano, Kumar Kartikeya Dwivedi,
zhenglifeng1, xueshuai, Joao Martins, Boris Ostrovsky,
Konrad Rzeszutek Wilk
Arnd Bergmann <arnd@arndb.de> writes:
> On Wed, Nov 5, 2025, at 09:27, Ankur Arora wrote:
>> Catalin Marinas <catalin.marinas@arm.com> writes:
>>> On Mon, Nov 03, 2025 at 01:00:33PM -0800, Ankur Arora wrote:
>>> With time_end_ns being passed to cpu_poll_relax(), we assume that this
>>> is always the absolute time. Do we still need time_expr in this case?
>>> It works for WFET as long as we can map this time_end_ns onto the
>>> hardware CNTVCT.
>>>
>>> Alternatively, we could pass something like remaining_ns, though not
>>> sure how smp_cond_load_relaxed_timeout() can decide to spin before
>>> checking time_expr again (we probably went over this in the past two
>>> years ;)).
>>
>> I'm sure it is in there somewhere :).
>> This one?: https://lore.kernel.org/lkml/aJy414YufthzC1nv@arm.com/.
>> Though the whole wait_policy thing confused the issue somewhat there.
>>
>> Though that problem exists for both remaining_ns and for time_end_ns
>> with WFE. I think we are fine so long as SMP_TIMEOUT_POLL_COUNT is
>> defined to be 1.
>
> We need to be careful with the definition of the time_expr() if
> cpu_poll_relax requires the absolute time in CNTVCT domain.
True. The absolute time assumes that CPU time and CNTVCT domain
times are freely translatable, and won't drift.
> We're probably fine here, but it feels like a bit of a layering
> violation. I think passing relative time into it would be cleaner
> because it avoids the ambiguity it
I'll play around a little; see if we can pass the relative time and
yet not depend on the conjoined value of SMP_TIMEOUT_POLL_COUNT.
> but probably requires an extra
> access to the timer register that is hopefully fast on arm64.
>
> I'm ok with either way.
I'll keep the caller parameter to be remaining_ns. This way we
can internally switch over to relative/absolute time if needed.
>> For now, I think it makes sense to always pass the absolute deadline
>> even if the caller uses remaining_ns. So:
>>
>> #define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr, remaining_ns) \
>> ({ \
>> typeof(ptr) __PTR = (ptr); \
>> __unqual_scalar_typeof(*ptr) VAL; \
>> u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
>> u64 __time_start_ns = (time_expr); \
>> s64 __time_end_ns = __time_start_ns + (remaining_ns); \
>> \
>> for (;;) { \
>> VAL = READ_ONCE(*__PTR); \
>> if (cond_expr) \
>> break; \
>> cpu_poll_relax(__PTR, VAL, __time_end_ns); \
>
> This looks perfectly fine to me, thanks for the update!
Thanks for the review comments!
--
ankur
^ permalink raw reply [flat|nested] 32+ messages in thread
* [RESEND PATCH v7 3/7] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait()
2025-10-28 5:31 [RESEND PATCH v7 0/7] barrier: Add smp_cond_load_*_timeout() Ankur Arora
2025-10-28 5:31 ` [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout() Ankur Arora
2025-10-28 5:31 ` [RESEND PATCH v7 2/7] arm64: barrier: Support smp_cond_load_relaxed_timeout() Ankur Arora
@ 2025-10-28 5:31 ` Ankur Arora
2025-10-28 5:31 ` [RESEND PATCH v7 4/7] asm-generic: barrier: Add smp_cond_load_acquire_timeout() Ankur Arora
` (3 subsequent siblings)
6 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-10-28 5:31 UTC (permalink / raw)
To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
joao.m.martins, boris.ostrovsky, konrad.wilk
In preparation for defining smp_cond_load_acquire_timeout(), remove
the private copy. Lacking this, the rqspinlock code falls back to using
smp_cond_load_acquire().
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/arm64/include/asm/rqspinlock.h | 85 -----------------------------
1 file changed, 85 deletions(-)
diff --git a/arch/arm64/include/asm/rqspinlock.h b/arch/arm64/include/asm/rqspinlock.h
index 9ea0a74e5892..a385603436e9 100644
--- a/arch/arm64/include/asm/rqspinlock.h
+++ b/arch/arm64/include/asm/rqspinlock.h
@@ -3,91 +3,6 @@
#define _ASM_RQSPINLOCK_H
#include <asm/barrier.h>
-
-/*
- * Hardcode res_smp_cond_load_acquire implementations for arm64 to a custom
- * version based on [0]. In rqspinlock code, our conditional expression involves
- * checking the value _and_ additionally a timeout. However, on arm64, the
- * WFE-based implementation may never spin again if no stores occur to the
- * locked byte in the lock word. As such, we may be stuck forever if
- * event-stream based unblocking is not available on the platform for WFE spin
- * loops (arch_timer_evtstrm_available).
- *
- * Once support for smp_cond_load_acquire_timewait [0] lands, we can drop this
- * copy-paste.
- *
- * While we rely on the implementation to amortize the cost of sampling
- * cond_expr for us, it will not happen when event stream support is
- * unavailable, time_expr check is amortized. This is not the common case, and
- * it would be difficult to fit our logic in the time_expr_ns >= time_limit_ns
- * comparison, hence just let it be. In case of event-stream, the loop is woken
- * up at microsecond granularity.
- *
- * [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
- */
-
-#ifndef smp_cond_load_acquire_timewait
-
-#define smp_cond_time_check_count 200
-
-#define __smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, \
- time_limit_ns) ({ \
- typeof(ptr) __PTR = (ptr); \
- __unqual_scalar_typeof(*ptr) VAL; \
- unsigned int __count = 0; \
- for (;;) { \
- VAL = READ_ONCE(*__PTR); \
- if (cond_expr) \
- break; \
- cpu_relax(); \
- if (__count++ < smp_cond_time_check_count) \
- continue; \
- if ((time_expr_ns) >= (time_limit_ns)) \
- break; \
- __count = 0; \
- } \
- (typeof(*ptr))VAL; \
-})
-
-#define __smp_cond_load_acquire_timewait(ptr, cond_expr, \
- time_expr_ns, time_limit_ns) \
-({ \
- typeof(ptr) __PTR = (ptr); \
- __unqual_scalar_typeof(*ptr) VAL; \
- for (;;) { \
- VAL = smp_load_acquire(__PTR); \
- if (cond_expr) \
- break; \
- __cmpwait_relaxed(__PTR, VAL); \
- if ((time_expr_ns) >= (time_limit_ns)) \
- break; \
- } \
- (typeof(*ptr))VAL; \
-})
-
-#define smp_cond_load_acquire_timewait(ptr, cond_expr, \
- time_expr_ns, time_limit_ns) \
-({ \
- __unqual_scalar_typeof(*ptr) _val; \
- int __wfe = arch_timer_evtstrm_available(); \
- \
- if (likely(__wfe)) { \
- _val = __smp_cond_load_acquire_timewait(ptr, cond_expr, \
- time_expr_ns, \
- time_limit_ns); \
- } else { \
- _val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr, \
- time_expr_ns, \
- time_limit_ns); \
- smp_acquire__after_ctrl_dep(); \
- } \
- (typeof(*ptr))_val; \
-})
-
-#endif
-
-#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire_timewait(v, c, 0, 1)
-
#include <asm-generic/rqspinlock.h>
#endif /* _ASM_RQSPINLOCK_H */
--
2.43.5
^ permalink raw reply related [flat|nested] 32+ messages in thread* [RESEND PATCH v7 4/7] asm-generic: barrier: Add smp_cond_load_acquire_timeout()
2025-10-28 5:31 [RESEND PATCH v7 0/7] barrier: Add smp_cond_load_*_timeout() Ankur Arora
` (2 preceding siblings ...)
2025-10-28 5:31 ` [RESEND PATCH v7 3/7] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait() Ankur Arora
@ 2025-10-28 5:31 ` Ankur Arora
2025-10-28 5:31 ` [RESEND PATCH v7 5/7] atomic: Add atomic_cond_read_*_timeout() Ankur Arora
` (2 subsequent siblings)
6 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-10-28 5:31 UTC (permalink / raw)
To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
joao.m.martins, boris.ostrovsky, konrad.wilk
Add the acquire variant of smp_cond_load_relaxed_timeout(). This
reuses the relaxed variant, with an additional LOAD->LOAD
ordering.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Haris Okanovic <harisokn@amazon.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
include/asm-generic/barrier.h | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 0063b46ec065..9a218f558c5c 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -314,6 +314,28 @@ do { \
})
#endif
+/**
+ * smp_cond_load_acquire_timeout() - (Spin) wait for cond with ACQUIRE ordering
+ * until a timeout expires.
+ *
+ * Arguments: same as smp_cond_load_relaxed_timeout().
+ *
+ * Equivalent to using smp_cond_load_acquire() on the condition variable with
+ * a timeout.
+ */
+#ifndef smp_cond_load_acquire_timeout
+#define smp_cond_load_acquire_timeout(ptr, cond_expr, time_check_expr) \
+({ \
+ __unqual_scalar_typeof(*ptr) _val; \
+ _val = smp_cond_load_relaxed_timeout(ptr, cond_expr, \
+ time_check_expr); \
+ \
+ /* Depends on the control dependency of the wait above. */ \
+ smp_acquire__after_ctrl_dep(); \
+ (typeof(*ptr))_val; \
+})
+#endif
+
/*
* pmem_wmb() ensures that all stores for which the modification
* are written to persistent storage by preceding instructions have
--
2.43.5
^ permalink raw reply related [flat|nested] 32+ messages in thread* [RESEND PATCH v7 5/7] atomic: Add atomic_cond_read_*_timeout()
2025-10-28 5:31 [RESEND PATCH v7 0/7] barrier: Add smp_cond_load_*_timeout() Ankur Arora
` (3 preceding siblings ...)
2025-10-28 5:31 ` [RESEND PATCH v7 4/7] asm-generic: barrier: Add smp_cond_load_acquire_timeout() Ankur Arora
@ 2025-10-28 5:31 ` Ankur Arora
2025-10-28 5:31 ` [RESEND PATCH v7 6/7] rqspinlock: Use smp_cond_load_acquire_timeout() Ankur Arora
2025-10-28 5:31 ` [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout() Ankur Arora
6 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-10-28 5:31 UTC (permalink / raw)
To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
joao.m.martins, boris.ostrovsky, konrad.wilk
Add atomic load wrappers, atomic_cond_read_*_timeout() and
atomic64_cond_read_*_timeout() for the cond-load timeout interfaces.
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
include/linux/atomic.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 8dd57c3a99e9..49f8100ad8af 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -31,6 +31,16 @@
#define atomic64_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
#define atomic64_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
+#define atomic_cond_read_acquire_timeout(v, c, t) \
+ smp_cond_load_acquire_timeout(&(v)->counter, (c), (t))
+#define atomic_cond_read_relaxed_timeout(v, c, t) \
+ smp_cond_load_relaxed_timeout(&(v)->counter, (c), (t))
+
+#define atomic64_cond_read_acquire_timeout(v, c, t) \
+ smp_cond_load_acquire_timeout(&(v)->counter, (c), (t))
+#define atomic64_cond_read_relaxed_timeout(v, c, t) \
+ smp_cond_load_relaxed_timeout(&(v)->counter, (c), (t))
+
/*
* The idea here is to build acquire/release variants by adding explicit
* barriers on top of the relaxed variant. In the case where the relaxed
--
2.43.5
^ permalink raw reply related [flat|nested] 32+ messages in thread* [RESEND PATCH v7 6/7] rqspinlock: Use smp_cond_load_acquire_timeout()
2025-10-28 5:31 [RESEND PATCH v7 0/7] barrier: Add smp_cond_load_*_timeout() Ankur Arora
` (4 preceding siblings ...)
2025-10-28 5:31 ` [RESEND PATCH v7 5/7] atomic: Add atomic_cond_read_*_timeout() Ankur Arora
@ 2025-10-28 5:31 ` Ankur Arora
2025-10-28 5:31 ` [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout() Ankur Arora
6 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-10-28 5:31 UTC (permalink / raw)
To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
joao.m.martins, boris.ostrovsky, konrad.wilk
Switch out the conditional load interfaces used by rqspinlock
to atomic_cond_read_acquire_timeout(), and
smp_cond_read_acquire_timeout().
Both these handle the timeout and amortize as needed, so use
check_timeout() directly.
Also, when using spin-wait implementations, redefine SMP_TIMEOUT_POLL_COUNT
to be 16k to be similar to the spin-count used in RES_CHECK_TIMEOUT().
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
kernel/bpf/rqspinlock.c | 31 ++++++++++++++-----------------
1 file changed, 14 insertions(+), 17 deletions(-)
diff --git a/kernel/bpf/rqspinlock.c b/kernel/bpf/rqspinlock.c
index a00561b1d3e5..f35f078a158d 100644
--- a/kernel/bpf/rqspinlock.c
+++ b/kernel/bpf/rqspinlock.c
@@ -241,20 +241,14 @@ static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
}
/*
- * Do not amortize with spins when res_smp_cond_load_acquire is defined,
- * as the macro does internal amortization for us.
+ * Amortize timeout check for busy-wait loops.
*/
-#ifndef res_smp_cond_load_acquire
#define RES_CHECK_TIMEOUT(ts, ret, mask) \
({ \
if (!(ts).spin++) \
(ret) = check_timeout((lock), (mask), &(ts)); \
(ret); \
})
-#else
-#define RES_CHECK_TIMEOUT(ts, ret, mask) \
- ({ (ret) = check_timeout((lock), (mask), &(ts)); })
-#endif
/*
* Initialize the 'spin' member.
@@ -268,6 +262,16 @@ static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
*/
#define RES_RESET_TIMEOUT(ts, _duration) ({ (ts).timeout_end = 0; (ts).duration = _duration; })
+/*
+ * Limit how often check_timeout() is invoked while spin-waiting in
+ * smp_cond_load_acquire_timeout() or atomic_cond_read_acquire_timeout().
+ * (ARM64, typically uses a waited implementation so we exclude that.)
+ */
+#ifndef CONFIG_ARM64
+#undef SMP_TIMEOUT_POLL_COUNT
+#define SMP_TIMEOUT_POLL_COUNT (16*1024)
+#endif
+
/*
* Provide a test-and-set fallback for cases when queued spin lock support is
* absent from the architecture.
@@ -313,12 +316,6 @@ EXPORT_SYMBOL_GPL(resilient_tas_spin_lock);
*/
static DEFINE_PER_CPU_ALIGNED(struct qnode, rqnodes[_Q_MAX_NODES]);
-#ifndef res_smp_cond_load_acquire
-#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire(v, c)
-#endif
-
-#define res_atomic_cond_read_acquire(v, c) res_smp_cond_load_acquire(&(v)->counter, (c))
-
/**
* resilient_queued_spin_lock_slowpath - acquire the queued spinlock
* @lock: Pointer to queued spinlock structure
@@ -418,7 +415,8 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
*/
if (val & _Q_LOCKED_MASK) {
RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
- res_smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK));
+ smp_cond_load_acquire_timeout(&lock->locked, !VAL,
+ (ret = check_timeout(lock, _Q_LOCKED_MASK, &ts)));
}
if (ret) {
@@ -572,9 +570,8 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
* us.
*/
RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT * 2);
- val = res_atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) ||
- RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK));
-
+ val = atomic_cond_read_acquire_timeout(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK),
+ (ret = check_timeout(lock, _Q_LOCKED_PENDING_MASK, &ts)));
waitq_timeout:
if (ret) {
/*
--
2.43.5
^ permalink raw reply related [flat|nested] 32+ messages in thread* [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout()
2025-10-28 5:31 [RESEND PATCH v7 0/7] barrier: Add smp_cond_load_*_timeout() Ankur Arora
` (5 preceding siblings ...)
2025-10-28 5:31 ` [RESEND PATCH v7 6/7] rqspinlock: Use smp_cond_load_acquire_timeout() Ankur Arora
@ 2025-10-28 5:31 ` Ankur Arora
2025-10-28 12:30 ` Rafael J. Wysocki
2025-10-28 16:16 ` Christoph Lameter (Ampere)
6 siblings, 2 replies; 32+ messages in thread
From: Ankur Arora @ 2025-10-28 5:31 UTC (permalink / raw)
To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
joao.m.martins, boris.ostrovsky, konrad.wilk
The inner loop in poll_idle() polls over the thread_info flags,
waiting to see if the thread has TIF_NEED_RESCHED set. The loop
exits once the condition is met, or if the poll time limit has
been exceeded.
To minimize the number of instructions executed in each iteration,
the time check is done only intermittently (once every
POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
executes cpu_relax() which on certain platforms provides a hint to
the pipeline that the loop busy-waits, allowing the processor to
reduce power consumption.
This is close to what smp_cond_load_relaxed_timeout() provides. So,
restructure the loop and fold the loop condition and the timeout check
in smp_cond_load_relaxed_timeout().
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
1 file changed, 8 insertions(+), 21 deletions(-)
diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index 9b6d90a72601..dc7f4b424fec 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -8,35 +8,22 @@
#include <linux/sched/clock.h>
#include <linux/sched/idle.h>
-#define POLL_IDLE_RELAX_COUNT 200
-
static int __cpuidle poll_idle(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index)
{
- u64 time_start;
-
- time_start = local_clock_noinstr();
+ u64 time_end;
+ u32 flags = 0;
dev->poll_time_limit = false;
+ time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
+
raw_local_irq_enable();
if (!current_set_polling_and_test()) {
- unsigned int loop_count = 0;
- u64 limit;
-
- limit = cpuidle_poll_time(drv, dev);
-
- while (!need_resched()) {
- cpu_relax();
- if (loop_count++ < POLL_IDLE_RELAX_COUNT)
- continue;
-
- loop_count = 0;
- if (local_clock_noinstr() - time_start > limit) {
- dev->poll_time_limit = true;
- break;
- }
- }
+ flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
+ (VAL & _TIF_NEED_RESCHED),
+ (local_clock_noinstr() >= time_end));
+ dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED);
}
raw_local_irq_disable();
--
2.43.5
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout()
2025-10-28 5:31 ` [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout() Ankur Arora
@ 2025-10-28 12:30 ` Rafael J. Wysocki
2025-10-29 4:41 ` Ankur Arora
2025-10-28 16:16 ` Christoph Lameter (Ampere)
1 sibling, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2025-10-28 12:30 UTC (permalink / raw)
To: Ankur Arora
Cc: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf, arnd,
catalin.marinas, will, peterz, akpm, mark.rutland, harisokn, cl,
ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
joao.m.martins, boris.ostrovsky, konrad.wilk
On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> The inner loop in poll_idle() polls over the thread_info flags,
> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
> exits once the condition is met, or if the poll time limit has
> been exceeded.
>
> To minimize the number of instructions executed in each iteration,
> the time check is done only intermittently (once every
> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
> executes cpu_relax() which on certain platforms provides a hint to
> the pipeline that the loop busy-waits, allowing the processor to
> reduce power consumption.
>
> This is close to what smp_cond_load_relaxed_timeout() provides. So,
> restructure the loop and fold the loop condition and the timeout check
> in smp_cond_load_relaxed_timeout().
Well, it is close, but is it close enough?
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
> 1 file changed, 8 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> index 9b6d90a72601..dc7f4b424fec 100644
> --- a/drivers/cpuidle/poll_state.c
> +++ b/drivers/cpuidle/poll_state.c
> @@ -8,35 +8,22 @@
> #include <linux/sched/clock.h>
> #include <linux/sched/idle.h>
>
> -#define POLL_IDLE_RELAX_COUNT 200
> -
> static int __cpuidle poll_idle(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index)
> {
> - u64 time_start;
> -
> - time_start = local_clock_noinstr();
> + u64 time_end;
> + u32 flags = 0;
>
> dev->poll_time_limit = false;
>
> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
Is there any particular reason for doing this unconditionally? If
not, then it looks like an arbitrary unrelated change to me.
> +
> raw_local_irq_enable();
> if (!current_set_polling_and_test()) {
> - unsigned int loop_count = 0;
> - u64 limit;
> -
> - limit = cpuidle_poll_time(drv, dev);
> -
> - while (!need_resched()) {
> - cpu_relax();
> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> - continue;
> -
> - loop_count = 0;
> - if (local_clock_noinstr() - time_start > limit) {
> - dev->poll_time_limit = true;
> - break;
> - }
> - }
> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
> + (VAL & _TIF_NEED_RESCHED),
> + (local_clock_noinstr() >= time_end));
So my understanding of this is that it reduces duplication with some
other places doing similar things. Fair enough.
However, since there is "timeout" in the name, I'd expect it to take
the timeout as an argument.
> + dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED);
> }
> raw_local_irq_disable();
>
> --
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout()
2025-10-28 12:30 ` Rafael J. Wysocki
@ 2025-10-29 4:41 ` Ankur Arora
2025-10-29 18:53 ` Rafael J. Wysocki
0 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2025-10-29 4:41 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, linux-pm,
bpf, arnd, catalin.marinas, will, peterz, akpm, mark.rutland,
harisokn, cl, ast, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
joao.m.martins, boris.ostrovsky, konrad.wilk
Rafael J. Wysocki <rafael@kernel.org> writes:
> On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> The inner loop in poll_idle() polls over the thread_info flags,
>> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
>> exits once the condition is met, or if the poll time limit has
>> been exceeded.
>>
>> To minimize the number of instructions executed in each iteration,
>> the time check is done only intermittently (once every
>> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
>> executes cpu_relax() which on certain platforms provides a hint to
>> the pipeline that the loop busy-waits, allowing the processor to
>> reduce power consumption.
>>
>> This is close to what smp_cond_load_relaxed_timeout() provides. So,
>> restructure the loop and fold the loop condition and the timeout check
>> in smp_cond_load_relaxed_timeout().
>
> Well, it is close, but is it close enough?
I guess that's the question.
>> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
>> 1 file changed, 8 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> index 9b6d90a72601..dc7f4b424fec 100644
>> --- a/drivers/cpuidle/poll_state.c
>> +++ b/drivers/cpuidle/poll_state.c
>> @@ -8,35 +8,22 @@
>> #include <linux/sched/clock.h>
>> #include <linux/sched/idle.h>
>>
>> -#define POLL_IDLE_RELAX_COUNT 200
>> -
>> static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> struct cpuidle_driver *drv, int index)
>> {
>> - u64 time_start;
>> -
>> - time_start = local_clock_noinstr();
>> + u64 time_end;
>> + u32 flags = 0;
>>
>> dev->poll_time_limit = false;
>>
>> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
>
> Is there any particular reason for doing this unconditionally? If
> not, then it looks like an arbitrary unrelated change to me.
Agreed. Will fix.
>> +
>> raw_local_irq_enable();
>> if (!current_set_polling_and_test()) {
>> - unsigned int loop_count = 0;
>> - u64 limit;
>> -
>> - limit = cpuidle_poll_time(drv, dev);
>> -
>> - while (!need_resched()) {
>> - cpu_relax();
>> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> - continue;
>> -
>> - loop_count = 0;
>> - if (local_clock_noinstr() - time_start > limit) {
>> - dev->poll_time_limit = true;
>> - break;
>> - }
>> - }
>> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
>> + (VAL & _TIF_NEED_RESCHED),
>> + (local_clock_noinstr() >= time_end));
>
> So my understanding of this is that it reduces duplication with some
> other places doing similar things. Fair enough.
>
> However, since there is "timeout" in the name, I'd expect it to take
> the timeout as an argument.
The early versions did have a timeout but that complicated the
implementation significantly. And the current users poll_idle(),
rqspinlock don't need a precise timeout.
smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
The problem with all suffixes I can think of is that it makes the
interface itself nonobvious.
Possibly something with the sense of bail out might work.
--
ankur
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout()
2025-10-29 4:41 ` Ankur Arora
@ 2025-10-29 18:53 ` Rafael J. Wysocki
2025-10-29 19:13 ` Ankur Arora
0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2025-10-29 18:53 UTC (permalink / raw)
To: Ankur Arora
Cc: Rafael J. Wysocki, linux-kernel, linux-arch, linux-arm-kernel,
linux-pm, bpf, arnd, catalin.marinas, will, peterz, akpm,
mark.rutland, harisokn, cl, ast, daniel.lezcano, memxor,
zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
konrad.wilk
On Wed, Oct 29, 2025 at 5:42 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>
> Rafael J. Wysocki <rafael@kernel.org> writes:
>
> > On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >>
> >> The inner loop in poll_idle() polls over the thread_info flags,
> >> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
> >> exits once the condition is met, or if the poll time limit has
> >> been exceeded.
> >>
> >> To minimize the number of instructions executed in each iteration,
> >> the time check is done only intermittently (once every
> >> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
> >> executes cpu_relax() which on certain platforms provides a hint to
> >> the pipeline that the loop busy-waits, allowing the processor to
> >> reduce power consumption.
> >>
> >> This is close to what smp_cond_load_relaxed_timeout() provides. So,
> >> restructure the loop and fold the loop condition and the timeout check
> >> in smp_cond_load_relaxed_timeout().
> >
> > Well, it is close, but is it close enough?
>
> I guess that's the question.
>
> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> >> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> ---
> >> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
> >> 1 file changed, 8 insertions(+), 21 deletions(-)
> >>
> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> >> index 9b6d90a72601..dc7f4b424fec 100644
> >> --- a/drivers/cpuidle/poll_state.c
> >> +++ b/drivers/cpuidle/poll_state.c
> >> @@ -8,35 +8,22 @@
> >> #include <linux/sched/clock.h>
> >> #include <linux/sched/idle.h>
> >>
> >> -#define POLL_IDLE_RELAX_COUNT 200
> >> -
> >> static int __cpuidle poll_idle(struct cpuidle_device *dev,
> >> struct cpuidle_driver *drv, int index)
> >> {
> >> - u64 time_start;
> >> -
> >> - time_start = local_clock_noinstr();
> >> + u64 time_end;
> >> + u32 flags = 0;
> >>
> >> dev->poll_time_limit = false;
> >>
> >> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
> >
> > Is there any particular reason for doing this unconditionally? If
> > not, then it looks like an arbitrary unrelated change to me.
>
> Agreed. Will fix.
>
> >> +
> >> raw_local_irq_enable();
> >> if (!current_set_polling_and_test()) {
> >> - unsigned int loop_count = 0;
> >> - u64 limit;
> >> -
> >> - limit = cpuidle_poll_time(drv, dev);
> >> -
> >> - while (!need_resched()) {
> >> - cpu_relax();
> >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> >> - continue;
> >> -
> >> - loop_count = 0;
> >> - if (local_clock_noinstr() - time_start > limit) {
> >> - dev->poll_time_limit = true;
> >> - break;
> >> - }
> >> - }
> >> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
> >> + (VAL & _TIF_NEED_RESCHED),
> >> + (local_clock_noinstr() >= time_end));
> >
> > So my understanding of this is that it reduces duplication with some
> > other places doing similar things. Fair enough.
> >
> > However, since there is "timeout" in the name, I'd expect it to take
> > the timeout as an argument.
>
> The early versions did have a timeout but that complicated the
> implementation significantly. And the current users poll_idle(),
> rqspinlock don't need a precise timeout.
>
> smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
>
> The problem with all suffixes I can think of is that it makes the
> interface itself nonobvious.
>
> Possibly something with the sense of bail out might work.
It basically has two conditions, one of which is checked in every step
of the internal loop and the other one is checked every
SMP_TIMEOUT_POLL_COUNT steps of it. That isn't particularly
straightforward IMV.
Honestly, I prefer the existing code. It is much easier to follow and
I don't see why the new code would be better. Sorry.
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout()
2025-10-29 18:53 ` Rafael J. Wysocki
@ 2025-10-29 19:13 ` Ankur Arora
2025-10-29 20:29 ` Rafael J. Wysocki
0 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2025-10-29 19:13 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, linux-pm,
bpf, arnd, catalin.marinas, will, peterz, akpm, mark.rutland,
harisokn, cl, ast, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
joao.m.martins, boris.ostrovsky, konrad.wilk
Rafael J. Wysocki <rafael@kernel.org> writes:
> On Wed, Oct 29, 2025 at 5:42 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>
>> Rafael J. Wysocki <rafael@kernel.org> writes:
>>
>> > On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >>
>> >> The inner loop in poll_idle() polls over the thread_info flags,
>> >> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
>> >> exits once the condition is met, or if the poll time limit has
>> >> been exceeded.
>> >>
>> >> To minimize the number of instructions executed in each iteration,
>> >> the time check is done only intermittently (once every
>> >> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
>> >> executes cpu_relax() which on certain platforms provides a hint to
>> >> the pipeline that the loop busy-waits, allowing the processor to
>> >> reduce power consumption.
>> >>
>> >> This is close to what smp_cond_load_relaxed_timeout() provides. So,
>> >> restructure the loop and fold the loop condition and the timeout check
>> >> in smp_cond_load_relaxed_timeout().
>> >
>> > Well, it is close, but is it close enough?
>>
>> I guess that's the question.
>>
>> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> >> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
>> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> >> ---
>> >> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
>> >> 1 file changed, 8 insertions(+), 21 deletions(-)
>> >>
>> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> >> index 9b6d90a72601..dc7f4b424fec 100644
>> >> --- a/drivers/cpuidle/poll_state.c
>> >> +++ b/drivers/cpuidle/poll_state.c
>> >> @@ -8,35 +8,22 @@
>> >> #include <linux/sched/clock.h>
>> >> #include <linux/sched/idle.h>
>> >>
>> >> -#define POLL_IDLE_RELAX_COUNT 200
>> >> -
>> >> static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> >> struct cpuidle_driver *drv, int index)
>> >> {
>> >> - u64 time_start;
>> >> -
>> >> - time_start = local_clock_noinstr();
>> >> + u64 time_end;
>> >> + u32 flags = 0;
>> >>
>> >> dev->poll_time_limit = false;
>> >>
>> >> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
>> >
>> > Is there any particular reason for doing this unconditionally? If
>> > not, then it looks like an arbitrary unrelated change to me.
>>
>> Agreed. Will fix.
>>
>> >> +
>> >> raw_local_irq_enable();
>> >> if (!current_set_polling_and_test()) {
>> >> - unsigned int loop_count = 0;
>> >> - u64 limit;
>> >> -
>> >> - limit = cpuidle_poll_time(drv, dev);
>> >> -
>> >> - while (!need_resched()) {
>> >> - cpu_relax();
>> >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> >> - continue;
>> >> -
>> >> - loop_count = 0;
>> >> - if (local_clock_noinstr() - time_start > limit) {
>> >> - dev->poll_time_limit = true;
>> >> - break;
>> >> - }
>> >> - }
>> >> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
>> >> + (VAL & _TIF_NEED_RESCHED),
>> >> + (local_clock_noinstr() >= time_end));
>> >
>> > So my understanding of this is that it reduces duplication with some
>> > other places doing similar things. Fair enough.
>> >
>> > However, since there is "timeout" in the name, I'd expect it to take
>> > the timeout as an argument.
>>
>> The early versions did have a timeout but that complicated the
>> implementation significantly. And the current users poll_idle(),
>> rqspinlock don't need a precise timeout.
>>
>> smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
>>
>> The problem with all suffixes I can think of is that it makes the
>> interface itself nonobvious.
>>
>> Possibly something with the sense of bail out might work.
>
> It basically has two conditions, one of which is checked in every step
> of the internal loop and the other one is checked every
> SMP_TIMEOUT_POLL_COUNT steps of it. That isn't particularly
> straightforward IMV.
Right. And that's similar to what poll_idle().
> Honestly, I prefer the existing code. It is much easier to follow and
> I don't see why the new code would be better. Sorry.
I don't think there's any problem with the current code. However, I'd like
to add support for poll_idle() on arm64 (and maybe other platforms) where
instead of spinning in a cpu_relax() loop, you wait on a cacheline.
And that's what using something like smp_cond_load_relaxed_timeout()
would enable.
Something like the series here:
https://lore.kernel.org/lkml/87wmaljd81.fsf@oracle.com/
(Sorry, should have mentioned this in the commit message.)
--
ankur
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout()
2025-10-29 19:13 ` Ankur Arora
@ 2025-10-29 20:29 ` Rafael J. Wysocki
2025-10-29 21:01 ` Ankur Arora
0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2025-10-29 20:29 UTC (permalink / raw)
To: Ankur Arora
Cc: Rafael J. Wysocki, linux-kernel, linux-arch, linux-arm-kernel,
linux-pm, bpf, arnd, catalin.marinas, will, peterz, akpm,
mark.rutland, harisokn, cl, ast, daniel.lezcano, memxor,
zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
konrad.wilk
On Wed, Oct 29, 2025 at 8:13 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>
> Rafael J. Wysocki <rafael@kernel.org> writes:
>
> > On Wed, Oct 29, 2025 at 5:42 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >>
> >>
> >> Rafael J. Wysocki <rafael@kernel.org> writes:
> >>
> >> > On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >> >>
> >> >> The inner loop in poll_idle() polls over the thread_info flags,
> >> >> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
> >> >> exits once the condition is met, or if the poll time limit has
> >> >> been exceeded.
> >> >>
> >> >> To minimize the number of instructions executed in each iteration,
> >> >> the time check is done only intermittently (once every
> >> >> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
> >> >> executes cpu_relax() which on certain platforms provides a hint to
> >> >> the pipeline that the loop busy-waits, allowing the processor to
> >> >> reduce power consumption.
> >> >>
> >> >> This is close to what smp_cond_load_relaxed_timeout() provides. So,
> >> >> restructure the loop and fold the loop condition and the timeout check
> >> >> in smp_cond_load_relaxed_timeout().
> >> >
> >> > Well, it is close, but is it close enough?
> >>
> >> I guess that's the question.
> >>
> >> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> >> >> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> >> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> >> ---
> >> >> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
> >> >> 1 file changed, 8 insertions(+), 21 deletions(-)
> >> >>
> >> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> >> >> index 9b6d90a72601..dc7f4b424fec 100644
> >> >> --- a/drivers/cpuidle/poll_state.c
> >> >> +++ b/drivers/cpuidle/poll_state.c
> >> >> @@ -8,35 +8,22 @@
> >> >> #include <linux/sched/clock.h>
> >> >> #include <linux/sched/idle.h>
> >> >>
> >> >> -#define POLL_IDLE_RELAX_COUNT 200
> >> >> -
> >> >> static int __cpuidle poll_idle(struct cpuidle_device *dev,
> >> >> struct cpuidle_driver *drv, int index)
> >> >> {
> >> >> - u64 time_start;
> >> >> -
> >> >> - time_start = local_clock_noinstr();
> >> >> + u64 time_end;
> >> >> + u32 flags = 0;
> >> >>
> >> >> dev->poll_time_limit = false;
> >> >>
> >> >> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
> >> >
> >> > Is there any particular reason for doing this unconditionally? If
> >> > not, then it looks like an arbitrary unrelated change to me.
> >>
> >> Agreed. Will fix.
> >>
> >> >> +
> >> >> raw_local_irq_enable();
> >> >> if (!current_set_polling_and_test()) {
> >> >> - unsigned int loop_count = 0;
> >> >> - u64 limit;
> >> >> -
> >> >> - limit = cpuidle_poll_time(drv, dev);
> >> >> -
> >> >> - while (!need_resched()) {
> >> >> - cpu_relax();
> >> >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> >> >> - continue;
> >> >> -
> >> >> - loop_count = 0;
> >> >> - if (local_clock_noinstr() - time_start > limit) {
> >> >> - dev->poll_time_limit = true;
> >> >> - break;
> >> >> - }
> >> >> - }
> >> >> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
> >> >> + (VAL & _TIF_NEED_RESCHED),
> >> >> + (local_clock_noinstr() >= time_end));
> >> >
> >> > So my understanding of this is that it reduces duplication with some
> >> > other places doing similar things. Fair enough.
> >> >
> >> > However, since there is "timeout" in the name, I'd expect it to take
> >> > the timeout as an argument.
> >>
> >> The early versions did have a timeout but that complicated the
> >> implementation significantly. And the current users poll_idle(),
> >> rqspinlock don't need a precise timeout.
> >>
> >> smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
> >>
> >> The problem with all suffixes I can think of is that it makes the
> >> interface itself nonobvious.
> >>
> >> Possibly something with the sense of bail out might work.
> >
> > It basically has two conditions, one of which is checked in every step
> > of the internal loop and the other one is checked every
> > SMP_TIMEOUT_POLL_COUNT steps of it. That isn't particularly
> > straightforward IMV.
>
> Right. And that's similar to what poll_idle().
My point is that the macro in its current form is not particularly
straightforward.
The code in poll_idle() does what it needs to do.
> > Honestly, I prefer the existing code. It is much easier to follow and
> > I don't see why the new code would be better. Sorry.
>
> I don't think there's any problem with the current code. However, I'd like
> to add support for poll_idle() on arm64 (and maybe other platforms) where
> instead of spinning in a cpu_relax() loop, you wait on a cacheline.
Well, there is MWAIT on x86, but it is not used here. It just takes
too much time to wake up from. There are "fast" variants of that too,
but they have been designed with user space in mind, so somewhat
cumbersome for kernel use.
> And that's what using something like smp_cond_load_relaxed_timeout()
> would enable.
>
> Something like the series here:
> https://lore.kernel.org/lkml/87wmaljd81.fsf@oracle.com/
>
> (Sorry, should have mentioned this in the commit message.)
I'm not sure how you can combine that with a proper timeout. The
timeout is needed because you want to break out of this when it starts
to take too much time, so you can go back to the idle loop and maybe
select a better idle state.
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout()
2025-10-29 20:29 ` Rafael J. Wysocki
@ 2025-10-29 21:01 ` Ankur Arora
2025-11-04 18:07 ` Rafael J. Wysocki
0 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2025-10-29 21:01 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, linux-pm,
bpf, arnd, catalin.marinas, will, peterz, akpm, mark.rutland,
harisokn, cl, ast, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
joao.m.martins, boris.ostrovsky, konrad.wilk
Rafael J. Wysocki <rafael@kernel.org> writes:
> On Wed, Oct 29, 2025 at 8:13 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>
>> Rafael J. Wysocki <rafael@kernel.org> writes:
>>
>> > On Wed, Oct 29, 2025 at 5:42 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >>
>> >>
>> >> Rafael J. Wysocki <rafael@kernel.org> writes:
>> >>
>> >> > On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >> >>
>> >> >> The inner loop in poll_idle() polls over the thread_info flags,
>> >> >> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
>> >> >> exits once the condition is met, or if the poll time limit has
>> >> >> been exceeded.
>> >> >>
>> >> >> To minimize the number of instructions executed in each iteration,
>> >> >> the time check is done only intermittently (once every
>> >> >> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
>> >> >> executes cpu_relax() which on certain platforms provides a hint to
>> >> >> the pipeline that the loop busy-waits, allowing the processor to
>> >> >> reduce power consumption.
>> >> >>
>> >> >> This is close to what smp_cond_load_relaxed_timeout() provides. So,
>> >> >> restructure the loop and fold the loop condition and the timeout check
>> >> >> in smp_cond_load_relaxed_timeout().
>> >> >
>> >> > Well, it is close, but is it close enough?
>> >>
>> >> I guess that's the question.
>> >>
>> >> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> >> >> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
>> >> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> >> >> ---
>> >> >> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
>> >> >> 1 file changed, 8 insertions(+), 21 deletions(-)
>> >> >>
>> >> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> >> >> index 9b6d90a72601..dc7f4b424fec 100644
>> >> >> --- a/drivers/cpuidle/poll_state.c
>> >> >> +++ b/drivers/cpuidle/poll_state.c
>> >> >> @@ -8,35 +8,22 @@
>> >> >> #include <linux/sched/clock.h>
>> >> >> #include <linux/sched/idle.h>
>> >> >>
>> >> >> -#define POLL_IDLE_RELAX_COUNT 200
>> >> >> -
>> >> >> static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> >> >> struct cpuidle_driver *drv, int index)
>> >> >> {
>> >> >> - u64 time_start;
>> >> >> -
>> >> >> - time_start = local_clock_noinstr();
>> >> >> + u64 time_end;
>> >> >> + u32 flags = 0;
>> >> >>
>> >> >> dev->poll_time_limit = false;
>> >> >>
>> >> >> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
>> >> >
>> >> > Is there any particular reason for doing this unconditionally? If
>> >> > not, then it looks like an arbitrary unrelated change to me.
>> >>
>> >> Agreed. Will fix.
>> >>
>> >> >> +
>> >> >> raw_local_irq_enable();
>> >> >> if (!current_set_polling_and_test()) {
>> >> >> - unsigned int loop_count = 0;
>> >> >> - u64 limit;
>> >> >> -
>> >> >> - limit = cpuidle_poll_time(drv, dev);
>> >> >> -
>> >> >> - while (!need_resched()) {
>> >> >> - cpu_relax();
>> >> >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> >> >> - continue;
>> >> >> -
>> >> >> - loop_count = 0;
>> >> >> - if (local_clock_noinstr() - time_start > limit) {
>> >> >> - dev->poll_time_limit = true;
>> >> >> - break;
>> >> >> - }
>> >> >> - }
>> >> >> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
>> >> >> + (VAL & _TIF_NEED_RESCHED),
>> >> >> + (local_clock_noinstr() >= time_end));
>> >> >
>> >> > So my understanding of this is that it reduces duplication with some
>> >> > other places doing similar things. Fair enough.
>> >> >
>> >> > However, since there is "timeout" in the name, I'd expect it to take
>> >> > the timeout as an argument.
>> >>
>> >> The early versions did have a timeout but that complicated the
>> >> implementation significantly. And the current users poll_idle(),
>> >> rqspinlock don't need a precise timeout.
>> >>
>> >> smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
>> >>
>> >> The problem with all suffixes I can think of is that it makes the
>> >> interface itself nonobvious.
>> >>
>> >> Possibly something with the sense of bail out might work.
>> >
>> > It basically has two conditions, one of which is checked in every step
>> > of the internal loop and the other one is checked every
>> > SMP_TIMEOUT_POLL_COUNT steps of it. That isn't particularly
>> > straightforward IMV.
>>
>> Right. And that's similar to what poll_idle().
>
> My point is that the macro in its current form is not particularly
> straightforward.
>
> The code in poll_idle() does what it needs to do.
>
>> > Honestly, I prefer the existing code. It is much easier to follow and
>> > I don't see why the new code would be better. Sorry.
>>
>> I don't think there's any problem with the current code. However, I'd like
>> to add support for poll_idle() on arm64 (and maybe other platforms) where
>> instead of spinning in a cpu_relax() loop, you wait on a cacheline.
>
> Well, there is MWAIT on x86, but it is not used here. It just takes
> too much time to wake up from. There are "fast" variants of that too,
> but they have been designed with user space in mind, so somewhat
> cumbersome for kernel use.
>
>> And that's what using something like smp_cond_load_relaxed_timeout()
>> would enable.
>>
>> Something like the series here:
>> https://lore.kernel.org/lkml/87wmaljd81.fsf@oracle.com/
>>
>> (Sorry, should have mentioned this in the commit message.)
>
> I'm not sure how you can combine that with a proper timeout.
Would taking the timeout as a separate argument work?
flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
(VAL & _TIF_NEED_RESCHED),
local_clock_noinstr(), time_end);
Or you are thinking of something on different lines from the smp_cond_load
kind of interface?
> The
> timeout is needed because you want to break out of this when it starts
> to take too much time, so you can go back to the idle loop and maybe
> select a better idle state.
Agreed. And that will happen with the version in the patch:
flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
(VAL & _TIF_NEED_RESCHED),
(local_clock_noinstr() >= time_end));
Just that with waited mode on arm64 the timeout might be delayed depending
on granularity of the event stream.
Thanks
--
ankur
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout()
2025-10-29 21:01 ` Ankur Arora
@ 2025-11-04 18:07 ` Rafael J. Wysocki
2025-11-05 8:30 ` Ankur Arora
0 siblings, 1 reply; 32+ messages in thread
From: Rafael J. Wysocki @ 2025-11-04 18:07 UTC (permalink / raw)
To: Ankur Arora
Cc: Rafael J. Wysocki, linux-kernel, linux-arch, linux-arm-kernel,
linux-pm, bpf, arnd, catalin.marinas, will, peterz, akpm,
mark.rutland, harisokn, cl, ast, daniel.lezcano, memxor,
zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
konrad.wilk
On Wed, Oct 29, 2025 at 10:01 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>
> Rafael J. Wysocki <rafael@kernel.org> writes:
>
> > On Wed, Oct 29, 2025 at 8:13 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >>
> >>
> >> Rafael J. Wysocki <rafael@kernel.org> writes:
> >>
> >> > On Wed, Oct 29, 2025 at 5:42 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >> >>
> >> >>
> >> >> Rafael J. Wysocki <rafael@kernel.org> writes:
> >> >>
> >> >> > On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >> >> >>
> >> >> >> The inner loop in poll_idle() polls over the thread_info flags,
> >> >> >> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
> >> >> >> exits once the condition is met, or if the poll time limit has
> >> >> >> been exceeded.
> >> >> >>
> >> >> >> To minimize the number of instructions executed in each iteration,
> >> >> >> the time check is done only intermittently (once every
> >> >> >> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
> >> >> >> executes cpu_relax() which on certain platforms provides a hint to
> >> >> >> the pipeline that the loop busy-waits, allowing the processor to
> >> >> >> reduce power consumption.
> >> >> >>
> >> >> >> This is close to what smp_cond_load_relaxed_timeout() provides. So,
> >> >> >> restructure the loop and fold the loop condition and the timeout check
> >> >> >> in smp_cond_load_relaxed_timeout().
> >> >> >
> >> >> > Well, it is close, but is it close enough?
> >> >>
> >> >> I guess that's the question.
> >> >>
> >> >> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> >> >> >> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> >> >> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> >> >> ---
> >> >> >> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
> >> >> >> 1 file changed, 8 insertions(+), 21 deletions(-)
> >> >> >>
> >> >> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> >> >> >> index 9b6d90a72601..dc7f4b424fec 100644
> >> >> >> --- a/drivers/cpuidle/poll_state.c
> >> >> >> +++ b/drivers/cpuidle/poll_state.c
> >> >> >> @@ -8,35 +8,22 @@
> >> >> >> #include <linux/sched/clock.h>
> >> >> >> #include <linux/sched/idle.h>
> >> >> >>
> >> >> >> -#define POLL_IDLE_RELAX_COUNT 200
> >> >> >> -
> >> >> >> static int __cpuidle poll_idle(struct cpuidle_device *dev,
> >> >> >> struct cpuidle_driver *drv, int index)
> >> >> >> {
> >> >> >> - u64 time_start;
> >> >> >> -
> >> >> >> - time_start = local_clock_noinstr();
> >> >> >> + u64 time_end;
> >> >> >> + u32 flags = 0;
> >> >> >>
> >> >> >> dev->poll_time_limit = false;
> >> >> >>
> >> >> >> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
> >> >> >
> >> >> > Is there any particular reason for doing this unconditionally? If
> >> >> > not, then it looks like an arbitrary unrelated change to me.
> >> >>
> >> >> Agreed. Will fix.
> >> >>
> >> >> >> +
> >> >> >> raw_local_irq_enable();
> >> >> >> if (!current_set_polling_and_test()) {
> >> >> >> - unsigned int loop_count = 0;
> >> >> >> - u64 limit;
> >> >> >> -
> >> >> >> - limit = cpuidle_poll_time(drv, dev);
> >> >> >> -
> >> >> >> - while (!need_resched()) {
> >> >> >> - cpu_relax();
> >> >> >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> >> >> >> - continue;
> >> >> >> -
> >> >> >> - loop_count = 0;
> >> >> >> - if (local_clock_noinstr() - time_start > limit) {
> >> >> >> - dev->poll_time_limit = true;
> >> >> >> - break;
> >> >> >> - }
> >> >> >> - }
> >> >> >> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
> >> >> >> + (VAL & _TIF_NEED_RESCHED),
> >> >> >> + (local_clock_noinstr() >= time_end));
> >> >> >
> >> >> > So my understanding of this is that it reduces duplication with some
> >> >> > other places doing similar things. Fair enough.
> >> >> >
> >> >> > However, since there is "timeout" in the name, I'd expect it to take
> >> >> > the timeout as an argument.
> >> >>
> >> >> The early versions did have a timeout but that complicated the
> >> >> implementation significantly. And the current users poll_idle(),
> >> >> rqspinlock don't need a precise timeout.
> >> >>
> >> >> smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
> >> >>
> >> >> The problem with all suffixes I can think of is that it makes the
> >> >> interface itself nonobvious.
> >> >>
> >> >> Possibly something with the sense of bail out might work.
> >> >
> >> > It basically has two conditions, one of which is checked in every step
> >> > of the internal loop and the other one is checked every
> >> > SMP_TIMEOUT_POLL_COUNT steps of it. That isn't particularly
> >> > straightforward IMV.
> >>
> >> Right. And that's similar to what poll_idle().
> >
> > My point is that the macro in its current form is not particularly
> > straightforward.
> >
> > The code in poll_idle() does what it needs to do.
> >
> >> > Honestly, I prefer the existing code. It is much easier to follow and
> >> > I don't see why the new code would be better. Sorry.
> >>
> >> I don't think there's any problem with the current code. However, I'd like
> >> to add support for poll_idle() on arm64 (and maybe other platforms) where
> >> instead of spinning in a cpu_relax() loop, you wait on a cacheline.
> >
> > Well, there is MWAIT on x86, but it is not used here. It just takes
> > too much time to wake up from. There are "fast" variants of that too,
> > but they have been designed with user space in mind, so somewhat
> > cumbersome for kernel use.
> >
> >> And that's what using something like smp_cond_load_relaxed_timeout()
> >> would enable.
> >>
> >> Something like the series here:
> >> https://lore.kernel.org/lkml/87wmaljd81.fsf@oracle.com/
> >>
> >> (Sorry, should have mentioned this in the commit message.)
> >
> > I'm not sure how you can combine that with a proper timeout.
>
> Would taking the timeout as a separate argument work?
>
> flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
> (VAL & _TIF_NEED_RESCHED),
> local_clock_noinstr(), time_end);
>
> Or you are thinking of something on different lines from the smp_cond_load
> kind of interface?
I would like it to be something along the lines of
arch_busy_wait_for_need_resched(time_limit);
dev->poll_time_limit = !need_resched();
and I don't care much about how exactly this is done in the arch code,
so long as it does what it says.
> > The timeout is needed because you want to break out of this when it starts
> > to take too much time, so you can go back to the idle loop and maybe
> > select a better idle state.
>
> Agreed. And that will happen with the version in the patch:
>
> flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
> (VAL & _TIF_NEED_RESCHED),
> (local_clock_noinstr() >= time_end));
>
> Just that with waited mode on arm64 the timeout might be delayed depending
> on granularity of the event stream.
That's fine. cpuidle_poll_time() is not exact anyway.
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout()
2025-11-04 18:07 ` Rafael J. Wysocki
@ 2025-11-05 8:30 ` Ankur Arora
0 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-11-05 8:30 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, linux-pm,
bpf, arnd, catalin.marinas, will, peterz, akpm, mark.rutland,
harisokn, cl, ast, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
joao.m.martins, boris.ostrovsky, konrad.wilk
Rafael J. Wysocki <rafael@kernel.org> writes:
> On Wed, Oct 29, 2025 at 10:01 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>
>> Rafael J. Wysocki <rafael@kernel.org> writes:
>>
>> > On Wed, Oct 29, 2025 at 8:13 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >>
>> >>
>> >> Rafael J. Wysocki <rafael@kernel.org> writes:
>> >>
>> >> > On Wed, Oct 29, 2025 at 5:42 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >> >>
>> >> >>
>> >> >> Rafael J. Wysocki <rafael@kernel.org> writes:
>> >> >>
>> >> >> > On Tue, Oct 28, 2025 at 6:32 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >> >> >>
>> >> >> >> The inner loop in poll_idle() polls over the thread_info flags,
>> >> >> >> waiting to see if the thread has TIF_NEED_RESCHED set. The loop
>> >> >> >> exits once the condition is met, or if the poll time limit has
>> >> >> >> been exceeded.
>> >> >> >>
>> >> >> >> To minimize the number of instructions executed in each iteration,
>> >> >> >> the time check is done only intermittently (once every
>> >> >> >> POLL_IDLE_RELAX_COUNT iterations). In addition, each loop iteration
>> >> >> >> executes cpu_relax() which on certain platforms provides a hint to
>> >> >> >> the pipeline that the loop busy-waits, allowing the processor to
>> >> >> >> reduce power consumption.
>> >> >> >>
>> >> >> >> This is close to what smp_cond_load_relaxed_timeout() provides. So,
>> >> >> >> restructure the loop and fold the loop condition and the timeout check
>> >> >> >> in smp_cond_load_relaxed_timeout().
>> >> >> >
>> >> >> > Well, it is close, but is it close enough?
>> >> >>
>> >> >> I guess that's the question.
>> >> >>
>> >> >> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> >> >> >> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
>> >> >> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> >> >> >> ---
>> >> >> >> drivers/cpuidle/poll_state.c | 29 ++++++++---------------------
>> >> >> >> 1 file changed, 8 insertions(+), 21 deletions(-)
>> >> >> >>
>> >> >> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> >> >> >> index 9b6d90a72601..dc7f4b424fec 100644
>> >> >> >> --- a/drivers/cpuidle/poll_state.c
>> >> >> >> +++ b/drivers/cpuidle/poll_state.c
>> >> >> >> @@ -8,35 +8,22 @@
>> >> >> >> #include <linux/sched/clock.h>
>> >> >> >> #include <linux/sched/idle.h>
>> >> >> >>
>> >> >> >> -#define POLL_IDLE_RELAX_COUNT 200
>> >> >> >> -
>> >> >> >> static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> >> >> >> struct cpuidle_driver *drv, int index)
>> >> >> >> {
>> >> >> >> - u64 time_start;
>> >> >> >> -
>> >> >> >> - time_start = local_clock_noinstr();
>> >> >> >> + u64 time_end;
>> >> >> >> + u32 flags = 0;
>> >> >> >>
>> >> >> >> dev->poll_time_limit = false;
>> >> >> >>
>> >> >> >> + time_end = local_clock_noinstr() + cpuidle_poll_time(drv, dev);
>> >> >> >
>> >> >> > Is there any particular reason for doing this unconditionally? If
>> >> >> > not, then it looks like an arbitrary unrelated change to me.
>> >> >>
>> >> >> Agreed. Will fix.
>> >> >>
>> >> >> >> +
>> >> >> >> raw_local_irq_enable();
>> >> >> >> if (!current_set_polling_and_test()) {
>> >> >> >> - unsigned int loop_count = 0;
>> >> >> >> - u64 limit;
>> >> >> >> -
>> >> >> >> - limit = cpuidle_poll_time(drv, dev);
>> >> >> >> -
>> >> >> >> - while (!need_resched()) {
>> >> >> >> - cpu_relax();
>> >> >> >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> >> >> >> - continue;
>> >> >> >> -
>> >> >> >> - loop_count = 0;
>> >> >> >> - if (local_clock_noinstr() - time_start > limit) {
>> >> >> >> - dev->poll_time_limit = true;
>> >> >> >> - break;
>> >> >> >> - }
>> >> >> >> - }
>> >> >> >> + flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
>> >> >> >> + (VAL & _TIF_NEED_RESCHED),
>> >> >> >> + (local_clock_noinstr() >= time_end));
>> >> >> >
>> >> >> > So my understanding of this is that it reduces duplication with some
>> >> >> > other places doing similar things. Fair enough.
>> >> >> >
>> >> >> > However, since there is "timeout" in the name, I'd expect it to take
>> >> >> > the timeout as an argument.
>> >> >>
>> >> >> The early versions did have a timeout but that complicated the
>> >> >> implementation significantly. And the current users poll_idle(),
>> >> >> rqspinlock don't need a precise timeout.
>> >> >>
>> >> >> smp_cond_load_relaxed_timed(), smp_cond_load_relaxed_timecheck()?
>> >> >>
>> >> >> The problem with all suffixes I can think of is that it makes the
>> >> >> interface itself nonobvious.
>> >> >>
>> >> >> Possibly something with the sense of bail out might work.
>> >> >
>> >> > It basically has two conditions, one of which is checked in every step
>> >> > of the internal loop and the other one is checked every
>> >> > SMP_TIMEOUT_POLL_COUNT steps of it. That isn't particularly
>> >> > straightforward IMV.
>> >>
>> >> Right. And that's similar to what poll_idle().
>> >
>> > My point is that the macro in its current form is not particularly
>> > straightforward.
>> >
>> > The code in poll_idle() does what it needs to do.
>> >
>> >> > Honestly, I prefer the existing code. It is much easier to follow and
>> >> > I don't see why the new code would be better. Sorry.
>> >>
>> >> I don't think there's any problem with the current code. However, I'd like
>> >> to add support for poll_idle() on arm64 (and maybe other platforms) where
>> >> instead of spinning in a cpu_relax() loop, you wait on a cacheline.
>> >
>> > Well, there is MWAIT on x86, but it is not used here. It just takes
>> > too much time to wake up from. There are "fast" variants of that too,
>> > but they have been designed with user space in mind, so somewhat
>> > cumbersome for kernel use.
>> >
>> >> And that's what using something like smp_cond_load_relaxed_timeout()
>> >> would enable.
>> >>
>> >> Something like the series here:
>> >> https://lore.kernel.org/lkml/87wmaljd81.fsf@oracle.com/
>> >>
>> >> (Sorry, should have mentioned this in the commit message.)
>> >
>> > I'm not sure how you can combine that with a proper timeout.
>>
>> Would taking the timeout as a separate argument work?
>>
>> flags = smp_cond_load_relaxed_timeout(¤t_thread_info()->flags,
>> (VAL & _TIF_NEED_RESCHED),
>> local_clock_noinstr(), time_end);
>>
>> Or you are thinking of something on different lines from the smp_cond_load
>> kind of interface?
>
> I would like it to be something along the lines of
>
> arch_busy_wait_for_need_resched(time_limit);
> dev->poll_time_limit = !need_resched();
>
> and I don't care much about how exactly this is done in the arch code,
> so long as it does what it says.
This looks great. I think it could just be:
tif_need_resched_wait(time_limit);
And, given that this is tied in with scheduling contexts, this interface
should be able to use local_clock()/sched_clock().
--
ankur
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout()
2025-10-28 5:31 ` [RESEND PATCH v7 7/7] cpuidle/poll_state: Poll via smp_cond_load_relaxed_timeout() Ankur Arora
2025-10-28 12:30 ` Rafael J. Wysocki
@ 2025-10-28 16:16 ` Christoph Lameter (Ampere)
1 sibling, 0 replies; 32+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-10-28 16:16 UTC (permalink / raw)
To: Ankur Arora
Cc: bpf, arnd, catalin.marinas, will, peterz, akpm, mark.rutland,
harisokn, ast, rafael, daniel.lezcano, memxor, zhenglifeng1,
xueshuai, joao.m.martins, boris.ostrovsky, konrad.wilk
Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
^ permalink raw reply [flat|nested] 32+ messages in thread