[PATCH v3 0/5] barrier: Add smp_cond_load_*

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] barrier: Add smp_cond_load_*_timewait()
@ 2025-06-27  4:48 Ankur Arora
  2025-06-27  4:48 ` [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Ankur Arora @ 2025-06-27  4:48 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, memxor, zhenglifeng1, xueshuai, joao.m.martins,
	boris.ostrovsky, konrad.wilk

Hi,

This series adds waited variants of the smp_cond_load() primitives:
smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().

Why?: as the name suggests, the new interfaces are meant for contexts
where you want to wait on a condition variable for a finite duration.
This is easy enough to do with a loop around cpu_relax(). However,
some architectures (ex. arm64) also allow waiting on a cacheline. So,
these interfaces handle a mixture of spin/wait with a smp_cond_load()
thrown in.

There are two known users for these interfaces:

 - poll_idle() [1]
 - resilient queued spinlocks [2]

The interfaces are:
   smp_cond_load_relaxed_spinwait(ptr, cond_expr,
                                  time_expr, time_limit, slack)
   smp_cond_load_acquire_spinwait(ptr, cond_expr,
                                  time_expr, time_limit, slack)

The added parameters pertain to the timeout checks and a measure of how
much slack the caller can tolerate in the timeout. The slack is useful
when in the wait state and thus dependent on an asynchronous event.

Changelog:
  v2 [3]:
    - simplified the interface (suggested by Catalin Marinas)
       - get rid of wait_policy, and a multitude of constants
       - adds a slack parameter
      This helped remove a fair amount of duplicated code duplication and in hindsight
      unnecessary constants.

  v1 [4]:
     - add wait_policy (coarse and fine)
     - derive spin-count etc at runtime instead of using arbitrary
       constants.

Haris Okanovic had tested an earlier version of this series with 
poll_idle()/haltpoll patches. [5]

Any comments appreciated!

Ankur

[1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
[2] Uses the smp_cond_load_acquire_timewait() from v1
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h
[3] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/
[4] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
[5] https://lore.kernel.org/lkml/f2f5d09e79539754ced085ed89865787fa668695.camel@amazon.com

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: linux-arch@vger.kernel.org

Ankur Arora (5):
  asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  asm-generic: barrier: Handle spin-wait in
    smp_cond_load_relaxed_timewait()
  asm-generic: barrier: Add smp_cond_load_acquire_timewait()
  arm64: barrier: Support waiting in smp_cond_load_relaxed_timewait()
  arm64: barrier: Handle waiting in smp_cond_load_relaxed_timewait()

 arch/arm64/include/asm/barrier.h    |  54 +++++++++++
 arch/arm64/include/asm/rqspinlock.h |   2 +-
 include/asm-generic/barrier.h       | 137 ++++++++++++++++++++++++++++
 3 files changed, 192 insertions(+), 1 deletion(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-06-27  4:48 [PATCH v3 0/5] barrier: Add smp_cond_load_*_timewait() Ankur Arora
@ 2025-06-27  4:48 ` Ankur Arora
  2025-08-08 10:51   ` Catalin Marinas
  2025-06-27  4:48 ` [PATCH v3 2/5] asm-generic: barrier: Handle spin-wait in smp_cond_load_relaxed_timewait() Ankur Arora
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Ankur Arora @ 2025-06-27  4:48 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, memxor, zhenglifeng1, xueshuai, joao.m.martins,
	boris.ostrovsky, konrad.wilk

Add smp_cond_load_relaxed_timewait(), which extends
smp_cond_load_relaxed() to allow waiting for a finite duration.

Additional parameters allow for timeout checks and a measure of
how much slack the caller can tolerate in the timeout.

The waiting is done via the usual cpu_relax() spin-wait around the
condition variable with periodic evaluation of the time-check.
And, optionally with architectural primitives that allow for
cheaper mechanisms such as waiting on a cacheline with out-of-band
timeout.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/asm-generic/barrier.h | 95 +++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..d33c2701c9ee 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -273,6 +273,101 @@ do {									\
 })
 #endif
 
+#ifndef SMP_TIMEWAIT_SPIN_BASE
+#define SMP_TIMEWAIT_SPIN_BASE		16
+#endif
+
+/*
+ * Policy handler that adjusts the number of times we spin or
+ * wait for cacheline to change before evaluating the time-expr.
+ *
+ * The generic version only supports spinning.
+ */
+static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
+				       u32 *spin, bool *wait, u64 slack)
+{
+	if (now >= end)
+		return 0;
+
+	*spin = SMP_TIMEWAIT_SPIN_BASE;
+	*wait = false;
+	return now;
+}
+
+#ifndef __smp_cond_policy
+#define __smp_cond_policy ___smp_cond_spinwait
+#endif
+
+/*
+ * Non-spin primitive that allows waiting for stores to an address,
+ * with support for a timeout. This works in conjunction with an
+ * architecturally defined policy.
+ */
+#ifndef __smp_timewait_store
+#define __smp_timewait_store(ptr, val)	do { } while (0)
+#endif
+
+#ifndef __smp_cond_load_relaxed_timewait
+#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,	\
+					 time_expr, time_end,		\
+					 slack) ({			\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;			\
+	u64 __prev = 0, __end = (time_end);				\
+	u64 __slack = slack;						\
+	bool __wait = false;						\
+									\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		cpu_relax();						\
+		if (++__n < __spin)					\
+			continue;					\
+		if (!(__prev = policy((time_expr), __prev, __end,	\
+					  &__spin, &__wait, __slack)))	\
+			break;						\
+		if (__wait)						\
+			__smp_timewait_store(__PTR, VAL);		\
+		__n = 0;						\
+	}								\
+	(typeof(*ptr))VAL;						\
+})
+#endif
+
+#define __check_time_types(type, a, b)			\
+		(__same_type(typeof(a), type) &&	\
+		 __same_type(typeof(b), type))
+
+/**
+ * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
+ * guarantees until a timeout expires.
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ * @time_expr: monotonic expression that evaluates to the current time
+ * @time_end: end time, compared against time_expr
+ * @slack: how much timer overshoot can the caller tolerate?
+ * Useful for when we go into wait states. A value of 0 indicates a high
+ * tolerance.
+ *
+ * Note that all times (time_expr, time_end, and slack) are in microseconds,
+ * with no mandated precision.
+ *
+ * Equivalent to using READ_ONCE() on the condition variable.
+ */
+#define smp_cond_load_relaxed_timewait(ptr, cond_expr, time_expr,	\
+				       time_end, slack) ({		\
+	__unqual_scalar_typeof(*ptr) _val;				\
+	BUILD_BUG_ON_MSG(!__check_time_types(u64, time_expr, time_end),	\
+			 "incompatible time units");			\
+	_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
+						__smp_cond_policy,	\
+						time_expr, time_end,	\
+						slack);			\
+	(typeof(*ptr))_val;						\
+})
+
 /*
  * pmem_wmb() ensures that all stores for which the modification
  * are written to persistent storage by preceding instructions have
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 2/5] asm-generic: barrier: Handle spin-wait in smp_cond_load_relaxed_timewait()
  2025-06-27  4:48 [PATCH v3 0/5] barrier: Add smp_cond_load_*_timewait() Ankur Arora
  2025-06-27  4:48 ` [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
@ 2025-06-27  4:48 ` Ankur Arora
  2025-06-27  4:48 ` [PATCH v3 3/5] asm-generic: barrier: Add smp_cond_load_acquire_timewait() Ankur Arora
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Ankur Arora @ 2025-06-27  4:48 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, memxor, zhenglifeng1, xueshuai, joao.m.martins,
	boris.ostrovsky, konrad.wilk

smp_cond_load_relaxed_timewait() waits on a conditional variable,
while watching the clock.

The generic code presents the simple case where the waiting is done
via a cpu_relax() spin-wait loop. To keep the pipeline as idle as
possible, we want to do the relatively expensive time check only
intermittently.

Add ___smp_cond_spinwait() which handles adjustments to the spin-count
based on the deadline.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/asm-generic/barrier.h | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d33c2701c9ee..8299c57d1110 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -15,6 +15,7 @@
 
 #include <linux/compiler.h>
 #include <linux/kcsan-checks.h>
+#include <linux/minmax.h>
 #include <asm/rwonce.h>
 
 #ifndef nop
@@ -286,11 +287,30 @@ do {									\
 static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
 				       u32 *spin, bool *wait, u64 slack)
 {
+	u64 time_check;
+	u64 remaining = end - now;
+
 	if (now >= end)
 		return 0;
-
-	*spin = SMP_TIMEWAIT_SPIN_BASE;
+	/*
+	 * Use a floor spin-count as it might be artificially low if we are
+	 * transitioning from wait to spin, or because we got interrupted.
+	 */
+	*spin = min(*spin, SMP_TIMEWAIT_SPIN_BASE);
 	*wait = false;
+
+	/*
+	 * We will map the time_check interval to the spin-count by scaling
+	 * based on the previous time-check interval. This is imprecise, so
+	 * use a safety margin.
+	 */
+	time_check = min(remaining/4, 1UL);
+
+	if ((now - prev) < time_check)
+		*spin <<= 1;
+	else
+		*spin = ((*spin >> 1) + (*spin >> 2));
+
 	return now;
 }
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 3/5] asm-generic: barrier: Add smp_cond_load_acquire_timewait()
  2025-06-27  4:48 [PATCH v3 0/5] barrier: Add smp_cond_load_*_timewait() Ankur Arora
  2025-06-27  4:48 ` [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
  2025-06-27  4:48 ` [PATCH v3 2/5] asm-generic: barrier: Handle spin-wait in smp_cond_load_relaxed_timewait() Ankur Arora
@ 2025-06-27  4:48 ` Ankur Arora
  2025-08-08  9:38   ` Catalin Marinas
  2025-06-27  4:48 ` [PATCH v3 4/5] arm64: barrier: Support waiting in smp_cond_load_relaxed_timewait() Ankur Arora
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Ankur Arora @ 2025-06-27  4:48 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, memxor, zhenglifeng1, xueshuai, joao.m.martins,
	boris.ostrovsky, konrad.wilk

Add the acquire variant of smp_cond_load_relaxed_timewait(). This
reuses the relaxed variant, with an additional LOAD->LOAD
ordering via smp_acquire__after_ctrl_dep().

Also, the rqspinlock implementation has a locally cached copy of
this interface. Fixup the parameters used there.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/rqspinlock.h |  2 +-
 include/asm-generic/barrier.h       | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/rqspinlock.h b/arch/arm64/include/asm/rqspinlock.h
index 9ea0a74e5892..f1b6a428013e 100644
--- a/arch/arm64/include/asm/rqspinlock.h
+++ b/arch/arm64/include/asm/rqspinlock.h
@@ -86,7 +86,7 @@
 
 #endif
 
-#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire_timewait(v, c, 0, 1)
+#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire_timewait(v, c, 0ULL, 1ULL, 0)
 
 #include <asm-generic/rqspinlock.h>
 
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 8299c57d1110..dd7c9ca2dff3 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -388,6 +388,28 @@ static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
 	(typeof(*ptr))_val;						\
 })
 
+/**
+ * smp_cond_load_acquire_timewait() - (Spin) wait for cond with ACQUIRE ordering
+ * until a timeout expires.
+ *
+ * Arguments: same as smp_cond_load_relaxed_timeout().
+ *
+ * Equivalent to using smp_cond_load_acquire() on the condition variable with
+ * a timeout.
+ */
+#ifndef smp_cond_load_acquire_timewait
+#define smp_cond_load_acquire_timewait(ptr, cond_expr,			\
+				       time_expr, time_end,		\
+				       slack) ({			\
+	__unqual_scalar_typeof(*ptr) _val;				\
+	_val = smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
+					      time_expr, time_end,	\
+					      slack);			\
+	/* Depends on the control dependency of the wait above. */	\
+	smp_acquire__after_ctrl_dep();					\
+	(typeof(*ptr))_val;						\
+})
+#endif
 /*
  * pmem_wmb() ensures that all stores for which the modification
  * are written to persistent storage by preceding instructions have
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 4/5] arm64: barrier: Support waiting in smp_cond_load_relaxed_timewait()
  2025-06-27  4:48 [PATCH v3 0/5] barrier: Add smp_cond_load_*_timewait() Ankur Arora
                   ` (2 preceding siblings ...)
  2025-06-27  4:48 ` [PATCH v3 3/5] asm-generic: barrier: Add smp_cond_load_acquire_timewait() Ankur Arora
@ 2025-06-27  4:48 ` Ankur Arora
  2025-06-27  4:48 ` [PATCH v3 5/5] arm64: barrier: Handle " Ankur Arora
  2025-07-28 19:03 ` [PATCH v3 0/5] barrier: Add smp_cond_load_*_timewait() Ankur Arora
  5 siblings, 0 replies; 26+ messages in thread
From: Ankur Arora @ 2025-06-27  4:48 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, memxor, zhenglifeng1, xueshuai, joao.m.martins,
	boris.ostrovsky, konrad.wilk

Define __smp_timewait_store() to support waiting for a condition
variable via __cmpwait_relaxed().

Used from smp_cond_load_{relaxed,acquire}_timewait().

Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/barrier.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 1ca947d5c939..7c56e2621c7d 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -216,6 +216,12 @@ do {									\
 	(typeof(*ptr))VAL;						\
 })
 
+/*
+ * Support waiting in smp_cond_load_{relaxed,acquire}_timewait().
+ */
+#define __smp_timewait_store(ptr, val)					\
+		__cmpwait_relaxed(ptr, val)
+
 #include <asm-generic/barrier.h>
 
 #endif	/* __ASSEMBLY__ */
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 5/5] arm64: barrier: Handle waiting in smp_cond_load_relaxed_timewait()
  2025-06-27  4:48 [PATCH v3 0/5] barrier: Add smp_cond_load_*_timewait() Ankur Arora
                   ` (3 preceding siblings ...)
  2025-06-27  4:48 ` [PATCH v3 4/5] arm64: barrier: Support waiting in smp_cond_load_relaxed_timewait() Ankur Arora
@ 2025-06-27  4:48 ` Ankur Arora
  2025-06-30 16:33   ` Christoph Lameter (Ampere)
  2025-07-28 19:03 ` [PATCH v3 0/5] barrier: Add smp_cond_load_*_timewait() Ankur Arora
  5 siblings, 1 reply; 26+ messages in thread
From: Ankur Arora @ 2025-06-27  4:48 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, memxor, zhenglifeng1, xueshuai, joao.m.martins,
	boris.ostrovsky, konrad.wilk

smp_cond_load_{relaxed,acquire}_timewait() wait on a condition variable
until a timeout expires. This waiting is some mix of spinning while
dereferencing an address, and waiting in a WFE for a store event or
periodic events from the event-stream.

Handle the waiting part of the policy in ___smp_cond_timewait() while
offloading the spinning to the generic ___smp_cond_spinwait().

To minimize time spent spinning when the user can tolerate a large
overshoot, choose SMP_TIMEWAIT_DEFAULT_US to be the event-stream
period.

This would result in a worst case delay of ARCH_TIMER_EVT_STREAM_PERIOD_US.

Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/barrier.h | 48 ++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 7c56e2621c7d..a1367f2901f0 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -10,6 +10,7 @@
 #ifndef __ASSEMBLY__
 
 #include <linux/kasan-checks.h>
+#include <linux/minmax.h>
 
 #include <asm/alternative-macros.h>
 
@@ -222,6 +223,53 @@ do {									\
 #define __smp_timewait_store(ptr, val)					\
 		__cmpwait_relaxed(ptr, val)
 
+/*
+ * Redefine ARCH_TIMER_EVT_STREAM_PERIOD_US locally to avoid include hell.
+ */
+#define __ARCH_TIMER_EVT_STREAM_PERIOD_US 100UL
+extern bool arch_timer_evtstrm_available(void);
+
+static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
+				       u32 *spin, bool *wait, u64 slack);
+/*
+ * To minimize time spent spinning, we want to allow a large overshoot.
+ * So, choose a default slack value of the event-stream period.
+ */
+#define SMP_TIMEWAIT_DEFAULT_US __ARCH_TIMER_EVT_STREAM_PERIOD_US
+
+static inline u64 ___smp_cond_timewait(u64 now, u64 prev, u64 end,
+				       u32 *spin, bool *wait, u64 slack)
+{
+	bool wfet = alternative_has_cap_unlikely(ARM64_HAS_WFXT);
+	bool wfe, ev = arch_timer_evtstrm_available();
+	u64 evt_period = __ARCH_TIMER_EVT_STREAM_PERIOD_US;
+	u64 remaining = end - now;
+
+	if (now >= end)
+		return 0;
+	/*
+	 * Use WFE if there's enough slack to get an event-stream wakeup even
+	 * if we don't come out of the WFE due to natural causes.
+	 */
+	wfe = ev && ((remaining + slack) > evt_period);
+
+	if (wfe || wfet) {
+		*wait = true;
+		*spin = 0;
+		return now;
+	}
+
+	/*
+	 * The time remaining is shorter than our wait granularity. Let
+	 * the generic spinwait policy determine how to spin.
+	 */
+	return ___smp_cond_spinwait(now, prev, end, spin, wait, slack);
+}
+
+#ifndef __smp_cond_policy
+#define __smp_cond_policy ___smp_cond_timewait
+#endif
+
 #include <asm-generic/barrier.h>
 
 #endif	/* __ASSEMBLY__ */
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 5/5] arm64: barrier: Handle waiting in smp_cond_load_relaxed_timewait()
  2025-06-27  4:48 ` [PATCH v3 5/5] arm64: barrier: Handle " Ankur Arora
@ 2025-06-30 16:33   ` Christoph Lameter (Ampere)
  2025-06-30 21:05     ` Ankur Arora
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-06-30 16:33 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, bpf, arnd,
	catalin.marinas, will, peterz, akpm, mark.rutland, harisokn, ast,
	memxor, zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
	konrad.wilk

On Thu, 26 Jun 2025, Ankur Arora wrote:

> @@ -222,6 +223,53 @@ do {									\
>  #define __smp_timewait_store(ptr, val)					\
>  		__cmpwait_relaxed(ptr, val)
>
> +/*
> + * Redefine ARCH_TIMER_EVT_STREAM_PERIOD_US locally to avoid include hell.
> + */
> +#define __ARCH_TIMER_EVT_STREAM_PERIOD_US 100UL
> +extern bool arch_timer_evtstrm_available(void);
> +
> +static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
> +				       u32 *spin, bool *wait, u64 slack);
> +/*
> + * To minimize time spent spinning, we want to allow a large overshoot.
> + * So, choose a default slack value of the event-stream period.
> + */
> +#define SMP_TIMEWAIT_DEFAULT_US __ARCH_TIMER_EVT_STREAM_PERIOD_US
> +
> +static inline u64 ___smp_cond_timewait(u64 now, u64 prev, u64 end,
> +				       u32 *spin, bool *wait, u64 slack)
> +{
> +	bool wfet = alternative_has_cap_unlikely(ARM64_HAS_WFXT);
> +	bool wfe, ev = arch_timer_evtstrm_available();

An unitialized and initialized variable on the same line. Maybe separate
that. Looks confusing and unusual to me.

> +	u64 evt_period = __ARCH_TIMER_EVT_STREAM_PERIOD_US;
> +	u64 remaining = end - now;
> +
> +	if (now >= end)
> +		return 0;
> +	/*
> +	 * Use WFE if there's enough slack to get an event-stream wakeup even
> +	 * if we don't come out of the WFE due to natural causes.
> +	 */
> +	wfe = ev && ((remaining + slack) > evt_period);

The line above does not matter for the wfet case and the calculation is
ignored. We hope that in the future wfet will be the default case.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 5/5] arm64: barrier: Handle waiting in smp_cond_load_relaxed_timewait()
  2025-06-30 16:33   ` Christoph Lameter (Ampere)
@ 2025-06-30 21:05     ` Ankur Arora
  2025-07-01  5:55       ` Ankur Arora
  0 siblings, 1 reply; 26+ messages in thread
From: Ankur Arora @ 2025-06-30 21:05 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, bpf,
	arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	ast, memxor, zhenglifeng1, xueshuai, joao.m.martins,
	boris.ostrovsky, konrad.wilk


Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Thu, 26 Jun 2025, Ankur Arora wrote:
>
>> @@ -222,6 +223,53 @@ do {									\
>>  #define __smp_timewait_store(ptr, val)					\
>>  		__cmpwait_relaxed(ptr, val)
>>
>> +/*
>> + * Redefine ARCH_TIMER_EVT_STREAM_PERIOD_US locally to avoid include hell.
>> + */
>> +#define __ARCH_TIMER_EVT_STREAM_PERIOD_US 100UL
>> +extern bool arch_timer_evtstrm_available(void);
>> +
>> +static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
>> +				       u32 *spin, bool *wait, u64 slack);
>> +/*
>> + * To minimize time spent spinning, we want to allow a large overshoot.
>> + * So, choose a default slack value of the event-stream period.
>> + */
>> +#define SMP_TIMEWAIT_DEFAULT_US __ARCH_TIMER_EVT_STREAM_PERIOD_US
>> +
>> +static inline u64 ___smp_cond_timewait(u64 now, u64 prev, u64 end,
>> +				       u32 *spin, bool *wait, u64 slack)
>> +{
>> +	bool wfet = alternative_has_cap_unlikely(ARM64_HAS_WFXT);
>> +	bool wfe, ev = arch_timer_evtstrm_available();
>
> An unitialized and initialized variable on the same line. Maybe separate
> that. Looks confusing and unusual to me.

Yeah, that makes sense. Will fix.

>> +	u64 evt_period = __ARCH_TIMER_EVT_STREAM_PERIOD_US;
>> +	u64 remaining = end - now;
>> +
>> +	if (now >= end)
>> +		return 0;
>> +	/*
>> +	 * Use WFE if there's enough slack to get an event-stream wakeup even
>> +	 * if we don't come out of the WFE due to natural causes.
>> +	 */
>> +	wfe = ev && ((remaining + slack) > evt_period);
>
> The line above does not matter for the wfet case and the calculation is
> ignored. We hope that in the future wfet will be the default case.

My assumption was that the compiler would only evaluate the evt_period
comparison if the wfet check evaluates false and it does need to check
if wfe is true or not.

But let me look at the generated code.

Thanks for the comments.

--
ankur


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 5/5] arm64: barrier: Handle waiting in smp_cond_load_relaxed_timewait()
  2025-06-30 21:05     ` Ankur Arora
@ 2025-07-01  5:55       ` Ankur Arora
  0 siblings, 0 replies; 26+ messages in thread
From: Ankur Arora @ 2025-07-01  5:55 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Christoph Lameter (Ampere), linux-kernel, linux-arch,
	linux-arm-kernel, bpf, arnd, catalin.marinas, will, peterz, akpm,
	mark.rutland, harisokn, ast, memxor, zhenglifeng1, xueshuai,
	joao.m.martins, boris.ostrovsky, konrad.wilk


Ankur Arora <ankur.a.arora@oracle.com> writes:

> Christoph Lameter (Ampere) <cl@gentwo.org> writes:
>
>> On Thu, 26 Jun 2025, Ankur Arora wrote:
>>
>>> @@ -222,6 +223,53 @@ do {									\
>>>  #define __smp_timewait_store(ptr, val)					\
>>>  		__cmpwait_relaxed(ptr, val)
>>>
>>> +/*
>>> + * Redefine ARCH_TIMER_EVT_STREAM_PERIOD_US locally to avoid include hell.
>>> + */
>>> +#define __ARCH_TIMER_EVT_STREAM_PERIOD_US 100UL
>>> +extern bool arch_timer_evtstrm_available(void);
>>> +
>>> +static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
>>> +				       u32 *spin, bool *wait, u64 slack);
>>> +/*
>>> + * To minimize time spent spinning, we want to allow a large overshoot.
>>> + * So, choose a default slack value of the event-stream period.
>>> + */
>>> +#define SMP_TIMEWAIT_DEFAULT_US __ARCH_TIMER_EVT_STREAM_PERIOD_US
>>> +
>>> +static inline u64 ___smp_cond_timewait(u64 now, u64 prev, u64 end,
>>> +				       u32 *spin, bool *wait, u64 slack)
>>> +{
>>> +	bool wfet = alternative_has_cap_unlikely(ARM64_HAS_WFXT);
>>> +	bool wfe, ev = arch_timer_evtstrm_available();
>>
>> An unitialized and initialized variable on the same line. Maybe separate
>> that. Looks confusing and unusual to me.
>
> Yeah, that makes sense. Will fix.
>
>>> +	u64 evt_period = __ARCH_TIMER_EVT_STREAM_PERIOD_US;
>>> +	u64 remaining = end - now;
>>> +
>>> +	if (now >= end)
>>> +		return 0;
>>> +	/*
>>> +	 * Use WFE if there's enough slack to get an event-stream wakeup even
>>> +	 * if we don't come out of the WFE due to natural causes.
>>> +	 */
>>> +	wfe = ev && ((remaining + slack) > evt_period);
>>
>> The line above does not matter for the wfet case and the calculation is
>> ignored. We hope that in the future wfet will be the default case.
>
> My assumption was that the compiler would only evaluate the evt_period
> comparison if the wfet check evaluates false and it does need to check
> if wfe is true or not.
>
> But let me look at the generated code.

So, I was too optimistic. The compiler doesn't do this optimization at
all. I'm guessing that's because of at least two reasons:

The wfet case is unlikely:

   bool wfet = alternative_has_cap_unlikely(ARM64_HAS_WFXT);

And, I'm testing for:

   if (wfe || wfet)

This last check should have been "if (wfet || wfe)"" since the first
clause is pretty much free.

But even with that fixed, I don't think the compiler will do the right thing.

I could condition the computation on arch_timer_evtstrm_available(), but I
would prefer to keep this code straightforward since it will only be
evaluated infrequently.

But, let me see what I can do.

--
ankur


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 0/5] barrier: Add smp_cond_load_*_timewait()
  2025-06-27  4:48 [PATCH v3 0/5] barrier: Add smp_cond_load_*_timewait() Ankur Arora
                   ` (4 preceding siblings ...)
  2025-06-27  4:48 ` [PATCH v3 5/5] arm64: barrier: Handle " Ankur Arora
@ 2025-07-28 19:03 ` Ankur Arora
  5 siblings, 0 replies; 26+ messages in thread
From: Ankur Arora @ 2025-07-28 19:03 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, bpf, arnd,
	catalin.marinas, will, peterz, akpm, mark.rutland, harisokn, cl,
	ast, memxor, zhenglifeng1, xueshuai, joao.m.martins,
	boris.ostrovsky, konrad.wilk


Gentle ping for review.

Ankur

Ankur Arora <ankur.a.arora@oracle.com> writes:

> Hi,
>
> This series adds waited variants of the smp_cond_load() primitives:
> smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().
>
> Why?: as the name suggests, the new interfaces are meant for contexts
> where you want to wait on a condition variable for a finite duration.
> This is easy enough to do with a loop around cpu_relax(). However,
> some architectures (ex. arm64) also allow waiting on a cacheline. So,
> these interfaces handle a mixture of spin/wait with a smp_cond_load()
> thrown in.
>
> There are two known users for these interfaces:
>
>  - poll_idle() [1]
>  - resilient queued spinlocks [2]
>
> The interfaces are:
>    smp_cond_load_relaxed_spinwait(ptr, cond_expr,
>                                   time_expr, time_limit, slack)
>    smp_cond_load_acquire_spinwait(ptr, cond_expr,
>                                   time_expr, time_limit, slack)
>
> The added parameters pertain to the timeout checks and a measure of how
> much slack the caller can tolerate in the timeout. The slack is useful
> when in the wait state and thus dependent on an asynchronous event.
>
> Changelog:
>   v2 [3]:
>     - simplified the interface (suggested by Catalin Marinas)
>        - get rid of wait_policy, and a multitude of constants
>        - adds a slack parameter
>       This helped remove a fair amount of duplicated code duplication and in hindsight
>       unnecessary constants.
>
>   v1 [4]:
>      - add wait_policy (coarse and fine)
>      - derive spin-count etc at runtime instead of using arbitrary
>        constants.
>
> Haris Okanovic had tested an earlier version of this series with
> poll_idle()/haltpoll patches. [5]
>
> Any comments appreciated!
>
> Ankur
>
> [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
> [2] Uses the smp_cond_load_acquire_timewait() from v1
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h
> [3] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/
> [4] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
> [5] https://lore.kernel.org/lkml/f2f5d09e79539754ced085ed89865787fa668695.camel@amazon.com
>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: linux-arch@vger.kernel.org
>
> Ankur Arora (5):
>   asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
>   asm-generic: barrier: Handle spin-wait in
>     smp_cond_load_relaxed_timewait()
>   asm-generic: barrier: Add smp_cond_load_acquire_timewait()
>   arm64: barrier: Support waiting in smp_cond_load_relaxed_timewait()
>   arm64: barrier: Handle waiting in smp_cond_load_relaxed_timewait()
>
>  arch/arm64/include/asm/barrier.h    |  54 +++++++++++
>  arch/arm64/include/asm/rqspinlock.h |   2 +-
>  include/asm-generic/barrier.h       | 137 ++++++++++++++++++++++++++++
>  3 files changed, 192 insertions(+), 1 deletion(-)


--
ankur


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/5] asm-generic: barrier: Add smp_cond_load_acquire_timewait()
  2025-06-27  4:48 ` [PATCH v3 3/5] asm-generic: barrier: Add smp_cond_load_acquire_timewait() Ankur Arora
@ 2025-08-08  9:38   ` Catalin Marinas
  2025-08-12  5:18     ` Ankur Arora
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2025-08-08  9:38 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, bpf, arnd, will,
	peterz, akpm, mark.rutland, harisokn, cl, ast, memxor,
	zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
	konrad.wilk

On Thu, Jun 26, 2025 at 09:48:03PM -0700, Ankur Arora wrote:
> diff --git a/arch/arm64/include/asm/rqspinlock.h b/arch/arm64/include/asm/rqspinlock.h
> index 9ea0a74e5892..f1b6a428013e 100644
> --- a/arch/arm64/include/asm/rqspinlock.h
> +++ b/arch/arm64/include/asm/rqspinlock.h
> @@ -86,7 +86,7 @@
>  
>  #endif
>  
> -#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire_timewait(v, c, 0, 1)
> +#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire_timewait(v, c, 0ULL, 1ULL, 0)
>  
>  #include <asm-generic/rqspinlock.h>
>  
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index 8299c57d1110..dd7c9ca2dff3 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -388,6 +388,28 @@ static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
>  	(typeof(*ptr))_val;						\
>  })
>  
> +/**
> + * smp_cond_load_acquire_timewait() - (Spin) wait for cond with ACQUIRE ordering
> + * until a timeout expires.
> + *
> + * Arguments: same as smp_cond_load_relaxed_timeout().
> + *
> + * Equivalent to using smp_cond_load_acquire() on the condition variable with
> + * a timeout.
> + */
> +#ifndef smp_cond_load_acquire_timewait
> +#define smp_cond_load_acquire_timewait(ptr, cond_expr,			\
> +				       time_expr, time_end,		\
> +				       slack) ({			\
> +	__unqual_scalar_typeof(*ptr) _val;				\
> +	_val = smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
> +					      time_expr, time_end,	\
> +					      slack);			\
> +	/* Depends on the control dependency of the wait above. */	\
> +	smp_acquire__after_ctrl_dep();					\
> +	(typeof(*ptr))_val;						\
> +})
> +#endif

Using #ifndef in the generic file is the correct thing to do, it allows
architectures to redefine it. Why we have a similar #ifndef in the arm64
rqspinlock.h, no idea, none of the arm64 maintainers acked that patch
(shouldn't have gone in really, we were still discussing the
implementation at the time; I also think it's slightly wrong).

Your change above to rqspinlock.h makes this even more confusing when
you look at the overall result with all the patches applied. We end up
with the same macro in asm/rqspinlock.h but with different number of
arguments.

I'd start with ripping out the current arm64 implementation, add a
generic implementation to barrier.h and then override it in the arch
code.

-- 
Catalin


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-06-27  4:48 ` [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
@ 2025-08-08 10:51   ` Catalin Marinas
  2025-08-11 21:15     ` Ankur Arora
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2025-08-08 10:51 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, bpf, arnd, will,
	peterz, akpm, mark.rutland, harisokn, cl, ast, memxor,
	zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
	konrad.wilk

On Thu, Jun 26, 2025 at 09:48:01PM -0700, Ankur Arora wrote:
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..d33c2701c9ee 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -273,6 +273,101 @@ do {									\
>  })
>  #endif
>  
> +#ifndef SMP_TIMEWAIT_SPIN_BASE
> +#define SMP_TIMEWAIT_SPIN_BASE		16
> +#endif
> +
> +/*
> + * Policy handler that adjusts the number of times we spin or
> + * wait for cacheline to change before evaluating the time-expr.
> + *
> + * The generic version only supports spinning.
> + */
> +static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
> +				       u32 *spin, bool *wait, u64 slack)
> +{
> +	if (now >= end)
> +		return 0;
> +
> +	*spin = SMP_TIMEWAIT_SPIN_BASE;
> +	*wait = false;
> +	return now;
> +}
> +
> +#ifndef __smp_cond_policy
> +#define __smp_cond_policy ___smp_cond_spinwait
> +#endif
> +
> +/*
> + * Non-spin primitive that allows waiting for stores to an address,
> + * with support for a timeout. This works in conjunction with an
> + * architecturally defined policy.
> + */
> +#ifndef __smp_timewait_store
> +#define __smp_timewait_store(ptr, val)	do { } while (0)
> +#endif
> +
> +#ifndef __smp_cond_load_relaxed_timewait
> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,	\
> +					 time_expr, time_end,		\
> +					 slack) ({			\
> +	typeof(ptr) __PTR = (ptr);					\
> +	__unqual_scalar_typeof(*ptr) VAL;				\
> +	u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;			\
> +	u64 __prev = 0, __end = (time_end);				\
> +	u64 __slack = slack;						\
> +	bool __wait = false;						\
> +									\
> +	for (;;) {							\
> +		VAL = READ_ONCE(*__PTR);				\
> +		if (cond_expr)						\
> +			break;						\
> +		cpu_relax();						\
> +		if (++__n < __spin)					\
> +			continue;					\
> +		if (!(__prev = policy((time_expr), __prev, __end,	\
> +					  &__spin, &__wait, __slack)))	\
> +			break;						\
> +		if (__wait)						\
> +			__smp_timewait_store(__PTR, VAL);		\
> +		__n = 0;						\
> +	}								\
> +	(typeof(*ptr))VAL;						\
> +})
> +#endif

TBH, this still looks over-engineered to me, especially with the second
patch trying to reduce the spin loops based on the remaining time. Does
any of the current users of this interface need it to get more precise?

Also I feel the spinning added to poll_idle() is more of an architecture
choice as some CPUs could not cope with local_clock() being called too
frequently. The above generic implementation takes a spin into
consideration even if an arch implementation doesn't need it (e.g. WFET
or WFE). Yes, the arch policy could set a spin of 0 but it feels overly
complicated for the generic implementation.

Can we instead have the generic implementation without any spinning?
Just polling a variable with cpu_relax() like
smp_cond_load_acquire/relaxed() with the additional check for time. We
redefine it in the arch code.

> +#define __check_time_types(type, a, b)			\
> +		(__same_type(typeof(a), type) &&	\
> +		 __same_type(typeof(b), type))
> +
> +/**
> + * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
> + * guarantees until a timeout expires.
> + * @ptr: pointer to the variable to wait on
> + * @cond: boolean expression to wait for
> + * @time_expr: monotonic expression that evaluates to the current time
> + * @time_end: end time, compared against time_expr
> + * @slack: how much timer overshoot can the caller tolerate?
> + * Useful for when we go into wait states. A value of 0 indicates a high
> + * tolerance.
> + *
> + * Note that all times (time_expr, time_end, and slack) are in microseconds,
> + * with no mandated precision.
> + *
> + * Equivalent to using READ_ONCE() on the condition variable.
> + */
> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, time_expr,	\
> +				       time_end, slack) ({		\
> +	__unqual_scalar_typeof(*ptr) _val;				\
> +	BUILD_BUG_ON_MSG(!__check_time_types(u64, time_expr, time_end),	\
> +			 "incompatible time units");			\
> +	_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
> +						__smp_cond_policy,	\
> +						time_expr, time_end,	\
> +						slack);			\
> +	(typeof(*ptr))_val;						\
> +})

Looking at the current user of the acquire variant - rqspinlock, it does
not even bother with a time_expr but rather added the time condition to
cond_expr. I don't think it has any "slack" requirements, only that
there's no deadlock eventually.

About poll_idle(), are there any slack requirement or we get away
without?

I think we have two ways forward (well, at least):

1. Clearly define what time_end is and we won't need a time_expr at all.
   This may work for poll_idle(), not sure about rqspinlock. The
   advantage is that we can drop the 'slack' argument since none of the
   current users seem to need it. The downside is that we need to know
   exactly what this time_end is to convert it to timer cycles for a
   WFET implementation on arm64.

2. Drop time_end and only leave time_expr as a bool (we don't care
   whether it uses ns, jiffies or whatever underneath, it's just a
   bool). In this case, we could use a 'slack' argument mostly to make a
   decision on whether we use WFET, WFE or just polling with
   cpu_relax(). For WFET, the wait time would be based on the slack
   value rather than some absolute end time which we won't have.

I'd go with (2), it looks simpler. Maybe even drop the 'slack' argument
for the time being until we have a clear user. The fallback on arm64
would be from wfe (if event streaming available), wfet with the same
period as the event stream (in the absence of a slack argument) or
cpu_relax().

-- 
Catalin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-08 10:51   ` Catalin Marinas
@ 2025-08-11 21:15     ` Ankur Arora
  2025-08-13 16:09       ` Catalin Marinas
  0 siblings, 1 reply; 26+ messages in thread
From: Ankur Arora @ 2025-08-11 21:15 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, bpf,
	arnd, will, peterz, akpm, mark.rutland, harisokn, cl, ast, memxor,
	zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
	konrad.wilk, rafael, daniel.lezcano


[ Added Rafael, Daniel. ]

Catalin Marinas <catalin.marinas@arm.com> writes:

> On Thu, Jun 26, 2025 at 09:48:01PM -0700, Ankur Arora wrote:
>> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
>> index d4f581c1e21d..d33c2701c9ee 100644
>> --- a/include/asm-generic/barrier.h
>> +++ b/include/asm-generic/barrier.h
>> @@ -273,6 +273,101 @@ do {									\
>>  })
>>  #endif
>>
>> +#ifndef SMP_TIMEWAIT_SPIN_BASE
>> +#define SMP_TIMEWAIT_SPIN_BASE		16
>> +#endif
>> +
>> +/*
>> + * Policy handler that adjusts the number of times we spin or
>> + * wait for cacheline to change before evaluating the time-expr.
>> + *
>> + * The generic version only supports spinning.
>> + */
>> +static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
>> +				       u32 *spin, bool *wait, u64 slack)
>> +{
>> +	if (now >= end)
>> +		return 0;
>> +
>> +	*spin = SMP_TIMEWAIT_SPIN_BASE;
>> +	*wait = false;
>> +	return now;
>> +}
>> +
>> +#ifndef __smp_cond_policy
>> +#define __smp_cond_policy ___smp_cond_spinwait
>> +#endif
>> +
>> +/*
>> + * Non-spin primitive that allows waiting for stores to an address,
>> + * with support for a timeout. This works in conjunction with an
>> + * architecturally defined policy.
>> + */
>> +#ifndef __smp_timewait_store
>> +#define __smp_timewait_store(ptr, val)	do { } while (0)
>> +#endif
>> +
>> +#ifndef __smp_cond_load_relaxed_timewait
>> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,	\
>> +					 time_expr, time_end,		\
>> +					 slack) ({			\
>> +	typeof(ptr) __PTR = (ptr);					\
>> +	__unqual_scalar_typeof(*ptr) VAL;				\
>> +	u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;			\
>> +	u64 __prev = 0, __end = (time_end);				\
>> +	u64 __slack = slack;						\
>> +	bool __wait = false;						\
>> +									\
>> +	for (;;) {							\
>> +		VAL = READ_ONCE(*__PTR);				\
>> +		if (cond_expr)						\
>> +			break;						\
>> +		cpu_relax();						\
>> +		if (++__n < __spin)					\
>> +			continue;					\
>> +		if (!(__prev = policy((time_expr), __prev, __end,	\
>> +					  &__spin, &__wait, __slack)))	\
>> +			break;						\
>> +		if (__wait)						\
>> +			__smp_timewait_store(__PTR, VAL);		\
>> +		__n = 0;						\
>> +	}								\
>> +	(typeof(*ptr))VAL;						\
>> +})
>> +#endif
>
> TBH, this still looks over-engineered to me, especially with the second
> patch trying to reduce the spin loops based on the remaining time. Does
> any of the current users of this interface need it to get more precise?

No, neither of rqspinlock nor poll_idle() really care about precision.
And, the slack even in this series is only useful for the waiting
implementation.

> Also I feel the spinning added to poll_idle() is more of an architecture
> choice as some CPUs could not cope with local_clock() being called too
> frequently.

Just on the frequency point -- I think it might be a more general
problem that just on specific architectures.

Architectures with GENERIC_SCHED_CLOCK could use a multitude of
clocksources and from a quick look some of them do iomem reads.
(AFAICT GENERIC_SCHED_CLOCK could also be selected by the clocksource
itself, so an architecture header might not need to be an arch choice
at  all.)

Even for something like x86 which doesn't use GENERIC_SCHED_CLOCK,
we might be using tsc or jiffies or paravirt-clock all of which would
have very different performance characteristics. Or, just using a
clock more expensive than local_clock(); rqspinlock uses
ktime_get_mono_fast_ns().

So, I feel we do need a generic rate limiter.

> The above generic implementation takes a spin into
> consideration even if an arch implementation doesn't need it (e.g. WFET
> or WFE). Yes, the arch policy could set a spin of 0 but it feels overly
> complicated for the generic implementation.

Agree with the last point. My thought was that it might be okay to always
optimistically spin a little, just because WFE*/MWAITX etc might (?)
have a entry/exit cost even when the wakeup is immediate.

Though the code is wrong in that it always waits right after evaluating
the policy. I should have done something like this instead:

+#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,       \
+                                        time_expr, time_end,           \
+                                        slack) ({                      \
+       typeof(ptr) __PTR = (ptr);                                      \
+       __unqual_scalar_typeof(*ptr) VAL;                               \
+       u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;                   \
+       u64 __prev = 0, __end = (time_end);                             \
+       u64 __slack = slack;                                            \
+       bool __wait = false;                                            \
+                                                                       \
+       for (;;) {                                                      \
+               VAL = READ_ONCE(*__PTR);                                \
+               if (cond_expr)                                          \
+                       break;                                          \
+               cpu_relax();                                            \
+               if (++__n < __spin)                                     \
+                       continue;                                       \
+               if (__wait)                                             \
+                       __smp_timewait_store(__PTR, VAL);               \
+               if (!(__prev = policy((time_expr), __prev, __end,       \
+                                         &__spin, &__wait, __slack)))  \
+                       break;                                          \
+               __n = 0;                                                \
+       }                                                               \
+       (typeof(*ptr))VAL;                                              \
+})


> Can we instead have the generic implementation without any spinning?
> Just polling a variable with cpu_relax() like
> smp_cond_load_acquire/relaxed() with the additional check for time. We
> redefine it in the arch code.
>
>> +#define __check_time_types(type, a, b)			\
>> +		(__same_type(typeof(a), type) &&	\
>> +		 __same_type(typeof(b), type))
>> +
>> +/**
>> + * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
>> + * guarantees until a timeout expires.
>> + * @ptr: pointer to the variable to wait on
>> + * @cond: boolean expression to wait for
>> + * @time_expr: monotonic expression that evaluates to the current time
>> + * @time_end: end time, compared against time_expr
>> + * @slack: how much timer overshoot can the caller tolerate?
>> + * Useful for when we go into wait states. A value of 0 indicates a high
>> + * tolerance.
>> + *
>> + * Note that all times (time_expr, time_end, and slack) are in microseconds,
>> + * with no mandated precision.
>> + *
>> + * Equivalent to using READ_ONCE() on the condition variable.
>> + */
>> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, time_expr,	\
>> +				       time_end, slack) ({		\
>> +	__unqual_scalar_typeof(*ptr) _val;				\
>> +	BUILD_BUG_ON_MSG(!__check_time_types(u64, time_expr, time_end),	\
>> +			 "incompatible time units");			\
>> +	_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
>> +						__smp_cond_policy,	\
>> +						time_expr, time_end,	\
>> +						slack);			\
>> +	(typeof(*ptr))_val;						\
>> +})
>
> Looking at the current user of the acquire variant - rqspinlock, it does
> not even bother with a time_expr but rather added the time condition to
> cond_expr. I don't think it has any "slack" requirements, only that
> there's no deadlock eventually.

So, that code only uses smp_cond_load_*_timewait() on arm64. Everywhere
else it just uses smp_cond_load_acquire() and because it jams both
of these interfaces together, it doesn't really use time_expr.

But, it needs more extensive rework so all platforms can use
__smp_cond_load_acquire_timewait with the deadlock check folded
inside its own policy handler.

Anyway let me detail that in my reply to your other mail.

> About poll_idle(), are there any slack requirement or we get away
> without?

I don't believe there are any slack requirements. Definitely not for
rqspinlock (given that it has a large timeout) and I believe also
not for poll_idle() since a timeout delay only leads to a slightly
delayed deeper sleep.

Question for Rafael, Daniel: With smp_cond_load_relaxed_timewait(), when
used in waiting mode instead of via the usual cpu_relax() spin, we
could overshoot by an architecturally defined granularity.
On arm64, that could be ~100us in the worst case. Do we have hard
requirements about timer overshoot in poll_idle()?

> I think we have two ways forward (well, at least):
>
> 1. Clearly define what time_end is and we won't need a time_expr at all.
>    This may work for poll_idle(), not sure about rqspinlock. The
>    advantage is that we can drop the 'slack' argument since none of the
>    current users seem to need it. The downside is that we need to know
>    exactly what this time_end is to convert it to timer cycles for a
>    WFET implementation on arm64.
>
> 2. Drop time_end and only leave time_expr as a bool (we don't care
>    whether it uses ns, jiffies or whatever underneath, it's just a
>    bool). In this case, we could use a 'slack' argument mostly to make a
>    decision on whether we use WFET, WFE or just polling with
>    cpu_relax(). For WFET, the wait time would be based on the slack
>    value rather than some absolute end time which we won't have.
>
> I'd go with (2), it looks simpler. Maybe even drop the 'slack' argument
> for the time being until we have a clear user. The fallback on arm64
> would be from wfe (if event streaming available), wfet with the same
> period as the event stream (in the absence of a slack argument) or
> cpu_relax().

So I like the approach with (2) quite a bit. It'll simplify the time
handling quite nicely. And, I think it is also good to drop slack
unless there's a use for it.

There's just one problem, which is that a notion of time-remaining
still seems quite important to me. Without it, it's difficult to know
how often to do the time-check etc. I could use an arbitrary
parameter, say evaluate time_expr once every N cpu_relax() loops etc
but that seems worse than the current approach.

So, how about replacing the bool time_expr, with a time_remaining_expr
(s32) which evaluates to a fixed time unit (ns).

This also gives the WFET a clear end time (though it would still need
to be converted to timer cycles) but the WFE path could stay simple
by allowing an overshoot instead of falling back to polling.

--
ankur


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 3/5] asm-generic: barrier: Add smp_cond_load_acquire_timewait()
  2025-08-08  9:38   ` Catalin Marinas
@ 2025-08-12  5:18     ` Ankur Arora
  0 siblings, 0 replies; 26+ messages in thread
From: Ankur Arora @ 2025-08-12  5:18 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, bpf,
	arnd, will, peterz, akpm, mark.rutland, harisokn, cl, ast, memxor,
	zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
	konrad.wilk


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Thu, Jun 26, 2025 at 09:48:03PM -0700, Ankur Arora wrote:
>> diff --git a/arch/arm64/include/asm/rqspinlock.h b/arch/arm64/include/asm/rqspinlock.h
>> index 9ea0a74e5892..f1b6a428013e 100644
>> --- a/arch/arm64/include/asm/rqspinlock.h
>> +++ b/arch/arm64/include/asm/rqspinlock.h
>> @@ -86,7 +86,7 @@
>>
>>  #endif
>>
>> -#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire_timewait(v, c, 0, 1)
>> +#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire_timewait(v, c, 0ULL, 1ULL, 0)
>>
>>  #include <asm-generic/rqspinlock.h>
>>
>> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
>> index 8299c57d1110..dd7c9ca2dff3 100644
>> --- a/include/asm-generic/barrier.h
>> +++ b/include/asm-generic/barrier.h
>> @@ -388,6 +388,28 @@ static inline u64 ___smp_cond_spinwait(u64 now, u64 prev, u64 end,
>>  	(typeof(*ptr))_val;						\
>>  })
>>
>> +/**
>> + * smp_cond_load_acquire_timewait() - (Spin) wait for cond with ACQUIRE ordering
>> + * until a timeout expires.
>> + *
>> + * Arguments: same as smp_cond_load_relaxed_timeout().
>> + *
>> + * Equivalent to using smp_cond_load_acquire() on the condition variable with
>> + * a timeout.
>> + */
>> +#ifndef smp_cond_load_acquire_timewait
>> +#define smp_cond_load_acquire_timewait(ptr, cond_expr,			\
>> +				       time_expr, time_end,		\
>> +				       slack) ({			\
>> +	__unqual_scalar_typeof(*ptr) _val;				\
>> +	_val = smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
>> +					      time_expr, time_end,	\
>> +					      slack);			\
>> +	/* Depends on the control dependency of the wait above. */	\
>> +	smp_acquire__after_ctrl_dep();					\
>> +	(typeof(*ptr))_val;						\
>> +})
>> +#endif
>
> Using #ifndef in the generic file is the correct thing to do, it allows
> architectures to redefine it. Why we have a similar #ifndef in the arm64
> rqspinlock.h, no idea, none of the arm64 maintainers acked that patch
> (shouldn't have gone in really, we were still discussing the
> implementation at the time; I also think it's slightly wrong).
>
> Your change above to rqspinlock.h makes this even more confusing when
> you look at the overall result with all the patches applied. We end up
> with the same macro in asm/rqspinlock.h but with different number of
> arguments.

I agree that my change doesn't improve on matters at all.

Just to lay out the problem, rqspinlock defines this in the common code:

  #ifndef res_smp_cond_load_acquire
  #define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire(v, c)
  #endif

And, the definition of res_smp_cond_load_acquire() (only on arm64)
essentially uses smp_cond_load_acquire_timewait() such that it will
be equivalent to smp_cond_load_acquire() but one that's guaranteed
to terminate.

> I'd start with ripping out the current arm64 implementation, add a
> generic implementation to barrier.h and then override it in the arch
> code.

The problem is that rqspinlock code is mostly written as if it is
working with smp_cond_load_acquire().

Fixing it needs some amount of refactoring. I had preliminary patches
patches to do that, but my preference was to send those out after
the barrier changes.

If you feel that is best done as part of this series, I can add those
patches to v4.


Thanks for the comments!

--
ankur


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-11 21:15     ` Ankur Arora
@ 2025-08-13 16:09       ` Catalin Marinas
  2025-08-13 16:29         ` Arnd Bergmann
  2025-08-14  7:30         ` Ankur Arora
  0 siblings, 2 replies; 26+ messages in thread
From: Catalin Marinas @ 2025-08-13 16:09 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, bpf, arnd, will,
	peterz, akpm, mark.rutland, harisokn, cl, ast, memxor,
	zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
	konrad.wilk, rafael, daniel.lezcano

On Mon, Aug 11, 2025 at 02:15:56PM -0700, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@arm.com> writes:
> > On Thu, Jun 26, 2025 at 09:48:01PM -0700, Ankur Arora wrote:
> >> +#ifndef __smp_cond_load_relaxed_timewait
> >> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,	\
> >> +					 time_expr, time_end,		\
> >> +					 slack) ({			\
> >> +	typeof(ptr) __PTR = (ptr);					\
> >> +	__unqual_scalar_typeof(*ptr) VAL;				\
> >> +	u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;			\
> >> +	u64 __prev = 0, __end = (time_end);				\
> >> +	u64 __slack = slack;						\
> >> +	bool __wait = false;						\
> >> +									\
> >> +	for (;;) {							\
> >> +		VAL = READ_ONCE(*__PTR);				\
> >> +		if (cond_expr)						\
> >> +			break;						\
> >> +		cpu_relax();						\
> >> +		if (++__n < __spin)					\
> >> +			continue;					\
> >> +		if (!(__prev = policy((time_expr), __prev, __end,	\
> >> +					  &__spin, &__wait, __slack)))	\
> >> +			break;						\
> >> +		if (__wait)						\
> >> +			__smp_timewait_store(__PTR, VAL);		\
> >> +		__n = 0;						\
> >> +	}								\
> >> +	(typeof(*ptr))VAL;						\
> >> +})
> >> +#endif
> >
> > TBH, this still looks over-engineered to me, especially with the second
> > patch trying to reduce the spin loops based on the remaining time. Does
> > any of the current users of this interface need it to get more precise?
> 
> No, neither of rqspinlock nor poll_idle() really care about precision.
> And, the slack even in this series is only useful for the waiting
> implementation.

I pretty much came to the same conclusion. I guess it depends on how we
implement it. With WFET, the precision depends on the hardware clock.
For WFE, we could use the 100us event stream only or we could do like
__delay() and fall back to busy spinning for smaller (or end of)
intervals.

The spinning variant may have some random slack depending on how long it
loops before checking the clock.

> > Also I feel the spinning added to poll_idle() is more of an architecture
> > choice as some CPUs could not cope with local_clock() being called too
> > frequently.
> 
> Just on the frequency point -- I think it might be a more general
> problem that just on specific architectures.
> 
> Architectures with GENERIC_SCHED_CLOCK could use a multitude of
> clocksources and from a quick look some of them do iomem reads.
> (AFAICT GENERIC_SCHED_CLOCK could also be selected by the clocksource
> itself, so an architecture header might not need to be an arch choice
> at  all.)
> 
> Even for something like x86 which doesn't use GENERIC_SCHED_CLOCK,
> we might be using tsc or jiffies or paravirt-clock all of which would
> have very different performance characteristics. Or, just using a
> clock more expensive than local_clock(); rqspinlock uses
> ktime_get_mono_fast_ns().
> 
> So, I feel we do need a generic rate limiter.

That's a good point but the rate limiting is highly dependent on the
architecture, what a CPU does in the loop, how fast a loop iteration is.
That's why I'd keep it hidden in the arch code.

> > The above generic implementation takes a spin into
> > consideration even if an arch implementation doesn't need it (e.g. WFET
> > or WFE). Yes, the arch policy could set a spin of 0 but it feels overly
> > complicated for the generic implementation.
> 
> Agree with the last point. My thought was that it might be okay to always
> optimistically spin a little, just because WFE*/MWAITX etc might (?)
> have a entry/exit cost even when the wakeup is immediate.

They key is whether the cost is more expensive than some spinning. On
arm64, cpu_relax() doesn't do anything, so WFE is not any worse even if
it exits immediately (e.g. a SEVL+WFE loop). The only benefit would be
if the cost of local_clock() is more expensive than some busy spinning.
The arch code is better placed to know what kind of spinning is best.

> Though the code is wrong in that it always waits right after evaluating
> the policy. I should have done something like this instead:
> 
> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,       \
> +                                        time_expr, time_end,           \
> +                                        slack) ({                      \
> +       typeof(ptr) __PTR = (ptr);                                      \
> +       __unqual_scalar_typeof(*ptr) VAL;                               \
> +       u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;                   \
> +       u64 __prev = 0, __end = (time_end);                             \
> +       u64 __slack = slack;                                            \
> +       bool __wait = false;                                            \
> +                                                                       \
> +       for (;;) {                                                      \
> +               VAL = READ_ONCE(*__PTR);                                \
> +               if (cond_expr)                                          \
> +                       break;                                          \
> +               cpu_relax();                                            \
> +               if (++__n < __spin)                                     \
> +                       continue;                                       \
> +               if (__wait)                                             \
> +                       __smp_timewait_store(__PTR, VAL);               \
> +               if (!(__prev = policy((time_expr), __prev, __end,       \
> +                                         &__spin, &__wait, __slack)))  \
> +                       break;                                          \
> +               __n = 0;                                                \
> +       }                                                               \
> +       (typeof(*ptr))VAL;                                              \
> +})

I think both variants work just fine. The original one made more sense
with waiting immediately after the policy decided that it is needed.
With the above, you go through the loop one more time before detecting
__wait == true.

I thought the current __smp_cond_load_acquire_timewait() is not doing
the right thing but it does check cond_expr immediately after exiting
__cmpwait_relaxed() (if no timeout), so no unnecessary waiting.

> >> +#define __check_time_types(type, a, b)			\
> >> +		(__same_type(typeof(a), type) &&	\
> >> +		 __same_type(typeof(b), type))
> >> +
> >> +/**
> >> + * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
> >> + * guarantees until a timeout expires.
> >> + * @ptr: pointer to the variable to wait on
> >> + * @cond: boolean expression to wait for
> >> + * @time_expr: monotonic expression that evaluates to the current time
> >> + * @time_end: end time, compared against time_expr
> >> + * @slack: how much timer overshoot can the caller tolerate?
> >> + * Useful for when we go into wait states. A value of 0 indicates a high
> >> + * tolerance.
> >> + *
> >> + * Note that all times (time_expr, time_end, and slack) are in microseconds,
> >> + * with no mandated precision.
> >> + *
> >> + * Equivalent to using READ_ONCE() on the condition variable.
> >> + */
> >> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, time_expr,	\
> >> +				       time_end, slack) ({		\
> >> +	__unqual_scalar_typeof(*ptr) _val;				\
> >> +	BUILD_BUG_ON_MSG(!__check_time_types(u64, time_expr, time_end),	\
> >> +			 "incompatible time units");			\
> >> +	_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
> >> +						__smp_cond_policy,	\
> >> +						time_expr, time_end,	\
> >> +						slack);			\
> >> +	(typeof(*ptr))_val;						\
> >> +})
> >
> > Looking at the current user of the acquire variant - rqspinlock, it does
> > not even bother with a time_expr but rather added the time condition to
> > cond_expr. I don't think it has any "slack" requirements, only that
> > there's no deadlock eventually.
> 
> So, that code only uses smp_cond_load_*_timewait() on arm64. Everywhere
> else it just uses smp_cond_load_acquire() and because it jams both
> of these interfaces together, it doesn't really use time_expr.
> 
> But, it needs more extensive rework so all platforms can use
> __smp_cond_load_acquire_timewait with the deadlock check folded
> inside its own policy handler.

We can have a generic:

#define smp_cond_load_acquire_timewait(ptr, cond_expr, time_expr) \
	smp_cond_load_acquire(ptr, (cond_expr) || (time_expr))

with some big comment that it may deadlock if an architecture does not
regularly check timer_expr in the absence of a *ptr update.

Alternatively, just define res_smp_cond_load_acquire() in the bpf code
to take separate cond_expr and time_expr and do an 'or' between them in
the default implementation.

> > About poll_idle(), are there any slack requirement or we get away
> > without?
> 
> I don't believe there are any slack requirements. Definitely not for
> rqspinlock (given that it has a large timeout) and I believe also
> not for poll_idle() since a timeout delay only leads to a slightly
> delayed deeper sleep.
> 
> Question for Rafael, Daniel: With smp_cond_load_relaxed_timewait(), when
> used in waiting mode instead of via the usual cpu_relax() spin, we
> could overshoot by an architecturally defined granularity.
> On arm64, that could be ~100us in the worst case. Do we have hard
> requirements about timer overshoot in poll_idle()?

I can see a CPUIDLE_POLL_MIN and MAX defined as 10us and 250us
respectively (for a HZ of 250). Not sure it matters if we overshoot by
100us. If it does, we should do like the __delay() implementation with a
fall-back to a busy loop.

> > I think we have two ways forward (well, at least):
> >
> > 1. Clearly define what time_end is and we won't need a time_expr at all.
> >    This may work for poll_idle(), not sure about rqspinlock. The
> >    advantage is that we can drop the 'slack' argument since none of the
> >    current users seem to need it. The downside is that we need to know
> >    exactly what this time_end is to convert it to timer cycles for a
> >    WFET implementation on arm64.
> >
> > 2. Drop time_end and only leave time_expr as a bool (we don't care
> >    whether it uses ns, jiffies or whatever underneath, it's just a
> >    bool). In this case, we could use a 'slack' argument mostly to make a
> >    decision on whether we use WFET, WFE or just polling with
> >    cpu_relax(). For WFET, the wait time would be based on the slack
> >    value rather than some absolute end time which we won't have.
> >
> > I'd go with (2), it looks simpler. Maybe even drop the 'slack' argument
> > for the time being until we have a clear user. The fallback on arm64
> > would be from wfe (if event streaming available), wfet with the same
> > period as the event stream (in the absence of a slack argument) or
> > cpu_relax().
> 
> So I like the approach with (2) quite a bit. It'll simplify the time
> handling quite nicely. And, I think it is also good to drop slack
> unless there's a use for it.
> 
> There's just one problem, which is that a notion of time-remaining
> still seems quite important to me. Without it, it's difficult to know
> how often to do the time-check etc. I could use an arbitrary
> parameter, say evaluate time_expr once every N cpu_relax() loops etc
> but that seems worse than the current approach.
> 
> So, how about replacing the bool time_expr, with a time_remaining_expr
> (s32) which evaluates to a fixed time unit (ns).

I'd use ktime_t instead of s32. It is already signed and can represent
(negative) time deltas. The downside is that we need to use
ns_to_ktime() etc. for conversion.

However, in the absence of some precision requirement for the potential
two users of this interface, I think we complicate things unnecessarily.
The only advantage is if you want to make it future proof, in case we
ever need more precision.

> This also gives the WFET a clear end time (though it would still need
> to be converted to timer cycles) but the WFE path could stay simple
> by allowing an overshoot instead of falling back to polling.

For arm64, both WFE and WFET would be woken up by the event stream
(which is enabled on all production systems). The only reason to use
WFET is if you need smaller granularity than the event stream period
(100us). In this case, we should probably also add a fallback from WFE
to a busy loop.

-- 
Catalin


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-13 16:09       ` Catalin Marinas
@ 2025-08-13 16:29         ` Arnd Bergmann
  2025-08-13 16:54           ` Christoph Lameter (Ampere)
  2025-08-14 13:00           ` Catalin Marinas
  2025-08-14  7:30         ` Ankur Arora
  1 sibling, 2 replies; 26+ messages in thread
From: Arnd Bergmann @ 2025-08-13 16:29 UTC (permalink / raw)
  To: Catalin Marinas, Ankur Arora
  Cc: linux-kernel, Linux-Arch, linux-arm-kernel, bpf, Will Deacon,
	Peter Zijlstra, Andrew Morton, Mark Rutland, harisokn, cl,
	Alexei Starovoitov, Kumar Kartikeya Dwivedi, zhenglifeng1,
	xueshuai, Joao Martins, Boris Ostrovsky, Konrad Rzeszutek Wilk,
	Rafael J . Wysocki, Daniel Lezcano

On Wed, Aug 13, 2025, at 18:09, Catalin Marinas wrote:
> On Mon, Aug 11, 2025 at 02:15:56PM -0700, Ankur Arora wrote:
>
>> This also gives the WFET a clear end time (though it would still need
>> to be converted to timer cycles) but the WFE path could stay simple
>> by allowing an overshoot instead of falling back to polling.
>
> For arm64, both WFE and WFET would be woken up by the event stream
> (which is enabled on all production systems). The only reason to use
> WFET is if you need smaller granularity than the event stream period
> (100us). In this case, we should probably also add a fallback from WFE
> to a busy loop.

I think there is a reasonable chance that in the future we may want
to turn the event stream off on systems that support WFET, since that
is better for both low-power systems and virtual machines with CPU
overcommit.

    Arnd


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-13 16:29         ` Arnd Bergmann
@ 2025-08-13 16:54           ` Christoph Lameter (Ampere)
  2025-08-14 13:00           ` Catalin Marinas
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-08-13 16:54 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Catalin Marinas, Ankur Arora, linux-kernel, Linux-Arch,
	linux-arm-kernel, bpf, Will Deacon, Peter Zijlstra, Andrew Morton,
	Mark Rutland, harisokn, Alexei Starovoitov,
	Kumar Kartikeya Dwivedi, zhenglifeng1, xueshuai, Joao Martins,
	Boris Ostrovsky, Konrad Rzeszutek Wilk, Rafael J . Wysocki,
	Daniel Lezcano

On Wed, 13 Aug 2025, Arnd Bergmann wrote:

> I think there is a reasonable chance that in the future we may want
> to turn the event stream off on systems that support WFET, since that
> is better for both low-power systems and virtual machines with CPU
> overcommit.

WFET is coming in with V9.something I believe so its good to be prepared
and it will simplify the logic.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-13 16:09       ` Catalin Marinas
  2025-08-13 16:29         ` Arnd Bergmann
@ 2025-08-14  7:30         ` Ankur Arora
  2025-08-14 11:39           ` Catalin Marinas
  1 sibling, 1 reply; 26+ messages in thread
From: Ankur Arora @ 2025-08-14  7:30 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, bpf,
	arnd, will, peterz, akpm, mark.rutland, harisokn, cl, ast, memxor,
	zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
	konrad.wilk, rafael, daniel.lezcano


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Mon, Aug 11, 2025 at 02:15:56PM -0700, Ankur Arora wrote:
>> Catalin Marinas <catalin.marinas@arm.com> writes:
>> > On Thu, Jun 26, 2025 at 09:48:01PM -0700, Ankur Arora wrote:
>> >> +#ifndef __smp_cond_load_relaxed_timewait
>> >> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,	\
>> >> +					 time_expr, time_end,		\
>> >> +					 slack) ({			\
>> >> +	typeof(ptr) __PTR = (ptr);					\
>> >> +	__unqual_scalar_typeof(*ptr) VAL;				\
>> >> +	u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;			\
>> >> +	u64 __prev = 0, __end = (time_end);				\
>> >> +	u64 __slack = slack;						\
>> >> +	bool __wait = false;						\
>> >> +									\
>> >> +	for (;;) {							\
>> >> +		VAL = READ_ONCE(*__PTR);				\
>> >> +		if (cond_expr)						\
>> >> +			break;						\
>> >> +		cpu_relax();						\
>> >> +		if (++__n < __spin)					\
>> >> +			continue;					\
>> >> +		if (!(__prev = policy((time_expr), __prev, __end,	\
>> >> +					  &__spin, &__wait, __slack)))	\
>> >> +			break;						\
>> >> +		if (__wait)						\
>> >> +			__smp_timewait_store(__PTR, VAL);		\
>> >> +		__n = 0;						\
>> >> +	}								\
>> >> +	(typeof(*ptr))VAL;						\
>> >> +})
>> >> +#endif
>> >
>> > TBH, this still looks over-engineered to me, especially with the second
>> > patch trying to reduce the spin loops based on the remaining time. Does
>> > any of the current users of this interface need it to get more precise?
>>
>> No, neither of rqspinlock nor poll_idle() really care about precision.
>> And, the slack even in this series is only useful for the waiting
>> implementation.
>
> I pretty much came to the same conclusion. I guess it depends on how we
> implement it. With WFET, the precision depends on the hardware clock.
> For WFE, we could use the 100us event stream only or we could do like
> __delay() and fall back to busy spinning for smaller (or end of)
> intervals.
>
> The spinning variant may have some random slack depending on how long it
> loops before checking the clock.

Yeah. My conclusion too. v2 was taking the slack into account while
spinning. For v3 I got rid of that. And, for v4 the slack goes away
entirely :).

>> > Also I feel the spinning added to poll_idle() is more of an architecture
>> > choice as some CPUs could not cope with local_clock() being called too
>> > frequently.
>>
>> Just on the frequency point -- I think it might be a more general
>> problem that just on specific architectures.
>>
>> Architectures with GENERIC_SCHED_CLOCK could use a multitude of
>> clocksources and from a quick look some of them do iomem reads.
>> (AFAICT GENERIC_SCHED_CLOCK could also be selected by the clocksource
>> itself, so an architecture header might not need to be an arch choice
>> at  all.)
>>
>> Even for something like x86 which doesn't use GENERIC_SCHED_CLOCK,
>> we might be using tsc or jiffies or paravirt-clock all of which would
>> have very different performance characteristics. Or, just using a
>> clock more expensive than local_clock(); rqspinlock uses
>> ktime_get_mono_fast_ns().
>>
>> So, I feel we do need a generic rate limiter.
>
> That's a good point but the rate limiting is highly dependent on the
> architecture, what a CPU does in the loop, how fast a loop iteration is.
>
> That's why I'd keep it hidden in the arch code.

Yeah, this makes sense. However, I would like to keep as much of the
code that does this common.

Currently the spin rate limit is scaled up or down based on the current
spin (or SMP_TIMEWAIT_SPIN_BASE) in the common __smp_cond_spinwait()
which implements a default policy.
SMP_TIMEWAIT_SPIN_BASE can already be chosen by the architecture but
that doesn't allow it any runtime say in the spinning.

I think a good way to handle this might be the same way that the wait
policy is handled. When the architecture handler (__smp_cond_timewait()
on arm64) is used, it should be able to choose both spin and wait and
can feed those back into __smp_cond_spinwait() which can do the scaling
based on that.

>> > The above generic implementation takes a spin into
>> > consideration even if an arch implementation doesn't need it (e.g. WFET
>> > or WFE). Yes, the arch policy could set a spin of 0 but it feels overly
>> > complicated for the generic implementation.
>>
>> Agree with the last point. My thought was that it might be okay to always
>> optimistically spin a little, just because WFE*/MWAITX etc might (?)
>> have a entry/exit cost even when the wakeup is immediate.
>
> They key is whether the cost is more expensive than some spinning. On
> arm64, cpu_relax() doesn't do anything, so WFE is not any worse even if
> it exits immediately (e.g. a SEVL+WFE loop). The only benefit would be
> if the cost of local_clock() is more expensive than some busy spinning.
> The arch code is better placed to know what kind of spinning is best.

Yeah, that's a good point.

>> Though the code is wrong in that it always waits right after evaluating
>> the policy. I should have done something like this instead:
>>
>> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, policy,       \
>> +                                        time_expr, time_end,           \
>> +                                        slack) ({                      \
>> +       typeof(ptr) __PTR = (ptr);                                      \
>> +       __unqual_scalar_typeof(*ptr) VAL;                               \
>> +       u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_BASE;                   \
>> +       u64 __prev = 0, __end = (time_end);                             \
>> +       u64 __slack = slack;                                            \
>> +       bool __wait = false;                                            \
>> +                                                                       \
>> +       for (;;) {                                                      \
>> +               VAL = READ_ONCE(*__PTR);                                \
>> +               if (cond_expr)                                          \
>> +                       break;                                          \
>> +               cpu_relax();                                            \
>> +               if (++__n < __spin)                                     \
>> +                       continue;                                       \
>> +               if (__wait)                                             \
>> +                       __smp_timewait_store(__PTR, VAL);               \
>> +               if (!(__prev = policy((time_expr), __prev, __end,       \
>> +                                         &__spin, &__wait, __slack)))  \
>> +                       break;                                          \
>> +               __n = 0;                                                \
>> +       }                                                               \
>> +       (typeof(*ptr))VAL;                                              \
>> +})
>
> I think both variants work just fine. The original one made more sense
> with waiting immediately after the policy decided that it is needed.
> With the above, you go through the loop one more time before detecting
> __wait == true.

Yeah you are right. This version evaluates the time-check unnecessarily
even in the exit path.

> I thought the current __smp_cond_load_acquire_timewait() is not doing
> the right thing but it does check cond_expr immediately after exiting
> __cmpwait_relaxed() (if no timeout), so no unnecessary waiting.
>
>> >> +#define __check_time_types(type, a, b)			\
>> >> +		(__same_type(typeof(a), type) &&	\
>> >> +		 __same_type(typeof(b), type))
>> >> +
>> >> +/**
>> >> + * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
>> >> + * guarantees until a timeout expires.
>> >> + * @ptr: pointer to the variable to wait on
>> >> + * @cond: boolean expression to wait for
>> >> + * @time_expr: monotonic expression that evaluates to the current time
>> >> + * @time_end: end time, compared against time_expr
>> >> + * @slack: how much timer overshoot can the caller tolerate?
>> >> + * Useful for when we go into wait states. A value of 0 indicates a high
>> >> + * tolerance.
>> >> + *
>> >> + * Note that all times (time_expr, time_end, and slack) are in microseconds,
>> >> + * with no mandated precision.
>> >> + *
>> >> + * Equivalent to using READ_ONCE() on the condition variable.
>> >> + */
>> >> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, time_expr,	\
>> >> +				       time_end, slack) ({		\
>> >> +	__unqual_scalar_typeof(*ptr) _val;				\
>> >> +	BUILD_BUG_ON_MSG(!__check_time_types(u64, time_expr, time_end),	\
>> >> +			 "incompatible time units");			\
>> >> +	_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
>> >> +						__smp_cond_policy,	\
>> >> +						time_expr, time_end,	\
>> >> +						slack);			\
>> >> +	(typeof(*ptr))_val;						\
>> >> +})
>> >
>> > Looking at the current user of the acquire variant - rqspinlock, it does
>> > not even bother with a time_expr but rather added the time condition to
>> > cond_expr. I don't think it has any "slack" requirements, only that
>> > there's no deadlock eventually.
>>
>> So, that code only uses smp_cond_load_*_timewait() on arm64. Everywhere
>> else it just uses smp_cond_load_acquire() and because it jams both
>> of these interfaces together, it doesn't really use time_expr.
>>
>> But, it needs more extensive rework so all platforms can use
>> __smp_cond_load_acquire_timewait with the deadlock check folded
>> inside its own policy handler.
>
> We can have a generic:
>
> #define smp_cond_load_acquire_timewait(ptr, cond_expr, time_expr) \
> 	smp_cond_load_acquire(ptr, (cond_expr) || (time_expr))
>
> with some big comment that it may deadlock if an architecture does not
> regularly check timer_expr in the absence of a *ptr update.
>
> Alternatively, just define res_smp_cond_load_acquire() in the bpf code
> to take separate cond_expr and time_expr and do an 'or' between them in
> the default implementation.

Yeah, this seems like the way to go. Any cleanups can follow in a separate
series.

>> > About poll_idle(), are there any slack requirement or we get away
>> > without?
>>
>> I don't believe there are any slack requirements. Definitely not for
>> rqspinlock (given that it has a large timeout) and I believe also
>> not for poll_idle() since a timeout delay only leads to a slightly
>> delayed deeper sleep.
>>
>> Question for Rafael, Daniel: With smp_cond_load_relaxed_timewait(), when
>> used in waiting mode instead of via the usual cpu_relax() spin, we
>> could overshoot by an architecturally defined granularity.
>> On arm64, that could be ~100us in the worst case. Do we have hard
>> requirements about timer overshoot in poll_idle()?
>
> I can see a CPUIDLE_POLL_MIN and MAX defined as 10us and 250us
> respectively (for a HZ of 250). Not sure it matters if we overshoot by
> 100us. If it does, we should do like the __delay() implementation with a
> fall-back to a busy loop.

Sounds good to me.

>> > I think we have two ways forward (well, at least):
>> >
>> > 1. Clearly define what time_end is and we won't need a time_expr at all.
>> >    This may work for poll_idle(), not sure about rqspinlock. The
>> >    advantage is that we can drop the 'slack' argument since none of the
>> >    current users seem to need it. The downside is that we need to know
>> >    exactly what this time_end is to convert it to timer cycles for a
>> >    WFET implementation on arm64.
>> >
>> > 2. Drop time_end and only leave time_expr as a bool (we don't care
>> >    whether it uses ns, jiffies or whatever underneath, it's just a
>> >    bool). In this case, we could use a 'slack' argument mostly to make a
>> >    decision on whether we use WFET, WFE or just polling with
>> >    cpu_relax(). For WFET, the wait time would be based on the slack
>> >    value rather than some absolute end time which we won't have.
>> >
>> > I'd go with (2), it looks simpler. Maybe even drop the 'slack' argument
>> > for the time being until we have a clear user. The fallback on arm64
>> > would be from wfe (if event streaming available), wfet with the same
>> > period as the event stream (in the absence of a slack argument) or
>> > cpu_relax().
>>
>> So I like the approach with (2) quite a bit. It'll simplify the time
>> handling quite nicely. And, I think it is also good to drop slack
>> unless there's a use for it.
>>
>> There's just one problem, which is that a notion of time-remaining
>> still seems quite important to me. Without it, it's difficult to know
>> how often to do the time-check etc. I could use an arbitrary
>> parameter, say evaluate time_expr once every N cpu_relax() loops etc
>> but that seems worse than the current approach.
>>
>> So, how about replacing the bool time_expr, with a time_remaining_expr
>> (s32) which evaluates to a fixed time unit (ns).
>
> I'd use ktime_t instead of s32. It is already signed and can represent
> (negative) time deltas. The downside is that we need to use
> ns_to_ktime() etc. for conversion.

Apart from that aspect of it, ktime_t seems ideal.

> However, in the absence of some precision requirement for the potential
> two users of this interface, I think we complicate things unnecessarily.
> The only advantage is if you want to make it future proof, in case we
> ever need more precision.

There's the future proofing aspect but also having time-remaining
simplifies the rate limiting of the time-check because now it can
rate-limit depending on how often the policy handler is called and
the remaining time.

>> This also gives the WFET a clear end time (though it would still need
>> to be converted to timer cycles) but the WFE path could stay simple
>> by allowing an overshoot instead of falling back to polling.
>
> For arm64, both WFE and WFET would be woken up by the event stream
> (which is enabled on all production systems). The only reason to use
> WFET is if you need smaller granularity than the event stream period
> (100us). In this case, we should probably also add a fallback from WFE
> to a busy loop.

What do you think would be a good boundary for transitioning to a busy loop?

Say, we have < 100us left and the event-stream is 100us. We could do
what __delay() does and spin for the remaining time. But given that we
dont' care about precision at least until there's need for it, it seems
to be better to err on the side of saving power.

So, how about switching to busy-looping when we get to event-stream-period/4?
(and note that in a comment block.)

--
ankur


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-14  7:30         ` Ankur Arora
@ 2025-08-14 11:39           ` Catalin Marinas
  2025-08-17 22:14             ` Ankur Arora
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2025-08-14 11:39 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, bpf, arnd, will,
	peterz, akpm, mark.rutland, harisokn, cl, ast, memxor,
	zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
	konrad.wilk, rafael, daniel.lezcano

On Thu, Aug 14, 2025 at 12:30:36AM -0700, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@arm.com> writes:
> > On Mon, Aug 11, 2025 at 02:15:56PM -0700, Ankur Arora wrote:
> >> Catalin Marinas <catalin.marinas@arm.com> writes:
> >> > Also I feel the spinning added to poll_idle() is more of an architecture
> >> > choice as some CPUs could not cope with local_clock() being called too
> >> > frequently.
> >>
> >> Just on the frequency point -- I think it might be a more general
> >> problem that just on specific architectures.
> >>
> >> Architectures with GENERIC_SCHED_CLOCK could use a multitude of
> >> clocksources and from a quick look some of them do iomem reads.
> >> (AFAICT GENERIC_SCHED_CLOCK could also be selected by the clocksource
> >> itself, so an architecture header might not need to be an arch choice
> >> at  all.)
> >>
> >> Even for something like x86 which doesn't use GENERIC_SCHED_CLOCK,
> >> we might be using tsc or jiffies or paravirt-clock all of which would
> >> have very different performance characteristics. Or, just using a
> >> clock more expensive than local_clock(); rqspinlock uses
> >> ktime_get_mono_fast_ns().
> >>
> >> So, I feel we do need a generic rate limiter.
> >
> > That's a good point but the rate limiting is highly dependent on the
> > architecture, what a CPU does in the loop, how fast a loop iteration is.
> >
> > That's why I'd keep it hidden in the arch code.
> 
> Yeah, this makes sense. However, I would like to keep as much of the
> code that does this common.

You can mimic what poll_idle() does for x86 in the generic
implementation, maybe with some comment referring to the poll_idle() CPU
usage of calling local_clock() in a loop. However, allow the arch code
to override the whole implementation and get rid of the policy. If an
arch wants to spin for some power reason, it can do it itself. The code
duplication for a while loop is much more readable than a policy setting
some spin/wait parameters just to have a single spin loop. If at some
point we see some pattern, we could revisit the common code.

For arm64, I doubt the extra spinning makes any difference. Our
cpu_relax() doesn't do anything (almost), it's probably about the same
cost as reading the monotonic clock. I also see a single definition
close enough to the logic in __delay() on arm64. It would be more
readable than a policy callback setting wait/spin with a separate call
for actually waiting. Well, gut feel, let's see how that would look
like.

> Currently the spin rate limit is scaled up or down based on the current
> spin (or SMP_TIMEWAIT_SPIN_BASE) in the common __smp_cond_spinwait()
> which implements a default policy.
> SMP_TIMEWAIT_SPIN_BASE can already be chosen by the architecture but
> that doesn't allow it any runtime say in the spinning.
> 
> I think a good way to handle this might be the same way that the wait
> policy is handled. When the architecture handler (__smp_cond_timewait()
> on arm64) is used, it should be able to choose both spin and wait and
> can feed those back into __smp_cond_spinwait() which can do the scaling
> based on that.

Exactly.

> > However, in the absence of some precision requirement for the potential
> > two users of this interface, I think we complicate things unnecessarily.
> > The only advantage is if you want to make it future proof, in case we
> > ever need more precision.
> 
> There's the future proofing aspect but also having time-remaining
> simplifies the rate limiting of the time-check because now it can
> rate-limit depending on how often the policy handler is called and
> the remaining time.

To keep things simple, I'd skip the automatic adjustment of the rate
limiting in the generic code. Just go for a fixed one until someone
complains about the slack.

> >> This also gives the WFET a clear end time (though it would still need
> >> to be converted to timer cycles) but the WFE path could stay simple
> >> by allowing an overshoot instead of falling back to polling.
> >
> > For arm64, both WFE and WFET would be woken up by the event stream
> > (which is enabled on all production systems). The only reason to use
> > WFET is if you need smaller granularity than the event stream period
> > (100us). In this case, we should probably also add a fallback from WFE
> > to a busy loop.
> 
> What do you think would be a good boundary for transitioning to a busy
> loop?
> 
> Say, we have < 100us left and the event-stream is 100us. We could do
> what __delay() does and spin for the remaining time. But given that we
> dont' care about precision at least until there's need for it, it seems
> to be better to err on the side of saving power.
> 
> So, how about switching to busy-looping when we get to event-stream-period/4?
> (and note that in a comment block.)

Let's do like __delay() (the principle, not necessarily copying the
code) - start with WFET if available, WFE otherwise. For the latter,
when the time left is <100us, fall back to spinning. We can later get
someone to benchmark the power usage and we can revisit. Longer term,
WFET would be sufficient without any spinning.

That said, I thin our __delay() implementation doesn't need spinning if
WFET is available:

diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index cb2062e7e234..5d4c28db399a 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -43,10 +43,10 @@ void __delay(unsigned long cycles)
 
 		while ((get_cycles() - start + timer_evt_period) < cycles)
 			wfe();
-	}
 
-	while ((get_cycles() - start) < cycles)
-		cpu_relax();
+		while ((get_cycles() - start) < cycles)
+			cpu_relax();
+	}
 }
 EXPORT_SYMBOL(__delay);
 
-- 
Catalin


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-13 16:29         ` Arnd Bergmann
  2025-08-13 16:54           ` Christoph Lameter (Ampere)
@ 2025-08-14 13:00           ` Catalin Marinas
  2025-08-18 11:51             ` Arnd Bergmann
  1 sibling, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2025-08-14 13:00 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ankur Arora, linux-kernel, Linux-Arch, linux-arm-kernel, bpf,
	Will Deacon, Peter Zijlstra, Andrew Morton, Mark Rutland,
	harisokn, cl, Alexei Starovoitov, Kumar Kartikeya Dwivedi,
	zhenglifeng1, xueshuai, Joao Martins, Boris Ostrovsky,
	Konrad Rzeszutek Wilk, Rafael J . Wysocki, Daniel Lezcano

On Wed, Aug 13, 2025 at 06:29:37PM +0200, Arnd Bergmann wrote:
> On Wed, Aug 13, 2025, at 18:09, Catalin Marinas wrote:
> > On Mon, Aug 11, 2025 at 02:15:56PM -0700, Ankur Arora wrote:
> >> This also gives the WFET a clear end time (though it would still need
> >> to be converted to timer cycles) but the WFE path could stay simple
> >> by allowing an overshoot instead of falling back to polling.
> >
> > For arm64, both WFE and WFET would be woken up by the event stream
> > (which is enabled on all production systems). The only reason to use
> > WFET is if you need smaller granularity than the event stream period
> > (100us). In this case, we should probably also add a fallback from WFE
> > to a busy loop.
> 
> I think there is a reasonable chance that in the future we may want
> to turn the event stream off on systems that support WFET, since that
> is better for both low-power systems

Definitely better for low power.

> and virtual machines with CPU overcommit.

Not sure it helps here. With vCPU overcommit, KVM enables WFE trapping
and the event stream no longer has any effect (it's not like it
interrupts the host).

That said, my worry is that either broken hardware or software rely on
the event stream unknowingly, e.g. someone using WFE in a busy loop. And
for hardware errata, we've had a few where the wakeup events don't
propagate between clusters, though these we can toggle on a case by case
basis.

-- 
Catalin


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-14 11:39           ` Catalin Marinas
@ 2025-08-17 22:14             ` Ankur Arora
  2025-08-18 17:55               ` Catalin Marinas
  0 siblings, 1 reply; 26+ messages in thread
From: Ankur Arora @ 2025-08-17 22:14 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, bpf,
	arnd, will, peterz, akpm, mark.rutland, harisokn, cl, ast, memxor,
	zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
	konrad.wilk, rafael, daniel.lezcano


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Thu, Aug 14, 2025 at 12:30:36AM -0700, Ankur Arora wrote:
>> Catalin Marinas <catalin.marinas@arm.com> writes:
>> > On Mon, Aug 11, 2025 at 02:15:56PM -0700, Ankur Arora wrote:
>> >> Catalin Marinas <catalin.marinas@arm.com> writes:
>> >> > Also I feel the spinning added to poll_idle() is more of an architecture
>> >> > choice as some CPUs could not cope with local_clock() being called too
>> >> > frequently.
>> >>
>> >> Just on the frequency point -- I think it might be a more general
>> >> problem that just on specific architectures.
>> >>
>> >> Architectures with GENERIC_SCHED_CLOCK could use a multitude of
>> >> clocksources and from a quick look some of them do iomem reads.
>> >> (AFAICT GENERIC_SCHED_CLOCK could also be selected by the clocksource
>> >> itself, so an architecture header might not need to be an arch choice
>> >> at  all.)
>> >>
>> >> Even for something like x86 which doesn't use GENERIC_SCHED_CLOCK,
>> >> we might be using tsc or jiffies or paravirt-clock all of which would
>> >> have very different performance characteristics. Or, just using a
>> >> clock more expensive than local_clock(); rqspinlock uses
>> >> ktime_get_mono_fast_ns().
>> >>
>> >> So, I feel we do need a generic rate limiter.
>> >
>> > That's a good point but the rate limiting is highly dependent on the
>> > architecture, what a CPU does in the loop, how fast a loop iteration is.
>> >
>> > That's why I'd keep it hidden in the arch code.
>>
>> Yeah, this makes sense. However, I would like to keep as much of the
>> code that does this common.
>
> You can mimic what poll_idle() does for x86 in the generic
> implementation, maybe with some comment referring to the poll_idle() CPU
> usage of calling local_clock() in a loop. However, allow the arch code
> to override the whole implementation and get rid of the policy. If an
> arch wants to spin for some power reason, it can do it itself. The code
> duplication for a while loop is much more readable than a policy setting
> some spin/wait parameters just to have a single spin loop. If at some
> point we see some pattern, we could revisit the common code.
>
> For arm64, I doubt the extra spinning makes any difference. Our
> cpu_relax() doesn't do anything (almost), it's probably about the same
> cost as reading the monotonic clock. I also see a single definition
> close enough to the logic in __delay() on arm64. It would be more
> readable than a policy callback setting wait/spin with a separate call
> for actually waiting. Well, gut feel, let's see how that would look
> like.

So, I tried to pare back the code and the following (untested) is
what I came up with. Given the straight-forward rate-limiting, and the
current users not needing accurate timekeeping, this uses a
bool time_check_expr. Figured I'd keep it simple until someone actually
needs greater complexity as you suggested.

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..e8793347a395 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -273,6 +273,34 @@ do {                                                                       \
 })
 #endif

+
+#ifndef SMP_TIMEWAIT_SPIN_COUNT
+#define SMP_TIMEWAIT_SPIN_COUNT                200
+#endif
+
+#ifndef smp_cond_load_relaxed_timewait
+#define smp_cond_load_relaxed_timewait(ptr, cond_expr,                 \
+                                       time_check_expr)                \
+({                                                                     \
+       typeof(ptr) __PTR = (ptr);                                      \
+       __unqual_scalar_typeof(*ptr) VAL;                               \
+       u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_COUNT;                  \
+                                                                       \
+       for (;;) {                                                      \
+               VAL = READ_ONCE(*__PTR);                                \
+               if (cond_expr)                                          \
+                       break;                                          \
+               cpu_relax();                                            \
+               if (++__n < __spin)                                     \
+                       continue;                                       \
+               if ((time_check_expr))                                  \
+                       break;                                          \
+               __n = 0;                                                \
+       }                                                               \
+       (typeof(*ptr))VAL;                                              \
+})
+#endif
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index f5801b0ba9e9..c9934ab68da2 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -219,6 +219,43 @@ do {                                                                       \
        (typeof(*ptr))VAL;                                              \
 })

+extern bool arch_timer_evtstrm_available(void);
+
+#ifndef SMP_TIMEWAIT_SPIN_COUNT
+#define SMP_TIMEWAIT_SPIN_COUNT                200
+#endif
+
+#define smp_cond_load_relaxed_timewait(ptr, cond_expr,                 \
+                                         time_check_expr)              \
+({                                                                     \
+       typeof(ptr) __PTR = (ptr);                                      \
+       __unqual_scalar_typeof(*ptr) VAL;                               \
+       u32 __n = 0, __spin = 0;                                        \
+       bool __wfet = alternative_has_cap_unlikely(ARM64_HAS_WFXT);     \
+       bool __wfe = arch_timer_evtstrm_available();                    \
+       bool __wait = false;                                            \
+                                                                       \
+       if (__wfet || __wfe)                                            \
+               __wait = true;                                          \
+       else                                                            \
+               __spin = SMP_TIMEWAIT_SPIN_COUNT;                       \
+                                                                       \
+       for (;;) {                                                      \
+               VAL = READ_ONCE(*__PTR);                                \
+               if (cond_expr)                                          \
+                       break;                                          \
+               cpu_relax();                                            \
+               if (++__n < __spin)                                     \
+                       continue;                                       \
+               if ((time_check_expr))                                  \
+                       break;                                          \
+               if (__wait)                                             \
+                       __cmpwait_relaxed(__PTR, VAL);                  \
+               __n = 0;                                                \
+       }                                                               \
+       (typeof(*ptr))VAL;                                              \
+})
+
 #include <asm-generic/barrier.h>

__cmpwait_relaxed() will need adjustment to set a deadline for WFET.

AFAICT the rqspinlock code should be able to work by specifying something
like:
  ((ktime_get_mono_fast_ns() > tval)) || (deadlock_check(&lock_context)))
as the time_check_expr.

I think they also want to rate limit how often deadlock_check() is
called, so they can redefine SMP_TIMEWAIT_SPIN_COUNT to some large
value for arm64.

How does this look?


Thanks

--
ankur


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-14 13:00           ` Catalin Marinas
@ 2025-08-18 11:51             ` Arnd Bergmann
  2025-08-18 18:28               ` Catalin Marinas
  0 siblings, 1 reply; 26+ messages in thread
From: Arnd Bergmann @ 2025-08-18 11:51 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-kernel, Linux-Arch, linux-arm-kernel, bpf,
	Will Deacon, Peter Zijlstra, Andrew Morton, Mark Rutland,
	harisokn, Christoph Lameter (Ampere), Alexei Starovoitov,
	Kumar Kartikeya Dwivedi, zhenglifeng1, xueshuai, Joao Martins,
	Boris Ostrovsky, Konrad Rzeszutek Wilk, Rafael J . Wysocki,
	Daniel Lezcano

On Thu, Aug 14, 2025, at 15:00, Catalin Marinas wrote:
> On Wed, Aug 13, 2025 at 06:29:37PM +0200, Arnd Bergmann wrote:
>> On Wed, Aug 13, 2025, at 18:09, Catalin Marinas wrote:
>> and virtual machines with CPU overcommit.
>
> Not sure it helps here. With vCPU overcommit, KVM enables WFE trapping
> and the event stream no longer has any effect (it's not like it
> interrupts the host).

I would expect a similar overhead for the WFE trapping as for the
bare-metal hardware case: When the WFE traps, the host has to
reschedule all guests that are in WFE periodically, while WFET
with event stream disabled means this can be driven by an accurate
host timer.

> That said, my worry is that either broken hardware or software rely on
> the event stream unknowingly, e.g. someone using WFE in a busy loop. And
> for hardware errata, we've had a few where the wakeup events don't
> propagate between clusters, though these we can toggle on a case by case
> basis.

Don't we already support hardware without a functional architected
timer even with? Those don't use the event stream today even when
CONFIG_ARM_ARCH_TIMER_EVTSTREAM is enabled.

     Arnd


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-17 22:14             ` Ankur Arora
@ 2025-08-18 17:55               ` Catalin Marinas
  2025-08-18 19:15                 ` Ankur Arora
  0 siblings, 1 reply; 26+ messages in thread
From: Catalin Marinas @ 2025-08-18 17:55 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, bpf, arnd, will,
	peterz, akpm, mark.rutland, harisokn, cl, ast, memxor,
	zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
	konrad.wilk, rafael, daniel.lezcano

On Sun, Aug 17, 2025 at 03:14:26PM -0700, Ankur Arora wrote:
> So, I tried to pare back the code and the following (untested) is
> what I came up with. Given the straight-forward rate-limiting, and the
> current users not needing accurate timekeeping, this uses a
> bool time_check_expr. Figured I'd keep it simple until someone actually
> needs greater complexity as you suggested.
> 
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..e8793347a395 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -273,6 +273,34 @@ do {                                                                       \
>  })
>  #endif
> 
> +
> +#ifndef SMP_TIMEWAIT_SPIN_COUNT
> +#define SMP_TIMEWAIT_SPIN_COUNT                200
> +#endif
> +
> +#ifndef smp_cond_load_relaxed_timewait
> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr,                 \
> +                                       time_check_expr)                \
> +({                                                                     \
> +       typeof(ptr) __PTR = (ptr);                                      \
> +       __unqual_scalar_typeof(*ptr) VAL;                               \
> +       u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_COUNT;                  \
> +                                                                       \
> +       for (;;) {                                                      \
> +               VAL = READ_ONCE(*__PTR);                                \
> +               if (cond_expr)                                          \
> +                       break;                                          \
> +               cpu_relax();                                            \
> +               if (++__n < __spin)                                     \
> +                       continue;                                       \
> +               if ((time_check_expr))                                  \
> +                       break;                                          \
> +               __n = 0;                                                \
> +       }                                                               \
> +       (typeof(*ptr))VAL;                                              \
> +})
> +#endif

This looks fine, at least as it would be used by poll_idle(). The only
reason for not folding time_check_expr into cond_expr is the poll_idle()
requirement to avoid calling time_check_expr too often.

> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index f5801b0ba9e9..c9934ab68da2 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -219,6 +219,43 @@ do {                                                                       \
>         (typeof(*ptr))VAL;                                              \
>  })
> 
> +extern bool arch_timer_evtstrm_available(void);
> +
> +#ifndef SMP_TIMEWAIT_SPIN_COUNT
> +#define SMP_TIMEWAIT_SPIN_COUNT                200
> +#endif
> +
> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr,                 \
> +                                         time_check_expr)              \
> +({                                                                     \
> +       typeof(ptr) __PTR = (ptr);                                      \
> +       __unqual_scalar_typeof(*ptr) VAL;                               \
> +       u32 __n = 0, __spin = 0;                                        \
> +       bool __wfet = alternative_has_cap_unlikely(ARM64_HAS_WFXT);     \
> +       bool __wfe = arch_timer_evtstrm_available();                    \
> +       bool __wait = false;                                            \
> +                                                                       \
> +       if (__wfet || __wfe)                                            \
> +               __wait = true;                                          \
> +       else                                                            \
> +               __spin = SMP_TIMEWAIT_SPIN_COUNT;                       \
> +                                                                       \
> +       for (;;) {                                                      \
> +               VAL = READ_ONCE(*__PTR);                                \
> +               if (cond_expr)                                          \
> +                       break;                                          \
> +               cpu_relax();                                            \
> +               if (++__n < __spin)                                     \
> +                       continue;                                       \
> +               if ((time_check_expr))                                  \
> +                       break;                                          \
> +               if (__wait)                                             \
> +                       __cmpwait_relaxed(__PTR, VAL);                  \
> +               __n = 0;                                                \
> +       }                                                               \
> +       (typeof(*ptr))VAL;                                              \
> +})

For arm64, I wouldn't bother with the spin count. Since cpu_relax()
doesn't do anything, I doubt it makes any difference, especially as we
are likely to use WFE anyway. If we do add one, I'd like it backed by
some numbers to show it makes a difference in practice.

The question is whether 100us granularity is good enough for poll_idle()
(I came to the conclusion it's fine for rqspinlock, given their 1ms
deadlock check).

>  #include <asm-generic/barrier.h>
> 
> __cmpwait_relaxed() will need adjustment to set a deadline for WFET.

Yeah, __cmpwait_relaxed() doesn't use WFET as it doesn't need a timeout
(it just happens to have one with the event stream).

We could extend this or create a new one that uses WFET and takes an
argument. If extending this one, for example a timeout argument of 0
means WFE, non-zero means WFET cycles. This adds a couple of more
instructions.

What I had in mind of time_expr was a ktime_t would be something like:

	for (;;) {
		VAL = READ_ONCE(*__PTR);
		if (cond_expr)
			break;

		cycles = some_func_of(time_expr);	// see __udelay()
		if (cycles <= 0)
			break;

		if (__wfet) {
			__cmpwait_relaxed(__PTR, VAL, get_cycles() + cycles);
		} else if (__wfe && cycles >= timer_evt_period) {
			__cmpwait_relaxed(__PTR, VAL, 0);
		} else {
			cpu_relax();
		}
	}

Now, if we don't care about the time check granularity (for now) and
time_check_expr is a bool (this seems to work better for rqspinlock), I
think we could do something like:

	for (;;) {
		VAL = READ_ONCE(*__PTR);
		if (cond_expr)
			break;
		if (time_check_expr)
			break;

		if (__wfe) {
			__cmpwait_relaxed(__PTR, VAL, 0);
		} else if (__wfet) {
			__cmpwait_relaxed(__PTR, VAL, get_cycles() + timer_evt_period);
		} else {
			cpu_relax();
		}
	}

We go with WFE first in this case to avoid get_cycles() unnecessarily.

I'd suggest we add the WFET support in __cmpwait_relaxed() (or a
different function) as a separate patch, doesn't even need to be part of
this series. WFE is good enough to get things moving. WFET will only
make a difference if (1) we disable the event stream or (2) we need
better accuracy of the timeout.

> AFAICT the rqspinlock code should be able to work by specifying something
> like:
>   ((ktime_get_mono_fast_ns() > tval)) || (deadlock_check(&lock_context)))
> as the time_check_expr.

Why not the whole RES_CHECK_TIMEOUT(...) as in rqspinlock.c? It does the
deadlock check only after a timeout over a millisecond. Just follow the
res_atomic_cond_read_acquire() calls but replace '||' with a comma.

> I think they also want to rate limit how often deadlock_check() is
> called, so they can redefine SMP_TIMEWAIT_SPIN_COUNT to some large
> value for arm64.

Everyone would want a different rate of checking other stuff, so I think
this needs to go in their time_check_expr.

-- 
Catalin


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-18 11:51             ` Arnd Bergmann
@ 2025-08-18 18:28               ` Catalin Marinas
  0 siblings, 0 replies; 26+ messages in thread
From: Catalin Marinas @ 2025-08-18 18:28 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ankur Arora, linux-kernel, Linux-Arch, linux-arm-kernel, bpf,
	Will Deacon, Peter Zijlstra, Andrew Morton, Mark Rutland,
	harisokn, Christoph Lameter (Ampere), Alexei Starovoitov,
	Kumar Kartikeya Dwivedi, zhenglifeng1, xueshuai, Joao Martins,
	Boris Ostrovsky, Konrad Rzeszutek Wilk, Rafael J . Wysocki,
	Daniel Lezcano

On Mon, Aug 18, 2025 at 01:51:28PM +0200, Arnd Bergmann wrote:
> On Thu, Aug 14, 2025, at 15:00, Catalin Marinas wrote:
> > On Wed, Aug 13, 2025 at 06:29:37PM +0200, Arnd Bergmann wrote:
> >> On Wed, Aug 13, 2025, at 18:09, Catalin Marinas wrote:
> >> and virtual machines with CPU overcommit.
> >
> > Not sure it helps here. With vCPU overcommit, KVM enables WFE trapping
> > and the event stream no longer has any effect (it's not like it
> > interrupts the host).
> 
> I would expect a similar overhead for the WFE trapping as for the
> bare-metal hardware case: When the WFE traps, the host has to
> reschedule all guests that are in WFE periodically, while WFET
> with event stream disabled means this can be driven by an accurate
> host timer.

For WFIT, yes, this works as the hypervisor either gets an interrupt or
schedules a timer. It can tell what WFI* is going to be woken by.

With WFET, the hypervisor cannot tell if an event was generated by
another vCPU (e.g. clearing of the exclusive monitor) unless it does a
WFE itself. So it can't put the vCPU to sleep based on the WFET timeout.
IIUC, from kvm_handle_wfx(), the only difference is whether it returns
immediately to the vCPU if it timed out or tells the core KVM to
reschedule other vCPUs.

> > That said, my worry is that either broken hardware or software rely on
> > the event stream unknowingly, e.g. someone using WFE in a busy loop. And
> > for hardware errata, we've had a few where the wakeup events don't
> > propagate between clusters, though these we can toggle on a case by case
> > basis.
> 
> Don't we already support hardware without a functional architected
> timer even with? Those don't use the event stream today even when
> CONFIG_ARM_ARCH_TIMER_EVTSTREAM is enabled.

Maybe we still have such hardware around (e.g. errata) but it shouldn't
be the norm, especially if vendors try to follow the *BSA specs.

-- 
Catalin


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-18 17:55               ` Catalin Marinas
@ 2025-08-18 19:15                 ` Ankur Arora
  2025-08-19 10:34                   ` Catalin Marinas
  0 siblings, 1 reply; 26+ messages in thread
From: Ankur Arora @ 2025-08-18 19:15 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, bpf,
	arnd, will, peterz, akpm, mark.rutland, harisokn, cl, ast, memxor,
	zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
	konrad.wilk, rafael, daniel.lezcano


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Sun, Aug 17, 2025 at 03:14:26PM -0700, Ankur Arora wrote:
>> So, I tried to pare back the code and the following (untested) is
>> what I came up with. Given the straight-forward rate-limiting, and the
>> current users not needing accurate timekeeping, this uses a
>> bool time_check_expr. Figured I'd keep it simple until someone actually
>> needs greater complexity as you suggested.
>>
>> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
>> index d4f581c1e21d..e8793347a395 100644
>> --- a/include/asm-generic/barrier.h
>> +++ b/include/asm-generic/barrier.h
>> @@ -273,6 +273,34 @@ do {                                                                       \
>>  })
>>  #endif
>>
>> +
>> +#ifndef SMP_TIMEWAIT_SPIN_COUNT
>> +#define SMP_TIMEWAIT_SPIN_COUNT                200
>> +#endif
>> +
>> +#ifndef smp_cond_load_relaxed_timewait
>> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr,                 \
>> +                                       time_check_expr)                \
>> +({                                                                     \
>> +       typeof(ptr) __PTR = (ptr);                                      \
>> +       __unqual_scalar_typeof(*ptr) VAL;                               \
>> +       u32 __n = 0, __spin = SMP_TIMEWAIT_SPIN_COUNT;                  \
>> +                                                                       \
>> +       for (;;) {                                                      \
>> +               VAL = READ_ONCE(*__PTR);                                \
>> +               if (cond_expr)                                          \
>> +                       break;                                          \
>> +               cpu_relax();                                            \
>> +               if (++__n < __spin)                                     \
>> +                       continue;                                       \
>> +               if ((time_check_expr))                                  \
>> +                       break;                                          \
>> +               __n = 0;                                                \
>> +       }                                                               \
>> +       (typeof(*ptr))VAL;                                              \
>> +})
>> +#endif
>
> This looks fine, at least as it would be used by poll_idle(). The only
> reason for not folding time_check_expr into cond_expr is the poll_idle()
> requirement to avoid calling time_check_expr too often.

Yeah. Exactly.

>> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
>> index f5801b0ba9e9..c9934ab68da2 100644
>> --- a/arch/arm64/include/asm/barrier.h
>> +++ b/arch/arm64/include/asm/barrier.h
>> @@ -219,6 +219,43 @@ do {                                                                       \
>>         (typeof(*ptr))VAL;                                              \
>>  })
>>
>> +extern bool arch_timer_evtstrm_available(void);
>> +
>> +#ifndef SMP_TIMEWAIT_SPIN_COUNT
>> +#define SMP_TIMEWAIT_SPIN_COUNT                200
>> +#endif
>> +
>> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr,                 \
>> +                                         time_check_expr)              \
>> +({                                                                     \
>> +       typeof(ptr) __PTR = (ptr);                                      \
>> +       __unqual_scalar_typeof(*ptr) VAL;                               \
>> +       u32 __n = 0, __spin = 0;                                        \
>> +       bool __wfet = alternative_has_cap_unlikely(ARM64_HAS_WFXT);     \
>> +       bool __wfe = arch_timer_evtstrm_available();                    \
>> +       bool __wait = false;                                            \
>> +                                                                       \
>> +       if (__wfet || __wfe)                                            \
>> +               __wait = true;                                          \
>> +       else                                                            \
>> +               __spin = SMP_TIMEWAIT_SPIN_COUNT;                       \
>> +                                                                       \
>> +       for (;;) {                                                      \
>> +               VAL = READ_ONCE(*__PTR);                                \
>> +               if (cond_expr)                                          \
>> +                       break;                                          \
>> +               cpu_relax();                                            \
>> +               if (++__n < __spin)                                     \
>> +                       continue;                                       \
>> +               if ((time_check_expr))                                  \
>> +                       break;                                          \
>> +               if (__wait)                                             \
>> +                       __cmpwait_relaxed(__PTR, VAL);                  \
>> +               __n = 0;                                                \
>> +       }                                                               \
>> +       (typeof(*ptr))VAL;                                              \
>> +})
>
> For arm64, I wouldn't bother with the spin count. Since cpu_relax()
> doesn't do anything, I doubt it makes any difference, especially as we
> are likely to use WFE anyway. If we do add one, I'd like it backed by
> some numbers to show it makes a difference in practice.

Okay. That makes sense.

> The question is whether 100us granularity is good enough for poll_idle()
> (I came to the conclusion it's fine for rqspinlock, given their 1ms
> deadlock check).
>
>>  #include <asm-generic/barrier.h>
>>
>> __cmpwait_relaxed() will need adjustment to set a deadline for WFET.
>
> Yeah, __cmpwait_relaxed() doesn't use WFET as it doesn't need a timeout
> (it just happens to have one with the event stream).
>
> We could extend this or create a new one that uses WFET and takes an
> argument. If extending this one, for example a timeout argument of 0
> means WFE, non-zero means WFET cycles. This adds a couple of more
> instructions.

Though then we would need an ALTERNATIVE for WFET to fallback to WFE where
not available. This is a minor point, but how about just always using
WFE or WFET appropriately instead of choosing between the two based on
etime.

  static inline void __cmpwait_case_##sz(volatile void *ptr,              \
                                  unsigned long val,                      \
                                  unsigned long etime)                    \
                                                                          \
          unsigned long tmp;                                              \
                                                                          \
          const unsigned long ecycles = xloops_to_cycles(nsecs_to_xloops(etime)); \
          asm volatile(                                                   \
          "       sevl\ n"                                                \
          "       wfe\ n"                                                 \
          "       ldxr" #sfx "\ t%" #w "[tmp], %[v]\n"                    \
          "       eor     %" #w "[tmp], %" #w "[tmp], %" #w "[val]\ n"    \
          "       cbnz    %" #w "[tmp], 1f\ n"                            \
          ALTERNATIVE("wfe\ n",                                           \
                  "msr s0_3_c1_c0_0, %[ecycles]\ n",                      \
                  ARM64_HAS_WFXT)                                         \
          "1:"                                                            \
          : [tmp] "=&r" (tmp), [v] "+Q" (*(u##sz *)ptr)                   \
          : [val] "r" (val), [ecycles] "r" (ecycles));                    \
  }

This would cause us to compute the end time unnecessarily for WFE but,
given that nothing will use the output of that computation, wouldn't
WFE be able to execute before the result of that computation is available?
(Though I guess WFE is somewhat special, so the usual rules might not
apply.)

> What I had in mind of time_expr was a ktime_t would be something like:
>
> 	for (;;) {
> 		VAL = READ_ONCE(*__PTR);
> 		if (cond_expr)
> 			break;
>
> 		cycles = some_func_of(time_expr);	// see __udelay()
> 		if (cycles <= 0)
> 			break;
>
> 		if (__wfet) {
> 			__cmpwait_relaxed(__PTR, VAL, get_cycles() + cycles);
> 		} else if (__wfe && cycles >= timer_evt_period) {
> 			__cmpwait_relaxed(__PTR, VAL, 0);
> 		} else {
> 			cpu_relax();
> 		}
> 	}
>
> Now, if we don't care about the time check granularity (for now) and
> time_check_expr is a bool (this seems to work better for rqspinlock), I
> think we could do something like:
>
> 	for (;;) {
> 		VAL = READ_ONCE(*__PTR);
> 		if (cond_expr)
> 			break;
> 		if (time_check_expr)
> 			break;
>
> 		if (__wfe) {
> 			__cmpwait_relaxed(__PTR, VAL, 0);
> 		} else if (__wfet) {
> 			__cmpwait_relaxed(__PTR, VAL, get_cycles() + timer_evt_period);
> 		} else {
> 			cpu_relax();
> 		}
> 	}

Yeah. This is simpler.

> We go with WFE first in this case to avoid get_cycles() unnecessarily.

Agreed.

> I'd suggest we add the WFET support in __cmpwait_relaxed() (or a
> different function) as a separate patch, doesn't even need to be part of
> this series. WFE is good enough to get things moving. WFET will only

Agreed.

> make a difference if (1) we disable the event stream or (2) we need
> better accuracy of the timeout.
>
>> AFAICT the rqspinlock code should be able to work by specifying something
>> like:
>>   ((ktime_get_mono_fast_ns() > tval)) || (deadlock_check(&lock_context)))
>> as the time_check_expr.
>
> Why not the whole RES_CHECK_TIMEOUT(...) as in rqspinlock.c? It does the
> deadlock check only after a timeout over a millisecond. Just follow the
> res_atomic_cond_read_acquire() calls but replace '||' with a comma.

Yeah, for this series that's exactly what I was looking to do.

For later I think the rqspinlock code can be simplified so there
are fewer layers of checking.

>> I think they also want to rate limit how often deadlock_check() is
>> called, so they can redefine SMP_TIMEWAIT_SPIN_COUNT to some large
>> value for arm64.
>
> Everyone would want a different rate of checking other stuff, so I think
> this needs to go in their time_check_expr.

Yeah that makes a lot of sense. Trying to address all of that in this
interface just makes that more complicated.

Thanks for all the comments!

--
ankur


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-08-18 19:15                 ` Ankur Arora
@ 2025-08-19 10:34                   ` Catalin Marinas
  0 siblings, 0 replies; 26+ messages in thread
From: Catalin Marinas @ 2025-08-19 10:34 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, bpf, arnd, will,
	peterz, akpm, mark.rutland, harisokn, cl, ast, memxor,
	zhenglifeng1, xueshuai, joao.m.martins, boris.ostrovsky,
	konrad.wilk, rafael, daniel.lezcano

On Mon, Aug 18, 2025 at 12:15:29PM -0700, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@arm.com> writes:
> > On Sun, Aug 17, 2025 at 03:14:26PM -0700, Ankur Arora wrote:
> >> __cmpwait_relaxed() will need adjustment to set a deadline for WFET.
> >
> > Yeah, __cmpwait_relaxed() doesn't use WFET as it doesn't need a timeout
> > (it just happens to have one with the event stream).
> >
> > We could extend this or create a new one that uses WFET and takes an
> > argument. If extending this one, for example a timeout argument of 0
> > means WFE, non-zero means WFET cycles. This adds a couple of more
> > instructions.
> 
> Though then we would need an ALTERNATIVE for WFET to fallback to WFE where
> not available. This is a minor point, but how about just always using
> WFE or WFET appropriately instead of choosing between the two based on
> etime.
> 
>   static inline void __cmpwait_case_##sz(volatile void *ptr,              \
>                                   unsigned long val,                      \
>                                   unsigned long etime)                    \
>                                                                           \
>           unsigned long tmp;                                              \
>                                                                           \
>           const unsigned long ecycles = xloops_to_cycles(nsecs_to_xloops(etime)); \
>           asm volatile(                                                   \
>           "       sevl\ n"                                                \
>           "       wfe\ n"                                                 \
>           "       ldxr" #sfx "\ t%" #w "[tmp], %[v]\n"                    \
>           "       eor     %" #w "[tmp], %" #w "[tmp], %" #w "[val]\ n"    \
>           "       cbnz    %" #w "[tmp], 1f\ n"                            \
>           ALTERNATIVE("wfe\ n",                                           \
>                   "msr s0_3_c1_c0_0, %[ecycles]\ n",                      \
>                   ARM64_HAS_WFXT)                                         \
>           "1:"                                                            \
>           : [tmp] "=&r" (tmp), [v] "+Q" (*(u##sz *)ptr)                   \
>           : [val] "r" (val), [ecycles] "r" (ecycles));                    \
>   }
> 
> This would cause us to compute the end time unnecessarily for WFE but,
> given that nothing will use the output of that computation, wouldn't
> WFE be able to execute before the result of that computation is available?
> (Though I guess WFE is somewhat special, so the usual rules might not
> apply.)

The compiler cannot tell what's happening inside the asm block, so it
will compute ecycles, place it in a register before the asm. The
hardware won't do anything smarter like skip the computation because the
register holding ecycles is not going to be used (or it is going to be
re-written later). So I wouldn't want to penalise the existing
smp_cond_load_acquire() which only needs a WFE.

We could patch WFET in and always pass -1UL in the non-timeout case but
I think we are better off just duplicating the whole thing. It's going
to be inlined anyway, so it's not like we end up with lots of these
functions.

-- 
Catalin


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-08-19 12:39 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27  4:48 [PATCH v3 0/5] barrier: Add smp_cond_load_*_timewait() Ankur Arora
2025-06-27  4:48 ` [PATCH v3 1/5] asm-generic: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
2025-08-08 10:51   ` Catalin Marinas
2025-08-11 21:15     ` Ankur Arora
2025-08-13 16:09       ` Catalin Marinas
2025-08-13 16:29         ` Arnd Bergmann
2025-08-13 16:54           ` Christoph Lameter (Ampere)
2025-08-14 13:00           ` Catalin Marinas
2025-08-18 11:51             ` Arnd Bergmann
2025-08-18 18:28               ` Catalin Marinas
2025-08-14  7:30         ` Ankur Arora
2025-08-14 11:39           ` Catalin Marinas
2025-08-17 22:14             ` Ankur Arora
2025-08-18 17:55               ` Catalin Marinas
2025-08-18 19:15                 ` Ankur Arora
2025-08-19 10:34                   ` Catalin Marinas
2025-06-27  4:48 ` [PATCH v3 2/5] asm-generic: barrier: Handle spin-wait in smp_cond_load_relaxed_timewait() Ankur Arora
2025-06-27  4:48 ` [PATCH v3 3/5] asm-generic: barrier: Add smp_cond_load_acquire_timewait() Ankur Arora
2025-08-08  9:38   ` Catalin Marinas
2025-08-12  5:18     ` Ankur Arora
2025-06-27  4:48 ` [PATCH v3 4/5] arm64: barrier: Support waiting in smp_cond_load_relaxed_timewait() Ankur Arora
2025-06-27  4:48 ` [PATCH v3 5/5] arm64: barrier: Handle " Ankur Arora
2025-06-30 16:33   ` Christoph Lameter (Ampere)
2025-06-30 21:05     ` Ankur Arora
2025-07-01  5:55       ` Ankur Arora
2025-07-28 19:03 ` [PATCH v3 0/5] barrier: Add smp_cond_load_*_timewait() Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).