[PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
@ 2026-03-16  1:36 Ankur Arora
  2026-03-16  1:36 ` [PATCH v10 01/12] asm-generic: barrier: Add smp_cond_load_relaxed_timeout() Ankur Arora
                   ` (12 more replies)
  0 siblings, 13 replies; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Hi,

This series adds waited variants of the smp_cond_load() primitives:
smp_cond_load_relaxed_timeout(), and smp_cond_load_acquire_timeout().

With this version, the main remaining things are:

  - Review by PeterZ of the new interface tif_need_resched_relaxed_wait()
    (patch 11, "sched: add need-resched timed wait interface").

  - Review of the BPF changes. This version simplifies the rqspinlock
    changes by reusing the original error handling path
    (patches 9, 10 "bpf/rqspinlock: switch check_timeout() to a clock
    interface", "bpf/rqspinlock: Use smp_cond_load_acquire_timeout()").

  - Review of WFET handling. (patch 4, "arm64: support WFET in
    smp_cond_load_relaxed_timeout()").

The new interfaces are meant for contexts where you want to wait on a
condition variable for a finite duration. This is easy enough to do with
a loop around cpu_relax(). There are, however, architectures (ex. arm64)
that allow waiting on a cacheline instead.

So, these interfaces handle a mixture of spin/wait with a
smp_cond_load() thrown in. The interfaces are:

   smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr, timeout)
   smp_cond_load_acquire_timeout(ptr, cond_expr, time_expr, timeout)

The parameters, time_expr, timeout determine when to bail out.

Also add tif_need_resched_relaxed_wait() which wraps the pattern used
in poll_idle() and abstracts out details of the interface and those
of the scheduler.

In addition add atomic_cond_read_*_timeout(), atomic64_cond_read_*_timeout(),
and atomic_long wrappers to the interfaces.

Finally update poll_idle() and resilient queued spinlocks to use them.

Changelog:
  v9 [9]:
   - s/@cond/@cond_expr/ (Randy Dunlap)
   - Clarify that SMP_TIMEOUT_POLL_COUNT is only around memory
     addresses. (David Laight)
   - Add the missing config ARCH_HAS_CPU_RELAX in arch/arm64/Kconfig.
     (Catalin Marinas).
   - Switch to arch_counter_get_cntvct_stable() (via __delay_cycles())
     in the cmpwait path instead of using arch_timer_read_counter().
     (Catalin Marinas)

  v8 [0]:
   - Defer evaluation of @time_expr_ns to when we hit the slowpath.
      (comment from Alexei Starovoitov).

   - Mention that cpu_poll_relax() is better than raw CPU polling
     only where ARCH_HAS_CPU_RELAX is defined.
     - also define ARCH_HAS_CPU_RELAX for arm64.
      (Came out of a discussion with Will Deacon.)

   - Split out WFET and WFE handling. I was doing both of these
     in a common handler.
     (From Will Deacon and in an earlier revision by Catalin Marinas.)

   - Add mentions of atomic_cond_read_{relaxed,acquire}(),
     atomic_cond_read_{relaxed,acquire}_timeout() in
     Documentation/atomic_t.txt.

   - Use the BIT() macro to do the checking in tif_bitset_relaxed_wait().

   - Cleanup unnecessary assignments, casts etc in poll_idle().
     (From Rafael Wysocki.)

   - Fixup warnings from kernel build robot


  v7 [1]:
   - change the interface to separately provide the timeout. This is
     useful for supporting WFET and similar primitives which can do
     timed waiting (suggested by Arnd Bergmann).

   - Adapting rqspinlock code to this changed interface also
     necessitated allowing time_expr to fail.
   - rqspinlock changes to adapt to the new smp_cond_load_acquire_timeout().

   - add WFET support (suggested by Arnd Bergmann).
   - add support for atomic-long wrappers.
   - add a new scheduler interface tif_need_resched_relaxed_wait() which
     encapsulates the polling logic used by poll_idle().
     - interface suggested by (Rafael J. Wysocki).


  v6 [2]:
   - fixup missing timeout parameters in atomic64_cond_read_*_timeout()
   - remove a race between setting of TIF_NEED_RESCHED and the call to
     smp_cond_load_relaxed_timeout(). This would mean that dev->poll_time_limit
     would be set even if we hadn't spent any time waiting.
     (The original check compared against local_clock(), which would have been
     fine, but I was instead using a cheaper check against _TIF_NEED_RESCHED.)
   (Both from meta-CI bot)


  v5 [3]:
   - use cpu_poll_relax() instead of cpu_relax().
   - instead of defining an arm64 specific
     smp_cond_load_relaxed_timeout(), just define the appropriate
     cpu_poll_relax().
   - re-read the target pointer when we exit due to the time-check.
   - s/SMP_TIMEOUT_SPIN_COUNT/SMP_TIMEOUT_POLL_COUNT/
   (Suggested by Will Deacon)

   - add atomic_cond_read_*_timeout() and atomic64_cond_read_*_timeout()
     interfaces.
   - rqspinlock: use atomic_cond_read_acquire_timeout().
   - cpuidle: use smp_cond_load_relaxed_tiemout() for polling.
   (Suggested by Catalin Marinas)

   - rqspinlock: define SMP_TIMEOUT_POLL_COUNT to be 16k for non arm64


  v4 [4]:
    - naming change 's/timewait/timeout/'
    - resilient spinlocks: get rid of res_smp_cond_load_acquire_waiting()
      and fixup use of RES_CHECK_TIMEOUT().
    (Both suggested by Catalin Marinas)

  v3 [5]:
    - further interface simplifications (suggested by Catalin Marinas)

  v2 [6]:
    - simplified the interface (suggested by Catalin Marinas)
       - get rid of wait_policy, and a multitude of constants
       - adds a slack parameter
      This helped remove a fair amount of duplicated code duplication and in
      hindsight unnecessary constants.

  v1 [7]:
     - add wait_policy (coarse and fine)
     - derive spin-count etc at runtime instead of using arbitrary
       constants.

Haris Okanovic tested v4 of this series with poll_idle()/haltpoll patches. [8]

Comments appreciated!

Thanks
Ankur

 [0] https://lore.kernel.org/lkml/20251215044919.460086-1-ankur.a.arora@oracle.com/
 [1] https://lore.kernel.org/lkml/20251028053136.692462-1-ankur.a.arora@oracle.com/
 [2] https://lore.kernel.org/lkml/20250911034655.3916002-1-ankur.a.arora@oracle.com/
 [3] https://lore.kernel.org/lkml/20250911034655.3916002-1-ankur.a.arora@oracle.com/
 [4] https://lore.kernel.org/lkml/20250829080735.3598416-1-ankur.a.arora@oracle.com/
 [5] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/
 [6] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/
 [7] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
 [8] https://lore.kernel.org/lkml/2cecbf7fb23ee83a4ce027e1be3f46f97efd585c.camel@amazon.com/
 [9] https://lore.kernel.org/lkml/20260209023153.2661784-1-ankur.a.arora@oracle.com/

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: bpf@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-pm@vger.kernel.org

Ankur Arora (12):
  asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
  arm64: barrier: Support smp_cond_load_relaxed_timeout()
  arm64/delay: move some constants out to a separate header
  arm64: support WFET in smp_cond_load_relaxed_timeout()
  arm64: rqspinlock: Remove private copy of
    smp_cond_load_acquire_timewait()
  asm-generic: barrier: Add smp_cond_load_acquire_timeout()
  atomic: Add atomic_cond_read_*_timeout()
  locking/atomic: scripts: build atomic_long_cond_read_*_timeout()
  bpf/rqspinlock: switch check_timeout() to a clock interface
  bpf/rqspinlock: Use smp_cond_load_acquire_timeout()
  sched: add need-resched timed wait interface
  cpuidle/poll_state: Wait for need-resched via
    tif_need_resched_relaxed_wait()

 Documentation/atomic_t.txt           | 14 +++--
 arch/arm64/Kconfig                   |  3 +
 arch/arm64/include/asm/barrier.h     | 23 +++++++
 arch/arm64/include/asm/cmpxchg.h     | 62 +++++++++++++++----
 arch/arm64/include/asm/delay-const.h | 27 +++++++++
 arch/arm64/include/asm/rqspinlock.h  | 85 --------------------------
 arch/arm64/lib/delay.c               | 15 ++---
 drivers/cpuidle/poll_state.c         | 21 +------
 drivers/soc/qcom/rpmh-rsc.c          |  8 +--
 include/asm-generic/barrier.h        | 90 ++++++++++++++++++++++++++++
 include/linux/atomic.h               | 10 ++++
 include/linux/atomic/atomic-long.h   | 18 +++---
 include/linux/sched/idle.h           | 29 +++++++++
 kernel/bpf/rqspinlock.c              | 77 +++++++++++++++---------
 scripts/atomic/gen-atomic-long.sh    | 16 +++--
 15 files changed, 320 insertions(+), 178 deletions(-)
 create mode 100644 arch/arm64/include/asm/delay-const.h

-- 
2.31.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v10 01/12] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
@ 2026-03-16  1:36 ` Ankur Arora
  2026-03-16  1:36 ` [PATCH v10 02/12] arm64: barrier: Support smp_cond_load_relaxed_timeout() Ankur Arora
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Add smp_cond_load_relaxed_timeout(), which extends
smp_cond_load_relaxed() to allow waiting for a duration.

We loop around waiting for the condition variable to change while
peridically doing a time-check. The loop uses cpu_poll_relax() to slow
down the busy-wait, which, unless overridden by the architecture
code, amounts to a cpu_relax().

Note that there are two ways for the time-check to fail: the timeout
case or, @time_expr_ns returning an invalid value (negative or zero).
The second failure mode allows for clocks attached to the clock-domain
of @cond_expr --  which might cease to operate meaningfully once some
state internal to @cond_expr has changed -- to fail.

Evaluation of @time_expr_ns: in the fastpath we want to keep the
performance close to smp_cond_load_relaxed(). So defer evaluation
of the potentially costly @time_expr_ns to the slowpath.

This also means that there will always be some hardware dependent
duration that has passed in cpu_poll_relax() iterations at the time
of first evaluation. Additionally cpu_poll_relax() is not guaranteed
to return at timeout boundary. In sum, expect timeout overshoot when
we exit due to expiration of the timeout.

The number of spin iterations before time-check, SMP_TIMEOUT_POLL_COUNT
is chosen to be 200 by default. With a cpu_poll_relax() iteration
taking ~20-30 cycles (measured on a variety of x86 platforms), we
expect a time-check every ~4000-6000 cycles.

The outer limit of the overshoot is double that when working with the
parameters above. This might be higher or lower depending on the
implementation of cpu_poll_relax() across architectures.
Note that this analysis makes an assumption that evaluation of the
loop condition, @cond_expr, is relatively cheap. This might not be
the case if that involves MMIO access.

Lastly, config option ARCH_HAS_CPU_RELAX indicates availability of a
cpu_poll_relax() that is cheaper than polling. This might be relevant
for cases with a long timeout.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
Notes:
   - s/@cond/@cond_expr/ in the comment describing the macro (Randy Dunlap)
   - Clarify that SMP_TIMEOUT_POLL_COUNT is only around memory
     addresses. (David Laight)
   - Minor cleanups to the commit message.

 include/asm-generic/barrier.h | 64 +++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..8e6b85c3ed99 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -273,6 +273,70 @@ do {									\
 })
 #endif

+/*
+ * Number of times we iterate in the loop before doing the time check.
+ * Note that the iteration count assumes that the loop condition is
+ * relatively cheap. This might not be true if that involves MMIO access.
+ */
+#ifndef SMP_TIMEOUT_POLL_COUNT
+#define SMP_TIMEOUT_POLL_COUNT		200
+#endif
+
+/*
+ * Platforms with ARCH_HAS_CPU_RELAX have a cpu_poll_relax() implementation
+ * that is expected to be cheaper (lower power) than pure polling.
+ */
+#ifndef cpu_poll_relax
+#define cpu_poll_relax(ptr, val, timeout_ns)	cpu_relax()
+#endif
+
+/**
+ * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
+ * guarantees until a timeout expires.
+ * @ptr: pointer to the variable to wait on.
+ * @cond_expr: boolean expression to wait for.
+ * @time_expr_ns: expression that evaluates to monotonic time (in ns) or,
+ *  on failure, returns a negative value.
+ * @timeout_ns: timeout value in ns
+ * Both of the above are assumed to be compatible with s64; the signed
+ * value is used to handle the failure case in @time_expr_ns.
+ *
+ * Equivalent to using READ_ONCE() on the condition variable.
+ *
+ * Callers that expect to wait for prolonged durations might want to
+ * take into account the availability of ARCH_HAS_CPU_RELAX.
+ */
+#ifndef smp_cond_load_relaxed_timeout
+#define smp_cond_load_relaxed_timeout(ptr, cond_expr,			\
+				      time_expr_ns, timeout_ns)		\
+({									\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT;			\
+	s64 __timeout = (s64)timeout_ns;				\
+	s64 __time_now, __time_end = 0;					\
+									\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		cpu_poll_relax(__PTR, VAL, (u64)__timeout);		\
+		if (++__n < __spin)					\
+			continue;					\
+		__time_now = (s64)(time_expr_ns);			\
+		if (unlikely(__time_end == 0))				\
+			__time_end = __time_now + __timeout;		\
+		__timeout = __time_end - __time_now;			\
+		if (__time_now <= 0 || __timeout <= 0) {		\
+			VAL = READ_ONCE(*__PTR);			\
+			break;						\
+		}							\
+		__n = 0;						\
+	}								\
+	(typeof(*ptr))VAL;						\
+})
+#endif
+
 /*
  * pmem_wmb() ensures that all stores for which the modification
  * are written to persistent storage by preceding instructions have
-- 
2.31.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v10 02/12] arm64: barrier: Support smp_cond_load_relaxed_timeout()
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
  2026-03-16  1:36 ` [PATCH v10 01/12] asm-generic: barrier: Add smp_cond_load_relaxed_timeout() Ankur Arora
@ 2026-03-16  1:36 ` Ankur Arora
  2026-03-16  1:36 ` [PATCH v10 03/12] arm64/delay: move some constants out to a separate header Ankur Arora
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Support waiting in smp_cond_load_relaxed_timeout() via
__cmpwait_relaxed(). To ensure that we wake from waiting in
WFE periodically and don't block forever if there are no stores
to ptr, this path is only used when the event-stream is enabled.

Note that when using __cmpwait_relaxed() we ignore the timeout
value, allowing an overshoot by up to the event-stream period.
And, in the unlikely event that the event-stream is unavailable,
fallback to spin-waiting.

Also set SMP_TIMEOUT_POLL_COUNT to 1 so we do the time-check in
each iteration of smp_cond_load_relaxed_timeout().

And finally define ARCH_HAS_CPU_RELAX to indicate that we have
an optimized implementation of cpu_poll_relax().

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Suggested-by: Will Deacon <will@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
Note:
   - Add the missing config ARCH_HAS_CPU_RELAX in arch/arm64/Kconfig.
     (Catalin Marinas).

 arch/arm64/Kconfig               |  3 +++
 arch/arm64/include/asm/barrier.h | 21 +++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 38dba5f7e4d2..a3b42d9beed5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1627,6 +1627,9 @@ config ARCH_SUPPORTS_CRASH_DUMP
 config ARCH_DEFAULT_CRASH_DUMP
 	def_bool y
 
+config ARCH_HAS_CPU_RELAX
+	def_bool y
+
 config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
 	def_bool CRASH_RESERVE
 
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 9495c4441a46..6190e178db51 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -12,6 +12,7 @@
 #include <linux/kasan-checks.h>
 
 #include <asm/alternative-macros.h>
+#include <asm/vdso/processor.h>
 
 #define __nops(n)	".rept	" #n "\nnop\n.endr\n"
 #define nops(n)		asm volatile(__nops(n))
@@ -219,6 +220,26 @@ do {									\
 	(typeof(*ptr))VAL;						\
 })
 
+/* Re-declared here to avoid include dependency. */
+extern bool arch_timer_evtstrm_available(void);
+
+/*
+ * In the common case, cpu_poll_relax() sits waiting in __cmpwait_relaxed()
+ * for the ptr value to change.
+ *
+ * Since this period is reasonably long, choose SMP_TIMEOUT_POLL_COUNT
+ * to be 1, so smp_cond_load_{relaxed,acquire}_timeout() does a
+ * time-check in each iteration.
+ */
+#define SMP_TIMEOUT_POLL_COUNT	1
+
+#define cpu_poll_relax(ptr, val, timeout_ns) do {			\
+	if (arch_timer_evtstrm_available())				\
+		__cmpwait_relaxed(ptr, val);				\
+	else								\
+		cpu_relax();						\
+} while (0)
+
 #include <asm-generic/barrier.h>
 
 #endif	/* __ASSEMBLER__ */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v10 03/12] arm64/delay: move some constants out to a separate header
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
  2026-03-16  1:36 ` [PATCH v10 01/12] asm-generic: barrier: Add smp_cond_load_relaxed_timeout() Ankur Arora
  2026-03-16  1:36 ` [PATCH v10 02/12] arm64: barrier: Support smp_cond_load_relaxed_timeout() Ankur Arora
@ 2026-03-16  1:36 ` Ankur Arora
  2026-03-16  1:36 ` [PATCH v10 04/12] arm64: support WFET in smp_cond_load_relaxed_timeout() Ankur Arora
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Bjorn Andersson, Konrad Dybcio,
	Christoph Lameter

Moves some constants and functions related to xloops, cycles computation
out to a new header. Also make __delay_cycles() available outside of
arch/arm64/lib/delay.c.

Rename some macros in qcom/rpmh-rsc.c which were occupying the same
namespace.

No functional change.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Bjorn Andersson <andersson@kernel.org>
Cc: Konrad Dybcio <konradybcio@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
Note:
  - Makes __delay_cycles() available outside arch/arm64/lib/delay.c
    and changes the return type from cycles_to to u64.

 arch/arm64/include/asm/delay-const.h | 27 +++++++++++++++++++++++++++
 arch/arm64/lib/delay.c               | 15 ++++-----------
 drivers/soc/qcom/rpmh-rsc.c          |  8 ++++----
 3 files changed, 35 insertions(+), 15 deletions(-)
 create mode 100644 arch/arm64/include/asm/delay-const.h

diff --git a/arch/arm64/include/asm/delay-const.h b/arch/arm64/include/asm/delay-const.h
new file mode 100644
index 000000000000..cb3988ff4e41
--- /dev/null
+++ b/arch/arm64/include/asm/delay-const.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_DELAY_CONST_H
+#define _ASM_DELAY_CONST_H
+
+#include <asm/param.h>	/* For HZ */
+
+/* 2**32 / 1000000 (rounded up) */
+#define __usecs_to_xloops_mult	0x10C7UL
+
+/* 2**32 / 1000000000 (rounded up) */
+#define __nsecs_to_xloops_mult	0x5UL
+
+extern unsigned long loops_per_jiffy;
+static inline unsigned long xloops_to_cycles(unsigned long xloops)
+{
+	return (xloops * loops_per_jiffy * HZ) >> 32;
+}
+
+#define USECS_TO_CYCLES(time_usecs) \
+	xloops_to_cycles((time_usecs) * __usecs_to_xloops_mult)
+
+#define NSECS_TO_CYCLES(time_nsecs) \
+	xloops_to_cycles((time_nsecs) * __nsecs_to_xloops_mult)
+
+u64 notrace __delay_cycles(void);
+
+#endif	/* _ASM_DELAY_CONST_H */
diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index e278e060e78a..c660a7ea26dd 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -12,17 +12,10 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/timex.h>
+#include <asm/delay-const.h>
 
 #include <clocksource/arm_arch_timer.h>
 
-#define USECS_TO_CYCLES(time_usecs)			\
-	xloops_to_cycles((time_usecs) * 0x10C7UL)
-
-static inline unsigned long xloops_to_cycles(unsigned long xloops)
-{
-	return (xloops * loops_per_jiffy * HZ) >> 32;
-}
-
 /*
  * Force the use of CNTVCT_EL0 in order to have the same base as WFxT.
  * This avoids some annoying issues when CNTVOFF_EL2 is not reset 0 on a
@@ -32,7 +25,7 @@ static inline unsigned long xloops_to_cycles(unsigned long xloops)
  * Note that userspace cannot change the offset behind our back either,
  * as the vcpu mutex is held as long as KVM_RUN is in progress.
  */
-static cycles_t notrace __delay_cycles(void)
+u64 notrace __delay_cycles(void)
 {
 	guard(preempt_notrace)();
 	return __arch_counter_get_cntvct_stable();
@@ -73,12 +66,12 @@ EXPORT_SYMBOL(__const_udelay);
 
 void __udelay(unsigned long usecs)
 {
-	__const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */
+	__const_udelay(usecs * __usecs_to_xloops_mult);
 }
 EXPORT_SYMBOL(__udelay);
 
 void __ndelay(unsigned long nsecs)
 {
-	__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
+	__const_udelay(nsecs * __nsecs_to_xloops_mult);
 }
 EXPORT_SYMBOL(__ndelay);
diff --git a/drivers/soc/qcom/rpmh-rsc.c b/drivers/soc/qcom/rpmh-rsc.c
index c6f7d5c9c493..ad5ec5c0de0a 100644
--- a/drivers/soc/qcom/rpmh-rsc.c
+++ b/drivers/soc/qcom/rpmh-rsc.c
@@ -146,10 +146,10 @@ enum {
  *  +---------------------------------------------------+
  */
 
-#define USECS_TO_CYCLES(time_usecs)			\
-	xloops_to_cycles((time_usecs) * 0x10C7UL)
+#define RPMH_USECS_TO_CYCLES(time_usecs)		\
+	rpmh_xloops_to_cycles((time_usecs) * 0x10C7UL)
 
-static inline unsigned long xloops_to_cycles(u64 xloops)
+static inline unsigned long rpmh_xloops_to_cycles(u64 xloops)
 {
 	return (xloops * loops_per_jiffy * HZ) >> 32;
 }
@@ -819,7 +819,7 @@ void rpmh_rsc_write_next_wakeup(struct rsc_drv *drv)
 	wakeup_us = ktime_to_us(wakeup);
 
 	/* Convert the wakeup to arch timer scale */
-	wakeup_cycles = USECS_TO_CYCLES(wakeup_us);
+	wakeup_cycles = RPMH_USECS_TO_CYCLES(wakeup_us);
 	wakeup_cycles += arch_timer_read_counter();
 
 exit:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v10 04/12] arm64: support WFET in smp_cond_load_relaxed_timeout()
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
                   ` (2 preceding siblings ...)
  2026-03-16  1:36 ` [PATCH v10 03/12] arm64/delay: move some constants out to a separate header Ankur Arora
@ 2026-03-16  1:36 ` Ankur Arora
  2026-03-16  1:36 ` [PATCH v10 05/12] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait() Ankur Arora
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

To handle WFET use __cmpwait_timeout() similarly to __cmpwait(). These
call out to the respective __cmpwait_case_timeout_##sz(),
__cmpwait_case_##sz() functions.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
Notes:
   - Switch to arch_counter_get_cntvct_stable() (via __delay_cycles())
     instead of using arch_timer_read_counter(). (Catalin Marinas)

 arch/arm64/include/asm/barrier.h |  8 +++--
 arch/arm64/include/asm/cmpxchg.h | 62 +++++++++++++++++++++++++-------
 2 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6190e178db51..fbd71cd4ef4e 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -224,8 +224,8 @@ do {									\
 extern bool arch_timer_evtstrm_available(void);
 
 /*
- * In the common case, cpu_poll_relax() sits waiting in __cmpwait_relaxed()
- * for the ptr value to change.
+ * In the common case, cpu_poll_relax() sits waiting in __cmpwait_relaxed()/
+ * __cmpwait_relaxed_timeout() for the ptr value to change.
  *
  * Since this period is reasonably long, choose SMP_TIMEOUT_POLL_COUNT
  * to be 1, so smp_cond_load_{relaxed,acquire}_timeout() does a
@@ -234,7 +234,9 @@ extern bool arch_timer_evtstrm_available(void);
 #define SMP_TIMEOUT_POLL_COUNT	1
 
 #define cpu_poll_relax(ptr, val, timeout_ns) do {			\
-	if (arch_timer_evtstrm_available())				\
+	if (alternative_has_cap_unlikely(ARM64_HAS_WFXT))		\
+		__cmpwait_relaxed_timeout(ptr, val, timeout_ns);	\
+	else if (arch_timer_evtstrm_available())			\
 		__cmpwait_relaxed(ptr, val);				\
 	else								\
 		cpu_relax();						\
diff --git a/arch/arm64/include/asm/cmpxchg.h b/arch/arm64/include/asm/cmpxchg.h
index 6cf3cd6873f5..9e4cdc9e41d1 100644
--- a/arch/arm64/include/asm/cmpxchg.h
+++ b/arch/arm64/include/asm/cmpxchg.h
@@ -12,6 +12,7 @@
 
 #include <asm/barrier.h>
 #include <asm/lse.h>
+#include <asm/delay-const.h>
 
 /*
  * We need separate acquire parameters for ll/sc and lse, since the full
@@ -212,7 +213,8 @@ __CMPXCHG_GEN(_mb)
 
 #define __CMPWAIT_CASE(w, sfx, sz)					\
 static inline void __cmpwait_case_##sz(volatile void *ptr,		\
-				       unsigned long val)		\
+				       unsigned long val,		\
+				       u64 __maybe_unused timeout_ns)	\
 {									\
 	unsigned long tmp;						\
 									\
@@ -235,20 +237,52 @@ __CMPWAIT_CASE( ,  , 64);
 
 #undef __CMPWAIT_CASE
 
-#define __CMPWAIT_GEN(sfx)						\
-static __always_inline void __cmpwait##sfx(volatile void *ptr,		\
-				  unsigned long val,			\
-				  int size)				\
+#define __CMPWAIT_TIMEOUT_CASE(w, sfx, sz)				\
+static inline void __cmpwait_case_timeout_##sz(volatile void *ptr,	\
+					       unsigned long val,	\
+					       u64 timeout_ns)		\
+{									\
+	unsigned long tmp;						\
+	u64 ecycles = __delay_cycles() +				\
+			NSECS_TO_CYCLES(timeout_ns);			\
+	asm volatile(							\
+	"	sevl\n"							\
+	"	wfe\n"							\
+	"	ldxr" #sfx "\t%" #w "[tmp], %[v]\n"			\
+	"	eor	%" #w "[tmp], %" #w "[tmp], %" #w "[val]\n"	\
+	"	cbnz	%" #w "[tmp], 2f\n"				\
+	"	msr s0_3_c1_c0_0, %[ecycles]\n"				\
+	"2:"								\
+	: [tmp] "=&r" (tmp), [v] "+Q" (*(u##sz *)ptr)			\
+	: [val] "r" (val), [ecycles] "r" (ecycles));			\
+}
+
+__CMPWAIT_TIMEOUT_CASE(w, b, 8);
+__CMPWAIT_TIMEOUT_CASE(w, h, 16);
+__CMPWAIT_TIMEOUT_CASE(w,  , 32);
+__CMPWAIT_TIMEOUT_CASE( ,  , 64);
+
+#undef __CMPWAIT_TIMEOUT_CASE
+
+#define __CMPWAIT_GEN(timeout, sfx)					\
+static __always_inline void __cmpwait##timeout##sfx(volatile void *ptr,	\
+						    unsigned long val,	\
+						    u64 timeout_ns,	\
+						    int size)		\
 {									\
 	switch (size) {							\
 	case 1:								\
-		return __cmpwait_case##sfx##_8(ptr, (u8)val);		\
+		return __cmpwait_case##timeout##sfx##_8(ptr, (u8)val,	\
+							timeout_ns);	\
 	case 2:								\
-		return __cmpwait_case##sfx##_16(ptr, (u16)val);		\
+		return __cmpwait_case##timeout##sfx##_16(ptr, (u16)val,	\
+							 timeout_ns);	\
 	case 4:								\
-		return __cmpwait_case##sfx##_32(ptr, val);		\
+		return __cmpwait_case##timeout##sfx##_32(ptr, val,	\
+							 timeout_ns);	\
 	case 8:								\
-		return __cmpwait_case##sfx##_64(ptr, val);		\
+		return __cmpwait_case##timeout##sfx##_64(ptr, val,	\
+							 timeout_ns);	\
 	default:							\
 		BUILD_BUG();						\
 	}								\
@@ -256,11 +290,15 @@ static __always_inline void __cmpwait##sfx(volatile void *ptr,		\
 	unreachable();							\
 }
 
-__CMPWAIT_GEN()
+__CMPWAIT_GEN(        , )
+__CMPWAIT_GEN(_timeout, )
 
 #undef __CMPWAIT_GEN
 
-#define __cmpwait_relaxed(ptr, val) \
-	__cmpwait((ptr), (unsigned long)(val), sizeof(*(ptr)))
+#define __cmpwait_relaxed_timeout(ptr, val, timeout_ns)			\
+	__cmpwait_timeout((ptr), (unsigned long)(val), timeout_ns, sizeof(*(ptr)))
+
+#define __cmpwait_relaxed(ptr, val)					\
+	__cmpwait((ptr), (unsigned long)(val), 0, sizeof(*(ptr)))
 
 #endif	/* __ASM_CMPXCHG_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v10 05/12] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait()
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
                   ` (3 preceding siblings ...)
  2026-03-16  1:36 ` [PATCH v10 04/12] arm64: support WFET in smp_cond_load_relaxed_timeout() Ankur Arora
@ 2026-03-16  1:36 ` Ankur Arora
  2026-03-24  1:41   ` Kumar Kartikeya Dwivedi
  2026-03-16  1:36 ` [PATCH v10 06/12] asm-generic: barrier: Add smp_cond_load_acquire_timeout() Ankur Arora
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

In preparation for defining smp_cond_load_acquire_timeout(), remove
the private copy. Lacking this, the rqspinlock code falls back to using
smp_cond_load_acquire().

Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/rqspinlock.h | 85 -----------------------------
 1 file changed, 85 deletions(-)

diff --git a/arch/arm64/include/asm/rqspinlock.h b/arch/arm64/include/asm/rqspinlock.h
index 9ea0a74e5892..a385603436e9 100644
--- a/arch/arm64/include/asm/rqspinlock.h
+++ b/arch/arm64/include/asm/rqspinlock.h
@@ -3,91 +3,6 @@
 #define _ASM_RQSPINLOCK_H
 
 #include <asm/barrier.h>
-
-/*
- * Hardcode res_smp_cond_load_acquire implementations for arm64 to a custom
- * version based on [0]. In rqspinlock code, our conditional expression involves
- * checking the value _and_ additionally a timeout. However, on arm64, the
- * WFE-based implementation may never spin again if no stores occur to the
- * locked byte in the lock word. As such, we may be stuck forever if
- * event-stream based unblocking is not available on the platform for WFE spin
- * loops (arch_timer_evtstrm_available).
- *
- * Once support for smp_cond_load_acquire_timewait [0] lands, we can drop this
- * copy-paste.
- *
- * While we rely on the implementation to amortize the cost of sampling
- * cond_expr for us, it will not happen when event stream support is
- * unavailable, time_expr check is amortized. This is not the common case, and
- * it would be difficult to fit our logic in the time_expr_ns >= time_limit_ns
- * comparison, hence just let it be. In case of event-stream, the loop is woken
- * up at microsecond granularity.
- *
- * [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
- */
-
-#ifndef smp_cond_load_acquire_timewait
-
-#define smp_cond_time_check_count	200
-
-#define __smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns,	\
-					 time_limit_ns) ({		\
-	typeof(ptr) __PTR = (ptr);					\
-	__unqual_scalar_typeof(*ptr) VAL;				\
-	unsigned int __count = 0;					\
-	for (;;) {							\
-		VAL = READ_ONCE(*__PTR);				\
-		if (cond_expr)						\
-			break;						\
-		cpu_relax();						\
-		if (__count++ < smp_cond_time_check_count)		\
-			continue;					\
-		if ((time_expr_ns) >= (time_limit_ns))			\
-			break;						\
-		__count = 0;						\
-	}								\
-	(typeof(*ptr))VAL;						\
-})
-
-#define __smp_cond_load_acquire_timewait(ptr, cond_expr,		\
-					 time_expr_ns, time_limit_ns)	\
-({									\
-	typeof(ptr) __PTR = (ptr);					\
-	__unqual_scalar_typeof(*ptr) VAL;				\
-	for (;;) {							\
-		VAL = smp_load_acquire(__PTR);				\
-		if (cond_expr)						\
-			break;						\
-		__cmpwait_relaxed(__PTR, VAL);				\
-		if ((time_expr_ns) >= (time_limit_ns))			\
-			break;						\
-	}								\
-	(typeof(*ptr))VAL;						\
-})
-
-#define smp_cond_load_acquire_timewait(ptr, cond_expr,			\
-				      time_expr_ns, time_limit_ns)	\
-({									\
-	__unqual_scalar_typeof(*ptr) _val;				\
-	int __wfe = arch_timer_evtstrm_available();			\
-									\
-	if (likely(__wfe)) {						\
-		_val = __smp_cond_load_acquire_timewait(ptr, cond_expr,	\
-							time_expr_ns,	\
-							time_limit_ns);	\
-	} else {							\
-		_val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr,	\
-							time_expr_ns,	\
-							time_limit_ns);	\
-		smp_acquire__after_ctrl_dep();				\
-	}								\
-	(typeof(*ptr))_val;						\
-})
-
-#endif
-
-#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire_timewait(v, c, 0, 1)
-
 #include <asm-generic/rqspinlock.h>
 
 #endif /* _ASM_RQSPINLOCK_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v10 06/12] asm-generic: barrier: Add smp_cond_load_acquire_timeout()
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
                   ` (4 preceding siblings ...)
  2026-03-16  1:36 ` [PATCH v10 05/12] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait() Ankur Arora
@ 2026-03-16  1:36 ` Ankur Arora
  2026-03-16  1:36 ` [PATCH v10 07/12] atomic: Add atomic_cond_read_*_timeout() Ankur Arora
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Add the acquire variant of smp_cond_load_relaxed_timeout(). This
reuses the relaxed variant, with additional LOAD->LOAD ordering.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Haris Okanovic <harisokn@amazon.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/asm-generic/barrier.h | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 8e6b85c3ed99..998270df6133 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -337,6 +337,32 @@ do {									\
 })
 #endif
 
+/**
+ * smp_cond_load_acquire_timeout() - (Spin) wait for cond with ACQUIRE ordering
+ * until a timeout expires.
+ * @ptr: pointer to the variable to wait on.
+ * @cond_expr: boolean expression to wait for.
+ * @time_expr_ns: monotonic expression that evaluates to time in ns or,
+ *  on failure, returns a negative value.
+ * @timeout_ns: timeout value in ns
+ * (Both of the above are assumed to be compatible with s64.)
+ *
+ * Equivalent to using smp_cond_load_acquire() on the condition variable with
+ * a timeout.
+ */
+#ifndef smp_cond_load_acquire_timeout
+#define smp_cond_load_acquire_timeout(ptr, cond_expr,			\
+				      time_expr_ns, timeout_ns)		\
+({									\
+	__unqual_scalar_typeof(*ptr) _val;				\
+	_val = smp_cond_load_relaxed_timeout(ptr, cond_expr,		\
+					     time_expr_ns,		\
+					     timeout_ns);		\
+	smp_acquire__after_ctrl_dep();					\
+	(typeof(*ptr))_val;						\
+})
+#endif
+
 /*
  * pmem_wmb() ensures that all stores for which the modification
  * are written to persistent storage by preceding instructions have
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v10 07/12] atomic: Add atomic_cond_read_*_timeout()
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
                   ` (5 preceding siblings ...)
  2026-03-16  1:36 ` [PATCH v10 06/12] asm-generic: barrier: Add smp_cond_load_acquire_timeout() Ankur Arora
@ 2026-03-16  1:36 ` Ankur Arora
  2026-03-16  1:36 ` [PATCH v10 08/12] locking/atomic: scripts: build atomic_long_cond_read_*_timeout() Ankur Arora
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Boqun Feng

Add atomic load wrappers, atomic_cond_read_*_timeout() and
atomic64_cond_read_*_timeout() for the cond-load timeout interfaces.

Also add a short description for the atomic_cond_read_{relaxed,acquire}(),
and the atomic_cond_read_{relaxed,acquire}_timeout() interfaces.

Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 Documentation/atomic_t.txt | 14 +++++++++-----
 include/linux/atomic.h     | 10 ++++++++++
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/Documentation/atomic_t.txt b/Documentation/atomic_t.txt
index bee3b1bca9a7..0e53f6ccb558 100644
--- a/Documentation/atomic_t.txt
+++ b/Documentation/atomic_t.txt
@@ -16,6 +16,10 @@ Non-RMW ops:
   atomic_read(), atomic_set()
   atomic_read_acquire(), atomic_set_release()
 
+Non-RMW, non-atomic_t ops:
+
+  atomic_cond_read_{relaxed,acquire}()
+  atomic_cond_read_{relaxed,acquire}_timeout()
 
 RMW atomic operations:
 
@@ -79,11 +83,11 @@ SEMANTICS
 
 Non-RMW ops:
 
-The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
-implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
-smp_store_release() respectively. Therefore, if you find yourself only using
-the Non-RMW operations of atomic_t, you do not in fact need atomic_t at all
-and are doing it wrong.
+The non-RMW ops are (typically) regular, or conditional LOADs and STOREs and
+are canonically implemented using READ_ONCE(), WRITE_ONCE(),
+smp_load_acquire() and smp_store_release() respectively. Therefore, if you
+find yourself only using the Non-RMW operations of atomic_t, you do not in
+fact need atomic_t at all and are doing it wrong.
 
 A note for the implementation of atomic_set{}() is that it must not break the
 atomicity of the RMW ops. That is:
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 8dd57c3a99e9..5bcb86e07784 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -31,6 +31,16 @@
 #define atomic64_cond_read_acquire(v, c) smp_cond_load_acquire(&(v)->counter, (c))
 #define atomic64_cond_read_relaxed(v, c) smp_cond_load_relaxed(&(v)->counter, (c))
 
+#define atomic_cond_read_acquire_timeout(v, c, e, t) \
+	smp_cond_load_acquire_timeout(&(v)->counter, (c), (e), (t))
+#define atomic_cond_read_relaxed_timeout(v, c, e, t) \
+	smp_cond_load_relaxed_timeout(&(v)->counter, (c), (e), (t))
+
+#define atomic64_cond_read_acquire_timeout(v, c, e, t) \
+	smp_cond_load_acquire_timeout(&(v)->counter, (c), (e), (t))
+#define atomic64_cond_read_relaxed_timeout(v, c, e, t) \
+	smp_cond_load_relaxed_timeout(&(v)->counter, (c), (e), (t))
+
 /*
  * The idea here is to build acquire/release variants by adding explicit
  * barriers on top of the relaxed variant. In the case where the relaxed
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v10 08/12] locking/atomic: scripts: build atomic_long_cond_read_*_timeout()
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
                   ` (6 preceding siblings ...)
  2026-03-16  1:36 ` [PATCH v10 07/12] atomic: Add atomic_cond_read_*_timeout() Ankur Arora
@ 2026-03-16  1:36 ` Ankur Arora
  2026-03-16  1:36 ` [PATCH v10 09/12] bpf/rqspinlock: switch check_timeout() to a clock interface Ankur Arora
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Boqun Feng

Add the atomic long wrappers for the cond-load timeout interfaces.

Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/atomic/atomic-long.h | 18 +++++++++++-------
 scripts/atomic/gen-atomic-long.sh  | 16 ++++++++++------
 2 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/include/linux/atomic/atomic-long.h b/include/linux/atomic/atomic-long.h
index 6a4e47d2db35..553b6b0e0258 100644
--- a/include/linux/atomic/atomic-long.h
+++ b/include/linux/atomic/atomic-long.h
@@ -11,14 +11,18 @@
 
 #ifdef CONFIG_64BIT
 typedef atomic64_t atomic_long_t;
-#define ATOMIC_LONG_INIT(i)		ATOMIC64_INIT(i)
-#define atomic_long_cond_read_acquire	atomic64_cond_read_acquire
-#define atomic_long_cond_read_relaxed	atomic64_cond_read_relaxed
+#define ATOMIC_LONG_INIT(i)			ATOMIC64_INIT(i)
+#define atomic_long_cond_read_acquire		atomic64_cond_read_acquire
+#define atomic_long_cond_read_relaxed		atomic64_cond_read_relaxed
+#define atomic_long_cond_read_acquire_timeout	atomic64_cond_read_acquire_timeout
+#define atomic_long_cond_read_relaxed_timeout	atomic64_cond_read_relaxed_timeout
 #else
 typedef atomic_t atomic_long_t;
-#define ATOMIC_LONG_INIT(i)		ATOMIC_INIT(i)
-#define atomic_long_cond_read_acquire	atomic_cond_read_acquire
-#define atomic_long_cond_read_relaxed	atomic_cond_read_relaxed
+#define ATOMIC_LONG_INIT(i)			ATOMIC_INIT(i)
+#define atomic_long_cond_read_acquire		atomic_cond_read_acquire
+#define atomic_long_cond_read_relaxed		atomic_cond_read_relaxed
+#define atomic_long_cond_read_acquire_timeout	atomic_cond_read_acquire_timeout
+#define atomic_long_cond_read_relaxed_timeout	atomic_cond_read_relaxed_timeout
 #endif
 
 /**
@@ -1809,4 +1813,4 @@ raw_atomic_long_dec_if_positive(atomic_long_t *v)
 }
 
 #endif /* _LINUX_ATOMIC_LONG_H */
-// 4b882bf19018602c10816c52f8b4ae280adc887b
+// 79c1f4acb5774376ceed559843d5d9ed1348df99
diff --git a/scripts/atomic/gen-atomic-long.sh b/scripts/atomic/gen-atomic-long.sh
index 9826be3ba986..874643dc74bd 100755
--- a/scripts/atomic/gen-atomic-long.sh
+++ b/scripts/atomic/gen-atomic-long.sh
@@ -79,14 +79,18 @@ cat << EOF
 
 #ifdef CONFIG_64BIT
 typedef atomic64_t atomic_long_t;
-#define ATOMIC_LONG_INIT(i)		ATOMIC64_INIT(i)
-#define atomic_long_cond_read_acquire	atomic64_cond_read_acquire
-#define atomic_long_cond_read_relaxed	atomic64_cond_read_relaxed
+#define ATOMIC_LONG_INIT(i)			ATOMIC64_INIT(i)
+#define atomic_long_cond_read_acquire		atomic64_cond_read_acquire
+#define atomic_long_cond_read_relaxed		atomic64_cond_read_relaxed
+#define atomic_long_cond_read_acquire_timeout	atomic64_cond_read_acquire_timeout
+#define atomic_long_cond_read_relaxed_timeout	atomic64_cond_read_relaxed_timeout
 #else
 typedef atomic_t atomic_long_t;
-#define ATOMIC_LONG_INIT(i)		ATOMIC_INIT(i)
-#define atomic_long_cond_read_acquire	atomic_cond_read_acquire
-#define atomic_long_cond_read_relaxed	atomic_cond_read_relaxed
+#define ATOMIC_LONG_INIT(i)			ATOMIC_INIT(i)
+#define atomic_long_cond_read_acquire		atomic_cond_read_acquire
+#define atomic_long_cond_read_relaxed		atomic_cond_read_relaxed
+#define atomic_long_cond_read_acquire_timeout	atomic_cond_read_acquire_timeout
+#define atomic_long_cond_read_relaxed_timeout	atomic_cond_read_relaxed_timeout
 #endif
 
 EOF
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v10 09/12] bpf/rqspinlock: switch check_timeout() to a clock interface
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
                   ` (7 preceding siblings ...)
  2026-03-16  1:36 ` [PATCH v10 08/12] locking/atomic: scripts: build atomic_long_cond_read_*_timeout() Ankur Arora
@ 2026-03-16  1:36 ` Ankur Arora
  2026-03-24  1:43   ` Kumar Kartikeya Dwivedi
  2026-03-16  1:36 ` [PATCH v10 10/12] bpf/rqspinlock: Use smp_cond_load_acquire_timeout() Ankur Arora
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

check_timeout() gets the current time value and depending on how
much time has passed, checks for deadlock or times out, returning 0
or -errno on deadlock or timeout.

Switch this out to a clock style interface, where it functions as a
clock in the "lock-domain", returning the current time until a
deadlock or timeout occurs. Once a deadlock or timeout has occurred,
it stops functioning as a clock and returns error.

Also adjust the RES_CHECK_TIMEOUT macro to discard the clock value
when updating the explicit return status.

Cc: bpf@vger.kernel.org
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
--
Notes:
  - adjust the RES_CHECK_TIMEOUT macro to provide the old behaviour.

 kernel/bpf/rqspinlock.c | 45 +++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/rqspinlock.c b/kernel/bpf/rqspinlock.c
index e4e338cdb437..0ec17ebb67c1 100644
--- a/kernel/bpf/rqspinlock.c
+++ b/kernel/bpf/rqspinlock.c
@@ -196,8 +196,12 @@ static noinline int check_deadlock_ABBA(rqspinlock_t *lock, u32 mask)
 	return 0;
 }
 
-static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
-				  struct rqspinlock_timeout *ts)
+/*
+ * Returns current monotonic time in ns on success or, negative errno
+ * value on failure due to timeout expiration or detection of deadlock.
+ */
+static noinline s64 clock_deadlock(rqspinlock_t *lock, u32 mask,
+				   struct rqspinlock_timeout *ts)
 {
 	u64 prev = ts->cur;
 	u64 time;
@@ -207,7 +211,7 @@ static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
 			return -EDEADLK;
 		ts->cur = ktime_get_mono_fast_ns();
 		ts->timeout_end = ts->cur + ts->duration;
-		return 0;
+		return (s64)ts->cur;
 	}
 
 	time = ktime_get_mono_fast_ns();
@@ -219,11 +223,15 @@ static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
 	 * checks.
 	 */
 	if (prev + NSEC_PER_MSEC < time) {
+		int ret;
 		ts->cur = time;
-		return check_deadlock_ABBA(lock, mask);
+		ret = check_deadlock_ABBA(lock, mask);
+		if (ret)
+			return ret;
+
 	}
 
-	return 0;
+	return (s64)time;
 }
 
 /*
@@ -231,15 +239,22 @@ static noinline int check_timeout(rqspinlock_t *lock, u32 mask,
  * as the macro does internal amortization for us.
  */
 #ifndef res_smp_cond_load_acquire
-#define RES_CHECK_TIMEOUT(ts, ret, mask)                              \
-	({                                                            \
-		if (!(ts).spin++)                                     \
-			(ret) = check_timeout((lock), (mask), &(ts)); \
-		(ret);                                                \
+#define RES_CHECK_TIMEOUT(ts, ret, mask)					\
+	({									\
+		s64 __timeval_err = 0;						\
+		if (!(ts).spin++)						\
+			__timeval_err = clock_deadlock((lock), (mask), &(ts));	\
+		(ret) = __timeval_err < 0 ? __timeval_err : 0;			\
+		__timeval_err;							\
 	})
 #else
-#define RES_CHECK_TIMEOUT(ts, ret, mask)			      \
-	({ (ret) = check_timeout((lock), (mask), &(ts)); })
+#define RES_CHECK_TIMEOUT(ts, ret, mask)					\
+	({									\
+		s64 __timeval_err;						\
+		__timeval_err = clock_deadlock((lock), (mask), &(ts));		\
+		(ret) = __timeval_err < 0 ? __timeval_err : 0;			\
+		__timeval_err;							\
+	})
 #endif
 
 /*
@@ -281,7 +296,7 @@ int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock)
 	val = atomic_read(&lock->val);
 
 	if (val || !atomic_try_cmpxchg(&lock->val, &val, 1)) {
-		if (RES_CHECK_TIMEOUT(ts, ret, ~0u))
+		if (RES_CHECK_TIMEOUT(ts, ret, ~0u) < 0)
 			goto out;
 		cpu_relax();
 		goto retry;
@@ -406,7 +421,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 	 */
 	if (val & _Q_LOCKED_MASK) {
 		RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
-		res_smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK));
+		res_smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK) < 0);
 	}
 
 	if (ret) {
@@ -568,7 +583,7 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 	 */
 	RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT * 2);
 	val = res_atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) ||
-					   RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK));
+					   RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK) < 0);
 
 	/* Disable queue destruction when we detect deadlocks. */
 	if (ret == -EDEADLK) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v10 10/12] bpf/rqspinlock: Use smp_cond_load_acquire_timeout()
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
                   ` (8 preceding siblings ...)
  2026-03-16  1:36 ` [PATCH v10 09/12] bpf/rqspinlock: switch check_timeout() to a clock interface Ankur Arora
@ 2026-03-16  1:36 ` Ankur Arora
  2026-03-24  1:46   ` Kumar Kartikeya Dwivedi
  2026-03-16  1:36 ` [PATCH v10 11/12] sched: add need-resched timed wait interface Ankur Arora
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Switch out the conditional load interfaces used by rqspinlock
to smp_cond_read_acquire_timeout() and its wrapper,
atomic_cond_read_acquire_timeout().

Both these handle the timeout and amortize as needed, so use the
non-amortized RES_CHECK_TIMEOUT.

RES_CHECK_TIMEOUT does double duty here -- presenting the current
clock value, the timeout/deadlock error from clock_deadlock() to
the cond-load and, returning the error value via ret.

For correctness, we need to ensure that the error case of the
cond-load interface always agrees with that in clock_deadlock().

For the most part, this is fine because there's no independent clock,
or double reads from the clock in cond-load -- either of which could
lead to its internal state going out of sync from that of
clock_deadlock().

There is, however, an edge case where clock_deadlock() checks for:

        if (time > ts->timeout_end)
                return -ETIMEDOUT;

while smp_cond_load_acquire_timeout() checks for:

        __time_now = (time_expr_ns);
        if (__time_now <= 0 || __time_now >= __time_end) {
                VAL = READ_ONCE(*__PTR);
                break;
        }

This runs into a problem when (__time_now == __time_end) since
clock_deadlock() does not treat it as a timeout condition but
the second clause in the conditional above does.
So, add an equality check in clock_deadlock().

Finally, redefine SMP_TIMEOUT_POLL_COUNT to be 16k to be similar to
the spin-count used in the amortized version. We only do this for
non-arm64 as that uses a waiting implementation.

Cc: bpf@vger.kernel.org
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 kernel/bpf/rqspinlock.c | 40 +++++++++++++++++++++++-----------------
 1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/kernel/bpf/rqspinlock.c b/kernel/bpf/rqspinlock.c
index 0ec17ebb67c1..e5e27266b813 100644
--- a/kernel/bpf/rqspinlock.c
+++ b/kernel/bpf/rqspinlock.c
@@ -215,7 +215,7 @@ static noinline s64 clock_deadlock(rqspinlock_t *lock, u32 mask,
 	}
 
 	time = ktime_get_mono_fast_ns();
-	if (time > ts->timeout_end)
+	if (time >= ts->timeout_end)
 		return -ETIMEDOUT;
 
 	/*
@@ -235,11 +235,10 @@ static noinline s64 clock_deadlock(rqspinlock_t *lock, u32 mask,
 }
 
 /*
- * Do not amortize with spins when res_smp_cond_load_acquire is defined,
- * as the macro does internal amortization for us.
+ * Spin amortized version of RES_CHECK_TIMEOUT. Used when busy-waiting in
+ * atomic_try_cmpxchg().
  */
-#ifndef res_smp_cond_load_acquire
-#define RES_CHECK_TIMEOUT(ts, ret, mask)					\
+#define RES_CHECK_TIMEOUT_AMORTIZED(ts, ret, mask)				\
 	({									\
 		s64 __timeval_err = 0;						\
 		if (!(ts).spin++)						\
@@ -247,7 +246,7 @@ static noinline s64 clock_deadlock(rqspinlock_t *lock, u32 mask,
 		(ret) = __timeval_err < 0 ? __timeval_err : 0;			\
 		__timeval_err;							\
 	})
-#else
+
 #define RES_CHECK_TIMEOUT(ts, ret, mask)					\
 	({									\
 		s64 __timeval_err;						\
@@ -255,7 +254,6 @@ static noinline s64 clock_deadlock(rqspinlock_t *lock, u32 mask,
 		(ret) = __timeval_err < 0 ? __timeval_err : 0;			\
 		__timeval_err;							\
 	})
-#endif
 
 /*
  * Initialize the 'spin' member.
@@ -269,6 +267,17 @@ static noinline s64 clock_deadlock(rqspinlock_t *lock, u32 mask,
  */
 #define RES_RESET_TIMEOUT(ts, _duration) ({ (ts).timeout_end = 0; (ts).duration = _duration; })
 
+/*
+ * Limit how often we invoke clock_deadlock() while spin-waiting in
+ * smp_cond_load_acquire_timeout() or atomic_cond_read_acquire_timeout().
+ *
+ * We only override the default value not superceding ARM64's override.
+ */
+#ifndef CONFIG_ARM64
+#undef SMP_TIMEOUT_POLL_COUNT
+#define SMP_TIMEOUT_POLL_COUNT	(16*1024)
+#endif
+
 /*
  * Provide a test-and-set fallback for cases when queued spin lock support is
  * absent from the architecture.
@@ -296,7 +305,7 @@ int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock)
 	val = atomic_read(&lock->val);
 
 	if (val || !atomic_try_cmpxchg(&lock->val, &val, 1)) {
-		if (RES_CHECK_TIMEOUT(ts, ret, ~0u) < 0)
+		if (RES_CHECK_TIMEOUT_AMORTIZED(ts, ret, ~0u) < 0)
 			goto out;
 		cpu_relax();
 		goto retry;
@@ -319,12 +328,6 @@ EXPORT_SYMBOL_GPL(resilient_tas_spin_lock);
  */
 static DEFINE_PER_CPU_ALIGNED(struct qnode, rqnodes[_Q_MAX_NODES]);
 
-#ifndef res_smp_cond_load_acquire
-#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire(v, c)
-#endif
-
-#define res_atomic_cond_read_acquire(v, c) res_smp_cond_load_acquire(&(v)->counter, (c))
-
 /**
  * resilient_queued_spin_lock_slowpath - acquire the queued spinlock
  * @lock: Pointer to queued spinlock structure
@@ -421,7 +424,9 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 	 */
 	if (val & _Q_LOCKED_MASK) {
 		RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT);
-		res_smp_cond_load_acquire(&lock->locked, !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK) < 0);
+		smp_cond_load_acquire_timeout(&lock->locked, !VAL,
+					      RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK),
+					      ts.duration);
 	}
 
 	if (ret) {
@@ -582,8 +587,9 @@ int __lockfunc resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val)
 	 * us.
 	 */
 	RES_RESET_TIMEOUT(ts, RES_DEF_TIMEOUT * 2);
-	val = res_atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK) ||
-					   RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK) < 0);
+	val = atomic_cond_read_acquire_timeout(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK),
+					       RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_PENDING_MASK),
+					       ts.duration);
 
 	/* Disable queue destruction when we detect deadlocks. */
 	if (ret == -EDEADLK) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v10 11/12] sched: add need-resched timed wait interface
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
                   ` (9 preceding siblings ...)
  2026-03-16  1:36 ` [PATCH v10 10/12] bpf/rqspinlock: Use smp_cond_load_acquire_timeout() Ankur Arora
@ 2026-03-16  1:36 ` Ankur Arora
  2026-03-16  1:36 ` [PATCH v10 12/12] cpuidle/poll_state: Wait for need-resched via tif_need_resched_relaxed_wait() Ankur Arora
  2026-03-16  1:49 ` [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Andrew Morton
  12 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora, Ingo Molnar

Add tif_bitset_relaxed_wait() (and tif_need_resched_relaxed_wait()
which wraps it) which takes the thread_info bit and timeout duration
as parameters and waits until the bit is set or for the expiration
of the timeout.

The wait is implemented via smp_cond_load_relaxed_timeout().

smp_cond_load_relaxed_timeout() essentially provides the pattern used
in poll_idle() where we spin in a loop waiting for the flag to change
until a timeout occurs.

tif_need_resched_relaxed_wait() allows us to abstract out the internals
of waiting, scheduler specific details etc.

Placed in linux/sched/idle.h instead of linux/thread_info.h to work
around recursive include hell.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: linux-pm@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/sched/idle.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index 8465ff1f20d1..ddee9b019895 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -3,6 +3,7 @@
 #define _LINUX_SCHED_IDLE_H
 
 #include <linux/sched.h>
+#include <linux/sched/clock.h>
 
 enum cpu_idle_type {
 	__CPU_NOT_IDLE = 0,
@@ -113,4 +114,32 @@ static __always_inline void current_clr_polling(void)
 }
 #endif
 
+/*
+ * Caller needs to make sure that the thread context cannot be preempted
+ * or migrated, so current_thread_info() cannot change from under us.
+ *
+ * This also allows us to safely stay in the local_clock domain.
+ */
+static __always_inline bool tif_bitset_relaxed_wait(int tif, u64 timeout_ns)
+{
+	unsigned long flags;
+
+	flags = smp_cond_load_relaxed_timeout(&current_thread_info()->flags,
+					      (VAL & BIT(tif)),
+					      local_clock_noinstr(),
+					      timeout_ns);
+	return flags & BIT(tif);
+}
+
+/**
+ * tif_need_resched_relaxed_wait() - Wait for need-resched being set
+ * with no ordering guarantees until a timeout expires.
+ *
+ * @timeout_ns: timeout value.
+ */
+static __always_inline bool tif_need_resched_relaxed_wait(u64 timeout_ns)
+{
+	return tif_bitset_relaxed_wait(TIF_NEED_RESCHED, timeout_ns);
+}
+
 #endif /* _LINUX_SCHED_IDLE_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v10 12/12] cpuidle/poll_state: Wait for need-resched via tif_need_resched_relaxed_wait()
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
                   ` (10 preceding siblings ...)
  2026-03-16  1:36 ` [PATCH v10 11/12] sched: add need-resched timed wait interface Ankur Arora
@ 2026-03-16  1:36 ` Ankur Arora
  2026-03-16  1:49 ` [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Andrew Morton
  12 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2026-03-16  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf
  Cc: arnd, catalin.marinas, will, peterz, akpm, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

The inner loop in poll_idle() polls over the thread_info flags,
waiting to see if the thread has TIF_NEED_RESCHED set. The loop
exits once the condition is met, or if the poll time limit has
been exceeded.

To minimize the number of instructions executed in each iteration,
the time check is rate-limited. In addition, each loop iteration
executes cpu_relax() which on certain platforms provides a hint to
the pipeline that the loop busy-waits, allowing the processor to
reduce power consumption.

Switch over to tif_need_resched_relaxed_wait() instead, since that
provides exactly that.

However, since we want to minimize power consumption in idle, building
of cpuidle/poll_state.c continues to depend on CONFIG_ARCH_HAS_CPU_RELAX
as that serves as an indicator that the platform supports an optimized
version of tif_need_resched_relaxed_wait() (via
smp_cond_load_acquire_timeout()).

Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: linux-pm@vger.kernel.org
Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/cpuidle/poll_state.c | 21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index c7524e4c522a..7443b3e971ba 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -6,41 +6,22 @@
 #include <linux/cpuidle.h>
 #include <linux/export.h>
 #include <linux/irqflags.h>
-#include <linux/sched.h>
-#include <linux/sched/clock.h>
 #include <linux/sched/idle.h>
 #include <linux/sprintf.h>
 #include <linux/types.h>
 
-#define POLL_IDLE_RELAX_COUNT	200
-
 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
 {
-	u64 time_start;
-
-	time_start = local_clock_noinstr();
-
 	dev->poll_time_limit = false;
 
 	raw_local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		unsigned int loop_count = 0;
 		u64 limit;
 
 		limit = cpuidle_poll_time(drv, dev);
 
-		while (!need_resched()) {
-			cpu_relax();
-			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
-				continue;
-
-			loop_count = 0;
-			if (local_clock_noinstr() - time_start > limit) {
-				dev->poll_time_limit = true;
-				break;
-			}
-		}
+		dev->poll_time_limit = !tif_need_resched_relaxed_wait(limit);
 	}
 	raw_local_irq_disable();
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
  2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
                   ` (11 preceding siblings ...)
  2026-03-16  1:36 ` [PATCH v10 12/12] cpuidle/poll_state: Wait for need-resched via tif_need_resched_relaxed_wait() Ankur Arora
@ 2026-03-16  1:49 ` Andrew Morton
  2026-03-16 22:08   ` Ankur Arora
  12 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2026-03-16  1:49 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf, arnd,
	catalin.marinas, will, peterz, mark.rutland, harisokn, cl, ast,
	rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai, rdunlap,
	david.laight.linux, joao.m.martins, boris.ostrovsky, konrad.wilk

On Sun, 15 Mar 2026 18:36:39 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Hi,
> 
> This series adds waited variants of the smp_cond_load() primitives:
> smp_cond_load_relaxed_timeout(), and smp_cond_load_acquire_timeout().
>
> ...
>

How are we to determine that this change is successful, useful, etc? 
Reduced CPU consumption?  Reduced energy usage?  Improved latencies?

>
> Finally update poll_idle() and resilient queued spinlocks to use them.

Have you identified other suitable sites for conversion?

> 
>  Documentation/atomic_t.txt           | 14 +++--
>  arch/arm64/Kconfig                   |  3 +
>  arch/arm64/include/asm/barrier.h     | 23 +++++++
>  arch/arm64/include/asm/cmpxchg.h     | 62 +++++++++++++++----
>  arch/arm64/include/asm/delay-const.h | 27 +++++++++
>  arch/arm64/include/asm/rqspinlock.h  | 85 --------------------------
>  arch/arm64/lib/delay.c               | 15 ++---
>  drivers/cpuidle/poll_state.c         | 21 +------
>  drivers/soc/qcom/rpmh-rsc.c          |  8 +--
>  include/asm-generic/barrier.h        | 90 ++++++++++++++++++++++++++++
>  include/linux/atomic.h               | 10 ++++
>  include/linux/atomic/atomic-long.h   | 18 +++---
>  include/linux/sched/idle.h           | 29 +++++++++
>  kernel/bpf/rqspinlock.c              | 77 +++++++++++++++---------
>  scripts/atomic/gen-atomic-long.sh    | 16 +++--
>  15 files changed, 320 insertions(+), 178 deletions(-)
>  create mode 100644 arch/arm64/include/asm/delay-const.h

Some sort of testing in lib/tests/ would be appropriate and useful.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
  2026-03-16  1:49 ` [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Andrew Morton
@ 2026-03-16 22:08   ` Ankur Arora
  2026-03-16 23:37     ` David Laight
  0 siblings, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2026-03-16 22:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, linux-pm,
	bpf, arnd, catalin.marinas, will, peterz, mark.rutland, harisokn,
	cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk


Andrew Morton <akpm@linux-foundation.org> writes:

> On Sun, 15 Mar 2026 18:36:39 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Hi,
>>
>> This series adds waited variants of the smp_cond_load() primitives:
>> smp_cond_load_relaxed_timeout(), and smp_cond_load_acquire_timeout().
>>
>> ...
>>
>
> How are we to determine that this change is successful, useful, etc?

Good point. So this series was split off from this one here:
  https://lore.kernel.org/lkml/20250218213337.377987-1-ankur.a.arora@oracle.com/

The series enables ARCH_HAS_CPU_RELAX on arm64 which should allow
relatively cheap polling in idle on arm64.
However, it does need a few more patches from the series above to do that.

> Reduced CPU consumption?  Reduced energy usage?  Improved latencies?

With the additional patches this should improve wakeup latency:

  I ran the sched-pipe test with processes on VCPUs 4 and 5 with
  kvm-arm.wfi_trap_policy=notrap.

  # perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
  perf bench sched pipe -l 1000000 -c 4

  # No haltpoll (and, no TIF_POLLING_NRFLAG):

  Performance counter stats for 'CPU(s) 4,5' (5 runs):

         25,229.57 msec task-clock                       #    2.000 CPUs utilized               ( +-  7.75% )
    45,821,250,284      cycles                           #    1.816 GHz                         ( +- 10.07% )
    26,557,496,665      instructions                     #    0.58  insn per cycle              ( +-  0.21% )
                 0      sched:sched_wake_idle_without_ipi #    0.000 /sec

       12.615 +- 0.977 seconds time elapsed  ( +-  7.75% )


  # Haltpoll:

  Performance counter stats for 'CPU(s) 4,5' (5 runs):

         15,131.58 msec task-clock                       #    2.000 CPUs utilized               ( +- 10.00% )
    34,158,188,839      cycles                           #    2.257 GHz                         ( +-  6.91% )
    20,824,950,916      instructions                     #    0.61  insn per cycle              ( +-  0.09% )
         1,983,822      sched:sched_wake_idle_without_ipi #  131.105 K/sec                       ( +-  0.78% )

        7.566 +- 0.756 seconds time elapsed  ( +- 10.00% )

  We get a decent boost just because we are executing ~20% fewer
  instructions. Not sure how the cpu frequency scaling works in a VM but
  we also run at a higher frequency.

(That specifically applies to guests but that series also adds enables this
with acpi-idle for baremetal.)

(From: https://lore.kernel.org/lkml/877c9zhk68.fsf@oracle.com/)

>> Finally update poll_idle() and resilient queued spinlocks to use them.
>
> Have you identified other suitable sites for conversion?

Haven't found other places in the core kernel where this could be used.
I think one reason is that the typical kernel wait is unbounded.

There are some in drivers/ that have this pattern. For instance I think
this in drivers/iommu/arm/arm-smmu-v3 could be converted:
__arm_smmu_cmdq_poll_until_msi().

However, as David Laight pointed out in this thread
(https://lore.kernel.org/lkml/20260214113122.70627a8b@pumpkin/)
that this would be fine so long as the polling is on memory, but would
need some work to handle MMIO.

>>  Documentation/atomic_t.txt           | 14 +++--
>>  arch/arm64/Kconfig                   |  3 +
>>  arch/arm64/include/asm/barrier.h     | 23 +++++++
>>  arch/arm64/include/asm/cmpxchg.h     | 62 +++++++++++++++----
>>  arch/arm64/include/asm/delay-const.h | 27 +++++++++
>>  arch/arm64/include/asm/rqspinlock.h  | 85 --------------------------
>>  arch/arm64/lib/delay.c               | 15 ++---
>>  drivers/cpuidle/poll_state.c         | 21 +------
>>  drivers/soc/qcom/rpmh-rsc.c          |  8 +--
>>  include/asm-generic/barrier.h        | 90 ++++++++++++++++++++++++++++
>>  include/linux/atomic.h               | 10 ++++
>>  include/linux/atomic/atomic-long.h   | 18 +++---
>>  include/linux/sched/idle.h           | 29 +++++++++
>>  kernel/bpf/rqspinlock.c              | 77 +++++++++++++++---------
>>  scripts/atomic/gen-atomic-long.sh    | 16 +++--
>>  15 files changed, 320 insertions(+), 178 deletions(-)
>>  create mode 100644 arch/arm64/include/asm/delay-const.h
>
> Some sort of testing in lib/tests/ would be appropriate and useful.

Makes sense. Will add.

Thanks
--
ankur

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
  2026-03-16 22:08   ` Ankur Arora
@ 2026-03-16 23:37     ` David Laight
  2026-03-17  6:53       ` Ankur Arora
  2026-03-25 15:55       ` Catalin Marinas
  0 siblings, 2 replies; 30+ messages in thread
From: David Laight @ 2026-03-16 23:37 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Andrew Morton, linux-kernel, linux-arch, linux-arm-kernel,
	linux-pm, bpf, arnd, catalin.marinas, will, peterz, mark.rutland,
	harisokn, cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1,
	xueshuai, rdunlap, joao.m.martins, boris.ostrovsky, konrad.wilk

On Mon, 16 Mar 2026 15:08:07 -0700
Ankur Arora <ankur.a.arora@oracle.com> wrote:

...
> However, as David Laight pointed out in this thread
> (https://lore.kernel.org/lkml/20260214113122.70627a8b@pumpkin/)
> that this would be fine so long as the polling is on memory, but would
> need some work to handle MMIO.

I'm not sure the current code works with MMIO on arm64.
I was looking at the osq_lock() code, it uses smp_cond_load() with 'expr'
being 'VAL || need_resched()' expecting to get woken by the IPI associated
with the preemption being requested.
But the arm64 code relies on 'wfe' being woken when the memory write
'breaks' the 'ldx' for the monitored location.
That will only work for cached addresses.

For osq_lock(), while an IPI will wake it up, there is also a small timing
window where the IPI can happen before the ldx and so not actually wake up it.
This is true whenever 'expr' is non-trivial.
On arm64 I think you could use explicit sev and wfe - but that will wake all
'sleeping' cpu; and you may not want the 'thundering herd'.

	David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
  2026-03-16 23:37     ` David Laight
@ 2026-03-17  6:53       ` Ankur Arora
  2026-03-17  9:17         ` David Laight
  2026-03-25 15:55       ` Catalin Marinas
  1 sibling, 1 reply; 30+ messages in thread
From: Ankur Arora @ 2026-03-17  6:53 UTC (permalink / raw)
  To: David Laight
  Cc: Ankur Arora, Andrew Morton, linux-kernel, linux-arch,
	linux-arm-kernel, linux-pm, bpf, arnd, catalin.marinas, will,
	peterz, mark.rutland, harisokn, cl, ast, rafael, daniel.lezcano,
	memxor, zhenglifeng1, xueshuai, rdunlap, joao.m.martins,
	boris.ostrovsky, konrad.wilk


David Laight <david.laight.linux@gmail.com> writes:

> On Mon, 16 Mar 2026 15:08:07 -0700
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> ...
>> However, as David Laight pointed out in this thread
>> (https://lore.kernel.org/lkml/20260214113122.70627a8b@pumpkin/)
>> that this would be fine so long as the polling is on memory, but would
>> need some work to handle MMIO.
>
> I'm not sure the current code works with MMIO on arm64.
>
> I was looking at the osq_lock() code, it uses smp_cond_load() with 'expr'
> being 'VAL || need_resched()' expecting to get woken by the IPI associated
> with the preemption being requested.

Yeah, osq_lock() has:

  if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
                            vcpu_is_preempted(node_cpu(node->prev))))

This works on x86 even with a non-stub vcpu_is_preempted() because we
poll on the node->locked while evaluating the conditional.

Won't work on arm64 since there we are just waiting on node->locked.

(And so we depend on the IPI there.)

> But the arm64 code relies on 'wfe' being woken when the memory write
> 'breaks' the 'ldx' for the monitored location.

Agreed.

> That will only work for cached addresses.

Yeah I was forgetting that. LDXR won't really work with non cacheable
memory. So a generic MMIO version might not be possible at all.

> For osq_lock(), while an IPI will wake it up, there is also a small timing
> window where the IPI can happen before the ldx and so not actually wake up it.
> This is true whenever 'expr' is non-trivial.

Yeah because we are checking the state of more than one address in the
conditional but the waiting monitor only does one memory location at a
time. If only CPUs had vectored waiting mechanisms...

I think the smp_cond_load_*() in poll_idle() and in rqspinlock monitor
a single address (and in the conditional) so those two should be okay.

> On arm64 I think you could use explicit sev and wfe - but that will wake all
> 'sleeping' cpu; and you may not want the 'thundering herd'.

Wouldn't we still have the same narrow window where the CPU disregards the IPI?

--
ankur

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
  2026-03-17  6:53       ` Ankur Arora
@ 2026-03-17  9:17         ` David Laight
  2026-03-25 13:53           ` Catalin Marinas
  0 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2026-03-17  9:17 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Andrew Morton, linux-kernel, linux-arch, linux-arm-kernel,
	linux-pm, bpf, arnd, catalin.marinas, will, peterz, mark.rutland,
	harisokn, cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1,
	xueshuai, rdunlap, joao.m.martins, boris.ostrovsky, konrad.wilk

On Mon, 16 Mar 2026 23:53:22 -0700
Ankur Arora <ankur.a.arora@oracle.com> wrote:

> David Laight <david.laight.linux@gmail.com> writes:
...
> > On arm64 I think you could use explicit sev and wfe - but that will wake all
> > 'sleeping' cpu; and you may not want the 'thundering herd'.  
> 
> Wouldn't we still have the same narrow window where the CPU disregards the IPI?

You need a 'sevl' in the interrupt exit path.
(Or, more specifically, the ISR needs to exit with the flag set.)
Actually I think you need one anyway, if the ISR clears the flag
then the existing code will sleep until the following IRQ if an
interrupt happens between the ldx and wfe.
I've not looked at the ISR exit code.

Ignoring the vcu check (which is fairly broken anyway), I think osq_lock()
would be ok if the 'osq node' were in the right cache line.
I've some patches pending (I need to sort out lots of comments) that reduce
the osq_node down to two cpu numbers; 8 bytes but possibly only 4
although that is harder without 16bit atomics.
That would work for arm32 (ldx uses a cache-line resolution) but I'm not
sure about similar functionality on other cpu.

	David

> 
> --
> ankur

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 05/12] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait()
  2026-03-16  1:36 ` [PATCH v10 05/12] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait() Ankur Arora
@ 2026-03-24  1:41   ` Kumar Kartikeya Dwivedi
  2026-03-25  5:58     ` Ankur Arora
  0 siblings, 1 reply; 30+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-03-24  1:41 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf, arnd,
	catalin.marinas, will, peterz, akpm, mark.rutland, harisokn, cl,
	ast, rafael, daniel.lezcano, zhenglifeng1, xueshuai, rdunlap,
	david.laight.linux, joao.m.martins, boris.ostrovsky, konrad.wilk

On Mon, 16 Mar 2026 at 02:37, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> In preparation for defining smp_cond_load_acquire_timeout(), remove
> the private copy. Lacking this, the rqspinlock code falls back to using
> smp_cond_load_acquire().

nit: This temporarily breaks bisection (or rather, leaves things
broken until later commits), but I don't care too much.

>
> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: bpf@vger.kernel.org
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> Reviewed-by: Haris Okanovic <harisokn@amazon.com>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

>  [...]
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 09/12] bpf/rqspinlock: switch check_timeout() to a clock interface
  2026-03-16  1:36 ` [PATCH v10 09/12] bpf/rqspinlock: switch check_timeout() to a clock interface Ankur Arora
@ 2026-03-24  1:43   ` Kumar Kartikeya Dwivedi
  2026-03-25  5:57     ` Ankur Arora
  0 siblings, 1 reply; 30+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-03-24  1:43 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf, arnd,
	catalin.marinas, will, peterz, akpm, mark.rutland, harisokn, cl,
	ast, rafael, daniel.lezcano, zhenglifeng1, xueshuai, rdunlap,
	david.laight.linux, joao.m.martins, boris.ostrovsky, konrad.wilk

On Mon, 16 Mar 2026 at 02:37, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> check_timeout() gets the current time value and depending on how
> much time has passed, checks for deadlock or times out, returning 0
> or -errno on deadlock or timeout.
>
> Switch this out to a clock style interface, where it functions as a
> clock in the "lock-domain", returning the current time until a
> deadlock or timeout occurs. Once a deadlock or timeout has occurred,
> it stops functioning as a clock and returns error.
>
> Also adjust the RES_CHECK_TIMEOUT macro to discard the clock value
> when updating the explicit return status.
>
> Cc: bpf@vger.kernel.org
> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> --

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

>  [...]
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 10/12] bpf/rqspinlock: Use smp_cond_load_acquire_timeout()
  2026-03-16  1:36 ` [PATCH v10 10/12] bpf/rqspinlock: Use smp_cond_load_acquire_timeout() Ankur Arora
@ 2026-03-24  1:46   ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 30+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-03-24  1:46 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, linux-pm, bpf, arnd,
	catalin.marinas, will, peterz, akpm, mark.rutland, harisokn, cl,
	ast, rafael, daniel.lezcano, zhenglifeng1, xueshuai, rdunlap,
	david.laight.linux, joao.m.martins, boris.ostrovsky, konrad.wilk

On Mon, 16 Mar 2026 at 02:37, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> Switch out the conditional load interfaces used by rqspinlock
> to smp_cond_read_acquire_timeout() and its wrapper,
> atomic_cond_read_acquire_timeout().
>
> Both these handle the timeout and amortize as needed, so use the
> non-amortized RES_CHECK_TIMEOUT.
>
> RES_CHECK_TIMEOUT does double duty here -- presenting the current
> clock value, the timeout/deadlock error from clock_deadlock() to
> the cond-load and, returning the error value via ret.
>
> For correctness, we need to ensure that the error case of the
> cond-load interface always agrees with that in clock_deadlock().
>
> For the most part, this is fine because there's no independent clock,
> or double reads from the clock in cond-load -- either of which could
> lead to its internal state going out of sync from that of
> clock_deadlock().
>
> There is, however, an edge case where clock_deadlock() checks for:
>
>         if (time > ts->timeout_end)
>                 return -ETIMEDOUT;
>
> while smp_cond_load_acquire_timeout() checks for:
>
>         __time_now = (time_expr_ns);
>         if (__time_now <= 0 || __time_now >= __time_end) {
>                 VAL = READ_ONCE(*__PTR);
>                 break;
>         }
>
> This runs into a problem when (__time_now == __time_end) since
> clock_deadlock() does not treat it as a timeout condition but
> the second clause in the conditional above does.
> So, add an equality check in clock_deadlock().
>
> Finally, redefine SMP_TIMEOUT_POLL_COUNT to be 16k to be similar to
> the spin-count used in the amortized version. We only do this for
> non-arm64 as that uses a waiting implementation.
>
> Cc: bpf@vger.kernel.org
> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

> [...]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 09/12] bpf/rqspinlock: switch check_timeout() to a clock interface
  2026-03-24  1:43   ` Kumar Kartikeya Dwivedi
@ 2026-03-25  5:57     ` Ankur Arora
  0 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2026-03-25  5:57 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, linux-pm,
	bpf, arnd, catalin.marinas, will, peterz, akpm, mark.rutland,
	harisokn, cl, ast, rafael, daniel.lezcano, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk


Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Mon, 16 Mar 2026 at 02:37, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> check_timeout() gets the current time value and depending on how
>> much time has passed, checks for deadlock or times out, returning 0
>> or -errno on deadlock or timeout.
>>
>> Switch this out to a clock style interface, where it functions as a
>> clock in the "lock-domain", returning the current time until a
>> deadlock or timeout occurs. Once a deadlock or timeout has occurred,
>> it stops functioning as a clock and returns error.
>>
>> Also adjust the RES_CHECK_TIMEOUT macro to discard the clock value
>> when updating the explicit return status.
>>
>> Cc: bpf@vger.kernel.org
>> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> --
>
> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Thanks Kumar.

--
ankur

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 05/12] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait()
  2026-03-24  1:41   ` Kumar Kartikeya Dwivedi
@ 2026-03-25  5:58     ` Ankur Arora
  0 siblings, 0 replies; 30+ messages in thread
From: Ankur Arora @ 2026-03-25  5:58 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, linux-pm,
	bpf, arnd, catalin.marinas, will, peterz, akpm, mark.rutland,
	harisokn, cl, ast, rafael, daniel.lezcano, zhenglifeng1, xueshuai,
	rdunlap, david.laight.linux, joao.m.martins, boris.ostrovsky,
	konrad.wilk


Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Mon, 16 Mar 2026 at 02:37, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> In preparation for defining smp_cond_load_acquire_timeout(), remove
>> the private copy. Lacking this, the rqspinlock code falls back to using
>> smp_cond_load_acquire().
>
> nit: This temporarily breaks bisection (or rather, leaves things
> broken until later commits), but I don't care too much.

Yeah. Would have avoided it if I easily could. But given that there
was a private interface involved this seemed the cleanest way.

>> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: bpf@vger.kernel.org
>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>> Reviewed-by: Haris Okanovic <harisokn@amazon.com>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>
> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Thanks.

--
ankur

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
  2026-03-17  9:17         ` David Laight
@ 2026-03-25 13:53           ` Catalin Marinas
  2026-03-25 15:42             ` David Laight
  0 siblings, 1 reply; 30+ messages in thread
From: Catalin Marinas @ 2026-03-25 13:53 UTC (permalink / raw)
  To: David Laight
  Cc: Ankur Arora, Andrew Morton, linux-kernel, linux-arch,
	linux-arm-kernel, linux-pm, bpf, arnd, will, peterz, mark.rutland,
	harisokn, cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1,
	xueshuai, rdunlap, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Mar 17, 2026 at 09:17:05AM +0000, David Laight wrote:
> On Mon, 16 Mar 2026 23:53:22 -0700
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
> > David Laight <david.laight.linux@gmail.com> writes:
> > > On arm64 I think you could use explicit sev and wfe - but that will wake all
> > > 'sleeping' cpu; and you may not want the 'thundering herd'.  
> > 
> > Wouldn't we still have the same narrow window where the CPU disregards the IPI?
> 
> You need a 'sevl' in the interrupt exit path.

No need to, see the rule below in
https://developer.arm.com/documentation/ddi0487/maa/2983-beijhbbd:

R_XRZRK
  The Event Register for a PE is set by any of the following:
  [...]
  - An exception return.

-- 
Catalin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
  2026-03-25 13:53           ` Catalin Marinas
@ 2026-03-25 15:42             ` David Laight
  2026-03-25 16:32               ` Catalin Marinas
  0 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2026-03-25 15:42 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, Andrew Morton, linux-kernel, linux-arch,
	linux-arm-kernel, linux-pm, bpf, arnd, will, peterz, mark.rutland,
	harisokn, cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1,
	xueshuai, rdunlap, joao.m.martins, boris.ostrovsky, konrad.wilk

On Wed, 25 Mar 2026 13:53:50 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Tue, Mar 17, 2026 at 09:17:05AM +0000, David Laight wrote:
> > On Mon, 16 Mar 2026 23:53:22 -0700
> > Ankur Arora <ankur.a.arora@oracle.com> wrote:  
> > > David Laight <david.laight.linux@gmail.com> writes:  
> > > > On arm64 I think you could use explicit sev and wfe - but that will wake all
> > > > 'sleeping' cpu; and you may not want the 'thundering herd'.    
> > > 
> > > Wouldn't we still have the same narrow window where the CPU disregards the IPI?  
> > 
> > You need a 'sevl' in the interrupt exit path.  
> 
> No need to, see the rule below in
> https://developer.arm.com/documentation/ddi0487/maa/2983-beijhbbd:
> 
> R_XRZRK
>   The Event Register for a PE is set by any of the following:
>   [...]
>   - An exception return.
> 

It is a shame the pages for the SEV and WFE instructions don't mention that.
And the copy I found doesn't have working hyperlinks to any other sections.
(Not even references to related instructions...)

You do need to at least comment that the "msr s0_3_c1_c0_0, %[ecycles]" is
actually WFET.
Is that using an absolute cycle count?
If so does it work if the time has already passed?
If it is absolute do you need to recalculate it every time around the loop?
__delay_cycles() contains guard(preempt_notrace()). I haven't looked what
that does but is it needed here since preemption is disabled?

Looking at the code I think the "sevl; wfe" pair should be higher up.
If they were before the evaluation of the condition then an IPI that set
need_resched() just after it was tested would cause a wakeup.
Clearly that won't help if the condition does anything that executes 'wfe'
and won't sleep if it sets the event - but I suspect they are unlikely.

I also wonder how long it takes the cpu to leave any low power state.
We definitely found that was an issue on some x86 cpu and had to both
disable the lowest low power state and completely rework some wakeup
code that really wanted a 'thundering herd' rather than the very gentle
'bring each cpu out of low power one at a time' that cv_broadcast()
gave it.

	David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
  2026-03-16 23:37     ` David Laight
  2026-03-17  6:53       ` Ankur Arora
@ 2026-03-25 15:55       ` Catalin Marinas
  2026-03-25 19:36         ` David Laight
  1 sibling, 1 reply; 30+ messages in thread
From: Catalin Marinas @ 2026-03-25 15:55 UTC (permalink / raw)
  To: David Laight
  Cc: Ankur Arora, Andrew Morton, linux-kernel, linux-arch,
	linux-arm-kernel, linux-pm, bpf, arnd, will, peterz, mark.rutland,
	harisokn, cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1,
	xueshuai, rdunlap, joao.m.martins, boris.ostrovsky, konrad.wilk

On Mon, Mar 16, 2026 at 11:37:12PM +0000, David Laight wrote:
> On Mon, 16 Mar 2026 15:08:07 -0700
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
> > However, as David Laight pointed out in this thread
> > (https://lore.kernel.org/lkml/20260214113122.70627a8b@pumpkin/)
> > that this would be fine so long as the polling is on memory, but would
> > need some work to handle MMIO.
> 
> I'm not sure the current code works with MMIO on arm64.

It won't but also passing an MMIO pointer to smp_cond_load() is wrong in
general. You'd need a new API that takes an __iomem pointer.

> I was looking at the osq_lock() code, it uses smp_cond_load() with 'expr'
> being 'VAL || need_resched()' expecting to get woken by the IPI associated
> with the preemption being requested.
> But the arm64 code relies on 'wfe' being woken when the memory write
> 'breaks' the 'ldx' for the monitored location.
> That will only work for cached addresses.

Even worse, depending on the hardware, you may even get a data abort
when attempting LDXR on Device memory.

> For osq_lock(), while an IPI will wake it up, there is also a small timing
> window where the IPI can happen before the ldx and so not actually wake up it.
> This is true whenever 'expr' is non-trivial.

Hmm, I thought this is fine because of the implicit SEVL on exception
return but the arm64 __cmpwait_relaxed() does a SEVL+WFE which clears
any prior event, it can wait in theory forever when the event stream is
disabled.

Expanding smp_cond_load_relaxed() into asm, we have something like:

	LDR	X0, [PTR]
	condition check for VAL || need_resched() with branch out
	SEVL
	WFE
	LDXR	X1, [PTR]
	EOR	X1, X1, X0
	CBNZ	out
	WFE
  out:

If the condition is updated to become true (need_resched()) after the
condition check but before the first WFE while *PTR remains unchanged,
the IPI won't do anything. Maybe we should revert 1cfc63b5ae60 ("arm64:
cmpwait: Clear event register before arming exclusive monitor"). Not
great but probably better than reverting f5bfdc8e3947 ("locking/osq: Use
optimized spinning loop for arm64")).

Using SEV instead of IPI would have the same problem.

-- 
Catalin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
  2026-03-25 15:42             ` David Laight
@ 2026-03-25 16:32               ` Catalin Marinas
  2026-03-25 20:23                 ` David Laight
  0 siblings, 1 reply; 30+ messages in thread
From: Catalin Marinas @ 2026-03-25 16:32 UTC (permalink / raw)
  To: David Laight
  Cc: Ankur Arora, Andrew Morton, linux-kernel, linux-arch,
	linux-arm-kernel, linux-pm, bpf, arnd, will, peterz, mark.rutland,
	harisokn, cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1,
	xueshuai, rdunlap, joao.m.martins, boris.ostrovsky, konrad.wilk

On Wed, Mar 25, 2026 at 03:42:10PM +0000, David Laight wrote:
> On Wed, 25 Mar 2026 13:53:50 +0000
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> > On Tue, Mar 17, 2026 at 09:17:05AM +0000, David Laight wrote:
> > > On Mon, 16 Mar 2026 23:53:22 -0700
> > > Ankur Arora <ankur.a.arora@oracle.com> wrote:  
> > > > David Laight <david.laight.linux@gmail.com> writes:  
> > > > > On arm64 I think you could use explicit sev and wfe - but that will wake all
> > > > > 'sleeping' cpu; and you may not want the 'thundering herd'.    
> > > > 
> > > > Wouldn't we still have the same narrow window where the CPU disregards the IPI?  
> > > 
> > > You need a 'sevl' in the interrupt exit path.  
> > 
> > No need to, see the rule below in
> > https://developer.arm.com/documentation/ddi0487/maa/2983-beijhbbd:
> > 
> > R_XRZRK
> >   The Event Register for a PE is set by any of the following:
> >   [...]
> >   - An exception return.
> > 
> 
> It is a shame the pages for the SEV and WFE instructions don't mention that.
> And the copy I found doesn't have working hyperlinks to any other sections.
> (Not even references to related instructions...)

The latest architecture spec (M.a.a) has working hyperlinks.

> You do need to at least comment that the "msr s0_3_c1_c0_0, %[ecycles]" is
> actually WFET.
> Is that using an absolute cycle count?

Yes, compared to CNTVCT.

> If so does it work if the time has already passed?

Yes, it exits immediately. These counters are not going to wrap in our
(or device's) lifetime.

> If it is absolute do you need to recalculate it every time around the loop?

No but you do need to read CNTVCT, that's what __delay_cycles() does (it
does not wait).

> __delay_cycles() contains guard(preempt_notrace()). I haven't looked what
> that does but is it needed here since preemption is disabled?

The guard was added recently by commit e5cb94ba5f96 ("arm64: Fix
sampling the "stable" virtual counter in preemptible section"). It's
needed for the udelay() case but probably not for Ankur's series. Maybe
we can move the guard in the caller?

> Looking at the code I think the "sevl; wfe" pair should be higher up.

Yes, I replied to your other message. We could move it higher indeed,
before the condition check, but I can't get my head around the ordering.
Can need_resched() check be speculated before the WFE? I need to think
some more.

> I also wonder how long it takes the cpu to leave any low power state.
> We definitely found that was an issue on some x86 cpu and had to both
> disable the lowest low power state and completely rework some wakeup
> code that really wanted a 'thundering herd' rather than the very gentle
> 'bring each cpu out of low power one at a time' that cv_broadcast()
> gave it.

WFE is a very shallow power state where all hardware state is retained.
We have an even stream broadcast to all CPUs regularly already (10KHz)
and I haven't heard people complaining about power degradation. If a CPU
is a WFI state or even deeper into firmware (following a PSCI call), an
exclusive monitor event won't wake it up. It's only for those cores
waiting in WFE.

-- 
Catalin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
  2026-03-25 15:55       ` Catalin Marinas
@ 2026-03-25 19:36         ` David Laight
  0 siblings, 0 replies; 30+ messages in thread
From: David Laight @ 2026-03-25 19:36 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, Andrew Morton, linux-kernel, linux-arch,
	linux-arm-kernel, linux-pm, bpf, arnd, will, peterz, mark.rutland,
	harisokn, cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1,
	xueshuai, rdunlap, joao.m.martins, boris.ostrovsky, konrad.wilk

On Wed, 25 Mar 2026 15:55:21 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Mon, Mar 16, 2026 at 11:37:12PM +0000, David Laight wrote:
...
> > For osq_lock(), while an IPI will wake it up, there is also a small timing
> > window where the IPI can happen before the ldx and so not actually wake up it.
> > This is true whenever 'expr' is non-trivial.  
> 
> Hmm, I thought this is fine because of the implicit SEVL on exception
> return but the arm64 __cmpwait_relaxed() does a SEVL+WFE which clears
> any prior event, it can wait in theory forever when the event stream is
> disabled.

Not forever, there will be a timer interrupt in the end.

> Expanding smp_cond_load_relaxed() into asm, we have something like:
> 
> 	LDR	X0, [PTR]
> 	condition check for VAL || need_resched() with branch out
> 	SEVL
> 	WFE
> 	LDXR	X1, [PTR]
> 	EOR	X1, X1, X0
> 	CBNZ	out
> 	WFE
>   out:
> 
> If the condition is updated to become true (need_resched()) after the
> condition check but before the first WFE while *PTR remains unchanged,
> the IPI won't do anything. Maybe we should revert 1cfc63b5ae60 ("arm64:
> cmpwait: Clear event register before arming exclusive monitor"). Not
> great but probably better than reverting f5bfdc8e3947 ("locking/osq: Use
> optimized spinning loop for arm64")).

Could you change the order to:
	LDR	X0, [PTR]
	SEVL
	WFE
	condition check for VAL || need_resched() with branch out
	LDXR	X1, [PTR]
	EOR	X1, X1, X0
	CBNZ	out
	WFE
  out:

that closes the timing window for the interrupt provided the condition
check doesn't change the event register.

I must get back to the osq_lock code again.
I'm happy with the code - the per-cpu data is down to two cpu numbers.
(Apart from the acquire/release semantics in a few places.)
But the comments have got out of hand.
Writing succinct and accurate comments is hard - too verbose and they
hide too much code.

	David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
  2026-03-25 16:32               ` Catalin Marinas
@ 2026-03-25 20:23                 ` David Laight
  2026-03-26 15:39                   ` Catalin Marinas
  0 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2026-03-25 20:23 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, Andrew Morton, linux-kernel, linux-arch,
	linux-arm-kernel, linux-pm, bpf, arnd, will, peterz, mark.rutland,
	harisokn, cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1,
	xueshuai, rdunlap, joao.m.martins, boris.ostrovsky, konrad.wilk

On Wed, 25 Mar 2026 16:32:49 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Wed, Mar 25, 2026 at 03:42:10PM +0000, David Laight wrote:
...
> > Looking at the code I think the "sevl; wfe" pair should be higher up.  
> 
> Yes, I replied to your other message. We could move it higher indeed,
> before the condition check, but I can't get my head around the ordering.
> Can need_resched() check be speculated before the WFE? I need to think
> some more.

I don't think speculation can matter.
Both SEVL and WFE must be serialised against any other instructions
that can change the event flag (as well as each other) otherwise
everything is broken.
Apart from that it doesn't matter, what matters is the instruction
boundary the interrupt happens at.

Actually both SEVL and WFE may be synchronising instructions and very slow.
So you may not want to put them in the fast path where the condition
is true on entry (or even true after a retry).
So the code might have to look like:
	for (;;) {
		VAL = mem;
		if (cond(VAL)) return;
		SEVL; WFE;
		if (cond(VAL)) return;
		v1 = LDX(mem);
		if (v1 == VAL)
			WFE;
	}

Definitely needs a comment that both 'return from exception' and
losing the exclusive access set the event flag.

The asm probably ought to have a full "memory" clobber.
Just in case the compiler gets lively with re-ordering.

	David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout()
  2026-03-25 20:23                 ` David Laight
@ 2026-03-26 15:39                   ` Catalin Marinas
  0 siblings, 0 replies; 30+ messages in thread
From: Catalin Marinas @ 2026-03-26 15:39 UTC (permalink / raw)
  To: David Laight
  Cc: Ankur Arora, Andrew Morton, linux-kernel, linux-arch,
	linux-arm-kernel, linux-pm, bpf, arnd, will, peterz, mark.rutland,
	harisokn, cl, ast, rafael, daniel.lezcano, memxor, zhenglifeng1,
	xueshuai, rdunlap, joao.m.martins, boris.ostrovsky, konrad.wilk

On Wed, Mar 25, 2026 at 08:23:57PM +0000, David Laight wrote:
> On Wed, 25 Mar 2026 16:32:49 +0000
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> > On Wed, Mar 25, 2026 at 03:42:10PM +0000, David Laight wrote:
> ...
> > > Looking at the code I think the "sevl; wfe" pair should be higher up.  
> > 
> > Yes, I replied to your other message. We could move it higher indeed,
> > before the condition check, but I can't get my head around the ordering.
> > Can need_resched() check be speculated before the WFE? I need to think
> > some more.
> 
> I don't think speculation can matter.
> Both SEVL and WFE must be serialised against any other instructions
> that can change the event flag (as well as each other) otherwise
> everything is broken.

Welcome to the Arm memory model. We don't have any guarantee that an LDR
will only access memory after SEVL+WFE. They are not serialising.

> Apart from that it doesn't matter, what matters is the instruction
> boundary the interrupt happens at.

True. If an interrupt is taken before the LDR (that would be a
need_resched() check for example), then a prior WFE would not matter.
This won't work if we replace the IPI with a SEV though (suggested
somewhere in this thread).

> Actually both SEVL and WFE may be synchronising instructions and very slow.

Most likely not.

> So you may not want to put them in the fast path where the condition
> is true on entry (or even true after a retry).
> So the code might have to look like:
> 	for (;;) {
> 		VAL = mem;

If we only waited on the location passed to LDXR, things would have been
much simpler. But the osq_lock() also wants to wait on the TIF flags via
need_resched() (and vcpu_is_preempted()).

> 		if (cond(VAL)) return;

So the cond(VAL) here is actually a series of other memory loads
unrelated to 'mem'

> 		SEVL; WFE;
> 		if (cond(VAL)) return;

I think this will work in principle even if 'cond' accesses other memory
locations, though I wouldn't bother with an additional 'cond' call, I'd
expect SEVL+WFE to be mostly NOPs. However, 'cond' must not set a local
event, otherwise the power saving on waiting is gone.

> 		v1 = LDX(mem);
> 		if (v1 == VAL)
> 			WFE;
> 	}

I think it's cleaner to use Ankur's timeout API here for the very rare
case where an IPI hits at the wrong time. We then keep
smp_cond_load_relaxed() intact as it's really not meant to wait on
multiple memory locations to change. Any changes of
smp_cond_load_relaxed() with moving the WFE around are just hacks, not
the intended use of this API.

-- 
Catalin

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2026-03-26 15:40 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-16  1:36 [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Ankur Arora
2026-03-16  1:36 ` [PATCH v10 01/12] asm-generic: barrier: Add smp_cond_load_relaxed_timeout() Ankur Arora
2026-03-16  1:36 ` [PATCH v10 02/12] arm64: barrier: Support smp_cond_load_relaxed_timeout() Ankur Arora
2026-03-16  1:36 ` [PATCH v10 03/12] arm64/delay: move some constants out to a separate header Ankur Arora
2026-03-16  1:36 ` [PATCH v10 04/12] arm64: support WFET in smp_cond_load_relaxed_timeout() Ankur Arora
2026-03-16  1:36 ` [PATCH v10 05/12] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait() Ankur Arora
2026-03-24  1:41   ` Kumar Kartikeya Dwivedi
2026-03-25  5:58     ` Ankur Arora
2026-03-16  1:36 ` [PATCH v10 06/12] asm-generic: barrier: Add smp_cond_load_acquire_timeout() Ankur Arora
2026-03-16  1:36 ` [PATCH v10 07/12] atomic: Add atomic_cond_read_*_timeout() Ankur Arora
2026-03-16  1:36 ` [PATCH v10 08/12] locking/atomic: scripts: build atomic_long_cond_read_*_timeout() Ankur Arora
2026-03-16  1:36 ` [PATCH v10 09/12] bpf/rqspinlock: switch check_timeout() to a clock interface Ankur Arora
2026-03-24  1:43   ` Kumar Kartikeya Dwivedi
2026-03-25  5:57     ` Ankur Arora
2026-03-16  1:36 ` [PATCH v10 10/12] bpf/rqspinlock: Use smp_cond_load_acquire_timeout() Ankur Arora
2026-03-24  1:46   ` Kumar Kartikeya Dwivedi
2026-03-16  1:36 ` [PATCH v10 11/12] sched: add need-resched timed wait interface Ankur Arora
2026-03-16  1:36 ` [PATCH v10 12/12] cpuidle/poll_state: Wait for need-resched via tif_need_resched_relaxed_wait() Ankur Arora
2026-03-16  1:49 ` [PATCH v10 00/12] barrier: Add smp_cond_load_{relaxed,acquire}_timeout() Andrew Morton
2026-03-16 22:08   ` Ankur Arora
2026-03-16 23:37     ` David Laight
2026-03-17  6:53       ` Ankur Arora
2026-03-17  9:17         ` David Laight
2026-03-25 13:53           ` Catalin Marinas
2026-03-25 15:42             ` David Laight
2026-03-25 16:32               ` Catalin Marinas
2026-03-25 20:23                 ` David Laight
2026-03-26 15:39                   ` Catalin Marinas
2026-03-25 15:55       ` Catalin Marinas
2026-03-25 19:36         ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox