linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout()
@ 2025-02-03 21:49 Ankur Arora
  2025-02-03 21:49 ` [PATCH 1/4] asm-generic: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Ankur Arora @ 2025-02-03 21:49 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel
  Cc: arnd, catalin.marinas, will, peterz, mark.rutland, harisokn, cl,
	memxor, zhenglifeng1, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Hi,

This series adds waited variants of the smp_cond_load() primitives:
smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().

There are two known users for these interfaces:

 - poll_idle() [1]
 - resilient queued spinlocks [2]

For both of these cases we want to wait on a condition but also want
to terminate the wait at some point.

Now, in theory, that can be worked around by making the time check a
part of the conditional expression provided to smp_cond_load_*():

   smp_cond_load_relaxed(&cvar, !VAL || time_check());

That approach, however, runs into two problems:
 
  - smp_cond_load_*() only allow waiting on a condition: this might
    be okay when we are synchronously spin-waiting on the condition,
    but not on architectures where are actually waiting for a store
    to a cacheline.

  - this semantic problem becomes a real problem on arm64 if the
    event-stream is disabled. That means that there will be no
    asynchronous event (the event-stream) that periodically wakes
    the waiter, which might lead to an interminable wait if VAL is
    never written to.

This series extends the smp_cond_load_*() interfaces by adding two
arguments: a time-check expression and its associated time limit.
This is sufficient to allow for both a synchronously waited
implementation (like the generic cpu_relax() based loop), or one
where the CPU waits for a store to a cacheline with an out-of-band
timer.

Any comments appreciated!


Ankur

[1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
[2] https://lore.kernel.org/lkml/20250107140004.2732830-9-memxor@gmail.com/

--
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: linux-arch@vger.kernel.org

Ankur Arora (4):
  asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  asm-generic: barrier: Add smp_cond_load_acquire_timewait()
  arm64: barrier: Add smp_cond_load_relaxed_timewait()
  arm64: barrier: Add smp_cond_load_acquire_timewait()

 arch/arm64/include/asm/barrier.h | 74 ++++++++++++++++++++++++++++++++
 include/asm-generic/barrier.h    | 71 ++++++++++++++++++++++++++++++
 2 files changed, 145 insertions(+)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-02-03 21:49 [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout() Ankur Arora
@ 2025-02-03 21:49 ` Ankur Arora
  2025-03-04 19:15   ` Catalin Marinas
  2025-02-03 21:49 ` [PATCH 2/4] asm-generic: barrier: Add smp_cond_load_acquire_timewait() Ankur Arora
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Ankur Arora @ 2025-02-03 21:49 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel
  Cc: arnd, catalin.marinas, will, peterz, mark.rutland, harisokn, cl,
	memxor, zhenglifeng1, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Add smp_cond_load_relaxed_timewait(), a timed variant of
smp_cond_load_relaxed(). This is useful for cases where we want to
wait on a conditional variable but don't want to wait indefinitely.

The timeout is supported by extending the smp_cond_load_relaxed()
interface to specify a time-check expression and an associated
time-limit.

The generic version is implemented as the usual cpu_relax() spin-wait
loop with a periodic evaluation of the time-check expression. To
minimize the numbers of instructions executed in each iteration of
the loop, limit the frequency of the time-check to once every
smp_cond_time_check_count.
(The inner loop in poll_idle() has a substantially similar structure
and constraints as smp_cond_load_relaxed_timewait(), so we reuse the
same value for smp_cond_time_check_count.)

Architecture specific versions can implement this by waiting on a
cacheline to change and an out-of-band mechanism to allow for the
time check.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/asm-generic/barrier.h | 48 +++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..31de8ed2a05e 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -273,6 +273,54 @@ do {									\
 })
 #endif
 
+#ifndef smp_cond_time_check_count
+/*
+ * Limit how often smp_cond_load_relaxed_timewait() evaluates time_expr_ns.
+ * This helps reduce the number of instructions executed while spin-waiting.
+ */
+#define smp_cond_time_check_count	200
+#endif
+
+/**
+ * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
+ * guarantees until a timeout expires.
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ * @time_expr_ns: evaluates to the current time
+ * @time_limit_ns: compared against time_expr_ns
+ *
+ * Equivalent to using READ_ONCE() on the condition variable.
+ *
+ * Note that the time check in time_expr_ns can be synchronous or
+ * asynchronous.
+ * In the generic version the check is synchronous but kept coarse
+ * to minimize instructions executed while spin-waiting.
+ */
+#ifndef __smp_cond_load_relaxed_spinwait
+#define __smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns,	\
+					 time_limit_ns) ({		\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	unsigned int __count = 0;					\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		cpu_relax();						\
+		if (__count++ < smp_cond_time_check_count)		\
+			continue;					\
+		if ((time_expr_ns) >= (time_limit_ns))			\
+			break;						\
+		__count = 0;						\
+	}								\
+	(typeof(*ptr))VAL;						\
+})
+#endif
+
+#ifndef smp_cond_load_relaxed_timewait
+#define smp_cond_load_relaxed_timewait  __smp_cond_load_relaxed_spinwait
+#endif
+
 /*
  * pmem_wmb() ensures that all stores for which the modification
  * are written to persistent storage by preceding instructions have
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] asm-generic: barrier: Add smp_cond_load_acquire_timewait()
  2025-02-03 21:49 [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout() Ankur Arora
  2025-02-03 21:49 ` [PATCH 1/4] asm-generic: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
@ 2025-02-03 21:49 ` Ankur Arora
  2025-02-03 21:49 ` [PATCH 3/4] arm64: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Ankur Arora @ 2025-02-03 21:49 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel
  Cc: arnd, catalin.marinas, will, peterz, mark.rutland, harisokn, cl,
	memxor, zhenglifeng1, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Add smp_cond_load_acquire_timewait(), a timed variant of
smp_cond_load_acquire(). This is useful for cases where we want
to wait on a conditional variable but want to ensure termination.

smp_cond_load_acquire_timewait() is implemented via the relaxed
variant, with the additional LOAD->LOAD ordering via
smp_acquire__after_ctrl_dep().

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/asm-generic/barrier.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 31de8ed2a05e..62673ad37db2 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -321,6 +321,29 @@ do {									\
 #define smp_cond_load_relaxed_timewait  __smp_cond_load_relaxed_spinwait
 #endif
 
+/**
+ * smp_cond_load_acquire_timewait() - (Spin) wait for cond with ACQUIRE ordering
+ * until a timeout expires.
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ * @time_expr_ns: evaluates to the current time
+ * @time_limit_ns: compared against time_expr_ns
+ *
+ * Equivalent to using smp_cond_load_acquire() on the condition variable with
+ * a timeout.
+ */
+#ifndef smp_cond_load_acquire_timewait
+#define smp_cond_load_acquire_timewait(ptr, cond_expr, time_expr_ns,	\
+				       time_limit_ns) ({		\
+	__unqual_scalar_typeof(*ptr) _val;				\
+	_val = smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
+					      time_expr_ns,		\
+					      time_limit_ns);		\
+	smp_acquire__after_ctrl_dep();					\
+	(typeof(*ptr))_val;						\
+})
+#endif
+
 /*
  * pmem_wmb() ensures that all stores for which the modification
  * are written to persistent storage by preceding instructions have
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] arm64: barrier: Add smp_cond_load_relaxed_timewait()
  2025-02-03 21:49 [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout() Ankur Arora
  2025-02-03 21:49 ` [PATCH 1/4] asm-generic: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
  2025-02-03 21:49 ` [PATCH 2/4] asm-generic: barrier: Add smp_cond_load_acquire_timewait() Ankur Arora
@ 2025-02-03 21:49 ` Ankur Arora
  2025-03-04 19:29   ` Catalin Marinas
  2025-02-03 21:49 ` [PATCH 4/4] arm64: barrier: Add smp_cond_load_acquire_timewait() Ankur Arora
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Ankur Arora @ 2025-02-03 21:49 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel
  Cc: arnd, catalin.marinas, will, peterz, mark.rutland, harisokn, cl,
	memxor, zhenglifeng1, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Add smp_cond_load_relaxed_timewait(), a timed variant of
smp_cond_load_relaxed().

This uses __cmpwait_relaxed() to do the actual waiting, with the
event-stream guaranteeing that we wake up from WFE periodically
and not block forever in case there are no stores to the cacheline.

For cases when the event-stream is unavailable, fallback to the
generic spin-wait implementation.

Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/barrier.h | 38 ++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 1ca947d5c939..25721275a5a2 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -216,6 +216,44 @@ do {									\
 	(typeof(*ptr))VAL;						\
 })
 
+#define __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
+					 time_expr_ns, time_limit_ns)	\
+({									\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		__cmpwait_relaxed(__PTR, VAL);				\
+		if ((time_expr_ns) >= (time_limit_ns))			\
+			break;						\
+	}								\
+	(typeof(*ptr))VAL;						\
+})
+
+/*
+ * For the unlikely case that the event-stream is unavailable,
+ * ward off the possibility of waiting forever by falling back
+ * to the generic spin-wait.
+ */
+#define smp_cond_load_relaxed_timewait(ptr, cond_expr,			\
+				       time_expr_ns, time_limit_ns)	\
+({									\
+	__unqual_scalar_typeof(*ptr) _val;				\
+	int __wfe = arch_timer_evtstrm_available();			\
+									\
+	if (likely(__wfe))						\
+		_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,	\
+							time_expr_ns,	\
+							time_limit_ns);	\
+	else								\
+		_val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr,	\
+							time_expr_ns,	\
+							time_limit_ns);	\
+	(typeof(*ptr))_val;						\
+})
+
 #include <asm-generic/barrier.h>
 
 #endif	/* __ASSEMBLY__ */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] arm64: barrier: Add smp_cond_load_acquire_timewait()
  2025-02-03 21:49 [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout() Ankur Arora
                   ` (2 preceding siblings ...)
  2025-02-03 21:49 ` [PATCH 3/4] arm64: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
@ 2025-02-03 21:49 ` Ankur Arora
  2025-02-14 22:42   ` Okanovic, Haris
  2025-02-06 10:57 ` [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout() Kumar Kartikeya Dwivedi
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Ankur Arora @ 2025-02-03 21:49 UTC (permalink / raw)
  To: linux-kernel, linux-arch, linux-arm-kernel
  Cc: arnd, catalin.marinas, will, peterz, mark.rutland, harisokn, cl,
	memxor, zhenglifeng1, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Ankur Arora

Add smp_cond_load_acquire_timewait(). This is substantially similar
to smp_cond_load_acquire() where we use a load-acquire in the loop
and avoid an smp_rmb() later.

To handle the unlikely case of the event-stream being unavailable,
keep the implementation simple by falling back to the generic
__smp_cond_load_relaxed_spinwait() with an smp_rmb() to follow
(via smp_acquire__after_ctrl_dep().)

Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/barrier.h | 36 ++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 25721275a5a2..22d9291aee8d 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -232,6 +232,22 @@ do {									\
 	(typeof(*ptr))VAL;						\
 })
 
+#define __smp_cond_load_acquire_timewait(ptr, cond_expr,		\
+					 time_expr_ns, time_limit_ns)	\
+({									\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	for (;;) {							\
+		VAL = smp_load_acquire(__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		__cmpwait_relaxed(__PTR, VAL);				\
+		if ((time_expr_ns) >= (time_limit_ns))			\
+			break;						\
+	}								\
+	(typeof(*ptr))VAL;						\
+})
+
 /*
  * For the unlikely case that the event-stream is unavailable,
  * ward off the possibility of waiting forever by falling back
@@ -254,6 +270,26 @@ do {									\
 	(typeof(*ptr))_val;						\
 })
 
+#define smp_cond_load_acquire_timewait(ptr, cond_expr,			\
+				      time_expr_ns, time_limit_ns)	\
+({									\
+	__unqual_scalar_typeof(*ptr) _val;				\
+	int __wfe = arch_timer_evtstrm_available();			\
+									\
+	if (likely(__wfe)) {						\
+		_val = __smp_cond_load_acquire_timewait(ptr, cond_expr,	\
+							time_expr_ns,	\
+							time_limit_ns);	\
+	} else {							\
+		_val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr,	\
+							time_expr_ns,	\
+							time_limit_ns);	\
+		smp_acquire__after_ctrl_dep();				\
+	}								\
+	(typeof(*ptr))_val;						\
+})
+
+
 #include <asm-generic/barrier.h>
 
 #endif	/* __ASSEMBLY__ */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout()
  2025-02-03 21:49 [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout() Ankur Arora
                   ` (3 preceding siblings ...)
  2025-02-03 21:49 ` [PATCH 4/4] arm64: barrier: Add smp_cond_load_acquire_timewait() Ankur Arora
@ 2025-02-06 10:57 ` Kumar Kartikeya Dwivedi
  2025-02-18 21:48 ` Ankur Arora
  2025-03-03 21:28 ` Ankur Arora
  6 siblings, 0 replies; 17+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-02-06 10:57 UTC (permalink / raw)
  To: Ankur Arora
  Cc: bpf, Alexei Starovoitov, linux-kernel, linux-arch,
	linux-arm-kernel, arnd, catalin.marinas, will, peterz,
	mark.rutland, harisokn, cl, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

On Mon, 3 Feb 2025 at 22:49, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> Hi,
>
> This series adds waited variants of the smp_cond_load() primitives:
> smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().
>
> There are two known users for these interfaces:
>
>  - poll_idle() [1]
>  - resilient queued spinlocks [2]
>
> For both of these cases we want to wait on a condition but also want
> to terminate the wait at some point.
>
> Now, in theory, that can be worked around by making the time check a
> part of the conditional expression provided to smp_cond_load_*():
>
>    smp_cond_load_relaxed(&cvar, !VAL || time_check());
>
> That approach, however, runs into two problems:
>
>   - smp_cond_load_*() only allow waiting on a condition: this might
>     be okay when we are synchronously spin-waiting on the condition,
>     but not on architectures where are actually waiting for a store
>     to a cacheline.
>
>   - this semantic problem becomes a real problem on arm64 if the
>     event-stream is disabled. That means that there will be no
>     asynchronous event (the event-stream) that periodically wakes
>     the waiter, which might lead to an interminable wait if VAL is
>     never written to.
>
> This series extends the smp_cond_load_*() interfaces by adding two
> arguments: a time-check expression and its associated time limit.
> This is sufficient to allow for both a synchronously waited
> implementation (like the generic cpu_relax() based loop), or one
> where the CPU waits for a store to a cacheline with an out-of-band
> timer.
>
> Any comments appreciated!
>

Thanks for splitting this and sending it out.
+cc bpf@ for visibility (please keep it in cc for subsequent versions).

>
>  [...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] arm64: barrier: Add smp_cond_load_acquire_timewait()
  2025-02-03 21:49 ` [PATCH 4/4] arm64: barrier: Add smp_cond_load_acquire_timewait() Ankur Arora
@ 2025-02-14 22:42   ` Okanovic, Haris
  2025-02-18 21:44     ` Ankur Arora
  0 siblings, 1 reply; 17+ messages in thread
From: Okanovic, Haris @ 2025-02-14 22:42 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	ankur.a.arora@oracle.com
  Cc: Okanovic, Haris, cl@gentwo.org, joao.m.martins@oracle.com,
	peterz@infradead.org, memxor@gmail.com, catalin.marinas@arm.com,
	arnd@arndb.de, will@kernel.org, mark.rutland@arm.com,
	konrad.wilk@oracle.com, zhenglifeng1@huawei.com,
	boris.ostrovsky@oracle.com

On Mon, 2025-02-03 at 13:49 -0800, Ankur Arora wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> Add smp_cond_load_acquire_timewait(). This is substantially similar
> to smp_cond_load_acquire() where we use a load-acquire in the loop
> and avoid an smp_rmb() later.
> 
> To handle the unlikely case of the event-stream being unavailable,
> keep the implementation simple by falling back to the generic
> __smp_cond_load_relaxed_spinwait() with an smp_rmb() to follow
> (via smp_acquire__after_ctrl_dep().)
> 
> Cc: Will Deacon <will@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/arm64/include/asm/barrier.h | 36 ++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 25721275a5a2..22d9291aee8d 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -232,6 +232,22 @@ do {                                                                       \
>         (typeof(*ptr))VAL;                                              \
>  })
> 
> +#define __smp_cond_load_acquire_timewait(ptr, cond_expr,               \
> +                                        time_expr_ns, time_limit_ns)   \
> +({                                                                     \
> +       typeof(ptr) __PTR = (ptr);                                      \
> +       __unqual_scalar_typeof(*ptr) VAL;                               \
> +       for (;;) {                                                      \
> +               VAL = smp_load_acquire(__PTR);                          \
> +               if (cond_expr)                                          \
> +                       break;                                          \
> +               __cmpwait_relaxed(__PTR, VAL);                          \
> +               if ((time_expr_ns) >= (time_limit_ns))                  \
> +                       break;                                          \
> +       }                                                               \
> +       (typeof(*ptr))VAL;                                              \
> +})
> +
>  /*
>   * For the unlikely case that the event-stream is unavailable,
>   * ward off the possibility of waiting forever by falling back
> @@ -254,6 +270,26 @@ do {                                                                       \
>         (typeof(*ptr))_val;                                             \
>  })
> 
> +#define smp_cond_load_acquire_timewait(ptr, cond_expr,                 \
> +                                     time_expr_ns, time_limit_ns)      \
> +({                                                                     \
> +       __unqual_scalar_typeof(*ptr) _val;                              \
> +       int __wfe = arch_timer_evtstrm_available();                     \
> +                                                                       \
> +       if (likely(__wfe)) {                                            \
> +               _val = __smp_cond_load_acquire_timewait(ptr, cond_expr, \
> +                                                       time_expr_ns,   \
> +                                                       time_limit_ns); \
> +       } else {                                                        \
> +               _val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr, \
> +                                                       time_expr_ns,   \
> +                                                       time_limit_ns); \
> +               smp_acquire__after_ctrl_dep();                          \
> +       }                                                               \
> +       (typeof(*ptr))_val;                                             \
> +})
> +
> +
>  #include <asm-generic/barrier.h>
> 
>  #endif /* __ASSEMBLY__ */
> --
> 2.43.5

Tested both relaxed and acquire variants on AWS Graviton (ARM64
Neoverse V1) with your V9 haltpoll changes, atop master 128c8f96eb.

Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Haris Okanovic <harisokn@amazon.com>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] arm64: barrier: Add smp_cond_load_acquire_timewait()
  2025-02-14 22:42   ` Okanovic, Haris
@ 2025-02-18 21:44     ` Ankur Arora
  0 siblings, 0 replies; 17+ messages in thread
From: Ankur Arora @ 2025-02-18 21:44 UTC (permalink / raw)
  To: Okanovic, Haris
  Cc: linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	ankur.a.arora@oracle.com, cl@gentwo.org,
	joao.m.martins@oracle.com, peterz@infradead.org, memxor@gmail.com,
	catalin.marinas@arm.com, arnd@arndb.de, will@kernel.org,
	mark.rutland@arm.com, konrad.wilk@oracle.com,
	zhenglifeng1@huawei.com, boris.ostrovsky@oracle.com


Okanovic, Haris <harisokn@amazon.com> writes:

> On Mon, 2025-02-03 at 13:49 -0800, Ankur Arora wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> Add smp_cond_load_acquire_timewait(). This is substantially similar
>> to smp_cond_load_acquire() where we use a load-acquire in the loop
>> and avoid an smp_rmb() later.
>>
>> To handle the unlikely case of the event-stream being unavailable,
>> keep the implementation simple by falling back to the generic
>> __smp_cond_load_relaxed_spinwait() with an smp_rmb() to follow
>> (via smp_acquire__after_ctrl_dep().)
>>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: linux-arm-kernel@lists.infradead.org
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  arch/arm64/include/asm/barrier.h | 36 ++++++++++++++++++++++++++++++++
>>  1 file changed, 36 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
>> index 25721275a5a2..22d9291aee8d 100644
>> --- a/arch/arm64/include/asm/barrier.h
>> +++ b/arch/arm64/include/asm/barrier.h
>> @@ -232,6 +232,22 @@ do {                                                                       \
>>         (typeof(*ptr))VAL;                                              \
>>  })
>>
>> +#define __smp_cond_load_acquire_timewait(ptr, cond_expr,               \
>> +                                        time_expr_ns, time_limit_ns)   \
>> +({                                                                     \
>> +       typeof(ptr) __PTR = (ptr);                                      \
>> +       __unqual_scalar_typeof(*ptr) VAL;                               \
>> +       for (;;) {                                                      \
>> +               VAL = smp_load_acquire(__PTR);                          \
>> +               if (cond_expr)                                          \
>> +                       break;                                          \
>> +               __cmpwait_relaxed(__PTR, VAL);                          \
>> +               if ((time_expr_ns) >= (time_limit_ns))                  \
>> +                       break;                                          \
>> +       }                                                               \
>> +       (typeof(*ptr))VAL;                                              \
>> +})
>> +
>>  /*
>>   * For the unlikely case that the event-stream is unavailable,
>>   * ward off the possibility of waiting forever by falling back
>> @@ -254,6 +270,26 @@ do {                                                                       \
>>         (typeof(*ptr))_val;                                             \
>>  })
>>
>> +#define smp_cond_load_acquire_timewait(ptr, cond_expr,                 \
>> +                                     time_expr_ns, time_limit_ns)      \
>> +({                                                                     \
>> +       __unqual_scalar_typeof(*ptr) _val;                              \
>> +       int __wfe = arch_timer_evtstrm_available();                     \
>> +                                                                       \
>> +       if (likely(__wfe)) {                                            \
>> +               _val = __smp_cond_load_acquire_timewait(ptr, cond_expr, \
>> +                                                       time_expr_ns,   \
>> +                                                       time_limit_ns); \
>> +       } else {                                                        \
>> +               _val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr, \
>> +                                                       time_expr_ns,   \
>> +                                                       time_limit_ns); \
>> +               smp_acquire__after_ctrl_dep();                          \
>> +       }                                                               \
>> +       (typeof(*ptr))_val;                                             \
>> +})
>> +
>> +
>>  #include <asm-generic/barrier.h>
>>
>>  #endif /* __ASSEMBLY__ */
>> --
>> 2.43.5
>
> Tested both relaxed and acquire variants on AWS Graviton (ARM64
> Neoverse V1) with your V9 haltpoll changes, atop master 128c8f96eb.
>
> Reviewed-by: Haris Okanovic <harisokn@amazon.com>
> Tested-by: Haris Okanovic <harisokn@amazon.com>

That's great. Thanks Haris.

--
ankur

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout()
  2025-02-03 21:49 [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout() Ankur Arora
                   ` (4 preceding siblings ...)
  2025-02-06 10:57 ` [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout() Kumar Kartikeya Dwivedi
@ 2025-02-18 21:48 ` Ankur Arora
  2025-03-03 21:28 ` Ankur Arora
  6 siblings, 0 replies; 17+ messages in thread
From: Ankur Arora @ 2025-02-18 21:48 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, arnd, catalin.marinas,
	will, peterz, mark.rutland, harisokn, cl, memxor, zhenglifeng1,
	joao.m.martins, boris.ostrovsky, konrad.wilk


Ankur Arora <ankur.a.arora@oracle.com> writes:

> Hi,
>
> This series adds waited variants of the smp_cond_load() primitives:
> smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().
>
> There are two known users for these interfaces:
>
>  - poll_idle() [1]

The next version of the poll_idle() series (goes on top of this one):
 https://lore.kernel.org/lkml/20250218213337.377987-1-ankur.a.arora@oracle.com/

Ankur

>  - resilient queued spinlocks [2]
>
> For both of these cases we want to wait on a condition but also want
> to terminate the wait at some point.
>
> Now, in theory, that can be worked around by making the time check a
> part of the conditional expression provided to smp_cond_load_*():
>
>    smp_cond_load_relaxed(&cvar, !VAL || time_check());
>
> That approach, however, runs into two problems:
>
>   - smp_cond_load_*() only allow waiting on a condition: this might
>     be okay when we are synchronously spin-waiting on the condition,
>     but not on architectures where are actually waiting for a store
>     to a cacheline.
>
>   - this semantic problem becomes a real problem on arm64 if the
>     event-stream is disabled. That means that there will be no
>     asynchronous event (the event-stream) that periodically wakes
>     the waiter, which might lead to an interminable wait if VAL is
>     never written to.
>
> This series extends the smp_cond_load_*() interfaces by adding two
> arguments: a time-check expression and its associated time limit.
> This is sufficient to allow for both a synchronously waited
> implementation (like the generic cpu_relax() based loop), or one
> where the CPU waits for a store to a cacheline with an out-of-band
> timer.
>
> Any comments appreciated!
>
>
> Ankur
>
> [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
> [2] https://lore.kernel.org/lkml/20250107140004.2732830-9-memxor@gmail.com/
>
> --
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Cc: linux-arch@vger.kernel.org
>
> Ankur Arora (4):
>   asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
>   asm-generic: barrier: Add smp_cond_load_acquire_timewait()
>   arm64: barrier: Add smp_cond_load_relaxed_timewait()
>   arm64: barrier: Add smp_cond_load_acquire_timewait()
>
>  arch/arm64/include/asm/barrier.h | 74 ++++++++++++++++++++++++++++++++
>  include/asm-generic/barrier.h    | 71 ++++++++++++++++++++++++++++++
>  2 files changed, 145 insertions(+)


--
ankur

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout()
  2025-02-03 21:49 [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout() Ankur Arora
                   ` (5 preceding siblings ...)
  2025-02-18 21:48 ` Ankur Arora
@ 2025-03-03 21:28 ` Ankur Arora
  6 siblings, 0 replies; 17+ messages in thread
From: Ankur Arora @ 2025-03-03 21:28 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, arnd, catalin.marinas,
	will, peterz, mark.rutland, harisokn, cl, memxor, zhenglifeng1,
	joao.m.martins, boris.ostrovsky, konrad.wilk


Hi folks,

Gentle ping for reviews.

Next versions of users of this series:

Resilient Queued Spin Lock:
  https://lore.kernel.org/lkml/20250303152305.3195648-1-memxor@gmail.com/

poll_idle():
  https://lore.kernel.org/lkml/20250218213337.377987-1-ankur.a.arora@oracle.com/


Thanks
Ankur

Ankur Arora <ankur.a.arora@oracle.com> writes:

> Hi,
>
> This series adds waited variants of the smp_cond_load() primitives:
> smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().
>
> There are two known users for these interfaces:
>
>  - poll_idle() [1]
>  - resilient queued spinlocks [2]
>
> For both of these cases we want to wait on a condition but also want
> to terminate the wait at some point.
>
> Now, in theory, that can be worked around by making the time check a
> part of the conditional expression provided to smp_cond_load_*():
>
>    smp_cond_load_relaxed(&cvar, !VAL || time_check());
>
> That approach, however, runs into two problems:
>
>   - smp_cond_load_*() only allow waiting on a condition: this might
>     be okay when we are synchronously spin-waiting on the condition,
>     but not on architectures where are actually waiting for a store
>     to a cacheline.
>
>   - this semantic problem becomes a real problem on arm64 if the
>     event-stream is disabled. That means that there will be no
>     asynchronous event (the event-stream) that periodically wakes
>     the waiter, which might lead to an interminable wait if VAL is
>     never written to.
>
> This series extends the smp_cond_load_*() interfaces by adding two
> arguments: a time-check expression and its associated time limit.
> This is sufficient to allow for both a synchronously waited
> implementation (like the generic cpu_relax() based loop), or one
> where the CPU waits for a store to a cacheline with an out-of-band
> timer.
>
> Any comments appreciated!
>
>
> Ankur
>
> [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
> [2] https://lore.kernel.org/lkml/20250107140004.2732830-9-memxor@gmail.com/
>
> --
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Cc: linux-arch@vger.kernel.org
>
> Ankur Arora (4):
>   asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
>   asm-generic: barrier: Add smp_cond_load_acquire_timewait()
>   arm64: barrier: Add smp_cond_load_relaxed_timewait()
>   arm64: barrier: Add smp_cond_load_acquire_timewait()
>
>  arch/arm64/include/asm/barrier.h | 74 ++++++++++++++++++++++++++++++++
>  include/asm-generic/barrier.h    | 71 ++++++++++++++++++++++++++++++
>  2 files changed, 145 insertions(+)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-02-03 21:49 ` [PATCH 1/4] asm-generic: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
@ 2025-03-04 19:15   ` Catalin Marinas
  2025-03-06  7:53     ` Ankur Arora
  2025-03-09  3:26     ` Ankur Arora
  0 siblings, 2 replies; 17+ messages in thread
From: Catalin Marinas @ 2025-03-04 19:15 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, arnd, will, peterz,
	mark.rutland, harisokn, cl, memxor, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

On Mon, Feb 03, 2025 at 01:49:08PM -0800, Ankur Arora wrote:
> Add smp_cond_load_relaxed_timewait(), a timed variant of
> smp_cond_load_relaxed(). This is useful for cases where we want to
> wait on a conditional variable but don't want to wait indefinitely.

Bikeshedding: why not "timeout" rather than "timewait"?

> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..31de8ed2a05e 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -273,6 +273,54 @@ do {									\
>  })
>  #endif
>  
> +#ifndef smp_cond_time_check_count
> +/*
> + * Limit how often smp_cond_load_relaxed_timewait() evaluates time_expr_ns.
> + * This helps reduce the number of instructions executed while spin-waiting.
> + */
> +#define smp_cond_time_check_count	200
> +#endif

While this was indeed added to the poll_idle() loop, it feels completely
random in a generic implementation. It's highly dependent on the
time_expr_ns passed. Can the caller not move the loop in time_expr_ns
before invoking this macro?

> +
> +/**
> + * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
> + * guarantees until a timeout expires.
> + * @ptr: pointer to the variable to wait on
> + * @cond: boolean expression to wait for
> + * @time_expr_ns: evaluates to the current time
> + * @time_limit_ns: compared against time_expr_ns
> + *
> + * Equivalent to using READ_ONCE() on the condition variable.
> + *
> + * Note that the time check in time_expr_ns can be synchronous or
> + * asynchronous.
> + * In the generic version the check is synchronous but kept coarse
> + * to minimize instructions executed while spin-waiting.
> + */

Not sure exactly what synchronous vs asynchronous here mean. I see the
latter more like an interrupt. I guess what you have in mind is the WFE
wakeup events on arm64, though they don't interrupt the instruction
flow. I'd not bother specifying this at all.

> +#ifndef __smp_cond_load_relaxed_spinwait
> +#define __smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns,	\
> +					 time_limit_ns) ({		\
> +	typeof(ptr) __PTR = (ptr);					\
> +	__unqual_scalar_typeof(*ptr) VAL;				\
> +	unsigned int __count = 0;					\
> +	for (;;) {							\
> +		VAL = READ_ONCE(*__PTR);				\
> +		if (cond_expr)						\
> +			break;						\
> +		cpu_relax();						\
> +		if (__count++ < smp_cond_time_check_count)		\
> +			continue;					\
> +		if ((time_expr_ns) >= (time_limit_ns))			\
> +			break;						\
> +		__count = 0;						\
> +	}								\
> +	(typeof(*ptr))VAL;						\
> +})
> +#endif
> +
> +#ifndef smp_cond_load_relaxed_timewait
> +#define smp_cond_load_relaxed_timewait  __smp_cond_load_relaxed_spinwait
> +#endif

What I don't particularly like about this interface is (1) no idea of
what time granularity it offers, how much it can slip past the deadline,
even though there's some nanoseconds implied and (2) time_expr_ns leaves
the caller to figure out why time function to use for tracking the time.
Well, I can be ok with (2) if we make it a bit more generic.

The way it is written, I guess the type of the time expression and limit
no longer matters as long as you can compare them. The naming implies
nanoseconds but we don't have such precision, especially with the WFE
implementation for arm64. We could add a slack range argument like the
delta_ns for some of the hrtimer API and let the arch code decide
whether to honour it.

What about we drop the time_limit_ns and build it into the time_expr_ns
as a 'time_cond' argument? The latter would return the result of some
comparison and the loop bails out if true. An additional argument would
be the minimum granularity for checking the time_cond and the arch code
may decide to fall back to busy loops if the granularity is larger than
what the caller required.

-- 
Catalin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] arm64: barrier: Add smp_cond_load_relaxed_timewait()
  2025-02-03 21:49 ` [PATCH 3/4] arm64: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
@ 2025-03-04 19:29   ` Catalin Marinas
  2025-03-06  7:58     ` Ankur Arora
  0 siblings, 1 reply; 17+ messages in thread
From: Catalin Marinas @ 2025-03-04 19:29 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-arch, linux-arm-kernel, arnd, will, peterz,
	mark.rutland, harisokn, cl, memxor, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

On Mon, Feb 03, 2025 at 01:49:10PM -0800, Ankur Arora wrote:
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 1ca947d5c939..25721275a5a2 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -216,6 +216,44 @@ do {									\
>  	(typeof(*ptr))VAL;						\
>  })
>  
> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
> +					 time_expr_ns, time_limit_ns)	\
> +({									\
> +	typeof(ptr) __PTR = (ptr);					\
> +	__unqual_scalar_typeof(*ptr) VAL;				\
> +	for (;;) {							\
> +		VAL = READ_ONCE(*__PTR);				\
> +		if (cond_expr)						\
> +			break;						\
> +		__cmpwait_relaxed(__PTR, VAL);				\
> +		if ((time_expr_ns) >= (time_limit_ns))			\
> +			break;						\
> +	}								\
> +	(typeof(*ptr))VAL;						\
> +})

Rename this to something like *_evstrm as this doesn't really work
unless we have the event stream. Another one would be *_wfet.

> +
> +/*
> + * For the unlikely case that the event-stream is unavailable,
> + * ward off the possibility of waiting forever by falling back
> + * to the generic spin-wait.
> + */
> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr,			\
> +				       time_expr_ns, time_limit_ns)	\
> +({									\
> +	__unqual_scalar_typeof(*ptr) _val;				\
> +	int __wfe = arch_timer_evtstrm_available();			\

This should be a bool.

> +									\
> +	if (likely(__wfe))						\
> +		_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,	\
> +							time_expr_ns,	\
> +							time_limit_ns);	\
> +	else								\
> +		_val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr,	\
> +							time_expr_ns,	\
> +							time_limit_ns);	\
> +	(typeof(*ptr))_val;						\
> +})

Not sure there's much to say here, this depends on the actual interface
introduced by patch 1. If we make some statements about granularity of
some time_cond_expr check, we'll have to take that into account.

-- 
Catalin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-03-04 19:15   ` Catalin Marinas
@ 2025-03-06  7:53     ` Ankur Arora
  2025-03-11  8:48       ` Kumar Kartikeya Dwivedi
  2025-03-09  3:26     ` Ankur Arora
  1 sibling, 1 reply; 17+ messages in thread
From: Ankur Arora @ 2025-03-06  7:53 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, arnd,
	will, peterz, mark.rutland, harisokn, cl, memxor, zhenglifeng1,
	joao.m.martins, boris.ostrovsky, konrad.wilk


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Mon, Feb 03, 2025 at 01:49:08PM -0800, Ankur Arora wrote:
>> Add smp_cond_load_relaxed_timewait(), a timed variant of
>> smp_cond_load_relaxed(). This is useful for cases where we want to
>> wait on a conditional variable but don't want to wait indefinitely.
>
> Bikeshedding: why not "timeout" rather than "timewait"?

Well my reasons, such as they are, also involved a fair amount of bikeshedding:

 - timewait and spinwait have same length names which just minimized all
   the indentation issues.
 - timeout seems to suggest a timing mechanism of some kind.

>> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
>> index d4f581c1e21d..31de8ed2a05e 100644
>> --- a/include/asm-generic/barrier.h
>> +++ b/include/asm-generic/barrier.h
>> @@ -273,6 +273,54 @@ do {									\
>>  })
>>  #endif
>>
>> +#ifndef smp_cond_time_check_count
>> +/*
>> + * Limit how often smp_cond_load_relaxed_timewait() evaluates time_expr_ns.
>> + * This helps reduce the number of instructions executed while spin-waiting.
>> + */
>> +#define smp_cond_time_check_count	200
>> +#endif
>
> While this was indeed added to the poll_idle() loop, it feels completely
> random in a generic implementation. It's highly dependent on the
> time_expr_ns passed.

I agree about this feeling quite out of place. For one thing, this is
exposed unnecessarily to the arch code.

> Can the caller not move the loop in time_expr_ns
> before invoking this macro?

You mean add a loop count conditional of some kind in the time_cond_expr?

Might need to change the direction of some of the checks, but let me
play with that. (The one below won't work.)

Even if we could do that though, it seems that how often the time-check
is done should be an internal detail of the interface. Seems ugly to expose
this to the caller:

    flags = smp_cond_load_relaxed_timewait(&current_thread_info()->flags,
                                           (VAL & _TIF_NEED_RESCHED),
                                           (count++ % POLL_IDLE_RELAX_COUNT) ||
                                           (local_clock_noinstr() >= time_limit)));

(Anyway more on this below.)

>> +
>> +/**
>> + * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
>> + * guarantees until a timeout expires.
>> + * @ptr: pointer to the variable to wait on
>> + * @cond: boolean expression to wait for
>> + * @time_expr_ns: evaluates to the current time
>> + * @time_limit_ns: compared against time_expr_ns
>> + *
>> + * Equivalent to using READ_ONCE() on the condition variable.
>> + *
>> + * Note that the time check in time_expr_ns can be synchronous or
>> + * asynchronous.
>> + * In the generic version the check is synchronous but kept coarse
>> + * to minimize instructions executed while spin-waiting.
>> + */
>
> Not sure exactly what synchronous vs asynchronous here mean. I see the
> latter more like an interrupt. I guess what you have in mind is the WFE
> wakeup events on arm64, though they don't interrupt the instruction
> flow. I'd not bother specifying this at all.

Yeah, with fresh eyes this feels quite unnecessary.

 In the generic version the time check is done in the loop but kept coarse
 to minimize instructions executed while spin-waiting. Architecture code
 might implement this without spin-waiting.

>> +#ifndef __smp_cond_load_relaxed_spinwait
>> +#define __smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns,	\
>> +					 time_limit_ns) ({		\
>> +	typeof(ptr) __PTR = (ptr);					\
>> +	__unqual_scalar_typeof(*ptr) VAL;				\
>> +	unsigned int __count = 0;					\
>> +	for (;;) {							\
>> +		VAL = READ_ONCE(*__PTR);				\
>> +		if (cond_expr)						\
>> +			break;						\
>> +		cpu_relax();						\
>> +		if (__count++ < smp_cond_time_check_count)		\
>> +			continue;					\
>> +		if ((time_expr_ns) >= (time_limit_ns))			\
>> +			break;						\
>> +		__count = 0;						\
>> +	}								\
>> +	(typeof(*ptr))VAL;						\
>> +})
>> +#endif
>> +
>> +#ifndef smp_cond_load_relaxed_timewait
>> +#define smp_cond_load_relaxed_timewait  __smp_cond_load_relaxed_spinwait
>> +#endif
>
> What I don't particularly like about this interface is (1) no idea of
> what time granularity it offers, how much it can slip past the deadline,
> even though there's some nanoseconds implied and

So, one problem with that is that the time granularity probably varies
wildly with the implementation and some don't have a clear idea
of the granularity at all:

- generic version (x86): does not have a well defined granularity. The
  assumption is that calling cpu_relax() is "fast" but at least for how
  it is used via poll_idle(), each iteration is at least around 2k-4k
  cycles. (Also depends on the particular CPU model.)

- arm64 WFE version: well defined, potentially query-able granularity
  (say 100us)

- a future arm64 WFET version: similar to the WFE version with a
  smaller granularity.

- generic code: arm64 fallback: granularity not well defined. For
  implementations that implement YIELD as NOP, it's probably
  extremely fast and hot.

> (2) time_expr_ns leaves
> the caller to figure out why time function to use for tracking the time.
> Well, I can be ok with (2) if we make it a bit more generic.

> The way it is written, I guess the type of the time expression and limit
> no longer matters as long as you can compare them. The naming implies
> nanoseconds but we don't have such precision, especially with the WFE
> implementation for arm64. We could add a slack range argument like the
> delta_ns for some of the hrtimer API and let the arch code decide
> whether to honour it.

Yeah and as you say, none of the code actually cares for the unit so
that _ns stuff is unnecessary.

> What about we drop the time_limit_ns and build it into the time_expr_ns
> as a 'time_cond' argument? The latter would return the result of some
> comparison and the loop bails out if true.

Yeah the interface can't actually do anything with the time_expr_ns and
time_limit_ns anyway so just using a combined time_cond seems like a good
idea.

But more below.

> An additional argument would
> be the minimum granularity for checking the time_cond and the arch code
> may decide to fall back to busy loops if the granularity is larger than
> what the caller required.

As you mention mention, there are two problems with the time-check:
constraints on how much it can overflow and how much tardiness we can
bring to bear on the actual evaluation of the presumed expensive
time-check (the smp_cond_time_check_count).

Both of these are conceptually related but the first one only really
applies to WFE like implementations, and the second only to spinning
implementations.

I think some kind of a granularity parameter might help with both of
these.

Taking a step back, we have the following time related parameters:

  1. time_expr
  2. time_limit

Or (1) and (2) combined via time_cond.

And, these are some different ways to put constraints on the time-check:

  3. time_slack: int, representing some kind of percentage/order etc limiting
     value for the overflow. This needs a visible value for (2).

  4. time_check_coarse: bool, representing coarse or fine.
  5. time_granularity: int, specifying the granularity (say
     1/10/100/1000 us) the caller wants.

     This allows the caller to dynamically choose a particular value
     based on (2). poll_idle(), for instance, would evaluate
     cpuidle_poll_time() to decide what value to use.

IMO (3), and (5) both have too much precision that neither of the known
users seem to need:
 - poll_idle(): the only thing waiting for it on the other side of an
   expired timeout is that it goes to a deeper idle state. So, it should
   be fine to be a bit late.

 - resilient spinlock timeout: current timeout value is 0.25s
   (RES_DEF_TIMEOUT [1]) which is pretty large.

So, instead it might be enough for the users to just specify what kind
of timeout or time-check they want (specifying coarse/fine via (4)) and
let the interface use that to handle its overflow and the tardiness
constraints.

With that the generic version could be:

#define __smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_cond,     \
                                         time_check_coarse) ({          \
        typeof(ptr) __PTR = (ptr);                                      \
        __unqual_scalar_typeof(*ptr) VAL;                               \
        unsigned int __count = 0;                                       \
        unsigned int __time_check_count = 200;                          \
                                                                        \
        if (!time_check_coarse)                                         \
                __time_check_count /= 4;                                \
                                                                        \
        for (;;) {                                                      \
                VAL = READ_ONCE(*__PTR);                                \
                if (cond_expr)                                          \
                        break;                                          \
                cpu_relax();                                            \
                if (__count++ < __time_check_count)                     \
                        continue;                                       \
                if (time_cond)                                          \
                        break;                                          \
                __count = 0;                                            \
        }                                                               \
        (typeof(*ptr))VAL;                                              \
})

(The constant is still arbitrary, but it seems to fit in the broad
coarse/fine outlines.)

And the arm64 version:

#define smp_cond_load_relaxed_timewait(ptr, cond_expr,                  \
                                       time_cond, time_check_coarse)    \
({                                                                      \
        __unqual_scalar_typeof(*ptr) _val;                              \
        bool __wfe = arch_timer_evtstrm_available();                    \
                                                                        \
        if (time_check_coarse && __wfe)                                 \
                _val = __smp_cond_load_relaxed_timewait(ptr, cond_expr, \
                                                        time_cond);     \
        else                                                            \
                _val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr, \
                                                        time_cond,      \
                                                        time_check_coarse); \
        (typeof(*ptr))_val;                                             \
})


What do you think?

Oh and thanks for the very helpful review!

[1] https://lore.kernel.org/lkml/20250303152305.3195648-8-memxor@gmail.com/

--
ankur

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] arm64: barrier: Add smp_cond_load_relaxed_timewait()
  2025-03-04 19:29   ` Catalin Marinas
@ 2025-03-06  7:58     ` Ankur Arora
  0 siblings, 0 replies; 17+ messages in thread
From: Ankur Arora @ 2025-03-06  7:58 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, arnd,
	will, peterz, mark.rutland, harisokn, cl, memxor, zhenglifeng1,
	joao.m.martins, boris.ostrovsky, konrad.wilk


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Mon, Feb 03, 2025 at 01:49:10PM -0800, Ankur Arora wrote:
>> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
>> index 1ca947d5c939..25721275a5a2 100644
>> --- a/arch/arm64/include/asm/barrier.h
>> +++ b/arch/arm64/include/asm/barrier.h
>> @@ -216,6 +216,44 @@ do {									\
>>  	(typeof(*ptr))VAL;						\
>>  })
>>
>> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
>> +					 time_expr_ns, time_limit_ns)	\
>> +({									\
>> +	typeof(ptr) __PTR = (ptr);					\
>> +	__unqual_scalar_typeof(*ptr) VAL;				\
>> +	for (;;) {							\
>> +		VAL = READ_ONCE(*__PTR);				\
>> +		if (cond_expr)						\
>> +			break;						\
>> +		__cmpwait_relaxed(__PTR, VAL);				\
>> +		if ((time_expr_ns) >= (time_limit_ns))			\
>> +			break;						\
>> +	}								\
>> +	(typeof(*ptr))VAL;						\
>> +})
>
> Rename this to something like *_evstrm as this doesn't really work
> unless we have the event stream.

Ack.

> Another one would be *_wfet.

Hadn't sent out the WFET version yet.

Did you mean that this should be *_evtstrm or *_wfet?

>> +
>> +/*
>> + * For the unlikely case that the event-stream is unavailable,
>> + * ward off the possibility of waiting forever by falling back
>> + * to the generic spin-wait.
>> + */
>> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr,			\
>> +				       time_expr_ns, time_limit_ns)	\
>> +({									\
>> +	__unqual_scalar_typeof(*ptr) _val;				\
>> +	int __wfe = arch_timer_evtstrm_available();			\
>
> This should be a bool.

Yeah. Will fix.

>> +									\
>> +	if (likely(__wfe))						\
>> +		_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,	\
>> +							time_expr_ns,	\
>> +							time_limit_ns);	\
>> +	else								\
>> +		_val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr,	\
>> +							time_expr_ns,	\
>> +							time_limit_ns);	\
>> +	(typeof(*ptr))_val;						\
>> +})
>
> Not sure there's much to say here, this depends on the actual interface
> introduced by patch 1. If we make some statements about granularity of
> some time_cond_expr check, we'll have to take that into account.

Agreed.

Thanks for the review!

--
ankur

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-03-04 19:15   ` Catalin Marinas
  2025-03-06  7:53     ` Ankur Arora
@ 2025-03-09  3:26     ` Ankur Arora
  1 sibling, 0 replies; 17+ messages in thread
From: Ankur Arora @ 2025-03-09  3:26 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-kernel, linux-arch, linux-arm-kernel, arnd,
	will, peterz, mark.rutland, harisokn, cl, memxor, zhenglifeng1,
	joao.m.martins, boris.ostrovsky, konrad.wilk


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Mon, Feb 03, 2025 at 01:49:08PM -0800, Ankur Arora wrote:
>> Add smp_cond_load_relaxed_timewait(), a timed variant of
>> smp_cond_load_relaxed(). This is useful for cases where we want to
>> wait on a conditional variable but don't want to wait indefinitely.

[ ... ]

> What I don't particularly like about this interface is (1) no idea of
> what time granularity it offers, how much it can slip past the deadline,
> even though there's some nanoseconds implied and (2) time_expr_ns leaves
> the caller to figure out why time function to use for tracking the time.
> Well, I can be ok with (2) if we make it a bit more generic.

So I thought about it some more, and on (1), I still feel that a too
specific value of granularity might not be ideal:

Consider that in cpuidle_poll_time() gives 110us, and we derive a
minimum granularity argument of 100us from that.

On arm64, with an event stream period of 100us, that might mean that we
spend part of the time taking the WFE path, and the remaining in the
cpu_relax() portion.

Now, depending on how this call aligns with the event-stream, this
this might mean we spend anywhere from say 100us to 10us in WFE and
10us to 100us in the cpu_relax() loop.

At least for the poll_idle() power consumption, this seems bad.

If, however, the granularity did not depend on an actual time value,
but just a representation of what the caller can tolerate (as I argue
in the other mail), then poll_idle() could just specify granularity=coarse
and we would always take the WFE path.

But, maybe you were thinking of other cases where a integer value
of granularity might be useful?

[...]
> We could add a slack range argument like the
> delta_ns for some of the hrtimer API and let the arch code decide
> whether to honour it.

If there are cases like that, then a slack range could serve both
purposes with slack tolerant cases like idle/resilient spilocks
specifying a large value and more constrained users specifying a
lower value.

However, I don't see how arch code can choose not to honour this.


Thanks

--
ankur

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-03-06  7:53     ` Ankur Arora
@ 2025-03-11  8:48       ` Kumar Kartikeya Dwivedi
  2025-03-12  6:34         ` Ankur Arora
  0 siblings, 1 reply; 17+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-03-11  8:48 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Catalin Marinas, linux-kernel, linux-arch, linux-arm-kernel, arnd,
	will, peterz, mark.rutland, harisokn, cl, zhenglifeng1,
	joao.m.martins, boris.ostrovsky, konrad.wilk, Alexei Starovoitov

On Thu, 6 Mar 2025 at 08:53, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>
> Catalin Marinas <catalin.marinas@arm.com> writes:
>
> > On Mon, Feb 03, 2025 at 01:49:08PM -0800, Ankur Arora wrote:
> >> Add smp_cond_load_relaxed_timewait(), a timed variant of
> >> smp_cond_load_relaxed(). This is useful for cases where we want to
> >> wait on a conditional variable but don't want to wait indefinitely.
> >
> > Bikeshedding: why not "timeout" rather than "timewait"?
>
> Well my reasons, such as they are, also involved a fair amount of bikeshedding:
>
>  - timewait and spinwait have same length names which just minimized all
>    the indentation issues.
>  - timeout seems to suggest a timing mechanism of some kind.

I would also be in favor of timewait naming, since this is not a
generic timeout primitive, the alternative naming is useful to
distinguish it.
The wait can be off by 100us or so for arm64 when we need to break out
but that's tolerable for some cases.
I've also taken a copy of the thing in [0] so it can begin using the
in-tree implementation once it's merged.

  [0]: https://lore.kernel.org/bpf/20250303152305.3195648-9-memxor@gmail.com

>
> [...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  2025-03-11  8:48       ` Kumar Kartikeya Dwivedi
@ 2025-03-12  6:34         ` Ankur Arora
  0 siblings, 0 replies; 17+ messages in thread
From: Ankur Arora @ 2025-03-12  6:34 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Ankur Arora, Catalin Marinas, linux-kernel, linux-arch,
	linux-arm-kernel, arnd, will, peterz, mark.rutland, harisokn, cl,
	zhenglifeng1, joao.m.martins, boris.ostrovsky, konrad.wilk,
	Alexei Starovoitov


Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Thu, 6 Mar 2025 at 08:53, Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>
>> Catalin Marinas <catalin.marinas@arm.com> writes:
>>
>> > On Mon, Feb 03, 2025 at 01:49:08PM -0800, Ankur Arora wrote:
>> >> Add smp_cond_load_relaxed_timewait(), a timed variant of
>> >> smp_cond_load_relaxed(). This is useful for cases where we want to
>> >> wait on a conditional variable but don't want to wait indefinitely.
>> >
>> > Bikeshedding: why not "timeout" rather than "timewait"?
>>
>> Well my reasons, such as they are, also involved a fair amount of bikeshedding:
>>
>>  - timewait and spinwait have same length names which just minimized all
>>    the indentation issues.
>>  - timeout seems to suggest a timing mechanism of some kind.
>
> I would also be in favor of timewait naming, since this is not a
> generic timeout primitive, the alternative naming is useful to
> distinguish it.
> The wait can be off by 100us or so for arm64 when we need to break out
> but that's tolerable for some cases.
> I've also taken a copy of the thing in [0] so it can begin using the
> in-tree implementation once it's merged.

Sounds good. I'll keep the timewait naming for the next version.
I'm on vacation until the 23rd so will send it out the week after.

Current plan is to address Catalin's comments and make sure that
any fuzziness around the timeout is visible to the caller.

Ankur

>   [0]: https://lore.kernel.org/bpf/20250303152305.3195648-9-memxor@gmail.com
>
>>
>> [...]


--
ankur

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-03-12  6:35 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-03 21:49 [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout() Ankur Arora
2025-02-03 21:49 ` [PATCH 1/4] asm-generic: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
2025-03-04 19:15   ` Catalin Marinas
2025-03-06  7:53     ` Ankur Arora
2025-03-11  8:48       ` Kumar Kartikeya Dwivedi
2025-03-12  6:34         ` Ankur Arora
2025-03-09  3:26     ` Ankur Arora
2025-02-03 21:49 ` [PATCH 2/4] asm-generic: barrier: Add smp_cond_load_acquire_timewait() Ankur Arora
2025-02-03 21:49 ` [PATCH 3/4] arm64: barrier: Add smp_cond_load_relaxed_timewait() Ankur Arora
2025-03-04 19:29   ` Catalin Marinas
2025-03-06  7:58     ` Ankur Arora
2025-02-03 21:49 ` [PATCH 4/4] arm64: barrier: Add smp_cond_load_acquire_timewait() Ankur Arora
2025-02-14 22:42   ` Okanovic, Haris
2025-02-18 21:44     ` Ankur Arora
2025-02-06 10:57 ` [PATCH 0/4] barrier: Introduce smp_cond_load_*_timeout() Kumar Kartikeya Dwivedi
2025-02-18 21:48 ` Ankur Arora
2025-03-03 21:28 ` Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).