Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v8 00/11] Enable haltpoll on arm64
       [not found] <20240925232425.2763385-1-ankur.a.arora@oracle.com>
@ 2024-11-05 18:30 ` Haris Okanovic
  2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
                     ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Haris Okanovic @ 2024-11-05 18:30 UTC (permalink / raw)
  To: ankur.a.arora, catalin.marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

Hi Ankur, Catalin,

How about the following series based on a refactor of arm64's delay()?
Does it address your earlier concerns?

delay() already implements wfet() and falls back to wfe() w/ evstream
or a cpu_relax loop. I refactored it to poll an address, and wrapped in
a new platform-agnostic smp_vcond_load_relaxed() macro. More details in
the following commit log.

Regards,
Haris Okanovic
AWS Graviton Software



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-05 18:30 ` [PATCH v8 00/11] Enable haltpoll on arm64 Haris Okanovic
@ 2024-11-05 18:30   ` Haris Okanovic
  2024-11-05 19:36     ` Christoph Lameter (Ampere)
                       ` (2 more replies)
  2024-11-05 18:30   ` [PATCH 2/5] arm64: add __READ_ONCE_EX() Haris Okanovic
                     ` (4 subsequent siblings)
  5 siblings, 3 replies; 25+ messages in thread
From: Haris Okanovic @ 2024-11-05 18:30 UTC (permalink / raw)
  To: ankur.a.arora, catalin.marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

Relaxed poll until desired mask/value is observed at the specified
address or timeout.

This macro is a specialization of the generic smp_cond_load_relaxed(),
which takes a simple mask/value condition (vcond) instead of an
arbitrary expression. It allows architectures to better specialize the
implementation, e.g. to enable wfe() polling of the address on arm.

Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
 include/asm-generic/barrier.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..112027eabbfc 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -256,6 +256,31 @@ do {									\
 })
 #endif
 
+/**
+ * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
+ * with no ordering guarantees. Spins until `(*addr & mask) == val` or
+ * `nsecs` elapse, and returns the last observed `*addr` value.
+ *
+ * @nsecs: timeout in nanoseconds
+ * @addr: pointer to an integer
+ * @mask: a bit mask applied to read values
+ * @val: Expected value with mask
+ */
+#ifndef smp_vcond_load_relaxed
+#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({	\
+	const u64 __start = local_clock_noinstr();		\
+	u64 __nsecs = (nsecs);					\
+	typeof(addr) __addr = (addr);				\
+	typeof(*__addr) __mask = (mask);			\
+	typeof(*__addr) __val = (val);				\
+	typeof(*__addr) __cur;					\
+	smp_cond_load_relaxed(__addr, (				\
+		(VAL & __mask) == __val ||			\
+		local_clock_noinstr() - __start > __nsecs	\
+	));							\
+})
+#endif
+
 /**
  * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
  * @ptr: pointer to the variable to wait on
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/5] arm64: add __READ_ONCE_EX()
  2024-11-05 18:30 ` [PATCH v8 00/11] Enable haltpoll on arm64 Haris Okanovic
  2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
@ 2024-11-05 18:30   ` Haris Okanovic
  2024-11-05 19:39     ` Christoph Lameter (Ampere)
                       ` (2 more replies)
  2024-11-05 18:30   ` [PATCH 3/5] arm64: refactor delay() to enable polling for value Haris Okanovic
                     ` (3 subsequent siblings)
  5 siblings, 3 replies; 25+ messages in thread
From: Haris Okanovic @ 2024-11-05 18:30 UTC (permalink / raw)
  To: ankur.a.arora, catalin.marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

Perform an exclusive load, which atomically loads a word and arms the
exclusive monitor to enable wfet()/wfe() accelerated polling.

https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-accesses/exclusive-monitors

Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
 arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)
 create mode 100644 arch/arm64/include/asm/readex.h

diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
new file mode 100644
index 000000000000..51963c3107e1
--- /dev/null
+++ b/arch/arm64/include/asm/readex.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Based on arch/arm64/include/asm/rwonce.h
+ *
+ * Copyright (C) 2020 Google LLC.
+ * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#ifndef __ASM_READEX_H
+#define __ASM_READEX_H
+
+#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
+
+#define __READ_ONCE_EX(x)						\
+({									\
+	typeof(&(x)) __x = &(x);					\
+	int atomic = 1;							\
+	union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u;	\
+	switch (sizeof(x)) {						\
+	case 1:								\
+		asm volatile(__LOAD_EX(b, %w0, %1)			\
+			: "=r" (*(__u8 *)__u.__c)			\
+			: "Q" (*__x) : "memory");			\
+		break;							\
+	case 2:								\
+		asm volatile(__LOAD_EX(h, %w0, %1)			\
+			: "=r" (*(__u16 *)__u.__c)			\
+			: "Q" (*__x) : "memory");			\
+		break;							\
+	case 4:								\
+		asm volatile(__LOAD_EX(, %w0, %1)			\
+			: "=r" (*(__u32 *)__u.__c)			\
+			: "Q" (*__x) : "memory");			\
+		break;							\
+	case 8:								\
+		asm volatile(__LOAD_EX(, %0, %1)			\
+			: "=r" (*(__u64 *)__u.__c)			\
+			: "Q" (*__x) : "memory");			\
+		break;							\
+	default:							\
+		atomic = 0;						\
+	}								\
+	atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\
+})
+
+#endif
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/5] arm64: refactor delay() to enable polling for value
  2024-11-05 18:30 ` [PATCH v8 00/11] Enable haltpoll on arm64 Haris Okanovic
  2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
  2024-11-05 18:30   ` [PATCH 2/5] arm64: add __READ_ONCE_EX() Haris Okanovic
@ 2024-11-05 18:30   ` Haris Okanovic
  2024-11-05 19:42     ` Christoph Lameter (Ampere)
  2024-11-06  9:18     ` Catalin Marinas
  2024-11-05 18:30   ` [PATCH 4/5] arm64: add smp_vcond_load_relaxed() Haris Okanovic
                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 25+ messages in thread
From: Haris Okanovic @ 2024-11-05 18:30 UTC (permalink / raw)
  To: ankur.a.arora, catalin.marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

Refactor arm64's delay() to poll for a mask/value condition (vcond) in
it's wfet(), wfe(), and relaxed polling loops.

Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
 arch/arm64/lib/delay.c | 70 ++++++++++++++++++++++++++++++------------
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index cb2062e7e234..a7c3040af316 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -14,43 +14,73 @@
 #include <linux/timex.h>
 
 #include <clocksource/arm_arch_timer.h>
+#include <asm/readex.h>
 
-#define USECS_TO_CYCLES(time_usecs)			\
-	xloops_to_cycles((time_usecs) * 0x10C7UL)
-
-static inline unsigned long xloops_to_cycles(unsigned long xloops)
+static inline u64 xloops_to_cycles(u64 xloops)
 {
 	return (xloops * loops_per_jiffy * HZ) >> 32;
 }
 
-void __delay(unsigned long cycles)
+#define USECS_TO_XLOOPS(time_usecs) \
+	((time_usecs) * 0x10C7UL)
+
+#define USECS_TO_CYCLES(time_usecs) \
+	xloops_to_cycles(USECS_TO_XLOOPS(time_usecs))
+
+#define NSECS_TO_XLOOPS(time_nsecs) \
+	((time_nsecs) * 0x10C7UL)
+
+#define NSECS_TO_CYCLES(time_nsecs) \
+	xloops_to_cycles(NSECS_TO_XLOOPS(time_nsecs))
+
+static unsigned long __delay_until_ul(u64 cycles, unsigned long* addr, unsigned long mask, unsigned long val)
 {
-	cycles_t start = get_cycles();
+	u64 start = get_cycles();
+	unsigned long cur;
 
 	if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
 		u64 end = start + cycles;
 
-		/*
-		 * Start with WFIT. If an interrupt makes us resume
-		 * early, use a WFET loop to complete the delay.
-		 */
-		wfit(end);
-		while ((get_cycles() - start) < cycles)
+		do {
+			cur = __READ_ONCE_EX(*addr);
+			if ((cur & mask) == val) {
+				break;
+			}
 			wfet(end);
-	} else 	if (arch_timer_evtstrm_available()) {
-		const cycles_t timer_evt_period =
+		} while ((get_cycles() - start) < cycles);
+	} else if (arch_timer_evtstrm_available()) {
+		const u64 timer_evt_period =
 			USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
 
-		while ((get_cycles() - start + timer_evt_period) < cycles)
+		do {
+			cur = __READ_ONCE_EX(*addr);
+			if ((cur & mask) == val) {
+				break;
+			}
 			wfe();
+		} while ((get_cycles() - start + timer_evt_period) < cycles);
+	} else {
+		do {
+			cur = __READ_ONCE_EX(*addr);
+			if ((cur & mask) == val) {
+				break;
+			}
+			cpu_relax();
+		} while ((get_cycles() - start) < cycles);
 	}
 
-	while ((get_cycles() - start) < cycles)
-		cpu_relax();
+	return cur;
+}
+
+void __delay(unsigned long cycles)
+{
+	/* constant word for wfet()/wfe() to poll */
+	unsigned long dummy ____cacheline_aligned = 0;
+	__delay_until_ul(cycles, &dummy, 0, 1);
 }
 EXPORT_SYMBOL(__delay);
 
-inline void __const_udelay(unsigned long xloops)
+void __const_udelay(unsigned long xloops)
 {
 	__delay(xloops_to_cycles(xloops));
 }
@@ -58,12 +88,12 @@ EXPORT_SYMBOL(__const_udelay);
 
 void __udelay(unsigned long usecs)
 {
-	__const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */
+	__delay(USECS_TO_CYCLES(usecs));
 }
 EXPORT_SYMBOL(__udelay);
 
 void __ndelay(unsigned long nsecs)
 {
-	__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
+	__delay(NSECS_TO_CYCLES(nsecs));
 }
 EXPORT_SYMBOL(__ndelay);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/5] arm64: add smp_vcond_load_relaxed()
  2024-11-05 18:30 ` [PATCH v8 00/11] Enable haltpoll on arm64 Haris Okanovic
                     ` (2 preceding siblings ...)
  2024-11-05 18:30   ` [PATCH 3/5] arm64: refactor delay() to enable polling for value Haris Okanovic
@ 2024-11-05 18:30   ` Haris Okanovic
  2024-11-05 18:30   ` [PATCH 5/5] cpuidle: implement poll_idle() using smp_vcond_load_relaxed() Haris Okanovic
  2024-11-05 18:49   ` [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
  5 siblings, 0 replies; 25+ messages in thread
From: Haris Okanovic @ 2024-11-05 18:30 UTC (permalink / raw)
  To: ankur.a.arora, catalin.marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

Implement smp_vcond_load_relaxed() atop __delay_until_ul() on arm64,
to reduce number of busy loops while waiting for a value condition.

This implementation only support unsigned long words. It can be extended
via the enclosed case structure in barrier.h as needed.

Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
 arch/arm64/include/asm/barrier.h | 18 ++++++++++++++++++
 arch/arm64/lib/delay.c           | 16 ++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 1ca947d5c939..188327e3ce72 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -203,6 +203,24 @@ do {									\
 	(typeof(*ptr))VAL;						\
 })
 
+extern unsigned long __smp_vcond_load_relaxed_ul(
+	u64 nsecs, unsigned long* addr, unsigned long mask, unsigned long val);
+
+#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({		\
+	u64 __nsecs = (nsecs);						\
+	typeof(addr) __addr = (addr);					\
+	typeof(*__addr) __mask = (mask);				\
+	typeof(*__addr) __val = (val);					\
+	typeof(*__addr) __cur;						\
+	switch (sizeof(*__addr)) {					\
+	case sizeof(unsigned long):					\
+		__cur = __smp_vcond_load_relaxed_ul(			\
+			__nsecs, __addr, __mask, __val);		\
+		break;							\
+	}								\
+	(__cur);							\
+})
+
 #define smp_cond_load_acquire(ptr, cond_expr)				\
 ({									\
 	typeof(ptr) __PTR = (ptr);					\
diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index a7c3040af316..a61a13b04439 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -12,6 +12,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/timex.h>
+#include <linux/sched/clock.h>
 
 #include <clocksource/arm_arch_timer.h>
 #include <asm/readex.h>
@@ -97,3 +98,18 @@ void __ndelay(unsigned long nsecs)
 	__delay(NSECS_TO_CYCLES(nsecs));
 }
 EXPORT_SYMBOL(__ndelay);
+
+unsigned long __smp_vcond_load_relaxed_ul(
+	u64 nsecs, unsigned long* addr, unsigned long mask, unsigned long val)
+{
+	const u64 start = local_clock_noinstr();
+	const u64 cycles = NSECS_TO_CYCLES(nsecs);
+	unsigned long cur;
+
+	do {
+		cur = __delay_until_ul(cycles, addr, mask, val);
+	} while((cur & mask) != val && local_clock_noinstr() - start < nsecs);
+
+	return cur;
+}
+EXPORT_SYMBOL(__smp_vcond_load_relaxed_ul);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/5] cpuidle: implement poll_idle() using smp_vcond_load_relaxed()
  2024-11-05 18:30 ` [PATCH v8 00/11] Enable haltpoll on arm64 Haris Okanovic
                     ` (3 preceding siblings ...)
  2024-11-05 18:30   ` [PATCH 4/5] arm64: add smp_vcond_load_relaxed() Haris Okanovic
@ 2024-11-05 18:30   ` Haris Okanovic
  2024-11-05 19:45     ` Christoph Lameter (Ampere)
  2024-11-05 18:49   ` [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
  5 siblings, 1 reply; 25+ messages in thread
From: Haris Okanovic @ 2024-11-05 18:30 UTC (permalink / raw)
  To: ankur.a.arora, catalin.marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

Implement poll_idle() using smp_vcond_load_relaxed() function.

Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
 drivers/cpuidle/poll_state.c | 36 +++++-------------------------------
 1 file changed, 5 insertions(+), 31 deletions(-)

diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index 61df2395585e..5553e6f31702 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -7,46 +7,20 @@
 #include <linux/sched.h>
 #include <linux/sched/clock.h>
 #include <linux/sched/idle.h>
-
-#ifdef CONFIG_ARM64
-/*
- * POLL_IDLE_RELAX_COUNT determines how often we check for timeout
- * while polling for TIF_NEED_RESCHED in thread_info->flags.
- *
- * Set this to a low value since arm64, instead of polling, uses a
- * event based mechanism.
- */
-#define POLL_IDLE_RELAX_COUNT	1
-#else
-#define POLL_IDLE_RELAX_COUNT	200
-#endif
+#include <asm/barrier.h>
 
 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
 {
-	u64 time_start;
-
-	time_start = local_clock_noinstr();
+	unsigned long flags;
 
 	dev->poll_time_limit = false;
 
 	raw_local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		u64 limit;
-
-		limit = cpuidle_poll_time(drv, dev);
-
-		while (!need_resched()) {
-			unsigned int loop_count = 0;
-			if (local_clock_noinstr() - time_start > limit) {
-				dev->poll_time_limit = true;
-				break;
-			}
-
-			smp_cond_load_relaxed(&current_thread_info()->flags,
-					      VAL & _TIF_NEED_RESCHED ||
-					      loop_count++ >= POLL_IDLE_RELAX_COUNT);
-		}
+		u64 limit = cpuidle_poll_time(drv, dev);
+		flags = smp_vcond_load_relaxed(limit, &current_thread_info()->flags, _TIF_NEED_RESCHED, _TIF_NEED_RESCHED);
+		dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED);
 	}
 	raw_local_irq_disable();
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v8 00/11] Enable haltpoll on arm64
  2024-11-05 18:30 ` [PATCH v8 00/11] Enable haltpoll on arm64 Haris Okanovic
                     ` (4 preceding siblings ...)
  2024-11-05 18:30   ` [PATCH 5/5] cpuidle: implement poll_idle() using smp_vcond_load_relaxed() Haris Okanovic
@ 2024-11-05 18:49   ` Ankur Arora
  5 siblings, 0 replies; 25+ messages in thread
From: Ankur Arora @ 2024-11-05 18:49 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk


Haris Okanovic <harisokn@amazon.com> writes:

> Hi Ankur, Catalin,
>
> How about the following series based on a refactor of arm64's delay()?
> Does it address your earlier concerns?
>
> delay() already implements wfet() and falls back to wfe() w/ evstream
> or a cpu_relax loop. I refactored it to poll an address, and wrapped in
> a new platform-agnostic smp_vcond_load_relaxed() macro. More details in
> the following commit log.

Haven't looked at your series too closely but it looks quite a bit
different from the version I was working on.

Let me send out my version as well in the next few days.

--
ankur


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
@ 2024-11-05 19:36     ` Christoph Lameter (Ampere)
  2024-11-06 17:06       ` Okanovic, Haris
  2024-11-06 11:08     ` Catalin Marinas
  2024-11-06 11:39     ` Will Deacon
  2 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-05 19:36 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, 5 Nov 2024, Haris Okanovic wrote:

> +/**
> + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> + * `nsecs` elapse, and returns the last observed `*addr` value.
> + *
> + * @nsecs: timeout in nanoseconds

Please use an absolute time in nsecs instead of a timeout. You do not know
what will happen to your execution thread until the local_clock_noinstr()
is run.




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] arm64: add __READ_ONCE_EX()
  2024-11-05 18:30   ` [PATCH 2/5] arm64: add __READ_ONCE_EX() Haris Okanovic
@ 2024-11-05 19:39     ` Christoph Lameter (Ampere)
  2024-11-06 17:37       ` Okanovic, Haris
  2024-11-06 11:43     ` Will Deacon
  2024-11-09  9:49     ` David Laight
  2 siblings, 1 reply; 25+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-05 19:39 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, 5 Nov 2024, Haris Okanovic wrote:

> +#define __READ_ONCE_EX(x)						\

This is derived from READ_ONCE and named similarly so maybe it would
better be placed into rwonce.h?




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] arm64: refactor delay() to enable polling for value
  2024-11-05 18:30   ` [PATCH 3/5] arm64: refactor delay() to enable polling for value Haris Okanovic
@ 2024-11-05 19:42     ` Christoph Lameter (Ampere)
  2024-11-06 17:42       ` Okanovic, Haris
  2024-11-06  9:18     ` Catalin Marinas
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-05 19:42 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, 5 Nov 2024, Haris Okanovic wrote:

> -#define USECS_TO_CYCLES(time_usecs)			\
> -	xloops_to_cycles((time_usecs) * 0x10C7UL)
> -
> -static inline unsigned long xloops_to_cycles(unsigned long xloops)
> +static inline u64 xloops_to_cycles(u64 xloops)
>  {
>  	return (xloops * loops_per_jiffy * HZ) >> 32;
>  }
>
> -void __delay(unsigned long cycles)
> +#define USECS_TO_XLOOPS(time_usecs) \
> +	((time_usecs) * 0x10C7UL)
> +
> +#define USECS_TO_CYCLES(time_usecs) \
> +	xloops_to_cycles(USECS_TO_XLOOPS(time_usecs))
> +


> +#define NSECS_TO_XLOOPS(time_nsecs) \
> +	((time_nsecs) * 0x10C7UL)

The constant here is the same value as for microseconds. If I remember
correctly its 5UL for nanoseconds.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] cpuidle: implement poll_idle() using smp_vcond_load_relaxed()
  2024-11-05 18:30   ` [PATCH 5/5] cpuidle: implement poll_idle() using smp_vcond_load_relaxed() Haris Okanovic
@ 2024-11-05 19:45     ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-05 19:45 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, 5 Nov 2024, Haris Okanovic wrote:

>  {
> -	u64 time_start;
> -
> -	time_start = local_clock_noinstr();
> +	unsigned long flags;

Lets keep recording that start value here. Otherwise the timeout could me
later than expected.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] arm64: refactor delay() to enable polling for value
  2024-11-05 18:30   ` [PATCH 3/5] arm64: refactor delay() to enable polling for value Haris Okanovic
  2024-11-05 19:42     ` Christoph Lameter (Ampere)
@ 2024-11-06  9:18     ` Catalin Marinas
  2024-11-06 17:38       ` Okanovic, Haris
  1 sibling, 1 reply; 25+ messages in thread
From: Catalin Marinas @ 2024-11-06  9:18 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	will, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, mtosatti, sudeep.holla, cl, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Nov 05, 2024 at 12:30:39PM -0600, Haris Okanovic wrote:
> +		do {
> +			cur = __READ_ONCE_EX(*addr);
> +			if ((cur & mask) == val) {
> +				break;
> +			}
>  			wfet(end);

Constructs like this need to be entirely in assembly. The compiler may
spill 'cur' onto the stack and the write could clear the exclusive
monitor which makes the wfet return immediately. That's highly CPU
implementation specific but it's the reason we have functions like
__cmpwait() in assembly (or whatever else deals with exclusives). IOW,
we can't have other memory accesses between the LDXR and whatever is
consuming the exclusive monitor armed state (typically STXR but WFE/WFET
can be another).

-- 
Catalin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
  2024-11-05 19:36     ` Christoph Lameter (Ampere)
@ 2024-11-06 11:08     ` Catalin Marinas
  2024-11-06 18:13       ` Okanovic, Haris
  2024-11-06 11:39     ` Will Deacon
  2 siblings, 1 reply; 25+ messages in thread
From: Catalin Marinas @ 2024-11-06 11:08 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	will, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, mtosatti, sudeep.holla, cl, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..112027eabbfc 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -256,6 +256,31 @@ do {									\
>  })
>  #endif
>  
> +/**
> + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> + * `nsecs` elapse, and returns the last observed `*addr` value.
> + *
> + * @nsecs: timeout in nanoseconds

FWIW, I don't mind the relative timeout, it makes the API easier to use.
Yes, it may take longer in absolute time if the thread is scheduled out
before local_clock_noinstr() is read but the same can happen in the
caller anyway. It's similar to udelay(), it can take longer if the
thread is scheduled out.

> + * @addr: pointer to an integer
> + * @mask: a bit mask applied to read values
> + * @val: Expected value with mask
> + */
> +#ifndef smp_vcond_load_relaxed
> +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({	\
> +	const u64 __start = local_clock_noinstr();		\
> +	u64 __nsecs = (nsecs);					\
> +	typeof(addr) __addr = (addr);				\
> +	typeof(*__addr) __mask = (mask);			\
> +	typeof(*__addr) __val = (val);				\
> +	typeof(*__addr) __cur;					\
> +	smp_cond_load_relaxed(__addr, (				\
> +		(VAL & __mask) == __val ||			\
> +		local_clock_noinstr() - __start > __nsecs	\
> +	));							\
> +})

The generic implementation has the same problem as Ankur's current
series. smp_cond_load_relaxed() can't wait on anything other than the
variable at __addr. If it goes into a WFE, there's nothing executed to
read the timer and check for progress. Any generic implementation of
such function would have to use cpu_relax() and polling.

-- 
Catalin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
  2024-11-05 19:36     ` Christoph Lameter (Ampere)
  2024-11-06 11:08     ` Catalin Marinas
@ 2024-11-06 11:39     ` Will Deacon
  2024-11-06 17:18       ` Okanovic, Haris
  2 siblings, 1 reply; 25+ messages in thread
From: Will Deacon @ 2024-11-06 11:39 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini,
	wanpengli, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, mtosatti, sudeep.holla, cl, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> Relaxed poll until desired mask/value is observed at the specified
> address or timeout.
> 
> This macro is a specialization of the generic smp_cond_load_relaxed(),
> which takes a simple mask/value condition (vcond) instead of an
> arbitrary expression. It allows architectures to better specialize the
> implementation, e.g. to enable wfe() polling of the address on arm.

This doesn't make sense to me. The existing smp_cond_load() functions
already use wfe on arm64 and I don't see why we need a special helper
just to do a mask.

> Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> ---
>  include/asm-generic/barrier.h | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..112027eabbfc 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -256,6 +256,31 @@ do {									\
>  })
>  #endif
>  
> +/**
> + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> + * `nsecs` elapse, and returns the last observed `*addr` value.
> + *
> + * @nsecs: timeout in nanoseconds
> + * @addr: pointer to an integer
> + * @mask: a bit mask applied to read values
> + * @val: Expected value with mask
> + */
> +#ifndef smp_vcond_load_relaxed

I know naming is hard, but "vcond" is especially terrible.
Perhaps smp_cond_load_timeout()?

Will


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] arm64: add __READ_ONCE_EX()
  2024-11-05 18:30   ` [PATCH 2/5] arm64: add __READ_ONCE_EX() Haris Okanovic
  2024-11-05 19:39     ` Christoph Lameter (Ampere)
@ 2024-11-06 11:43     ` Will Deacon
  2024-11-06 17:09       ` Okanovic, Haris
  2024-11-09  9:49     ` David Laight
  2 siblings, 1 reply; 25+ messages in thread
From: Will Deacon @ 2024-11-06 11:43 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini,
	wanpengli, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, mtosatti, sudeep.holla, cl, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Nov 05, 2024 at 12:30:38PM -0600, Haris Okanovic wrote:
> Perform an exclusive load, which atomically loads a word and arms the
> exclusive monitor to enable wfet()/wfe() accelerated polling.
> 
> https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-accesses/exclusive-monitors
> 
> Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> ---
>  arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
>  create mode 100644 arch/arm64/include/asm/readex.h
> 
> diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
> new file mode 100644
> index 000000000000..51963c3107e1
> --- /dev/null
> +++ b/arch/arm64/include/asm/readex.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Based on arch/arm64/include/asm/rwonce.h
> + *
> + * Copyright (C) 2020 Google LLC.
> + * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
> + */
> +
> +#ifndef __ASM_READEX_H
> +#define __ASM_READEX_H
> +
> +#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
> +
> +#define __READ_ONCE_EX(x)						\
> +({									\
> +	typeof(&(x)) __x = &(x);					\
> +	int atomic = 1;							\
> +	union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u;	\
> +	switch (sizeof(x)) {						\
> +	case 1:								\
> +		asm volatile(__LOAD_EX(b, %w0, %1)			\
> +			: "=r" (*(__u8 *)__u.__c)			\
> +			: "Q" (*__x) : "memory");			\
> +		break;							\
> +	case 2:								\
> +		asm volatile(__LOAD_EX(h, %w0, %1)			\
> +			: "=r" (*(__u16 *)__u.__c)			\
> +			: "Q" (*__x) : "memory");			\
> +		break;							\
> +	case 4:								\
> +		asm volatile(__LOAD_EX(, %w0, %1)			\
> +			: "=r" (*(__u32 *)__u.__c)			\
> +			: "Q" (*__x) : "memory");			\
> +		break;							\
> +	case 8:								\
> +		asm volatile(__LOAD_EX(, %0, %1)			\
> +			: "=r" (*(__u64 *)__u.__c)			\
> +			: "Q" (*__x) : "memory");			\
> +		break;							\
> +	default:							\
> +		atomic = 0;						\
> +	}								\
> +	atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\
> +})

I think this is a bad idea. Load-exclusive needs to be used very carefully,
preferably when you're able to see exactly what instructions it's
interacting with. By making this into a macro, we're at the mercy of the
compiler and we give the wrong impression that you could e.g. build atomic
critical sections out of this macro.

So I'm fairly strongly against this interface.

Will


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-05 19:36     ` Christoph Lameter (Ampere)
@ 2024-11-06 17:06       ` Okanovic, Haris
  0 siblings, 0 replies; 25+ messages in thread
From: Okanovic, Haris @ 2024-11-06 17:06 UTC (permalink / raw)
  To: cl@gentwo.org
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	boris.ostrovsky@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, joao.m.martins@oracle.com,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, maobibo@loongson.cn, vkuznets@redhat.com,
	linux-arm-kernel@lists.infradead.org, Okanovic, Haris,
	linux-pm@vger.kernel.org, bp@alien8.de, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com

On Tue, 2024-11-05 at 11:36 -0800, Christoph Lameter (Ampere) wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, 5 Nov 2024, Haris Okanovic wrote:
> 
> > +/**
> > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > + *
> > + * @nsecs: timeout in nanoseconds
> 
> Please use an absolute time in nsecs instead of a timeout.

I went with relative time because it clock agnostic. I agree deadline
is nicer because it can propagate down layers of functions, but it pins
the caller to single time base.

> You do not know
> what will happen to your execution thread until the local_clock_noinstr()
> is run.


Not sure what you mean. Could you perhaps give an example where it
would break?

> 
> 

One alternative is to do timekeeping with delay() in all cases, to
decouple from sched/clock.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] arm64: add __READ_ONCE_EX()
  2024-11-06 11:43     ` Will Deacon
@ 2024-11-06 17:09       ` Okanovic, Haris
  0 siblings, 0 replies; 25+ messages in thread
From: Okanovic, Haris @ 2024-11-06 17:09 UTC (permalink / raw)
  To: will@kernel.org
  Cc: kvm@vger.kernel.org, rafael@kernel.org,
	boris.ostrovsky@oracle.com, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, mtosatti@redhat.com,
	hpa@zytor.com, peterz@infradead.org, maobibo@loongson.cn,
	vkuznets@redhat.com, linux-arm-kernel@lists.infradead.org,
	Okanovic, Haris, linux-pm@vger.kernel.org, bp@alien8.de,
	x86@kernel.org, mark.rutland@arm.com

On Wed, 2024-11-06 at 11:43 +0000, Will Deacon wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Nov 05, 2024 at 12:30:38PM -0600, Haris Okanovic wrote:
> > Perform an exclusive load, which atomically loads a word and arms the
> > exclusive monitor to enable wfet()/wfe() accelerated polling.
> > 
> > https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-accesses/exclusive-monitors
> > 
> > Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> > ---
> >  arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
> >  1 file changed, 46 insertions(+)
> >  create mode 100644 arch/arm64/include/asm/readex.h
> > 
> > diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
> > new file mode 100644
> > index 000000000000..51963c3107e1
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/readex.h
> > @@ -0,0 +1,46 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Based on arch/arm64/include/asm/rwonce.h
> > + *
> > + * Copyright (C) 2020 Google LLC.
> > + * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
> > + */
> > +
> > +#ifndef __ASM_READEX_H
> > +#define __ASM_READEX_H
> > +
> > +#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
> > +
> > +#define __READ_ONCE_EX(x)                                            \
> > +({                                                                   \
> > +     typeof(&(x)) __x = &(x);                                        \
> > +     int atomic = 1;                                                 \
> > +     union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \
> > +     switch (sizeof(x)) {                                            \
> > +     case 1:                                                         \
> > +             asm volatile(__LOAD_EX(b, %w0, %1)                      \
> > +                     : "=r" (*(__u8 *)__u.__c)                       \
> > +                     : "Q" (*__x) : "memory");                       \
> > +             break;                                                  \
> > +     case 2:                                                         \
> > +             asm volatile(__LOAD_EX(h, %w0, %1)                      \
> > +                     : "=r" (*(__u16 *)__u.__c)                      \
> > +                     : "Q" (*__x) : "memory");                       \
> > +             break;                                                  \
> > +     case 4:                                                         \
> > +             asm volatile(__LOAD_EX(, %w0, %1)                       \
> > +                     : "=r" (*(__u32 *)__u.__c)                      \
> > +                     : "Q" (*__x) : "memory");                       \
> > +             break;                                                  \
> > +     case 8:                                                         \
> > +             asm volatile(__LOAD_EX(, %0, %1)                        \
> > +                     : "=r" (*(__u64 *)__u.__c)                      \
> > +                     : "Q" (*__x) : "memory");                       \
> > +             break;                                                  \
> > +     default:                                                        \
> > +             atomic = 0;                                             \
> > +     }                                                               \
> > +     atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\
> > +})
> 
> I think this is a bad idea. Load-exclusive needs to be used very carefully,
> preferably when you're able to see exactly what instructions it's
> interacting with. By making this into a macro, we're at the mercy of the
> compiler and we give the wrong impression that you could e.g. build atomic
> critical sections out of this macro.
> 
> So I'm fairly strongly against this interface.

Fair point. I'll post an alternate delay() implementation in asm. It's
a simple routine.

> 
> Will


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-06 11:39     ` Will Deacon
@ 2024-11-06 17:18       ` Okanovic, Haris
  0 siblings, 0 replies; 25+ messages in thread
From: Okanovic, Haris @ 2024-11-06 17:18 UTC (permalink / raw)
  To: will@kernel.org
  Cc: kvm@vger.kernel.org, rafael@kernel.org,
	boris.ostrovsky@oracle.com, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, mtosatti@redhat.com,
	hpa@zytor.com, peterz@infradead.org, maobibo@loongson.cn,
	vkuznets@redhat.com, linux-arm-kernel@lists.infradead.org,
	Okanovic, Haris, linux-pm@vger.kernel.org, bp@alien8.de,
	x86@kernel.org, mark.rutland@arm.com

On Wed, 2024-11-06 at 11:39 +0000, Will Deacon wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > Relaxed poll until desired mask/value is observed at the specified
> > address or timeout.
> > 
> > This macro is a specialization of the generic smp_cond_load_relaxed(),
> > which takes a simple mask/value condition (vcond) instead of an
> > arbitrary expression. It allows architectures to better specialize the
> > implementation, e.g. to enable wfe() polling of the address on arm.
> 
> This doesn't make sense to me. The existing smp_cond_load() functions
> already use wfe on arm64 and I don't see why we need a special helper
> just to do a mask.

We can't turn an arbitrary C expression into a wfe()/wfet() exit
condition, which is one of the inputs to the existing smp_cond_load().
This API is therefore more amenable to hardware acceleration.

> 
> > Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> > ---
> >  include/asm-generic/barrier.h | 25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > index d4f581c1e21d..112027eabbfc 100644
> > --- a/include/asm-generic/barrier.h
> > +++ b/include/asm-generic/barrier.h
> > @@ -256,6 +256,31 @@ do {                                                                     \
> >  })
> >  #endif
> > 
> > +/**
> > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > + *
> > + * @nsecs: timeout in nanoseconds
> > + * @addr: pointer to an integer
> > + * @mask: a bit mask applied to read values
> > + * @val: Expected value with mask
> > + */
> > +#ifndef smp_vcond_load_relaxed
> 
> I know naming is hard, but "vcond" is especially terrible.
> Perhaps smp_cond_load_timeout()?

I agree, naming is hard! I was trying to differentiate it from
smp_cond_load() in some meaningful way - that one is an "expression"
condition this one is a "value" condition.

I'll think it over a bit more.

> 
> Will


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] arm64: add __READ_ONCE_EX()
  2024-11-05 19:39     ` Christoph Lameter (Ampere)
@ 2024-11-06 17:37       ` Okanovic, Haris
  0 siblings, 0 replies; 25+ messages in thread
From: Okanovic, Haris @ 2024-11-06 17:37 UTC (permalink / raw)
  To: cl@gentwo.org
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	boris.ostrovsky@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, joao.m.martins@oracle.com,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, maobibo@loongson.cn, vkuznets@redhat.com,
	linux-arm-kernel@lists.infradead.org, Okanovic, Haris,
	linux-pm@vger.kernel.org, bp@alien8.de, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com

On Tue, 2024-11-05 at 11:39 -0800, Christoph Lameter (Ampere) wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, 5 Nov 2024, Haris Okanovic wrote:
> 
> > +#define __READ_ONCE_EX(x)                                            \
> 
> This is derived from READ_ONCE and named similarly so maybe it would
> better be placed into rwonce.h?
> 
> 

I plan to remove this macro per other feedback in this thread.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] arm64: refactor delay() to enable polling for value
  2024-11-06  9:18     ` Catalin Marinas
@ 2024-11-06 17:38       ` Okanovic, Haris
  0 siblings, 0 replies; 25+ messages in thread
From: Okanovic, Haris @ 2024-11-06 17:38 UTC (permalink / raw)
  To: catalin.marinas@arm.com
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, bp@alien8.de, Okanovic, Haris,
	linux-pm@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	mtosatti@redhat.com, x86@kernel.org, mark.rutland@arm.com

On Wed, 2024-11-06 at 09:18 +0000, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Nov 05, 2024 at 12:30:39PM -0600, Haris Okanovic wrote:
> > +             do {
> > +                     cur = __READ_ONCE_EX(*addr);
> > +                     if ((cur & mask) == val) {
> > +                             break;
> > +                     }
> >                       wfet(end);
> 
> Constructs like this need to be entirely in assembly. The compiler may
> spill 'cur' onto the stack and the write could clear the exclusive
> monitor which makes the wfet return immediately. That's highly CPU
> implementation specific but it's the reason we have functions like
> __cmpwait() in assembly (or whatever else deals with exclusives). IOW,
> we can't have other memory accesses between the LDXR and whatever is
> consuming the exclusive monitor armed state (typically STXR but WFE/WFET
> can be another).

Agreed, will rewrite parts of delay() in asm.

> 
> --
> Catalin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] arm64: refactor delay() to enable polling for value
  2024-11-05 19:42     ` Christoph Lameter (Ampere)
@ 2024-11-06 17:42       ` Okanovic, Haris
  0 siblings, 0 replies; 25+ messages in thread
From: Okanovic, Haris @ 2024-11-06 17:42 UTC (permalink / raw)
  To: cl@gentwo.org
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	boris.ostrovsky@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, joao.m.martins@oracle.com,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, maobibo@loongson.cn, vkuznets@redhat.com,
	linux-arm-kernel@lists.infradead.org, Okanovic, Haris,
	linux-pm@vger.kernel.org, bp@alien8.de, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com

On Tue, 2024-11-05 at 11:42 -0800, Christoph Lameter (Ampere) wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, 5 Nov 2024, Haris Okanovic wrote:
> 
> > -#define USECS_TO_CYCLES(time_usecs)                  \
> > -     xloops_to_cycles((time_usecs) * 0x10C7UL)
> > -
> > -static inline unsigned long xloops_to_cycles(unsigned long xloops)
> > +static inline u64 xloops_to_cycles(u64 xloops)
> >  {
> >       return (xloops * loops_per_jiffy * HZ) >> 32;
> >  }
> > 
> > -void __delay(unsigned long cycles)
> > +#define USECS_TO_XLOOPS(time_usecs) \
> > +     ((time_usecs) * 0x10C7UL)
> > +
> > +#define USECS_TO_CYCLES(time_usecs) \
> > +     xloops_to_cycles(USECS_TO_XLOOPS(time_usecs))
> > +
> 
> 
> > +#define NSECS_TO_XLOOPS(time_nsecs) \
> > +     ((time_nsecs) * 0x10C7UL)
> 
> The constant here is the same value as for microseconds. If I remember
> correctly its 5UL for nanoseconds.
> 

You're right, good catch. Should be `nsecs * 0x5UL` per old code.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-06 11:08     ` Catalin Marinas
@ 2024-11-06 18:13       ` Okanovic, Haris
  2024-11-06 19:55         ` Catalin Marinas
  0 siblings, 1 reply; 25+ messages in thread
From: Okanovic, Haris @ 2024-11-06 18:13 UTC (permalink / raw)
  To: catalin.marinas@arm.com
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, bp@alien8.de, Okanovic, Haris,
	linux-pm@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	mtosatti@redhat.com, x86@kernel.org, mark.rutland@arm.com

On Wed, 2024-11-06 at 11:08 +0000, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > index d4f581c1e21d..112027eabbfc 100644
> > --- a/include/asm-generic/barrier.h
> > +++ b/include/asm-generic/barrier.h
> > @@ -256,6 +256,31 @@ do {                                                                     \
> >  })
> >  #endif
> > 
> > +/**
> > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > + *
> > + * @nsecs: timeout in nanoseconds
> 
> FWIW, I don't mind the relative timeout, it makes the API easier to use.
> Yes, it may take longer in absolute time if the thread is scheduled out
> before local_clock_noinstr() is read but the same can happen in the
> caller anyway. It's similar to udelay(), it can take longer if the
> thread is scheduled out.
> 
> > + * @addr: pointer to an integer
> > + * @mask: a bit mask applied to read values
> > + * @val: Expected value with mask
> > + */
> > +#ifndef smp_vcond_load_relaxed
> > +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({    \
> > +     const u64 __start = local_clock_noinstr();              \
> > +     u64 __nsecs = (nsecs);                                  \
> > +     typeof(addr) __addr = (addr);                           \
> > +     typeof(*__addr) __mask = (mask);                        \
> > +     typeof(*__addr) __val = (val);                          \
> > +     typeof(*__addr) __cur;                                  \
> > +     smp_cond_load_relaxed(__addr, (                         \
> > +             (VAL & __mask) == __val ||                      \
> > +             local_clock_noinstr() - __start > __nsecs       \
> > +     ));                                                     \
> > +})
> 
> The generic implementation has the same problem as Ankur's current
> series. smp_cond_load_relaxed() can't wait on anything other than the
> variable at __addr. If it goes into a WFE, there's nothing executed to
> read the timer and check for progress. Any generic implementation of
> such function would have to use cpu_relax() and polling.

How would the caller enter wfe()? Can you give a specific scenario that
you're concerned about?

This code already reduces to a relaxed poll, something like this:

```
start = clock();
while((READ_ONCE(*addr) & mask) != val && (clock() - start) < nsecs) {
  cpu_relax();
}
```

> 
> --
> Catalin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-06 18:13       ` Okanovic, Haris
@ 2024-11-06 19:55         ` Catalin Marinas
  2024-11-06 20:31           ` Okanovic, Haris
  0 siblings, 1 reply; 25+ messages in thread
From: Catalin Marinas @ 2024-11-06 19:55 UTC (permalink / raw)
  To: Okanovic, Haris
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, bp@alien8.de, linux-pm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com

On Wed, Nov 06, 2024 at 06:13:35PM +0000, Okanovic, Haris wrote:
> On Wed, 2024-11-06 at 11:08 +0000, Catalin Marinas wrote:
> > On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > > index d4f581c1e21d..112027eabbfc 100644
> > > --- a/include/asm-generic/barrier.h
> > > +++ b/include/asm-generic/barrier.h
> > > @@ -256,6 +256,31 @@ do {                                                                     \
> > >  })
> > >  #endif
> > > 
> > > +/**
> > > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > > + *
> > > + * @nsecs: timeout in nanoseconds
> > 
> > FWIW, I don't mind the relative timeout, it makes the API easier to use.
> > Yes, it may take longer in absolute time if the thread is scheduled out
> > before local_clock_noinstr() is read but the same can happen in the
> > caller anyway. It's similar to udelay(), it can take longer if the
> > thread is scheduled out.
> > 
> > > + * @addr: pointer to an integer
> > > + * @mask: a bit mask applied to read values
> > > + * @val: Expected value with mask
> > > + */
> > > +#ifndef smp_vcond_load_relaxed
> > > +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({    \
> > > +     const u64 __start = local_clock_noinstr();              \
> > > +     u64 __nsecs = (nsecs);                                  \
> > > +     typeof(addr) __addr = (addr);                           \
> > > +     typeof(*__addr) __mask = (mask);                        \
> > > +     typeof(*__addr) __val = (val);                          \
> > > +     typeof(*__addr) __cur;                                  \
> > > +     smp_cond_load_relaxed(__addr, (                         \
> > > +             (VAL & __mask) == __val ||                      \
> > > +             local_clock_noinstr() - __start > __nsecs       \
> > > +     ));                                                     \
> > > +})
> > 
> > The generic implementation has the same problem as Ankur's current
> > series. smp_cond_load_relaxed() can't wait on anything other than the
> > variable at __addr. If it goes into a WFE, there's nothing executed to
> > read the timer and check for progress. Any generic implementation of
> > such function would have to use cpu_relax() and polling.
> 
> How would the caller enter wfe()? Can you give a specific scenario that
> you're concerned about?

Let's take the arm64 example with the event stream disabled. Without the
subsequent patches implementing smp_vcond_load_relaxed(), just expand
the arm64 smp_cond_load_relaxed() implementation in the above macro. If
the timer check doesn't trigger an exit from the loop,
__cmpwait_relaxed() only waits on the variable to change its value,
nothing to do with the timer.

> This code already reduces to a relaxed poll, something like this:
> 
> ```
> start = clock();
> while((READ_ONCE(*addr) & mask) != val && (clock() - start) < nsecs) {
>   cpu_relax();
> }
> ```

Well, that's if you also use the generic implementation of
smp_cond_load_relaxed() but have you checked all the other architectures
that don't do something similar to the arm64 wfe (riscv comes close)?
Even if all other architectures just use a cpu_relax(), that's still
abusing the smp_cond_load_relaxed() semantics. And what if one places
another loop in their __cmpwait()? That's allowed because you are
supposed to wait on a single variable to change not on multiple states.

-- 
Catalin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-06 19:55         ` Catalin Marinas
@ 2024-11-06 20:31           ` Okanovic, Haris
  0 siblings, 0 replies; 25+ messages in thread
From: Okanovic, Haris @ 2024-11-06 20:31 UTC (permalink / raw)
  To: catalin.marinas@arm.com
  Cc: joao.m.martins@oracle.com, kvm@vger.kernel.org,
	mtosatti@redhat.com, boris.ostrovsky@oracle.com,
	mark.rutland@arm.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	cl@gentwo.org, wanpengli@tencent.com,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, vkuznets@redhat.com, sudeep.holla@arm.com,
	Okanovic, Haris, rafael@kernel.org, bp@alien8.de,
	linux-pm@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org

On Wed, 2024-11-06 at 19:55 +0000, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Wed, Nov 06, 2024 at 06:13:35PM +0000, Okanovic, Haris wrote:
> > On Wed, 2024-11-06 at 11:08 +0000, Catalin Marinas wrote:
> > > On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > > > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > > > index d4f581c1e21d..112027eabbfc 100644
> > > > --- a/include/asm-generic/barrier.h
> > > > +++ b/include/asm-generic/barrier.h
> > > > @@ -256,6 +256,31 @@ do {                                                                     \
> > > >  })
> > > >  #endif
> > > > 
> > > > +/**
> > > > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > > > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > > > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > > > + *
> > > > + * @nsecs: timeout in nanoseconds
> > > 
> > > FWIW, I don't mind the relative timeout, it makes the API easier to use.
> > > Yes, it may take longer in absolute time if the thread is scheduled out
> > > before local_clock_noinstr() is read but the same can happen in the
> > > caller anyway. It's similar to udelay(), it can take longer if the
> > > thread is scheduled out.
> > > 
> > > > + * @addr: pointer to an integer
> > > > + * @mask: a bit mask applied to read values
> > > > + * @val: Expected value with mask
> > > > + */
> > > > +#ifndef smp_vcond_load_relaxed
> > > > +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({    \
> > > > +     const u64 __start = local_clock_noinstr();              \
> > > > +     u64 __nsecs = (nsecs);                                  \
> > > > +     typeof(addr) __addr = (addr);                           \
> > > > +     typeof(*__addr) __mask = (mask);                        \
> > > > +     typeof(*__addr) __val = (val);                          \
> > > > +     typeof(*__addr) __cur;                                  \
> > > > +     smp_cond_load_relaxed(__addr, (                         \
> > > > +             (VAL & __mask) == __val ||                      \
> > > > +             local_clock_noinstr() - __start > __nsecs       \
> > > > +     ));                                                     \
> > > > +})
> > > 
> > > The generic implementation has the same problem as Ankur's current
> > > series. smp_cond_load_relaxed() can't wait on anything other than the
> > > variable at __addr. If it goes into a WFE, there's nothing executed to
> > > read the timer and check for progress. Any generic implementation of
> > > such function would have to use cpu_relax() and polling.
> > 
> > How would the caller enter wfe()? Can you give a specific scenario that
> > you're concerned about?
> 
> Let's take the arm64 example with the event stream disabled. Without the
> subsequent patches implementing smp_vcond_load_relaxed(), just expand
> the arm64 smp_cond_load_relaxed() implementation in the above macro. If
> the timer check doesn't trigger an exit from the loop,
> __cmpwait_relaxed() only waits on the variable to change its value,
> nothing to do with the timer.
> 
> > This code already reduces to a relaxed poll, something like this:
> > 
> > ```
> > start = clock();
> > while((READ_ONCE(*addr) & mask) != val && (clock() - start) < nsecs) {
> >   cpu_relax();
> > }
> > ```
> 
> Well, that's if you also use the generic implementation of
> smp_cond_load_relaxed() but have you checked all the other architectures
> that don't do something similar to the arm64 wfe (riscv comes close)?
> Even if all other architectures just use a cpu_relax(), that's still
> abusing the smp_cond_load_relaxed() semantics. And what if one places
> another loop in their __cmpwait()? That's allowed because you are
> supposed to wait on a single variable to change not on multiple states.

I see what you mean now - I glossed over the use of __cmpwait_relaxed()
in smp_cond_load_relaxed(). I'll post another rev with the fix, similar
to the above "reduced" code.

> 
> --
> Catalin



^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 2/5] arm64: add __READ_ONCE_EX()
  2024-11-05 18:30   ` [PATCH 2/5] arm64: add __READ_ONCE_EX() Haris Okanovic
  2024-11-05 19:39     ` Christoph Lameter (Ampere)
  2024-11-06 11:43     ` Will Deacon
@ 2024-11-09  9:49     ` David Laight
  2 siblings, 0 replies; 25+ messages in thread
From: David Laight @ 2024-11-09  9:49 UTC (permalink / raw)
  To: 'Haris Okanovic', ankur.a.arora@oracle.com,
	catalin.marinas@arm.com
  Cc: linux-pm@vger.kernel.org, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, pbonzini@redhat.com,
	wanpengli@tencent.com, vkuznets@redhat.com, rafael@kernel.org,
	daniel.lezcano@linaro.org, peterz@infradead.org, arnd@arndb.de,
	lenb@kernel.org, mark.rutland@arm.com, mtosatti@redhat.com,
	sudeep.holla@arm.com, cl@gentwo.org, misono.tomohiro@fujitsu.com,
	maobibo@loongson.cn, joao.m.martins@oracle.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com

From: Haris Okanovic
> Sent: 05 November 2024 18:31
> 
> Perform an exclusive load, which atomically loads a word and arms the
> exclusive monitor to enable wfet()/wfe() accelerated polling.
> 
...
> +	atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\

That doesn't do what you want it to do.
(It is wrong in READ_ONCE() as well.)

?: is treated like an arithmetic operator and the result will get
promoted to 'int'.
Moving the first cast outside the ?: probably works:
	(typeof(*__x))(atomic ? __u.__val : (*(volatile typeof(__x))__x));

   David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-11-09  9:55 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240925232425.2763385-1-ankur.a.arora@oracle.com>
2024-11-05 18:30 ` [PATCH v8 00/11] Enable haltpoll on arm64 Haris Okanovic
2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
2024-11-05 19:36     ` Christoph Lameter (Ampere)
2024-11-06 17:06       ` Okanovic, Haris
2024-11-06 11:08     ` Catalin Marinas
2024-11-06 18:13       ` Okanovic, Haris
2024-11-06 19:55         ` Catalin Marinas
2024-11-06 20:31           ` Okanovic, Haris
2024-11-06 11:39     ` Will Deacon
2024-11-06 17:18       ` Okanovic, Haris
2024-11-05 18:30   ` [PATCH 2/5] arm64: add __READ_ONCE_EX() Haris Okanovic
2024-11-05 19:39     ` Christoph Lameter (Ampere)
2024-11-06 17:37       ` Okanovic, Haris
2024-11-06 11:43     ` Will Deacon
2024-11-06 17:09       ` Okanovic, Haris
2024-11-09  9:49     ` David Laight
2024-11-05 18:30   ` [PATCH 3/5] arm64: refactor delay() to enable polling for value Haris Okanovic
2024-11-05 19:42     ` Christoph Lameter (Ampere)
2024-11-06 17:42       ` Okanovic, Haris
2024-11-06  9:18     ` Catalin Marinas
2024-11-06 17:38       ` Okanovic, Haris
2024-11-05 18:30   ` [PATCH 4/5] arm64: add smp_vcond_load_relaxed() Haris Okanovic
2024-11-05 18:30   ` [PATCH 5/5] cpuidle: implement poll_idle() using smp_vcond_load_relaxed() Haris Okanovic
2024-11-05 19:45     ` Christoph Lameter (Ampere)
2024-11-05 18:49   ` [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox