[PATCH v8 00/11] Enable haltpoll on arm64

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v8 00/11] Enable haltpoll on arm64
@ 2024-09-25 23:24 Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed() Ankur Arora
                   ` (14 more replies)
  0 siblings, 15 replies; 74+ messages in thread
From: Ankur Arora @ 2024-09-25 23:24 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

This patchset enables the cpuidle-haltpoll driver and its namesake
governor on arm64. This is specifically interesting for KVM guests by
reducing IPC latencies.

Comparing idle switching latencies on an arm64 KVM guest with 
perf bench sched pipe:

                                     usecs/op       %stdev   

  no haltpoll (baseline)               13.48       +-  5.19%
  with haltpoll                         6.84       +- 22.07%


No change in performance for a similar test on x86:

                                     usecs/op        %stdev   

  haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
  haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%

Both sets of tests were on otherwise idle systems with guest VCPUs
pinned to specific PCPUs. One reason for the higher stdev on arm64
is that trapping of the WFE instruction by the host KVM is contingent
on the number of tasks on the runqueue.

Tomohiro Misono and Haris Okanovic also report similar latency
improvements on Grace and Graviton systems (for v7) [1] [2].

The patch series is organized in three parts: 

 - patch 1, reorganizes the poll_idle() loop, switching to
   smp_cond_load_relaxed() in the polling loop.
   Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
   renaming it to ARCH_HAS_OPTIMIZED_POLL.

 - patches 4-6 reorganize the haltpoll selection and init logic
   to allow architecture code to select it. 

 - and finally, patches 7-11 add the bits for arm64 support.

What is still missing: this series largely completes the haltpoll side
of functionality for arm64. There are, however, a few related areas
that still need to be threshed out:

 - WFET support: WFE on arm64 does not guarantee that poll_idle()
   would terminate in halt_poll_ns. Using WFET would address this.
 - KVM_NO_POLL support on arm64
 - KVM TWED support on arm64: allow the host to limit time spent in
   WFE.


Changelog:

v8: No logic changes. Largely respin of v7, with changes
noted below:

 - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
   own patch.
   (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
   
 - address comments simplifying arm64 support (Will Deacon)
   (patch-11 "arm64: support cpuidle-haltpoll")

v7: No significant logic changes. Mostly a respin of v6.

 - minor cleanup in poll_idle() (Christoph Lameter)
 - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
   (Tomohiro Misono)

v6:

 - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
   changes together (comment from Christoph Lameter)
 - threshes out the commit messages a bit more (comments from Christoph
   Lameter, Sudeep Holla)
 - also rework selection of cpuidle-haltpoll. Now selected based
   on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
 - moved back to arch_haltpoll_want() (comment from Joao Martins)
   Also, arch_haltpoll_want() now takes the force parameter and is
   now responsible for the complete selection (or not) of haltpoll.
 - fixes the build breakage on i386
 - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
   Tomohiro Misono, Haris Okanovic)


v5:
 - rework the poll_idle() loop around smp_cond_load_relaxed() (review
   comment from Tomohiro Misono.)
 - also rework selection of cpuidle-haltpoll. Now selected based
   on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
 - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
   arm64 now depends on the event-stream being enabled.
 - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
 - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.

v4 changes from v3:
 - change 7/8 per Rafael input: drop the parens and use ret for the final check
 - add 8/8 which renames the guard for building poll_state

v3 changes from v2:
 - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
 - add Ack-by from Rafael Wysocki on 2/7

v2 changes from v1:
 - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
   (this improves by 50% at least the CPU cycles consumed in the tests above:
   10,716,881,137 now vs 14,503,014,257 before)
 - removed the ifdef from patch 1 per RafaelW

Please review.

[1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
[2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/

Ankur Arora (6):
  cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
  cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
  arm64: idle: export arch_cpu_idle
  arm64: select ARCH_HAS_OPTIMIZED_POLL
  cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
  arm64: support cpuidle-haltpoll

Joao Martins (4):
  Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
  cpuidle-haltpoll: define arch_haltpoll_want()
  governors/haltpoll: drop kvm_para_available() check
  arm64: define TIF_POLLING_NRFLAG

Mihai Carabas (1):
  cpuidle/poll_state: poll via smp_cond_load_relaxed()

 arch/Kconfig                              |  3 +++
 arch/arm64/Kconfig                        |  7 +++++++
 arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
 arch/arm64/include/asm/thread_info.h      |  2 ++
 arch/arm64/kernel/idle.c                  |  1 +
 arch/x86/Kconfig                          |  5 ++---
 arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
 arch/x86/kernel/kvm.c                     | 13 ++++++++++++
 drivers/acpi/processor_idle.c             |  4 ++--
 drivers/cpuidle/Kconfig                   |  5 ++---
 drivers/cpuidle/Makefile                  |  2 +-
 drivers/cpuidle/cpuidle-haltpoll.c        | 12 +-----------
 drivers/cpuidle/governors/haltpoll.c      |  6 +-----
 drivers/cpuidle/poll_state.c              | 22 +++++++++++++++------
 drivers/idle/Kconfig                      |  1 +
 include/linux/cpuidle.h                   |  2 +-
 include/linux/cpuidle_haltpoll.h          |  5 +++++
 17 files changed, 83 insertions(+), 32 deletions(-)
 create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h

-- 
2.43.5


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
@ 2024-09-25 23:24 ` Ankur Arora
  2024-10-15 12:04   ` Catalin Marinas
  2024-09-25 23:24 ` [PATCH v8 02/11] cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL Ankur Arora
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Ankur Arora @ 2024-09-25 23:24 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

From: Mihai Carabas <mihai.carabas@oracle.com>

The inner loop in poll_idle() polls up to POLL_IDLE_RELAX_COUNT times,
checking to see if the thread has the TIF_NEED_RESCHED bit set. The
loop exits once the condition is met, or if the poll time limit has
been exceeded.

To minimize the number of instructions executed each iteration, the
time check is done only infrequently (once every POLL_IDLE_RELAX_COUNT
iterations). In addition, each loop iteration executes cpu_relax()
which on certain platforms provides a hint to the pipeline that the
loop is busy-waiting, thus allowing the processor to reduce power
consumption.

However, cpu_relax() is defined optimally only on x86. On arm64, for
instance, it is implemented as a YIELD which only serves a hint to the
CPU that it prioritize a different hardware thread if one is available.
arm64, however, does expose a more optimal polling mechanism via
smp_cond_load_relaxed() which uses LDXR, WFE to wait until a store
to a specified region.

So restructure the loop, folding both checks in smp_cond_load_relaxed().
Also, move the time check to the head of the loop allowing it to exit
straight-away once TIF_NEED_RESCHED is set.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/cpuidle/poll_state.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index 9b6d90a72601..fc1204426158 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,

 	raw_local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		unsigned int loop_count = 0;
 		u64 limit;

 		limit = cpuidle_poll_time(drv, dev);

 		while (!need_resched()) {
-			cpu_relax();
-			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
-				continue;
-
-			loop_count = 0;
+			unsigned int loop_count = 0;
 			if (local_clock_noinstr() - time_start > limit) {
 				dev->poll_time_limit = true;
 				break;
 			}
+
+			smp_cond_load_relaxed(&current_thread_info()->flags,
+					      VAL & _TIF_NEED_RESCHED ||
+					      loop_count++ >= POLL_IDLE_RELAX_COUNT);
 		}
 	}
 	raw_local_irq_disable();
-- 
2.43.5

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-09-25 23:24 ` [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed() Ankur Arora
@ 2024-10-15 12:04   ` Catalin Marinas
  2024-10-15 16:42     ` Christoph Lameter (Ampere)
                       ` (2 more replies)
  0 siblings, 3 replies; 74+ messages in thread
From: Catalin Marinas @ 2024-10-15 12:04 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> index 9b6d90a72601..fc1204426158 100644
> --- a/drivers/cpuidle/poll_state.c
> +++ b/drivers/cpuidle/poll_state.c
> @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
>  
>  	raw_local_irq_enable();
>  	if (!current_set_polling_and_test()) {
> -		unsigned int loop_count = 0;
>  		u64 limit;
>  
>  		limit = cpuidle_poll_time(drv, dev);
>  
>  		while (!need_resched()) {
> -			cpu_relax();
> -			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> -				continue;
> -
> -			loop_count = 0;
> +			unsigned int loop_count = 0;
>  			if (local_clock_noinstr() - time_start > limit) {
>  				dev->poll_time_limit = true;
>  				break;
>  			}
> +
> +			smp_cond_load_relaxed(&current_thread_info()->flags,
> +					      VAL & _TIF_NEED_RESCHED ||
> +					      loop_count++ >= POLL_IDLE_RELAX_COUNT);

The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
never set. With the event stream enabled on arm64, the WFE will
eventually be woken up, loop_count incremented and the condition would
become true. However, the smp_cond_load_relaxed() semantics require that
a different agent updates the variable being waited on, not the waiting
CPU updating it itself. Also note that the event stream can be disabled
on arm64 on the kernel command line.

Does the code above break any other architecture? I'd say if you want
something like this, better introduce a new smp_cond_load_timeout()
API. The above looks like a hack that may only work on arm64 when the
event stream is enabled.

A generic option is udelay() (on arm64 it would use WFE/WFET by
default). Not sure how important it is for poll_idle() but the downside
of udelay() that it won't be able to also poll need_resched() while
waiting for the timeout. If this matters, you could instead make smaller
udelay() calls. Yet another problem, I don't know how energy efficient
udelay() is on x86 vs cpu_relax().

So maybe an smp_cond_load_timeout() would be better, implemented with
cpu_relax() generically and the arm64 would use LDXR, WFE and rely on
the event stream (or fall back to cpu_relax() if the event stream is
disabled).

-- 
Catalin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 12:04   ` Catalin Marinas
@ 2024-10-15 16:42     ` Christoph Lameter (Ampere)
  2024-10-15 16:50       ` Catalin Marinas
  2024-10-15 21:32     ` Ankur Arora
  2024-10-16 15:13     ` Okanovic, Haris
  2 siblings, 1 reply; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-10-15 16:42 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel, will,
	tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, 15 Oct 2024, Catalin Marinas wrote:

> > +			unsigned int loop_count = 0;
> >  			if (local_clock_noinstr() - time_start > limit) {
> >  				dev->poll_time_limit = true;
> >  				break;
> >  			}
> > +
> > +			smp_cond_load_relaxed(&current_thread_info()->flags,
> > +					      VAL & _TIF_NEED_RESCHED ||
> > +					      loop_count++ >= POLL_IDLE_RELAX_COUNT);
>
> The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> never set. With the event stream enabled on arm64, the WFE will
> eventually be woken up, loop_count incremented and the condition would
> become true. However, the smp_cond_load_relaxed() semantics require that
> a different agent updates the variable being waited on, not the waiting
> CPU updating it itself. Also note that the event stream can be disabled
> on arm64 on the kernel command line.


Setting of need_resched() from another processor involves sending an IPI
after that was set. I dont think we need to smp_cond_load_relaxed since
the IPI will cause an event. For ARM a WFE would be sufficient.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 16:42     ` Christoph Lameter (Ampere)
@ 2024-10-15 16:50       ` Catalin Marinas
  2024-10-15 17:17         ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 74+ messages in thread
From: Catalin Marinas @ 2024-10-15 16:50 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel, will,
	tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Oct 15, 2024 at 09:42:56AM -0700, Christoph Lameter (Ampere) wrote:
> On Tue, 15 Oct 2024, Catalin Marinas wrote:
> > > +			unsigned int loop_count = 0;
> > >  			if (local_clock_noinstr() - time_start > limit) {
> > >  				dev->poll_time_limit = true;
> > >  				break;
> > >  			}
> > > +
> > > +			smp_cond_load_relaxed(&current_thread_info()->flags,
> > > +					      VAL & _TIF_NEED_RESCHED ||
> > > +					      loop_count++ >= POLL_IDLE_RELAX_COUNT);
> >
> > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> > never set. With the event stream enabled on arm64, the WFE will
> > eventually be woken up, loop_count incremented and the condition would
> > become true. However, the smp_cond_load_relaxed() semantics require that
> > a different agent updates the variable being waited on, not the waiting
> > CPU updating it itself. Also note that the event stream can be disabled
> > on arm64 on the kernel command line.
> 
> Setting of need_resched() from another processor involves sending an IPI
> after that was set. I dont think we need to smp_cond_load_relaxed since
> the IPI will cause an event. For ARM a WFE would be sufficient.

I'm not worried about the need_resched() case, even without an IPI it
would still work.

The loop_count++ side of the condition is supposed to timeout in the
absence of a need_resched() event. You can't do an smp_cond_load_*() on
a variable that's only updated by the waiting CPU. Nothing guarantees to
wake it up to update the variable (the event stream on arm64, yes, but
that's generic code).

-- 
Catalin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 16:50       ` Catalin Marinas
@ 2024-10-15 17:17         ` Christoph Lameter (Ampere)
  2024-10-15 17:40           ` Catalin Marinas
  0 siblings, 1 reply; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-10-15 17:17 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel, will,
	tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

[-- Attachment #1: Type: text/plain, Size: 3786 bytes --]

On Tue, 15 Oct 2024, Catalin Marinas wrote:

> > Setting of need_resched() from another processor involves sending an IPI
> > after that was set. I dont think we need to smp_cond_load_relaxed since
> > the IPI will cause an event. For ARM a WFE would be sufficient.
>
> I'm not worried about the need_resched() case, even without an IPI it
> would still work.
>
> The loop_count++ side of the condition is supposed to timeout in the
> absence of a need_resched() event. You can't do an smp_cond_load_*() on
> a variable that's only updated by the waiting CPU. Nothing guarantees to
> wake it up to update the variable (the event stream on arm64, yes, but
> that's generic code).

Hmm... I have WFET implementation here without smp_cond modelled after
the delay() implementation ARM64 (but its not generic and there is
an additional patch required to make this work. Intermediate patch
attached)


From: Christoph Lameter (Ampere) <cl@gentwo.org>
Subject: [Haltpoll: Implement waiting using WFET

Use WFET if the hardware supports it to implement
a wait until something happens to wake up the cpu.

If WFET is not available then use the stream event
source to periodically wake up until an event happens
or the timeout expires.

The smp_cond_wait() is not necessary because the scheduler
will create an event on the targeted cpu by sending an IPI.

Without cond_wait we can simply take the basic approach
from the delay() function and customize it a bit.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 drivers/cpuidle/poll_state.c | 43 +++++++++++++++++-------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

Index: linux/drivers/cpuidle/poll_state.c
===================================================================
--- linux.orig/drivers/cpuidle/poll_state.c
+++ linux/drivers/cpuidle/poll_state.c
@@ -5,48 +5,41 @@

 #include <linux/cpuidle.h>
 #include <linux/sched.h>
-#include <linux/sched/clock.h>
 #include <linux/sched/idle.h>
-
-#ifdef CONFIG_ARM64
-/*
- * POLL_IDLE_RELAX_COUNT determines how often we check for timeout
- * while polling for TIF_NEED_RESCHED in thread_info->flags.
- *
- * Set this to a low value since arm64, instead of polling, uses a
- * event based mechanism.
- */
-#define POLL_IDLE_RELAX_COUNT	1
-#else
-#define POLL_IDLE_RELAX_COUNT	200
-#endif
+#include <clocksource/arm_arch_timer.h>

 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
 {
-	u64 time_start;
-
-	time_start = local_clock_noinstr();
+	const cycles_t start = get_cycles();

 	dev->poll_time_limit = false;

 	raw_local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		u64 limit;

-		limit = cpuidle_poll_time(drv, dev);
+		const cycles_t end = start + ARCH_TIMER_NSECS_TO_CYCLES(cpuidle_poll_time(drv, dev));

 		while (!need_resched()) {
-			unsigned int loop_count = 0;
-			if (local_clock_noinstr() - time_start > limit) {
-				dev->poll_time_limit = true;
-				break;
-			}

-			smp_cond_load_relaxed(&current_thread_info()->flags,
-					      VAL & _TIF_NEED_RESCHED ||
-					      loop_count++ >= POLL_IDLE_RELAX_COUNT);
+			if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
+
+				/* We can power down for a configurable interval while waiting */
+				while (!need_resched() && get_cycles() < end)
+						wfet(end);
+
+			} else if (arch_timer_evtstrm_available()) {
+				const cycles_t timer_period = ARCH_TIMER_USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
+
+				/* Wake up periodically during evstream events */
+				while (!need_resched() && get_cycles() + timer_period <= end)
+						wfe();
+			}
 		}
+
+		/* In case time is not up yet due to coarse time intervals above */
+		while (!need_resched() && get_cycles() < end)
+					cpu_relax();
 	}
 	raw_local_irq_disable();

[-- Attachment #2: arch timr mods --]
[-- Type: text/plain, Size: 2205 bytes --]

From: Christoph Lameter (Ampere) <cl@linux.com>

Move the conversion from time to cycles of arch_timer
into arch_timer.h. Add nsec conversion since we will need that soon.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/arch/arm64/lib/delay.c
===================================================================
--- linux.orig/arch/arm64/lib/delay.c
+++ linux/arch/arm64/lib/delay.c
@@ -15,14 +15,6 @@
 
 #include <clocksource/arm_arch_timer.h>
 
-#define USECS_TO_CYCLES(time_usecs)			\
-	xloops_to_cycles((time_usecs) * 0x10C7UL)
-
-static inline unsigned long xloops_to_cycles(unsigned long xloops)
-{
-	return (xloops * loops_per_jiffy * HZ) >> 32;
-}
-
 void __delay(unsigned long cycles)
 {
 	cycles_t start = get_cycles();
@@ -39,7 +31,7 @@ void __delay(unsigned long cycles)
 			wfet(end);
 	} else 	if (arch_timer_evtstrm_available()) {
 		const cycles_t timer_evt_period =
-			USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
+			ARCH_TIMER_USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
 
 		while ((get_cycles() - start + timer_evt_period) < cycles)
 			wfe();
@@ -52,7 +44,7 @@ EXPORT_SYMBOL(__delay);
 
 inline void __const_udelay(unsigned long xloops)
 {
-	__delay(xloops_to_cycles(xloops));
+	__delay(arch_timer_xloops_to_cycles(xloops));
 }
 EXPORT_SYMBOL(__const_udelay);
 
Index: linux/include/clocksource/arm_arch_timer.h
===================================================================
--- linux.orig/include/clocksource/arm_arch_timer.h
+++ linux/include/clocksource/arm_arch_timer.h
@@ -90,6 +90,19 @@ extern u64 (*arch_timer_read_counter)(vo
 extern struct arch_timer_kvm_info *arch_timer_get_kvm_info(void);
 extern bool arch_timer_evtstrm_available(void);
 
+#include <linux/delay.h>
+
+static inline unsigned long arch_timer_xloops_to_cycles(unsigned long xloops)
+{
+	return (xloops * loops_per_jiffy * HZ) >> 32;
+}
+
+#define ARCH_TIMER_USECS_TO_CYCLES(time_usecs)			\
+	arch_timer_xloops_to_cycles((time_usecs) * 0x10C7UL)
+
+#define ARCH_TIMER_NSECS_TO_CYCLES(time_nsecs)			\
+	arch_timer_xloops_to_cycles((time_nsecs) * 0x5UL)
+
 #else
 
 static inline u32 arch_timer_get_rate(void)

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 17:17         ` Christoph Lameter (Ampere)
@ 2024-10-15 17:40           ` Catalin Marinas
  2024-10-15 21:53             ` Ankur Arora
  0 siblings, 1 reply; 74+ messages in thread
From: Catalin Marinas @ 2024-10-15 17:40 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel, will,
	tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Oct 15, 2024 at 10:17:13AM -0700, Christoph Lameter (Ampere) wrote:
> On Tue, 15 Oct 2024, Catalin Marinas wrote:
> > > Setting of need_resched() from another processor involves sending an IPI
> > > after that was set. I dont think we need to smp_cond_load_relaxed since
> > > the IPI will cause an event. For ARM a WFE would be sufficient.
> >
> > I'm not worried about the need_resched() case, even without an IPI it
> > would still work.
> >
> > The loop_count++ side of the condition is supposed to timeout in the
> > absence of a need_resched() event. You can't do an smp_cond_load_*() on
> > a variable that's only updated by the waiting CPU. Nothing guarantees to
> > wake it up to update the variable (the event stream on arm64, yes, but
> > that's generic code).
> 
> Hmm... I have WFET implementation here without smp_cond modelled after
> the delay() implementation ARM64 (but its not generic and there is
> an additional patch required to make this work. Intermediate patch
> attached)

At least one additional patch ;). But yeah, I suggested hiding all this
behind something like smp_cond_load_timeout() which would wait on
current_thread_info()->flags but with a timeout. The arm64
implementation would follow some of the logic in __delay(). Others may
simply poll with cpu_relax().

Alternatively, if we get an IPI anyway, we can avoid smp_cond_load() and
rely on need_resched() and some new delay/cpu_relax() API that waits for
a timeout or an IPI, whichever comes first. E.g. cpu_relax_timeout()
which on arm64 it's just a simplified version of __delay() without the
'while' loops.

-- 
Catalin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 17:40           ` Catalin Marinas
@ 2024-10-15 21:53             ` Ankur Arora
  2024-10-15 22:28               ` Christoph Lameter (Ampere)
  2024-10-15 22:40               ` Christoph Lameter (Ampere)
  0 siblings, 2 replies; 74+ messages in thread
From: Ankur Arora @ 2024-10-15 21:53 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Christoph Lameter (Ampere), Ankur Arora, linux-pm, kvm,
	linux-arm-kernel, linux-kernel, will, tglx, mingo, bp,
	dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, misono.tomohiro, maobibo, joao.m.martins,
	boris.ostrovsky, konrad.wilk


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Tue, Oct 15, 2024 at 10:17:13AM -0700, Christoph Lameter (Ampere) wrote:
>> On Tue, 15 Oct 2024, Catalin Marinas wrote:
>> > > Setting of need_resched() from another processor involves sending an IPI
>> > > after that was set. I dont think we need to smp_cond_load_relaxed since
>> > > the IPI will cause an event. For ARM a WFE would be sufficient.
>> >
>> > I'm not worried about the need_resched() case, even without an IPI it
>> > would still work.
>> >
>> > The loop_count++ side of the condition is supposed to timeout in the
>> > absence of a need_resched() event. You can't do an smp_cond_load_*() on
>> > a variable that's only updated by the waiting CPU. Nothing guarantees to
>> > wake it up to update the variable (the event stream on arm64, yes, but
>> > that's generic code).
>>
>> Hmm... I have WFET implementation here without smp_cond modelled after
>> the delay() implementation ARM64 (but its not generic and there is
>> an additional patch required to make this work. Intermediate patch
>> attached)
>
> At least one additional patch ;). But yeah, I suggested hiding all this
> behind something like smp_cond_load_timeout() which would wait on
> current_thread_info()->flags but with a timeout. The arm64
> implementation would follow some of the logic in __delay(). Others may
> simply poll with cpu_relax().
>
> Alternatively, if we get an IPI anyway, we can avoid smp_cond_load() and
> rely on need_resched() and some new delay/cpu_relax() API that waits for
> a timeout or an IPI, whichever comes first. E.g. cpu_relax_timeout()
> which on arm64 it's just a simplified version of __delay() without the
> 'while' loops.

AFAICT when polling (which we are since poll_idle() calls
current_set_polling_and_test()), the scheduler will elide the IPI
by remotely setting the need-resched bit via set_nr_if_polling().

Once we stop polling then the scheduler should take the IPI path
because call_function_single_prep_ipi() will fail.

--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 21:53             ` Ankur Arora
@ 2024-10-15 22:28               ` Christoph Lameter (Ampere)
  2024-10-16  7:06                 ` Ankur Arora
  2024-10-15 22:40               ` Christoph Lameter (Ampere)
  1 sibling, 1 reply; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-10-15 22:28 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Catalin Marinas, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	will, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, 15 Oct 2024, Ankur Arora wrote:

> > Alternatively, if we get an IPI anyway, we can avoid smp_cond_load() and
> > rely on need_resched() and some new delay/cpu_relax() API that waits for
> > a timeout or an IPI, whichever comes first. E.g. cpu_relax_timeout()
> > which on arm64 it's just a simplified version of __delay() without the
> > 'while' loops.
>
> AFAICT when polling (which we are since poll_idle() calls
> current_set_polling_and_test()), the scheduler will elide the IPI
> by remotely setting the need-resched bit via set_nr_if_polling().

The scheduler runs on multiple cores. The core on which we are
running this code puts the core into a wait state so the scheduler does
not run on this core at all during the wait period.

The other cores may run scheduler functions and set the need_resched bit
for the core where we are currently waiting.

The other core will wake our core up by sending an IPI. The IPI will
invoke a scheduler function on our core and the WFE will continue.

> Once we stop polling then the scheduler should take the IPI path
> because call_function_single_prep_ipi() will fail.

The IPI stops the polling. IPI is an interrupt.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 22:28               ` Christoph Lameter (Ampere)
@ 2024-10-16  7:06                 ` Ankur Arora
  2024-10-17 16:54                   ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 74+ messages in thread
From: Ankur Arora @ 2024-10-16  7:06 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Ankur Arora, Catalin Marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk


Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Tue, 15 Oct 2024, Ankur Arora wrote:
>
>> > Alternatively, if we get an IPI anyway, we can avoid smp_cond_load() and
>> > rely on need_resched() and some new delay/cpu_relax() API that waits for
>> > a timeout or an IPI, whichever comes first. E.g. cpu_relax_timeout()
>> > which on arm64 it's just a simplified version of __delay() without the
>> > 'while' loops.
>>
>> AFAICT when polling (which we are since poll_idle() calls
>> current_set_polling_and_test()), the scheduler will elide the IPI
>> by remotely setting the need-resched bit via set_nr_if_polling().
>
> The scheduler runs on multiple cores. The core on which we are
> running this code puts the core into a wait state so the scheduler does
> not run on this core at all during the wait period.

Yes.

> The other cores may run scheduler functions and set the need_resched bit
> for the core where we are currently waiting.

Yes.

> The other core will wake our core up by sending an IPI. The IPI will
> invoke a scheduler function on our core and the WFE will continue.

Why? The target core is not sleeping. It is *polling* on a memory
address (on arm64, via LDXR; WFE). Ergo an IPI is not needed to tell
it that a need-resched bit is set.

>> Once we stop polling then the scheduler should take the IPI path
>> because call_function_single_prep_ipi() will fail.
>
> The IPI stops the polling. IPI is an interrupt.

Yes an IPI is an interrupt. And, since the target is polling there's
no need for an interrupt to inform it that a memory address on which
it is polling has changed.

resched_curr() is a good example. It only sends the resched IPI if the
target is not polling.

resched_curr() {
         ...
        if (set_nr_and_not_polling(curr))
                smp_send_reschedule(cpu);
        else
                trace_sched_wake_idle_without_ipi(cpu);
}

--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-16  7:06                 ` Ankur Arora
@ 2024-10-17 16:54                   ` Christoph Lameter (Ampere)
  2024-10-17 18:36                     ` Ankur Arora
  0 siblings, 1 reply; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-10-17 16:54 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Catalin Marinas, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	will, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Wed, 16 Oct 2024, Ankur Arora wrote:

> > The other core will wake our core up by sending an IPI. The IPI will
> > invoke a scheduler function on our core and the WFE will continue.
>
> Why? The target core is not sleeping. It is *polling* on a memory
> address (on arm64, via LDXR; WFE). Ergo an IPI is not needed to tell
> it that a need-resched bit is set.

The IPI is sent to interrupt the process that is not sleeping. This is
done so the busy processor can reschedule the currently running process
and respond to the event.

It does not matter if the core is "sleeping" or not.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-17 16:54                   ` Christoph Lameter (Ampere)
@ 2024-10-17 18:36                     ` Ankur Arora
  0 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-10-17 18:36 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Ankur Arora, Catalin Marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk


Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Wed, 16 Oct 2024, Ankur Arora wrote:
>
>> > The other core will wake our core up by sending an IPI. The IPI will
>> > invoke a scheduler function on our core and the WFE will continue.
>>
>> Why? The target core is not sleeping. It is *polling* on a memory
>> address (on arm64, via LDXR; WFE). Ergo an IPI is not needed to tell
>> it that a need-resched bit is set.
>
> The IPI is sent to interrupt the process that is not sleeping. This is
> done so the busy processor can reschedule the currently running process
> and respond to the event.

The scheduler treats idle specially (if the architecture defines
TIF_POLLING_NRFLAG). There's also the sched_wake_idle_without_ipi
tracepoint for this path.

$ sudo perf stat -e sched:sched_wake_idle_without_ipi perf bench sched pipe
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes

     Total time: 5.173 [sec]

       5.173613 usecs/op
         193288 ops/sec

 Performance counter stats for 'perf bench sched pipe':

         1,992,368      sched:sched_wake_idle_without_ipi

       5.178976487 seconds time elapsed

       0.396076000 seconds user
       6.999566000 seconds sys

--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 21:53             ` Ankur Arora
  2024-10-15 22:28               ` Christoph Lameter (Ampere)
@ 2024-10-15 22:40               ` Christoph Lameter (Ampere)
  2024-10-16  9:54                 ` Catalin Marinas
  1 sibling, 1 reply; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-10-15 22:40 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Catalin Marinas, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	will, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

Here is a patch that keeps the cpuidle stuiff generic but allows an
override by arm64..


From: Christoph Lameter (Ampere) <cl@linux.com>
Subject: Revise cpu poll idle to make full use of wfet() and wfe()

ARM64 has instructions that can wait for an event and timeouts.

Clean up the code in drivers/cpuidle/ to wait until the end
of a period and allow the override of the handling of the
waiting by an architecture.

Provide an optimized wait function for arm64.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/arch/arm64/lib/delay.c
===================================================================
--- linux.orig/arch/arm64/lib/delay.c
+++ linux/arch/arm64/lib/delay.c
@@ -12,6 +12,8 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/timex.h>
+#include <linux/sched/clock.h>
+#include <linux/cpuidle.h>

 #include <clocksource/arm_arch_timer.h>

@@ -67,3 +69,27 @@ void __ndelay(unsigned long nsecs)
 	__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
 }
 EXPORT_SYMBOL(__ndelay);
+
+void cpuidle_wait_for_resched_with_timeout(u64 end)
+{
+	u64 start;
+
+	while (!need_resched() && (start = local_clock_noinstr()) < end) {
+
+		if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
+
+			/* Processor supports waiting for a specified period */
+			wfet(xloops_to_cycles((end - start) * 0x5UL));
+
+		} else
+		if (arch_timer_evtstrm_available() && start + ARCH_TIMER_EVT_STREAM_PERIOD_US * 1000 < end) {
+
+			/* We can wait until a periodic event occurs */
+			wfe();
+
+		} else
+			/* Need to spin until the end */
+			cpu_relax();
+	}
+}
+
Index: linux/drivers/cpuidle/poll_state.c
===================================================================
--- linux.orig/drivers/cpuidle/poll_state.c
+++ linux/drivers/cpuidle/poll_state.c
@@ -8,35 +8,29 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/idle.h>

-#define POLL_IDLE_RELAX_COUNT	200
+__weak void cpuidle_wait_for_resched_with_timeout(u64 end)
+{
+	while (!need_resched() && local_clock_noinstr() < end) {
+		cpu_relax();
+	}
+}

 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
 {
-	u64 time_start;
-
-	time_start = local_clock_noinstr();
+	u64 time_start = local_clock_noinstr();
+	u64 time_end = time_start + cpuidle_poll_time(drv, dev);

 	dev->poll_time_limit = false;

 	raw_local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		unsigned int loop_count = 0;
-		u64 limit;

-		limit = cpuidle_poll_time(drv, dev);
+		cpuidle_wait_for_resched_with_timeout(time_end);
+
+		if (!need_resched())
+			dev->poll_time_limit = true;

-		while (!need_resched()) {
-			cpu_relax();
-			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
-				continue;
-
-			loop_count = 0;
-			if (local_clock_noinstr() - time_start > limit) {
-				dev->poll_time_limit = true;
-				break;
-			}
-		}
 	}
 	raw_local_irq_disable();

Index: linux/include/linux/cpuidle.h
===================================================================
--- linux.orig/include/linux/cpuidle.h
+++ linux/include/linux/cpuidle.h
@@ -202,6 +202,9 @@ extern int cpuidle_play_dead(void);
 extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev);
 static inline struct cpuidle_device *cpuidle_get_device(void)
 {return __this_cpu_read(cpuidle_devices); }
+
+extern __weak void cpuidle_wait_for_resched_with_timeout(u64);
+
 #else
 static inline void disable_cpuidle(void) { }
 static inline bool cpuidle_not_available(struct cpuidle_driver *drv,

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 22:40               ` Christoph Lameter (Ampere)
@ 2024-10-16  9:54                 ` Catalin Marinas
  2024-10-17 16:56                   ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 74+ messages in thread
From: Catalin Marinas @ 2024-10-16  9:54 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel, will,
	tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Oct 15, 2024 at 03:40:33PM -0700, Christoph Lameter (Ampere) wrote:
> Index: linux/arch/arm64/lib/delay.c
> ===================================================================
> --- linux.orig/arch/arm64/lib/delay.c
> +++ linux/arch/arm64/lib/delay.c
> @@ -12,6 +12,8 @@
>  #include <linux/kernel.h>
>  #include <linux/module.h>
>  #include <linux/timex.h>
> +#include <linux/sched/clock.h>
> +#include <linux/cpuidle.h>
> 
>  #include <clocksource/arm_arch_timer.h>
> 
> @@ -67,3 +69,27 @@ void __ndelay(unsigned long nsecs)
>  	__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
>  }
>  EXPORT_SYMBOL(__ndelay);
> +
> +void cpuidle_wait_for_resched_with_timeout(u64 end)
> +{
> +	u64 start;
> +
> +	while (!need_resched() && (start = local_clock_noinstr()) < end) {
> +
> +		if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
> +
> +			/* Processor supports waiting for a specified period */
> +			wfet(xloops_to_cycles((end - start) * 0x5UL));
> +
> +		} else
> +		if (arch_timer_evtstrm_available() && start + ARCH_TIMER_EVT_STREAM_PERIOD_US * 1000 < end) {
> +
> +			/* We can wait until a periodic event occurs */
> +			wfe();
> +
> +		} else
> +			/* Need to spin until the end */
> +			cpu_relax();
> +	}
> +}

The behaviour above is slightly different from the current poll_idle()
implementation. The above is more like poll every timeout period rather
than continuously poll until either the need_resched() condition is true
_or_ the timeout expired. From Ankur's email, an IPI may not happen so
we don't have any guarantee that WFET will wake up before the timeout.
The only way for WFE/WFET to wake up on need_resched() is to use LDXR to
arm the exclusive monitor. That's what smp_cond_load_relaxed() does.

If you only need the behaviour proposed above, you might as well go for
udelay() directly. Otherwise I think we need to revisit Ankur's
smp_cond_load_timeout() proposal from earlier this year.

-- 
Catalin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-16  9:54                 ` Catalin Marinas
@ 2024-10-17 16:56                   ` Christoph Lameter (Ampere)
  2024-10-17 18:15                     ` Catalin Marinas
  0 siblings, 1 reply; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-10-17 16:56 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel, will,
	tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Wed, 16 Oct 2024, Catalin Marinas wrote:

> The behaviour above is slightly different from the current poll_idle()
> implementation. The above is more like poll every timeout period rather
> than continuously poll until either the need_resched() condition is true
> _or_ the timeout expired. From Ankur's email, an IPI may not happen so
> we don't have any guarantee that WFET will wake up before the timeout.
> The only way for WFE/WFET to wake up on need_resched() is to use LDXR to
> arm the exclusive monitor. That's what smp_cond_load_relaxed() does.

Sorry no. The IPI will cause the WFE to continue immediately and not wait
till the end of the timeout period.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-17 16:56                   ` Christoph Lameter (Ampere)
@ 2024-10-17 18:15                     ` Catalin Marinas
  2024-10-17 19:34                       ` Ankur Arora
  0 siblings, 1 reply; 74+ messages in thread
From: Catalin Marinas @ 2024-10-17 18:15 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel, will,
	tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Thu, Oct 17, 2024 at 09:56:13AM -0700, Christoph Lameter (Ampere) wrote:
> On Wed, 16 Oct 2024, Catalin Marinas wrote:
> > The behaviour above is slightly different from the current poll_idle()
> > implementation. The above is more like poll every timeout period rather
> > than continuously poll until either the need_resched() condition is true
> > _or_ the timeout expired. From Ankur's email, an IPI may not happen so
> > we don't have any guarantee that WFET will wake up before the timeout.
> > The only way for WFE/WFET to wake up on need_resched() is to use LDXR to
> > arm the exclusive monitor. That's what smp_cond_load_relaxed() does.
> 
> Sorry no. The IPI will cause the WFE to continue immediately and not wait
> till the end of the timeout period.

*If* there is an IPI. The scheduler is not really my area but some
functions like wake_up_idle_cpu() seem to elide the IPI if
TIF_NR_POLLING is set.

But even if we had an IPI, it still feels like abusing the semantics of
smp_cond_load_relaxed() when relying on it to increment a variable in
the condition check as a result of some unrelated wake-up event. This
API is meant to wait for a condition on a single variable. It cannot
wait on multiple variables and especially not one it updates itself
(even if it happens to work on arm64 under certain conditions).

My strong preference would be to revive the smp_cond_load_timeout()
proposal from Ankur earlier in the year.

-- 
Catalin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-17 18:15                     ` Catalin Marinas
@ 2024-10-17 19:34                       ` Ankur Arora
  0 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-10-17 19:34 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Christoph Lameter (Ampere), Ankur Arora, linux-pm, kvm,
	linux-arm-kernel, linux-kernel, will, tglx, mingo, bp,
	dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, misono.tomohiro, maobibo, joao.m.martins,
	boris.ostrovsky, konrad.wilk


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Thu, Oct 17, 2024 at 09:56:13AM -0700, Christoph Lameter (Ampere) wrote:
>> On Wed, 16 Oct 2024, Catalin Marinas wrote:
>> > The behaviour above is slightly different from the current poll_idle()
>> > implementation. The above is more like poll every timeout period rather
>> > than continuously poll until either the need_resched() condition is true
>> > _or_ the timeout expired. From Ankur's email, an IPI may not happen so
>> > we don't have any guarantee that WFET will wake up before the timeout.
>> > The only way for WFE/WFET to wake up on need_resched() is to use LDXR to
>> > arm the exclusive monitor. That's what smp_cond_load_relaxed() does.
>>
>> Sorry no. The IPI will cause the WFE to continue immediately and not wait
>> till the end of the timeout period.
>
> *If* there is an IPI. The scheduler is not really my area but some
> functions like wake_up_idle_cpu() seem to elide the IPI if
> TIF_NR_POLLING is set.
>
> But even if we had an IPI, it still feels like abusing the semantics of
> smp_cond_load_relaxed() when relying on it to increment a variable in
> the condition check as a result of some unrelated wake-up event. This
> API is meant to wait for a condition on a single variable. It cannot
> wait on multiple variables and especially not one it updates itself
> (even if it happens to work on arm64 under certain conditions).

Yeah that makes sense. smp_cond_load_relaxed() uses two separate
side-effects to make sure things work: the event-stream and the
increment in the conditional.

I do want to thresh out smp_cond_load_timeout() a bit more but let
me reply to your other mail for that.

> My strong preference would be to revive the smp_cond_load_timeout()
> proposal from Ankur earlier in the year.

Ack that.

--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 12:04   ` Catalin Marinas
  2024-10-15 16:42     ` Christoph Lameter (Ampere)
@ 2024-10-15 21:32     ` Ankur Arora
  2024-10-16  6:20       ` maobibo
  2024-10-16 10:06       ` Catalin Marinas
  2024-10-16 15:13     ` Okanovic, Haris
  2 siblings, 2 replies; 74+ messages in thread
From: Ankur Arora @ 2024-10-15 21:32 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel, will,
	tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
>> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> index 9b6d90a72601..fc1204426158 100644
>> --- a/drivers/cpuidle/poll_state.c
>> +++ b/drivers/cpuidle/poll_state.c
>> @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
>>
>>  	raw_local_irq_enable();
>>  	if (!current_set_polling_and_test()) {
>> -		unsigned int loop_count = 0;
>>  		u64 limit;
>>
>>  		limit = cpuidle_poll_time(drv, dev);
>>
>>  		while (!need_resched()) {
>> -			cpu_relax();
>> -			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> -				continue;
>> -
>> -			loop_count = 0;
>> +			unsigned int loop_count = 0;
>>  			if (local_clock_noinstr() - time_start > limit) {
>>  				dev->poll_time_limit = true;
>>  				break;
>>  			}
>> +
>> +			smp_cond_load_relaxed(&current_thread_info()->flags,
>> +					      VAL & _TIF_NEED_RESCHED ||
>> +					      loop_count++ >= POLL_IDLE_RELAX_COUNT);
>
> The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> never set. With the event stream enabled on arm64, the WFE will
> eventually be woken up, loop_count incremented and the condition would
> become true.

That makes sense.

> However, the smp_cond_load_relaxed() semantics require that
> a different agent updates the variable being waited on, not the waiting
> CPU updating it itself.

Right. And, that seems to work well with the semantics of WFE. And,
the event stream (if enabled) has a side effect that allows the exit
from the loop.

> Also note that the event stream can be disabled
> on arm64 on the kernel command line.

Yes, that's a good point. In patch-11 I tried to address that aspect
by only allowing haltpoll to be force loaded.

But, I guess your point is that its not just haltpoll that has a problem,
but also regular polling -- and maybe the right thing to do would be to
disable polling if the event stream is disabled.

> Does the code above break any other architecture?

Me (and others) have so far tested x86, ARM64 (with/without the
event stream), and I believe riscv. I haven't seen any obvious
breakage. But, that's probably because most of the time somebody would
be set TIF_NEED_RESCHED.

> I'd say if you want
> something like this, better introduce a new smp_cond_load_timeout()
> API. The above looks like a hack that may only work on arm64 when the
> event stream is enabled.

I had a preliminary version of smp_cond_load_relaxed_timeout() here:
 https://lore.kernel.org/lkml/87edae3a1x.fsf@oracle.com/

Even with an smp_cond_load_timeout(), we would need to fallback to
something like the above for uarchs without WFxT.

> A generic option is udelay() (on arm64 it would use WFE/WFET by
> default). Not sure how important it is for poll_idle() but the downside
> of udelay() that it won't be able to also poll need_resched() while
> waiting for the timeout. If this matters, you could instead make smaller
> udelay() calls. Yet another problem, I don't know how energy efficient
> udelay() is on x86 vs cpu_relax().
>
> So maybe an smp_cond_load_timeout() would be better, implemented with
> cpu_relax() generically and the arm64 would use LDXR, WFE and rely on
> the event stream (or fall back to cpu_relax() if the event stream is
> disabled).

Yeah, something like that might work.

--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 21:32     ` Ankur Arora
@ 2024-10-16  6:20       ` maobibo
  2024-10-16 10:06       ` Catalin Marinas
  1 sibling, 0 replies; 74+ messages in thread
From: maobibo @ 2024-10-16  6:20 UTC (permalink / raw)
  To: Ankur Arora, Catalin Marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, joao.m.martins,
	boris.ostrovsky, konrad.wilk



On 2024/10/16 上午5:32, Ankur Arora wrote:
> 
> Catalin Marinas <catalin.marinas@arm.com> writes:
> 
>> On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
>>> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>>> index 9b6d90a72601..fc1204426158 100644
>>> --- a/drivers/cpuidle/poll_state.c
>>> +++ b/drivers/cpuidle/poll_state.c
>>> @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
>>>
>>>   	raw_local_irq_enable();
>>>   	if (!current_set_polling_and_test()) {
>>> -		unsigned int loop_count = 0;
>>>   		u64 limit;
>>>
>>>   		limit = cpuidle_poll_time(drv, dev);
>>>
>>>   		while (!need_resched()) {
>>> -			cpu_relax();
>>> -			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>>> -				continue;
>>> -
>>> -			loop_count = 0;
>>> +			unsigned int loop_count = 0;
>>>   			if (local_clock_noinstr() - time_start > limit) {
>>>   				dev->poll_time_limit = true;
>>>   				break;
>>>   			}
>>> +
>>> +			smp_cond_load_relaxed(&current_thread_info()->flags,
>>> +					      VAL & _TIF_NEED_RESCHED ||
>>> +					      loop_count++ >= POLL_IDLE_RELAX_COUNT);
>>
>> The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
>> never set. With the event stream enabled on arm64, the WFE will
>> eventually be woken up, loop_count incremented and the condition would
>> become true.
> 
> That makes sense.
> 
>> However, the smp_cond_load_relaxed() semantics require that
>> a different agent updates the variable being waited on, not the waiting
>> CPU updating it itself.
> 
> Right. And, that seems to work well with the semantics of WFE. And,
> the event stream (if enabled) has a side effect that allows the exit
> from the loop.
> 
>> Also note that the event stream can be disabled
>> on arm64 on the kernel command line.
> 
> Yes, that's a good point. In patch-11 I tried to address that aspect
> by only allowing haltpoll to be force loaded.
> 
> But, I guess your point is that its not just haltpoll that has a problem,
> but also regular polling -- and maybe the right thing to do would be to
> disable polling if the event stream is disabled.
> 
>> Does the code above break any other architecture?
> 
> Me (and others) have so far tested x86, ARM64 (with/without the
> event stream), and I believe riscv. I haven't seen any obvious
> breakage. But, that's probably because most of the time somebody would
> be set TIF_NEED_RESCHED.
> 
>> I'd say if you want
>> something like this, better introduce a new smp_cond_load_timeout()
>> API. The above looks like a hack that may only work on arm64 when the
>> event stream is enabled.
> 
> I had a preliminary version of smp_cond_load_relaxed_timeout() here:
>   https://lore.kernel.org/lkml/87edae3a1x.fsf@oracle.com/
> 
> Even with an smp_cond_load_timeout(), we would need to fallback to
> something like the above for uarchs without WFxT.
> 
>> A generic option is udelay() (on arm64 it would use WFE/WFET by
>> default). Not sure how important it is for poll_idle() but the downside
>> of udelay() that it won't be able to also poll need_resched() while
>> waiting for the timeout. If this matters, you could instead make smaller
>> udelay() calls. Yet another problem, I don't know how energy efficient
>> udelay() is on x86 vs cpu_relax().
>>
>> So maybe an smp_cond_load_timeout() would be better, implemented with
>> cpu_relax() generically and the arm64 would use LDXR, WFE and rely on
>> the event stream (or fall back to cpu_relax() if the event stream is
>> disabled).
> 
> Yeah, something like that might work.
Yeah, I like idea about smp_cond_load_timeout method. If generic 
smp_cond_load_timeout method does not meet the requirement, the 
architecture can define itself one. And this keeps less modification 
about original code logic.

Regards
Bibo Mao
> 
> --
> ankur
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 21:32     ` Ankur Arora
  2024-10-16  6:20       ` maobibo
@ 2024-10-16 10:06       ` Catalin Marinas
  1 sibling, 0 replies; 74+ messages in thread
From: Catalin Marinas @ 2024-10-16 10:06 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Oct 15, 2024 at 02:32:00PM -0700, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@arm.com> writes:
> > I'd say if you want
> > something like this, better introduce a new smp_cond_load_timeout()
> > API. The above looks like a hack that may only work on arm64 when the
> > event stream is enabled.
> 
> I had a preliminary version of smp_cond_load_relaxed_timeout() here:
>  https://lore.kernel.org/lkml/87edae3a1x.fsf@oracle.com/
> 
> Even with an smp_cond_load_timeout(), we would need to fallback to
> something like the above for uarchs without WFxT.

Yes, the default/generic implementation would use cpu_relax(). For the
arm64 one, some combination of __cmpwait() and __delay() with the
fallback to cpu_relax().

-- 
Catalin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-15 12:04   ` Catalin Marinas
  2024-10-15 16:42     ` Christoph Lameter (Ampere)
  2024-10-15 21:32     ` Ankur Arora
@ 2024-10-16 15:13     ` Okanovic, Haris
  2024-10-16 17:04       ` Ankur Arora
  2024-10-17 14:01       ` Catalin Marinas
  2 siblings, 2 replies; 74+ messages in thread
From: Okanovic, Haris @ 2024-10-16 15:13 UTC (permalink / raw)
  To: catalin.marinas@arm.com, ankur.a.arora@oracle.com
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, dave.hansen@linux.intel.com,
	konrad.wilk@oracle.com, wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, linux-arm-kernel@lists.infradead.org,
	Okanovic, Haris, linux-pm@vger.kernel.org, bp@alien8.de,
	mtosatti@redhat.com, x86@kernel.org, mark.rutland@arm.com

On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
> > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> > index 9b6d90a72601..fc1204426158 100644
> > --- a/drivers/cpuidle/poll_state.c
> > +++ b/drivers/cpuidle/poll_state.c
> > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
> > 
> >       raw_local_irq_enable();
> >       if (!current_set_polling_and_test()) {
> > -             unsigned int loop_count = 0;
> >               u64 limit;
> > 
> >               limit = cpuidle_poll_time(drv, dev);
> > 
> >               while (!need_resched()) {
> > -                     cpu_relax();
> > -                     if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> > -                             continue;
> > -
> > -                     loop_count = 0;
> > +                     unsigned int loop_count = 0;
> >                       if (local_clock_noinstr() - time_start > limit) {
> >                               dev->poll_time_limit = true;
> >                               break;
> >                       }
> > +
> > +                     smp_cond_load_relaxed(&current_thread_info()->flags,
> > +                                           VAL & _TIF_NEED_RESCHED ||
> > +                                           loop_count++ >= POLL_IDLE_RELAX_COUNT);
> 
> The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> never set. With the event stream enabled on arm64, the WFE will
> eventually be woken up, loop_count incremented and the condition would
> become true. However, the smp_cond_load_relaxed() semantics require that
> a different agent updates the variable being waited on, not the waiting
> CPU updating it itself. Also note that the event stream can be disabled
> on arm64 on the kernel command line.

Alternately could we condition arch_haltpoll_want() on
arch_timer_evtstrm_available(), like v7?

> 
> Does the code above break any other architecture? I'd say if you want
> something like this, better introduce a new smp_cond_load_timeout()
> API. The above looks like a hack that may only work on arm64 when the
> event stream is enabled.
> 
> A generic option is udelay() (on arm64 it would use WFE/WFET by
> default). Not sure how important it is for poll_idle() but the downside
> of udelay() that it won't be able to also poll need_resched() while
> waiting for the timeout. If this matters, you could instead make smaller
> udelay() calls. Yet another problem, I don't know how energy efficient
> udelay() is on x86 vs cpu_relax().
> 
> So maybe an smp_cond_load_timeout() would be better, implemented with
> cpu_relax() generically and the arm64 would use LDXR, WFE and rely on
> the event stream (or fall back to cpu_relax() if the event stream is
> disabled).
> 
> --
> Catalin


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-16 15:13     ` Okanovic, Haris
@ 2024-10-16 17:04       ` Ankur Arora
  2024-10-16 18:04         ` Okanovic, Haris
  2024-10-17 14:01       ` Catalin Marinas
  1 sibling, 1 reply; 74+ messages in thread
From: Ankur Arora @ 2024-10-16 17:04 UTC (permalink / raw)
  To: Okanovic, Haris
  Cc: catalin.marinas@arm.com, ankur.a.arora@oracle.com,
	kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, dave.hansen@linux.intel.com,
	konrad.wilk@oracle.com, wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, linux-arm-kernel@lists.infradead.org,
	linux-pm@vger.kernel.org, bp@alien8.de, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com


Okanovic, Haris <harisokn@amazon.com> writes:

> On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
>> > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> > index 9b6d90a72601..fc1204426158 100644
>> > --- a/drivers/cpuidle/poll_state.c
>> > +++ b/drivers/cpuidle/poll_state.c
>> > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> >
>> >       raw_local_irq_enable();
>> >       if (!current_set_polling_and_test()) {
>> > -             unsigned int loop_count = 0;
>> >               u64 limit;
>> >
>> >               limit = cpuidle_poll_time(drv, dev);
>> >
>> >               while (!need_resched()) {
>> > -                     cpu_relax();
>> > -                     if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> > -                             continue;
>> > -
>> > -                     loop_count = 0;
>> > +                     unsigned int loop_count = 0;
>> >                       if (local_clock_noinstr() - time_start > limit) {
>> >                               dev->poll_time_limit = true;
>> >                               break;
>> >                       }
>> > +
>> > +                     smp_cond_load_relaxed(&current_thread_info()->flags,
>> > +                                           VAL & _TIF_NEED_RESCHED ||
>> > +                                           loop_count++ >= POLL_IDLE_RELAX_COUNT);
>>
>> The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
>> never set. With the event stream enabled on arm64, the WFE will
>> eventually be woken up, loop_count incremented and the condition would
>> become true. However, the smp_cond_load_relaxed() semantics require that
>> a different agent updates the variable being waited on, not the waiting
>> CPU updating it itself. Also note that the event stream can be disabled
>> on arm64 on the kernel command line.
>
> Alternately could we condition arch_haltpoll_want() on
> arch_timer_evtstrm_available(), like v7?

Yes, I'm thinking of staging it somewhat like that. First an
smp_cond_load_relaxed() which gets rid of this issue, followed by
one based on smp_cond_load_relaxed_timeout().

That said, conditioning just arch_haltpoll_want() won't suffice since
what Catalin pointed out affects all users of poll_idle(), not just
haltpoll.

Right now there's only haltpoll but there are future users like
zhenglifeng with a patch for acpi-idle here:

  https://lore.kernel.org/all/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/

>> Does the code above break any other architecture? I'd say if you want
>> something like this, better introduce a new smp_cond_load_timeout()
>> API. The above looks like a hack that may only work on arm64 when the
>> event stream is enabled.
>>
>> A generic option is udelay() (on arm64 it would use WFE/WFET by
>> default). Not sure how important it is for poll_idle() but the downside
>> of udelay() that it won't be able to also poll need_resched() while
>> waiting for the timeout. If this matters, you could instead make smaller
>> udelay() calls. Yet another problem, I don't know how energy efficient
>> udelay() is on x86 vs cpu_relax().
>>
>> So maybe an smp_cond_load_timeout() would be better, implemented with
>> cpu_relax() generically and the arm64 would use LDXR, WFE and rely on
>> the event stream (or fall back to cpu_relax() if the event stream is
>> disabled).
>>
>> --
>> Catalin


--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-16 17:04       ` Ankur Arora
@ 2024-10-16 18:04         ` Okanovic, Haris
  0 siblings, 0 replies; 74+ messages in thread
From: Okanovic, Haris @ 2024-10-16 18:04 UTC (permalink / raw)
  To: ankur.a.arora@oracle.com
  Cc: joao.m.martins@oracle.com, kvm@vger.kernel.org,
	mtosatti@redhat.com, boris.ostrovsky@oracle.com,
	mark.rutland@arm.com, dave.hansen@linux.intel.com,
	konrad.wilk@oracle.com, cl@gentwo.org, wanpengli@tencent.com,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	catalin.marinas@arm.com, pbonzini@redhat.com, tglx@linutronix.de,
	maobibo@loongson.cn, daniel.lezcano@linaro.org,
	misono.tomohiro@fujitsu.com, arnd@arndb.de, will@kernel.org,
	lenb@kernel.org, hpa@zytor.com, peterz@infradead.org,
	vkuznets@redhat.com, sudeep.holla@arm.com, Okanovic, Haris,
	rafael@kernel.org, linux-arm-kernel@lists.infradead.org,
	linux-pm@vger.kernel.org, x86@kernel.org, bp@alien8.de

On Wed, 2024-10-16 at 10:04 -0700, Ankur Arora wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> Okanovic, Haris <harisokn@amazon.com> writes:
> 
> > On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote:
> > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > > 
> > > 
> > > 
> > > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
> > > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> > > > index 9b6d90a72601..fc1204426158 100644
> > > > --- a/drivers/cpuidle/poll_state.c
> > > > +++ b/drivers/cpuidle/poll_state.c
> > > > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
> > > > 
> > > >       raw_local_irq_enable();
> > > >       if (!current_set_polling_and_test()) {
> > > > -             unsigned int loop_count = 0;
> > > >               u64 limit;
> > > > 
> > > >               limit = cpuidle_poll_time(drv, dev);
> > > > 
> > > >               while (!need_resched()) {
> > > > -                     cpu_relax();
> > > > -                     if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> > > > -                             continue;
> > > > -
> > > > -                     loop_count = 0;
> > > > +                     unsigned int loop_count = 0;
> > > >                       if (local_clock_noinstr() - time_start > limit) {
> > > >                               dev->poll_time_limit = true;
> > > >                               break;
> > > >                       }
> > > > +
> > > > +                     smp_cond_load_relaxed(&current_thread_info()->flags,
> > > > +                                           VAL & _TIF_NEED_RESCHED ||
> > > > +                                           loop_count++ >= POLL_IDLE_RELAX_COUNT);
> > > 
> > > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> > > never set. With the event stream enabled on arm64, the WFE will
> > > eventually be woken up, loop_count incremented and the condition would
> > > become true. However, the smp_cond_load_relaxed() semantics require that
> > > a different agent updates the variable being waited on, not the waiting
> > > CPU updating it itself. Also note that the event stream can be disabled
> > > on arm64 on the kernel command line.
> > 
> > Alternately could we condition arch_haltpoll_want() on
> > arch_timer_evtstrm_available(), like v7?
> 
> Yes, I'm thinking of staging it somewhat like that. First an
> smp_cond_load_relaxed() which gets rid of this issue, followed by
> one based on smp_cond_load_relaxed_timeout().
> 
> That said, conditioning just arch_haltpoll_want() won't suffice since
> what Catalin pointed out affects all users of poll_idle(), not just
> haltpoll.

The only other users I see today are apm_init() and
acpi_processor_setup_cstates(), both in x86 path. Perhaps not ideal,
but should be sufficient.

> 
> Right now there's only haltpoll but there are future users like
> zhenglifeng with a patch for acpi-idle here:
> 
>   https://lore.kernel.org/all/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/
> 
> > > Does the code above break any other architecture? I'd say if you want
> > > something like this, better introduce a new smp_cond_load_timeout()
> > > API. The above looks like a hack that may only work on arm64 when the
> > > event stream is enabled.
> > > 
> > > A generic option is udelay() (on arm64 it would use WFE/WFET by
> > > default). Not sure how important it is for poll_idle() but the downside
> > > of udelay() that it won't be able to also poll need_resched() while
> > > waiting for the timeout. If this matters, you could instead make smaller
> > > udelay() calls. Yet another problem, I don't know how energy efficient
> > > udelay() is on x86 vs cpu_relax().
> > > 
> > > So maybe an smp_cond_load_timeout() would be better, implemented with
> > > cpu_relax() generically and the arm64 would use LDXR, WFE and rely on
> > > the event stream (or fall back to cpu_relax() if the event stream is
> > > disabled).
> > > 
> > > --
> > > Catalin
> 
> 
> --
> ankur


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-16 15:13     ` Okanovic, Haris
  2024-10-16 17:04       ` Ankur Arora
@ 2024-10-17 14:01       ` Catalin Marinas
  2024-10-17 22:47         ` Ankur Arora
  1 sibling, 1 reply; 74+ messages in thread
From: Catalin Marinas @ 2024-10-17 14:01 UTC (permalink / raw)
  To: Okanovic, Haris
  Cc: ankur.a.arora@oracle.com, kvm@vger.kernel.org, rafael@kernel.org,
	sudeep.holla@arm.com, joao.m.martins@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, linux-arm-kernel@lists.infradead.org,
	linux-pm@vger.kernel.org, bp@alien8.de, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com

On Wed, Oct 16, 2024 at 03:13:33PM +0000, Okanovic, Haris wrote:
> On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote:
> > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
> > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> > > index 9b6d90a72601..fc1204426158 100644
> > > --- a/drivers/cpuidle/poll_state.c
> > > +++ b/drivers/cpuidle/poll_state.c
> > > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
> > > 
> > >       raw_local_irq_enable();
> > >       if (!current_set_polling_and_test()) {
> > > -             unsigned int loop_count = 0;
> > >               u64 limit;
> > > 
> > >               limit = cpuidle_poll_time(drv, dev);
> > > 
> > >               while (!need_resched()) {
> > > -                     cpu_relax();
> > > -                     if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> > > -                             continue;
> > > -
> > > -                     loop_count = 0;
> > > +                     unsigned int loop_count = 0;
> > >                       if (local_clock_noinstr() - time_start > limit) {
> > >                               dev->poll_time_limit = true;
> > >                               break;
> > >                       }
> > > +
> > > +                     smp_cond_load_relaxed(&current_thread_info()->flags,
> > > +                                           VAL & _TIF_NEED_RESCHED ||
> > > +                                           loop_count++ >= POLL_IDLE_RELAX_COUNT);
> > 
> > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> > never set. With the event stream enabled on arm64, the WFE will
> > eventually be woken up, loop_count incremented and the condition would
> > become true. However, the smp_cond_load_relaxed() semantics require that
> > a different agent updates the variable being waited on, not the waiting
> > CPU updating it itself. Also note that the event stream can be disabled
> > on arm64 on the kernel command line.
> 
> Alternately could we condition arch_haltpoll_want() on
> arch_timer_evtstrm_available(), like v7?

No. The problem is about the smp_cond_load_relaxed() semantics - it
can't wait on a variable that's only updated in its exit condition. We
need a new API for this, especially since we are changing generic code
here (even it was arm64 code only, I'd still object to such
smp_cond_load_*() constructs).

-- 
Catalin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-17 14:01       ` Catalin Marinas
@ 2024-10-17 22:47         ` Ankur Arora
  2024-10-18 11:05           ` Catalin Marinas
  0 siblings, 1 reply; 74+ messages in thread
From: Ankur Arora @ 2024-10-17 22:47 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Okanovic, Haris, ankur.a.arora@oracle.com, kvm@vger.kernel.org,
	rafael@kernel.org, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, dave.hansen@linux.intel.com,
	konrad.wilk@oracle.com, wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, linux-arm-kernel@lists.infradead.org,
	linux-pm@vger.kernel.org, bp@alien8.de, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Wed, Oct 16, 2024 at 03:13:33PM +0000, Okanovic, Haris wrote:
>> On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote:
>> > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
>> > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> > > index 9b6d90a72601..fc1204426158 100644
>> > > --- a/drivers/cpuidle/poll_state.c
>> > > +++ b/drivers/cpuidle/poll_state.c
>> > > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> > >
>> > >       raw_local_irq_enable();
>> > >       if (!current_set_polling_and_test()) {
>> > > -             unsigned int loop_count = 0;
>> > >               u64 limit;
>> > >
>> > >               limit = cpuidle_poll_time(drv, dev);
>> > >
>> > >               while (!need_resched()) {
>> > > -                     cpu_relax();
>> > > -                     if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> > > -                             continue;
>> > > -
>> > > -                     loop_count = 0;
>> > > +                     unsigned int loop_count = 0;
>> > >                       if (local_clock_noinstr() - time_start > limit) {
>> > >                               dev->poll_time_limit = true;
>> > >                               break;
>> > >                       }
>> > > +
>> > > +                     smp_cond_load_relaxed(&current_thread_info()->flags,
>> > > +                                           VAL & _TIF_NEED_RESCHED ||
>> > > +                                           loop_count++ >= POLL_IDLE_RELAX_COUNT);
>> >
>> > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
>> > never set. With the event stream enabled on arm64, the WFE will
>> > eventually be woken up, loop_count incremented and the condition would
>> > become true. However, the smp_cond_load_relaxed() semantics require that
>> > a different agent updates the variable being waited on, not the waiting
>> > CPU updating it itself. Also note that the event stream can be disabled
>> > on arm64 on the kernel command line.
>>
>> Alternately could we condition arch_haltpoll_want() on
>> arch_timer_evtstrm_available(), like v7?
>
> No. The problem is about the smp_cond_load_relaxed() semantics - it
> can't wait on a variable that's only updated in its exit condition. We
> need a new API for this, especially since we are changing generic code
> here (even it was arm64 code only, I'd still object to such
> smp_cond_load_*() constructs).

Right. The problem is that smp_cond_load_relaxed() used in this context
depends on the event-stream side effect when the interface does not
encode those semantics anywhere.

So, a smp_cond_load_timeout() like in [1] that continues to depend on
the event-stream is better because it explicitly accounts for the side
effect from the timeout.

This would cover both the WFxT and the event-stream case.

The part I'm a little less sure about is the case where WFxT and the
event-stream are absent.

As you said earlier, for that case on arm64, we use either short
__delay() calls or spin in cpu_relax(), both of which are essentially
the same thing.

Now on x86 cpu_relax() is quite optimal. The spec explicitly recommends
it and from my measurement a loop doing "while (!cond) cpu_relax()" gets
an IPC of something like 0.1 or similar.

On my arm64 systems however the same loop gets an IPC of 2.  Now this
likely varies greatly but seems like it would run pretty hot some of
the time.

So maybe the right thing to do would be to keep smp_cond_load_timeout()
but only allow polling if WFxT or event-stream is enabled. And enhance
cpuidle_poll_state_init() to fail if the above condition is not met.

Does that make sense?

Thanks
Ankur

[1] https://lore.kernel.org/lkml/87edae3a1x.fsf@oracle.com/

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-17 22:47         ` Ankur Arora
@ 2024-10-18 11:05           ` Catalin Marinas
  2024-10-18 19:00             ` Ankur Arora
  0 siblings, 1 reply; 74+ messages in thread
From: Catalin Marinas @ 2024-10-18 11:05 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Okanovic, Haris, kvm@vger.kernel.org, rafael@kernel.org,
	sudeep.holla@arm.com, joao.m.martins@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, linux-arm-kernel@lists.infradead.org,
	linux-pm@vger.kernel.org, bp@alien8.de, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com

On Thu, Oct 17, 2024 at 03:47:31PM -0700, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@arm.com> writes:
> > On Wed, Oct 16, 2024 at 03:13:33PM +0000, Okanovic, Haris wrote:
> >> On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote:
> >> > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
> >> > > +                     smp_cond_load_relaxed(&current_thread_info()->flags,
> >> > > +                                           VAL & _TIF_NEED_RESCHED ||
> >> > > +                                           loop_count++ >= POLL_IDLE_RELAX_COUNT);
> >> >
> >> > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
> >> > never set. With the event stream enabled on arm64, the WFE will
> >> > eventually be woken up, loop_count incremented and the condition would
> >> > become true. However, the smp_cond_load_relaxed() semantics require that
> >> > a different agent updates the variable being waited on, not the waiting
> >> > CPU updating it itself. Also note that the event stream can be disabled
> >> > on arm64 on the kernel command line.
> >>
> >> Alternately could we condition arch_haltpoll_want() on
> >> arch_timer_evtstrm_available(), like v7?
> >
> > No. The problem is about the smp_cond_load_relaxed() semantics - it
> > can't wait on a variable that's only updated in its exit condition. We
> > need a new API for this, especially since we are changing generic code
> > here (even it was arm64 code only, I'd still object to such
> > smp_cond_load_*() constructs).
> 
> Right. The problem is that smp_cond_load_relaxed() used in this context
> depends on the event-stream side effect when the interface does not
> encode those semantics anywhere.
> 
> So, a smp_cond_load_timeout() like in [1] that continues to depend on
> the event-stream is better because it explicitly accounts for the side
> effect from the timeout.
> 
> This would cover both the WFxT and the event-stream case.

Indeed.

> The part I'm a little less sure about is the case where WFxT and the
> event-stream are absent.
> 
> As you said earlier, for that case on arm64, we use either short
> __delay() calls or spin in cpu_relax(), both of which are essentially
> the same thing.

Something derived from __delay(), not exactly this function. We can't
use it directly as we also want it to wake up if an event is generated
as a result of a memory write (like the current smp_cond_load().

> Now on x86 cpu_relax() is quite optimal. The spec explicitly recommends
> it and from my measurement a loop doing "while (!cond) cpu_relax()" gets
> an IPC of something like 0.1 or similar.
> 
> On my arm64 systems however the same loop gets an IPC of 2.  Now this
> likely varies greatly but seems like it would run pretty hot some of
> the time.

For the cpu_relax() fall-back, it wouldn't be any worse than the current
poll_idle() code, though I guess in this instance we'd not enable idle
polling.

I expect the event stream to be on in all production deployments. The
reason we have a way to disable it is for testing. We've had hardware
errata in the past where the event on spin_unlock doesn't cross the
cluster boundary. We'd not notice because of the event stream.

> So maybe the right thing to do would be to keep smp_cond_load_timeout()
> but only allow polling if WFxT or event-stream is enabled. And enhance
> cpuidle_poll_state_init() to fail if the above condition is not met.

We could do this as well. Maybe hide this behind another function like
arch_has_efficient_smp_cond_load_timeout() (well, some shorter name),
checked somewhere in or on the path to cpuidle_poll_state_init(). Well,
it might be simpler to do this in haltpoll_want(), backed by an
arch_haltpoll_want() function.

I assume we want poll_idle() to wake up as soon as a task becomes
available. Otherwise we could have just used udelay() for some fraction
of cpuidle_poll_time() instead of cpu_relax().

-- 
Catalin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-18 11:05           ` Catalin Marinas
@ 2024-10-18 19:00             ` Ankur Arora
  2024-10-21 12:02               ` Catalin Marinas
  0 siblings, 1 reply; 74+ messages in thread
From: Ankur Arora @ 2024-10-18 19:00 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ankur Arora, Okanovic, Haris, kvm@vger.kernel.org,
	rafael@kernel.org, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, dave.hansen@linux.intel.com,
	konrad.wilk@oracle.com, wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, linux-arm-kernel@lists.infradead.org,
	linux-pm@vger.kernel.org, bp@alien8.de, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Thu, Oct 17, 2024 at 03:47:31PM -0700, Ankur Arora wrote:
>> Catalin Marinas <catalin.marinas@arm.com> writes:
>> > On Wed, Oct 16, 2024 at 03:13:33PM +0000, Okanovic, Haris wrote:
>> >> On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote:
>> >> > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote:
>> >> > > +                     smp_cond_load_relaxed(&current_thread_info()->flags,
>> >> > > +                                           VAL & _TIF_NEED_RESCHED ||
>> >> > > +                                           loop_count++ >= POLL_IDLE_RELAX_COUNT);
>> >> >
>> >> > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is
>> >> > never set. With the event stream enabled on arm64, the WFE will
>> >> > eventually be woken up, loop_count incremented and the condition would
>> >> > become true. However, the smp_cond_load_relaxed() semantics require that
>> >> > a different agent updates the variable being waited on, not the waiting
>> >> > CPU updating it itself. Also note that the event stream can be disabled
>> >> > on arm64 on the kernel command line.
>> >>
>> >> Alternately could we condition arch_haltpoll_want() on
>> >> arch_timer_evtstrm_available(), like v7?
>> >
>> > No. The problem is about the smp_cond_load_relaxed() semantics - it
>> > can't wait on a variable that's only updated in its exit condition. We
>> > need a new API for this, especially since we are changing generic code
>> > here (even it was arm64 code only, I'd still object to such
>> > smp_cond_load_*() constructs).
>>
>> Right. The problem is that smp_cond_load_relaxed() used in this context
>> depends on the event-stream side effect when the interface does not
>> encode those semantics anywhere.
>>
>> So, a smp_cond_load_timeout() like in [1] that continues to depend on
>> the event-stream is better because it explicitly accounts for the side
>> effect from the timeout.
>>
>> This would cover both the WFxT and the event-stream case.
>
> Indeed.
>
>> The part I'm a little less sure about is the case where WFxT and the
>> event-stream are absent.
>>
>> As you said earlier, for that case on arm64, we use either short
>> __delay() calls or spin in cpu_relax(), both of which are essentially
>> the same thing.

> Something derived from __delay(), not exactly this function. We can't
> use it directly as we also want it to wake up if an event is generated
> as a result of a memory write (like the current smp_cond_load().
>
>> Now on x86 cpu_relax() is quite optimal. The spec explicitly recommends
>> it and from my measurement a loop doing "while (!cond) cpu_relax()" gets
>> an IPC of something like 0.1 or similar.
>>
>> On my arm64 systems however the same loop gets an IPC of 2.  Now this
>> likely varies greatly but seems like it would run pretty hot some of
>> the time.
>
> For the cpu_relax() fall-back, it wouldn't be any worse than the current
> poll_idle() code, though I guess in this instance we'd not enable idle
> polling.
>
> I expect the event stream to be on in all production deployments. The
> reason we have a way to disable it is for testing. We've had hardware
> errata in the past where the event on spin_unlock doesn't cross the
> cluster boundary. We'd not notice because of the event stream.

Ah, interesting. Thanks, that helps.

>> So maybe the right thing to do would be to keep smp_cond_load_timeout()
>> but only allow polling if WFxT or event-stream is enabled. And enhance
>> cpuidle_poll_state_init() to fail if the above condition is not met.
>
> We could do this as well. Maybe hide this behind another function like
> arch_has_efficient_smp_cond_load_timeout() (well, some shorter name),
> checked somewhere in or on the path to cpuidle_poll_state_init(). Well,
> it might be simpler to do this in haltpoll_want(), backed by an
> arch_haltpoll_want() function.

Yeah, checking in arch_haltpoll_want() would mean that we can leave all
the cpuidle_poll_state_init() call sites unchanged.

However, I suspect that even acpi-idle on arm64 might end up using
poll_idle() (as this patch tries to do:
https://lore.kernel.org/lkml/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/).

So, let me try doing it both ways to see which one is simpler.
Given that the event-stream can be assumed to be always-on it might just
be more straight-forward to fallback to cpu_relax() in that edge case.

> I assume we want poll_idle() to wake up as soon as a task becomes
> available. Otherwise we could have just used udelay() for some fraction
> of cpuidle_poll_time() instead of cpu_relax().

Yeah, agreed.

Thanks

--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed()
  2024-10-18 19:00             ` Ankur Arora
@ 2024-10-21 12:02               ` Catalin Marinas
  0 siblings, 0 replies; 74+ messages in thread
From: Catalin Marinas @ 2024-10-21 12:02 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Okanovic, Haris, kvm@vger.kernel.org, rafael@kernel.org,
	sudeep.holla@arm.com, joao.m.martins@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, linux-arm-kernel@lists.infradead.org,
	linux-pm@vger.kernel.org, bp@alien8.de, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com

On Fri, Oct 18, 2024 at 12:00:34PM -0700, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@arm.com> writes:
> > On Thu, Oct 17, 2024 at 03:47:31PM -0700, Ankur Arora wrote:
> >> So maybe the right thing to do would be to keep smp_cond_load_timeout()
> >> but only allow polling if WFxT or event-stream is enabled. And enhance
> >> cpuidle_poll_state_init() to fail if the above condition is not met.
> >
> > We could do this as well. Maybe hide this behind another function like
> > arch_has_efficient_smp_cond_load_timeout() (well, some shorter name),
> > checked somewhere in or on the path to cpuidle_poll_state_init(). Well,
> > it might be simpler to do this in haltpoll_want(), backed by an
> > arch_haltpoll_want() function.
> 
> Yeah, checking in arch_haltpoll_want() would mean that we can leave all
> the cpuidle_poll_state_init() call sites unchanged.
> 
> However, I suspect that even acpi-idle on arm64 might end up using
> poll_idle() (as this patch tries to do:
> https://lore.kernel.org/lkml/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/).
> 
> So, let me try doing it both ways to see which one is simpler.
> Given that the event-stream can be assumed to be always-on it might just
> be more straight-forward to fallback to cpu_relax() in that edge case.

I agree, let's go with the simplest. If one has some strong case for
running with the event stream disabled and idle polling becomes too
energy inefficient, we can revisit and add some run-time checks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v8 02/11] cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed() Ankur Arora
@ 2024-09-25 23:24 ` Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 03/11] Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig Ankur Arora
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-09-25 23:24 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

ARCH_HAS_CPU_RELAX is defined on architectures that provide an
primitive (via cpu_relax()) that can be used as part of a polling
mechanism -- one that would be cheaper than spinning in a tight
loop.

However, recent changes in poll_idle() mean that a higher level
primitive -- smp_cond_load_relaxed() is used for polling. This would
in-turn use cpu_relax() or an architecture specific implementation.
On ARM64 in particular this turns into a WFE which waits on a store
to a cacheline instead of a busy poll.

Accordingly condition the polling drivers on ARCH_HAS_OPTIMIZED_POLL
instead of ARCH_HAS_CPU_RELAX. While at it, make both intel-idle
and cpuidle-haltpoll explicitly depend on ARCH_HAS_CPU_RELAX.

Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig              | 2 +-
 drivers/acpi/processor_idle.c | 4 ++--
 drivers/cpuidle/Kconfig       | 2 +-
 drivers/cpuidle/Makefile      | 2 +-
 drivers/idle/Kconfig          | 1 +
 include/linux/cpuidle.h       | 2 +-
 6 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2852fcd82cbd..555871e7e3b2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -378,7 +378,7 @@ config ARCH_MAY_HAVE_PC_FDC
 config GENERIC_CALIBRATE_DELAY
 	def_bool y
 
-config ARCH_HAS_CPU_RELAX
+config ARCH_HAS_OPTIMIZED_POLL
 	def_bool y
 
 config ARCH_HIBERNATION_POSSIBLE
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 831fa4a12159..44096406d65d 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -35,7 +35,7 @@
 #include <asm/cpu.h>
 #endif
 
-#define ACPI_IDLE_STATE_START	(IS_ENABLED(CONFIG_ARCH_HAS_CPU_RELAX) ? 1 : 0)
+#define ACPI_IDLE_STATE_START	(IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL) ? 1 : 0)
 
 static unsigned int max_cstate __read_mostly = ACPI_PROCESSOR_MAX_POWER;
 module_param(max_cstate, uint, 0400);
@@ -782,7 +782,7 @@ static int acpi_processor_setup_cstates(struct acpi_processor *pr)
 	if (max_cstate == 0)
 		max_cstate = 1;
 
-	if (IS_ENABLED(CONFIG_ARCH_HAS_CPU_RELAX)) {
+	if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) {
 		cpuidle_poll_state_init(drv);
 		count = 1;
 	} else {
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index cac5997dca50..75f6e176bbc8 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -73,7 +73,7 @@ endmenu
 
 config HALTPOLL_CPUIDLE
 	tristate "Halt poll cpuidle driver"
-	depends on X86 && KVM_GUEST
+	depends on X86 && KVM_GUEST && ARCH_HAS_OPTIMIZED_POLL
 	select CPU_IDLE_GOV_HALTPOLL
 	default y
 	help
diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
index d103342b7cfc..f29dfd1525b0 100644
--- a/drivers/cpuidle/Makefile
+++ b/drivers/cpuidle/Makefile
@@ -7,7 +7,7 @@ obj-y += cpuidle.o driver.o governor.o sysfs.o governors/
 obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
 obj-$(CONFIG_DT_IDLE_STATES)		  += dt_idle_states.o
 obj-$(CONFIG_DT_IDLE_GENPD)		  += dt_idle_genpd.o
-obj-$(CONFIG_ARCH_HAS_CPU_RELAX)	  += poll_state.o
+obj-$(CONFIG_ARCH_HAS_OPTIMIZED_POLL)	  += poll_state.o
 obj-$(CONFIG_HALTPOLL_CPUIDLE)		  += cpuidle-haltpoll.o
 
 ##################################################################################
diff --git a/drivers/idle/Kconfig b/drivers/idle/Kconfig
index 6707d2539fc4..6f9b1d48fede 100644
--- a/drivers/idle/Kconfig
+++ b/drivers/idle/Kconfig
@@ -4,6 +4,7 @@ config INTEL_IDLE
 	depends on CPU_IDLE
 	depends on X86
 	depends on CPU_SUP_INTEL
+	depends on ARCH_HAS_OPTIMIZED_POLL
 	help
 	  Enable intel_idle, a cpuidle driver that includes knowledge of
 	  native Intel hardware idle features.  The acpi_idle driver
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 3183aeb7f5b4..7e7e58a17b07 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -275,7 +275,7 @@ static inline void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev,
 }
 #endif
 
-#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
+#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_OPTIMIZED_POLL)
 void cpuidle_poll_state_init(struct cpuidle_driver *drv);
 #else
 static inline void cpuidle_poll_state_init(struct cpuidle_driver *drv) {}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v8 03/11] Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed() Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 02/11] cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL Ankur Arora
@ 2024-09-25 23:24 ` Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 04/11] cpuidle-haltpoll: define arch_haltpoll_want() Ankur Arora
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-09-25 23:24 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

From: Joao Martins <joao.m.martins@oracle.com>

ARCH_HAS_OPTIMIZED_POLL gates selection of polling while idle in
poll_idle(). Move the configuration option to arch/Kconfig to allow
non-x86 architectures to select it.

Note that ARCH_HAS_OPTIMIZED_POLL should probably be exclusive with
GENERIC_IDLE_POLL_SETUP (which controls the generic polling logic in
cpu_idle_poll()). However, that would remove boot options
(hlt=, nohlt=). So, leave it untouched for now.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/Kconfig     | 3 +++
 arch/x86/Kconfig | 4 +---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 405c85ab86f2..cee60ddee5cf 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -273,6 +273,9 @@ config HAVE_ARCH_TRACEHOOK
 config HAVE_DMA_CONTIGUOUS
 	bool
 
+config ARCH_HAS_OPTIMIZED_POLL
+	bool
+
 config GENERIC_SMP_IDLE_THREAD
 	bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 555871e7e3b2..272ec653a8cd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -138,6 +138,7 @@ config X86
 	select ARCH_WANTS_NO_INSTR
 	select ARCH_WANT_GENERAL_HUGETLB
 	select ARCH_WANT_HUGE_PMD_SHARE
+	select ARCH_HAS_OPTIMIZED_POLL
 	select ARCH_WANT_LD_ORPHAN_WARN
 	select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP	if X86_64
 	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
@@ -378,9 +379,6 @@ config ARCH_MAY_HAVE_PC_FDC
 config GENERIC_CALIBRATE_DELAY
 	def_bool y
 
-config ARCH_HAS_OPTIMIZED_POLL
-	def_bool y
-
 config ARCH_HIBERNATION_POSSIBLE
 	def_bool y
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v8 04/11] cpuidle-haltpoll: define arch_haltpoll_want()
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
                   ` (2 preceding siblings ...)
  2024-09-25 23:24 ` [PATCH v8 03/11] Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig Ankur Arora
@ 2024-09-25 23:24 ` Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 05/11] governors/haltpoll: drop kvm_para_available() check Ankur Arora
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-09-25 23:24 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

From: Joao Martins <joao.m.martins@oracle.com>

While initializing haltpoll we check if KVM supports the
realtime hint and if idle is overridden at boot.

Both of these checks are x86 specific. So, in pursuit of
making cpuidle-haltpoll architecture independent, move these
checks out of common code.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/cpuidle_haltpoll.h |  1 +
 arch/x86/kernel/kvm.c                   | 13 +++++++++++++
 drivers/cpuidle/cpuidle-haltpoll.c      | 12 +-----------
 include/linux/cpuidle_haltpoll.h        |  5 +++++
 4 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/cpuidle_haltpoll.h b/arch/x86/include/asm/cpuidle_haltpoll.h
index c8b39c6716ff..8a0a12769c2e 100644
--- a/arch/x86/include/asm/cpuidle_haltpoll.h
+++ b/arch/x86/include/asm/cpuidle_haltpoll.h
@@ -4,5 +4,6 @@
 
 void arch_haltpoll_enable(unsigned int cpu);
 void arch_haltpoll_disable(unsigned int cpu);
+bool arch_haltpoll_want(bool force);
 
 #endif
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 263f8aed4e2c..63710cb1aa63 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -1151,4 +1151,17 @@ void arch_haltpoll_disable(unsigned int cpu)
 	smp_call_function_single(cpu, kvm_enable_host_haltpoll, NULL, 1);
 }
 EXPORT_SYMBOL_GPL(arch_haltpoll_disable);
+
+bool arch_haltpoll_want(bool force)
+{
+	/* Do not load haltpoll if idle= is passed */
+	if (boot_option_idle_override != IDLE_NO_OVERRIDE)
+		return false;
+
+	if (!kvm_para_available())
+		return false;
+
+	return kvm_para_has_hint(KVM_HINTS_REALTIME) || force;
+}
+EXPORT_SYMBOL_GPL(arch_haltpoll_want);
 #endif
diff --git a/drivers/cpuidle/cpuidle-haltpoll.c b/drivers/cpuidle/cpuidle-haltpoll.c
index bcd03e893a0a..e532aa2bf608 100644
--- a/drivers/cpuidle/cpuidle-haltpoll.c
+++ b/drivers/cpuidle/cpuidle-haltpoll.c
@@ -15,7 +15,6 @@
 #include <linux/cpuidle.h>
 #include <linux/module.h>
 #include <linux/sched/idle.h>
-#include <linux/kvm_para.h>
 #include <linux/cpuidle_haltpoll.h>
 
 static bool force __read_mostly;
@@ -93,21 +92,12 @@ static void haltpoll_uninit(void)
 	haltpoll_cpuidle_devices = NULL;
 }
 
-static bool haltpoll_want(void)
-{
-	return kvm_para_has_hint(KVM_HINTS_REALTIME) || force;
-}
-
 static int __init haltpoll_init(void)
 {
 	int ret;
 	struct cpuidle_driver *drv = &haltpoll_driver;
 
-	/* Do not load haltpoll if idle= is passed */
-	if (boot_option_idle_override != IDLE_NO_OVERRIDE)
-		return -ENODEV;
-
-	if (!kvm_para_available() || !haltpoll_want())
+	if (!arch_haltpoll_want(force))
 		return -ENODEV;
 
 	cpuidle_poll_state_init(drv);
diff --git a/include/linux/cpuidle_haltpoll.h b/include/linux/cpuidle_haltpoll.h
index d50c1e0411a2..68eb7a757120 100644
--- a/include/linux/cpuidle_haltpoll.h
+++ b/include/linux/cpuidle_haltpoll.h
@@ -12,5 +12,10 @@ static inline void arch_haltpoll_enable(unsigned int cpu)
 static inline void arch_haltpoll_disable(unsigned int cpu)
 {
 }
+
+static inline bool arch_haltpoll_want(bool force)
+{
+	return false;
+}
 #endif
 #endif
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v8 05/11] governors/haltpoll: drop kvm_para_available() check
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
                   ` (3 preceding siblings ...)
  2024-09-25 23:24 ` [PATCH v8 04/11] cpuidle-haltpoll: define arch_haltpoll_want() Ankur Arora
@ 2024-09-25 23:24 ` Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 06/11] cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL Ankur Arora
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-09-25 23:24 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

From: Joao Martins <joao.m.martins@oracle.com>

The haltpoll governor is selected either by the cpuidle-haltpoll
driver, or explicitly by the user.
In particular, it is never selected by default since it has the lowest
rating of all governors (menu=20, teo=19, ladder=10/25, haltpoll=9).

So, we can safely forgo the kvm_para_available() check. This also
allows cpuidle-haltpoll to be tested on baremetal.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/cpuidle/governors/haltpoll.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/cpuidle/governors/haltpoll.c b/drivers/cpuidle/governors/haltpoll.c
index 663b7f164d20..c8752f793e61 100644
--- a/drivers/cpuidle/governors/haltpoll.c
+++ b/drivers/cpuidle/governors/haltpoll.c
@@ -18,7 +18,6 @@
 #include <linux/tick.h>
 #include <linux/sched.h>
 #include <linux/module.h>
-#include <linux/kvm_para.h>
 #include <trace/events/power.h>
 
 static unsigned int guest_halt_poll_ns __read_mostly = 200000;
@@ -148,10 +147,7 @@ static struct cpuidle_governor haltpoll_governor = {
 
 static int __init init_haltpoll(void)
 {
-	if (kvm_para_available())
-		return cpuidle_register_governor(&haltpoll_governor);
-
-	return 0;
+	return cpuidle_register_governor(&haltpoll_governor);
 }
 
 postcore_initcall(init_haltpoll);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v8 06/11] cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
                   ` (4 preceding siblings ...)
  2024-09-25 23:24 ` [PATCH v8 05/11] governors/haltpoll: drop kvm_para_available() check Ankur Arora
@ 2024-09-25 23:24 ` Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 07/11] arm64: define TIF_POLLING_NRFLAG Ankur Arora
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-09-25 23:24 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

The cpuidle-haltpoll driver and its namesake governor are selected
under KVM_GUEST on X86. KVM_GUEST in-turn selects ARCH_CPUIDLE_HALTPOLL
and defines the requisite arch_haltpoll_{enable,disable}() functions.

So remove the explicit dependence of HALTPOLL_CPUIDLE on KVM_GUEST,
and instead use ARCH_CPUIDLE_HALTPOLL as proxy for architectural
support for haltpoll.

Also change "halt poll" to "haltpoll" in one of the summary clauses,
since the second form is used everywhere else.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig        | 1 +
 drivers/cpuidle/Kconfig | 5 ++---
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 272ec653a8cd..cd457400eaf6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -844,6 +844,7 @@ config KVM_GUEST
 
 config ARCH_CPUIDLE_HALTPOLL
 	def_bool n
+	depends on KVM_GUEST
 	prompt "Disable host haltpoll when loading haltpoll driver"
 	help
 	  If virtualized under KVM, disable host haltpoll.
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 75f6e176bbc8..c1bebadf22bc 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -35,7 +35,6 @@ config CPU_IDLE_GOV_TEO
 
 config CPU_IDLE_GOV_HALTPOLL
 	bool "Haltpoll governor (for virtualized systems)"
-	depends on KVM_GUEST
 	help
 	  This governor implements haltpoll idle state selection, to be
 	  used in conjunction with the haltpoll cpuidle driver, allowing
@@ -72,8 +71,8 @@ source "drivers/cpuidle/Kconfig.riscv"
 endmenu
 
 config HALTPOLL_CPUIDLE
-	tristate "Halt poll cpuidle driver"
-	depends on X86 && KVM_GUEST && ARCH_HAS_OPTIMIZED_POLL
+	tristate "Haltpoll cpuidle driver"
+	depends on ARCH_CPUIDLE_HALTPOLL && ARCH_HAS_OPTIMIZED_POLL
 	select CPU_IDLE_GOV_HALTPOLL
 	default y
 	help
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v8 07/11] arm64: define TIF_POLLING_NRFLAG
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
                   ` (5 preceding siblings ...)
  2024-09-25 23:24 ` [PATCH v8 06/11] cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL Ankur Arora
@ 2024-09-25 23:24 ` Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 08/11] arm64: idle: export arch_cpu_idle Ankur Arora
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-09-25 23:24 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

From: Joao Martins <joao.m.martins@oracle.com>

Commit 842514849a61 ("arm64: Remove TIF_POLLING_NRFLAG") had removed
TIF_POLLING_NRFLAG because arm64 only supported non-polled idling via
cpu_do_idle().

To add support for polling via cpuidle-haltpoll, we want to use the
standard poll_idle() interface, which sets TIF_POLLING_NRFLAG while
polling.

Reuse the same bit to define TIF_POLLING_NRFLAG.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/thread_info.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 1114c1c3300a..5326cd583b01 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -69,6 +69,7 @@ void arch_setup_new_exec(void);
 #define TIF_SYSCALL_TRACEPOINT	10	/* syscall tracepoint for ftrace */
 #define TIF_SECCOMP		11	/* syscall secure computing */
 #define TIF_SYSCALL_EMU		12	/* syscall emulation active */
+#define TIF_POLLING_NRFLAG	16	/* set while polling in poll_idle() */
 #define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
@@ -92,6 +93,7 @@ void arch_setup_new_exec(void);
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_SYSCALL_EMU	(1 << TIF_SYSCALL_EMU)
+#define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_32BIT		(1 << TIF_32BIT)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v8 08/11] arm64: idle: export arch_cpu_idle
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
                   ` (6 preceding siblings ...)
  2024-09-25 23:24 ` [PATCH v8 07/11] arm64: define TIF_POLLING_NRFLAG Ankur Arora
@ 2024-09-25 23:24 ` Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 09/11] arm64: select ARCH_HAS_OPTIMIZED_POLL Ankur Arora
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-09-25 23:24 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

Needed for cpuidle-haltpoll.

Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/kernel/idle.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
index 05cfb347ec26..b85ba0df9b02 100644
--- a/arch/arm64/kernel/idle.c
+++ b/arch/arm64/kernel/idle.c
@@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
 	 */
 	cpu_do_idle();
 }
+EXPORT_SYMBOL_GPL(arch_cpu_idle);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v8 09/11] arm64: select ARCH_HAS_OPTIMIZED_POLL
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
                   ` (7 preceding siblings ...)
  2024-09-25 23:24 ` [PATCH v8 08/11] arm64: idle: export arch_cpu_idle Ankur Arora
@ 2024-09-25 23:24 ` Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 10/11] cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64 Ankur Arora
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-09-25 23:24 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

poll_idle() uses smp_cond_load*() as its polling mechanism.
arm64 supports an optimized version of this via LDXR, WFE, with
LDXR loading a memory region in exclusive state and the WFE
waiting for any stores to it.

Select ARCH_HAS_OPTIMIZED_POLL so poll_idle() can be used.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 49f054dcd4de..ef9c22c3cff2 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -38,6 +38,7 @@ config ARM64
 	select ARCH_HAS_MEM_ENCRYPT
 	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
+	select ARCH_HAS_OPTIMIZED_POLL
 	select ARCH_HAS_PTE_DEVMAP
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_HW_PTE_YOUNG
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v8 10/11] cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
                   ` (8 preceding siblings ...)
  2024-09-25 23:24 ` [PATCH v8 09/11] arm64: select ARCH_HAS_OPTIMIZED_POLL Ankur Arora
@ 2024-09-25 23:24 ` Ankur Arora
  2024-09-25 23:24 ` [PATCH v8 11/11] arm64: support cpuidle-haltpoll Ankur Arora
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-09-25 23:24 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

smp_cond_load_relaxed(), in its generic polling variant, polls on
the loop condition waiting for it to change, eventually exiting
the loop if the time limit has been exceeded.

To limit the frequency of the relatively expensive time check it
is limited to once every POLL_IDLE_RELAX_COUNT iterations.

arm64, however, uses an event based mechanism where instead of
polling, we wait for a store to a region.

Limit the POLL_IDLE_RELAX_COUNT to 1 for that case.

Suggested-by: Haris Okanovic <harisokn@amazon.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/cpuidle/poll_state.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index fc1204426158..61df2395585e 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -8,7 +8,18 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/idle.h>
 
+#ifdef CONFIG_ARM64
+/*
+ * POLL_IDLE_RELAX_COUNT determines how often we check for timeout
+ * while polling for TIF_NEED_RESCHED in thread_info->flags.
+ *
+ * Set this to a low value since arm64, instead of polling, uses a
+ * event based mechanism.
+ */
+#define POLL_IDLE_RELAX_COUNT	1
+#else
 #define POLL_IDLE_RELAX_COUNT	200
+#endif
 
 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v8 11/11] arm64: support cpuidle-haltpoll
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
                   ` (9 preceding siblings ...)
  2024-09-25 23:24 ` [PATCH v8 10/11] cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64 Ankur Arora
@ 2024-09-25 23:24 ` Ankur Arora
  2024-10-02 22:42   ` Okanovic, Haris
  2024-10-16 15:13   ` Okanovic, Haris
  2024-10-09  2:37 ` [PATCH v8 00/11] Enable haltpoll on arm64 zhenglifeng (A)
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Ankur Arora @ 2024-09-25 23:24 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

Add architectural support for the cpuidle-haltpoll driver by defining
arch_haltpoll_*(). Also define ARCH_CPUIDLE_HALTPOLL to allow
cpuidle-haltpoll to be selected.

Haltpoll uses poll_idle() to do the actual polling. This in turn
uses smp_cond_load*() to wait until there's a specific store to
a cacheline. 
In the edge case -- no stores to the cacheline and no interrupt --
the event-stream provides the terminating condition ensuring we
don't wait forever. But because the event-stream runs at a fixed
frequency (configured at 10kHz) haltpoll might spend more time in
the polling stage than specified by cpuidle_poll_time().

This would only happen in the last iteration, since overshooting the
poll_limit means the governor will move out of the polling stage.

Tested-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/Kconfig                        |  6 ++++++
 arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
 2 files changed, 30 insertions(+)
 create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ef9c22c3cff2..5fc99eba22b2 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2415,6 +2415,12 @@ config ARCH_HIBERNATION_HEADER
 config ARCH_SUSPEND_POSSIBLE
 	def_bool y
 
+config ARCH_CPUIDLE_HALTPOLL
+	bool "Enable selection of the cpuidle-haltpoll driver"
+	help
+	  cpuidle-haltpoll allows for adaptive polling based on
+	  current load before entering the idle state.
+
 endmenu # "Power management options"
 
 menu "CPU Power Management"
diff --git a/arch/arm64/include/asm/cpuidle_haltpoll.h b/arch/arm64/include/asm/cpuidle_haltpoll.h
new file mode 100644
index 000000000000..91f0be707629
--- /dev/null
+++ b/arch/arm64/include/asm/cpuidle_haltpoll.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _ARCH_HALTPOLL_H
+#define _ARCH_HALTPOLL_H
+
+static inline void arch_haltpoll_enable(unsigned int cpu) { }
+static inline void arch_haltpoll_disable(unsigned int cpu) { }
+
+static inline bool arch_haltpoll_want(bool force)
+{
+	/*
+	 * Enabling haltpoll requires two things:
+	 *
+	 * - Event stream support to provide a terminating condition to the
+	 *   WFE in the poll loop.
+	 *
+	 * - KVM support for arch_haltpoll_enable(), arch_haltpoll_disable().
+	 *
+	 * Given that the second is missing, only allow force loading for
+	 * haltpoll.
+	 */
+	return force;
+}
+#endif
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 11/11] arm64: support cpuidle-haltpoll
  2024-09-25 23:24 ` [PATCH v8 11/11] arm64: support cpuidle-haltpoll Ankur Arora
@ 2024-10-02 22:42   ` Okanovic, Haris
  2024-10-03  3:29     ` Ankur Arora
  2024-10-16 15:13   ` Okanovic, Haris
  1 sibling, 1 reply; 74+ messages in thread
From: Okanovic, Haris @ 2024-10-02 22:42 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org,
	ankur.a.arora@oracle.com
  Cc: joao.m.martins@oracle.com, boris.ostrovsky@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org, mingo@redhat.com,
	catalin.marinas@arm.com, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, maobibo@loongson.cn, vkuznets@redhat.com,
	bp@alien8.de, Okanovic, Haris, rafael@kernel.org,
	sudeep.holla@arm.com, mtosatti@redhat.com, x86@kernel.org,
	mark.rutland@arm.com

On Wed, 2024-09-25 at 16:24 -0700, Ankur Arora wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> Add architectural support for the cpuidle-haltpoll driver by defining
> arch_haltpoll_*(). Also define ARCH_CPUIDLE_HALTPOLL to allow
> cpuidle-haltpoll to be selected.
> 
> Haltpoll uses poll_idle() to do the actual polling. This in turn
> uses smp_cond_load*() to wait until there's a specific store to
> a cacheline.
> In the edge case -- no stores to the cacheline and no interrupt --
> the event-stream provides the terminating condition ensuring we
> don't wait forever. But because the event-stream runs at a fixed
> frequency (configured at 10kHz) haltpoll might spend more time in
> the polling stage than specified by cpuidle_poll_time().
> 
> This would only happen in the last iteration, since overshooting the
> poll_limit means the governor will move out of the polling stage.
> 
> Tested-by: Haris Okanovic <harisokn@amazon.com>
> Tested-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/arm64/Kconfig                        |  6 ++++++
>  arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
>  2 files changed, 30 insertions(+)
>  create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index ef9c22c3cff2..5fc99eba22b2 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2415,6 +2415,12 @@ config ARCH_HIBERNATION_HEADER
>  config ARCH_SUSPEND_POSSIBLE
>         def_bool y
> 
> +config ARCH_CPUIDLE_HALTPOLL
> +       bool "Enable selection of the cpuidle-haltpoll driver"
> +       help
> +         cpuidle-haltpoll allows for adaptive polling based on
> +         current load before entering the idle state.
> +
>  endmenu # "Power management options"
> 
>  menu "CPU Power Management"
> diff --git a/arch/arm64/include/asm/cpuidle_haltpoll.h b/arch/arm64/include/asm/cpuidle_haltpoll.h
> new file mode 100644
> index 000000000000..91f0be707629
> --- /dev/null
> +++ b/arch/arm64/include/asm/cpuidle_haltpoll.h
> @@ -0,0 +1,24 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _ARCH_HALTPOLL_H
> +#define _ARCH_HALTPOLL_H
> +
> +static inline void arch_haltpoll_enable(unsigned int cpu) { }
> +static inline void arch_haltpoll_disable(unsigned int cpu) { }
> +
> +static inline bool arch_haltpoll_want(bool force)
> +{
> +       /*
> +        * Enabling haltpoll requires two things:
> +        *
> +        * - Event stream support to provide a terminating condition to the
> +        *   WFE in the poll loop.
> +        *
> +        * - KVM support for arch_haltpoll_enable(), arch_haltpoll_disable().
> +        *
> +        * Given that the second is missing, only allow force loading for
> +        * haltpoll.
> +        */
> +       return force;
> +}
> +#endif
> --
> 2.43.5
> 

I applied your patches to master e32cde8d2bd7 and verified same
performance gains on AWS Graviton.

Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Haris Okanovic <harisokn@amazon.com>



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 11/11] arm64: support cpuidle-haltpoll
  2024-10-02 22:42   ` Okanovic, Haris
@ 2024-10-03  3:29     ` Ankur Arora
  0 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-10-03  3:29 UTC (permalink / raw)
  To: Okanovic, Haris
  Cc: linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org,
	ankur.a.arora@oracle.com, joao.m.martins@oracle.com,
	boris.ostrovsky@oracle.com, dave.hansen@linux.intel.com,
	konrad.wilk@oracle.com, wanpengli@tencent.com, cl@gentwo.org,
	mingo@redhat.com, catalin.marinas@arm.com, pbonzini@redhat.com,
	tglx@linutronix.de, misono.tomohiro@fujitsu.com,
	daniel.lezcano@linaro.org, arnd@arndb.de, lenb@kernel.org,
	will@kernel.org, hpa@zytor.com, peterz@infradead.org,
	maobibo@loongson.cn, vkuznets@redhat.com, bp@alien8.de,
	rafael@kernel.org, sudeep.holla@arm.com, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com


Okanovic, Haris <harisokn@amazon.com> writes:

> On Wed, 2024-09-25 at 16:24 -0700, Ankur Arora wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> Add architectural support for the cpuidle-haltpoll driver by defining
>> arch_haltpoll_*(). Also define ARCH_CPUIDLE_HALTPOLL to allow
>> cpuidle-haltpoll to be selected.
>>
>> Haltpoll uses poll_idle() to do the actual polling. This in turn
>> uses smp_cond_load*() to wait until there's a specific store to
>> a cacheline.
>> In the edge case -- no stores to the cacheline and no interrupt --
>> the event-stream provides the terminating condition ensuring we
>> don't wait forever. But because the event-stream runs at a fixed
>> frequency (configured at 10kHz) haltpoll might spend more time in
>> the polling stage than specified by cpuidle_poll_time().
>>
>> This would only happen in the last iteration, since overshooting the
>> poll_limit means the governor will move out of the polling stage.
>>
>> Tested-by: Haris Okanovic <harisokn@amazon.com>
>> Tested-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  arch/arm64/Kconfig                        |  6 ++++++
>>  arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
>>  2 files changed, 30 insertions(+)
>>  create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index ef9c22c3cff2..5fc99eba22b2 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -2415,6 +2415,12 @@ config ARCH_HIBERNATION_HEADER
>>  config ARCH_SUSPEND_POSSIBLE
>>         def_bool y
>>
>> +config ARCH_CPUIDLE_HALTPOLL
>> +       bool "Enable selection of the cpuidle-haltpoll driver"
>> +       help
>> +         cpuidle-haltpoll allows for adaptive polling based on
>> +         current load before entering the idle state.
>> +
>>  endmenu # "Power management options"
>>
>>  menu "CPU Power Management"
>> diff --git a/arch/arm64/include/asm/cpuidle_haltpoll.h b/arch/arm64/include/asm/cpuidle_haltpoll.h
>> new file mode 100644
>> index 000000000000..91f0be707629
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/cpuidle_haltpoll.h
>> @@ -0,0 +1,24 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifndef _ARCH_HALTPOLL_H
>> +#define _ARCH_HALTPOLL_H
>> +
>> +static inline void arch_haltpoll_enable(unsigned int cpu) { }
>> +static inline void arch_haltpoll_disable(unsigned int cpu) { }
>> +
>> +static inline bool arch_haltpoll_want(bool force)
>> +{
>> +       /*
>> +        * Enabling haltpoll requires two things:
>> +        *
>> +        * - Event stream support to provide a terminating condition to the
>> +        *   WFE in the poll loop.
>> +        *
>> +        * - KVM support for arch_haltpoll_enable(), arch_haltpoll_disable().
>> +        *
>> +        * Given that the second is missing, only allow force loading for
>> +        * haltpoll.
>> +        */
>> +       return force;
>> +}
>> +#endif
>> --
>> 2.43.5
>>
>
> I applied your patches to master e32cde8d2bd7 and verified same
> performance gains on AWS Graviton.

Great.

> Reviewed-by: Haris Okanovic <harisokn@amazon.com>
> Tested-by: Haris Okanovic <harisokn@amazon.com>

Thanks!

--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 11/11] arm64: support cpuidle-haltpoll
  2024-09-25 23:24 ` [PATCH v8 11/11] arm64: support cpuidle-haltpoll Ankur Arora
  2024-10-02 22:42   ` Okanovic, Haris
@ 2024-10-16 15:13   ` Okanovic, Haris
  1 sibling, 0 replies; 74+ messages in thread
From: Okanovic, Haris @ 2024-10-16 15:13 UTC (permalink / raw)
  To: linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org,
	ankur.a.arora@oracle.com
  Cc: joao.m.martins@oracle.com, boris.ostrovsky@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org, mingo@redhat.com,
	catalin.marinas@arm.com, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, maobibo@loongson.cn, vkuznets@redhat.com,
	bp@alien8.de, Okanovic, Haris, rafael@kernel.org,
	sudeep.holla@arm.com, mtosatti@redhat.com, x86@kernel.org,
	mark.rutland@arm.com

On Wed, 2024-09-25 at 16:24 -0700, Ankur Arora wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> Add architectural support for the cpuidle-haltpoll driver by defining
> arch_haltpoll_*(). Also define ARCH_CPUIDLE_HALTPOLL to allow
> cpuidle-haltpoll to be selected.
> 
> Haltpoll uses poll_idle() to do the actual polling. This in turn
> uses smp_cond_load*() to wait until there's a specific store to
> a cacheline.
> In the edge case -- no stores to the cacheline and no interrupt --
> the event-stream provides the terminating condition ensuring we
> don't wait forever. But because the event-stream runs at a fixed
> frequency (configured at 10kHz) haltpoll might spend more time in
> the polling stage than specified by cpuidle_poll_time().
> 
> This would only happen in the last iteration, since overshooting the
> poll_limit means the governor will move out of the polling stage.
> 
> Tested-by: Haris Okanovic <harisokn@amazon.com>
> Tested-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/arm64/Kconfig                        |  6 ++++++
>  arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
>  2 files changed, 30 insertions(+)
>  create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index ef9c22c3cff2..5fc99eba22b2 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2415,6 +2415,12 @@ config ARCH_HIBERNATION_HEADER
>  config ARCH_SUSPEND_POSSIBLE
>         def_bool y
> 
> +config ARCH_CPUIDLE_HALTPOLL
> +       bool "Enable selection of the cpuidle-haltpoll driver"
> +       help
> +         cpuidle-haltpoll allows for adaptive polling based on
> +         current load before entering the idle state.
> +
>  endmenu # "Power management options"
> 
>  menu "CPU Power Management"
> diff --git a/arch/arm64/include/asm/cpuidle_haltpoll.h b/arch/arm64/include/asm/cpuidle_haltpoll.h
> new file mode 100644
> index 000000000000..91f0be707629
> --- /dev/null
> +++ b/arch/arm64/include/asm/cpuidle_haltpoll.h
> @@ -0,0 +1,24 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _ARCH_HALTPOLL_H
> +#define _ARCH_HALTPOLL_H
> +
> +static inline void arch_haltpoll_enable(unsigned int cpu) { }
> +static inline void arch_haltpoll_disable(unsigned int cpu) { }
> +
> +static inline bool arch_haltpoll_want(bool force)
> +{
> +       /*
> +        * Enabling haltpoll requires two things:
> +        *
> +        * - Event stream support to provide a terminating condition to the
> +        *   WFE in the poll loop.

I missed this earlier: Why did you drop arch_timer_evtstrm_available()?

> +        *
> +        * - KVM support for arch_haltpoll_enable(), arch_haltpoll_disable().
> +        *
> +        * Given that the second is missing, only allow force loading for
> +        * haltpoll.
> +        */
> +       return force;
> +}
> +#endif
> --
> 2.43.5
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 00/11] Enable haltpoll on arm64
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
                   ` (10 preceding siblings ...)
  2024-09-25 23:24 ` [PATCH v8 11/11] arm64: support cpuidle-haltpoll Ankur Arora
@ 2024-10-09  2:37 ` zhenglifeng (A)
  2024-10-15  1:53   ` Ankur Arora
  2024-10-14 22:54 ` Christoph Lameter (Ampere)
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: zhenglifeng (A) @ 2024-10-09  2:37 UTC (permalink / raw)
  To: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Jie Zhan, lihuisong



On 2024/9/26 7:24, Ankur Arora wrote:
> This patchset enables the cpuidle-haltpoll driver and its namesake
> governor on arm64. This is specifically interesting for KVM guests by
> reducing IPC latencies.
> 
> Comparing idle switching latencies on an arm64 KVM guest with 
> perf bench sched pipe:
> 
>                                      usecs/op       %stdev   
> 
>   no haltpoll (baseline)               13.48       +-  5.19%
>   with haltpoll                         6.84       +- 22.07%
> 
> 
> No change in performance for a similar test on x86:
> 
>                                      usecs/op        %stdev   
> 
>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
> 
> Both sets of tests were on otherwise idle systems with guest VCPUs
> pinned to specific PCPUs. One reason for the higher stdev on arm64
> is that trapping of the WFE instruction by the host KVM is contingent
> on the number of tasks on the runqueue.
> 
> Tomohiro Misono and Haris Okanovic also report similar latency
> improvements on Grace and Graviton systems (for v7) [1] [2].
> 
> The patch series is organized in three parts: 
> 
>  - patch 1, reorganizes the poll_idle() loop, switching to
>    smp_cond_load_relaxed() in the polling loop.
>    Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
>    renaming it to ARCH_HAS_OPTIMIZED_POLL.
> 
>  - patches 4-6 reorganize the haltpoll selection and init logic
>    to allow architecture code to select it. 
> 
>  - and finally, patches 7-11 add the bits for arm64 support.
> 
> What is still missing: this series largely completes the haltpoll side
> of functionality for arm64. There are, however, a few related areas
> that still need to be threshed out:
> 
>  - WFET support: WFE on arm64 does not guarantee that poll_idle()
>    would terminate in halt_poll_ns. Using WFET would address this.
>  - KVM_NO_POLL support on arm64
>  - KVM TWED support on arm64: allow the host to limit time spent in
>    WFE.
> 
> 
> Changelog:
> 
> v8: No logic changes. Largely respin of v7, with changes
> noted below:
> 
>  - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
>    own patch.
>    (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
>    
>  - address comments simplifying arm64 support (Will Deacon)
>    (patch-11 "arm64: support cpuidle-haltpoll")
> 
> v7: No significant logic changes. Mostly a respin of v6.
> 
>  - minor cleanup in poll_idle() (Christoph Lameter)
>  - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
>    (Tomohiro Misono)
> 
> v6:
> 
>  - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
>    changes together (comment from Christoph Lameter)
>  - threshes out the commit messages a bit more (comments from Christoph
>    Lameter, Sudeep Holla)
>  - also rework selection of cpuidle-haltpoll. Now selected based
>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>  - moved back to arch_haltpoll_want() (comment from Joao Martins)
>    Also, arch_haltpoll_want() now takes the force parameter and is
>    now responsible for the complete selection (or not) of haltpoll.
>  - fixes the build breakage on i386
>  - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
>    Tomohiro Misono, Haris Okanovic)
> 
> 
> v5:
>  - rework the poll_idle() loop around smp_cond_load_relaxed() (review
>    comment from Tomohiro Misono.)
>  - also rework selection of cpuidle-haltpoll. Now selected based
>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>  - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
>    arm64 now depends on the event-stream being enabled.
>  - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
>  - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
> 
> v4 changes from v3:
>  - change 7/8 per Rafael input: drop the parens and use ret for the final check
>  - add 8/8 which renames the guard for building poll_state
> 
> v3 changes from v2:
>  - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
>  - add Ack-by from Rafael Wysocki on 2/7
> 
> v2 changes from v1:
>  - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
>    (this improves by 50% at least the CPU cycles consumed in the tests above:
>    10,716,881,137 now vs 14,503,014,257 before)
>  - removed the ifdef from patch 1 per RafaelW
> 
> Please review.
> 
> [1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
> [2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
> 
> Ankur Arora (6):
>   cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
>   cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
>   arm64: idle: export arch_cpu_idle
>   arm64: select ARCH_HAS_OPTIMIZED_POLL
>   cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
>   arm64: support cpuidle-haltpoll
> 
> Joao Martins (4):
>   Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
>   cpuidle-haltpoll: define arch_haltpoll_want()
>   governors/haltpoll: drop kvm_para_available() check
>   arm64: define TIF_POLLING_NRFLAG
> 
> Mihai Carabas (1):
>   cpuidle/poll_state: poll via smp_cond_load_relaxed()
> 
>  arch/Kconfig                              |  3 +++
>  arch/arm64/Kconfig                        |  7 +++++++
>  arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
>  arch/arm64/include/asm/thread_info.h      |  2 ++
>  arch/arm64/kernel/idle.c                  |  1 +
>  arch/x86/Kconfig                          |  5 ++---
>  arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
>  arch/x86/kernel/kvm.c                     | 13 ++++++++++++
>  drivers/acpi/processor_idle.c             |  4 ++--
>  drivers/cpuidle/Kconfig                   |  5 ++---
>  drivers/cpuidle/Makefile                  |  2 +-
>  drivers/cpuidle/cpuidle-haltpoll.c        | 12 +-----------
>  drivers/cpuidle/governors/haltpoll.c      |  6 +-----
>  drivers/cpuidle/poll_state.c              | 22 +++++++++++++++------
>  drivers/idle/Kconfig                      |  1 +
>  include/linux/cpuidle.h                   |  2 +-
>  include/linux/cpuidle_haltpoll.h          |  5 +++++
>  17 files changed, 83 insertions(+), 32 deletions(-)
>  create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
> 

Hi Ankur,

Thanks for the patches!

We have tested these patches on our machine, with an adaptation of ACPI LPI
states rather than c-states.

Include polling state, there would be three states to get in. Comparing idle
switching latencies of different state with perf bench sched pipe:

                                     usecs/op       %stdev   

  state0(polling state)                7.36       +-  0.35%
  state1                               8.78       +-  0.46%
  state2                              77.32       +-  5.50%
  
It turns out that it works on our machine.

Tested-by: Lifeng Zheng <zhenglifeng1@huawei.com>

The adaptation of ACPI LPI states is shown below as a patch. Feel free to
include this patch as part of your series, or I can also send it out after
your series being merged.

From: Lifeng Zheng <zhenglifeng1@huawei.com>

ACPI: processor_idle: Support polling state for LPI

Initialize an optional polling state besides LPI states.

Wrap up a new enter method to correctly reflect the actual entered state
when the polling state is enabled.

Signed-off-by: Lifeng Zheng <zhenglifeng1@huawei.com>
Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com>
---
 drivers/acpi/processor_idle.c | 39 ++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 44096406d65d..d154b5d77328 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -1194,20 +1194,46 @@ static int acpi_idle_lpi_enter(struct cpuidle_device *dev,
 	return -EINVAL;
 }
 
+/* To correctly reflect the entered state if the poll state is enabled. */
+static int acpi_idle_lpi_enter_with_poll_state(struct cpuidle_device *dev,
+			       struct cpuidle_driver *drv, int index)
+{
+	int entered_state;
+
+	if (unlikely(index < 1))
+		return -EINVAL;
+
+	entered_state = acpi_idle_lpi_enter(dev, drv, index - 1);
+	if (entered_state < 0)
+		return entered_state;
+
+	return entered_state + 1;
+}
+
 static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
 {
-	int i;
+	int i, count;
 	struct acpi_lpi_state *lpi;
 	struct cpuidle_state *state;
 	struct cpuidle_driver *drv = &acpi_idle_driver;
+	typeof(state->enter) enter_method;
 
 	if (!pr->flags.has_lpi)
 		return -EOPNOTSUPP;
 
+	if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) {
+		cpuidle_poll_state_init(drv);
+		count = 1;
+		enter_method = acpi_idle_lpi_enter_with_poll_state;
+	} else {
+		count = 0;
+		enter_method = acpi_idle_lpi_enter;
+	}
+
 	for (i = 0; i < pr->power.count && i < CPUIDLE_STATE_MAX; i++) {
 		lpi = &pr->power.lpi_states[i];
 
-		state = &drv->states[i];
+		state = &drv->states[count];
 		snprintf(state->name, CPUIDLE_NAME_LEN, "LPI-%d", i);
 		strscpy(state->desc, lpi->desc, CPUIDLE_DESC_LEN);
 		state->exit_latency = lpi->wake_latency;
@@ -1215,11 +1241,14 @@ static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
 		state->flags |= arch_get_idle_state_flags(lpi->arch_flags);
 		if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
 			state->flags |= CPUIDLE_FLAG_RCU_IDLE;
-		state->enter = acpi_idle_lpi_enter;
-		drv->safe_state_index = i;
+		state->enter = enter_method;
+		drv->safe_state_index = count;
+		count++;
+		if (count == CPUIDLE_STATE_MAX)
+			break;
 	}
 
-	drv->state_count = i;
+	drv->state_count = count;
 
 	return 0;
 }
-- 
2.33.0




^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 00/11] Enable haltpoll on arm64
  2024-10-09  2:37 ` [PATCH v8 00/11] Enable haltpoll on arm64 zhenglifeng (A)
@ 2024-10-15  1:53   ` Ankur Arora
  0 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-10-15  1:53 UTC (permalink / raw)
  To: zhenglifeng (A)
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk, Jie Zhan, lihuisong


zhenglifeng (A) <zhenglifeng1@huawei.com> writes:

> On 2024/9/26 7:24, Ankur Arora wrote:
>> This patchset enables the cpuidle-haltpoll driver and its namesake
>> governor on arm64. This is specifically interesting for KVM guests by
>> reducing IPC latencies.
>>
>> Comparing idle switching latencies on an arm64 KVM guest with
>> perf bench sched pipe:
>>
>>                                      usecs/op       %stdev
>>
>>   no haltpoll (baseline)               13.48       +-  5.19%
>>   with haltpoll                         6.84       +- 22.07%
>>
>>
>> No change in performance for a similar test on x86:
>>
>>                                      usecs/op        %stdev
>>
>>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
>>
>> Both sets of tests were on otherwise idle systems with guest VCPUs
>> pinned to specific PCPUs. One reason for the higher stdev on arm64
>> is that trapping of the WFE instruction by the host KVM is contingent
>> on the number of tasks on the runqueue.
>>
>> Tomohiro Misono and Haris Okanovic also report similar latency
>> improvements on Grace and Graviton systems (for v7) [1] [2].
>>
>> The patch series is organized in three parts:
>>
>>  - patch 1, reorganizes the poll_idle() loop, switching to
>>    smp_cond_load_relaxed() in the polling loop.
>>    Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
>>    renaming it to ARCH_HAS_OPTIMIZED_POLL.
>>
>>  - patches 4-6 reorganize the haltpoll selection and init logic
>>    to allow architecture code to select it.
>>
>>  - and finally, patches 7-11 add the bits for arm64 support.
>>
>> What is still missing: this series largely completes the haltpoll side
>> of functionality for arm64. There are, however, a few related areas
>> that still need to be threshed out:
>>
>>  - WFET support: WFE on arm64 does not guarantee that poll_idle()
>>    would terminate in halt_poll_ns. Using WFET would address this.
>>  - KVM_NO_POLL support on arm64
>>  - KVM TWED support on arm64: allow the host to limit time spent in
>>    WFE.
>>
>>
>> Changelog:
>>
>> v8: No logic changes. Largely respin of v7, with changes
>> noted below:
>>
>>  - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
>>    own patch.
>>    (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
>>
>>  - address comments simplifying arm64 support (Will Deacon)
>>    (patch-11 "arm64: support cpuidle-haltpoll")
>>
>> v7: No significant logic changes. Mostly a respin of v6.
>>
>>  - minor cleanup in poll_idle() (Christoph Lameter)
>>  - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
>>    (Tomohiro Misono)
>>
>> v6:
>>
>>  - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
>>    changes together (comment from Christoph Lameter)
>>  - threshes out the commit messages a bit more (comments from Christoph
>>    Lameter, Sudeep Holla)
>>  - also rework selection of cpuidle-haltpoll. Now selected based
>>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>>  - moved back to arch_haltpoll_want() (comment from Joao Martins)
>>    Also, arch_haltpoll_want() now takes the force parameter and is
>>    now responsible for the complete selection (or not) of haltpoll.
>>  - fixes the build breakage on i386
>>  - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
>>    Tomohiro Misono, Haris Okanovic)
>>
>>
>> v5:
>>  - rework the poll_idle() loop around smp_cond_load_relaxed() (review
>>    comment from Tomohiro Misono.)
>>  - also rework selection of cpuidle-haltpoll. Now selected based
>>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>>  - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
>>    arm64 now depends on the event-stream being enabled.
>>  - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
>>  - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
>>
>> v4 changes from v3:
>>  - change 7/8 per Rafael input: drop the parens and use ret for the final check
>>  - add 8/8 which renames the guard for building poll_state
>>
>> v3 changes from v2:
>>  - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
>>  - add Ack-by from Rafael Wysocki on 2/7
>>
>> v2 changes from v1:
>>  - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
>>    (this improves by 50% at least the CPU cycles consumed in the tests above:
>>    10,716,881,137 now vs 14,503,014,257 before)
>>  - removed the ifdef from patch 1 per RafaelW
>>
>> Please review.
>>
>> [1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
>> [2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
>>
>> Ankur Arora (6):
>>   cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
>>   cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
>>   arm64: idle: export arch_cpu_idle
>>   arm64: select ARCH_HAS_OPTIMIZED_POLL
>>   cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
>>   arm64: support cpuidle-haltpoll
>>
>> Joao Martins (4):
>>   Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
>>   cpuidle-haltpoll: define arch_haltpoll_want()
>>   governors/haltpoll: drop kvm_para_available() check
>>   arm64: define TIF_POLLING_NRFLAG
>>
>> Mihai Carabas (1):
>>   cpuidle/poll_state: poll via smp_cond_load_relaxed()
>>
>>  arch/Kconfig                              |  3 +++
>>  arch/arm64/Kconfig                        |  7 +++++++
>>  arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
>>  arch/arm64/include/asm/thread_info.h      |  2 ++
>>  arch/arm64/kernel/idle.c                  |  1 +
>>  arch/x86/Kconfig                          |  5 ++---
>>  arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
>>  arch/x86/kernel/kvm.c                     | 13 ++++++++++++
>>  drivers/acpi/processor_idle.c             |  4 ++--
>>  drivers/cpuidle/Kconfig                   |  5 ++---
>>  drivers/cpuidle/Makefile                  |  2 +-
>>  drivers/cpuidle/cpuidle-haltpoll.c        | 12 +-----------
>>  drivers/cpuidle/governors/haltpoll.c      |  6 +-----
>>  drivers/cpuidle/poll_state.c              | 22 +++++++++++++++------
>>  drivers/idle/Kconfig                      |  1 +
>>  include/linux/cpuidle.h                   |  2 +-
>>  include/linux/cpuidle_haltpoll.h          |  5 +++++
>>  17 files changed, 83 insertions(+), 32 deletions(-)
>>  create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
>>
>
> Hi Ankur,
>
> Thanks for the patches!
>
> We have tested these patches on our machine, with an adaptation of ACPI LPI
> states rather than c-states.
>
> Include polling state, there would be three states to get in. Comparing idle
> switching latencies of different state with perf bench sched pipe:
>
>                                      usecs/op       %stdev
>
>   state0(polling state)                7.36       +-  0.35%
>   state1                               8.78       +-  0.46%
>   state2                              77.32       +-  5.50%
>
> It turns out that it works on our machine.
>
> Tested-by: Lifeng Zheng <zhenglifeng1@huawei.com>

Great. Thanks Lifeng.

> The adaptation of ACPI LPI states is shown below as a patch. Feel free to
> include this patch as part of your series, or I can also send it out after
> your series being merged.

Ah, so polling for the regular ACPI driver. From a quick look the
patch looks good but this series is mostly focused on haltpoll so I
think this patch can go in after.

Please Cc me when you send it.

Thanks
Ankur

> From: Lifeng Zheng <zhenglifeng1@huawei.com>
>
> ACPI: processor_idle: Support polling state for LPI
>
> Initialize an optional polling state besides LPI states.
>
> Wrap up a new enter method to correctly reflect the actual entered state
> when the polling state is enabled.
>
> Signed-off-by: Lifeng Zheng <zhenglifeng1@huawei.com>
> Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com>
> ---
>  drivers/acpi/processor_idle.c | 39 ++++++++++++++++++++++++++++++-----
>  1 file changed, 34 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 44096406d65d..d154b5d77328 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -1194,20 +1194,46 @@ static int acpi_idle_lpi_enter(struct cpuidle_device *dev,
>  	return -EINVAL;
>  }
>
> +/* To correctly reflect the entered state if the poll state is enabled. */
> +static int acpi_idle_lpi_enter_with_poll_state(struct cpuidle_device *dev,
> +			       struct cpuidle_driver *drv, int index)
> +{
> +	int entered_state;
> +
> +	if (unlikely(index < 1))
> +		return -EINVAL;
> +
> +	entered_state = acpi_idle_lpi_enter(dev, drv, index - 1);
> +	if (entered_state < 0)
> +		return entered_state;
> +
> +	return entered_state + 1;
> +}
> +
>  static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
>  {
> -	int i;
> +	int i, count;
>  	struct acpi_lpi_state *lpi;
>  	struct cpuidle_state *state;
>  	struct cpuidle_driver *drv = &acpi_idle_driver;
> +	typeof(state->enter) enter_method;
>
>  	if (!pr->flags.has_lpi)
>  		return -EOPNOTSUPP;
>
> +	if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) {
> +		cpuidle_poll_state_init(drv);
> +		count = 1;
> +		enter_method = acpi_idle_lpi_enter_with_poll_state;
> +	} else {
> +		count = 0;
> +		enter_method = acpi_idle_lpi_enter;
> +	}
> +
>  	for (i = 0; i < pr->power.count && i < CPUIDLE_STATE_MAX; i++) {
>  		lpi = &pr->power.lpi_states[i];
>
> -		state = &drv->states[i];
> +		state = &drv->states[count];
>  		snprintf(state->name, CPUIDLE_NAME_LEN, "LPI-%d", i);
>  		strscpy(state->desc, lpi->desc, CPUIDLE_DESC_LEN);
>  		state->exit_latency = lpi->wake_latency;
> @@ -1215,11 +1241,14 @@ static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
>  		state->flags |= arch_get_idle_state_flags(lpi->arch_flags);
>  		if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
>  			state->flags |= CPUIDLE_FLAG_RCU_IDLE;
> -		state->enter = acpi_idle_lpi_enter;
> -		drv->safe_state_index = i;
> +		state->enter = enter_method;
> +		drv->safe_state_index = count;
> +		count++;
> +		if (count == CPUIDLE_STATE_MAX)
> +			break;
>  	}
>
> -	drv->state_count = i;
> +	drv->state_count = count;
>
>  	return 0;
>  }


--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 00/11] Enable haltpoll on arm64
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
                   ` (11 preceding siblings ...)
  2024-10-09  2:37 ` [PATCH v8 00/11] Enable haltpoll on arm64 zhenglifeng (A)
@ 2024-10-14 22:54 ` Christoph Lameter (Ampere)
  2024-10-15 12:36 ` Marc Zyngier
  2024-11-05 18:30 ` Haris Okanovic
  14 siblings, 0 replies; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-10-14 22:54 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, catalin.marinas,
	will, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

Various versions of this patchset have been extensively tested at Ampere
with numerous benchmarks by our performance testing groups.

Please merge this. We will continue to contribute to this patchset.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 00/11] Enable haltpoll on arm64
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
                   ` (12 preceding siblings ...)
  2024-10-14 22:54 ` Christoph Lameter (Ampere)
@ 2024-10-15 12:36 ` Marc Zyngier
  2024-10-16 21:55   ` Ankur Arora
  2024-11-05 18:30 ` Haris Okanovic
  14 siblings, 1 reply; 74+ messages in thread
From: Marc Zyngier @ 2024-10-15 12:36 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, catalin.marinas,
	will, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

On Thu, 26 Sep 2024 00:24:14 +0100,
Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
> This patchset enables the cpuidle-haltpoll driver and its namesake
> governor on arm64. This is specifically interesting for KVM guests by
> reducing IPC latencies.
> 
> Comparing idle switching latencies on an arm64 KVM guest with 
> perf bench sched pipe:
> 
>                                      usecs/op       %stdev   
> 
>   no haltpoll (baseline)               13.48       +-  5.19%
>   with haltpoll                         6.84       +- 22.07%
> 
> 
> No change in performance for a similar test on x86:
> 
>                                      usecs/op        %stdev   
> 
>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
> 
> Both sets of tests were on otherwise idle systems with guest VCPUs
> pinned to specific PCPUs. One reason for the higher stdev on arm64
> is that trapping of the WFE instruction by the host KVM is contingent
> on the number of tasks on the runqueue.

Sorry to state the obvious, but if that's the variable trapping of
WFI/WFE is the cause of your trouble, why don't you simply turn it off
(see 0b5afe05377d for the details)? Given that you pin your vcpus to
physical CPUs, there is no need for any trapping.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 00/11] Enable haltpoll on arm64
  2024-10-15 12:36 ` Marc Zyngier
@ 2024-10-16 21:55   ` Ankur Arora
  2024-10-17  8:19     ` Marc Zyngier
  0 siblings, 1 reply; 74+ messages in thread
From: Ankur Arora @ 2024-10-16 21:55 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk


Marc Zyngier <maz@kernel.org> writes:

> On Thu, 26 Sep 2024 00:24:14 +0100,
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> This patchset enables the cpuidle-haltpoll driver and its namesake
>> governor on arm64. This is specifically interesting for KVM guests by
>> reducing IPC latencies.
>>
>> Comparing idle switching latencies on an arm64 KVM guest with
>> perf bench sched pipe:
>>
>>                                      usecs/op       %stdev
>>
>>   no haltpoll (baseline)               13.48       +-  5.19%
>>   with haltpoll                         6.84       +- 22.07%
>>
>>
>> No change in performance for a similar test on x86:
>>
>>                                      usecs/op        %stdev
>>
>>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
>>
>> Both sets of tests were on otherwise idle systems with guest VCPUs
>> pinned to specific PCPUs. One reason for the higher stdev on arm64
>> is that trapping of the WFE instruction by the host KVM is contingent
>> on the number of tasks on the runqueue.
>
> Sorry to state the obvious, but if that's the variable trapping of
> WFI/WFE is the cause of your trouble, why don't you simply turn it off
> (see 0b5afe05377d for the details)? Given that you pin your vcpus to
> physical CPUs, there is no need for any trapping.

Good point. Thanks. That should help reduce the guessing games around
the variance in these tests.

--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 00/11] Enable haltpoll on arm64
  2024-10-16 21:55   ` Ankur Arora
@ 2024-10-17  8:19     ` Marc Zyngier
  2024-10-17 18:35       ` Ankur Arora
  0 siblings, 1 reply; 74+ messages in thread
From: Marc Zyngier @ 2024-10-17  8:19 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, catalin.marinas,
	will, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk

On Wed, 16 Oct 2024 22:55:09 +0100,
Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
> 
> Marc Zyngier <maz@kernel.org> writes:
> 
> > On Thu, 26 Sep 2024 00:24:14 +0100,
> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >>
> >> This patchset enables the cpuidle-haltpoll driver and its namesake
> >> governor on arm64. This is specifically interesting for KVM guests by
> >> reducing IPC latencies.
> >>
> >> Comparing idle switching latencies on an arm64 KVM guest with
> >> perf bench sched pipe:
> >>
> >>                                      usecs/op       %stdev
> >>
> >>   no haltpoll (baseline)               13.48       +-  5.19%
> >>   with haltpoll                         6.84       +- 22.07%
> >>
> >>
> >> No change in performance for a similar test on x86:
> >>
> >>                                      usecs/op        %stdev
> >>
> >>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
> >>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
> >>
> >> Both sets of tests were on otherwise idle systems with guest VCPUs
> >> pinned to specific PCPUs. One reason for the higher stdev on arm64
> >> is that trapping of the WFE instruction by the host KVM is contingent
> >> on the number of tasks on the runqueue.
> >
> > Sorry to state the obvious, but if that's the variable trapping of
> > WFI/WFE is the cause of your trouble, why don't you simply turn it off
> > (see 0b5afe05377d for the details)? Given that you pin your vcpus to
> > physical CPUs, there is no need for any trapping.
> 
> Good point. Thanks. That should help reduce the guessing games around
> the variance in these tests.

I'd be interested to find out whether there is still some benefit in
this series once you disable the WFx trapping heuristics.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 00/11] Enable haltpoll on arm64
  2024-10-17  8:19     ` Marc Zyngier
@ 2024-10-17 18:35       ` Ankur Arora
  2024-10-22 22:01         ` Ankur Arora
  0 siblings, 1 reply; 74+ messages in thread
From: Ankur Arora @ 2024-10-17 18:35 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk


Marc Zyngier <maz@kernel.org> writes:

> On Wed, 16 Oct 2024 22:55:09 +0100,
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>
>> Marc Zyngier <maz@kernel.org> writes:
>>
>> > On Thu, 26 Sep 2024 00:24:14 +0100,
>> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
>> >>
>> >> This patchset enables the cpuidle-haltpoll driver and its namesake
>> >> governor on arm64. This is specifically interesting for KVM guests by
>> >> reducing IPC latencies.
>> >>
>> >> Comparing idle switching latencies on an arm64 KVM guest with
>> >> perf bench sched pipe:
>> >>
>> >>                                      usecs/op       %stdev
>> >>
>> >>   no haltpoll (baseline)               13.48       +-  5.19%
>> >>   with haltpoll                         6.84       +- 22.07%
>> >>
>> >>
>> >> No change in performance for a similar test on x86:
>> >>
>> >>                                      usecs/op        %stdev
>> >>
>> >>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>> >>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
>> >>
>> >> Both sets of tests were on otherwise idle systems with guest VCPUs
>> >> pinned to specific PCPUs. One reason for the higher stdev on arm64
>> >> is that trapping of the WFE instruction by the host KVM is contingent
>> >> on the number of tasks on the runqueue.
>> >
>> > Sorry to state the obvious, but if that's the variable trapping of
>> > WFI/WFE is the cause of your trouble, why don't you simply turn it off
>> > (see 0b5afe05377d for the details)? Given that you pin your vcpus to
>> > physical CPUs, there is no need for any trapping.
>>
>> Good point. Thanks. That should help reduce the guessing games around
>> the variance in these tests.
>
> I'd be interested to find out whether there is still some benefit in
> this series once you disable the WFx trapping heuristics.

The benefit of polling in idle is more than just avoiding the cost of
trapping and re-entering. The other benefit is that remote wakeups
can now be done just by setting need-resched, instead of sending an
IPI, and incurring the cost of handling the interrupt on the receiver
side.

But let me get you some numbers with that.

--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 00/11] Enable haltpoll on arm64
  2024-10-17 18:35       ` Ankur Arora
@ 2024-10-22 22:01         ` Ankur Arora
  0 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-10-22 22:01 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Marc Zyngier, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk


Ankur Arora <ankur.a.arora@oracle.com> writes:

> Marc Zyngier <maz@kernel.org> writes:
>
>> On Wed, 16 Oct 2024 22:55:09 +0100,
>> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>>
>>>
>>> Marc Zyngier <maz@kernel.org> writes:
>>>
>>> > On Thu, 26 Sep 2024 00:24:14 +0100,
>>> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>> >>
>>> >> This patchset enables the cpuidle-haltpoll driver and its namesake
>>> >> governor on arm64. This is specifically interesting for KVM guests by
>>> >> reducing IPC latencies.
>>> >>
>>> >> Comparing idle switching latencies on an arm64 KVM guest with
>>> >> perf bench sched pipe:
>>> >>
>>> >>                                      usecs/op       %stdev
>>> >>
>>> >>   no haltpoll (baseline)               13.48       +-  5.19%
>>> >>   with haltpoll                         6.84       +- 22.07%
>>> >>
>>> >>
>>> >> No change in performance for a similar test on x86:
>>> >>
>>> >>                                      usecs/op        %stdev
>>> >>
>>> >>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>>> >>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
>>> >>
>>> >> Both sets of tests were on otherwise idle systems with guest VCPUs
>>> >> pinned to specific PCPUs. One reason for the higher stdev on arm64
>>> >> is that trapping of the WFE instruction by the host KVM is contingent
>>> >> on the number of tasks on the runqueue.
>>> >
>>> > Sorry to state the obvious, but if that's the variable trapping of
>>> > WFI/WFE is the cause of your trouble, why don't you simply turn it off
>>> > (see 0b5afe05377d for the details)? Given that you pin your vcpus to
>>> > physical CPUs, there is no need for any trapping.
>>>
>>> Good point. Thanks. That should help reduce the guessing games around
>>> the variance in these tests.
>>
>> I'd be interested to find out whether there is still some benefit in
>> this series once you disable the WFx trapping heuristics.
>
> The benefit of polling in idle is more than just avoiding the cost of
> trapping and re-entering. The other benefit is that remote wakeups
> can now be done just by setting need-resched, instead of sending an
> IPI, and incurring the cost of handling the interrupt on the receiver
> side.
>
> But let me get you some numbers with that.

So, I ran the sched-pipe test with processes on VCPUs 4 and 5 with
kvm-arm.wfi_trap_policy=notrap.

# perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
  perf bench sched pipe -l 1000000 -c 4

# No haltpoll (and, no TIF_POLLING_NRFLAG):

 Performance counter stats for 'CPU(s) 4,5' (5 runs):

         25,229.57 msec task-clock                       #    2.000 CPUs utilized               ( +-  7.75% )
    45,821,250,284      cycles                           #    1.816 GHz                         ( +- 10.07% )
    26,557,496,665      instructions                     #    0.58  insn per cycle              ( +-  0.21% )
                 0      sched:sched_wake_idle_without_ipi #    0.000 /sec

            12.615 +- 0.977 seconds time elapsed  ( +-  7.75% )


# Haltpoll:

 Performance counter stats for 'CPU(s) 4,5' (5 runs):

         15,131.58 msec task-clock                       #    2.000 CPUs utilized               ( +- 10.00% )
    34,158,188,839      cycles                           #    2.257 GHz                         ( +-  6.91% )
    20,824,950,916      instructions                     #    0.61  insn per cycle              ( +-  0.09% )
         1,983,822      sched:sched_wake_idle_without_ipi #  131.105 K/sec                       ( +-  0.78% )

             7.566 +- 0.756 seconds time elapsed  ( +- 10.00% )

We get a decent boost just because we are executing ~20% fewer
instructions. Not sure how the cpu frequency scaling works in a
VM but we also run at a higher frequency.

--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 00/11] Enable haltpoll on arm64
  2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
                   ` (13 preceding siblings ...)
  2024-10-15 12:36 ` Marc Zyngier
@ 2024-11-05 18:30 ` Haris Okanovic
  2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
                     ` (5 more replies)
  14 siblings, 6 replies; 74+ messages in thread
From: Haris Okanovic @ 2024-11-05 18:30 UTC (permalink / raw)
  To: ankur.a.arora, catalin.marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

Hi Ankur, Catalin,

How about the following series based on a refactor of arm64's delay()?
Does it address your earlier concerns?

delay() already implements wfet() and falls back to wfe() w/ evstream
or a cpu_relax loop. I refactored it to poll an address, and wrapped in
a new platform-agnostic smp_vcond_load_relaxed() macro. More details in
the following commit log.

Regards,
Haris Okanovic
AWS Graviton Software


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-05 18:30 ` Haris Okanovic
@ 2024-11-05 18:30   ` Haris Okanovic
  2024-11-05 19:36     ` Christoph Lameter (Ampere)
                       ` (2 more replies)
  2024-11-05 18:30   ` [PATCH 2/5] arm64: add __READ_ONCE_EX() Haris Okanovic
                     ` (4 subsequent siblings)
  5 siblings, 3 replies; 74+ messages in thread
From: Haris Okanovic @ 2024-11-05 18:30 UTC (permalink / raw)
  To: ankur.a.arora, catalin.marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

Relaxed poll until desired mask/value is observed at the specified
address or timeout.

This macro is a specialization of the generic smp_cond_load_relaxed(),
which takes a simple mask/value condition (vcond) instead of an
arbitrary expression. It allows architectures to better specialize the
implementation, e.g. to enable wfe() polling of the address on arm.

Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
 include/asm-generic/barrier.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..112027eabbfc 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -256,6 +256,31 @@ do {									\
 })
 #endif
 
+/**
+ * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
+ * with no ordering guarantees. Spins until `(*addr & mask) == val` or
+ * `nsecs` elapse, and returns the last observed `*addr` value.
+ *
+ * @nsecs: timeout in nanoseconds
+ * @addr: pointer to an integer
+ * @mask: a bit mask applied to read values
+ * @val: Expected value with mask
+ */
+#ifndef smp_vcond_load_relaxed
+#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({	\
+	const u64 __start = local_clock_noinstr();		\
+	u64 __nsecs = (nsecs);					\
+	typeof(addr) __addr = (addr);				\
+	typeof(*__addr) __mask = (mask);			\
+	typeof(*__addr) __val = (val);				\
+	typeof(*__addr) __cur;					\
+	smp_cond_load_relaxed(__addr, (				\
+		(VAL & __mask) == __val ||			\
+		local_clock_noinstr() - __start > __nsecs	\
+	));							\
+})
+#endif
+
 /**
  * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
  * @ptr: pointer to the variable to wait on
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
@ 2024-11-05 19:36     ` Christoph Lameter (Ampere)
  2024-11-06 17:06       ` Okanovic, Haris
  2024-11-06 11:08     ` Catalin Marinas
  2024-11-06 11:39     ` Will Deacon
  2 siblings, 1 reply; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-05 19:36 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, 5 Nov 2024, Haris Okanovic wrote:

> +/**
> + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> + * `nsecs` elapse, and returns the last observed `*addr` value.
> + *
> + * @nsecs: timeout in nanoseconds

Please use an absolute time in nsecs instead of a timeout. You do not know
what will happen to your execution thread until the local_clock_noinstr()
is run.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-05 19:36     ` Christoph Lameter (Ampere)
@ 2024-11-06 17:06       ` Okanovic, Haris
  0 siblings, 0 replies; 74+ messages in thread
From: Okanovic, Haris @ 2024-11-06 17:06 UTC (permalink / raw)
  To: cl@gentwo.org
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	boris.ostrovsky@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, joao.m.martins@oracle.com,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, maobibo@loongson.cn, vkuznets@redhat.com,
	linux-arm-kernel@lists.infradead.org, Okanovic, Haris,
	linux-pm@vger.kernel.org, bp@alien8.de, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com

On Tue, 2024-11-05 at 11:36 -0800, Christoph Lameter (Ampere) wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, 5 Nov 2024, Haris Okanovic wrote:
> 
> > +/**
> > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > + *
> > + * @nsecs: timeout in nanoseconds
> 
> Please use an absolute time in nsecs instead of a timeout.

I went with relative time because it clock agnostic. I agree deadline
is nicer because it can propagate down layers of functions, but it pins
the caller to single time base.

> You do not know
> what will happen to your execution thread until the local_clock_noinstr()
> is run.


Not sure what you mean. Could you perhaps give an example where it
would break?

> 
> 

One alternative is to do timekeeping with delay() in all cases, to
decouple from sched/clock.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
  2024-11-05 19:36     ` Christoph Lameter (Ampere)
@ 2024-11-06 11:08     ` Catalin Marinas
  2024-11-06 18:13       ` Okanovic, Haris
  2024-11-06 11:39     ` Will Deacon
  2 siblings, 1 reply; 74+ messages in thread
From: Catalin Marinas @ 2024-11-06 11:08 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	will, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, mtosatti, sudeep.holla, cl, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..112027eabbfc 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -256,6 +256,31 @@ do {									\
>  })
>  #endif
>  
> +/**
> + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> + * `nsecs` elapse, and returns the last observed `*addr` value.
> + *
> + * @nsecs: timeout in nanoseconds

FWIW, I don't mind the relative timeout, it makes the API easier to use.
Yes, it may take longer in absolute time if the thread is scheduled out
before local_clock_noinstr() is read but the same can happen in the
caller anyway. It's similar to udelay(), it can take longer if the
thread is scheduled out.

> + * @addr: pointer to an integer
> + * @mask: a bit mask applied to read values
> + * @val: Expected value with mask
> + */
> +#ifndef smp_vcond_load_relaxed
> +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({	\
> +	const u64 __start = local_clock_noinstr();		\
> +	u64 __nsecs = (nsecs);					\
> +	typeof(addr) __addr = (addr);				\
> +	typeof(*__addr) __mask = (mask);			\
> +	typeof(*__addr) __val = (val);				\
> +	typeof(*__addr) __cur;					\
> +	smp_cond_load_relaxed(__addr, (				\
> +		(VAL & __mask) == __val ||			\
> +		local_clock_noinstr() - __start > __nsecs	\
> +	));							\
> +})

The generic implementation has the same problem as Ankur's current
series. smp_cond_load_relaxed() can't wait on anything other than the
variable at __addr. If it goes into a WFE, there's nothing executed to
read the timer and check for progress. Any generic implementation of
such function would have to use cpu_relax() and polling.

-- 
Catalin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-06 11:08     ` Catalin Marinas
@ 2024-11-06 18:13       ` Okanovic, Haris
  2024-11-06 19:55         ` Catalin Marinas
  0 siblings, 1 reply; 74+ messages in thread
From: Okanovic, Haris @ 2024-11-06 18:13 UTC (permalink / raw)
  To: catalin.marinas@arm.com
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, bp@alien8.de, Okanovic, Haris,
	linux-pm@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	mtosatti@redhat.com, x86@kernel.org, mark.rutland@arm.com

On Wed, 2024-11-06 at 11:08 +0000, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > index d4f581c1e21d..112027eabbfc 100644
> > --- a/include/asm-generic/barrier.h
> > +++ b/include/asm-generic/barrier.h
> > @@ -256,6 +256,31 @@ do {                                                                     \
> >  })
> >  #endif
> > 
> > +/**
> > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > + *
> > + * @nsecs: timeout in nanoseconds
> 
> FWIW, I don't mind the relative timeout, it makes the API easier to use.
> Yes, it may take longer in absolute time if the thread is scheduled out
> before local_clock_noinstr() is read but the same can happen in the
> caller anyway. It's similar to udelay(), it can take longer if the
> thread is scheduled out.
> 
> > + * @addr: pointer to an integer
> > + * @mask: a bit mask applied to read values
> > + * @val: Expected value with mask
> > + */
> > +#ifndef smp_vcond_load_relaxed
> > +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({    \
> > +     const u64 __start = local_clock_noinstr();              \
> > +     u64 __nsecs = (nsecs);                                  \
> > +     typeof(addr) __addr = (addr);                           \
> > +     typeof(*__addr) __mask = (mask);                        \
> > +     typeof(*__addr) __val = (val);                          \
> > +     typeof(*__addr) __cur;                                  \
> > +     smp_cond_load_relaxed(__addr, (                         \
> > +             (VAL & __mask) == __val ||                      \
> > +             local_clock_noinstr() - __start > __nsecs       \
> > +     ));                                                     \
> > +})
> 
> The generic implementation has the same problem as Ankur's current
> series. smp_cond_load_relaxed() can't wait on anything other than the
> variable at __addr. If it goes into a WFE, there's nothing executed to
> read the timer and check for progress. Any generic implementation of
> such function would have to use cpu_relax() and polling.

How would the caller enter wfe()? Can you give a specific scenario that
you're concerned about?

This code already reduces to a relaxed poll, something like this:

```
start = clock();
while((READ_ONCE(*addr) & mask) != val && (clock() - start) < nsecs) {
  cpu_relax();
}
```

> 
> --
> Catalin


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-06 18:13       ` Okanovic, Haris
@ 2024-11-06 19:55         ` Catalin Marinas
  2024-11-06 20:31           ` Okanovic, Haris
  0 siblings, 1 reply; 74+ messages in thread
From: Catalin Marinas @ 2024-11-06 19:55 UTC (permalink / raw)
  To: Okanovic, Haris
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, bp@alien8.de, linux-pm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com

On Wed, Nov 06, 2024 at 06:13:35PM +0000, Okanovic, Haris wrote:
> On Wed, 2024-11-06 at 11:08 +0000, Catalin Marinas wrote:
> > On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > > index d4f581c1e21d..112027eabbfc 100644
> > > --- a/include/asm-generic/barrier.h
> > > +++ b/include/asm-generic/barrier.h
> > > @@ -256,6 +256,31 @@ do {                                                                     \
> > >  })
> > >  #endif
> > > 
> > > +/**
> > > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > > + *
> > > + * @nsecs: timeout in nanoseconds
> > 
> > FWIW, I don't mind the relative timeout, it makes the API easier to use.
> > Yes, it may take longer in absolute time if the thread is scheduled out
> > before local_clock_noinstr() is read but the same can happen in the
> > caller anyway. It's similar to udelay(), it can take longer if the
> > thread is scheduled out.
> > 
> > > + * @addr: pointer to an integer
> > > + * @mask: a bit mask applied to read values
> > > + * @val: Expected value with mask
> > > + */
> > > +#ifndef smp_vcond_load_relaxed
> > > +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({    \
> > > +     const u64 __start = local_clock_noinstr();              \
> > > +     u64 __nsecs = (nsecs);                                  \
> > > +     typeof(addr) __addr = (addr);                           \
> > > +     typeof(*__addr) __mask = (mask);                        \
> > > +     typeof(*__addr) __val = (val);                          \
> > > +     typeof(*__addr) __cur;                                  \
> > > +     smp_cond_load_relaxed(__addr, (                         \
> > > +             (VAL & __mask) == __val ||                      \
> > > +             local_clock_noinstr() - __start > __nsecs       \
> > > +     ));                                                     \
> > > +})
> > 
> > The generic implementation has the same problem as Ankur's current
> > series. smp_cond_load_relaxed() can't wait on anything other than the
> > variable at __addr. If it goes into a WFE, there's nothing executed to
> > read the timer and check for progress. Any generic implementation of
> > such function would have to use cpu_relax() and polling.
> 
> How would the caller enter wfe()? Can you give a specific scenario that
> you're concerned about?

Let's take the arm64 example with the event stream disabled. Without the
subsequent patches implementing smp_vcond_load_relaxed(), just expand
the arm64 smp_cond_load_relaxed() implementation in the above macro. If
the timer check doesn't trigger an exit from the loop,
__cmpwait_relaxed() only waits on the variable to change its value,
nothing to do with the timer.

> This code already reduces to a relaxed poll, something like this:
> 
> ```
> start = clock();
> while((READ_ONCE(*addr) & mask) != val && (clock() - start) < nsecs) {
>   cpu_relax();
> }
> ```

Well, that's if you also use the generic implementation of
smp_cond_load_relaxed() but have you checked all the other architectures
that don't do something similar to the arm64 wfe (riscv comes close)?
Even if all other architectures just use a cpu_relax(), that's still
abusing the smp_cond_load_relaxed() semantics. And what if one places
another loop in their __cmpwait()? That's allowed because you are
supposed to wait on a single variable to change not on multiple states.

-- 
Catalin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-06 19:55         ` Catalin Marinas
@ 2024-11-06 20:31           ` Okanovic, Haris
  0 siblings, 0 replies; 74+ messages in thread
From: Okanovic, Haris @ 2024-11-06 20:31 UTC (permalink / raw)
  To: catalin.marinas@arm.com
  Cc: joao.m.martins@oracle.com, kvm@vger.kernel.org,
	mtosatti@redhat.com, boris.ostrovsky@oracle.com,
	mark.rutland@arm.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	cl@gentwo.org, wanpengli@tencent.com,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, vkuznets@redhat.com, sudeep.holla@arm.com,
	Okanovic, Haris, rafael@kernel.org, bp@alien8.de,
	linux-pm@vger.kernel.org, x86@kernel.org,
	linux-arm-kernel@lists.infradead.org

On Wed, 2024-11-06 at 19:55 +0000, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Wed, Nov 06, 2024 at 06:13:35PM +0000, Okanovic, Haris wrote:
> > On Wed, 2024-11-06 at 11:08 +0000, Catalin Marinas wrote:
> > > On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > > > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > > > index d4f581c1e21d..112027eabbfc 100644
> > > > --- a/include/asm-generic/barrier.h
> > > > +++ b/include/asm-generic/barrier.h
> > > > @@ -256,6 +256,31 @@ do {                                                                     \
> > > >  })
> > > >  #endif
> > > > 
> > > > +/**
> > > > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > > > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > > > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > > > + *
> > > > + * @nsecs: timeout in nanoseconds
> > > 
> > > FWIW, I don't mind the relative timeout, it makes the API easier to use.
> > > Yes, it may take longer in absolute time if the thread is scheduled out
> > > before local_clock_noinstr() is read but the same can happen in the
> > > caller anyway. It's similar to udelay(), it can take longer if the
> > > thread is scheduled out.
> > > 
> > > > + * @addr: pointer to an integer
> > > > + * @mask: a bit mask applied to read values
> > > > + * @val: Expected value with mask
> > > > + */
> > > > +#ifndef smp_vcond_load_relaxed
> > > > +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({    \
> > > > +     const u64 __start = local_clock_noinstr();              \
> > > > +     u64 __nsecs = (nsecs);                                  \
> > > > +     typeof(addr) __addr = (addr);                           \
> > > > +     typeof(*__addr) __mask = (mask);                        \
> > > > +     typeof(*__addr) __val = (val);                          \
> > > > +     typeof(*__addr) __cur;                                  \
> > > > +     smp_cond_load_relaxed(__addr, (                         \
> > > > +             (VAL & __mask) == __val ||                      \
> > > > +             local_clock_noinstr() - __start > __nsecs       \
> > > > +     ));                                                     \
> > > > +})
> > > 
> > > The generic implementation has the same problem as Ankur's current
> > > series. smp_cond_load_relaxed() can't wait on anything other than the
> > > variable at __addr. If it goes into a WFE, there's nothing executed to
> > > read the timer and check for progress. Any generic implementation of
> > > such function would have to use cpu_relax() and polling.
> > 
> > How would the caller enter wfe()? Can you give a specific scenario that
> > you're concerned about?
> 
> Let's take the arm64 example with the event stream disabled. Without the
> subsequent patches implementing smp_vcond_load_relaxed(), just expand
> the arm64 smp_cond_load_relaxed() implementation in the above macro. If
> the timer check doesn't trigger an exit from the loop,
> __cmpwait_relaxed() only waits on the variable to change its value,
> nothing to do with the timer.
> 
> > This code already reduces to a relaxed poll, something like this:
> > 
> > ```
> > start = clock();
> > while((READ_ONCE(*addr) & mask) != val && (clock() - start) < nsecs) {
> >   cpu_relax();
> > }
> > ```
> 
> Well, that's if you also use the generic implementation of
> smp_cond_load_relaxed() but have you checked all the other architectures
> that don't do something similar to the arm64 wfe (riscv comes close)?
> Even if all other architectures just use a cpu_relax(), that's still
> abusing the smp_cond_load_relaxed() semantics. And what if one places
> another loop in their __cmpwait()? That's allowed because you are
> supposed to wait on a single variable to change not on multiple states.

I see what you mean now - I glossed over the use of __cmpwait_relaxed()
in smp_cond_load_relaxed(). I'll post another rev with the fix, similar
to the above "reduced" code.

> 
> --
> Catalin



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
  2024-11-05 19:36     ` Christoph Lameter (Ampere)
  2024-11-06 11:08     ` Catalin Marinas
@ 2024-11-06 11:39     ` Will Deacon
  2024-11-06 17:18       ` Okanovic, Haris
  2 siblings, 1 reply; 74+ messages in thread
From: Will Deacon @ 2024-11-06 11:39 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini,
	wanpengli, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, mtosatti, sudeep.holla, cl, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> Relaxed poll until desired mask/value is observed at the specified
> address or timeout.
> 
> This macro is a specialization of the generic smp_cond_load_relaxed(),
> which takes a simple mask/value condition (vcond) instead of an
> arbitrary expression. It allows architectures to better specialize the
> implementation, e.g. to enable wfe() polling of the address on arm.

This doesn't make sense to me. The existing smp_cond_load() functions
already use wfe on arm64 and I don't see why we need a special helper
just to do a mask.

> Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> ---
>  include/asm-generic/barrier.h | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..112027eabbfc 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -256,6 +256,31 @@ do {									\
>  })
>  #endif
>  
> +/**
> + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> + * `nsecs` elapse, and returns the last observed `*addr` value.
> + *
> + * @nsecs: timeout in nanoseconds
> + * @addr: pointer to an integer
> + * @mask: a bit mask applied to read values
> + * @val: Expected value with mask
> + */
> +#ifndef smp_vcond_load_relaxed

I know naming is hard, but "vcond" is especially terrible.
Perhaps smp_cond_load_timeout()?

Will

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed()
  2024-11-06 11:39     ` Will Deacon
@ 2024-11-06 17:18       ` Okanovic, Haris
  0 siblings, 0 replies; 74+ messages in thread
From: Okanovic, Haris @ 2024-11-06 17:18 UTC (permalink / raw)
  To: will@kernel.org
  Cc: kvm@vger.kernel.org, rafael@kernel.org,
	boris.ostrovsky@oracle.com, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, mtosatti@redhat.com,
	hpa@zytor.com, peterz@infradead.org, maobibo@loongson.cn,
	vkuznets@redhat.com, linux-arm-kernel@lists.infradead.org,
	Okanovic, Haris, linux-pm@vger.kernel.org, bp@alien8.de,
	x86@kernel.org, mark.rutland@arm.com

On Wed, 2024-11-06 at 11:39 +0000, Will Deacon wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > Relaxed poll until desired mask/value is observed at the specified
> > address or timeout.
> > 
> > This macro is a specialization of the generic smp_cond_load_relaxed(),
> > which takes a simple mask/value condition (vcond) instead of an
> > arbitrary expression. It allows architectures to better specialize the
> > implementation, e.g. to enable wfe() polling of the address on arm.
> 
> This doesn't make sense to me. The existing smp_cond_load() functions
> already use wfe on arm64 and I don't see why we need a special helper
> just to do a mask.

We can't turn an arbitrary C expression into a wfe()/wfet() exit
condition, which is one of the inputs to the existing smp_cond_load().
This API is therefore more amenable to hardware acceleration.

> 
> > Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> > ---
> >  include/asm-generic/barrier.h | 25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > index d4f581c1e21d..112027eabbfc 100644
> > --- a/include/asm-generic/barrier.h
> > +++ b/include/asm-generic/barrier.h
> > @@ -256,6 +256,31 @@ do {                                                                     \
> >  })
> >  #endif
> > 
> > +/**
> > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > + *
> > + * @nsecs: timeout in nanoseconds
> > + * @addr: pointer to an integer
> > + * @mask: a bit mask applied to read values
> > + * @val: Expected value with mask
> > + */
> > +#ifndef smp_vcond_load_relaxed
> 
> I know naming is hard, but "vcond" is especially terrible.
> Perhaps smp_cond_load_timeout()?

I agree, naming is hard! I was trying to differentiate it from
smp_cond_load() in some meaningful way - that one is an "expression"
condition this one is a "value" condition.

I'll think it over a bit more.

> 
> Will


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 2/5] arm64: add __READ_ONCE_EX()
  2024-11-05 18:30 ` Haris Okanovic
  2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
@ 2024-11-05 18:30   ` Haris Okanovic
  2024-11-05 19:39     ` Christoph Lameter (Ampere)
                       ` (2 more replies)
  2024-11-05 18:30   ` [PATCH 3/5] arm64: refactor delay() to enable polling for value Haris Okanovic
                     ` (3 subsequent siblings)
  5 siblings, 3 replies; 74+ messages in thread
From: Haris Okanovic @ 2024-11-05 18:30 UTC (permalink / raw)
  To: ankur.a.arora, catalin.marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

Perform an exclusive load, which atomically loads a word and arms the
exclusive monitor to enable wfet()/wfe() accelerated polling.

https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-accesses/exclusive-monitors

Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
 arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)
 create mode 100644 arch/arm64/include/asm/readex.h

diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
new file mode 100644
index 000000000000..51963c3107e1
--- /dev/null
+++ b/arch/arm64/include/asm/readex.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Based on arch/arm64/include/asm/rwonce.h
+ *
+ * Copyright (C) 2020 Google LLC.
+ * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#ifndef __ASM_READEX_H
+#define __ASM_READEX_H
+
+#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
+
+#define __READ_ONCE_EX(x)						\
+({									\
+	typeof(&(x)) __x = &(x);					\
+	int atomic = 1;							\
+	union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u;	\
+	switch (sizeof(x)) {						\
+	case 1:								\
+		asm volatile(__LOAD_EX(b, %w0, %1)			\
+			: "=r" (*(__u8 *)__u.__c)			\
+			: "Q" (*__x) : "memory");			\
+		break;							\
+	case 2:								\
+		asm volatile(__LOAD_EX(h, %w0, %1)			\
+			: "=r" (*(__u16 *)__u.__c)			\
+			: "Q" (*__x) : "memory");			\
+		break;							\
+	case 4:								\
+		asm volatile(__LOAD_EX(, %w0, %1)			\
+			: "=r" (*(__u32 *)__u.__c)			\
+			: "Q" (*__x) : "memory");			\
+		break;							\
+	case 8:								\
+		asm volatile(__LOAD_EX(, %0, %1)			\
+			: "=r" (*(__u64 *)__u.__c)			\
+			: "Q" (*__x) : "memory");			\
+		break;							\
+	default:							\
+		atomic = 0;						\
+	}								\
+	atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\
+})
+
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] arm64: add __READ_ONCE_EX()
  2024-11-05 18:30   ` [PATCH 2/5] arm64: add __READ_ONCE_EX() Haris Okanovic
@ 2024-11-05 19:39     ` Christoph Lameter (Ampere)
  2024-11-06 17:37       ` Okanovic, Haris
  2024-11-06 11:43     ` Will Deacon
  2024-11-09  9:49     ` David Laight
  2 siblings, 1 reply; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-05 19:39 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, 5 Nov 2024, Haris Okanovic wrote:

> +#define __READ_ONCE_EX(x)						\

This is derived from READ_ONCE and named similarly so maybe it would
better be placed into rwonce.h?



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] arm64: add __READ_ONCE_EX()
  2024-11-05 19:39     ` Christoph Lameter (Ampere)
@ 2024-11-06 17:37       ` Okanovic, Haris
  0 siblings, 0 replies; 74+ messages in thread
From: Okanovic, Haris @ 2024-11-06 17:37 UTC (permalink / raw)
  To: cl@gentwo.org
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	boris.ostrovsky@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, joao.m.martins@oracle.com,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, maobibo@loongson.cn, vkuznets@redhat.com,
	linux-arm-kernel@lists.infradead.org, Okanovic, Haris,
	linux-pm@vger.kernel.org, bp@alien8.de, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com

On Tue, 2024-11-05 at 11:39 -0800, Christoph Lameter (Ampere) wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, 5 Nov 2024, Haris Okanovic wrote:
> 
> > +#define __READ_ONCE_EX(x)                                            \
> 
> This is derived from READ_ONCE and named similarly so maybe it would
> better be placed into rwonce.h?
> 
> 

I plan to remove this macro per other feedback in this thread.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] arm64: add __READ_ONCE_EX()
  2024-11-05 18:30   ` [PATCH 2/5] arm64: add __READ_ONCE_EX() Haris Okanovic
  2024-11-05 19:39     ` Christoph Lameter (Ampere)
@ 2024-11-06 11:43     ` Will Deacon
  2024-11-06 17:09       ` Okanovic, Haris
  2024-11-09  9:49     ` David Laight
  2 siblings, 1 reply; 74+ messages in thread
From: Will Deacon @ 2024-11-06 11:43 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini,
	wanpengli, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, mtosatti, sudeep.holla, cl, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Nov 05, 2024 at 12:30:38PM -0600, Haris Okanovic wrote:
> Perform an exclusive load, which atomically loads a word and arms the
> exclusive monitor to enable wfet()/wfe() accelerated polling.
> 
> https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-accesses/exclusive-monitors
> 
> Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> ---
>  arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
>  create mode 100644 arch/arm64/include/asm/readex.h
> 
> diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
> new file mode 100644
> index 000000000000..51963c3107e1
> --- /dev/null
> +++ b/arch/arm64/include/asm/readex.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Based on arch/arm64/include/asm/rwonce.h
> + *
> + * Copyright (C) 2020 Google LLC.
> + * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
> + */
> +
> +#ifndef __ASM_READEX_H
> +#define __ASM_READEX_H
> +
> +#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
> +
> +#define __READ_ONCE_EX(x)						\
> +({									\
> +	typeof(&(x)) __x = &(x);					\
> +	int atomic = 1;							\
> +	union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u;	\
> +	switch (sizeof(x)) {						\
> +	case 1:								\
> +		asm volatile(__LOAD_EX(b, %w0, %1)			\
> +			: "=r" (*(__u8 *)__u.__c)			\
> +			: "Q" (*__x) : "memory");			\
> +		break;							\
> +	case 2:								\
> +		asm volatile(__LOAD_EX(h, %w0, %1)			\
> +			: "=r" (*(__u16 *)__u.__c)			\
> +			: "Q" (*__x) : "memory");			\
> +		break;							\
> +	case 4:								\
> +		asm volatile(__LOAD_EX(, %w0, %1)			\
> +			: "=r" (*(__u32 *)__u.__c)			\
> +			: "Q" (*__x) : "memory");			\
> +		break;							\
> +	case 8:								\
> +		asm volatile(__LOAD_EX(, %0, %1)			\
> +			: "=r" (*(__u64 *)__u.__c)			\
> +			: "Q" (*__x) : "memory");			\
> +		break;							\
> +	default:							\
> +		atomic = 0;						\
> +	}								\
> +	atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\
> +})

I think this is a bad idea. Load-exclusive needs to be used very carefully,
preferably when you're able to see exactly what instructions it's
interacting with. By making this into a macro, we're at the mercy of the
compiler and we give the wrong impression that you could e.g. build atomic
critical sections out of this macro.

So I'm fairly strongly against this interface.

Will

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/5] arm64: add __READ_ONCE_EX()
  2024-11-06 11:43     ` Will Deacon
@ 2024-11-06 17:09       ` Okanovic, Haris
  0 siblings, 0 replies; 74+ messages in thread
From: Okanovic, Haris @ 2024-11-06 17:09 UTC (permalink / raw)
  To: will@kernel.org
  Cc: kvm@vger.kernel.org, rafael@kernel.org,
	boris.ostrovsky@oracle.com, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, mtosatti@redhat.com,
	hpa@zytor.com, peterz@infradead.org, maobibo@loongson.cn,
	vkuznets@redhat.com, linux-arm-kernel@lists.infradead.org,
	Okanovic, Haris, linux-pm@vger.kernel.org, bp@alien8.de,
	x86@kernel.org, mark.rutland@arm.com

On Wed, 2024-11-06 at 11:43 +0000, Will Deacon wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Nov 05, 2024 at 12:30:38PM -0600, Haris Okanovic wrote:
> > Perform an exclusive load, which atomically loads a word and arms the
> > exclusive monitor to enable wfet()/wfe() accelerated polling.
> > 
> > https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-accesses/exclusive-monitors
> > 
> > Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> > ---
> >  arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
> >  1 file changed, 46 insertions(+)
> >  create mode 100644 arch/arm64/include/asm/readex.h
> > 
> > diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
> > new file mode 100644
> > index 000000000000..51963c3107e1
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/readex.h
> > @@ -0,0 +1,46 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Based on arch/arm64/include/asm/rwonce.h
> > + *
> > + * Copyright (C) 2020 Google LLC.
> > + * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
> > + */
> > +
> > +#ifndef __ASM_READEX_H
> > +#define __ASM_READEX_H
> > +
> > +#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
> > +
> > +#define __READ_ONCE_EX(x)                                            \
> > +({                                                                   \
> > +     typeof(&(x)) __x = &(x);                                        \
> > +     int atomic = 1;                                                 \
> > +     union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \
> > +     switch (sizeof(x)) {                                            \
> > +     case 1:                                                         \
> > +             asm volatile(__LOAD_EX(b, %w0, %1)                      \
> > +                     : "=r" (*(__u8 *)__u.__c)                       \
> > +                     : "Q" (*__x) : "memory");                       \
> > +             break;                                                  \
> > +     case 2:                                                         \
> > +             asm volatile(__LOAD_EX(h, %w0, %1)                      \
> > +                     : "=r" (*(__u16 *)__u.__c)                      \
> > +                     : "Q" (*__x) : "memory");                       \
> > +             break;                                                  \
> > +     case 4:                                                         \
> > +             asm volatile(__LOAD_EX(, %w0, %1)                       \
> > +                     : "=r" (*(__u32 *)__u.__c)                      \
> > +                     : "Q" (*__x) : "memory");                       \
> > +             break;                                                  \
> > +     case 8:                                                         \
> > +             asm volatile(__LOAD_EX(, %0, %1)                        \
> > +                     : "=r" (*(__u64 *)__u.__c)                      \
> > +                     : "Q" (*__x) : "memory");                       \
> > +             break;                                                  \
> > +     default:                                                        \
> > +             atomic = 0;                                             \
> > +     }                                                               \
> > +     atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\
> > +})
> 
> I think this is a bad idea. Load-exclusive needs to be used very carefully,
> preferably when you're able to see exactly what instructions it's
> interacting with. By making this into a macro, we're at the mercy of the
> compiler and we give the wrong impression that you could e.g. build atomic
> critical sections out of this macro.
> 
> So I'm fairly strongly against this interface.

Fair point. I'll post an alternate delay() implementation in asm. It's
a simple routine.

> 
> Will


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH 2/5] arm64: add __READ_ONCE_EX()
  2024-11-05 18:30   ` [PATCH 2/5] arm64: add __READ_ONCE_EX() Haris Okanovic
  2024-11-05 19:39     ` Christoph Lameter (Ampere)
  2024-11-06 11:43     ` Will Deacon
@ 2024-11-09  9:49     ` David Laight
  2 siblings, 0 replies; 74+ messages in thread
From: David Laight @ 2024-11-09  9:49 UTC (permalink / raw)
  To: 'Haris Okanovic', ankur.a.arora@oracle.com,
	catalin.marinas@arm.com
  Cc: linux-pm@vger.kernel.org, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, pbonzini@redhat.com,
	wanpengli@tencent.com, vkuznets@redhat.com, rafael@kernel.org,
	daniel.lezcano@linaro.org, peterz@infradead.org, arnd@arndb.de,
	lenb@kernel.org, mark.rutland@arm.com, mtosatti@redhat.com,
	sudeep.holla@arm.com, cl@gentwo.org, misono.tomohiro@fujitsu.com,
	maobibo@loongson.cn, joao.m.martins@oracle.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com

From: Haris Okanovic
> Sent: 05 November 2024 18:31
> 
> Perform an exclusive load, which atomically loads a word and arms the
> exclusive monitor to enable wfet()/wfe() accelerated polling.
> 
...
> +	atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\

That doesn't do what you want it to do.
(It is wrong in READ_ONCE() as well.)

?: is treated like an arithmetic operator and the result will get
promoted to 'int'.
Moving the first cast outside the ?: probably works:
	(typeof(*__x))(atomic ? __u.__val : (*(volatile typeof(__x))__x));

   David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 3/5] arm64: refactor delay() to enable polling for value
  2024-11-05 18:30 ` Haris Okanovic
  2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
  2024-11-05 18:30   ` [PATCH 2/5] arm64: add __READ_ONCE_EX() Haris Okanovic
@ 2024-11-05 18:30   ` Haris Okanovic
  2024-11-05 19:42     ` Christoph Lameter (Ampere)
  2024-11-06  9:18     ` Catalin Marinas
  2024-11-05 18:30   ` [PATCH 4/5] arm64: add smp_vcond_load_relaxed() Haris Okanovic
                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 74+ messages in thread
From: Haris Okanovic @ 2024-11-05 18:30 UTC (permalink / raw)
  To: ankur.a.arora, catalin.marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

Refactor arm64's delay() to poll for a mask/value condition (vcond) in
it's wfet(), wfe(), and relaxed polling loops.

Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
 arch/arm64/lib/delay.c | 70 ++++++++++++++++++++++++++++++------------
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index cb2062e7e234..a7c3040af316 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -14,43 +14,73 @@
 #include <linux/timex.h>
 
 #include <clocksource/arm_arch_timer.h>
+#include <asm/readex.h>
 
-#define USECS_TO_CYCLES(time_usecs)			\
-	xloops_to_cycles((time_usecs) * 0x10C7UL)
-
-static inline unsigned long xloops_to_cycles(unsigned long xloops)
+static inline u64 xloops_to_cycles(u64 xloops)
 {
 	return (xloops * loops_per_jiffy * HZ) >> 32;
 }
 
-void __delay(unsigned long cycles)
+#define USECS_TO_XLOOPS(time_usecs) \
+	((time_usecs) * 0x10C7UL)
+
+#define USECS_TO_CYCLES(time_usecs) \
+	xloops_to_cycles(USECS_TO_XLOOPS(time_usecs))
+
+#define NSECS_TO_XLOOPS(time_nsecs) \
+	((time_nsecs) * 0x10C7UL)
+
+#define NSECS_TO_CYCLES(time_nsecs) \
+	xloops_to_cycles(NSECS_TO_XLOOPS(time_nsecs))
+
+static unsigned long __delay_until_ul(u64 cycles, unsigned long* addr, unsigned long mask, unsigned long val)
 {
-	cycles_t start = get_cycles();
+	u64 start = get_cycles();
+	unsigned long cur;
 
 	if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
 		u64 end = start + cycles;
 
-		/*
-		 * Start with WFIT. If an interrupt makes us resume
-		 * early, use a WFET loop to complete the delay.
-		 */
-		wfit(end);
-		while ((get_cycles() - start) < cycles)
+		do {
+			cur = __READ_ONCE_EX(*addr);
+			if ((cur & mask) == val) {
+				break;
+			}
 			wfet(end);
-	} else 	if (arch_timer_evtstrm_available()) {
-		const cycles_t timer_evt_period =
+		} while ((get_cycles() - start) < cycles);
+	} else if (arch_timer_evtstrm_available()) {
+		const u64 timer_evt_period =
 			USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
 
-		while ((get_cycles() - start + timer_evt_period) < cycles)
+		do {
+			cur = __READ_ONCE_EX(*addr);
+			if ((cur & mask) == val) {
+				break;
+			}
 			wfe();
+		} while ((get_cycles() - start + timer_evt_period) < cycles);
+	} else {
+		do {
+			cur = __READ_ONCE_EX(*addr);
+			if ((cur & mask) == val) {
+				break;
+			}
+			cpu_relax();
+		} while ((get_cycles() - start) < cycles);
 	}
 
-	while ((get_cycles() - start) < cycles)
-		cpu_relax();
+	return cur;
+}
+
+void __delay(unsigned long cycles)
+{
+	/* constant word for wfet()/wfe() to poll */
+	unsigned long dummy ____cacheline_aligned = 0;
+	__delay_until_ul(cycles, &dummy, 0, 1);
 }
 EXPORT_SYMBOL(__delay);
 
-inline void __const_udelay(unsigned long xloops)
+void __const_udelay(unsigned long xloops)
 {
 	__delay(xloops_to_cycles(xloops));
 }
@@ -58,12 +88,12 @@ EXPORT_SYMBOL(__const_udelay);
 
 void __udelay(unsigned long usecs)
 {
-	__const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */
+	__delay(USECS_TO_CYCLES(usecs));
 }
 EXPORT_SYMBOL(__udelay);
 
 void __ndelay(unsigned long nsecs)
 {
-	__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
+	__delay(NSECS_TO_CYCLES(nsecs));
 }
 EXPORT_SYMBOL(__ndelay);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/5] arm64: refactor delay() to enable polling for value
  2024-11-05 18:30   ` [PATCH 3/5] arm64: refactor delay() to enable polling for value Haris Okanovic
@ 2024-11-05 19:42     ` Christoph Lameter (Ampere)
  2024-11-06 17:42       ` Okanovic, Haris
  2024-11-06  9:18     ` Catalin Marinas
  1 sibling, 1 reply; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-05 19:42 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, 5 Nov 2024, Haris Okanovic wrote:

> -#define USECS_TO_CYCLES(time_usecs)			\
> -	xloops_to_cycles((time_usecs) * 0x10C7UL)
> -
> -static inline unsigned long xloops_to_cycles(unsigned long xloops)
> +static inline u64 xloops_to_cycles(u64 xloops)
>  {
>  	return (xloops * loops_per_jiffy * HZ) >> 32;
>  }
>
> -void __delay(unsigned long cycles)
> +#define USECS_TO_XLOOPS(time_usecs) \
> +	((time_usecs) * 0x10C7UL)
> +
> +#define USECS_TO_CYCLES(time_usecs) \
> +	xloops_to_cycles(USECS_TO_XLOOPS(time_usecs))
> +


> +#define NSECS_TO_XLOOPS(time_nsecs) \
> +	((time_nsecs) * 0x10C7UL)

The constant here is the same value as for microseconds. If I remember
correctly its 5UL for nanoseconds.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/5] arm64: refactor delay() to enable polling for value
  2024-11-05 19:42     ` Christoph Lameter (Ampere)
@ 2024-11-06 17:42       ` Okanovic, Haris
  0 siblings, 0 replies; 74+ messages in thread
From: Okanovic, Haris @ 2024-11-06 17:42 UTC (permalink / raw)
  To: cl@gentwo.org
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	boris.ostrovsky@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, joao.m.martins@oracle.com,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, maobibo@loongson.cn, vkuznets@redhat.com,
	linux-arm-kernel@lists.infradead.org, Okanovic, Haris,
	linux-pm@vger.kernel.org, bp@alien8.de, mtosatti@redhat.com,
	x86@kernel.org, mark.rutland@arm.com

On Tue, 2024-11-05 at 11:42 -0800, Christoph Lameter (Ampere) wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, 5 Nov 2024, Haris Okanovic wrote:
> 
> > -#define USECS_TO_CYCLES(time_usecs)                  \
> > -     xloops_to_cycles((time_usecs) * 0x10C7UL)
> > -
> > -static inline unsigned long xloops_to_cycles(unsigned long xloops)
> > +static inline u64 xloops_to_cycles(u64 xloops)
> >  {
> >       return (xloops * loops_per_jiffy * HZ) >> 32;
> >  }
> > 
> > -void __delay(unsigned long cycles)
> > +#define USECS_TO_XLOOPS(time_usecs) \
> > +     ((time_usecs) * 0x10C7UL)
> > +
> > +#define USECS_TO_CYCLES(time_usecs) \
> > +     xloops_to_cycles(USECS_TO_XLOOPS(time_usecs))
> > +
> 
> 
> > +#define NSECS_TO_XLOOPS(time_nsecs) \
> > +     ((time_nsecs) * 0x10C7UL)
> 
> The constant here is the same value as for microseconds. If I remember
> correctly its 5UL for nanoseconds.
> 

You're right, good catch. Should be `nsecs * 0x5UL` per old code.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/5] arm64: refactor delay() to enable polling for value
  2024-11-05 18:30   ` [PATCH 3/5] arm64: refactor delay() to enable polling for value Haris Okanovic
  2024-11-05 19:42     ` Christoph Lameter (Ampere)
@ 2024-11-06  9:18     ` Catalin Marinas
  2024-11-06 17:38       ` Okanovic, Haris
  1 sibling, 1 reply; 74+ messages in thread
From: Catalin Marinas @ 2024-11-06  9:18 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	will, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, wanpengli,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, mtosatti, sudeep.holla, cl, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, Nov 05, 2024 at 12:30:39PM -0600, Haris Okanovic wrote:
> +		do {
> +			cur = __READ_ONCE_EX(*addr);
> +			if ((cur & mask) == val) {
> +				break;
> +			}
>  			wfet(end);

Constructs like this need to be entirely in assembly. The compiler may
spill 'cur' onto the stack and the write could clear the exclusive
monitor which makes the wfet return immediately. That's highly CPU
implementation specific but it's the reason we have functions like
__cmpwait() in assembly (or whatever else deals with exclusives). IOW,
we can't have other memory accesses between the LDXR and whatever is
consuming the exclusive monitor armed state (typically STXR but WFE/WFET
can be another).

-- 
Catalin

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/5] arm64: refactor delay() to enable polling for value
  2024-11-06  9:18     ` Catalin Marinas
@ 2024-11-06 17:38       ` Okanovic, Haris
  0 siblings, 0 replies; 74+ messages in thread
From: Okanovic, Haris @ 2024-11-06 17:38 UTC (permalink / raw)
  To: catalin.marinas@arm.com
  Cc: kvm@vger.kernel.org, rafael@kernel.org, sudeep.holla@arm.com,
	joao.m.martins@oracle.com, ankur.a.arora@oracle.com,
	dave.hansen@linux.intel.com, konrad.wilk@oracle.com,
	wanpengli@tencent.com, cl@gentwo.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	maobibo@loongson.cn, pbonzini@redhat.com, tglx@linutronix.de,
	misono.tomohiro@fujitsu.com, daniel.lezcano@linaro.org,
	arnd@arndb.de, lenb@kernel.org, will@kernel.org, hpa@zytor.com,
	peterz@infradead.org, boris.ostrovsky@oracle.com,
	vkuznets@redhat.com, bp@alien8.de, Okanovic, Haris,
	linux-pm@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	mtosatti@redhat.com, x86@kernel.org, mark.rutland@arm.com

On Wed, 2024-11-06 at 09:18 +0000, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Nov 05, 2024 at 12:30:39PM -0600, Haris Okanovic wrote:
> > +             do {
> > +                     cur = __READ_ONCE_EX(*addr);
> > +                     if ((cur & mask) == val) {
> > +                             break;
> > +                     }
> >                       wfet(end);
> 
> Constructs like this need to be entirely in assembly. The compiler may
> spill 'cur' onto the stack and the write could clear the exclusive
> monitor which makes the wfet return immediately. That's highly CPU
> implementation specific but it's the reason we have functions like
> __cmpwait() in assembly (or whatever else deals with exclusives). IOW,
> we can't have other memory accesses between the LDXR and whatever is
> consuming the exclusive monitor armed state (typically STXR but WFE/WFET
> can be another).

Agreed, will rewrite parts of delay() in asm.

> 
> --
> Catalin


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 4/5] arm64: add smp_vcond_load_relaxed()
  2024-11-05 18:30 ` Haris Okanovic
                     ` (2 preceding siblings ...)
  2024-11-05 18:30   ` [PATCH 3/5] arm64: refactor delay() to enable polling for value Haris Okanovic
@ 2024-11-05 18:30   ` Haris Okanovic
  2024-11-05 18:30   ` [PATCH 5/5] cpuidle: implement poll_idle() using smp_vcond_load_relaxed() Haris Okanovic
  2024-11-05 18:49   ` [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
  5 siblings, 0 replies; 74+ messages in thread
From: Haris Okanovic @ 2024-11-05 18:30 UTC (permalink / raw)
  To: ankur.a.arora, catalin.marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

Implement smp_vcond_load_relaxed() atop __delay_until_ul() on arm64,
to reduce number of busy loops while waiting for a value condition.

This implementation only support unsigned long words. It can be extended
via the enclosed case structure in barrier.h as needed.

Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
 arch/arm64/include/asm/barrier.h | 18 ++++++++++++++++++
 arch/arm64/lib/delay.c           | 16 ++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 1ca947d5c939..188327e3ce72 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -203,6 +203,24 @@ do {									\
 	(typeof(*ptr))VAL;						\
 })
 
+extern unsigned long __smp_vcond_load_relaxed_ul(
+	u64 nsecs, unsigned long* addr, unsigned long mask, unsigned long val);
+
+#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({		\
+	u64 __nsecs = (nsecs);						\
+	typeof(addr) __addr = (addr);					\
+	typeof(*__addr) __mask = (mask);				\
+	typeof(*__addr) __val = (val);					\
+	typeof(*__addr) __cur;						\
+	switch (sizeof(*__addr)) {					\
+	case sizeof(unsigned long):					\
+		__cur = __smp_vcond_load_relaxed_ul(			\
+			__nsecs, __addr, __mask, __val);		\
+		break;							\
+	}								\
+	(__cur);							\
+})
+
 #define smp_cond_load_acquire(ptr, cond_expr)				\
 ({									\
 	typeof(ptr) __PTR = (ptr);					\
diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index a7c3040af316..a61a13b04439 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -12,6 +12,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/timex.h>
+#include <linux/sched/clock.h>
 
 #include <clocksource/arm_arch_timer.h>
 #include <asm/readex.h>
@@ -97,3 +98,18 @@ void __ndelay(unsigned long nsecs)
 	__delay(NSECS_TO_CYCLES(nsecs));
 }
 EXPORT_SYMBOL(__ndelay);
+
+unsigned long __smp_vcond_load_relaxed_ul(
+	u64 nsecs, unsigned long* addr, unsigned long mask, unsigned long val)
+{
+	const u64 start = local_clock_noinstr();
+	const u64 cycles = NSECS_TO_CYCLES(nsecs);
+	unsigned long cur;
+
+	do {
+		cur = __delay_until_ul(cycles, addr, mask, val);
+	} while((cur & mask) != val && local_clock_noinstr() - start < nsecs);
+
+	return cur;
+}
+EXPORT_SYMBOL(__smp_vcond_load_relaxed_ul);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 5/5] cpuidle: implement poll_idle() using smp_vcond_load_relaxed()
  2024-11-05 18:30 ` Haris Okanovic
                     ` (3 preceding siblings ...)
  2024-11-05 18:30   ` [PATCH 4/5] arm64: add smp_vcond_load_relaxed() Haris Okanovic
@ 2024-11-05 18:30   ` Haris Okanovic
  2024-11-05 19:45     ` Christoph Lameter (Ampere)
  2024-11-05 18:49   ` [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
  5 siblings, 1 reply; 74+ messages in thread
From: Haris Okanovic @ 2024-11-05 18:30 UTC (permalink / raw)
  To: ankur.a.arora, catalin.marinas
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, will, tglx, mingo,
	bp, dave.hansen, x86, hpa, pbonzini, wanpengli, vkuznets, rafael,
	daniel.lezcano, peterz, arnd, lenb, mark.rutland, harisokn,
	mtosatti, sudeep.holla, cl, misono.tomohiro, maobibo,
	joao.m.martins, boris.ostrovsky, konrad.wilk

Implement poll_idle() using smp_vcond_load_relaxed() function.

Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
 drivers/cpuidle/poll_state.c | 36 +++++-------------------------------
 1 file changed, 5 insertions(+), 31 deletions(-)

diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index 61df2395585e..5553e6f31702 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -7,46 +7,20 @@
 #include <linux/sched.h>
 #include <linux/sched/clock.h>
 #include <linux/sched/idle.h>
-
-#ifdef CONFIG_ARM64
-/*
- * POLL_IDLE_RELAX_COUNT determines how often we check for timeout
- * while polling for TIF_NEED_RESCHED in thread_info->flags.
- *
- * Set this to a low value since arm64, instead of polling, uses a
- * event based mechanism.
- */
-#define POLL_IDLE_RELAX_COUNT	1
-#else
-#define POLL_IDLE_RELAX_COUNT	200
-#endif
+#include <asm/barrier.h>
 
 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
 {
-	u64 time_start;
-
-	time_start = local_clock_noinstr();
+	unsigned long flags;
 
 	dev->poll_time_limit = false;
 
 	raw_local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		u64 limit;
-
-		limit = cpuidle_poll_time(drv, dev);
-
-		while (!need_resched()) {
-			unsigned int loop_count = 0;
-			if (local_clock_noinstr() - time_start > limit) {
-				dev->poll_time_limit = true;
-				break;
-			}
-
-			smp_cond_load_relaxed(&current_thread_info()->flags,
-					      VAL & _TIF_NEED_RESCHED ||
-					      loop_count++ >= POLL_IDLE_RELAX_COUNT);
-		}
+		u64 limit = cpuidle_poll_time(drv, dev);
+		flags = smp_vcond_load_relaxed(limit, &current_thread_info()->flags, _TIF_NEED_RESCHED, _TIF_NEED_RESCHED);
+		dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED);
 	}
 	raw_local_irq_disable();
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 5/5] cpuidle: implement poll_idle() using smp_vcond_load_relaxed()
  2024-11-05 18:30   ` [PATCH 5/5] cpuidle: implement poll_idle() using smp_vcond_load_relaxed() Haris Okanovic
@ 2024-11-05 19:45     ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-05 19:45 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, mtosatti, sudeep.holla, misono.tomohiro,
	maobibo, joao.m.martins, boris.ostrovsky, konrad.wilk

On Tue, 5 Nov 2024, Haris Okanovic wrote:

>  {
> -	u64 time_start;
> -
> -	time_start = local_clock_noinstr();
> +	unsigned long flags;

Lets keep recording that start value here. Otherwise the timeout could me
later than expected.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v8 00/11] Enable haltpoll on arm64
  2024-11-05 18:30 ` Haris Okanovic
                     ` (4 preceding siblings ...)
  2024-11-05 18:30   ` [PATCH 5/5] cpuidle: implement poll_idle() using smp_vcond_load_relaxed() Haris Okanovic
@ 2024-11-05 18:49   ` Ankur Arora
  5 siblings, 0 replies; 74+ messages in thread
From: Ankur Arora @ 2024-11-05 18:49 UTC (permalink / raw)
  To: Haris Okanovic
  Cc: ankur.a.arora, catalin.marinas, linux-pm, kvm, linux-arm-kernel,
	linux-kernel, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, wanpengli, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, mtosatti, sudeep.holla, cl,
	misono.tomohiro, maobibo, joao.m.martins, boris.ostrovsky,
	konrad.wilk


Haris Okanovic <harisokn@amazon.com> writes:

> Hi Ankur, Catalin,
>
> How about the following series based on a refactor of arm64's delay()?
> Does it address your earlier concerns?
>
> delay() already implements wfet() and falls back to wfe() w/ evstream
> or a cpu_relax loop. I refactored it to poll an address, and wrapped in
> a new platform-agnostic smp_vcond_load_relaxed() macro. More details in
> the following commit log.

Haven't looked at your series too closely but it looks quite a bit
different from the version I was working on.

Let me send out my version as well in the next few days.

--
ankur

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2024-11-09  9:49 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-25 23:24 [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora
2024-09-25 23:24 ` [PATCH v8 01/11] cpuidle/poll_state: poll via smp_cond_load_relaxed() Ankur Arora
2024-10-15 12:04   ` Catalin Marinas
2024-10-15 16:42     ` Christoph Lameter (Ampere)
2024-10-15 16:50       ` Catalin Marinas
2024-10-15 17:17         ` Christoph Lameter (Ampere)
2024-10-15 17:40           ` Catalin Marinas
2024-10-15 21:53             ` Ankur Arora
2024-10-15 22:28               ` Christoph Lameter (Ampere)
2024-10-16  7:06                 ` Ankur Arora
2024-10-17 16:54                   ` Christoph Lameter (Ampere)
2024-10-17 18:36                     ` Ankur Arora
2024-10-15 22:40               ` Christoph Lameter (Ampere)
2024-10-16  9:54                 ` Catalin Marinas
2024-10-17 16:56                   ` Christoph Lameter (Ampere)
2024-10-17 18:15                     ` Catalin Marinas
2024-10-17 19:34                       ` Ankur Arora
2024-10-15 21:32     ` Ankur Arora
2024-10-16  6:20       ` maobibo
2024-10-16 10:06       ` Catalin Marinas
2024-10-16 15:13     ` Okanovic, Haris
2024-10-16 17:04       ` Ankur Arora
2024-10-16 18:04         ` Okanovic, Haris
2024-10-17 14:01       ` Catalin Marinas
2024-10-17 22:47         ` Ankur Arora
2024-10-18 11:05           ` Catalin Marinas
2024-10-18 19:00             ` Ankur Arora
2024-10-21 12:02               ` Catalin Marinas
2024-09-25 23:24 ` [PATCH v8 02/11] cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL Ankur Arora
2024-09-25 23:24 ` [PATCH v8 03/11] Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig Ankur Arora
2024-09-25 23:24 ` [PATCH v8 04/11] cpuidle-haltpoll: define arch_haltpoll_want() Ankur Arora
2024-09-25 23:24 ` [PATCH v8 05/11] governors/haltpoll: drop kvm_para_available() check Ankur Arora
2024-09-25 23:24 ` [PATCH v8 06/11] cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL Ankur Arora
2024-09-25 23:24 ` [PATCH v8 07/11] arm64: define TIF_POLLING_NRFLAG Ankur Arora
2024-09-25 23:24 ` [PATCH v8 08/11] arm64: idle: export arch_cpu_idle Ankur Arora
2024-09-25 23:24 ` [PATCH v8 09/11] arm64: select ARCH_HAS_OPTIMIZED_POLL Ankur Arora
2024-09-25 23:24 ` [PATCH v8 10/11] cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64 Ankur Arora
2024-09-25 23:24 ` [PATCH v8 11/11] arm64: support cpuidle-haltpoll Ankur Arora
2024-10-02 22:42   ` Okanovic, Haris
2024-10-03  3:29     ` Ankur Arora
2024-10-16 15:13   ` Okanovic, Haris
2024-10-09  2:37 ` [PATCH v8 00/11] Enable haltpoll on arm64 zhenglifeng (A)
2024-10-15  1:53   ` Ankur Arora
2024-10-14 22:54 ` Christoph Lameter (Ampere)
2024-10-15 12:36 ` Marc Zyngier
2024-10-16 21:55   ` Ankur Arora
2024-10-17  8:19     ` Marc Zyngier
2024-10-17 18:35       ` Ankur Arora
2024-10-22 22:01         ` Ankur Arora
2024-11-05 18:30 ` Haris Okanovic
2024-11-05 18:30   ` [PATCH 1/5] asm-generic: add smp_vcond_load_relaxed() Haris Okanovic
2024-11-05 19:36     ` Christoph Lameter (Ampere)
2024-11-06 17:06       ` Okanovic, Haris
2024-11-06 11:08     ` Catalin Marinas
2024-11-06 18:13       ` Okanovic, Haris
2024-11-06 19:55         ` Catalin Marinas
2024-11-06 20:31           ` Okanovic, Haris
2024-11-06 11:39     ` Will Deacon
2024-11-06 17:18       ` Okanovic, Haris
2024-11-05 18:30   ` [PATCH 2/5] arm64: add __READ_ONCE_EX() Haris Okanovic
2024-11-05 19:39     ` Christoph Lameter (Ampere)
2024-11-06 17:37       ` Okanovic, Haris
2024-11-06 11:43     ` Will Deacon
2024-11-06 17:09       ` Okanovic, Haris
2024-11-09  9:49     ` David Laight
2024-11-05 18:30   ` [PATCH 3/5] arm64: refactor delay() to enable polling for value Haris Okanovic
2024-11-05 19:42     ` Christoph Lameter (Ampere)
2024-11-06 17:42       ` Okanovic, Haris
2024-11-06  9:18     ` Catalin Marinas
2024-11-06 17:38       ` Okanovic, Haris
2024-11-05 18:30   ` [PATCH 4/5] arm64: add smp_vcond_load_relaxed() Haris Okanovic
2024-11-05 18:30   ` [PATCH 5/5] cpuidle: implement poll_idle() using smp_vcond_load_relaxed() Haris Okanovic
2024-11-05 19:45     ` Christoph Lameter (Ampere)
2024-11-05 18:49   ` [PATCH v8 00/11] Enable haltpoll on arm64 Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).