linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 00/15] arm64: support poll_idle()
@ 2024-11-07 19:08 Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout() Ankur Arora
                   ` (16 more replies)
  0 siblings, 17 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

This patchset adds support for polling in idle via poll_idle() on
arm64.

There are two main changes in this version:

1. rework the series to take Catalin Marinas' comments on the semantics
   of smp_cond_load_relaxed() (and how earlier versions of this
   series were abusing them) into account.

   This also allows dropping of the somewhat strained connections
   between haltpoll and the event-stream.

2. earlier versions of this series were adding support for poll_idle()
   but only using it in the haltpoll driver. Add Lifeng's patch to
   broaden it out by also polling in acpi-idle.

The benefit of polling in idle is to reduce the cost of remote wakeups.
When enabled, these can be done just by setting the need-resched bit,
instead of sending an IPI, and incurring the cost of handling the
interrupt on the receiver side. When running on a VM it also saves
the cost of WFE trapping (when enabled.)

Comparing sched-pipe performance on a guest VM:

# perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
  perf bench sched pipe -l 1000000 -c 4

# no polling in idle

 Performance counter stats for 'CPU(s) 4,5' (5 runs):

         25,229.57 msec task-clock                       #    2.000 CPUs utilized               ( +-  7.75% )
    45,821,250,284      cycles                           #    1.816 GHz                         ( +- 10.07% )
    26,557,496,665      instructions                     #    0.58  insn per cycle              ( +-  0.21% )
                 0      sched:sched_wake_idle_without_ipi #    0.000 /sec

            12.615 +- 0.977 seconds time elapsed  ( +-  7.75% )


# polling in idle (with haltpoll):

 Performance counter stats for 'CPU(s) 4,5' (5 runs):

         15,131.58 msec task-clock                       #    2.000 CPUs utilized               ( +- 10.00% )
    34,158,188,839      cycles                           #    2.257 GHz                         ( +-  6.91% )
    20,824,950,916      instructions                     #    0.61  insn per cycle              ( +-  0.09% )
         1,983,822      sched:sched_wake_idle_without_ipi #  131.105 K/sec                       ( +-  0.78% )

             7.566 +- 0.756 seconds time elapsed  ( +- 10.00% )

Tomohiro Misono and Haris Okanovic also report similar latency
improvements on Grace and Graviton systems (for v7) [1] [2].
Lifeng also reports improved context switch latency on a bare-metal
machine with acpi-idle [3].

The series is in four parts:

 - patches 1-4,

    "asm-generic: add barrier smp_cond_load_relaxed_timeout()"
    "cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout()"
    "cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL"
    "Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig"

   add smp_cond_load_relaxed_timeout() and switch poll_idle() to
   using it. Also, do some munging of related kconfig options.

 - patches 5-7,

    "arm64: barrier: add support for smp_cond_relaxed_timeout()"
    "arm64: define TIF_POLLING_NRFLAG"
    "arm64: add support for polling in idle"

   add support for the new barrier, the polling flag and enable
   poll_idle() support.

 - patches 8, 9-13,

    "ACPI: processor_idle: Support polling state for LPI"

    "cpuidle-haltpoll: define arch_haltpoll_want()"
    "governors/haltpoll: drop kvm_para_available() check"
    "cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL"
    "arm64: idle: export arch_cpu_idle"
    "arm64: support cpuidle-haltpoll"

    add support for polling via acpi-idle, and cpuidle-haltpoll.

  - patches 14, 15,
     "arm64/delay: move some constants out to a separate header"
     "arm64: support WFET in smp_cond_relaxed_timeout()"

    are RFC patches to enable WFET support.

Changelog:

v9:

 - reworked the series to address a comment from Catalin Marinas
   about how v8 was abusing semantics of smp_cond_load_relaxed().
 - add poll_idle() support in acpi-idle (Lifeng Zheng)
 - dropped some earlier "Tested-by", "Reviewed-by" due to the
   above rework.

v8: No logic changes. Largely respin of v7, with changes
noted below:

 - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
   own patch.
   (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
   
 - address comments simplifying arm64 support (Will Deacon)
   (patch-11 "arm64: support cpuidle-haltpoll")

v7: No significant logic changes. Mostly a respin of v6.

 - minor cleanup in poll_idle() (Christoph Lameter)
 - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
   (Tomohiro Misono)

v6:

 - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
   changes together (comment from Christoph Lameter)
 - threshes out the commit messages a bit more (comments from Christoph
   Lameter, Sudeep Holla)
 - also rework selection of cpuidle-haltpoll. Now selected based
   on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
 - moved back to arch_haltpoll_want() (comment from Joao Martins)
   Also, arch_haltpoll_want() now takes the force parameter and is
   now responsible for the complete selection (or not) of haltpoll.
 - fixes the build breakage on i386
 - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
   Tomohiro Misono, Haris Okanovic)

v5:
 - rework the poll_idle() loop around smp_cond_load_relaxed() (review
   comment from Tomohiro Misono.)
 - also rework selection of cpuidle-haltpoll. Now selected based
   on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
 - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
   arm64 now depends on the event-stream being enabled.
 - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
 - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.

v4 changes from v3:
 - change 7/8 per Rafael input: drop the parens and use ret for the final check
 - add 8/8 which renames the guard for building poll_state

v3 changes from v2:
 - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
 - add Ack-by from Rafael Wysocki on 2/7

v2 changes from v1:
 - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
   (this improves by 50% at least the CPU cycles consumed in the tests above:
   10,716,881,137 now vs 14,503,014,257 before)
 - removed the ifdef from patch 1 per RafaelW

Please review.

[1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
[2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
[3] https://lore.kernel.org/lkml/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/

Ankur Arora (10):
  asm-generic: add barrier smp_cond_load_relaxed_timeout()
  cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout()
  cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
  arm64: barrier: add support for smp_cond_relaxed_timeout()
  arm64: add support for polling in idle
  cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
  arm64: idle: export arch_cpu_idle
  arm64: support cpuidle-haltpoll
  arm64/delay: move some constants out to a separate header
  arm64: support WFET in smp_cond_relaxed_timeout()

Joao Martins (4):
  Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
  arm64: define TIF_POLLING_NRFLAG
  cpuidle-haltpoll: define arch_haltpoll_want()
  governors/haltpoll: drop kvm_para_available() check

Lifeng Zheng (1):
  ACPI: processor_idle: Support polling state for LPI

 arch/Kconfig                              |  3 ++
 arch/arm64/Kconfig                        |  7 +++
 arch/arm64/include/asm/barrier.h          | 62 ++++++++++++++++++++++-
 arch/arm64/include/asm/cmpxchg.h          | 26 ++++++----
 arch/arm64/include/asm/cpuidle_haltpoll.h | 20 ++++++++
 arch/arm64/include/asm/delay-const.h      | 25 +++++++++
 arch/arm64/include/asm/thread_info.h      |  2 +
 arch/arm64/kernel/idle.c                  |  1 +
 arch/arm64/lib/delay.c                    | 13 ++---
 arch/x86/Kconfig                          |  5 +-
 arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
 arch/x86/kernel/kvm.c                     | 13 +++++
 drivers/acpi/processor_idle.c             | 43 +++++++++++++---
 drivers/cpuidle/Kconfig                   |  5 +-
 drivers/cpuidle/Makefile                  |  2 +-
 drivers/cpuidle/cpuidle-haltpoll.c        | 12 +----
 drivers/cpuidle/governors/haltpoll.c      |  6 +--
 drivers/cpuidle/poll_state.c              | 27 +++-------
 drivers/idle/Kconfig                      |  1 +
 include/asm-generic/barrier.h             | 42 +++++++++++++++
 include/linux/cpuidle.h                   |  2 +-
 include/linux/cpuidle_haltpoll.h          |  5 ++
 22 files changed, 252 insertions(+), 71 deletions(-)
 create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
 create mode 100644 arch/arm64/include/asm/delay-const.h

-- 
2.43.5



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout()
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-08  2:33   ` Christoph Lameter (Ampere)
  2024-11-26  5:01   ` Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 02/15] cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout() Ankur Arora
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

Add a timed variant of smp_cond_load_relaxed().

This is useful because arm64 supports polling on a conditional variable
by directly waiting on the cacheline instead of spin waiting for the
condition to change.

However, an implementation such as this has a problem that it can block
forever -- unless there's an explicit timeout or another out-of-band
mechanism which allows it to come out of the wait state periodically.

smp_cond_load_relaxed_timeout() supports these semantics by specifying
a time-check expression and an associated time-limit.

However, note that for the generic spin-wait implementation we want to
minimize the numbers of instructions executed in each iteration. So,
limit how often we evaluate the time-check expression by doing it once
every smp_cond_time_check_count.

The inner loop in poll_idle() has a substantially similar structure
and constraints as smp_cond_load_relaxed_timeout(), so define
smp_cond_time_check_count to the same value used in poll_idle().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/asm-generic/barrier.h | 42 +++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..77726ef807e4 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -273,6 +273,48 @@ do {									\
 })
 #endif
 
+#ifndef smp_cond_time_check_count
+/*
+ * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns.
+ * This helps reduce the number of instructions executed while spin-waiting.
+ */
+#define smp_cond_time_check_count	200
+#endif
+
+/**
+ * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
+ * guarantees until a timeout expires.
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ * @time_expr_ns: evaluates to the current time
+ * @time_limit_ns: compared against time_expr_ns
+ *
+ * Equivalent to using READ_ONCE() on the condition variable.
+ *
+ * Due to C lacking lambda expressions we load the value of *ptr into a
+ * pre-named variable @VAL to be used in @cond.
+ */
+#ifndef smp_cond_load_relaxed_timeout
+#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns,	\
+				      time_limit_ns) ({			\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	unsigned int __count = 0;					\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		cpu_relax();						\
+		if (__count++ < smp_cond_time_check_count)		\
+			continue;					\
+		if ((time_expr_ns) >= time_limit_ns)			\
+			break;						\
+		__count = 0;						\
+	}								\
+	(typeof(*ptr))VAL;						\
+})
+#endif
+
 /*
  * pmem_wmb() ensures that all stores for which the modification
  * are written to persistent storage by preceding instructions have
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v9 02/15] cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout()
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout() Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 03/15] cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL Ankur Arora
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

The inner loop in poll_idle() polls to see if the thread's
TIF_NEED_RESCHED bit is set. The loop exits once the condition is met,
or if the poll time limit has been exceeded.

To minimize the number of instructions executed in each iteration, the
time check is rate-limited. In addition, each loop iteration executes
cpu_relax() which on certain platforms provides a hint to the pipeline
that the loop is busy-waiting, which allows the processor to reduce
power consumption.

However, cpu_relax() is defined optimally only on x86. On arm64, for
instance, it is implemented as a YIELD which only serves as a hint
to the CPU that it prioritize a different hardware thread if one is
available. arm64, does expose a more optimal polling mechanism via
smp_cond_load_relaxed_timeout() which uses LDXR, WFE to wait until a
store to a specified region, or until a timeout.

These semantics are essentially identical to what we want
from poll_idle(). So, restructure the loop to use
smp_cond_load_relaxed_timeout() instead.

The generated code remains close to the original version.

Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/cpuidle/poll_state.c | 27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index 9b6d90a72601..0b42971393c9 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -8,35 +8,24 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/idle.h>
 
-#define POLL_IDLE_RELAX_COUNT	200
-
 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
 {
-	u64 time_start;
-
-	time_start = local_clock_noinstr();
 
 	dev->poll_time_limit = false;
 
 	raw_local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		unsigned int loop_count = 0;
-		u64 limit;
+		unsigned long flags;
+		u64 time_start = local_clock_noinstr();
+		u64 limit = cpuidle_poll_time(drv, dev);
 
-		limit = cpuidle_poll_time(drv, dev);
+		flags = smp_cond_load_relaxed_timeout(&current_thread_info()->flags,
+						      VAL & _TIF_NEED_RESCHED,
+						      local_clock_noinstr(),
+						      time_start + limit);
 
-		while (!need_resched()) {
-			cpu_relax();
-			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
-				continue;
-
-			loop_count = 0;
-			if (local_clock_noinstr() - time_start > limit) {
-				dev->poll_time_limit = true;
-				break;
-			}
-		}
+		dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED);
 	}
 	raw_local_irq_disable();
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v9 03/15] cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout() Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 02/15] cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout() Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 04/15] Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig Ankur Arora
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

ARCH_HAS_CPU_RELAX is defined on architectures that provide an
primitive (via cpu_relax()) that can be used as part of a polling
mechanism -- one that would be cheaper than spinning in a tight
loop.

However, recent changes in poll_idle() mean that a higher level
primitive -- smp_cond_load_relaxed() is used for polling. This would
in-turn use cpu_relax() or an architecture specific implementation.
On ARM64 in particular this turns into a WFE which waits on a store
to a cacheline instead of a busy poll.

Accordingly condition the polling drivers on ARCH_HAS_OPTIMIZED_POLL
instead of ARCH_HAS_CPU_RELAX. While at it, make both intel-idle
and cpuidle-haltpoll explicitly depend on ARCH_HAS_CPU_RELAX.

Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig              | 2 +-
 drivers/acpi/processor_idle.c | 4 ++--
 drivers/cpuidle/Kconfig       | 2 +-
 drivers/cpuidle/Makefile      | 2 +-
 drivers/idle/Kconfig          | 1 +
 include/linux/cpuidle.h       | 2 +-
 6 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 16354dfa6d96..3fa741dc0445 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -378,7 +378,7 @@ config ARCH_MAY_HAVE_PC_FDC
 config GENERIC_CALIBRATE_DELAY
 	def_bool y
 
-config ARCH_HAS_CPU_RELAX
+config ARCH_HAS_OPTIMIZED_POLL
 	def_bool y
 
 config ARCH_HIBERNATION_POSSIBLE
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 831fa4a12159..44096406d65d 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -35,7 +35,7 @@
 #include <asm/cpu.h>
 #endif
 
-#define ACPI_IDLE_STATE_START	(IS_ENABLED(CONFIG_ARCH_HAS_CPU_RELAX) ? 1 : 0)
+#define ACPI_IDLE_STATE_START	(IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL) ? 1 : 0)
 
 static unsigned int max_cstate __read_mostly = ACPI_PROCESSOR_MAX_POWER;
 module_param(max_cstate, uint, 0400);
@@ -782,7 +782,7 @@ static int acpi_processor_setup_cstates(struct acpi_processor *pr)
 	if (max_cstate == 0)
 		max_cstate = 1;
 
-	if (IS_ENABLED(CONFIG_ARCH_HAS_CPU_RELAX)) {
+	if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) {
 		cpuidle_poll_state_init(drv);
 		count = 1;
 	} else {
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index cac5997dca50..75f6e176bbc8 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -73,7 +73,7 @@ endmenu
 
 config HALTPOLL_CPUIDLE
 	tristate "Halt poll cpuidle driver"
-	depends on X86 && KVM_GUEST
+	depends on X86 && KVM_GUEST && ARCH_HAS_OPTIMIZED_POLL
 	select CPU_IDLE_GOV_HALTPOLL
 	default y
 	help
diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
index d103342b7cfc..f29dfd1525b0 100644
--- a/drivers/cpuidle/Makefile
+++ b/drivers/cpuidle/Makefile
@@ -7,7 +7,7 @@ obj-y += cpuidle.o driver.o governor.o sysfs.o governors/
 obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
 obj-$(CONFIG_DT_IDLE_STATES)		  += dt_idle_states.o
 obj-$(CONFIG_DT_IDLE_GENPD)		  += dt_idle_genpd.o
-obj-$(CONFIG_ARCH_HAS_CPU_RELAX)	  += poll_state.o
+obj-$(CONFIG_ARCH_HAS_OPTIMIZED_POLL)	  += poll_state.o
 obj-$(CONFIG_HALTPOLL_CPUIDLE)		  += cpuidle-haltpoll.o
 
 ##################################################################################
diff --git a/drivers/idle/Kconfig b/drivers/idle/Kconfig
index 6707d2539fc4..6f9b1d48fede 100644
--- a/drivers/idle/Kconfig
+++ b/drivers/idle/Kconfig
@@ -4,6 +4,7 @@ config INTEL_IDLE
 	depends on CPU_IDLE
 	depends on X86
 	depends on CPU_SUP_INTEL
+	depends on ARCH_HAS_OPTIMIZED_POLL
 	help
 	  Enable intel_idle, a cpuidle driver that includes knowledge of
 	  native Intel hardware idle features.  The acpi_idle driver
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 3183aeb7f5b4..7e7e58a17b07 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -275,7 +275,7 @@ static inline void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev,
 }
 #endif
 
-#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
+#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_OPTIMIZED_POLL)
 void cpuidle_poll_state_init(struct cpuidle_driver *drv);
 #else
 static inline void cpuidle_poll_state_init(struct cpuidle_driver *drv) {}
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v9 04/15] Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (2 preceding siblings ...)
  2024-11-07 19:08 ` [PATCH v9 03/15] cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 05/15] arm64: barrier: add support for smp_cond_relaxed_timeout() Ankur Arora
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

From: Joao Martins <joao.m.martins@oracle.com>

ARCH_HAS_OPTIMIZED_POLL gates selection of polling while idle in
poll_idle(). Move the configuration option to arch/Kconfig to allow
non-x86 architectures to select it.

Note that ARCH_HAS_OPTIMIZED_POLL should probably be exclusive with
GENERIC_IDLE_POLL_SETUP (which controls the generic polling logic in
cpu_idle_poll()). However, that would remove boot options
(hlt=, nohlt=). So, leave it untouched for now.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/Kconfig     | 3 +++
 arch/x86/Kconfig | 4 +---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index bd9f095d69fa..c3a9de71c09f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -273,6 +273,9 @@ config HAVE_ARCH_TRACEHOOK
 config HAVE_DMA_CONTIGUOUS
 	bool
 
+config ARCH_HAS_OPTIMIZED_POLL
+	bool
+
 config GENERIC_SMP_IDLE_THREAD
 	bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3fa741dc0445..df75df8467d1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -138,6 +138,7 @@ config X86
 	select ARCH_WANTS_NO_INSTR
 	select ARCH_WANT_GENERAL_HUGETLB
 	select ARCH_WANT_HUGE_PMD_SHARE
+	select ARCH_HAS_OPTIMIZED_POLL
 	select ARCH_WANT_LD_ORPHAN_WARN
 	select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP	if X86_64
 	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
@@ -378,9 +379,6 @@ config ARCH_MAY_HAVE_PC_FDC
 config GENERIC_CALIBRATE_DELAY
 	def_bool y
 
-config ARCH_HAS_OPTIMIZED_POLL
-	def_bool y
-
 config ARCH_HIBERNATION_POSSIBLE
 	def_bool y
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v9 05/15] arm64: barrier: add support for smp_cond_relaxed_timeout()
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (3 preceding siblings ...)
  2024-11-07 19:08 ` [PATCH v9 04/15] Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-12-10 13:50   ` Will Deacon
  2024-11-07 19:08 ` [PATCH v9 06/15] arm64: define TIF_POLLING_NRFLAG Ankur Arora
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

Support a waited variant of polling on a conditional variable
via smp_cond_relaxed_timeout().

This uses the __cmpwait_relaxed() primitive to do the actual
waiting, when the wait can be guaranteed to not block forever
(in case there are no stores to the waited for cacheline.)
For this we depend on the availability of the event-stream.

For cases when the event-stream is unavailable, we fallback to
a spin-waited implementation which is identical to the generic
variant.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/barrier.h | 54 ++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 1ca947d5c939..ab2515ecd6ca 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -216,6 +216,60 @@ do {									\
 	(typeof(*ptr))VAL;						\
 })
 
+#define __smp_cond_load_timeout_spin(ptr, cond_expr,			\
+				     time_expr_ns, time_limit_ns)	\
+({									\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	unsigned int __count = 0;					\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		cpu_relax();						\
+		if (__count++ < smp_cond_time_check_count)		\
+			continue;					\
+		if ((time_expr_ns) >= time_limit_ns)			\
+			break;						\
+		__count = 0;						\
+	}								\
+	(typeof(*ptr))VAL;						\
+})
+
+#define __smp_cond_load_timeout_wait(ptr, cond_expr,			\
+				     time_expr_ns, time_limit_ns)	\
+({									\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		__cmpwait_relaxed(__PTR, VAL);				\
+		if ((time_expr_ns) >= time_limit_ns)			\
+			break;						\
+	}								\
+	(typeof(*ptr))VAL;						\
+})
+
+#define smp_cond_load_relaxed_timeout(ptr, cond_expr,			\
+				      time_expr_ns, time_limit_ns)	\
+({									\
+	__unqual_scalar_typeof(*ptr) _val;				\
+									\
+	int __wfe = arch_timer_evtstrm_available();			\
+	if (likely(__wfe))						\
+		_val = __smp_cond_load_timeout_wait(ptr, cond_expr,	\
+						   time_expr_ns,	\
+						   time_limit_ns);	\
+	else								\
+		_val = __smp_cond_load_timeout_spin(ptr, cond_expr,	\
+						   time_expr_ns,	\
+						   time_limit_ns);	\
+	(typeof(*ptr))_val;						\
+})
+
+
 #include <asm-generic/barrier.h>
 
 #endif	/* __ASSEMBLY__ */
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v9 06/15] arm64: define TIF_POLLING_NRFLAG
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (4 preceding siblings ...)
  2024-11-07 19:08 ` [PATCH v9 05/15] arm64: barrier: add support for smp_cond_relaxed_timeout() Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 07/15] arm64: add support for polling in idle Ankur Arora
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

From: Joao Martins <joao.m.martins@oracle.com>

Commit 842514849a61 ("arm64: Remove TIF_POLLING_NRFLAG") had removed
TIF_POLLING_NRFLAG because arm64 only supported non-polled idling via
cpu_do_idle().

To add support for polling via cpuidle-haltpoll, we want to use the
standard poll_idle() interface, which sets TIF_POLLING_NRFLAG while
polling.

Reuse the same bit to define TIF_POLLING_NRFLAG.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/thread_info.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 1114c1c3300a..5326cd583b01 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -69,6 +69,7 @@ void arch_setup_new_exec(void);
 #define TIF_SYSCALL_TRACEPOINT	10	/* syscall tracepoint for ftrace */
 #define TIF_SECCOMP		11	/* syscall secure computing */
 #define TIF_SYSCALL_EMU		12	/* syscall emulation active */
+#define TIF_POLLING_NRFLAG	16	/* set while polling in poll_idle() */
 #define TIF_MEMDIE		18	/* is terminating due to OOM killer */
 #define TIF_FREEZE		19
 #define TIF_RESTORE_SIGMASK	20
@@ -92,6 +93,7 @@ void arch_setup_new_exec(void);
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_SYSCALL_EMU	(1 << TIF_SYSCALL_EMU)
+#define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_32BIT		(1 << TIF_32BIT)
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v9 07/15] arm64: add support for polling in idle
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (5 preceding siblings ...)
  2024-11-07 19:08 ` [PATCH v9 06/15] arm64: define TIF_POLLING_NRFLAG Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 08/15] ACPI: processor_idle: Support polling state for LPI Ankur Arora
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

Polling in idle with poll_idle() needs TIF_POLLING_NRFLAG support,
and a cheap mechanism to do the actual polling via
smp_cond_load_relaxed_timeout().

Both of these are present on arm64. So, select ARCH_HAS_OPTIMIZED_POLL
to enable it.

Enabling this should help reduce the cost of remote wakeups, since if
the target sets TIF_POLLING_NRFLAG (as it does while polling in idle),
the scheduler does those just by setting the need-resched bit. This
contrasts with sending an IPI, and incurring the cost of handling the
interrupt on the receiver.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fd9df6dcc593..43762c68e357 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -38,6 +38,7 @@ config ARM64
 	select ARCH_HAS_MEM_ENCRYPT
 	select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
+	select ARCH_HAS_OPTIMIZED_POLL
 	select ARCH_HAS_PTE_DEVMAP
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_HW_PTE_YOUNG
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v9 08/15] ACPI: processor_idle: Support polling state for LPI
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (6 preceding siblings ...)
  2024-11-07 19:08 ` [PATCH v9 07/15] arm64: add support for polling in idle Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 09/15] cpuidle-haltpoll: define arch_haltpoll_want() Ankur Arora
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

From: Lifeng Zheng <zhenglifeng1@huawei.com>

Initialize an optional polling state besides LPI states.

Wrap up a new enter method to correctly reflect the actual entered state
when the polling state is enabled.

Signed-off-by: Lifeng Zheng <zhenglifeng1@huawei.com>
Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com>
---
 drivers/acpi/processor_idle.c | 39 ++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 44096406d65d..d154b5d77328 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -1194,20 +1194,46 @@ static int acpi_idle_lpi_enter(struct cpuidle_device *dev,
 	return -EINVAL;
 }
 
+/* To correctly reflect the entered state if the poll state is enabled. */
+static int acpi_idle_lpi_enter_with_poll_state(struct cpuidle_device *dev,
+			       struct cpuidle_driver *drv, int index)
+{
+	int entered_state;
+
+	if (unlikely(index < 1))
+		return -EINVAL;
+
+	entered_state = acpi_idle_lpi_enter(dev, drv, index - 1);
+	if (entered_state < 0)
+		return entered_state;
+
+	return entered_state + 1;
+}
+
 static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
 {
-	int i;
+	int i, count;
 	struct acpi_lpi_state *lpi;
 	struct cpuidle_state *state;
 	struct cpuidle_driver *drv = &acpi_idle_driver;
+	typeof(state->enter) enter_method;
 
 	if (!pr->flags.has_lpi)
 		return -EOPNOTSUPP;
 
+	if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) {
+		cpuidle_poll_state_init(drv);
+		count = 1;
+		enter_method = acpi_idle_lpi_enter_with_poll_state;
+	} else {
+		count = 0;
+		enter_method = acpi_idle_lpi_enter;
+	}
+
 	for (i = 0; i < pr->power.count && i < CPUIDLE_STATE_MAX; i++) {
 		lpi = &pr->power.lpi_states[i];
 
-		state = &drv->states[i];
+		state = &drv->states[count];
 		snprintf(state->name, CPUIDLE_NAME_LEN, "LPI-%d", i);
 		strscpy(state->desc, lpi->desc, CPUIDLE_DESC_LEN);
 		state->exit_latency = lpi->wake_latency;
@@ -1215,11 +1241,14 @@ static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
 		state->flags |= arch_get_idle_state_flags(lpi->arch_flags);
 		if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
 			state->flags |= CPUIDLE_FLAG_RCU_IDLE;
-		state->enter = acpi_idle_lpi_enter;
-		drv->safe_state_index = i;
+		state->enter = enter_method;
+		drv->safe_state_index = count;
+		count++;
+		if (count == CPUIDLE_STATE_MAX)
+			break;
 	}
 
-	drv->state_count = i;
+	drv->state_count = count;
 
 	return 0;
 }
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v9 09/15] cpuidle-haltpoll: define arch_haltpoll_want()
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (7 preceding siblings ...)
  2024-11-07 19:08 ` [PATCH v9 08/15] ACPI: processor_idle: Support polling state for LPI Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 10/15] governors/haltpoll: drop kvm_para_available() check Ankur Arora
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

From: Joao Martins <joao.m.martins@oracle.com>

While initializing haltpoll we check if KVM supports the
realtime hint and if idle is overridden at boot.

Both of these checks are x86 specific. So, in pursuit of
making cpuidle-haltpoll architecture independent, move these
checks out of common code.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/cpuidle_haltpoll.h |  1 +
 arch/x86/kernel/kvm.c                   | 13 +++++++++++++
 drivers/cpuidle/cpuidle-haltpoll.c      | 12 +-----------
 include/linux/cpuidle_haltpoll.h        |  5 +++++
 4 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/cpuidle_haltpoll.h b/arch/x86/include/asm/cpuidle_haltpoll.h
index c8b39c6716ff..8a0a12769c2e 100644
--- a/arch/x86/include/asm/cpuidle_haltpoll.h
+++ b/arch/x86/include/asm/cpuidle_haltpoll.h
@@ -4,5 +4,6 @@
 
 void arch_haltpoll_enable(unsigned int cpu);
 void arch_haltpoll_disable(unsigned int cpu);
+bool arch_haltpoll_want(bool force);
 
 #endif
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 21e9e4845354..6d717819eb4e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -1155,4 +1155,17 @@ void arch_haltpoll_disable(unsigned int cpu)
 	smp_call_function_single(cpu, kvm_enable_host_haltpoll, NULL, 1);
 }
 EXPORT_SYMBOL_GPL(arch_haltpoll_disable);
+
+bool arch_haltpoll_want(bool force)
+{
+	/* Do not load haltpoll if idle= is passed */
+	if (boot_option_idle_override != IDLE_NO_OVERRIDE)
+		return false;
+
+	if (!kvm_para_available())
+		return false;
+
+	return kvm_para_has_hint(KVM_HINTS_REALTIME) || force;
+}
+EXPORT_SYMBOL_GPL(arch_haltpoll_want);
 #endif
diff --git a/drivers/cpuidle/cpuidle-haltpoll.c b/drivers/cpuidle/cpuidle-haltpoll.c
index bcd03e893a0a..e532aa2bf608 100644
--- a/drivers/cpuidle/cpuidle-haltpoll.c
+++ b/drivers/cpuidle/cpuidle-haltpoll.c
@@ -15,7 +15,6 @@
 #include <linux/cpuidle.h>
 #include <linux/module.h>
 #include <linux/sched/idle.h>
-#include <linux/kvm_para.h>
 #include <linux/cpuidle_haltpoll.h>
 
 static bool force __read_mostly;
@@ -93,21 +92,12 @@ static void haltpoll_uninit(void)
 	haltpoll_cpuidle_devices = NULL;
 }
 
-static bool haltpoll_want(void)
-{
-	return kvm_para_has_hint(KVM_HINTS_REALTIME) || force;
-}
-
 static int __init haltpoll_init(void)
 {
 	int ret;
 	struct cpuidle_driver *drv = &haltpoll_driver;
 
-	/* Do not load haltpoll if idle= is passed */
-	if (boot_option_idle_override != IDLE_NO_OVERRIDE)
-		return -ENODEV;
-
-	if (!kvm_para_available() || !haltpoll_want())
+	if (!arch_haltpoll_want(force))
 		return -ENODEV;
 
 	cpuidle_poll_state_init(drv);
diff --git a/include/linux/cpuidle_haltpoll.h b/include/linux/cpuidle_haltpoll.h
index d50c1e0411a2..68eb7a757120 100644
--- a/include/linux/cpuidle_haltpoll.h
+++ b/include/linux/cpuidle_haltpoll.h
@@ -12,5 +12,10 @@ static inline void arch_haltpoll_enable(unsigned int cpu)
 static inline void arch_haltpoll_disable(unsigned int cpu)
 {
 }
+
+static inline bool arch_haltpoll_want(bool force)
+{
+	return false;
+}
 #endif
 #endif
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v9 10/15] governors/haltpoll: drop kvm_para_available() check
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (8 preceding siblings ...)
  2024-11-07 19:08 ` [PATCH v9 09/15] cpuidle-haltpoll: define arch_haltpoll_want() Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 11/15] cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL Ankur Arora
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

From: Joao Martins <joao.m.martins@oracle.com>

The haltpoll governor is selected either by the cpuidle-haltpoll
driver, or explicitly by the user.
In particular, it is never selected by default since it has the lowest
rating of all governors (menu=20, teo=19, ladder=10/25, haltpoll=9).

So, we can safely forgo the kvm_para_available() check. This also
allows cpuidle-haltpoll to be tested on baremetal.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 drivers/cpuidle/governors/haltpoll.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/cpuidle/governors/haltpoll.c b/drivers/cpuidle/governors/haltpoll.c
index 663b7f164d20..c8752f793e61 100644
--- a/drivers/cpuidle/governors/haltpoll.c
+++ b/drivers/cpuidle/governors/haltpoll.c
@@ -18,7 +18,6 @@
 #include <linux/tick.h>
 #include <linux/sched.h>
 #include <linux/module.h>
-#include <linux/kvm_para.h>
 #include <trace/events/power.h>
 
 static unsigned int guest_halt_poll_ns __read_mostly = 200000;
@@ -148,10 +147,7 @@ static struct cpuidle_governor haltpoll_governor = {
 
 static int __init init_haltpoll(void)
 {
-	if (kvm_para_available())
-		return cpuidle_register_governor(&haltpoll_governor);
-
-	return 0;
+	return cpuidle_register_governor(&haltpoll_governor);
 }
 
 postcore_initcall(init_haltpoll);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v9 11/15] cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (9 preceding siblings ...)
  2024-11-07 19:08 ` [PATCH v9 10/15] governors/haltpoll: drop kvm_para_available() check Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 12/15] arm64: idle: export arch_cpu_idle Ankur Arora
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

The cpuidle-haltpoll driver and its namesake governor are selected
under KVM_GUEST on X86. KVM_GUEST in-turn selects ARCH_CPUIDLE_HALTPOLL
and defines the requisite arch_haltpoll_{enable,disable}() functions.

So remove the explicit dependence of HALTPOLL_CPUIDLE on KVM_GUEST,
and instead use ARCH_CPUIDLE_HALTPOLL as proxy for architectural
support for haltpoll.

Also change "halt poll" to "haltpoll" in one of the summary clauses,
since the second form is used everywhere else.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/Kconfig        | 1 +
 drivers/cpuidle/Kconfig | 5 ++---
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index df75df8467d1..fd0ff83a84f0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -844,6 +844,7 @@ config KVM_GUEST
 
 config ARCH_CPUIDLE_HALTPOLL
 	def_bool n
+	depends on KVM_GUEST
 	prompt "Disable host haltpoll when loading haltpoll driver"
 	help
 	  If virtualized under KVM, disable host haltpoll.
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 75f6e176bbc8..c1bebadf22bc 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -35,7 +35,6 @@ config CPU_IDLE_GOV_TEO
 
 config CPU_IDLE_GOV_HALTPOLL
 	bool "Haltpoll governor (for virtualized systems)"
-	depends on KVM_GUEST
 	help
 	  This governor implements haltpoll idle state selection, to be
 	  used in conjunction with the haltpoll cpuidle driver, allowing
@@ -72,8 +71,8 @@ source "drivers/cpuidle/Kconfig.riscv"
 endmenu
 
 config HALTPOLL_CPUIDLE
-	tristate "Halt poll cpuidle driver"
-	depends on X86 && KVM_GUEST && ARCH_HAS_OPTIMIZED_POLL
+	tristate "Haltpoll cpuidle driver"
+	depends on ARCH_CPUIDLE_HALTPOLL && ARCH_HAS_OPTIMIZED_POLL
 	select CPU_IDLE_GOV_HALTPOLL
 	default y
 	help
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v9 12/15] arm64: idle: export arch_cpu_idle
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (10 preceding siblings ...)
  2024-11-07 19:08 ` [PATCH v9 11/15] cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-07 19:08 ` [PATCH v9 13/15] arm64: support cpuidle-haltpoll Ankur Arora
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

Needed for cpuidle-haltpoll.

Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/kernel/idle.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
index 05cfb347ec26..b85ba0df9b02 100644
--- a/arch/arm64/kernel/idle.c
+++ b/arch/arm64/kernel/idle.c
@@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
 	 */
 	cpu_do_idle();
 }
+EXPORT_SYMBOL_GPL(arch_cpu_idle);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v9 13/15] arm64: support cpuidle-haltpoll
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (11 preceding siblings ...)
  2024-11-07 19:08 ` [PATCH v9 12/15] arm64: idle: export arch_cpu_idle Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-07 19:08 ` [RFC PATCH v9 14/15] arm64/delay: move some constants out to a separate header Ankur Arora
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

Add architectural support for the cpuidle-haltpoll driver by defining
arch_haltpoll_*(). Also define ARCH_CPUIDLE_HALTPOLL to allow
cpuidle-haltpoll to be selected.

Tested-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Misono Tomohiro <misono.tomohiro@fujitsu.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/Kconfig                        |  6 ++++++
 arch/arm64/include/asm/cpuidle_haltpoll.h | 20 ++++++++++++++++++++
 2 files changed, 26 insertions(+)
 create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 43762c68e357..bd00647f6013 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2428,6 +2428,12 @@ config ARCH_HIBERNATION_HEADER
 config ARCH_SUSPEND_POSSIBLE
 	def_bool y
 
+config ARCH_CPUIDLE_HALTPOLL
+	bool "Enable selection of the cpuidle-haltpoll driver"
+	help
+	  cpuidle-haltpoll allows for adaptive polling based on
+	  current load before entering the idle state.
+
 endmenu # "Power management options"
 
 menu "CPU Power Management"
diff --git a/arch/arm64/include/asm/cpuidle_haltpoll.h b/arch/arm64/include/asm/cpuidle_haltpoll.h
new file mode 100644
index 000000000000..aa01ae9ad5dd
--- /dev/null
+++ b/arch/arm64/include/asm/cpuidle_haltpoll.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _ARCH_HALTPOLL_H
+#define _ARCH_HALTPOLL_H
+
+static inline void arch_haltpoll_enable(unsigned int cpu) { }
+static inline void arch_haltpoll_disable(unsigned int cpu) { }
+
+static inline bool arch_haltpoll_want(bool force)
+{
+	/*
+	 * Enabling haltpoll requires KVM support for arch_haltpoll_enable(),
+	 * arch_haltpoll_disable().
+	 *
+	 * Given that that's missing right now, only allow force loading for
+	 * haltpoll.
+	 */
+	return force;
+}
+#endif
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v9 14/15] arm64/delay: move some constants out to a separate header
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (12 preceding siblings ...)
  2024-11-07 19:08 ` [PATCH v9 13/15] arm64: support cpuidle-haltpoll Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2024-11-08  2:25   ` Christoph Lameter (Ampere)
  2024-11-07 19:08 ` [RFC PATCH v9 15/15] arm64: support WFET in smp_cond_relaxed_timeout() Ankur Arora
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

Moves some constants and functions related to xloops, cycles computation
out to a new header.

No functional change.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/delay-const.h | 25 +++++++++++++++++++++++++
 arch/arm64/lib/delay.c               | 13 +++----------
 2 files changed, 28 insertions(+), 10 deletions(-)
 create mode 100644 arch/arm64/include/asm/delay-const.h

diff --git a/arch/arm64/include/asm/delay-const.h b/arch/arm64/include/asm/delay-const.h
new file mode 100644
index 000000000000..63fb5fc24a90
--- /dev/null
+++ b/arch/arm64/include/asm/delay-const.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_DELAY_CONST_H
+#define _ASM_DELAY_CONST_H
+
+#include <asm/param.h>	/* For HZ */
+
+/* 2**32 / 1000000 (rounded up) */
+#define __usecs_to_xloops_mult	0x10C7UL
+
+/* 2**32 / 1000000000 (rounded up) */
+#define __nsecs_to_xloops_mult	0x5UL
+
+extern unsigned long loops_per_jiffy;
+static inline unsigned long xloops_to_cycles(unsigned long xloops)
+{
+	return (xloops * loops_per_jiffy * HZ) >> 32;
+}
+
+#define USECS_TO_CYCLES(time_usecs) \
+	xloops_to_cycles((time_usecs) * __usecs_to_xloops_mult)
+
+#define NSECS_TO_CYCLES(time_nsecs) \
+	xloops_to_cycles((time_nsecs) * __nsecs_to_xloops_mult)
+
+#endif	/* _ASM_DELAY_CONST_H */
diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index cb2062e7e234..511b5597e2a5 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -12,17 +12,10 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/timex.h>
+#include <asm/delay-const.h>
 
 #include <clocksource/arm_arch_timer.h>
 
-#define USECS_TO_CYCLES(time_usecs)			\
-	xloops_to_cycles((time_usecs) * 0x10C7UL)
-
-static inline unsigned long xloops_to_cycles(unsigned long xloops)
-{
-	return (xloops * loops_per_jiffy * HZ) >> 32;
-}
-
 void __delay(unsigned long cycles)
 {
 	cycles_t start = get_cycles();
@@ -58,12 +51,12 @@ EXPORT_SYMBOL(__const_udelay);
 
 void __udelay(unsigned long usecs)
 {
-	__const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */
+	__const_udelay(usecs * __usecs_to_xloops_mult);
 }
 EXPORT_SYMBOL(__udelay);
 
 void __ndelay(unsigned long nsecs)
 {
-	__const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
+	__const_udelay(nsecs * __nsecs_to_xloops_mult);
 }
 EXPORT_SYMBOL(__ndelay);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH v9 15/15] arm64: support WFET in smp_cond_relaxed_timeout()
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (13 preceding siblings ...)
  2024-11-07 19:08 ` [RFC PATCH v9 14/15] arm64/delay: move some constants out to a separate header Ankur Arora
@ 2024-11-07 19:08 ` Ankur Arora
  2025-01-07  5:23 ` [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
  2025-01-20 21:13 ` Ankur Arora
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-07 19:08 UTC (permalink / raw)
  To: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch
  Cc: catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

Support a WFET based implementation of the waited variant of
smp_cond_load_relaxed_timeout().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/barrier.h | 12 ++++++++----
 arch/arm64/include/asm/cmpxchg.h | 26 +++++++++++++++++---------
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index ab2515ecd6ca..6fcec5c12c4d 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -12,6 +12,7 @@
 #include <linux/kasan-checks.h>
 
 #include <asm/alternative-macros.h>
+#include <asm/delay-const.h>
 
 #define __nops(n)	".rept	" #n "\nnop\n.endr\n"
 #define nops(n)		asm volatile(__nops(n))
@@ -198,7 +199,7 @@ do {									\
 		VAL = READ_ONCE(*__PTR);				\
 		if (cond_expr)						\
 			break;						\
-		__cmpwait_relaxed(__PTR, VAL);				\
+		__cmpwait_relaxed(__PTR, VAL, ~0UL);			\
 	}								\
 	(typeof(*ptr))VAL;						\
 })
@@ -211,7 +212,7 @@ do {									\
 		VAL = smp_load_acquire(__PTR);				\
 		if (cond_expr)						\
 			break;						\
-		__cmpwait_relaxed(__PTR, VAL);				\
+		__cmpwait_relaxed(__PTR, VAL, ~0UL);			\
 	}								\
 	(typeof(*ptr))VAL;						\
 })
@@ -241,11 +242,13 @@ do {									\
 ({									\
 	typeof(ptr) __PTR = (ptr);					\
 	__unqual_scalar_typeof(*ptr) VAL;				\
+	const unsigned long __time_limit_cycles =			\
+					NSECS_TO_CYCLES(time_limit_ns);	\
 	for (;;) {							\
 		VAL = READ_ONCE(*__PTR);				\
 		if (cond_expr)						\
 			break;						\
-		__cmpwait_relaxed(__PTR, VAL);				\
+		__cmpwait_relaxed(__PTR, VAL, __time_limit_cycles);	\
 		if ((time_expr_ns) >= time_limit_ns)			\
 			break;						\
 	}								\
@@ -257,7 +260,8 @@ do {									\
 ({									\
 	__unqual_scalar_typeof(*ptr) _val;				\
 									\
-	int __wfe = arch_timer_evtstrm_available();			\
+	int __wfe = arch_timer_evtstrm_available() ||			\
+	           alternative_has_cap_unlikely(ARM64_HAS_WFXT);	\
 	if (likely(__wfe))						\
 		_val = __smp_cond_load_timeout_wait(ptr, cond_expr,	\
 						   time_expr_ns,	\
diff --git a/arch/arm64/include/asm/cmpxchg.h b/arch/arm64/include/asm/cmpxchg.h
index d7a540736741..bb842dab5d0e 100644
--- a/arch/arm64/include/asm/cmpxchg.h
+++ b/arch/arm64/include/asm/cmpxchg.h
@@ -210,7 +210,8 @@ __CMPXCHG_GEN(_mb)
 
 #define __CMPWAIT_CASE(w, sfx, sz)					\
 static inline void __cmpwait_case_##sz(volatile void *ptr,		\
-				       unsigned long val)		\
+				       unsigned long val,		\
+				       unsigned long time_limit_cycles)	\
 {									\
 	unsigned long tmp;						\
 									\
@@ -220,10 +221,12 @@ static inline void __cmpwait_case_##sz(volatile void *ptr,		\
 	"	ldxr" #sfx "\t%" #w "[tmp], %[v]\n"			\
 	"	eor	%" #w "[tmp], %" #w "[tmp], %" #w "[val]\n"	\
 	"	cbnz	%" #w "[tmp], 1f\n"				\
-	"	wfe\n"							\
+	ALTERNATIVE("wfe\n",						\
+		    "msr s0_3_c1_c0_0, %[time_limit_cycles]\n",		\
+		    ARM64_HAS_WFXT)					\
 	"1:"								\
 	: [tmp] "=&r" (tmp), [v] "+Q" (*(u##sz *)ptr)			\
-	: [val] "r" (val));						\
+	: [val] "r" (val), [time_limit_cycles] "r" (time_limit_cycles));\
 }
 
 __CMPWAIT_CASE(w, b, 8);
@@ -236,17 +239,22 @@ __CMPWAIT_CASE( ,  , 64);
 #define __CMPWAIT_GEN(sfx)						\
 static __always_inline void __cmpwait##sfx(volatile void *ptr,		\
 				  unsigned long val,			\
+				  unsigned long time_limit_cycles,	\
 				  int size)				\
 {									\
 	switch (size) {							\
 	case 1:								\
-		return __cmpwait_case##sfx##_8(ptr, (u8)val);		\
+		return __cmpwait_case##sfx##_8(ptr, (u8)val,		\
+					       time_limit_cycles);	\
 	case 2:								\
-		return __cmpwait_case##sfx##_16(ptr, (u16)val);		\
+		return __cmpwait_case##sfx##_16(ptr, (u16)val,		\
+						time_limit_cycles);	\
 	case 4:								\
-		return __cmpwait_case##sfx##_32(ptr, val);		\
+		return __cmpwait_case##sfx##_32(ptr, val,		\
+						time_limit_cycles);	\
 	case 8:								\
-		return __cmpwait_case##sfx##_64(ptr, val);		\
+		return __cmpwait_case##sfx##_64(ptr, val,		\
+						time_limit_cycles);	\
 	default:							\
 		BUILD_BUG();						\
 	}								\
@@ -258,7 +266,7 @@ __CMPWAIT_GEN()
 
 #undef __CMPWAIT_GEN
 
-#define __cmpwait_relaxed(ptr, val) \
-	__cmpwait((ptr), (unsigned long)(val), sizeof(*(ptr)))
+#define __cmpwait_relaxed(ptr, val, time_limit_cycles) \
+	__cmpwait((ptr), (unsigned long)(val), time_limit_cycles, sizeof(*(ptr)))
 
 #endif	/* __ASM_CMPXCHG_H */
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v9 14/15] arm64/delay: move some constants out to a separate header
  2024-11-07 19:08 ` [RFC PATCH v9 14/15] arm64/delay: move some constants out to a separate header Ankur Arora
@ 2024-11-08  2:25   ` Christoph Lameter (Ampere)
  2024-11-08  7:49     ` Ankur Arora
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-08  2:25 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

On Thu, 7 Nov 2024, Ankur Arora wrote:

> Moves some constants and functions related to xloops, cycles computation
> out to a new header.

Constants are correct...

Reviewed-by: Christoph Lameter <cl@linux.com>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout()
  2024-11-07 19:08 ` [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout() Ankur Arora
@ 2024-11-08  2:33   ` Christoph Lameter (Ampere)
  2024-11-08  7:53     ` Ankur Arora
  2024-11-26  5:01   ` Ankur Arora
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-08  2:33 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

On Thu, 7 Nov 2024, Ankur Arora wrote:

> +#ifndef smp_cond_time_check_count
> +/*
> + * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns.
> + * This helps reduce the number of instructions executed while spin-waiting.
> + */
> +#define smp_cond_time_check_count	200
> +#endif

I dont like these loops that execute differently depending on the
hardware. Can we use cycles and ns instead to have defined periods of
time? Later patches establish the infrastructure to convert cycles to
nanoseconds and microseconds. Use that?

> +#ifndef smp_cond_load_relaxed_timeout
> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns,	\
> +				      time_limit_ns) ({			\
> +	typeof(ptr) __PTR = (ptr);					\
> +	__unqual_scalar_typeof(*ptr) VAL;				\
> +	unsigned int __count = 0;					\
> +	for (;;) {							\
> +		VAL = READ_ONCE(*__PTR);				\
> +		if (cond_expr)						\
> +			break;						\
> +		cpu_relax();						\
> +		if (__count++ < smp_cond_time_check_count)		\
> +			continue;					\
> +		if ((time_expr_ns) >= time_limit_ns)			\
> +			break;						\

Calling the clock retrieval function repeatedly should be fine and is
typically done in user space as well as in kernel space for functions that
need to wait short time periods.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH v9 14/15] arm64/delay: move some constants out to a separate header
  2024-11-08  2:25   ` Christoph Lameter (Ampere)
@ 2024-11-08  7:49     ` Ankur Arora
  0 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-08  7:49 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	linux-arch, catalin.marinas, will, tglx, mingo, bp, dave.hansen,
	x86, hpa, pbonzini, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk


Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Thu, 7 Nov 2024, Ankur Arora wrote:
>
>> Moves some constants and functions related to xloops, cycles computation
>> out to a new header.
>
> Constants are correct...
>
> Reviewed-by: Christoph Lameter <cl@linux.com>

Thanks!

--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout()
  2024-11-08  2:33   ` Christoph Lameter (Ampere)
@ 2024-11-08  7:53     ` Ankur Arora
  2024-11-08 19:41       ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2024-11-08  7:53 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	linux-arch, catalin.marinas, will, tglx, mingo, bp, dave.hansen,
	x86, hpa, pbonzini, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk


Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Thu, 7 Nov 2024, Ankur Arora wrote:
>
>> +#ifndef smp_cond_time_check_count
>> +/*
>> + * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns.
>> + * This helps reduce the number of instructions executed while spin-waiting.
>> + */
>> +#define smp_cond_time_check_count	200
>> +#endif
>
> I dont like these loops that execute differently depending on the
> hardware. Can we use cycles and ns instead to have defined periods of
> time? Later patches establish the infrastructure to convert cycles to
> nanoseconds and microseconds. Use that?
>
>> +#ifndef smp_cond_load_relaxed_timeout
>> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns,	\
>> +				      time_limit_ns) ({			\
>> +	typeof(ptr) __PTR = (ptr);					\
>> +	__unqual_scalar_typeof(*ptr) VAL;				\
>> +	unsigned int __count = 0;					\
>> +	for (;;) {							\
>> +		VAL = READ_ONCE(*__PTR);				\
>> +		if (cond_expr)						\
>> +			break;						\
>> +		cpu_relax();						\
>> +		if (__count++ < smp_cond_time_check_count)		\
>> +			continue;					\
>> +		if ((time_expr_ns) >= time_limit_ns)			\
>> +			break;						\
>
> Calling the clock retrieval function repeatedly should be fine and is
> typically done in user space as well as in kernel space for functions that
> need to wait short time periods.

The problem is that you might have multiple CPUs polling in idle
for prolonged periods of time. And, so you want to minimize
your power/thermal envelope.

For instance see commit 4dc2375c1a4e "cpuidle: poll_state: Avoid
invoking local_clock() too often" which originally added a similar
rate limit to poll_idle() where they saw exactly that issue.

--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout()
  2024-11-08  7:53     ` Ankur Arora
@ 2024-11-08 19:41       ` Christoph Lameter (Ampere)
  2024-11-08 22:15         ` Ankur Arora
  2024-11-14 17:22         ` Catalin Marinas
  0 siblings, 2 replies; 32+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-08 19:41 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

On Thu, 7 Nov 2024, Ankur Arora wrote:

> > Calling the clock retrieval function repeatedly should be fine and is
> > typically done in user space as well as in kernel space for functions that
> > need to wait short time periods.
>
> The problem is that you might have multiple CPUs polling in idle
> for prolonged periods of time. And, so you want to minimize
> your power/thermal envelope.

On ARM that maps to YIELD which does not do anything for the power
envelope AFAICT. It switches to the other hyperthread.

> For instance see commit 4dc2375c1a4e "cpuidle: poll_state: Avoid
> invoking local_clock() too often" which originally added a similar
> rate limit to poll_idle() where they saw exactly that issue.

Looping w/o calling local_clock may increase the wait period etc.

For power saving most arches have special instructions like ARMS
WFE/WFET. These are then causing more accurate wait times than the looping
thing?




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout()
  2024-11-08 19:41       ` Christoph Lameter (Ampere)
@ 2024-11-08 22:15         ` Ankur Arora
  2024-11-12 16:50           ` Christoph Lameter (Ampere)
  2024-11-14 17:22         ` Catalin Marinas
  1 sibling, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2024-11-08 22:15 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	linux-arch, catalin.marinas, will, tglx, mingo, bp, dave.hansen,
	x86, hpa, pbonzini, vkuznets, rafael, daniel.lezcano, peterz,
	arnd, lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk


Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Thu, 7 Nov 2024, Ankur Arora wrote:
>
>> > Calling the clock retrieval function repeatedly should be fine and is
>> > typically done in user space as well as in kernel space for functions that
>> > need to wait short time periods.
>>
>> The problem is that you might have multiple CPUs polling in idle
>> for prolonged periods of time. And, so you want to minimize
>> your power/thermal envelope.
>
> On ARM that maps to YIELD which does not do anything for the power
> envelope AFAICT. It switches to the other hyperthread.

Agreed. For arm64 patch-5 adds a specialized version.

For the fallback case when we don't have an event stream, the
arm64 version does use the same cpu_relax() loop but that's
not a production thing.

>> For instance see commit 4dc2375c1a4e "cpuidle: poll_state: Avoid
>> invoking local_clock() too often" which originally added a similar
>> rate limit to poll_idle() where they saw exactly that issue.
>
> Looping w/o calling local_clock may increase the wait period etc.

Yeah. I don't think that's a real problem for the poll_idle()
case as the only thing waiting on the other side of the possibly
delayed timer is a deeper idle state.

But, for any other potential users the looping duration might be
too long (the generated code for x86 will execute around 200 * 7
instructions before checking the timer, so a worst case delay of
say around 1-2us.)

I'll note that in the comment around smp_cond_time_check_count
just to warn any future users.

> For power saving most arches have special instructions like ARMS
> WFE/WFET. These are then causing more accurate wait times than the looping
> thing?

Definitely true for WFET. The WFE can still overshoot because the
eventstream has a period of 100us.

--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout()
  2024-11-08 22:15         ` Ankur Arora
@ 2024-11-12 16:50           ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-12 16:50 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

On Fri, 8 Nov 2024, Ankur Arora wrote:

> > For power saving most arches have special instructions like ARMS
> > WFE/WFET. These are then causing more accurate wait times than the looping
> > thing?
>
> Definitely true for WFET. The WFE can still overshoot because the
> eventstream has a period of 100us.

We can only use the event stream if we need to wait more than 100us.

The rest of the wait period can be coverd by a busy loop. Thus we are
accurate.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout()
  2024-11-08 19:41       ` Christoph Lameter (Ampere)
  2024-11-08 22:15         ` Ankur Arora
@ 2024-11-14 17:22         ` Catalin Marinas
  2024-11-15  0:28           ` Ankur Arora
  1 sibling, 1 reply; 32+ messages in thread
From: Catalin Marinas @ 2024-11-14 17:22 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	linux-arch, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

On Fri, Nov 08, 2024 at 11:41:08AM -0800, Christoph Lameter (Ampere) wrote:
> On Thu, 7 Nov 2024, Ankur Arora wrote:
> > > Calling the clock retrieval function repeatedly should be fine and is
> > > typically done in user space as well as in kernel space for functions that
> > > need to wait short time periods.
> >
> > The problem is that you might have multiple CPUs polling in idle
> > for prolonged periods of time. And, so you want to minimize
> > your power/thermal envelope.
> 
> On ARM that maps to YIELD which does not do anything for the power
> envelope AFAICT. It switches to the other hyperthread.

The issue is not necessarily arm64 but poll_idle() on other
architectures like x86 where, at the end of this series, they still call
cpu_relax() in a loop and check local_clock() every 200 times or so
iterations. So I wouldn't want to revert the improvement in 4dc2375c1a4e
("cpuidle: poll_state: Avoid invoking local_clock() too often").

I agree that the 200 iterations here it's pretty random and it was
something made up for poll_idle() specifically and it could increase the
wait period in other situations (or other architectures).

OTOH, I'm not sure we want to make this API too complex if the only
user for a while would be poll_idle(). We could add a comment that the
timeout granularity can be pretty coarse and architecture dependent (200
cpu_relax() calls in one deployment, 100us on arm64 with WFE).

-- 
Catalin


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout()
  2024-11-14 17:22         ` Catalin Marinas
@ 2024-11-15  0:28           ` Ankur Arora
  0 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-11-15  0:28 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Christoph Lameter (Ampere), Ankur Arora, linux-pm, kvm,
	linux-arm-kernel, linux-kernel, linux-arch, will, tglx, mingo, bp,
	dave.hansen, x86, hpa, pbonzini, vkuznets, rafael, daniel.lezcano,
	peterz, arnd, lenb, mark.rutland, harisokn, mtosatti,
	sudeep.holla, maz, misono.tomohiro, maobibo, zhenglifeng1,
	joao.m.martins, boris.ostrovsky, konrad.wilk


Catalin Marinas <catalin.marinas@arm.com> writes:

> On Fri, Nov 08, 2024 at 11:41:08AM -0800, Christoph Lameter (Ampere) wrote:
>> On Thu, 7 Nov 2024, Ankur Arora wrote:
>> > > Calling the clock retrieval function repeatedly should be fine and is
>> > > typically done in user space as well as in kernel space for functions that
>> > > need to wait short time periods.
>> >
>> > The problem is that you might have multiple CPUs polling in idle
>> > for prolonged periods of time. And, so you want to minimize
>> > your power/thermal envelope.
>>
>> On ARM that maps to YIELD which does not do anything for the power
>> envelope AFAICT. It switches to the other hyperthread.
>
> The issue is not necessarily arm64 but poll_idle() on other
> architectures like x86 where, at the end of this series, they still call
> cpu_relax() in a loop and check local_clock() every 200 times or so
> iterations. So I wouldn't want to revert the improvement in 4dc2375c1a4e
> ("cpuidle: poll_state: Avoid invoking local_clock() too often").
>
> I agree that the 200 iterations here it's pretty random and it was
> something made up for poll_idle() specifically and it could increase the
> wait period in other situations (or other architectures).
>
> OTOH, I'm not sure we want to make this API too complex if the only
> user for a while would be poll_idle(). We could add a comment that the
> timeout granularity can be pretty coarse and architecture dependent (200
> cpu_relax() calls in one deployment, 100us on arm64 with WFE).

Yeah, agreed. Not worth over engineering this interface at least not
until there are other users. For now I'll just add a comment mentioning
that the time-check is only coarse grained and architecture dependent.

--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout()
  2024-11-07 19:08 ` [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout() Ankur Arora
  2024-11-08  2:33   ` Christoph Lameter (Ampere)
@ 2024-11-26  5:01   ` Ankur Arora
  2024-11-26 10:36     ` Catalin Marinas
  1 sibling, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2024-11-26  5:01 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk


Ankur Arora <ankur.a.arora@oracle.com> writes:

> Add a timed variant of smp_cond_load_relaxed().
>
> This is useful because arm64 supports polling on a conditional variable
> by directly waiting on the cacheline instead of spin waiting for the
> condition to change.
>
> However, an implementation such as this has a problem that it can block
> forever -- unless there's an explicit timeout or another out-of-band
> mechanism which allows it to come out of the wait state periodically.
>
> smp_cond_load_relaxed_timeout() supports these semantics by specifying
> a time-check expression and an associated time-limit.
>
> However, note that for the generic spin-wait implementation we want to
> minimize the numbers of instructions executed in each iteration. So,
> limit how often we evaluate the time-check expression by doing it once
> every smp_cond_time_check_count.
>
> The inner loop in poll_idle() has a substantially similar structure
> and constraints as smp_cond_load_relaxed_timeout(), so define
> smp_cond_time_check_count to the same value used in poll_idle().
>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  include/asm-generic/barrier.h | 42 +++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
>
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..77726ef807e4 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -273,6 +273,48 @@ do {									\
>  })
>  #endif
>
> +#ifndef smp_cond_time_check_count
> +/*
> + * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns.
> + * This helps reduce the number of instructions executed while spin-waiting.
> + */
> +#define smp_cond_time_check_count	200
> +#endif
> +
> +/**
> + * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
> + * guarantees until a timeout expires.
> + * @ptr: pointer to the variable to wait on
> + * @cond: boolean expression to wait for
> + * @time_expr_ns: evaluates to the current time
> + * @time_limit_ns: compared against time_expr_ns
> + *
> + * Equivalent to using READ_ONCE() on the condition variable.
> + *
> + * Due to C lacking lambda expressions we load the value of *ptr into a
> + * pre-named variable @VAL to be used in @cond.

Based on the review comments so far I'm planning to add the following
text to this comment:

  Note that in the generic version the time check is done only coarsely
  to minimize instructions executed while spin-waiting.

  Architecture specific variations might also have their own timeout
  granularity.

Meanwhile, would appreciate more reviews.

Thanks
Ankur

> + */
> +#ifndef smp_cond_load_relaxed_timeout
> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns,	\
> +				      time_limit_ns) ({			\
> +	typeof(ptr) __PTR = (ptr);					\
> +	__unqual_scalar_typeof(*ptr) VAL;				\
> +	unsigned int __count = 0;					\
> +	for (;;) {							\
> +		VAL = READ_ONCE(*__PTR);				\
> +		if (cond_expr)						\
> +			break;						\
> +		cpu_relax();						\
> +		if (__count++ < smp_cond_time_check_count)		\
> +			continue;					\
> +		if ((time_expr_ns) >= time_limit_ns)			\
> +			break;						\
> +		__count = 0;						\
> +	}								\
> +	(typeof(*ptr))VAL;						\
> +})
> +#endif
> +
>  /*
>   * pmem_wmb() ensures that all stores for which the modification
>   * are written to persistent storage by preceding instructions have


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout()
  2024-11-26  5:01   ` Ankur Arora
@ 2024-11-26 10:36     ` Catalin Marinas
  0 siblings, 0 replies; 32+ messages in thread
From: Catalin Marinas @ 2024-11-26 10:36 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch, will,
	tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini, vkuznets,
	rafael, daniel.lezcano, peterz, arnd, lenb, mark.rutland,
	harisokn, mtosatti, sudeep.holla, cl, maz, misono.tomohiro,
	maobibo, zhenglifeng1, joao.m.martins, boris.ostrovsky,
	konrad.wilk

On Mon, Nov 25, 2024 at 09:01:56PM -0800, Ankur Arora wrote:
> Ankur Arora <ankur.a.arora@oracle.com> writes:
> > +/**
> > + * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
> > + * guarantees until a timeout expires.
> > + * @ptr: pointer to the variable to wait on
> > + * @cond: boolean expression to wait for
> > + * @time_expr_ns: evaluates to the current time
> > + * @time_limit_ns: compared against time_expr_ns
> > + *
> > + * Equivalent to using READ_ONCE() on the condition variable.
> > + *
> > + * Due to C lacking lambda expressions we load the value of *ptr into a
> > + * pre-named variable @VAL to be used in @cond.
> 
> Based on the review comments so far I'm planning to add the following
> text to this comment:
> 
>   Note that in the generic version the time check is done only coarsely
>   to minimize instructions executed while spin-waiting.
> 
>   Architecture specific variations might also have their own timeout
>   granularity.

Looks good.

> Meanwhile, would appreciate more reviews.

It's the middle of the merging window, usually not much review happens
unless they are fixes/regressions.

-- 
Catalin


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 05/15] arm64: barrier: add support for smp_cond_relaxed_timeout()
  2024-11-07 19:08 ` [PATCH v9 05/15] arm64: barrier: add support for smp_cond_relaxed_timeout() Ankur Arora
@ 2024-12-10 13:50   ` Will Deacon
  2024-12-10 20:14     ` Ankur Arora
  0 siblings, 1 reply; 32+ messages in thread
From: Will Deacon @ 2024-12-10 13:50 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch,
	catalin.marinas, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

On Thu, Nov 07, 2024 at 11:08:08AM -0800, Ankur Arora wrote:
> Support a waited variant of polling on a conditional variable
> via smp_cond_relaxed_timeout().
> 
> This uses the __cmpwait_relaxed() primitive to do the actual
> waiting, when the wait can be guaranteed to not block forever
> (in case there are no stores to the waited for cacheline.)
> For this we depend on the availability of the event-stream.
> 
> For cases when the event-stream is unavailable, we fallback to
> a spin-waited implementation which is identical to the generic
> variant.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/arm64/include/asm/barrier.h | 54 ++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 1ca947d5c939..ab2515ecd6ca 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -216,6 +216,60 @@ do {									\
>  	(typeof(*ptr))VAL;						\
>  })
>  
> +#define __smp_cond_load_timeout_spin(ptr, cond_expr,			\
> +				     time_expr_ns, time_limit_ns)	\
> +({									\
> +	typeof(ptr) __PTR = (ptr);					\
> +	__unqual_scalar_typeof(*ptr) VAL;				\
> +	unsigned int __count = 0;					\
> +	for (;;) {							\
> +		VAL = READ_ONCE(*__PTR);				\
> +		if (cond_expr)						\
> +			break;						\
> +		cpu_relax();						\
> +		if (__count++ < smp_cond_time_check_count)		\
> +			continue;					\
> +		if ((time_expr_ns) >= time_limit_ns)			\
> +			break;						\
> +		__count = 0;						\
> +	}								\
> +	(typeof(*ptr))VAL;						\
> +})

This is a carbon-copy of the asm-generic timeout implementation. Please
can you avoid duplicating that in the arch code?

Will


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 05/15] arm64: barrier: add support for smp_cond_relaxed_timeout()
  2024-12-10 13:50   ` Will Deacon
@ 2024-12-10 20:14     ` Ankur Arora
  0 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2024-12-10 20:14 UTC (permalink / raw)
  To: Will Deacon
  Cc: Ankur Arora, linux-pm, kvm, linux-arm-kernel, linux-kernel,
	linux-arch, catalin.marinas, tglx, mingo, bp, dave.hansen, x86,
	hpa, pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd,
	lenb, mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk


Will Deacon <will@kernel.org> writes:

> On Thu, Nov 07, 2024 at 11:08:08AM -0800, Ankur Arora wrote:
>> Support a waited variant of polling on a conditional variable
>> via smp_cond_relaxed_timeout().
>>
>> This uses the __cmpwait_relaxed() primitive to do the actual
>> waiting, when the wait can be guaranteed to not block forever
>> (in case there are no stores to the waited for cacheline.)
>> For this we depend on the availability of the event-stream.
>>
>> For cases when the event-stream is unavailable, we fallback to
>> a spin-waited implementation which is identical to the generic
>> variant.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  arch/arm64/include/asm/barrier.h | 54 ++++++++++++++++++++++++++++++++
>>  1 file changed, 54 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
>> index 1ca947d5c939..ab2515ecd6ca 100644
>> --- a/arch/arm64/include/asm/barrier.h
>> +++ b/arch/arm64/include/asm/barrier.h
>> @@ -216,6 +216,60 @@ do {									\
>>  	(typeof(*ptr))VAL;						\
>>  })
>>
>> +#define __smp_cond_load_timeout_spin(ptr, cond_expr,			\
>> +				     time_expr_ns, time_limit_ns)	\
>> +({									\
>> +	typeof(ptr) __PTR = (ptr);					\
>> +	__unqual_scalar_typeof(*ptr) VAL;				\
>> +	unsigned int __count = 0;					\
>> +	for (;;) {							\
>> +		VAL = READ_ONCE(*__PTR);				\
>> +		if (cond_expr)						\
>> +			break;						\
>> +		cpu_relax();						\
>> +		if (__count++ < smp_cond_time_check_count)		\
>> +			continue;					\
>> +		if ((time_expr_ns) >= time_limit_ns)			\
>> +			break;						\
>> +		__count = 0;						\
>> +	}								\
>> +	(typeof(*ptr))VAL;						\
>> +})
>
> This is a carbon-copy of the asm-generic timeout implementation. Please
> can you avoid duplicating that in the arch code?

Yeah I realized a bit late that I could avoid the duplication quite
simply. Will fix.

Thanks

--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 00/15] arm64: support poll_idle()
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (14 preceding siblings ...)
  2024-11-07 19:08 ` [RFC PATCH v9 15/15] arm64: support WFET in smp_cond_relaxed_timeout() Ankur Arora
@ 2025-01-07  5:23 ` Ankur Arora
  2025-01-20 21:13 ` Ankur Arora
  16 siblings, 0 replies; 32+ messages in thread
From: Ankur Arora @ 2025-01-07  5:23 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk


Just a ping on this to see if there are any further comments.

Thanks
Ankur

Ankur Arora <ankur.a.arora@oracle.com> writes:

> This patchset adds support for polling in idle via poll_idle() on
> arm64.
>
> There are two main changes in this version:
>
> 1. rework the series to take Catalin Marinas' comments on the semantics
>    of smp_cond_load_relaxed() (and how earlier versions of this
>    series were abusing them) into account.
>
>    This also allows dropping of the somewhat strained connections
>    between haltpoll and the event-stream.
>
> 2. earlier versions of this series were adding support for poll_idle()
>    but only using it in the haltpoll driver. Add Lifeng's patch to
>    broaden it out by also polling in acpi-idle.
>
> The benefit of polling in idle is to reduce the cost of remote wakeups.
> When enabled, these can be done just by setting the need-resched bit,
> instead of sending an IPI, and incurring the cost of handling the
> interrupt on the receiver side. When running on a VM it also saves
> the cost of WFE trapping (when enabled.)
>
> Comparing sched-pipe performance on a guest VM:
>
> # perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
>   perf bench sched pipe -l 1000000 -c 4
>
> # no polling in idle
>
>  Performance counter stats for 'CPU(s) 4,5' (5 runs):
>
>          25,229.57 msec task-clock                       #    2.000 CPUs utilized               ( +-  7.75% )
>     45,821,250,284      cycles                           #    1.816 GHz                         ( +- 10.07% )
>     26,557,496,665      instructions                     #    0.58  insn per cycle              ( +-  0.21% )
>                  0      sched:sched_wake_idle_without_ipi #    0.000 /sec
>
>             12.615 +- 0.977 seconds time elapsed  ( +-  7.75% )
>
>
> # polling in idle (with haltpoll):
>
>  Performance counter stats for 'CPU(s) 4,5' (5 runs):
>
>          15,131.58 msec task-clock                       #    2.000 CPUs utilized               ( +- 10.00% )
>     34,158,188,839      cycles                           #    2.257 GHz                         ( +-  6.91% )
>     20,824,950,916      instructions                     #    0.61  insn per cycle              ( +-  0.09% )
>          1,983,822      sched:sched_wake_idle_without_ipi #  131.105 K/sec                       ( +-  0.78% )
>
>              7.566 +- 0.756 seconds time elapsed  ( +- 10.00% )
>
> Tomohiro Misono and Haris Okanovic also report similar latency
> improvements on Grace and Graviton systems (for v7) [1] [2].
> Lifeng also reports improved context switch latency on a bare-metal
> machine with acpi-idle [3].
>
> The series is in four parts:
>
>  - patches 1-4,
>
>     "asm-generic: add barrier smp_cond_load_relaxed_timeout()"
>     "cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout()"
>     "cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL"
>     "Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig"
>
>    add smp_cond_load_relaxed_timeout() and switch poll_idle() to
>    using it. Also, do some munging of related kconfig options.
>
>  - patches 5-7,
>
>     "arm64: barrier: add support for smp_cond_relaxed_timeout()"
>     "arm64: define TIF_POLLING_NRFLAG"
>     "arm64: add support for polling in idle"
>
>    add support for the new barrier, the polling flag and enable
>    poll_idle() support.
>
>  - patches 8, 9-13,
>
>     "ACPI: processor_idle: Support polling state for LPI"
>
>     "cpuidle-haltpoll: define arch_haltpoll_want()"
>     "governors/haltpoll: drop kvm_para_available() check"
>     "cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL"
>     "arm64: idle: export arch_cpu_idle"
>     "arm64: support cpuidle-haltpoll"
>
>     add support for polling via acpi-idle, and cpuidle-haltpoll.
>
>   - patches 14, 15,
>      "arm64/delay: move some constants out to a separate header"
>      "arm64: support WFET in smp_cond_relaxed_timeout()"
>
>     are RFC patches to enable WFET support.
>
> Changelog:
>
> v9:
>
>  - reworked the series to address a comment from Catalin Marinas
>    about how v8 was abusing semantics of smp_cond_load_relaxed().
>  - add poll_idle() support in acpi-idle (Lifeng Zheng)
>  - dropped some earlier "Tested-by", "Reviewed-by" due to the
>    above rework.
>
> v8: No logic changes. Largely respin of v7, with changes
> noted below:
>
>  - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
>    own patch.
>    (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
>
>  - address comments simplifying arm64 support (Will Deacon)
>    (patch-11 "arm64: support cpuidle-haltpoll")
>
> v7: No significant logic changes. Mostly a respin of v6.
>
>  - minor cleanup in poll_idle() (Christoph Lameter)
>  - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
>    (Tomohiro Misono)
>
> v6:
>
>  - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
>    changes together (comment from Christoph Lameter)
>  - threshes out the commit messages a bit more (comments from Christoph
>    Lameter, Sudeep Holla)
>  - also rework selection of cpuidle-haltpoll. Now selected based
>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>  - moved back to arch_haltpoll_want() (comment from Joao Martins)
>    Also, arch_haltpoll_want() now takes the force parameter and is
>    now responsible for the complete selection (or not) of haltpoll.
>  - fixes the build breakage on i386
>  - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
>    Tomohiro Misono, Haris Okanovic)
>
> v5:
>  - rework the poll_idle() loop around smp_cond_load_relaxed() (review
>    comment from Tomohiro Misono.)
>  - also rework selection of cpuidle-haltpoll. Now selected based
>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>  - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
>    arm64 now depends on the event-stream being enabled.
>  - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
>  - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
>
> v4 changes from v3:
>  - change 7/8 per Rafael input: drop the parens and use ret for the final check
>  - add 8/8 which renames the guard for building poll_state
>
> v3 changes from v2:
>  - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
>  - add Ack-by from Rafael Wysocki on 2/7
>
> v2 changes from v1:
>  - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
>    (this improves by 50% at least the CPU cycles consumed in the tests above:
>    10,716,881,137 now vs 14,503,014,257 before)
>  - removed the ifdef from patch 1 per RafaelW
>
> Please review.
>
> [1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
> [2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
> [3] https://lore.kernel.org/lkml/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/
>
> Ankur Arora (10):
>   asm-generic: add barrier smp_cond_load_relaxed_timeout()
>   cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout()
>   cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
>   arm64: barrier: add support for smp_cond_relaxed_timeout()
>   arm64: add support for polling in idle
>   cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
>   arm64: idle: export arch_cpu_idle
>   arm64: support cpuidle-haltpoll
>   arm64/delay: move some constants out to a separate header
>   arm64: support WFET in smp_cond_relaxed_timeout()
>
> Joao Martins (4):
>   Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
>   arm64: define TIF_POLLING_NRFLAG
>   cpuidle-haltpoll: define arch_haltpoll_want()
>   governors/haltpoll: drop kvm_para_available() check
>
> Lifeng Zheng (1):
>   ACPI: processor_idle: Support polling state for LPI
>
>  arch/Kconfig                              |  3 ++
>  arch/arm64/Kconfig                        |  7 +++
>  arch/arm64/include/asm/barrier.h          | 62 ++++++++++++++++++++++-
>  arch/arm64/include/asm/cmpxchg.h          | 26 ++++++----
>  arch/arm64/include/asm/cpuidle_haltpoll.h | 20 ++++++++
>  arch/arm64/include/asm/delay-const.h      | 25 +++++++++
>  arch/arm64/include/asm/thread_info.h      |  2 +
>  arch/arm64/kernel/idle.c                  |  1 +
>  arch/arm64/lib/delay.c                    | 13 ++---
>  arch/x86/Kconfig                          |  5 +-
>  arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
>  arch/x86/kernel/kvm.c                     | 13 +++++
>  drivers/acpi/processor_idle.c             | 43 +++++++++++++---
>  drivers/cpuidle/Kconfig                   |  5 +-
>  drivers/cpuidle/Makefile                  |  2 +-
>  drivers/cpuidle/cpuidle-haltpoll.c        | 12 +----
>  drivers/cpuidle/governors/haltpoll.c      |  6 +--
>  drivers/cpuidle/poll_state.c              | 27 +++-------
>  drivers/idle/Kconfig                      |  1 +
>  include/asm-generic/barrier.h             | 42 +++++++++++++++
>  include/linux/cpuidle.h                   |  2 +-
>  include/linux/cpuidle_haltpoll.h          |  5 ++
>  22 files changed, 252 insertions(+), 71 deletions(-)
>  create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
>  create mode 100644 arch/arm64/include/asm/delay-const.h


--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 00/15] arm64: support poll_idle()
  2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
                   ` (15 preceding siblings ...)
  2025-01-07  5:23 ` [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
@ 2025-01-20 21:13 ` Ankur Arora
  2025-01-21  9:55   ` Will Deacon
  16 siblings, 1 reply; 32+ messages in thread
From: Ankur Arora @ 2025-01-20 21:13 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch,
	catalin.marinas, will, tglx, mingo, bp, dave.hansen, x86, hpa,
	pbonzini, vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk


Ankur Arora <ankur.a.arora@oracle.com> writes:

> This patchset adds support for polling in idle via poll_idle() on
> arm64.
>
> There are two main changes in this version:
>
> 1. rework the series to take Catalin Marinas' comments on the semantics
>    of smp_cond_load_relaxed() (and how earlier versions of this
>    series were abusing them) into account.

There was a recent series adding resilient spinlocks which might also have
use for an smp_cond_load_{acquire,release}_timeout() interface.
(https://lore.kernel.org/lkml/20250107192202.GA36003@noisy.programming.kicks-ass.net/)

So, unless anybody has any objection I'm planning to split out this
series into two parts:
  - adding smp_cond_load_*_timeout() (with an arm64 implementation)
  - arm64 support for poll_idle() and haltpoll


Ankur

>    This also allows dropping of the somewhat strained connections
>    between haltpoll and the event-stream.
>
> 2. earlier versions of this series were adding support for poll_idle()
>    but only using it in the haltpoll driver. Add Lifeng's patch to
>    broaden it out by also polling in acpi-idle.
>
> The benefit of polling in idle is to reduce the cost of remote wakeups.
> When enabled, these can be done just by setting the need-resched bit,
> instead of sending an IPI, and incurring the cost of handling the
> interrupt on the receiver side. When running on a VM it also saves
> the cost of WFE trapping (when enabled.)
>
> Comparing sched-pipe performance on a guest VM:
>
> # perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
>   perf bench sched pipe -l 1000000 -c 4
>
> # no polling in idle
>
>  Performance counter stats for 'CPU(s) 4,5' (5 runs):
>
>          25,229.57 msec task-clock                       #    2.000 CPUs utilized               ( +-  7.75% )
>     45,821,250,284      cycles                           #    1.816 GHz                         ( +- 10.07% )
>     26,557,496,665      instructions                     #    0.58  insn per cycle              ( +-  0.21% )
>                  0      sched:sched_wake_idle_without_ipi #    0.000 /sec
>
>             12.615 +- 0.977 seconds time elapsed  ( +-  7.75% )
>
>
> # polling in idle (with haltpoll):
>
>  Performance counter stats for 'CPU(s) 4,5' (5 runs):
>
>          15,131.58 msec task-clock                       #    2.000 CPUs utilized               ( +- 10.00% )
>     34,158,188,839      cycles                           #    2.257 GHz                         ( +-  6.91% )
>     20,824,950,916      instructions                     #    0.61  insn per cycle              ( +-  0.09% )
>          1,983,822      sched:sched_wake_idle_without_ipi #  131.105 K/sec                       ( +-  0.78% )
>
>              7.566 +- 0.756 seconds time elapsed  ( +- 10.00% )
>
> Tomohiro Misono and Haris Okanovic also report similar latency
> improvements on Grace and Graviton systems (for v7) [1] [2].
> Lifeng also reports improved context switch latency on a bare-metal
> machine with acpi-idle [3].
>
> The series is in four parts:
>
>  - patches 1-4,
>
>     "asm-generic: add barrier smp_cond_load_relaxed_timeout()"
>     "cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout()"
>     "cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL"
>     "Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig"
>
>    add smp_cond_load_relaxed_timeout() and switch poll_idle() to
>    using it. Also, do some munging of related kconfig options.
>
>  - patches 5-7,
>
>     "arm64: barrier: add support for smp_cond_relaxed_timeout()"
>     "arm64: define TIF_POLLING_NRFLAG"
>     "arm64: add support for polling in idle"
>
>    add support for the new barrier, the polling flag and enable
>    poll_idle() support.
>
>  - patches 8, 9-13,
>
>     "ACPI: processor_idle: Support polling state for LPI"
>
>     "cpuidle-haltpoll: define arch_haltpoll_want()"
>     "governors/haltpoll: drop kvm_para_available() check"
>     "cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL"
>     "arm64: idle: export arch_cpu_idle"
>     "arm64: support cpuidle-haltpoll"
>
>     add support for polling via acpi-idle, and cpuidle-haltpoll.
>
>   - patches 14, 15,
>      "arm64/delay: move some constants out to a separate header"
>      "arm64: support WFET in smp_cond_relaxed_timeout()"
>
>     are RFC patches to enable WFET support.
>
> Changelog:
>
> v9:
>
>  - reworked the series to address a comment from Catalin Marinas
>    about how v8 was abusing semantics of smp_cond_load_relaxed().
>  - add poll_idle() support in acpi-idle (Lifeng Zheng)
>  - dropped some earlier "Tested-by", "Reviewed-by" due to the
>    above rework.
>
> v8: No logic changes. Largely respin of v7, with changes
> noted below:
>
>  - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
>    own patch.
>    (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
>
>  - address comments simplifying arm64 support (Will Deacon)
>    (patch-11 "arm64: support cpuidle-haltpoll")
>
> v7: No significant logic changes. Mostly a respin of v6.
>
>  - minor cleanup in poll_idle() (Christoph Lameter)
>  - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
>    (Tomohiro Misono)
>
> v6:
>
>  - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
>    changes together (comment from Christoph Lameter)
>  - threshes out the commit messages a bit more (comments from Christoph
>    Lameter, Sudeep Holla)
>  - also rework selection of cpuidle-haltpoll. Now selected based
>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>  - moved back to arch_haltpoll_want() (comment from Joao Martins)
>    Also, arch_haltpoll_want() now takes the force parameter and is
>    now responsible for the complete selection (or not) of haltpoll.
>  - fixes the build breakage on i386
>  - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
>    Tomohiro Misono, Haris Okanovic)
>
> v5:
>  - rework the poll_idle() loop around smp_cond_load_relaxed() (review
>    comment from Tomohiro Misono.)
>  - also rework selection of cpuidle-haltpoll. Now selected based
>    on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>  - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
>    arm64 now depends on the event-stream being enabled.
>  - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
>  - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
>
> v4 changes from v3:
>  - change 7/8 per Rafael input: drop the parens and use ret for the final check
>  - add 8/8 which renames the guard for building poll_state
>
> v3 changes from v2:
>  - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
>  - add Ack-by from Rafael Wysocki on 2/7
>
> v2 changes from v1:
>  - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
>    (this improves by 50% at least the CPU cycles consumed in the tests above:
>    10,716,881,137 now vs 14,503,014,257 before)
>  - removed the ifdef from patch 1 per RafaelW
>
> Please review.
>
> [1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
> [2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
> [3] https://lore.kernel.org/lkml/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/
>
> Ankur Arora (10):
>   asm-generic: add barrier smp_cond_load_relaxed_timeout()
>   cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout()
>   cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
>   arm64: barrier: add support for smp_cond_relaxed_timeout()
>   arm64: add support for polling in idle
>   cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
>   arm64: idle: export arch_cpu_idle
>   arm64: support cpuidle-haltpoll
>   arm64/delay: move some constants out to a separate header
>   arm64: support WFET in smp_cond_relaxed_timeout()
>
> Joao Martins (4):
>   Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
>   arm64: define TIF_POLLING_NRFLAG
>   cpuidle-haltpoll: define arch_haltpoll_want()
>   governors/haltpoll: drop kvm_para_available() check
>
> Lifeng Zheng (1):
>   ACPI: processor_idle: Support polling state for LPI
>
>  arch/Kconfig                              |  3 ++
>  arch/arm64/Kconfig                        |  7 +++
>  arch/arm64/include/asm/barrier.h          | 62 ++++++++++++++++++++++-
>  arch/arm64/include/asm/cmpxchg.h          | 26 ++++++----
>  arch/arm64/include/asm/cpuidle_haltpoll.h | 20 ++++++++
>  arch/arm64/include/asm/delay-const.h      | 25 +++++++++
>  arch/arm64/include/asm/thread_info.h      |  2 +
>  arch/arm64/kernel/idle.c                  |  1 +
>  arch/arm64/lib/delay.c                    | 13 ++---
>  arch/x86/Kconfig                          |  5 +-
>  arch/x86/include/asm/cpuidle_haltpoll.h   |  1 +
>  arch/x86/kernel/kvm.c                     | 13 +++++
>  drivers/acpi/processor_idle.c             | 43 +++++++++++++---
>  drivers/cpuidle/Kconfig                   |  5 +-
>  drivers/cpuidle/Makefile                  |  2 +-
>  drivers/cpuidle/cpuidle-haltpoll.c        | 12 +----
>  drivers/cpuidle/governors/haltpoll.c      |  6 +--
>  drivers/cpuidle/poll_state.c              | 27 +++-------
>  drivers/idle/Kconfig                      |  1 +
>  include/asm-generic/barrier.h             | 42 +++++++++++++++
>  include/linux/cpuidle.h                   |  2 +-
>  include/linux/cpuidle_haltpoll.h          |  5 ++
>  22 files changed, 252 insertions(+), 71 deletions(-)
>  create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
>  create mode 100644 arch/arm64/include/asm/delay-const.h


--
ankur


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v9 00/15] arm64: support poll_idle()
  2025-01-20 21:13 ` Ankur Arora
@ 2025-01-21  9:55   ` Will Deacon
  0 siblings, 0 replies; 32+ messages in thread
From: Will Deacon @ 2025-01-21  9:55 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-pm, kvm, linux-arm-kernel, linux-kernel, linux-arch,
	catalin.marinas, tglx, mingo, bp, dave.hansen, x86, hpa, pbonzini,
	vkuznets, rafael, daniel.lezcano, peterz, arnd, lenb,
	mark.rutland, harisokn, mtosatti, sudeep.holla, cl, maz,
	misono.tomohiro, maobibo, zhenglifeng1, joao.m.martins,
	boris.ostrovsky, konrad.wilk

On Mon, Jan 20, 2025 at 01:13:25PM -0800, Ankur Arora wrote:
> 
> Ankur Arora <ankur.a.arora@oracle.com> writes:
> 
> > This patchset adds support for polling in idle via poll_idle() on
> > arm64.
> >
> > There are two main changes in this version:
> >
> > 1. rework the series to take Catalin Marinas' comments on the semantics
> >    of smp_cond_load_relaxed() (and how earlier versions of this
> >    series were abusing them) into account.
> 
> There was a recent series adding resilient spinlocks which might also have
> use for an smp_cond_load_{acquire,release}_timeout() interface.
> (https://lore.kernel.org/lkml/20250107192202.GA36003@noisy.programming.kicks-ass.net/)

Urgh, that reminds me that I need to go look at that...

Will


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-01-21  9:57 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-07 19:08 [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
2024-11-07 19:08 ` [PATCH v9 01/15] asm-generic: add barrier smp_cond_load_relaxed_timeout() Ankur Arora
2024-11-08  2:33   ` Christoph Lameter (Ampere)
2024-11-08  7:53     ` Ankur Arora
2024-11-08 19:41       ` Christoph Lameter (Ampere)
2024-11-08 22:15         ` Ankur Arora
2024-11-12 16:50           ` Christoph Lameter (Ampere)
2024-11-14 17:22         ` Catalin Marinas
2024-11-15  0:28           ` Ankur Arora
2024-11-26  5:01   ` Ankur Arora
2024-11-26 10:36     ` Catalin Marinas
2024-11-07 19:08 ` [PATCH v9 02/15] cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout() Ankur Arora
2024-11-07 19:08 ` [PATCH v9 03/15] cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL Ankur Arora
2024-11-07 19:08 ` [PATCH v9 04/15] Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig Ankur Arora
2024-11-07 19:08 ` [PATCH v9 05/15] arm64: barrier: add support for smp_cond_relaxed_timeout() Ankur Arora
2024-12-10 13:50   ` Will Deacon
2024-12-10 20:14     ` Ankur Arora
2024-11-07 19:08 ` [PATCH v9 06/15] arm64: define TIF_POLLING_NRFLAG Ankur Arora
2024-11-07 19:08 ` [PATCH v9 07/15] arm64: add support for polling in idle Ankur Arora
2024-11-07 19:08 ` [PATCH v9 08/15] ACPI: processor_idle: Support polling state for LPI Ankur Arora
2024-11-07 19:08 ` [PATCH v9 09/15] cpuidle-haltpoll: define arch_haltpoll_want() Ankur Arora
2024-11-07 19:08 ` [PATCH v9 10/15] governors/haltpoll: drop kvm_para_available() check Ankur Arora
2024-11-07 19:08 ` [PATCH v9 11/15] cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL Ankur Arora
2024-11-07 19:08 ` [PATCH v9 12/15] arm64: idle: export arch_cpu_idle Ankur Arora
2024-11-07 19:08 ` [PATCH v9 13/15] arm64: support cpuidle-haltpoll Ankur Arora
2024-11-07 19:08 ` [RFC PATCH v9 14/15] arm64/delay: move some constants out to a separate header Ankur Arora
2024-11-08  2:25   ` Christoph Lameter (Ampere)
2024-11-08  7:49     ` Ankur Arora
2024-11-07 19:08 ` [RFC PATCH v9 15/15] arm64: support WFET in smp_cond_relaxed_timeout() Ankur Arora
2025-01-07  5:23 ` [PATCH v9 00/15] arm64: support poll_idle() Ankur Arora
2025-01-20 21:13 ` Ankur Arora
2025-01-21  9:55   ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).