* [PATCH v2 0/6] x86/idle: Multiple MWAIT fixes
@ 2025-07-04 16:34 Andrew Cooper
2025-07-04 16:34 ` [PATCH v2 1/6] x86/idle: Remove broken MWAIT implementation Andrew Cooper
` (5 more replies)
0 siblings, 6 replies; 10+ messages in thread
From: Andrew Cooper @ 2025-07-04 16:34 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
The two main bugs were identified in Linux first, and I've modelled Xen's fix
similarly.
Patches 1-4 want committing together. They do bisect and operate correctly,
but the range takes out an optimisation in order to reimplement it correctly.
https://gitlab.com/xen-project/hardware/xen-staging/-/pipelines/1905583317
Andrew Cooper (6):
x86/idle: Remove broken MWAIT implementation
x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR
xen/softirq: Rework arch_skip_send_event_check() into arch_set_softirq()
x86/idle: Implement a new MWAIT IPI-elision algorithm
x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons"
xen/arch/x86/acpi/cpu_idle.c | 92 +++++++++++---------------
xen/arch/x86/cpu/intel.c | 2 +-
xen/arch/x86/cpu/mwait-idle.c | 8 +--
xen/arch/x86/hpet.c | 2 -
xen/arch/x86/include/asm/cpufeatures.h | 1 +
xen/arch/x86/include/asm/hardirq.h | 21 ++++--
xen/arch/x86/include/asm/mwait.h | 3 -
xen/arch/x86/include/asm/softirq.h | 48 +++++++++++++-
xen/common/softirq.c | 8 +--
xen/include/xen/cpuidle.h | 2 -
xen/include/xen/irq_cpustat.h | 1 -
xen/include/xen/softirq.h | 16 +++++
12 files changed, 124 insertions(+), 80 deletions(-)
--
2.39.5
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 1/6] x86/idle: Remove broken MWAIT implementation
2025-07-04 16:34 [PATCH v2 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
@ 2025-07-04 16:34 ` Andrew Cooper
2025-07-04 16:34 ` [PATCH v2 2/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints() Andrew Cooper
` (4 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Andrew Cooper @ 2025-07-04 16:34 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Roger Pau Monné, Jan Beulich, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
cpuidle_wakeup_mwait() is a TOCTOU race. The cpumask_and() sampling
cpuidle_mwait_flags can take a arbitrary period of time, and there's no
guarantee that the target CPUs are still in MWAIT when writing into
mwait_wakeup(cpu).
The consequence of the race is that we'll fail to IPI certain targets. Also,
there's no guarantee that mwait_idle_with_hints() will raise a TIMER_SOFTIRQ
on it's way out.
The fundamental bug is that the "in_mwait" variable needs to be in the
monitored line, and not in a separate cpuidle_mwait_flags variable, in order
to do this in a race-free way.
Arranging to fix this while keeping the old implementation is prohibitive, so
strip the current one out in order to implement the new one cleanly. In the
interim, this causes IPIs to be used unconditionally which is safe albeit
suboptimal.
Fixes: 3d521e933e1b ("cpuidle: mwait on softirq_pending & remove wakeup ipis")
Fixes: 1adb34ea846d ("CPUIDLE: re-implement mwait wakeup process")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
---
xen/arch/x86/acpi/cpu_idle.c | 48 ++++--------------------------
xen/arch/x86/hpet.c | 2 --
xen/arch/x86/include/asm/hardirq.h | 9 +++---
xen/include/xen/cpuidle.h | 2 --
xen/include/xen/irq_cpustat.h | 1 -
5 files changed, 9 insertions(+), 53 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 6c3a10e6fb4e..141f0f0facf6 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -436,27 +436,6 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-/*
- * The bit is set iff cpu use monitor/mwait to enter C state
- * with this flag set, CPU can be waken up from C state
- * by writing to specific memory address, instead of sending an IPI.
- */
-static cpumask_t cpuidle_mwait_flags;
-
-void cpuidle_wakeup_mwait(cpumask_t *mask)
-{
- cpumask_t target;
- unsigned int cpu;
-
- cpumask_and(&target, mask, &cpuidle_mwait_flags);
-
- /* CPU is MWAITing on the cpuidle_mwait_wakeup flag. */
- for_each_cpu(cpu, &target)
- mwait_wakeup(cpu) = 0;
-
- cpumask_andnot(mask, mask, &target);
-}
-
/* Force sending of a wakeup IPI regardless of mwait usage. */
bool __ro_after_init force_mwait_ipi_wakeup;
@@ -465,42 +444,25 @@ bool arch_skip_send_event_check(unsigned int cpu)
if ( force_mwait_ipi_wakeup )
return false;
- /*
- * This relies on softirq_pending() and mwait_wakeup() to access data
- * on the same cache line.
- */
- smp_mb();
- return !!cpumask_test_cpu(cpu, &cpuidle_mwait_flags);
+ return false;
}
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
- s_time_t expires = per_cpu(timer_deadline, cpu);
- const void *monitor_addr = &mwait_wakeup(cpu);
+ const unsigned int *this_softirq_pending = &softirq_pending(cpu);
- monitor(monitor_addr, 0, 0);
+ monitor(this_softirq_pending, 0, 0);
smp_mb();
- /*
- * Timer deadline passing is the event on which we will be woken via
- * cpuidle_mwait_wakeup. So check it now that the location is armed.
- */
- if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
+ if ( !*this_softirq_pending )
{
struct cpu_info *info = get_cpu_info();
- cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
-
spec_ctrl_enter_idle(info);
mwait(eax, ecx);
spec_ctrl_exit_idle(info);
-
- cpumask_clear_cpu(cpu, &cpuidle_mwait_flags);
}
-
- if ( expires <= NOW() && expires > 0 )
- raise_softirq(TIMER_SOFTIRQ);
}
static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
@@ -901,7 +863,7 @@ void cf_check acpi_dead_idle(void)
if ( cx->entry_method == ACPI_CSTATE_EM_FFH )
{
- void *mwait_ptr = &mwait_wakeup(smp_processor_id());
+ void *mwait_ptr = &softirq_pending(smp_processor_id());
/*
* Cache must be flushed as the last operation before sleeping.
diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c
index 1bca8c8b670d..192de433cfd1 100644
--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -188,8 +188,6 @@ static void evt_do_broadcast(cpumask_t *mask)
if ( __cpumask_test_and_clear_cpu(cpu, mask) )
raise_softirq(TIMER_SOFTIRQ);
- cpuidle_wakeup_mwait(mask);
-
if ( !cpumask_empty(mask) )
cpumask_raise_softirq(mask, TIMER_SOFTIRQ);
}
diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
index 342361cb6fdd..f3e93cc9b507 100644
--- a/xen/arch/x86/include/asm/hardirq.h
+++ b/xen/arch/x86/include/asm/hardirq.h
@@ -5,11 +5,10 @@
#include <xen/types.h>
typedef struct {
- unsigned int __softirq_pending;
- unsigned int __local_irq_count;
- unsigned int nmi_count;
- unsigned int mce_count;
- bool __mwait_wakeup;
+ unsigned int __softirq_pending;
+ unsigned int __local_irq_count;
+ unsigned int nmi_count;
+ unsigned int mce_count;
} __cacheline_aligned irq_cpustat_t;
#include <xen/irq_cpustat.h> /* Standard mappings for irq_cpustat_t above */
diff --git a/xen/include/xen/cpuidle.h b/xen/include/xen/cpuidle.h
index 705d0c1135f0..120e354fe340 100644
--- a/xen/include/xen/cpuidle.h
+++ b/xen/include/xen/cpuidle.h
@@ -92,8 +92,6 @@ extern struct cpuidle_governor *cpuidle_current_governor;
bool cpuidle_using_deep_cstate(void);
void cpuidle_disable_deep_cstate(void);
-extern void cpuidle_wakeup_mwait(cpumask_t *mask);
-
#define CPUIDLE_DRIVER_STATE_START 1
extern void menu_get_trace_data(u32 *expected, u32 *pred);
diff --git a/xen/include/xen/irq_cpustat.h b/xen/include/xen/irq_cpustat.h
index b9629f25c266..5f039b4b9a76 100644
--- a/xen/include/xen/irq_cpustat.h
+++ b/xen/include/xen/irq_cpustat.h
@@ -24,6 +24,5 @@ extern irq_cpustat_t irq_stat[];
/* arch independent irq_stat fields */
#define softirq_pending(cpu) __IRQ_STAT((cpu), __softirq_pending)
#define local_irq_count(cpu) __IRQ_STAT((cpu), __local_irq_count)
-#define mwait_wakeup(cpu) __IRQ_STAT((cpu), __mwait_wakeup)
#endif /* __irq_cpustat_h */
--
2.39.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 2/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
2025-07-04 16:34 [PATCH v2 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
2025-07-04 16:34 ` [PATCH v2 1/6] x86/idle: Remove broken MWAIT implementation Andrew Cooper
@ 2025-07-04 16:34 ` Andrew Cooper
2025-07-04 17:51 ` Roger Pau Monné
2025-07-04 16:34 ` [PATCH v2 3/6] x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR Andrew Cooper
` (3 subsequent siblings)
5 siblings, 1 reply; 10+ messages in thread
From: Andrew Cooper @ 2025-07-04 16:34 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
With the recent simplifications, it becomes obvious that smp_mb() isn't the
right barrier. Strictly speaking, MONITOR is ordered as a load, but smp_rmb()
isn't correct either, as this only pertains to local ordering. All we need is
a compiler barrier().
Merge the barier() into the monitor() itself, along with an explantion.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
v2:
* Move earlier in the series, not that it matters IMO.
* Expand the commit message.
---
xen/arch/x86/acpi/cpu_idle.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 141f0f0facf6..68dd44be5bb0 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -66,8 +66,12 @@ static always_inline void monitor(
alternative_input("", "clflush (%[addr])", X86_BUG_CLFLUSH_MONITOR,
[addr] "a" (addr));
+ /*
+ * The memory clobber is a compiler barrier. Subseqeunt reads from the
+ * monitored cacheline must not be reordered over MONITOR.
+ */
asm volatile ( "monitor"
- :: "a" (addr), "c" (ecx), "d" (edx) );
+ :: "a" (addr), "c" (ecx), "d" (edx) : "memory" );
}
static always_inline void mwait(unsigned int eax, unsigned int ecx)
@@ -453,7 +457,6 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
const unsigned int *this_softirq_pending = &softirq_pending(cpu);
monitor(this_softirq_pending, 0, 0);
- smp_mb();
if ( !*this_softirq_pending )
{
--
2.39.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 3/6] x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR
2025-07-04 16:34 [PATCH v2 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
2025-07-04 16:34 ` [PATCH v2 1/6] x86/idle: Remove broken MWAIT implementation Andrew Cooper
2025-07-04 16:34 ` [PATCH v2 2/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints() Andrew Cooper
@ 2025-07-04 16:34 ` Andrew Cooper
2025-07-04 17:52 ` Roger Pau Monné
2025-07-04 16:34 ` [PATCH v2 4/6] xen/softirq: Rework arch_skip_send_event_check() into arch_set_softirq() Andrew Cooper
` (2 subsequent siblings)
5 siblings, 1 reply; 10+ messages in thread
From: Andrew Cooper @ 2025-07-04 16:34 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
We're going to want alternative-patch based on it.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
---
xen/arch/x86/acpi/cpu_idle.c | 6 ------
xen/arch/x86/cpu/intel.c | 2 +-
xen/arch/x86/include/asm/cpufeatures.h | 1 +
xen/arch/x86/include/asm/mwait.h | 3 ---
4 files changed, 2 insertions(+), 10 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 68dd44be5bb0..07056a91a29e 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -440,14 +440,8 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-/* Force sending of a wakeup IPI regardless of mwait usage. */
-bool __ro_after_init force_mwait_ipi_wakeup;
-
bool arch_skip_send_event_check(unsigned int cpu)
{
- if ( force_mwait_ipi_wakeup )
- return false;
-
return false;
}
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index 5215b5405c76..f7bd0d777289 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -436,7 +436,7 @@ static void __init probe_mwait_errata(void)
{
printk(XENLOG_WARNING
"Forcing IPI MWAIT wakeup due to CPU erratum\n");
- force_mwait_ipi_wakeup = true;
+ setup_force_cpu_cap(X86_BUG_MONITOR);
}
}
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 84c93292c80c..56231b00f15d 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -53,6 +53,7 @@ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCAL
#define X86_BUG_CLFLUSH_MFENCE X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
#define X86_BUG_IBPB_NO_RET X86_BUG( 3) /* IBPB doesn't flush the RSB/RAS */
#define X86_BUG_CLFLUSH_MONITOR X86_BUG( 4) /* MONITOR requires CLFLUSH */
+#define X86_BUG_MONITOR X86_BUG( 5) /* MONITOR doesn't always notice writes (force IPIs) */
#define X86_SPEC_NO_LFENCE_ENTRY_PV X86_BUG(16) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_PV. */
#define X86_SPEC_NO_LFENCE_ENTRY_INTR X86_BUG(17) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_INTR. */
diff --git a/xen/arch/x86/include/asm/mwait.h b/xen/arch/x86/include/asm/mwait.h
index c52cd3f51011..000a692f6d19 100644
--- a/xen/arch/x86/include/asm/mwait.h
+++ b/xen/arch/x86/include/asm/mwait.h
@@ -13,9 +13,6 @@
#define MWAIT_ECX_INTERRUPT_BREAK 0x1
-/* Force sending of a wakeup IPI regardless of mwait usage. */
-extern bool force_mwait_ipi_wakeup;
-
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx);
#ifdef CONFIG_INTEL
bool mwait_pc10_supported(void);
--
2.39.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 4/6] xen/softirq: Rework arch_skip_send_event_check() into arch_set_softirq()
2025-07-04 16:34 [PATCH v2 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
` (2 preceding siblings ...)
2025-07-04 16:34 ` [PATCH v2 3/6] x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR Andrew Cooper
@ 2025-07-04 16:34 ` Andrew Cooper
2025-07-04 16:34 ` [PATCH v2 5/6] x86/idle: Implement a new MWAIT IPI-elision algorithm Andrew Cooper
2025-07-04 16:34 ` [PATCH v2 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons" Andrew Cooper
5 siblings, 0 replies; 10+ messages in thread
From: Andrew Cooper @ 2025-07-04 16:34 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Roger Pau Monné, Jan Beulich, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
x86 is the only architecture wanting an optimisation here, but the
test_and_set_bit() is a store into the monitored line (i.e. will wake up the
target) and, prior to the removal of the broken IPI-elision algorithm, was
racy, causing unnecessary IPIs to be sent.
To do this in a race-free way, the store to the monited line needs to also
sample the status of the target in one atomic action. Implement a new arch
helper with different semantics; to make the softirq pending and decide about
IPIs together. For now, implement the default helper. It will be overridden
by x86 in a subsequent change.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
v2:
* Rename new helper to arch_set_softirq()
* Expand commit message
---
xen/arch/x86/acpi/cpu_idle.c | 5 -----
xen/arch/x86/include/asm/softirq.h | 2 --
xen/common/softirq.c | 8 ++------
xen/include/xen/softirq.h | 16 ++++++++++++++++
4 files changed, 18 insertions(+), 13 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 07056a91a29e..4f69df5a7438 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -440,11 +440,6 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-bool arch_skip_send_event_check(unsigned int cpu)
-{
- return false;
-}
-
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
index 415ee866c79d..e4b194f069fb 100644
--- a/xen/arch/x86/include/asm/softirq.h
+++ b/xen/arch/x86/include/asm/softirq.h
@@ -9,6 +9,4 @@
#define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
#define NR_ARCH_SOFTIRQS 5
-bool arch_skip_send_event_check(unsigned int cpu);
-
#endif /* __ASM_SOFTIRQ_H__ */
diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index 60f344e8425e..dc3aabce3330 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -94,9 +94,7 @@ void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
raise_mask = &per_cpu(batch_mask, this_cpu);
for_each_cpu(cpu, mask)
- if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
- cpu != this_cpu &&
- !arch_skip_send_event_check(cpu) )
+ if ( !arch_set_softirq(nr, cpu) && cpu != this_cpu )
__cpumask_set_cpu(cpu, raise_mask);
if ( raise_mask == &send_mask )
@@ -107,9 +105,7 @@ void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
{
unsigned int this_cpu = smp_processor_id();
- if ( test_and_set_bit(nr, &softirq_pending(cpu))
- || (cpu == this_cpu)
- || arch_skip_send_event_check(cpu) )
+ if ( arch_set_softirq(nr, cpu) || cpu == this_cpu )
return;
if ( !per_cpu(batching, this_cpu) || in_irq() )
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index e9f79ec0ce3e..48f17e49efa1 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -23,6 +23,22 @@ enum {
#define NR_SOFTIRQS (NR_COMMON_SOFTIRQS + NR_ARCH_SOFTIRQS)
+/*
+ * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
+ * skipped, false if the IPI cannot be skipped.
+ */
+#ifndef arch_set_softirq
+static always_inline bool arch_set_softirq(unsigned int nr, unsigned int cpu)
+{
+ /*
+ * Try to set the softirq pending. If we set the bit (i.e. the old bit
+ * was 0), we're responsible to send the IPI. If the softirq was already
+ * pending (i.e. the old bit was 1), no IPI is needed.
+ */
+ return test_and_set_bit(nr, &softirq_pending(cpu));
+}
+#endif
+
typedef void (*softirq_handler)(void);
void do_softirq(void);
--
2.39.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 5/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-04 16:34 [PATCH v2 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
` (3 preceding siblings ...)
2025-07-04 16:34 ` [PATCH v2 4/6] xen/softirq: Rework arch_skip_send_event_check() into arch_set_softirq() Andrew Cooper
@ 2025-07-04 16:34 ` Andrew Cooper
2025-07-04 17:53 ` Roger Pau Monné
2025-07-04 16:34 ` [PATCH v2 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons" Andrew Cooper
5 siblings, 1 reply; 10+ messages in thread
From: Andrew Cooper @ 2025-07-04 16:34 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
In order elide IPIs, we must be able to identify whether a target CPU is in
MWAIT at the point it is woken up. i.e. the store to wake it up must also
identify the state.
Create a new in_mwait variable beside __softirq_pending, so we can use a
CMPXCHG to set the softirq while also observing the status safely. Implement
an x86 version of arch_pend_softirq() which does this.
In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
advertising in_mwait.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
This is modelled after Linux's TIF_NEED_RESCHED (single bit equivelent of all
of __softirq_pending), and TIF_POLLING_NRFLAG (arch-neutral "in_mwait").
In Linux, they're both in the same flags field, which adds complexity. In
Xen, __softirq_pending is already unsigned long for everything other than x86,
so adding an arch-neutral "in_mwait" would need wider changes.
v2:
* Fix cmpxchg() expression.
* Use BUILD_BUG_ON()s to check opencoding of in_mwait.
TODO: We want to introduce try_cmpxchg() which is better at the C and code-gen
level.
---
xen/arch/x86/acpi/cpu_idle.c | 20 ++++++++++++-
xen/arch/x86/include/asm/hardirq.h | 14 ++++++++-
xen/arch/x86/include/asm/softirq.h | 48 ++++++++++++++++++++++++++++++
3 files changed, 80 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 4f69df5a7438..c5d7a6c6fe2a 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -443,7 +443,21 @@ __initcall(cpu_idle_key_init);
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
- const unsigned int *this_softirq_pending = &softirq_pending(cpu);
+ irq_cpustat_t *stat = &irq_stat[cpu];
+ const unsigned int *this_softirq_pending = &stat->__softirq_pending;
+
+ /*
+ * By setting in_mwait, we promise to other CPUs that we'll notice changes
+ * to __softirq_pending without being sent an IPI. We achieve this by
+ * either not going to sleep, or by having hardware notice on our behalf.
+ *
+ * Some errata exist where MONITOR doesn't work properly, and the
+ * workaround is to force the use of an IPI. Cause this to happen by
+ * simply not advertising ourselves as being in_mwait.
+ */
+ alternative_io("movb $1, %[in_mwait]",
+ "", X86_BUG_MONITOR,
+ [in_mwait] "=m" (stat->in_mwait));
monitor(this_softirq_pending, 0, 0);
@@ -455,6 +469,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
mwait(eax, ecx);
spec_ctrl_exit_idle(info);
}
+
+ alternative_io("movb $0, %[in_mwait]",
+ "", X86_BUG_MONITOR,
+ [in_mwait] "=m" (stat->in_mwait));
}
static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
index f3e93cc9b507..1647cff04dc8 100644
--- a/xen/arch/x86/include/asm/hardirq.h
+++ b/xen/arch/x86/include/asm/hardirq.h
@@ -5,7 +5,19 @@
#include <xen/types.h>
typedef struct {
- unsigned int __softirq_pending;
+ /*
+ * The layout is important. Any CPU can set bits in __softirq_pending,
+ * but in_mwait is a status bit owned by the CPU. softirq_mwait_raw must
+ * cover both, and must be in a single cacheline.
+ */
+ union {
+ struct {
+ unsigned int __softirq_pending;
+ bool in_mwait;
+ };
+ uint64_t softirq_mwait_raw;
+ };
+
unsigned int __local_irq_count;
unsigned int nmi_count;
unsigned int mce_count;
diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
index e4b194f069fb..55b65c9747b1 100644
--- a/xen/arch/x86/include/asm/softirq.h
+++ b/xen/arch/x86/include/asm/softirq.h
@@ -1,6 +1,8 @@
#ifndef __ASM_SOFTIRQ_H__
#define __ASM_SOFTIRQ_H__
+#include <asm/system.h>
+
#define NMI_SOFTIRQ (NR_COMMON_SOFTIRQS + 0)
#define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
#define VCPU_KICK_SOFTIRQ (NR_COMMON_SOFTIRQS + 2)
@@ -9,4 +11,50 @@
#define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
#define NR_ARCH_SOFTIRQS 5
+/*
+ * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
+ * skipped, false if the IPI cannot be skipped.
+ *
+ * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
+ * set softirq @nr while also observing in_mwait in a race-free way.
+ */
+static always_inline bool arch_set_softirq(unsigned int nr, unsigned int cpu)
+{
+ uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
+ uint64_t prev, old, new;
+ unsigned int softirq = 1U << nr;
+
+ old = ACCESS_ONCE(*ptr);
+
+ for ( ;; )
+ {
+ if ( old & softirq )
+ /* Softirq already pending, nothing to do. */
+ return true;
+
+ new = old | softirq;
+
+ prev = cmpxchg(ptr, old, new);
+ if ( prev == old )
+ break;
+
+ old = prev;
+ }
+
+ /*
+ * We have caused the softirq to become pending. If in_mwait was set, the
+ * target CPU will notice the modification and act on it.
+ *
+ * We can't access the in_mwait field nicely, so use some BUILD_BUG_ON()'s
+ * to cross-check the (1UL << 32) opencoding.
+ */
+ BUILD_BUG_ON(sizeof(irq_stat[0].softirq_mwait_raw) != 8);
+ BUILD_BUG_ON((offsetof(irq_cpustat_t, in_mwait) -
+ offsetof(irq_cpustat_t, softirq_mwait_raw)) != 4);
+
+ return new & (1UL << 32) /* in_mwait */;
+
+}
+#define arch_set_softirq arch_set_softirq
+
#endif /* __ASM_SOFTIRQ_H__ */
--
2.39.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons"
2025-07-04 16:34 [PATCH v2 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
` (4 preceding siblings ...)
2025-07-04 16:34 ` [PATCH v2 5/6] x86/idle: Implement a new MWAIT IPI-elision algorithm Andrew Cooper
@ 2025-07-04 16:34 ` Andrew Cooper
5 siblings, 0 replies; 10+ messages in thread
From: Andrew Cooper @ 2025-07-04 16:34 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Jan Beulich, Jan Beulich, Roger Pau Monné,
Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini
The check of this_softirq_pending must be performed with irqs disabled, but
this property was broken by an attempt to optimise entry/exit latency.
Commit c227233ad64c in Linux (which we copied into Xen) was fixed up by
edc8fc01f608 in Linux, which we have so far missed.
Going to sleep without waking on interrupts is nonsensical outside of
play_dead(), so overload this to select between two possible MWAITs, the
second using the STI shadow to cover MWAIT for exactly the same reason as we
do in safe_halt().
Fixes: b17e0ec72ede ("x86/mwait-idle: enable interrupts before C1 on Xeons")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
---
xen/arch/x86/acpi/cpu_idle.c | 16 +++++++++++++++-
xen/arch/x86/cpu/mwait-idle.c | 8 ++------
2 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index c5d7a6c6fe2a..e50a9234a0d4 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -80,6 +80,13 @@ static always_inline void mwait(unsigned int eax, unsigned int ecx)
:: "a" (eax), "c" (ecx) );
}
+static always_inline void sti_mwait_cli(unsigned int eax, unsigned int ecx)
+{
+ /* STI shadow covers MWAIT. */
+ asm volatile ( "sti; mwait; cli"
+ :: "a" (eax), "c" (ecx) );
+}
+
#define GET_HW_RES_IN_NS(msr, val) \
do { rdmsrl(msr, val); val = tsc_ticks2ns(val); } while( 0 )
#define GET_MC6_RES(val) GET_HW_RES_IN_NS(0x664, val)
@@ -461,12 +468,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
monitor(this_softirq_pending, 0, 0);
+ ASSERT(!local_irq_is_enabled());
+
if ( !*this_softirq_pending )
{
struct cpu_info *info = get_cpu_info();
spec_ctrl_enter_idle(info);
- mwait(eax, ecx);
+
+ if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
+ mwait(eax, ecx);
+ else
+ sti_mwait_cli(eax, ecx);
+
spec_ctrl_exit_idle(info);
}
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index 5c16f5ad3a82..5e98011bfd0c 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -946,12 +946,8 @@ static void cf_check mwait_idle(void)
update_last_cx_stat(power, cx, before);
- if (cx->irq_enable_early)
- local_irq_enable();
-
- mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
-
- local_irq_disable();
+ mwait_idle_with_hints(cx->address,
+ cx->irq_enable_early ? 0 : MWAIT_ECX_INTERRUPT_BREAK);
after = alternative_call(cpuidle_get_tick);
--
2.39.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v2 2/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
2025-07-04 16:34 ` [PATCH v2 2/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints() Andrew Cooper
@ 2025-07-04 17:51 ` Roger Pau Monné
0 siblings, 0 replies; 10+ messages in thread
From: Roger Pau Monné @ 2025-07-04 17:51 UTC (permalink / raw)
To: Andrew Cooper
Cc: Xen-devel, Jan Beulich, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On Fri, Jul 04, 2025 at 05:34:06PM +0100, Andrew Cooper wrote:
> With the recent simplifications, it becomes obvious that smp_mb() isn't the
> right barrier. Strictly speaking, MONITOR is ordered as a load, but smp_rmb()
> isn't correct either, as this only pertains to local ordering. All we need is
> a compiler barrier().
>
> Merge the barier() into the monitor() itself, along with an explantion.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Thanks, Roger.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 3/6] x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR
2025-07-04 16:34 ` [PATCH v2 3/6] x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR Andrew Cooper
@ 2025-07-04 17:52 ` Roger Pau Monné
0 siblings, 0 replies; 10+ messages in thread
From: Roger Pau Monné @ 2025-07-04 17:52 UTC (permalink / raw)
To: Andrew Cooper
Cc: Xen-devel, Jan Beulich, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On Fri, Jul 04, 2025 at 05:34:07PM +0100, Andrew Cooper wrote:
> We're going to want alternative-patch based on it.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Thanks, Roger.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 5/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-04 16:34 ` [PATCH v2 5/6] x86/idle: Implement a new MWAIT IPI-elision algorithm Andrew Cooper
@ 2025-07-04 17:53 ` Roger Pau Monné
0 siblings, 0 replies; 10+ messages in thread
From: Roger Pau Monné @ 2025-07-04 17:53 UTC (permalink / raw)
To: Andrew Cooper
Cc: Xen-devel, Jan Beulich, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On Fri, Jul 04, 2025 at 05:34:09PM +0100, Andrew Cooper wrote:
> In order elide IPIs, we must be able to identify whether a target CPU is in
> MWAIT at the point it is woken up. i.e. the store to wake it up must also
> identify the state.
>
> Create a new in_mwait variable beside __softirq_pending, so we can use a
> CMPXCHG to set the softirq while also observing the status safely. Implement
> an x86 version of arch_pend_softirq() which does this.
>
> In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
> precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
> advertising in_mwait.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Thanks, Roger.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-07-04 17:53 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-04 16:34 [PATCH v2 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
2025-07-04 16:34 ` [PATCH v2 1/6] x86/idle: Remove broken MWAIT implementation Andrew Cooper
2025-07-04 16:34 ` [PATCH v2 2/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints() Andrew Cooper
2025-07-04 17:51 ` Roger Pau Monné
2025-07-04 16:34 ` [PATCH v2 3/6] x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR Andrew Cooper
2025-07-04 17:52 ` Roger Pau Monné
2025-07-04 16:34 ` [PATCH v2 4/6] xen/softirq: Rework arch_skip_send_event_check() into arch_set_softirq() Andrew Cooper
2025-07-04 16:34 ` [PATCH v2 5/6] x86/idle: Implement a new MWAIT IPI-elision algorithm Andrew Cooper
2025-07-04 17:53 ` Roger Pau Monné
2025-07-04 16:34 ` [PATCH v2 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons" Andrew Cooper
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.