* [PATCH 0/6] x86/idle: Multiple MWAIT fixes
@ 2025-07-02 14:41 Andrew Cooper
2025-07-02 14:41 ` [PATCH 1/6] x86/idle: Remove broken MWAIT implementation Andrew Cooper
` (5 more replies)
0 siblings, 6 replies; 35+ messages in thread
From: Andrew Cooper @ 2025-07-02 14:41 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
The two main bugs were identified in Linux first, and I've modelled Xen's fix
similarly.
Patches 1-4 want committing together. They do bisect and operate correctly,
but the range takes out an optimisation in order to reimplement it correctly.
https://gitlab.com/xen-project/hardware/xen-staging/-/pipelines/1902584433
Andrew Cooper (6):
x86/idle: Remove broken MWAIT implementation
x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR
xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq()
x86/idle: Implement a new MWAIT IPI-elision algorithm
x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons"
xen/arch/x86/acpi/cpu_idle.c | 92 +++++++++++---------------
xen/arch/x86/cpu/intel.c | 2 +-
xen/arch/x86/cpu/mwait-idle.c | 8 +--
xen/arch/x86/hpet.c | 2 -
xen/arch/x86/include/asm/cpufeatures.h | 1 +
xen/arch/x86/include/asm/hardirq.h | 21 ++++--
xen/arch/x86/include/asm/mwait.h | 3 -
xen/arch/x86/include/asm/softirq.h | 34 +++++++++-
xen/common/softirq.c | 8 +--
xen/include/xen/cpuidle.h | 2 -
xen/include/xen/irq_cpustat.h | 1 -
xen/include/xen/softirq.h | 16 +++++
12 files changed, 110 insertions(+), 80 deletions(-)
--
2.39.5
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 1/6] x86/idle: Remove broken MWAIT implementation
2025-07-02 14:41 [PATCH 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
@ 2025-07-02 14:41 ` Andrew Cooper
2025-07-03 16:01 ` Roger Pau Monné
2025-07-02 14:41 ` [PATCH 2/6] x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR Andrew Cooper
` (4 subsequent siblings)
5 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2025-07-02 14:41 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
cpuidle_wakeup_mwait() is a TOCTOU race. The cpumask_and() sampling
cpuidle_mwait_flags can take a arbitrary period of time, and there's no
guarantee that the target CPUs are still in MWAIT when writing into
mwait_wakeup(cpu).
The consequence of the race is that we'll fail to IPI targets. Also, there's
no guarantee that mwait_idle_with_hints() will raise a TIMER_SOFTIRQ on it's
way out.
The fundamental bug is that the "in_mwait" variable needs to be in the
monitored line in order to do this in a race-free way.
Arranging for this while keeping the old implementation is prohibitive, so
strip it out in order to implement the new one cleanly. In the interim, this
causes IPIs to be used unconditionally which is safe albeit suboptimal.
Fixes: 3d521e933e1b ("cpuidle: mwait on softirq_pending & remove wakeup ipis")
Fixes: 1adb34ea846d ("CPUIDLE: re-implement mwait wakeup process")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
---
xen/arch/x86/acpi/cpu_idle.c | 48 ++++--------------------------
xen/arch/x86/hpet.c | 2 --
xen/arch/x86/include/asm/hardirq.h | 9 +++---
xen/include/xen/cpuidle.h | 2 --
xen/include/xen/irq_cpustat.h | 1 -
5 files changed, 9 insertions(+), 53 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 6c3a10e6fb4e..141f0f0facf6 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -436,27 +436,6 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-/*
- * The bit is set iff cpu use monitor/mwait to enter C state
- * with this flag set, CPU can be waken up from C state
- * by writing to specific memory address, instead of sending an IPI.
- */
-static cpumask_t cpuidle_mwait_flags;
-
-void cpuidle_wakeup_mwait(cpumask_t *mask)
-{
- cpumask_t target;
- unsigned int cpu;
-
- cpumask_and(&target, mask, &cpuidle_mwait_flags);
-
- /* CPU is MWAITing on the cpuidle_mwait_wakeup flag. */
- for_each_cpu(cpu, &target)
- mwait_wakeup(cpu) = 0;
-
- cpumask_andnot(mask, mask, &target);
-}
-
/* Force sending of a wakeup IPI regardless of mwait usage. */
bool __ro_after_init force_mwait_ipi_wakeup;
@@ -465,42 +444,25 @@ bool arch_skip_send_event_check(unsigned int cpu)
if ( force_mwait_ipi_wakeup )
return false;
- /*
- * This relies on softirq_pending() and mwait_wakeup() to access data
- * on the same cache line.
- */
- smp_mb();
- return !!cpumask_test_cpu(cpu, &cpuidle_mwait_flags);
+ return false;
}
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
- s_time_t expires = per_cpu(timer_deadline, cpu);
- const void *monitor_addr = &mwait_wakeup(cpu);
+ const unsigned int *this_softirq_pending = &softirq_pending(cpu);
- monitor(monitor_addr, 0, 0);
+ monitor(this_softirq_pending, 0, 0);
smp_mb();
- /*
- * Timer deadline passing is the event on which we will be woken via
- * cpuidle_mwait_wakeup. So check it now that the location is armed.
- */
- if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
+ if ( !*this_softirq_pending )
{
struct cpu_info *info = get_cpu_info();
- cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
-
spec_ctrl_enter_idle(info);
mwait(eax, ecx);
spec_ctrl_exit_idle(info);
-
- cpumask_clear_cpu(cpu, &cpuidle_mwait_flags);
}
-
- if ( expires <= NOW() && expires > 0 )
- raise_softirq(TIMER_SOFTIRQ);
}
static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
@@ -901,7 +863,7 @@ void cf_check acpi_dead_idle(void)
if ( cx->entry_method == ACPI_CSTATE_EM_FFH )
{
- void *mwait_ptr = &mwait_wakeup(smp_processor_id());
+ void *mwait_ptr = &softirq_pending(smp_processor_id());
/*
* Cache must be flushed as the last operation before sleeping.
diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c
index 1bca8c8b670d..192de433cfd1 100644
--- a/xen/arch/x86/hpet.c
+++ b/xen/arch/x86/hpet.c
@@ -188,8 +188,6 @@ static void evt_do_broadcast(cpumask_t *mask)
if ( __cpumask_test_and_clear_cpu(cpu, mask) )
raise_softirq(TIMER_SOFTIRQ);
- cpuidle_wakeup_mwait(mask);
-
if ( !cpumask_empty(mask) )
cpumask_raise_softirq(mask, TIMER_SOFTIRQ);
}
diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
index 342361cb6fdd..f3e93cc9b507 100644
--- a/xen/arch/x86/include/asm/hardirq.h
+++ b/xen/arch/x86/include/asm/hardirq.h
@@ -5,11 +5,10 @@
#include <xen/types.h>
typedef struct {
- unsigned int __softirq_pending;
- unsigned int __local_irq_count;
- unsigned int nmi_count;
- unsigned int mce_count;
- bool __mwait_wakeup;
+ unsigned int __softirq_pending;
+ unsigned int __local_irq_count;
+ unsigned int nmi_count;
+ unsigned int mce_count;
} __cacheline_aligned irq_cpustat_t;
#include <xen/irq_cpustat.h> /* Standard mappings for irq_cpustat_t above */
diff --git a/xen/include/xen/cpuidle.h b/xen/include/xen/cpuidle.h
index 705d0c1135f0..120e354fe340 100644
--- a/xen/include/xen/cpuidle.h
+++ b/xen/include/xen/cpuidle.h
@@ -92,8 +92,6 @@ extern struct cpuidle_governor *cpuidle_current_governor;
bool cpuidle_using_deep_cstate(void);
void cpuidle_disable_deep_cstate(void);
-extern void cpuidle_wakeup_mwait(cpumask_t *mask);
-
#define CPUIDLE_DRIVER_STATE_START 1
extern void menu_get_trace_data(u32 *expected, u32 *pred);
diff --git a/xen/include/xen/irq_cpustat.h b/xen/include/xen/irq_cpustat.h
index b9629f25c266..5f039b4b9a76 100644
--- a/xen/include/xen/irq_cpustat.h
+++ b/xen/include/xen/irq_cpustat.h
@@ -24,6 +24,5 @@ extern irq_cpustat_t irq_stat[];
/* arch independent irq_stat fields */
#define softirq_pending(cpu) __IRQ_STAT((cpu), __softirq_pending)
#define local_irq_count(cpu) __IRQ_STAT((cpu), __local_irq_count)
-#define mwait_wakeup(cpu) __IRQ_STAT((cpu), __mwait_wakeup)
#endif /* __irq_cpustat_h */
--
2.39.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH 2/6] x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR
2025-07-02 14:41 [PATCH 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
2025-07-02 14:41 ` [PATCH 1/6] x86/idle: Remove broken MWAIT implementation Andrew Cooper
@ 2025-07-02 14:41 ` Andrew Cooper
2025-07-02 14:41 ` [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq() Andrew Cooper
` (3 subsequent siblings)
5 siblings, 0 replies; 35+ messages in thread
From: Andrew Cooper @ 2025-07-02 14:41 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
We're going to want alternative-patch based on it.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
---
xen/arch/x86/acpi/cpu_idle.c | 6 ------
xen/arch/x86/cpu/intel.c | 2 +-
xen/arch/x86/include/asm/cpufeatures.h | 1 +
xen/arch/x86/include/asm/mwait.h | 3 ---
4 files changed, 2 insertions(+), 10 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 141f0f0facf6..dbcb80d81db9 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -436,14 +436,8 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-/* Force sending of a wakeup IPI regardless of mwait usage. */
-bool __ro_after_init force_mwait_ipi_wakeup;
-
bool arch_skip_send_event_check(unsigned int cpu)
{
- if ( force_mwait_ipi_wakeup )
- return false;
-
return false;
}
diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c
index 5215b5405c76..f7bd0d777289 100644
--- a/xen/arch/x86/cpu/intel.c
+++ b/xen/arch/x86/cpu/intel.c
@@ -436,7 +436,7 @@ static void __init probe_mwait_errata(void)
{
printk(XENLOG_WARNING
"Forcing IPI MWAIT wakeup due to CPU erratum\n");
- force_mwait_ipi_wakeup = true;
+ setup_force_cpu_cap(X86_BUG_MONITOR);
}
}
diff --git a/xen/arch/x86/include/asm/cpufeatures.h b/xen/arch/x86/include/asm/cpufeatures.h
index 84c93292c80c..56231b00f15d 100644
--- a/xen/arch/x86/include/asm/cpufeatures.h
+++ b/xen/arch/x86/include/asm/cpufeatures.h
@@ -53,6 +53,7 @@ XEN_CPUFEATURE(USE_VMCALL, X86_SYNTH(30)) /* Use VMCALL instead of VMMCAL
#define X86_BUG_CLFLUSH_MFENCE X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */
#define X86_BUG_IBPB_NO_RET X86_BUG( 3) /* IBPB doesn't flush the RSB/RAS */
#define X86_BUG_CLFLUSH_MONITOR X86_BUG( 4) /* MONITOR requires CLFLUSH */
+#define X86_BUG_MONITOR X86_BUG( 5) /* MONITOR doesn't always notice writes (force IPIs) */
#define X86_SPEC_NO_LFENCE_ENTRY_PV X86_BUG(16) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_PV. */
#define X86_SPEC_NO_LFENCE_ENTRY_INTR X86_BUG(17) /* (No) safety LFENCE for SPEC_CTRL_ENTRY_INTR. */
diff --git a/xen/arch/x86/include/asm/mwait.h b/xen/arch/x86/include/asm/mwait.h
index c52cd3f51011..000a692f6d19 100644
--- a/xen/arch/x86/include/asm/mwait.h
+++ b/xen/arch/x86/include/asm/mwait.h
@@ -13,9 +13,6 @@
#define MWAIT_ECX_INTERRUPT_BREAK 0x1
-/* Force sending of a wakeup IPI regardless of mwait usage. */
-extern bool force_mwait_ipi_wakeup;
-
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx);
#ifdef CONFIG_INTEL
bool mwait_pc10_supported(void);
--
2.39.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq()
2025-07-02 14:41 [PATCH 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
2025-07-02 14:41 ` [PATCH 1/6] x86/idle: Remove broken MWAIT implementation Andrew Cooper
2025-07-02 14:41 ` [PATCH 2/6] x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR Andrew Cooper
@ 2025-07-02 14:41 ` Andrew Cooper
2025-07-03 8:11 ` Jan Beulich
2025-07-03 16:21 ` Roger Pau Monné
2025-07-02 14:41 ` [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm Andrew Cooper
` (2 subsequent siblings)
5 siblings, 2 replies; 35+ messages in thread
From: Andrew Cooper @ 2025-07-02 14:41 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
x86 is the only architecture wanting an optimisation here, but the
test_and_set_bit() is a store into the monitored line and is racy with
determining whether an IPI can be skipped.
Instead, implement a new arch helper with different semantics; to make the
softirq pending and decide about IPIs together. For now, implement the
default helper. It will be overridden by x86 in a subsequent change.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
---
xen/arch/x86/acpi/cpu_idle.c | 5 -----
xen/arch/x86/include/asm/softirq.h | 2 --
xen/common/softirq.c | 8 ++------
xen/include/xen/softirq.h | 16 ++++++++++++++++
4 files changed, 18 insertions(+), 13 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index dbcb80d81db9..caa0fef0b3b1 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -436,11 +436,6 @@ static int __init cf_check cpu_idle_key_init(void)
}
__initcall(cpu_idle_key_init);
-bool arch_skip_send_event_check(unsigned int cpu)
-{
- return false;
-}
-
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
index 415ee866c79d..e4b194f069fb 100644
--- a/xen/arch/x86/include/asm/softirq.h
+++ b/xen/arch/x86/include/asm/softirq.h
@@ -9,6 +9,4 @@
#define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
#define NR_ARCH_SOFTIRQS 5
-bool arch_skip_send_event_check(unsigned int cpu);
-
#endif /* __ASM_SOFTIRQ_H__ */
diff --git a/xen/common/softirq.c b/xen/common/softirq.c
index 60f344e8425e..0368011f95d1 100644
--- a/xen/common/softirq.c
+++ b/xen/common/softirq.c
@@ -94,9 +94,7 @@ void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
raise_mask = &per_cpu(batch_mask, this_cpu);
for_each_cpu(cpu, mask)
- if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
- cpu != this_cpu &&
- !arch_skip_send_event_check(cpu) )
+ if ( !arch_pend_softirq(nr, cpu) && cpu != this_cpu )
__cpumask_set_cpu(cpu, raise_mask);
if ( raise_mask == &send_mask )
@@ -107,9 +105,7 @@ void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
{
unsigned int this_cpu = smp_processor_id();
- if ( test_and_set_bit(nr, &softirq_pending(cpu))
- || (cpu == this_cpu)
- || arch_skip_send_event_check(cpu) )
+ if ( arch_pend_softirq(nr, cpu) || cpu == this_cpu )
return;
if ( !per_cpu(batching, this_cpu) || in_irq() )
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index e9f79ec0ce3e..0814c8e5cd41 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -23,6 +23,22 @@ enum {
#define NR_SOFTIRQS (NR_COMMON_SOFTIRQS + NR_ARCH_SOFTIRQS)
+/*
+ * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
+ * skipped, false if the IPI cannot be skipped.
+ */
+#ifndef arch_pend_softirq
+static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
+{
+ /*
+ * Try to set the softirq pending. If we set the bit (i.e. the old bit
+ * was 0), we're responsible to send the IPI. If the softirq was already
+ * pending (i.e. the old bit was 1), no IPI is needed.
+ */
+ return test_and_set_bit(nr, &softirq_pending(cpu));
+}
+#endif
+
typedef void (*softirq_handler)(void);
void do_softirq(void);
--
2.39.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-02 14:41 [PATCH 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
` (2 preceding siblings ...)
2025-07-02 14:41 ` [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq() Andrew Cooper
@ 2025-07-02 14:41 ` Andrew Cooper
2025-07-03 9:01 ` Jan Beulich
` (2 more replies)
2025-07-02 14:41 ` [PATCH 5/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints() Andrew Cooper
2025-07-02 14:41 ` [PATCH 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons" Andrew Cooper
5 siblings, 3 replies; 35+ messages in thread
From: Andrew Cooper @ 2025-07-02 14:41 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
In order elide IPIs, we must be able to identify whether a target CPU is in
MWAIT at the point it is woken up. i.e. the store to wake it up must also
identify the state.
Create a new in_mwait variable beside __softirq_pending, so we can use a
CMPXCHG to set the softirq while also observing the status safely. Implement
an x86 version of arch_pend_softirq() which does this.
In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
advertising in_mwait.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
This is modelled after Linux's TIF_NEED_RESCHED (single bit equivelent of all
of __softirq_pending), and TIF_POLLING_NRFLAG (arch-neutral "in_mwait").
In Linux, they're both in the same flags field, which adds complexity. In
Xen, __softirq_pending is already unsigned long for everything other than x86,
so adding an arch-neutral "in_mwait" would need wider changes.
---
xen/arch/x86/acpi/cpu_idle.c | 20 +++++++++++++++++-
xen/arch/x86/include/asm/hardirq.h | 14 +++++++++++-
xen/arch/x86/include/asm/softirq.h | 34 ++++++++++++++++++++++++++++++
3 files changed, 66 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index caa0fef0b3b1..13040ef467a0 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -439,7 +439,21 @@ __initcall(cpu_idle_key_init);
void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
{
unsigned int cpu = smp_processor_id();
- const unsigned int *this_softirq_pending = &softirq_pending(cpu);
+ irq_cpustat_t *stat = &irq_stat[cpu];
+ const unsigned int *this_softirq_pending = &stat->__softirq_pending;
+
+ /*
+ * By setting in_mwait, we promise to other CPUs that we'll notice changes
+ * to __softirq_pending without being sent an IPI. We achieve this by
+ * either not going to sleep, or by having hardware notice on our behalf.
+ *
+ * Some errata exist where MONITOR doesn't work properly, and the
+ * workaround is to force the use of an IPI. Cause this to happen by
+ * simply not advertising outselves as being in_mwait.
+ */
+ alternative_io("movb $1, %[in_mwait]",
+ "", X86_BUG_MONITOR,
+ [in_mwait] "=m" (stat->in_mwait));
monitor(this_softirq_pending, 0, 0);
smp_mb();
@@ -452,6 +466,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
mwait(eax, ecx);
spec_ctrl_exit_idle(info);
}
+
+ alternative_io("movb $0, %[in_mwait]",
+ "", X86_BUG_MONITOR,
+ [in_mwait] "=m" (stat->in_mwait));
}
static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
index f3e93cc9b507..1647cff04dc8 100644
--- a/xen/arch/x86/include/asm/hardirq.h
+++ b/xen/arch/x86/include/asm/hardirq.h
@@ -5,7 +5,19 @@
#include <xen/types.h>
typedef struct {
- unsigned int __softirq_pending;
+ /*
+ * The layout is important. Any CPU can set bits in __softirq_pending,
+ * but in_mwait is a status bit owned by the CPU. softirq_mwait_raw must
+ * cover both, and must be in a single cacheline.
+ */
+ union {
+ struct {
+ unsigned int __softirq_pending;
+ bool in_mwait;
+ };
+ uint64_t softirq_mwait_raw;
+ };
+
unsigned int __local_irq_count;
unsigned int nmi_count;
unsigned int mce_count;
diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
index e4b194f069fb..069e5716a68d 100644
--- a/xen/arch/x86/include/asm/softirq.h
+++ b/xen/arch/x86/include/asm/softirq.h
@@ -1,6 +1,8 @@
#ifndef __ASM_SOFTIRQ_H__
#define __ASM_SOFTIRQ_H__
+#include <asm/system.h>
+
#define NMI_SOFTIRQ (NR_COMMON_SOFTIRQS + 0)
#define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
#define VCPU_KICK_SOFTIRQ (NR_COMMON_SOFTIRQS + 2)
@@ -9,4 +11,36 @@
#define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
#define NR_ARCH_SOFTIRQS 5
+/*
+ * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
+ * skipped, false if the IPI cannot be skipped.
+ *
+ * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
+ * set softirq @nr while also observing in_mwait in a race-free way.
+ */
+static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
+{
+ uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
+ uint64_t old, new;
+ unsigned int softirq = 1U << nr;
+
+ old = ACCESS_ONCE(*ptr);
+
+ do {
+ if ( old & softirq )
+ /* Softirq already pending, nothing to do. */
+ return true;
+
+ new = old | softirq;
+
+ } while ( (old = cmpxchg(ptr, old, new)) != new );
+
+ /*
+ * We have caused the softirq to become pending. If in_mwait was set, the
+ * target CPU will notice the modification and act on it.
+ */
+ return new & (1UL << 32);
+}
+#define arch_pend_softirq arch_pend_softirq
+
#endif /* __ASM_SOFTIRQ_H__ */
--
2.39.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH 5/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
2025-07-02 14:41 [PATCH 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
` (3 preceding siblings ...)
2025-07-02 14:41 ` [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm Andrew Cooper
@ 2025-07-02 14:41 ` Andrew Cooper
2025-07-03 9:24 ` Jan Beulich
2025-07-02 14:41 ` [PATCH 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons" Andrew Cooper
5 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2025-07-02 14:41 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
With the recent simplifications, it becomes obvious that smp_mb() isn't the
right barrier; all we need is a compiler barrier.
Include this in monitor() itself, along with an explantion.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
---
xen/arch/x86/acpi/cpu_idle.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 13040ef467a0..de020dfee87f 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -66,8 +66,12 @@ static always_inline void monitor(
alternative_input("", "clflush (%[addr])", X86_BUG_CLFLUSH_MONITOR,
[addr] "a" (addr));
+ /*
+ * The memory clobber is a compiler barrier. Subseqeunt reads from the
+ * monitored cacheline must not be hoisted over MONITOR.
+ */
asm volatile ( "monitor"
- :: "a" (addr), "c" (ecx), "d" (edx) );
+ :: "a" (addr), "c" (ecx), "d" (edx) : "memory" );
}
static always_inline void mwait(unsigned int eax, unsigned int ecx)
@@ -456,7 +460,6 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
[in_mwait] "=m" (stat->in_mwait));
monitor(this_softirq_pending, 0, 0);
- smp_mb();
if ( !*this_softirq_pending )
{
--
2.39.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons"
2025-07-02 14:41 [PATCH 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
` (4 preceding siblings ...)
2025-07-02 14:41 ` [PATCH 5/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints() Andrew Cooper
@ 2025-07-02 14:41 ` Andrew Cooper
2025-07-03 9:35 ` Jan Beulich
2025-07-03 9:43 ` Jan Beulich
5 siblings, 2 replies; 35+ messages in thread
From: Andrew Cooper @ 2025-07-02 14:41 UTC (permalink / raw)
To: Xen-devel
Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné, Anthony PERARD,
Michal Orzel, Julien Grall, Stefano Stabellini
The check of this_softirq_pending must be performed with irqs disabled, but
this property was broken by an attempt to optimise entry/exit latency.
Commit c227233ad64c in Linux (which we copied into Xen) was fixed up by
edc8fc01f608 in Linux, which we have so far missed.
Going to sleep without waking on interrupts is nonsensical outside of
play_dead(), so overload this to select between two possible MWAITs, the
second using the STI shadow to cover MWAIT for exactly the same reason as we
do in safe_halt().
Fixes: b17e0ec72ede ("x86/mwait-idle: enable interrupts before C1 on Xeons")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Julien Grall <julien@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
---
xen/arch/x86/acpi/cpu_idle.c | 16 +++++++++++++++-
xen/arch/x86/cpu/mwait-idle.c | 8 ++------
2 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index de020dfee87f..dc8b7111a181 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -80,6 +80,13 @@ static always_inline void mwait(unsigned int eax, unsigned int ecx)
:: "a" (eax), "c" (ecx) );
}
+static always_inline void sti_mwait_cli(unsigned int eax, unsigned int ecx)
+{
+ /* STI shadow covers MWAIT. */
+ asm volatile ( "sti; mwait; cli"
+ :: "a" (eax), "c" (ecx) );
+}
+
#define GET_HW_RES_IN_NS(msr, val) \
do { rdmsrl(msr, val); val = tsc_ticks2ns(val); } while( 0 )
#define GET_MC6_RES(val) GET_HW_RES_IN_NS(0x664, val)
@@ -461,12 +468,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
monitor(this_softirq_pending, 0, 0);
+ ASSERT(!local_irq_is_enabled());
+
if ( !*this_softirq_pending )
{
struct cpu_info *info = get_cpu_info();
spec_ctrl_enter_idle(info);
- mwait(eax, ecx);
+
+ if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
+ mwait(eax, ecx);
+ else
+ sti_mwait_cli(eax, ecx);
+
spec_ctrl_exit_idle(info);
}
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index 5c16f5ad3a82..b65d6ae9ddc5 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -946,12 +946,8 @@ static void cf_check mwait_idle(void)
update_last_cx_stat(power, cx, before);
- if (cx->irq_enable_early)
- local_irq_enable();
-
- mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
-
- local_irq_disable();
+ mwait_idle_with_hints(cx->address,
+ cx->irq_enable_early ? 0 : MWAIT_ECX_INTERRUPT_BREAK);
after = alternative_call(cpuidle_get_tick);
--
2.39.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq()
2025-07-02 14:41 ` [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq() Andrew Cooper
@ 2025-07-03 8:11 ` Jan Beulich
2025-07-03 10:36 ` Andrew Cooper
2025-07-03 16:21 ` Roger Pau Monné
1 sibling, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2025-07-03 8:11 UTC (permalink / raw)
To: Andrew Cooper
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 02.07.2025 16:41, Andrew Cooper wrote:
> x86 is the only architecture wanting an optimisation here, but the
> test_and_set_bit() is a store into the monitored line
Which is intentional aiui, while this reads as if this was part of the issue.
> and is racy with determining whether an IPI can be skipped.
Racy here as in limiting the effect of the optimization, but not affecting
correctness aiui: If the woken CPU managed to clear the bit already, we'd
needlessly IPI it. This could also do with saying.
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-02 14:41 ` [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm Andrew Cooper
@ 2025-07-03 9:01 ` Jan Beulich
2025-07-03 11:59 ` Andrew Cooper
2025-07-03 16:36 ` Roger Pau Monné
2025-07-04 7:52 ` Roger Pau Monné
2 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2025-07-03 9:01 UTC (permalink / raw)
To: Andrew Cooper
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 02.07.2025 16:41, Andrew Cooper wrote:
> In order elide IPIs, we must be able to identify whether a target CPU is in
> MWAIT at the point it is woken up. i.e. the store to wake it up must also
> identify the state.
>
> Create a new in_mwait variable beside __softirq_pending, so we can use a
> CMPXCHG to set the softirq while also observing the status safely. Implement
> an x86 version of arch_pend_softirq() which does this.
>
> In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
> precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
> advertising in_mwait.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Anthony PERARD <anthony.perard@vates.tech>
> CC: Michal Orzel <michal.orzel@amd.com>
> CC: Julien Grall <julien@xen.org>
> CC: Stefano Stabellini <sstabellini@kernel.org>
>
> This is modelled after Linux's TIF_NEED_RESCHED (single bit equivelent of all
> of __softirq_pending), and TIF_POLLING_NRFLAG (arch-neutral "in_mwait").
>
> In Linux, they're both in the same flags field, which adds complexity. In
> Xen, __softirq_pending is already unsigned long for everything other than x86,
> so adding an arch-neutral "in_mwait" would need wider changes.
Why would the flag need to be arch-neutral? A pretty straightforward way to
achieve what you want would seem to be to define an x86-only MWAIT_SOFTIRQ,
which we'd never raise or open, but which instead we'd advertise to common
code as an always-ignore mask (to be OR-ed in by __do_softirq()). However,
setting and clearing such a bit would of course require LOCK-ed insns,
which clearly is less desirable than the simple MOVB you're using.
> --- a/xen/arch/x86/acpi/cpu_idle.c
> +++ b/xen/arch/x86/acpi/cpu_idle.c
> @@ -439,7 +439,21 @@ __initcall(cpu_idle_key_init);
> void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
> {
> unsigned int cpu = smp_processor_id();
> - const unsigned int *this_softirq_pending = &softirq_pending(cpu);
> + irq_cpustat_t *stat = &irq_stat[cpu];
> + const unsigned int *this_softirq_pending = &stat->__softirq_pending;
> +
> + /*
> + * By setting in_mwait, we promise to other CPUs that we'll notice changes
> + * to __softirq_pending without being sent an IPI. We achieve this by
> + * either not going to sleep, or by having hardware notice on our behalf.
> + *
> + * Some errata exist where MONITOR doesn't work properly, and the
> + * workaround is to force the use of an IPI. Cause this to happen by
> + * simply not advertising outselves as being in_mwait.
Nit: ourselves
> + */
> + alternative_io("movb $1, %[in_mwait]",
> + "", X86_BUG_MONITOR,
> + [in_mwait] "=m" (stat->in_mwait));
>
> monitor(this_softirq_pending, 0, 0);
> smp_mb();
Unlike alternative(), alternative_io() has no memory clobber. To the compiler
the two resulting asm() therefore have no dependency operand-wise[1]. The
sole ordering criteria then is that they're both volatile asm()-s. It not
being explicitly said (anywhere that I could find) that volatile guarantees
such ordering, I wonder if we wouldn't better have an operand-based
dependency between the two. Otoh it may well be that we rely on such ordering
elsewhere already, in which case having one more instance probably would be
okay(ish).
[1] It's actually worse than that, I think: monitor() also doesn't specify
the location as a (memory) input.
> @@ -452,6 +466,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
> mwait(eax, ecx);
> spec_ctrl_exit_idle(info);
> }
> +
> + alternative_io("movb $0, %[in_mwait]",
> + "", X86_BUG_MONITOR,
> + [in_mwait] "=m" (stat->in_mwait));
This doesn't strictly need to be an alternative, does it? We can save a
memory write in the buggy case, but that likely makes at most a marginal
difference.
> --- a/xen/arch/x86/include/asm/hardirq.h
> +++ b/xen/arch/x86/include/asm/hardirq.h
> @@ -5,7 +5,19 @@
> #include <xen/types.h>
>
> typedef struct {
> - unsigned int __softirq_pending;
> + /*
> + * The layout is important. Any CPU can set bits in __softirq_pending,
> + * but in_mwait is a status bit owned by the CPU. softirq_mwait_raw must
> + * cover both, and must be in a single cacheline.
> + */
> + union {
> + struct {
> + unsigned int __softirq_pending;
> + bool in_mwait;
> + };
> + uint64_t softirq_mwait_raw;
> + };
To guard against someone changing e.g. __softirq_pending to unsigned long
while ignoring the comment, how about adding a BUILD_BUG_ON() checking both
parts of the union are the same size (which of course would require naming
either the union field within the containing struct or its struct member)?
> @@ -9,4 +11,36 @@
> #define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
> #define NR_ARCH_SOFTIRQS 5
>
> +/*
> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
> + * skipped, false if the IPI cannot be skipped.
> + *
> + * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
> + * set softirq @nr while also observing in_mwait in a race-free way.
> + */
> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
> +{
> + uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
> + uint64_t old, new;
> + unsigned int softirq = 1U << nr;
> +
> + old = ACCESS_ONCE(*ptr);
> +
> + do {
> + if ( old & softirq )
> + /* Softirq already pending, nothing to do. */
> + return true;
> +
> + new = old | softirq;
> +
> + } while ( (old = cmpxchg(ptr, old, new)) != new );
Don't you mean
} while ( (new = cmpxchg(ptr, old, new)) != old );
here (with new latched ahead of the loop and old set from new inside it),
like we have it elsewhere when we avoid the use of yet another variable,
e.g. in inc_linear_{entries,uses}()?
> + /*
> + * We have caused the softirq to become pending. If in_mwait was set, the
> + * target CPU will notice the modification and act on it.
> + */
> + return new & (1UL << 32);
Hmm, this requires the layout to allow for even less re-arrangement than the
comment ahead of the union says: You strictly require in_mwait to follow
__softirq_pending immediately, and the (assembly) write to strictly write 1
into the field. Could I talk you into at least
return new & (1UL << (offsetof(..., in_mwait) * 8));
? This way the field being inspected would also be mentioned in the access
itself (i.e. become grep-able beyond the mentioning in the comment). And I
sincerely dislike hard-coded literal numbers when they (relatively) easily
can be expressed another way.
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 5/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
2025-07-02 14:41 ` [PATCH 5/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints() Andrew Cooper
@ 2025-07-03 9:24 ` Jan Beulich
2025-07-03 12:37 ` Andrew Cooper
0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2025-07-03 9:24 UTC (permalink / raw)
To: Andrew Cooper
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 02.07.2025 16:41, Andrew Cooper wrote:
> With the recent simplifications, it becomes obvious that smp_mb() isn't the
> right barrier; all we need is a compiler barrier.
>
> Include this in monitor() itself, along with an explantion.
Ah, here we go. As per my comment on patch 4, would this perhaps better move
ahead (which however would require a bit of an adjustment to the description)?
(Nit: explanation)
> --- a/xen/arch/x86/acpi/cpu_idle.c
> +++ b/xen/arch/x86/acpi/cpu_idle.c
> @@ -66,8 +66,12 @@ static always_inline void monitor(
> alternative_input("", "clflush (%[addr])", X86_BUG_CLFLUSH_MONITOR,
> [addr] "a" (addr));
>
> + /*
> + * The memory clobber is a compiler barrier. Subseqeunt reads from the
Nit: Subsequent
> + * monitored cacheline must not be hoisted over MONITOR.
> + */
> asm volatile ( "monitor"
> - :: "a" (addr), "c" (ecx), "d" (edx) );
> + :: "a" (addr), "c" (ecx), "d" (edx) : "memory" );
> }
That's heavier than we need, though. Can't we simply have a fake output
"+m" (irq_stat[cpu])? Downside being that the compiler may then set up
addressing of that operand, when the operand isn't really referenced. (As
long as __softirq_pending is the first field there, there may not be any
extra overhead, though, as %rax then would also address the unused operand.)
Yet then, is it really only reads from that cacheline that are of concern?
Isn't it - strictly speaking - also necessary that any (hypothetical) reads
done by the NOW() at the end of the function have to occur only afterwards
(and independent of there being a LOCK-ed access in cpumask_clear_cpu())?
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons"
2025-07-02 14:41 ` [PATCH 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons" Andrew Cooper
@ 2025-07-03 9:35 ` Jan Beulich
2025-07-03 9:43 ` Jan Beulich
1 sibling, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2025-07-03 9:35 UTC (permalink / raw)
To: Andrew Cooper
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 02.07.2025 16:41, Andrew Cooper wrote:
> The check of this_softirq_pending must be performed with irqs disabled, but
> this property was broken by an attempt to optimise entry/exit latency.
>
> Commit c227233ad64c in Linux (which we copied into Xen) was fixed up by
> edc8fc01f608 in Linux, which we have so far missed.
>
> Going to sleep without waking on interrupts is nonsensical outside of
> play_dead(), so overload this to select between two possible MWAITs, the
> second using the STI shadow to cover MWAIT for exactly the same reason as we
> do in safe_halt().
>
> Fixes: b17e0ec72ede ("x86/mwait-idle: enable interrupts before C1 on Xeons")
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
with ...
> --- a/xen/arch/x86/cpu/mwait-idle.c
> +++ b/xen/arch/x86/cpu/mwait-idle.c
> @@ -946,12 +946,8 @@ static void cf_check mwait_idle(void)
>
> update_last_cx_stat(power, cx, before);
>
> - if (cx->irq_enable_early)
> - local_irq_enable();
> -
> - mwait_idle_with_hints(cx->address, MWAIT_ECX_INTERRUPT_BREAK);
> -
> - local_irq_disable();
> + mwait_idle_with_hints(cx->address,
> + cx->irq_enable_early ? 0 : MWAIT_ECX_INTERRUPT_BREAK);
... indentation here switched to Linux style (to match the rest of the file).
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons"
2025-07-02 14:41 ` [PATCH 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons" Andrew Cooper
2025-07-03 9:35 ` Jan Beulich
@ 2025-07-03 9:43 ` Jan Beulich
2025-07-03 12:10 ` Andrew Cooper
1 sibling, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2025-07-03 9:43 UTC (permalink / raw)
To: Andrew Cooper
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 02.07.2025 16:41, Andrew Cooper wrote:
> @@ -461,12 +468,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
>
> monitor(this_softirq_pending, 0, 0);
>
> + ASSERT(!local_irq_is_enabled());
> +
> if ( !*this_softirq_pending )
> {
> struct cpu_info *info = get_cpu_info();
>
> spec_ctrl_enter_idle(info);
> - mwait(eax, ecx);
> +
> + if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
> + mwait(eax, ecx);
> + else
> + sti_mwait_cli(eax, ecx);
Actually, I'm curious: It seems quite likely that you did consider an
alternative resulting in assembly code like this:
test $MWAIT_ECX_INTERRUPT_BREAK, %cl
jz 0f
sti
0:
monitor
cli
CLI being a relatively cheap operation aiui, is there anything wrong or
undesirable with this (smaller overall) alternative?
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq()
2025-07-03 8:11 ` Jan Beulich
@ 2025-07-03 10:36 ` Andrew Cooper
2025-07-03 11:24 ` Jan Beulich
0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2025-07-03 10:36 UTC (permalink / raw)
To: Jan Beulich
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 03/07/2025 9:11 am, Jan Beulich wrote:
> On 02.07.2025 16:41, Andrew Cooper wrote:
>> x86 is the only architecture wanting an optimisation here, but the
>> test_and_set_bit() is a store into the monitored line
> Which is intentional aiui, while this reads as if this was part of the issue.
I don't understand what you're trying to say.
It is racy, and that's why we're changing it.
>> and is racy with determining whether an IPI can be skipped.
> Racy here as in limiting the effect of the optimization, but not affecting
> correctness aiui: If the woken CPU managed to clear the bit already, we'd
> needlessly IPI it.
Correct.
> This could also do with saying.
What about this?
x86 is the only architecture wanting an optimisation here, but the
test_and_set_bit() is a store into the monitored line (i.e. will wake up
the target) and, prior to the removal of the broken IPI-elision
algorithm, was racy, causing unnecessary IPIs to be sent.
To do this in a race-free way, the store to the monited line needs to
also sample the status of the target in one atomic action. Implement a
new arch helper with different semantics; to make the softirq pending
and decide about IPIs together. For now, implement the default helper.
It will be overridden by x86 in a subsequent change.
~Andrew
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq()
2025-07-03 10:36 ` Andrew Cooper
@ 2025-07-03 11:24 ` Jan Beulich
0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2025-07-03 11:24 UTC (permalink / raw)
To: Andrew Cooper
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 03.07.2025 12:36, Andrew Cooper wrote:
> On 03/07/2025 9:11 am, Jan Beulich wrote:
>> On 02.07.2025 16:41, Andrew Cooper wrote:
>>> x86 is the only architecture wanting an optimisation here, but the
>>> test_and_set_bit() is a store into the monitored line
>> Which is intentional aiui, while this reads as if this was part of the issue.
>
> I don't understand what you're trying to say.
What I was trying to say is the way you wrote it to me it read as if you're
describing two issues: There wrongly being a store into the monitored line,
and ...
> It is racy, and that's why we're changing it.
>
>>> and is racy with determining whether an IPI can be skipped.
... there being a race.
>> Racy here as in limiting the effect of the optimization, but not affecting
>> correctness aiui: If the woken CPU managed to clear the bit already, we'd
>> needlessly IPI it.
>
> Correct.
>
>> This could also do with saying.
>
> What about this?
>
> x86 is the only architecture wanting an optimisation here, but the
> test_and_set_bit() is a store into the monitored line (i.e. will wake up
> the target) and, prior to the removal of the broken IPI-elision
> algorithm, was racy, causing unnecessary IPIs to be sent.
>
> To do this in a race-free way, the store to the monited line needs to
> also sample the status of the target in one atomic action. Implement a
> new arch helper with different semantics; to make the softirq pending
> and decide about IPIs together. For now, implement the default helper.
> It will be overridden by x86 in a subsequent change.
Better, yes. Thanks.
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-03 9:01 ` Jan Beulich
@ 2025-07-03 11:59 ` Andrew Cooper
2025-07-03 14:07 ` Jan Beulich
0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2025-07-03 11:59 UTC (permalink / raw)
To: Jan Beulich
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 03/07/2025 10:01 am, Jan Beulich wrote:
> On 02.07.2025 16:41, Andrew Cooper wrote:
>> In order elide IPIs, we must be able to identify whether a target CPU is in
>> MWAIT at the point it is woken up. i.e. the store to wake it up must also
>> identify the state.
>>
>> Create a new in_mwait variable beside __softirq_pending, so we can use a
>> CMPXCHG to set the softirq while also observing the status safely. Implement
>> an x86 version of arch_pend_softirq() which does this.
>>
>> In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
>> precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
>> advertising in_mwait.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> CC: Anthony PERARD <anthony.perard@vates.tech>
>> CC: Michal Orzel <michal.orzel@amd.com>
>> CC: Julien Grall <julien@xen.org>
>> CC: Stefano Stabellini <sstabellini@kernel.org>
>>
>> This is modelled after Linux's TIF_NEED_RESCHED (single bit equivelent of all
>> of __softirq_pending), and TIF_POLLING_NRFLAG (arch-neutral "in_mwait").
>>
>> In Linux, they're both in the same flags field, which adds complexity. In
>> Xen, __softirq_pending is already unsigned long for everything other than x86,
>> so adding an arch-neutral "in_mwait" would need wider changes.
> Why would the flag need to be arch-neutral?
Because it's not about mwait; it's about the ability to notice the
rising edge of TIF_NEED_RESCHED, and is implemented by more than just
x86 in Linux.
>> + */
>> + alternative_io("movb $1, %[in_mwait]",
>> + "", X86_BUG_MONITOR,
>> + [in_mwait] "=m" (stat->in_mwait));
>>
>> monitor(this_softirq_pending, 0, 0);
>> smp_mb();
> Unlike alternative(), alternative_io() has no memory clobber. To the compiler
> the two resulting asm() therefore have no dependency operand-wise[1]. The
> sole ordering criteria then is that they're both volatile asm()-s. It not
> being explicitly said (anywhere that I could find) that volatile guarantees
> such ordering, I wonder if we wouldn't better have an operand-based
> dependency between the two. Otoh it may well be that we rely on such ordering
> elsewhere already, in which case having one more instance probably would be
> okay(ish).
>
> [1] It's actually worse than that, I think: monitor() also doesn't specify
> the location as a (memory) input.
The GCC bugzilla has plenty of statements that volatiles (which have
survived optimisation passes) can't be reordered.
>
>> @@ -452,6 +466,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
>> mwait(eax, ecx);
>> spec_ctrl_exit_idle(info);
>> }
>> +
>> + alternative_io("movb $0, %[in_mwait]",
>> + "", X86_BUG_MONITOR,
>> + [in_mwait] "=m" (stat->in_mwait));
> This doesn't strictly need to be an alternative, does it? We can save a
> memory write in the buggy case, but that likely makes at most a marginal
> difference.
It doesn't *need* to be an alternative. It could be:
if ( !cpu_bug_monitor )
ACCESS_ONCE(stat->in_mwait) = true;
but getting rid of the branch is an advantage too.
>> --- a/xen/arch/x86/include/asm/hardirq.h
>> +++ b/xen/arch/x86/include/asm/hardirq.h
>> @@ -5,7 +5,19 @@
>> #include <xen/types.h>
>>
>> typedef struct {
>> - unsigned int __softirq_pending;
>> + /*
>> + * The layout is important. Any CPU can set bits in __softirq_pending,
>> + * but in_mwait is a status bit owned by the CPU. softirq_mwait_raw must
>> + * cover both, and must be in a single cacheline.
>> + */
>> + union {
>> + struct {
>> + unsigned int __softirq_pending;
>> + bool in_mwait;
>> + };
>> + uint64_t softirq_mwait_raw;
>> + };
> To guard against someone changing e.g. __softirq_pending to unsigned long
> while ignoring the comment, how about adding a BUILD_BUG_ON() checking both
> parts of the union are the same size (which of course would require naming
> either the union field within the containing struct or its struct member)?
That turns out to be very hard because of the definition of
softirq_pending() being common. More below.
>
>> @@ -9,4 +11,36 @@
>> #define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
>> #define NR_ARCH_SOFTIRQS 5
>>
>> +/*
>> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
>> + * skipped, false if the IPI cannot be skipped.
>> + *
>> + * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
>> + * set softirq @nr while also observing in_mwait in a race-free way.
>> + */
>> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
>> +{
>> + uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
>> + uint64_t old, new;
>> + unsigned int softirq = 1U << nr;
>> +
>> + old = ACCESS_ONCE(*ptr);
>> +
>> + do {
>> + if ( old & softirq )
>> + /* Softirq already pending, nothing to do. */
>> + return true;
>> +
>> + new = old | softirq;
>> +
>> + } while ( (old = cmpxchg(ptr, old, new)) != new );
> Don't you mean
>
> } while ( (new = cmpxchg(ptr, old, new)) != old );
>
> here
No. I'm pretty sure I don't.
> (with new latched ahead of the loop and old set from new inside it),
> like we have it elsewhere when we avoid the use of yet another variable,
> e.g. in inc_linear_{entries,uses}()?
Eww, no. Having new and old backwards like that is borderline
obfucation, and is unnecessary cognitive complexity for the reader.
>
>> + /*
>> + * We have caused the softirq to become pending. If in_mwait was set, the
>> + * target CPU will notice the modification and act on it.
>> + */
>> + return new & (1UL << 32);
> Hmm, this requires the layout to allow for even less re-arrangement than the
> comment ahead of the union says: You strictly require in_mwait to follow
> __softirq_pending immediately, and the (assembly) write to strictly write 1
> into the field. Could I talk you into at least
>
> return new & (1UL << (offsetof(..., in_mwait) * 8));
>
> ? This way the field being inspected would also be mentioned in the access
> itself (i.e. become grep-able beyond the mentioning in the comment). And I
> sincerely dislike hard-coded literal numbers when they (relatively) easily
> can be expressed another way.
The nice way to do this would be a named union and a proper field, but
that doesn't work because of how softirq_pending() is declared.
I suppose I can use an offsetof() expression.
~Andrew
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons"
2025-07-03 9:43 ` Jan Beulich
@ 2025-07-03 12:10 ` Andrew Cooper
2025-07-03 13:11 ` Jan Beulich
0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2025-07-03 12:10 UTC (permalink / raw)
To: Jan Beulich
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 03/07/2025 10:43 am, Jan Beulich wrote:
> On 02.07.2025 16:41, Andrew Cooper wrote:
>> @@ -461,12 +468,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
>>
>> monitor(this_softirq_pending, 0, 0);
>>
>> + ASSERT(!local_irq_is_enabled());
>> +
>> if ( !*this_softirq_pending )
>> {
>> struct cpu_info *info = get_cpu_info();
>>
>> spec_ctrl_enter_idle(info);
>> - mwait(eax, ecx);
>> +
>> + if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
>> + mwait(eax, ecx);
>> + else
>> + sti_mwait_cli(eax, ecx);
> Actually, I'm curious: It seems quite likely that you did consider an
> alternative resulting in assembly code like this:
>
> test $MWAIT_ECX_INTERRUPT_BREAK, %cl
> jz 0f
> sti
> 0:
> monitor
> cli
>
> CLI being a relatively cheap operation aiui, is there anything wrong or
> undesirable with this (smaller overall) alternative?
Other than it needing to be mwait? The overheads aren't interesting;
they're nothing compared to going idle.
What does matter is that such a transformation cannot exist in mwait()
itself, as that breaks acpi_dead_idle(), and if we turn this mwait()
into inline asm, there's only a single caller of mwait() left.
~Andrew
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 5/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
2025-07-03 9:24 ` Jan Beulich
@ 2025-07-03 12:37 ` Andrew Cooper
2025-07-03 13:30 ` Jan Beulich
0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2025-07-03 12:37 UTC (permalink / raw)
To: Jan Beulich
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 03/07/2025 10:24 am, Jan Beulich wrote:
> On 02.07.2025 16:41, Andrew Cooper wrote:
>> With the recent simplifications, it becomes obvious that smp_mb() isn't the
>> right barrier; all we need is a compiler barrier.
>>
>> Include this in monitor() itself, along with an explantion.
> Ah, here we go. As per my comment on patch 4, would this perhaps better move
> ahead (which however would require a bit of an adjustment to the description)?
As said, it's not necessary in practice.
>
>> + * monitored cacheline must not be hoisted over MONITOR.
>> + */
>> asm volatile ( "monitor"
>> - :: "a" (addr), "c" (ecx), "d" (edx) );
>> + :: "a" (addr), "c" (ecx), "d" (edx) : "memory" );
>> }
> That's heavier than we need, though. Can't we simply have a fake output
> "+m" (irq_stat[cpu])?
No. That would be wrong for one of the two callers. Also we don't have
cpu available.
The correct value would be a round-down on addr and a cacheline-sized
sized type, but we can't do that because of -Wvla.
Nothing good can come of anything crossing the MONITOR, and ...
> Downside being that the compiler may then set up
> addressing of that operand, when the operand isn't really referenced. (As
> long as __softirq_pending is the first field there, there may not be any
> extra overhead, though, as %rax then would also address the unused operand.)
... nothing good is going to come from trying to get clever at
optimising a constraint that doesn't actually improve code generation in
the first place.
>
> Yet then, is it really only reads from that cacheline that are of concern?
> Isn't it - strictly speaking - also necessary that any (hypothetical) reads
> done by the NOW() at the end of the function have to occur only afterwards
> (and independent of there being a LOCK-ed access in cpumask_clear_cpu())?
The NOW() and cpumask_clear_cpu() are gone, and not going to be returning.
I did put a compiler barrier in mwait() originally too, but dropped it
because I couldn't reason about it easily.
Nothing good can come of having any loads hoisted above MWAIT, but from
a programmers point of view, it's indistinguishable from e.g. taking an
SMI. If there's a correctness issue, it's not MWAIT's fault.
~Andrew
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons"
2025-07-03 12:10 ` Andrew Cooper
@ 2025-07-03 13:11 ` Jan Beulich
0 siblings, 0 replies; 35+ messages in thread
From: Jan Beulich @ 2025-07-03 13:11 UTC (permalink / raw)
To: Andrew Cooper
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 03.07.2025 14:10, Andrew Cooper wrote:
> On 03/07/2025 10:43 am, Jan Beulich wrote:
>> On 02.07.2025 16:41, Andrew Cooper wrote:
>>> @@ -461,12 +468,19 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
>>>
>>> monitor(this_softirq_pending, 0, 0);
>>>
>>> + ASSERT(!local_irq_is_enabled());
>>> +
>>> if ( !*this_softirq_pending )
>>> {
>>> struct cpu_info *info = get_cpu_info();
>>>
>>> spec_ctrl_enter_idle(info);
>>> - mwait(eax, ecx);
>>> +
>>> + if ( ecx & MWAIT_ECX_INTERRUPT_BREAK )
>>> + mwait(eax, ecx);
>>> + else
>>> + sti_mwait_cli(eax, ecx);
>> Actually, I'm curious: It seems quite likely that you did consider an
>> alternative resulting in assembly code like this:
>>
>> test $MWAIT_ECX_INTERRUPT_BREAK, %cl
>> jz 0f
>> sti
>> 0:
>> monitor
>> cli
>>
>> CLI being a relatively cheap operation aiui, is there anything wrong or
>> undesirable with this (smaller overall) alternative?
>
> Other than it needing to be mwait?
Oops.
> The overheads aren't interesting;
> they're nothing compared to going idle.
>
> What does matter is that such a transformation cannot exist in mwait()
> itself, as that breaks acpi_dead_idle(), and if we turn this mwait()
> into inline asm, there's only a single caller of mwait() left.
I was rather think of it living in something derived from sti_mwait_cli(),
which could then be called unconditionally from mwait_idle_with_hints().
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 5/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
2025-07-03 12:37 ` Andrew Cooper
@ 2025-07-03 13:30 ` Jan Beulich
2025-07-04 15:45 ` Andrew Cooper
0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2025-07-03 13:30 UTC (permalink / raw)
To: Andrew Cooper
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 03.07.2025 14:37, Andrew Cooper wrote:
> On 03/07/2025 10:24 am, Jan Beulich wrote:
>> On 02.07.2025 16:41, Andrew Cooper wrote:
>>> With the recent simplifications, it becomes obvious that smp_mb() isn't the
>>> right barrier; all we need is a compiler barrier.
>>>
>>> Include this in monitor() itself, along with an explantion.
>> Ah, here we go. As per my comment on patch 4, would this perhaps better move
>> ahead (which however would require a bit of an adjustment to the description)?
>
> As said, it's not necessary in practice.
As said where? All you say here is that a memory barrier is needed. Perhaps
my use of "ahead" was ambiguous? I meant "move ahead in the series", not
"move ahead in code". Apart from this as a possibility I fear I don't
understand.
>>> + * monitored cacheline must not be hoisted over MONITOR.
>>> + */
>>> asm volatile ( "monitor"
>>> - :: "a" (addr), "c" (ecx), "d" (edx) );
>>> + :: "a" (addr), "c" (ecx), "d" (edx) : "memory" );
>>> }
>> That's heavier than we need, though. Can't we simply have a fake output
>> "+m" (irq_stat[cpu])?
>
> No. That would be wrong for one of the two callers.
How that? MONITOR behaves the same in all cases, doesn't it? And it's
the same piece of data in both cases the address of which is passed in
(&softirq_pending(smp_processor_id())).
> Also we don't have cpu available.
smp_processor_id()? Or else have a pointer to the full array entry passed
in? We could also specify the entire array, just that that's not easily
expressable afaict.
> The correct value would be a round-down on addr and a cacheline-sized
> sized type, but we can't do that because of -Wvla.
But that's what irq_stat[cpu] is (or at least claims to be, by the element
type having the __cacheline_aligned attribute).
> Nothing good can come of anything crossing the MONITOR, and ...
>
>> Downside being that the compiler may then set up
>> addressing of that operand, when the operand isn't really referenced. (As
>> long as __softirq_pending is the first field there, there may not be any
>> extra overhead, though, as %rax then would also address the unused operand.)
>
> ... nothing good is going to come from trying to get clever at
> optimising a constraint that doesn't actually improve code generation in
> the first place.
>
>> Yet then, is it really only reads from that cacheline that are of concern?
>> Isn't it - strictly speaking - also necessary that any (hypothetical) reads
>> done by the NOW() at the end of the function have to occur only afterwards
>> (and independent of there being a LOCK-ed access in cpumask_clear_cpu())?
>
> The NOW() and cpumask_clear_cpu() are gone, and not going to be returning.
I just used the former as example and mentioned the latter because it would
serialize memory accesses irrespective of any barriers we add.
> I did put a compiler barrier in mwait() originally too, but dropped it
> because I couldn't reason about it easily.
Which I understand.
> Nothing good can come of having any loads hoisted above MWAIT, but from
> a programmers point of view, it's indistinguishable from e.g. taking an
> SMI. If there's a correctness issue, it's not MWAIT's fault.
Well, yes, but what in the code is it that tells the compiler not to? Up
to and including LTO, should we ever get that to work again. This
specifically may be why mwait() may need to gain one, despite not itself
dealing with any memory (operands).
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-03 11:59 ` Andrew Cooper
@ 2025-07-03 14:07 ` Jan Beulich
2025-07-03 17:29 ` Andrew Cooper
0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2025-07-03 14:07 UTC (permalink / raw)
To: Andrew Cooper
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 03.07.2025 13:59, Andrew Cooper wrote:
> On 03/07/2025 10:01 am, Jan Beulich wrote:
>> On 02.07.2025 16:41, Andrew Cooper wrote:
>>> @@ -452,6 +466,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
>>> mwait(eax, ecx);
>>> spec_ctrl_exit_idle(info);
>>> }
>>> +
>>> + alternative_io("movb $0, %[in_mwait]",
>>> + "", X86_BUG_MONITOR,
>>> + [in_mwait] "=m" (stat->in_mwait));
>> This doesn't strictly need to be an alternative, does it? We can save a
>> memory write in the buggy case, but that likely makes at most a marginal
>> difference.
>
> It doesn't *need* to be an alternative. It could be:
>
> if ( !cpu_bug_monitor )
> ACCESS_ONCE(stat->in_mwait) = true;
>
> but getting rid of the branch is an advantage too.
That's the other alternative blob. What I mean that here it could simply
be
ACCESS_ONCE(stat->in_mwait) = false;
without any conditional.
>>> --- a/xen/arch/x86/include/asm/hardirq.h
>>> +++ b/xen/arch/x86/include/asm/hardirq.h
>>> @@ -5,7 +5,19 @@
>>> #include <xen/types.h>
>>>
>>> typedef struct {
>>> - unsigned int __softirq_pending;
>>> + /*
>>> + * The layout is important. Any CPU can set bits in __softirq_pending,
>>> + * but in_mwait is a status bit owned by the CPU. softirq_mwait_raw must
>>> + * cover both, and must be in a single cacheline.
>>> + */
>>> + union {
>>> + struct {
>>> + unsigned int __softirq_pending;
>>> + bool in_mwait;
>>> + };
>>> + uint64_t softirq_mwait_raw;
>>> + };
>> To guard against someone changing e.g. __softirq_pending to unsigned long
>> while ignoring the comment, how about adding a BUILD_BUG_ON() checking both
>> parts of the union are the same size (which of course would require naming
>> either the union field within the containing struct or its struct member)?
>
> That turns out to be very hard because of the definition of
> softirq_pending() being common. More below.
Hmm, yes, I see. Then maybe
BUILD_BUG_ON(offsetof(irq_cpustat_t, in_mwait) +
sizeof(irq_stat[0].in_mwait) >
offsetof(irq_cpustat_t, softirq_mwait_raw) +
sizeof(irq_stat[0].softirq_mwait_raw));
(or something substantially similar using e.g. typeof())?
>>> @@ -9,4 +11,36 @@
>>> #define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
>>> #define NR_ARCH_SOFTIRQS 5
>>>
>>> +/*
>>> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
>>> + * skipped, false if the IPI cannot be skipped.
>>> + *
>>> + * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
>>> + * set softirq @nr while also observing in_mwait in a race-free way.
>>> + */
>>> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
>>> +{
>>> + uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
>>> + uint64_t old, new;
>>> + unsigned int softirq = 1U << nr;
>>> +
>>> + old = ACCESS_ONCE(*ptr);
>>> +
>>> + do {
>>> + if ( old & softirq )
>>> + /* Softirq already pending, nothing to do. */
>>> + return true;
>>> +
>>> + new = old | softirq;
>>> +
>>> + } while ( (old = cmpxchg(ptr, old, new)) != new );
>> Don't you mean
>>
>> } while ( (new = cmpxchg(ptr, old, new)) != old );
>>
>> here
>
> No. I'm pretty sure I don't.
>
>> (with new latched ahead of the loop and old set from new inside it),
>> like we have it elsewhere when we avoid the use of yet another variable,
>> e.g. in inc_linear_{entries,uses}()?
>
> Eww, no. Having new and old backwards like that is borderline
> obfucation, and is unnecessary cognitive complexity for the reader.
Why backwards? You want to compare the (original) value read against the
expected old value. The way you wrote it you'll do at least one extra
loop iteration, as you wait for the fetched (original) value to equal
"new".
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/6] x86/idle: Remove broken MWAIT implementation
2025-07-02 14:41 ` [PATCH 1/6] x86/idle: Remove broken MWAIT implementation Andrew Cooper
@ 2025-07-03 16:01 ` Roger Pau Monné
2025-07-03 16:19 ` Andrew Cooper
0 siblings, 1 reply; 35+ messages in thread
From: Roger Pau Monné @ 2025-07-03 16:01 UTC (permalink / raw)
To: Andrew Cooper
Cc: Xen-devel, Jan Beulich, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On Wed, Jul 02, 2025 at 03:41:16PM +0100, Andrew Cooper wrote:
> cpuidle_wakeup_mwait() is a TOCTOU race. The cpumask_and() sampling
> cpuidle_mwait_flags can take a arbitrary period of time, and there's no
> guarantee that the target CPUs are still in MWAIT when writing into
> mwait_wakeup(cpu).
>
> The consequence of the race is that we'll fail to IPI targets. Also, there's
> no guarantee that mwait_idle_with_hints() will raise a TIMER_SOFTIRQ on it's
> way out.
>
> The fundamental bug is that the "in_mwait" variable needs to be in the
> monitored line in order to do this in a race-free way.
I assume in_mwait being in the same monitored line is required so that
you can do an atomic exchange and ensure the remote CPU was either in
the monitor state, or just getting out of it but before processing
softirqs when the softirq is set?
That means that a CPU getting out of mwait would also need to do an
atomic exchange to clear in_mwait and fetch the pending softirqs?
> Arranging for this while keeping the old implementation is prohibitive, so
> strip it out in order to implement the new one cleanly. In the interim, this
> causes IPIs to be used unconditionally which is safe albeit suboptimal.
>
> Fixes: 3d521e933e1b ("cpuidle: mwait on softirq_pending & remove wakeup ipis")
> Fixes: 1adb34ea846d ("CPUIDLE: re-implement mwait wakeup process")
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/6] x86/idle: Remove broken MWAIT implementation
2025-07-03 16:01 ` Roger Pau Monné
@ 2025-07-03 16:19 ` Andrew Cooper
0 siblings, 0 replies; 35+ messages in thread
From: Andrew Cooper @ 2025-07-03 16:19 UTC (permalink / raw)
To: Roger Pau Monné
Cc: Xen-devel, Jan Beulich, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On 03/07/2025 5:01 pm, Roger Pau Monné wrote:
> On Wed, Jul 02, 2025 at 03:41:16PM +0100, Andrew Cooper wrote:
>> cpuidle_wakeup_mwait() is a TOCTOU race. The cpumask_and() sampling
>> cpuidle_mwait_flags can take a arbitrary period of time, and there's no
>> guarantee that the target CPUs are still in MWAIT when writing into
>> mwait_wakeup(cpu).
>>
>> The consequence of the race is that we'll fail to IPI targets. Also, there's
>> no guarantee that mwait_idle_with_hints() will raise a TIMER_SOFTIRQ on it's
>> way out.
>>
>> The fundamental bug is that the "in_mwait" variable needs to be in the
>> monitored line in order to do this in a race-free way.
> I assume in_mwait being in the same monitored line is required so that
> you can do an atomic exchange and ensure the remote CPU was either in
> the monitor state, or just getting out of it but before processing
> softirqs when the softirq is set?
Yes
> That means that a CPU getting out of mwait would also need to do an
> atomic exchange to clear in_mwait and fetch the pending softirqs?
Not quite. It's probably better to see patch 4, than discuss it here.
>
>> Arranging for this while keeping the old implementation is prohibitive, so
>> strip it out in order to implement the new one cleanly. In the interim, this
>> causes IPIs to be used unconditionally which is safe albeit suboptimal.
>>
>> Fixes: 3d521e933e1b ("cpuidle: mwait on softirq_pending & remove wakeup ipis")
>> Fixes: 1adb34ea846d ("CPUIDLE: re-implement mwait wakeup process")
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Thanks
~Andrew
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq()
2025-07-02 14:41 ` [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq() Andrew Cooper
2025-07-03 8:11 ` Jan Beulich
@ 2025-07-03 16:21 ` Roger Pau Monné
2025-07-04 7:23 ` Jan Beulich
1 sibling, 1 reply; 35+ messages in thread
From: Roger Pau Monné @ 2025-07-03 16:21 UTC (permalink / raw)
To: Andrew Cooper
Cc: Xen-devel, Jan Beulich, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On Wed, Jul 02, 2025 at 03:41:18PM +0100, Andrew Cooper wrote:
> x86 is the only architecture wanting an optimisation here, but the
> test_and_set_bit() is a store into the monitored line and is racy with
> determining whether an IPI can be skipped.
>
> Instead, implement a new arch helper with different semantics; to make the
> softirq pending and decide about IPIs together. For now, implement the
> default helper. It will be overridden by x86 in a subsequent change.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
With the adjusted commit message you proposed to Jan:
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Just one nit below to comment something.
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Anthony PERARD <anthony.perard@vates.tech>
> CC: Michal Orzel <michal.orzel@amd.com>
> CC: Julien Grall <julien@xen.org>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> ---
> xen/arch/x86/acpi/cpu_idle.c | 5 -----
> xen/arch/x86/include/asm/softirq.h | 2 --
> xen/common/softirq.c | 8 ++------
> xen/include/xen/softirq.h | 16 ++++++++++++++++
> 4 files changed, 18 insertions(+), 13 deletions(-)
>
> diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
> index dbcb80d81db9..caa0fef0b3b1 100644
> --- a/xen/arch/x86/acpi/cpu_idle.c
> +++ b/xen/arch/x86/acpi/cpu_idle.c
> @@ -436,11 +436,6 @@ static int __init cf_check cpu_idle_key_init(void)
> }
> __initcall(cpu_idle_key_init);
>
> -bool arch_skip_send_event_check(unsigned int cpu)
> -{
> - return false;
> -}
> -
> void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
> {
> unsigned int cpu = smp_processor_id();
> diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
> index 415ee866c79d..e4b194f069fb 100644
> --- a/xen/arch/x86/include/asm/softirq.h
> +++ b/xen/arch/x86/include/asm/softirq.h
> @@ -9,6 +9,4 @@
> #define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
> #define NR_ARCH_SOFTIRQS 5
>
> -bool arch_skip_send_event_check(unsigned int cpu);
> -
> #endif /* __ASM_SOFTIRQ_H__ */
> diff --git a/xen/common/softirq.c b/xen/common/softirq.c
> index 60f344e8425e..0368011f95d1 100644
> --- a/xen/common/softirq.c
> +++ b/xen/common/softirq.c
> @@ -94,9 +94,7 @@ void cpumask_raise_softirq(const cpumask_t *mask, unsigned int nr)
> raise_mask = &per_cpu(batch_mask, this_cpu);
>
> for_each_cpu(cpu, mask)
> - if ( !test_and_set_bit(nr, &softirq_pending(cpu)) &&
> - cpu != this_cpu &&
> - !arch_skip_send_event_check(cpu) )
> + if ( !arch_pend_softirq(nr, cpu) && cpu != this_cpu )
> __cpumask_set_cpu(cpu, raise_mask);
>
> if ( raise_mask == &send_mask )
> @@ -107,9 +105,7 @@ void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
> {
> unsigned int this_cpu = smp_processor_id();
>
> - if ( test_and_set_bit(nr, &softirq_pending(cpu))
> - || (cpu == this_cpu)
> - || arch_skip_send_event_check(cpu) )
> + if ( arch_pend_softirq(nr, cpu) || cpu == this_cpu )
> return;
>
> if ( !per_cpu(batching, this_cpu) || in_irq() )
> diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
> index e9f79ec0ce3e..0814c8e5cd41 100644
> --- a/xen/include/xen/softirq.h
> +++ b/xen/include/xen/softirq.h
> @@ -23,6 +23,22 @@ enum {
>
> #define NR_SOFTIRQS (NR_COMMON_SOFTIRQS + NR_ARCH_SOFTIRQS)
>
> +/*
> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
> + * skipped, false if the IPI cannot be skipped.
> + */
> +#ifndef arch_pend_softirq
> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
Nit: I would maybe it arch_set_softirq(), I find `pend` not that clear
(I would rather fully spell `pending` instead).
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-02 14:41 ` [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm Andrew Cooper
2025-07-03 9:01 ` Jan Beulich
@ 2025-07-03 16:36 ` Roger Pau Monné
2025-07-03 17:48 ` Andrew Cooper
2025-07-04 7:52 ` Roger Pau Monné
2 siblings, 1 reply; 35+ messages in thread
From: Roger Pau Monné @ 2025-07-03 16:36 UTC (permalink / raw)
To: Andrew Cooper
Cc: Xen-devel, Jan Beulich, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On Wed, Jul 02, 2025 at 03:41:19PM +0100, Andrew Cooper wrote:
> In order elide IPIs, we must be able to identify whether a target CPU is in
> MWAIT at the point it is woken up. i.e. the store to wake it up must also
> identify the state.
>
> Create a new in_mwait variable beside __softirq_pending, so we can use a
> CMPXCHG to set the softirq while also observing the status safely. Implement
> an x86 version of arch_pend_softirq() which does this.
>
> In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
> precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
> advertising in_mwait.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Anthony PERARD <anthony.perard@vates.tech>
> CC: Michal Orzel <michal.orzel@amd.com>
> CC: Julien Grall <julien@xen.org>
> CC: Stefano Stabellini <sstabellini@kernel.org>
>
> This is modelled after Linux's TIF_NEED_RESCHED (single bit equivelent of all
> of __softirq_pending), and TIF_POLLING_NRFLAG (arch-neutral "in_mwait").
>
> In Linux, they're both in the same flags field, which adds complexity. In
> Xen, __softirq_pending is already unsigned long for everything other than x86,
> so adding an arch-neutral "in_mwait" would need wider changes.
> ---
> xen/arch/x86/acpi/cpu_idle.c | 20 +++++++++++++++++-
> xen/arch/x86/include/asm/hardirq.h | 14 +++++++++++-
> xen/arch/x86/include/asm/softirq.h | 34 ++++++++++++++++++++++++++++++
> 3 files changed, 66 insertions(+), 2 deletions(-)
>
> diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
> index caa0fef0b3b1..13040ef467a0 100644
> --- a/xen/arch/x86/acpi/cpu_idle.c
> +++ b/xen/arch/x86/acpi/cpu_idle.c
> @@ -439,7 +439,21 @@ __initcall(cpu_idle_key_init);
> void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
> {
> unsigned int cpu = smp_processor_id();
> - const unsigned int *this_softirq_pending = &softirq_pending(cpu);
> + irq_cpustat_t *stat = &irq_stat[cpu];
> + const unsigned int *this_softirq_pending = &stat->__softirq_pending;
> +
> + /*
> + * By setting in_mwait, we promise to other CPUs that we'll notice changes
> + * to __softirq_pending without being sent an IPI. We achieve this by
> + * either not going to sleep, or by having hardware notice on our behalf.
> + *
> + * Some errata exist where MONITOR doesn't work properly, and the
> + * workaround is to force the use of an IPI. Cause this to happen by
> + * simply not advertising outselves as being in_mwait.
> + */
> + alternative_io("movb $1, %[in_mwait]",
> + "", X86_BUG_MONITOR,
> + [in_mwait] "=m" (stat->in_mwait));
>
> monitor(this_softirq_pending, 0, 0);
> smp_mb();
> @@ -452,6 +466,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
> mwait(eax, ecx);
> spec_ctrl_exit_idle(info);
> }
> +
> + alternative_io("movb $0, %[in_mwait]",
> + "", X86_BUG_MONITOR,
> + [in_mwait] "=m" (stat->in_mwait));
Isn't it a bit overkill to use alternatives here when you could have a
conditional based on a global variable to decide whether to set/clear
in_mwait?
> }
>
> static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
> diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
> index f3e93cc9b507..1647cff04dc8 100644
> --- a/xen/arch/x86/include/asm/hardirq.h
> +++ b/xen/arch/x86/include/asm/hardirq.h
> @@ -5,7 +5,19 @@
> #include <xen/types.h>
>
> typedef struct {
> - unsigned int __softirq_pending;
> + /*
> + * The layout is important. Any CPU can set bits in __softirq_pending,
> + * but in_mwait is a status bit owned by the CPU. softirq_mwait_raw must
> + * cover both, and must be in a single cacheline.
> + */
> + union {
> + struct {
> + unsigned int __softirq_pending;
> + bool in_mwait;
Given the usage in assembly where you store 0 and 1, this might better
be a uint8_t then?
> + };
> + uint64_t softirq_mwait_raw;
> + };
This could be a named union type ...
> +
> unsigned int __local_irq_count;
> unsigned int nmi_count;
> unsigned int mce_count;
> diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
> index e4b194f069fb..069e5716a68d 100644
> --- a/xen/arch/x86/include/asm/softirq.h
> +++ b/xen/arch/x86/include/asm/softirq.h
> @@ -1,6 +1,8 @@
> #ifndef __ASM_SOFTIRQ_H__
> #define __ASM_SOFTIRQ_H__
>
> +#include <asm/system.h>
> +
> #define NMI_SOFTIRQ (NR_COMMON_SOFTIRQS + 0)
> #define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
> #define VCPU_KICK_SOFTIRQ (NR_COMMON_SOFTIRQS + 2)
> @@ -9,4 +11,36 @@
> #define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
> #define NR_ARCH_SOFTIRQS 5
>
> +/*
> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
> + * skipped, false if the IPI cannot be skipped.
> + *
> + * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
> + * set softirq @nr while also observing in_mwait in a race-free way.
> + */
> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
> +{
> + uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
> + uint64_t old, new;
... so that you also use it here? And the return check below could become
new.in_mwait instead of having to use a bitmask?
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-03 14:07 ` Jan Beulich
@ 2025-07-03 17:29 ` Andrew Cooper
0 siblings, 0 replies; 35+ messages in thread
From: Andrew Cooper @ 2025-07-03 17:29 UTC (permalink / raw)
To: Jan Beulich
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 03/07/2025 3:07 pm, Jan Beulich wrote:
> On 03.07.2025 13:59, Andrew Cooper wrote:
>> On 03/07/2025 10:01 am, Jan Beulich wrote:
>>> On 02.07.2025 16:41, Andrew Cooper wrote:
>>>> @@ -452,6 +466,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
>>>> mwait(eax, ecx);
>>>> spec_ctrl_exit_idle(info);
>>>> }
>>>> +
>>>> + alternative_io("movb $0, %[in_mwait]",
>>>> + "", X86_BUG_MONITOR,
>>>> + [in_mwait] "=m" (stat->in_mwait));
>>> This doesn't strictly need to be an alternative, does it? We can save a
>>> memory write in the buggy case, but that likely makes at most a marginal
>>> difference.
>> It doesn't *need* to be an alternative. It could be:
>>
>> if ( !cpu_bug_monitor )
>> ACCESS_ONCE(stat->in_mwait) = true;
>>
>> but getting rid of the branch is an advantage too.
> That's the other alternative blob. What I mean that here it could simply
> be
>
> ACCESS_ONCE(stat->in_mwait) = false;
>
> without any conditional.
It could. Or it could be consistent with it's other half.
>
>>>> --- a/xen/arch/x86/include/asm/hardirq.h
>>>> +++ b/xen/arch/x86/include/asm/hardirq.h
>>>> @@ -5,7 +5,19 @@
>>>> #include <xen/types.h>
>>>>
>>>> typedef struct {
>>>> - unsigned int __softirq_pending;
>>>> + /*
>>>> + * The layout is important. Any CPU can set bits in __softirq_pending,
>>>> + * but in_mwait is a status bit owned by the CPU. softirq_mwait_raw must
>>>> + * cover both, and must be in a single cacheline.
>>>> + */
>>>> + union {
>>>> + struct {
>>>> + unsigned int __softirq_pending;
>>>> + bool in_mwait;
>>>> + };
>>>> + uint64_t softirq_mwait_raw;
>>>> + };
>>> To guard against someone changing e.g. __softirq_pending to unsigned long
>>> while ignoring the comment, how about adding a BUILD_BUG_ON() checking both
>>> parts of the union are the same size (which of course would require naming
>>> either the union field within the containing struct or its struct member)?
>> That turns out to be very hard because of the definition of
>> softirq_pending() being common. More below.
> Hmm, yes, I see. Then maybe
>
> BUILD_BUG_ON(offsetof(irq_cpustat_t, in_mwait) +
> sizeof(irq_stat[0].in_mwait) >
> offsetof(irq_cpustat_t, softirq_mwait_raw) +
> sizeof(irq_stat[0].softirq_mwait_raw));
>
> (or something substantially similar using e.g. typeof())?
>
>>>> @@ -9,4 +11,36 @@
>>>> #define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
>>>> #define NR_ARCH_SOFTIRQS 5
>>>>
>>>> +/*
>>>> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
>>>> + * skipped, false if the IPI cannot be skipped.
>>>> + *
>>>> + * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
>>>> + * set softirq @nr while also observing in_mwait in a race-free way.
>>>> + */
>>>> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
>>>> +{
>>>> + uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
>>>> + uint64_t old, new;
>>>> + unsigned int softirq = 1U << nr;
>>>> +
>>>> + old = ACCESS_ONCE(*ptr);
>>>> +
>>>> + do {
>>>> + if ( old & softirq )
>>>> + /* Softirq already pending, nothing to do. */
>>>> + return true;
>>>> +
>>>> + new = old | softirq;
>>>> +
>>>> + } while ( (old = cmpxchg(ptr, old, new)) != new );
>>> Don't you mean
>>>
>>> } while ( (new = cmpxchg(ptr, old, new)) != old );
>>>
>>> here
>> No. I'm pretty sure I don't.
>>
>>> (with new latched ahead of the loop and old set from new inside it),
>>> like we have it elsewhere when we avoid the use of yet another variable,
>>> e.g. in inc_linear_{entries,uses}()?
>> Eww, no. Having new and old backwards like that is borderline
>> obfucation, and is unnecessary cognitive complexity for the reader.
> Why backwards? You want to compare the (original) value read against the
> expected old value. The way you wrote it you'll do at least one extra
> loop iteration, as you wait for the fetched (original) value to equal
> "new".
Hmm, yes, that's not quite what I intended, but I'm also not happy
writing anything of the form "new = cmpxchg()". It's plain wrong for
the semantics of cmpxchg.
~Andrew
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-03 16:36 ` Roger Pau Monné
@ 2025-07-03 17:48 ` Andrew Cooper
2025-07-04 7:24 ` Roger Pau Monné
0 siblings, 1 reply; 35+ messages in thread
From: Andrew Cooper @ 2025-07-03 17:48 UTC (permalink / raw)
To: Roger Pau Monné
Cc: Xen-devel, Jan Beulich, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On 03/07/2025 5:36 pm, Roger Pau Monné wrote:
> On Wed, Jul 02, 2025 at 03:41:19PM +0100, Andrew Cooper wrote:
>> In order elide IPIs, we must be able to identify whether a target CPU is in
>> MWAIT at the point it is woken up. i.e. the store to wake it up must also
>> identify the state.
>>
>> Create a new in_mwait variable beside __softirq_pending, so we can use a
>> CMPXCHG to set the softirq while also observing the status safely. Implement
>> an x86 version of arch_pend_softirq() which does this.
>>
>> In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
>> precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
>> advertising in_mwait.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> ---
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Roger Pau Monné <roger.pau@citrix.com>
>> CC: Anthony PERARD <anthony.perard@vates.tech>
>> CC: Michal Orzel <michal.orzel@amd.com>
>> CC: Julien Grall <julien@xen.org>
>> CC: Stefano Stabellini <sstabellini@kernel.org>
>>
>> This is modelled after Linux's TIF_NEED_RESCHED (single bit equivelent of all
>> of __softirq_pending), and TIF_POLLING_NRFLAG (arch-neutral "in_mwait").
>>
>> In Linux, they're both in the same flags field, which adds complexity. In
>> Xen, __softirq_pending is already unsigned long for everything other than x86,
>> so adding an arch-neutral "in_mwait" would need wider changes.
>> ---
>> xen/arch/x86/acpi/cpu_idle.c | 20 +++++++++++++++++-
>> xen/arch/x86/include/asm/hardirq.h | 14 +++++++++++-
>> xen/arch/x86/include/asm/softirq.h | 34 ++++++++++++++++++++++++++++++
>> 3 files changed, 66 insertions(+), 2 deletions(-)
>>
>> diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
>> index caa0fef0b3b1..13040ef467a0 100644
>> --- a/xen/arch/x86/acpi/cpu_idle.c
>> +++ b/xen/arch/x86/acpi/cpu_idle.c
>> @@ -439,7 +439,21 @@ __initcall(cpu_idle_key_init);
>> void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
>> {
>> unsigned int cpu = smp_processor_id();
>> - const unsigned int *this_softirq_pending = &softirq_pending(cpu);
>> + irq_cpustat_t *stat = &irq_stat[cpu];
>> + const unsigned int *this_softirq_pending = &stat->__softirq_pending;
>> +
>> + /*
>> + * By setting in_mwait, we promise to other CPUs that we'll notice changes
>> + * to __softirq_pending without being sent an IPI. We achieve this by
>> + * either not going to sleep, or by having hardware notice on our behalf.
>> + *
>> + * Some errata exist where MONITOR doesn't work properly, and the
>> + * workaround is to force the use of an IPI. Cause this to happen by
>> + * simply not advertising outselves as being in_mwait.
>> + */
>> + alternative_io("movb $1, %[in_mwait]",
>> + "", X86_BUG_MONITOR,
>> + [in_mwait] "=m" (stat->in_mwait));
>>
>> monitor(this_softirq_pending, 0, 0);
>> smp_mb();
>> @@ -452,6 +466,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
>> mwait(eax, ecx);
>> spec_ctrl_exit_idle(info);
>> }
>> +
>> + alternative_io("movb $0, %[in_mwait]",
>> + "", X86_BUG_MONITOR,
>> + [in_mwait] "=m" (stat->in_mwait));
> Isn't it a bit overkill to use alternatives here when you could have a
> conditional based on a global variable to decide whether to set/clear
> in_mwait?
I view it differently. Why should the common case suffer overhead
(extra memory read and conditional branch) to work around 3 buggy pieces
of hardware in a common path?
>> }
>>
>> static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
>> diff --git a/xen/arch/x86/include/asm/hardirq.h b/xen/arch/x86/include/asm/hardirq.h
>> index f3e93cc9b507..1647cff04dc8 100644
>> --- a/xen/arch/x86/include/asm/hardirq.h
>> +++ b/xen/arch/x86/include/asm/hardirq.h
>> @@ -5,7 +5,19 @@
>> #include <xen/types.h>
>>
>> typedef struct {
>> - unsigned int __softirq_pending;
>> + /*
>> + * The layout is important. Any CPU can set bits in __softirq_pending,
>> + * but in_mwait is a status bit owned by the CPU. softirq_mwait_raw must
>> + * cover both, and must be in a single cacheline.
>> + */
>> + union {
>> + struct {
>> + unsigned int __softirq_pending;
>> + bool in_mwait;
> Given the usage in assembly where you store 0 and 1, this might better
> be a uint8_t then?
We have loads of asm code which accesses bool's like this. C guarantees
that sizeof(bool) >= sizeof(char).
>
>> + };
>> + uint64_t softirq_mwait_raw;
>> + };
> This could be a named union type ...
>
>> +
>> unsigned int __local_irq_count;
>> unsigned int nmi_count;
>> unsigned int mce_count;
>> diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
>> index e4b194f069fb..069e5716a68d 100644
>> --- a/xen/arch/x86/include/asm/softirq.h
>> +++ b/xen/arch/x86/include/asm/softirq.h
>> @@ -1,6 +1,8 @@
>> #ifndef __ASM_SOFTIRQ_H__
>> #define __ASM_SOFTIRQ_H__
>>
>> +#include <asm/system.h>
>> +
>> #define NMI_SOFTIRQ (NR_COMMON_SOFTIRQS + 0)
>> #define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
>> #define VCPU_KICK_SOFTIRQ (NR_COMMON_SOFTIRQS + 2)
>> @@ -9,4 +11,36 @@
>> #define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
>> #define NR_ARCH_SOFTIRQS 5
>>
>> +/*
>> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
>> + * skipped, false if the IPI cannot be skipped.
>> + *
>> + * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
>> + * set softirq @nr while also observing in_mwait in a race-free way.
>> + */
>> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
>> +{
>> + uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
>> + uint64_t old, new;
> ... so that you also use it here?
No, it cant. The of softirq_pending() in common code requires no
intermediate field names, and I'm not untangling that mess in a series
wanting backporting.
~Andrew
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq()
2025-07-03 16:21 ` Roger Pau Monné
@ 2025-07-04 7:23 ` Jan Beulich
2025-07-04 7:55 ` Roger Pau Monné
0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2025-07-04 7:23 UTC (permalink / raw)
To: Roger Pau Monné, Andrew Cooper
Cc: Xen-devel, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini
On 03.07.2025 18:21, Roger Pau Monné wrote:
> On Wed, Jul 02, 2025 at 03:41:18PM +0100, Andrew Cooper wrote:
>> --- a/xen/include/xen/softirq.h
>> +++ b/xen/include/xen/softirq.h
>> @@ -23,6 +23,22 @@ enum {
>>
>> #define NR_SOFTIRQS (NR_COMMON_SOFTIRQS + NR_ARCH_SOFTIRQS)
>>
>> +/*
>> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
>> + * skipped, false if the IPI cannot be skipped.
>> + */
>> +#ifndef arch_pend_softirq
>> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
>
> Nit: I would maybe it arch_set_softirq(), I find `pend` not that clear
> (I would rather fully spell `pending` instead).
I, too, did wonder about the naming here. But using "pending" as you suggest
has the effect of giving the function a name we would normally associate with
a predicate ("Is it pending?"), whereas here the function is used to _mark_ a
softirq as pending. Hence in the end I didn't comment at all; I'd be fine
with "set", but I'm also okay with "pend".
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-03 17:48 ` Andrew Cooper
@ 2025-07-04 7:24 ` Roger Pau Monné
2025-07-04 16:13 ` Andrew Cooper
0 siblings, 1 reply; 35+ messages in thread
From: Roger Pau Monné @ 2025-07-04 7:24 UTC (permalink / raw)
To: Andrew Cooper
Cc: Xen-devel, Jan Beulich, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On Thu, Jul 03, 2025 at 06:48:23PM +0100, Andrew Cooper wrote:
> On 03/07/2025 5:36 pm, Roger Pau Monné wrote:
> > On Wed, Jul 02, 2025 at 03:41:19PM +0100, Andrew Cooper wrote:
> >> In order elide IPIs, we must be able to identify whether a target CPU is in
> >> MWAIT at the point it is woken up. i.e. the store to wake it up must also
> >> identify the state.
> >>
> >> Create a new in_mwait variable beside __softirq_pending, so we can use a
> >> CMPXCHG to set the softirq while also observing the status safely. Implement
> >> an x86 version of arch_pend_softirq() which does this.
> >>
> >> In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
> >> precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
> >> advertising in_mwait.
> >>
> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> >> ---
> >> CC: Jan Beulich <JBeulich@suse.com>
> >> CC: Roger Pau Monné <roger.pau@citrix.com>
> >> CC: Anthony PERARD <anthony.perard@vates.tech>
> >> CC: Michal Orzel <michal.orzel@amd.com>
> >> CC: Julien Grall <julien@xen.org>
> >> CC: Stefano Stabellini <sstabellini@kernel.org>
> >>
> >> This is modelled after Linux's TIF_NEED_RESCHED (single bit equivelent of all
> >> of __softirq_pending), and TIF_POLLING_NRFLAG (arch-neutral "in_mwait").
> >>
> >> In Linux, they're both in the same flags field, which adds complexity. In
> >> Xen, __softirq_pending is already unsigned long for everything other than x86,
> >> so adding an arch-neutral "in_mwait" would need wider changes.
> >> ---
> >> xen/arch/x86/acpi/cpu_idle.c | 20 +++++++++++++++++-
> >> xen/arch/x86/include/asm/hardirq.h | 14 +++++++++++-
> >> xen/arch/x86/include/asm/softirq.h | 34 ++++++++++++++++++++++++++++++
> >> 3 files changed, 66 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
> >> index caa0fef0b3b1..13040ef467a0 100644
> >> --- a/xen/arch/x86/acpi/cpu_idle.c
> >> +++ b/xen/arch/x86/acpi/cpu_idle.c
> >> @@ -439,7 +439,21 @@ __initcall(cpu_idle_key_init);
> >> void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
> >> {
> >> unsigned int cpu = smp_processor_id();
> >> - const unsigned int *this_softirq_pending = &softirq_pending(cpu);
> >> + irq_cpustat_t *stat = &irq_stat[cpu];
> >> + const unsigned int *this_softirq_pending = &stat->__softirq_pending;
> >> +
> >> + /*
> >> + * By setting in_mwait, we promise to other CPUs that we'll notice changes
> >> + * to __softirq_pending without being sent an IPI. We achieve this by
> >> + * either not going to sleep, or by having hardware notice on our behalf.
> >> + *
> >> + * Some errata exist where MONITOR doesn't work properly, and the
> >> + * workaround is to force the use of an IPI. Cause this to happen by
> >> + * simply not advertising outselves as being in_mwait.
> >> + */
> >> + alternative_io("movb $1, %[in_mwait]",
> >> + "", X86_BUG_MONITOR,
> >> + [in_mwait] "=m" (stat->in_mwait));
> >>
> >> monitor(this_softirq_pending, 0, 0);
> >> smp_mb();
> >> @@ -452,6 +466,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
> >> mwait(eax, ecx);
> >> spec_ctrl_exit_idle(info);
> >> }
> >> +
> >> + alternative_io("movb $0, %[in_mwait]",
> >> + "", X86_BUG_MONITOR,
> >> + [in_mwait] "=m" (stat->in_mwait));
> > Isn't it a bit overkill to use alternatives here when you could have a
> > conditional based on a global variable to decide whether to set/clear
> > in_mwait?
>
> I view it differently. Why should the common case suffer overhead
> (extra memory read and conditional branch) to work around 3 buggy pieces
> of hardware in a common path?
TBH I don't think the extra read and conditional would make any
difference in this specific path performance wise. Either the CPU is
going to sleep and has nothing to do, or the cost of getting back from
idle will shadow the overhead of an extra read and conditional.
It's all a question of taste I guess. If it was me I would set/clear
stat->in_mwait unconditionally in mwait_idle_with_hints(), and then in
arch_pend_softirq() I would return:
return new & (1UL << 32) || force_mwait_ipi_wakeup;
Or AFAICT you could possibly skip the cmpxchg() in the
force_mwait_ipi_wakeup case in arch_pend_softirq(), and just do:
if ( force_mwait_ipi_wakeup )
return test_and_set_bit(nr, &softirq_pending(cpu));
Because in that case Xen doesn't care about the in_mwait status. It
would be an optimization for the buggy hardware that already has to
deal with the extra cost of always sending an IPI.
> >> + };
> >> + uint64_t softirq_mwait_raw;
> >> + };
> > This could be a named union type ...
> >
> >> +
> >> unsigned int __local_irq_count;
> >> unsigned int nmi_count;
> >> unsigned int mce_count;
> >> diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
> >> index e4b194f069fb..069e5716a68d 100644
> >> --- a/xen/arch/x86/include/asm/softirq.h
> >> +++ b/xen/arch/x86/include/asm/softirq.h
> >> @@ -1,6 +1,8 @@
> >> #ifndef __ASM_SOFTIRQ_H__
> >> #define __ASM_SOFTIRQ_H__
> >>
> >> +#include <asm/system.h>
> >> +
> >> #define NMI_SOFTIRQ (NR_COMMON_SOFTIRQS + 0)
> >> #define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
> >> #define VCPU_KICK_SOFTIRQ (NR_COMMON_SOFTIRQS + 2)
> >> @@ -9,4 +11,36 @@
> >> #define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
> >> #define NR_ARCH_SOFTIRQS 5
> >>
> >> +/*
> >> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
> >> + * skipped, false if the IPI cannot be skipped.
> >> + *
> >> + * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
> >> + * set softirq @nr while also observing in_mwait in a race-free way.
> >> + */
> >> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
> >> +{
> >> + uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
> >> + uint64_t old, new;
> > ... so that you also use it here?
>
> No, it cant. The of softirq_pending() in common code requires no
> intermediate field names, and I'm not untangling that mess in a series
> wanting backporting.
Oh, I see. Never mind then, something for a later cleanup if
anything.
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-02 14:41 ` [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm Andrew Cooper
2025-07-03 9:01 ` Jan Beulich
2025-07-03 16:36 ` Roger Pau Monné
@ 2025-07-04 7:52 ` Roger Pau Monné
2025-07-04 16:14 ` Andrew Cooper
2 siblings, 1 reply; 35+ messages in thread
From: Roger Pau Monné @ 2025-07-04 7:52 UTC (permalink / raw)
To: Andrew Cooper
Cc: Xen-devel, Jan Beulich, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On Wed, Jul 02, 2025 at 03:41:19PM +0100, Andrew Cooper wrote:
> diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
> index e4b194f069fb..069e5716a68d 100644
> --- a/xen/arch/x86/include/asm/softirq.h
> +++ b/xen/arch/x86/include/asm/softirq.h
> @@ -1,6 +1,8 @@
> #ifndef __ASM_SOFTIRQ_H__
> #define __ASM_SOFTIRQ_H__
>
> +#include <asm/system.h>
> +
> #define NMI_SOFTIRQ (NR_COMMON_SOFTIRQS + 0)
> #define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
> #define VCPU_KICK_SOFTIRQ (NR_COMMON_SOFTIRQS + 2)
> @@ -9,4 +11,36 @@
> #define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
> #define NR_ARCH_SOFTIRQS 5
>
> +/*
> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
> + * skipped, false if the IPI cannot be skipped.
> + *
> + * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
> + * set softirq @nr while also observing in_mwait in a race-free way.
> + */
> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
> +{
> + uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
> + uint64_t old, new;
> + unsigned int softirq = 1U << nr;
> +
> + old = ACCESS_ONCE(*ptr);
> +
> + do {
> + if ( old & softirq )
> + /* Softirq already pending, nothing to do. */
> + return true;
> +
> + new = old | softirq;
> +
> + } while ( (old = cmpxchg(ptr, old, new)) != new );
> +
> + /*
> + * We have caused the softirq to become pending. If in_mwait was set, the
> + * target CPU will notice the modification and act on it.
> + */
> + return new & (1UL << 32);
Maybe I haven't got enough coffee yet, but if we do the cmpxchg()
won't it be enough to send the IPI when softirq is first set, but not
necessarily for each extra softirq that's set if there's already one
enabled? IOW:
uint64_t new, softirq = 1U << nr;
uint64_t old = ACCESS_ONCE(*ptr);
do {
if ( old & softirq )
/* Softirq already pending, nothing to do. */
return true;
new = old | softirq;
} while ( (old = cmpxchg(ptr, old, new)) != (new & ~softirq) );
/*
* We have caused the softirq to become pending when it was
* previously unset. If in_mwait was set, the target CPU will
* notice the modification and act on it.
*
* Reduce the logic to simply check whether the old value was
* different than 0: it will either be the in_mwait field or any
* already pending softirqs.
*/
return old;
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq()
2025-07-04 7:23 ` Jan Beulich
@ 2025-07-04 7:55 ` Roger Pau Monné
2025-07-04 8:25 ` Jan Beulich
0 siblings, 1 reply; 35+ messages in thread
From: Roger Pau Monné @ 2025-07-04 7:55 UTC (permalink / raw)
To: Jan Beulich
Cc: Andrew Cooper, Xen-devel, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On Fri, Jul 04, 2025 at 09:23:29AM +0200, Jan Beulich wrote:
> On 03.07.2025 18:21, Roger Pau Monné wrote:
> > On Wed, Jul 02, 2025 at 03:41:18PM +0100, Andrew Cooper wrote:
> >> --- a/xen/include/xen/softirq.h
> >> +++ b/xen/include/xen/softirq.h
> >> @@ -23,6 +23,22 @@ enum {
> >>
> >> #define NR_SOFTIRQS (NR_COMMON_SOFTIRQS + NR_ARCH_SOFTIRQS)
> >>
> >> +/*
> >> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
> >> + * skipped, false if the IPI cannot be skipped.
> >> + */
> >> +#ifndef arch_pend_softirq
> >> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
> >
> > Nit: I would maybe it arch_set_softirq(), I find `pend` not that clear
> > (I would rather fully spell `pending` instead).
>
> I, too, did wonder about the naming here. But using "pending" as you suggest
> has the effect of giving the function a name we would normally associate with
> a predicate ("Is it pending?"), whereas here the function is used to _mark_ a
> softirq as pending. Hence in the end I didn't comment at all; I'd be fine
> with "set", but I'm also okay with "pend".
It's a set and check kind of function, so I don't care much. Just
found the pend a bit too short, I don't think we usually abbreviate
pending to pend. In fact we use pend in more than one variable name
to store the end of a physical memory region.
Thanks, Roger.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq()
2025-07-04 7:55 ` Roger Pau Monné
@ 2025-07-04 8:25 ` Jan Beulich
2025-07-04 16:00 ` Andrew Cooper
0 siblings, 1 reply; 35+ messages in thread
From: Jan Beulich @ 2025-07-04 8:25 UTC (permalink / raw)
To: Roger Pau Monné
Cc: Andrew Cooper, Xen-devel, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On 04.07.2025 09:55, Roger Pau Monné wrote:
> On Fri, Jul 04, 2025 at 09:23:29AM +0200, Jan Beulich wrote:
>> On 03.07.2025 18:21, Roger Pau Monné wrote:
>>> On Wed, Jul 02, 2025 at 03:41:18PM +0100, Andrew Cooper wrote:
>>>> --- a/xen/include/xen/softirq.h
>>>> +++ b/xen/include/xen/softirq.h
>>>> @@ -23,6 +23,22 @@ enum {
>>>>
>>>> #define NR_SOFTIRQS (NR_COMMON_SOFTIRQS + NR_ARCH_SOFTIRQS)
>>>>
>>>> +/*
>>>> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
>>>> + * skipped, false if the IPI cannot be skipped.
>>>> + */
>>>> +#ifndef arch_pend_softirq
>>>> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
>>>
>>> Nit: I would maybe it arch_set_softirq(), I find `pend` not that clear
>>> (I would rather fully spell `pending` instead).
>>
>> I, too, did wonder about the naming here. But using "pending" as you suggest
>> has the effect of giving the function a name we would normally associate with
>> a predicate ("Is it pending?"), whereas here the function is used to _mark_ a
>> softirq as pending. Hence in the end I didn't comment at all; I'd be fine
>> with "set", but I'm also okay with "pend".
>
> It's a set and check kind of function, so I don't care much. Just
> found the pend a bit too short, I don't think we usually abbreviate
> pending to pend.
Aiui it's not an abbreviation, but kind of a verb (inverse-)derived from pending.
I don't know whether that's "official"; my dictionary doesn't have it.
Jan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 5/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints()
2025-07-03 13:30 ` Jan Beulich
@ 2025-07-04 15:45 ` Andrew Cooper
0 siblings, 0 replies; 35+ messages in thread
From: Andrew Cooper @ 2025-07-04 15:45 UTC (permalink / raw)
To: Jan Beulich
Cc: Roger Pau Monné, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini, Xen-devel
On 03/07/2025 2:30 pm, Jan Beulich wrote:
> On 03.07.2025 14:37, Andrew Cooper wrote:
>> On 03/07/2025 10:24 am, Jan Beulich wrote:
>>> On 02.07.2025 16:41, Andrew Cooper wrote:
>>>> With the recent simplifications, it becomes obvious that smp_mb() isn't the
>>>> right barrier; all we need is a compiler barrier.
>>>>
>>>> Include this in monitor() itself, along with an explantion.
>>> Ah, here we go. As per my comment on patch 4, would this perhaps better move
>>> ahead (which however would require a bit of an adjustment to the description)?
>> As said, it's not necessary in practice.
> As said where? All you say here is that a memory barrier is needed. Perhaps
> my use of "ahead" was ambiguous? I meant "move ahead in the series", not
> "move ahead in code". Apart from this as a possibility I fear I don't
> understand.
I did take it to mean "ahead in the series".
Your comment in patch 4 talks about alternative(), alternative_io() and
barriers. I stated (admittedly without reference) that the barrier
between two alternatives() doesn't matter because of their volatileness.
It can move in the series, because it is genuinely independent and
unrelated to patch 4 AFAICT.
>> Nothing good can come of having any loads hoisted above MWAIT, but from
>> a programmers point of view, it's indistinguishable from e.g. taking an
>> SMI. If there's a correctness issue, it's not MWAIT's fault.
> Well, yes, but what in the code is it that tells the compiler not to? Up
> to and including LTO, should we ever get that to work again. This
> specifically may be why mwait() may need to gain one, despite not itself
> dealing with any memory (operands).
In practice, mwait() is surrounded by spec_ctrl_{enter,exit}_idle() and
nothing is crossing those. I'm going to leave the mwait side of things
alone for now.
But even with LTO, if hoisting a read causes a problem, that's a bug in
whatever got hoisted, not in MWAIT.
~Andrew
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq()
2025-07-04 8:25 ` Jan Beulich
@ 2025-07-04 16:00 ` Andrew Cooper
0 siblings, 0 replies; 35+ messages in thread
From: Andrew Cooper @ 2025-07-04 16:00 UTC (permalink / raw)
To: Jan Beulich, Roger Pau Monné
Cc: Xen-devel, Anthony PERARD, Michal Orzel, Julien Grall,
Stefano Stabellini
On 04/07/2025 9:25 am, Jan Beulich wrote:
> On 04.07.2025 09:55, Roger Pau Monné wrote:
>> On Fri, Jul 04, 2025 at 09:23:29AM +0200, Jan Beulich wrote:
>>> On 03.07.2025 18:21, Roger Pau Monné wrote:
>>>> On Wed, Jul 02, 2025 at 03:41:18PM +0100, Andrew Cooper wrote:
>>>>> --- a/xen/include/xen/softirq.h
>>>>> +++ b/xen/include/xen/softirq.h
>>>>> @@ -23,6 +23,22 @@ enum {
>>>>>
>>>>> #define NR_SOFTIRQS (NR_COMMON_SOFTIRQS + NR_ARCH_SOFTIRQS)
>>>>>
>>>>> +/*
>>>>> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
>>>>> + * skipped, false if the IPI cannot be skipped.
>>>>> + */
>>>>> +#ifndef arch_pend_softirq
>>>>> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
>>>> Nit: I would maybe it arch_set_softirq(), I find `pend` not that clear
>>>> (I would rather fully spell `pending` instead).
>>> I, too, did wonder about the naming here. But using "pending" as you suggest
>>> has the effect of giving the function a name we would normally associate with
>>> a predicate ("Is it pending?"), whereas here the function is used to _mark_ a
>>> softirq as pending. Hence in the end I didn't comment at all; I'd be fine
>>> with "set", but I'm also okay with "pend".
>> It's a set and check kind of function, so I don't care much. Just
>> found the pend a bit too short, I don't think we usually abbreviate
>> pending to pend.
> Aiui it's not an abbreviation, but kind of a verb (inverse-)derived from pending.
> I don't know whether that's "official"; my dictionary doesn't have it.
It's used frequently in some circles, meaning "to make pending".
But I've changed the name because I don't care enough to argue.
~Andrew
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-04 7:24 ` Roger Pau Monné
@ 2025-07-04 16:13 ` Andrew Cooper
0 siblings, 0 replies; 35+ messages in thread
From: Andrew Cooper @ 2025-07-04 16:13 UTC (permalink / raw)
To: Roger Pau Monné
Cc: Xen-devel, Jan Beulich, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On 04/07/2025 8:24 am, Roger Pau Monné wrote:
> On Thu, Jul 03, 2025 at 06:48:23PM +0100, Andrew Cooper wrote:
>> On 03/07/2025 5:36 pm, Roger Pau Monné wrote:
>>> On Wed, Jul 02, 2025 at 03:41:19PM +0100, Andrew Cooper wrote:
>>>> In order elide IPIs, we must be able to identify whether a target CPU is in
>>>> MWAIT at the point it is woken up. i.e. the store to wake it up must also
>>>> identify the state.
>>>>
>>>> Create a new in_mwait variable beside __softirq_pending, so we can use a
>>>> CMPXCHG to set the softirq while also observing the status safely. Implement
>>>> an x86 version of arch_pend_softirq() which does this.
>>>>
>>>> In mwait_idle_with_hints(), advertise in_mwait, with an explanation of
>>>> precisely what it means. X86_BUG_MONITOR can be accounted for simply by not
>>>> advertising in_mwait.
>>>>
>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> ---
>>>> CC: Jan Beulich <JBeulich@suse.com>
>>>> CC: Roger Pau Monné <roger.pau@citrix.com>
>>>> CC: Anthony PERARD <anthony.perard@vates.tech>
>>>> CC: Michal Orzel <michal.orzel@amd.com>
>>>> CC: Julien Grall <julien@xen.org>
>>>> CC: Stefano Stabellini <sstabellini@kernel.org>
>>>>
>>>> This is modelled after Linux's TIF_NEED_RESCHED (single bit equivelent of all
>>>> of __softirq_pending), and TIF_POLLING_NRFLAG (arch-neutral "in_mwait").
>>>>
>>>> In Linux, they're both in the same flags field, which adds complexity. In
>>>> Xen, __softirq_pending is already unsigned long for everything other than x86,
>>>> so adding an arch-neutral "in_mwait" would need wider changes.
>>>> ---
>>>> xen/arch/x86/acpi/cpu_idle.c | 20 +++++++++++++++++-
>>>> xen/arch/x86/include/asm/hardirq.h | 14 +++++++++++-
>>>> xen/arch/x86/include/asm/softirq.h | 34 ++++++++++++++++++++++++++++++
>>>> 3 files changed, 66 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
>>>> index caa0fef0b3b1..13040ef467a0 100644
>>>> --- a/xen/arch/x86/acpi/cpu_idle.c
>>>> +++ b/xen/arch/x86/acpi/cpu_idle.c
>>>> @@ -439,7 +439,21 @@ __initcall(cpu_idle_key_init);
>>>> void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
>>>> {
>>>> unsigned int cpu = smp_processor_id();
>>>> - const unsigned int *this_softirq_pending = &softirq_pending(cpu);
>>>> + irq_cpustat_t *stat = &irq_stat[cpu];
>>>> + const unsigned int *this_softirq_pending = &stat->__softirq_pending;
>>>> +
>>>> + /*
>>>> + * By setting in_mwait, we promise to other CPUs that we'll notice changes
>>>> + * to __softirq_pending without being sent an IPI. We achieve this by
>>>> + * either not going to sleep, or by having hardware notice on our behalf.
>>>> + *
>>>> + * Some errata exist where MONITOR doesn't work properly, and the
>>>> + * workaround is to force the use of an IPI. Cause this to happen by
>>>> + * simply not advertising outselves as being in_mwait.
>>>> + */
>>>> + alternative_io("movb $1, %[in_mwait]",
>>>> + "", X86_BUG_MONITOR,
>>>> + [in_mwait] "=m" (stat->in_mwait));
>>>>
>>>> monitor(this_softirq_pending, 0, 0);
>>>> smp_mb();
>>>> @@ -452,6 +466,10 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
>>>> mwait(eax, ecx);
>>>> spec_ctrl_exit_idle(info);
>>>> }
>>>> +
>>>> + alternative_io("movb $0, %[in_mwait]",
>>>> + "", X86_BUG_MONITOR,
>>>> + [in_mwait] "=m" (stat->in_mwait));
>>> Isn't it a bit overkill to use alternatives here when you could have a
>>> conditional based on a global variable to decide whether to set/clear
>>> in_mwait?
>> I view it differently. Why should the common case suffer overhead
>> (extra memory read and conditional branch) to work around 3 buggy pieces
>> of hardware in a common path?
> TBH I don't think the extra read and conditional would make any
> difference in this specific path performance wise. Either the CPU is
> going to sleep and has nothing to do, or the cost of getting back from
> idle will shadow the overhead of an extra read and conditional.
>
> It's all a question of taste I guess. If it was me I would set/clear
> stat->in_mwait unconditionally in mwait_idle_with_hints(), and then in
> arch_pend_softirq() I would return:
>
> return new & (1UL << 32) || force_mwait_ipi_wakeup;
>
> Or AFAICT you could possibly skip the cmpxchg() in the
> force_mwait_ipi_wakeup case in arch_pend_softirq(), and just do:
>
> if ( force_mwait_ipi_wakeup )
> return test_and_set_bit(nr, &softirq_pending(cpu));
>
> Because in that case Xen doesn't care about the in_mwait status. It
> would be an optimization for the buggy hardware that already has to
> deal with the extra cost of always sending an IPI.
These will both function, but they're both poor options.
They're in a loop over a cpumask, and because of the full barriers in
the atomic options, the read cannot be hoisted or the loop split,
because the compiler has been told "the value may change on each loop
iteration".
This was a code-gen/perf note I had on your original errata fix, which I
said "lets fix later". Later is now.
The other part of the fix is arch_pend_softirq() is static inline, and
not out-of-line in a separate TU.
>>>> + };
>>>> + uint64_t softirq_mwait_raw;
>>>> + };
>>> This could be a named union type ...
>>>
>>>> +
>>>> unsigned int __local_irq_count;
>>>> unsigned int nmi_count;
>>>> unsigned int mce_count;
>>>> diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
>>>> index e4b194f069fb..069e5716a68d 100644
>>>> --- a/xen/arch/x86/include/asm/softirq.h
>>>> +++ b/xen/arch/x86/include/asm/softirq.h
>>>> @@ -1,6 +1,8 @@
>>>> #ifndef __ASM_SOFTIRQ_H__
>>>> #define __ASM_SOFTIRQ_H__
>>>>
>>>> +#include <asm/system.h>
>>>> +
>>>> #define NMI_SOFTIRQ (NR_COMMON_SOFTIRQS + 0)
>>>> #define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
>>>> #define VCPU_KICK_SOFTIRQ (NR_COMMON_SOFTIRQS + 2)
>>>> @@ -9,4 +11,36 @@
>>>> #define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
>>>> #define NR_ARCH_SOFTIRQS 5
>>>>
>>>> +/*
>>>> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
>>>> + * skipped, false if the IPI cannot be skipped.
>>>> + *
>>>> + * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
>>>> + * set softirq @nr while also observing in_mwait in a race-free way.
>>>> + */
>>>> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
>>>> +{
>>>> + uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
>>>> + uint64_t old, new;
>>> ... so that you also use it here?
>> No, it cant. The of softirq_pending() in common code requires no
>> intermediate field names, and I'm not untangling that mess in a series
>> wanting backporting.
> Oh, I see. Never mind then, something for a later cleanup if
> anything.
Yeah, I've got further cleanup pending, although I don't have a good fix
for this specifically.
~Andrew
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm
2025-07-04 7:52 ` Roger Pau Monné
@ 2025-07-04 16:14 ` Andrew Cooper
0 siblings, 0 replies; 35+ messages in thread
From: Andrew Cooper @ 2025-07-04 16:14 UTC (permalink / raw)
To: Roger Pau Monné
Cc: Xen-devel, Jan Beulich, Anthony PERARD, Michal Orzel,
Julien Grall, Stefano Stabellini
On 04/07/2025 8:52 am, Roger Pau Monné wrote:
> On Wed, Jul 02, 2025 at 03:41:19PM +0100, Andrew Cooper wrote:
>> diff --git a/xen/arch/x86/include/asm/softirq.h b/xen/arch/x86/include/asm/softirq.h
>> index e4b194f069fb..069e5716a68d 100644
>> --- a/xen/arch/x86/include/asm/softirq.h
>> +++ b/xen/arch/x86/include/asm/softirq.h
>> @@ -1,6 +1,8 @@
>> #ifndef __ASM_SOFTIRQ_H__
>> #define __ASM_SOFTIRQ_H__
>>
>> +#include <asm/system.h>
>> +
>> #define NMI_SOFTIRQ (NR_COMMON_SOFTIRQS + 0)
>> #define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
>> #define VCPU_KICK_SOFTIRQ (NR_COMMON_SOFTIRQS + 2)
>> @@ -9,4 +11,36 @@
>> #define HVM_DPCI_SOFTIRQ (NR_COMMON_SOFTIRQS + 4)
>> #define NR_ARCH_SOFTIRQS 5
>>
>> +/*
>> + * Ensure softirq @nr is pending on @cpu. Return true if an IPI can be
>> + * skipped, false if the IPI cannot be skipped.
>> + *
>> + * We use a CMPXCHG covering both __softirq_pending and in_mwait, in order to
>> + * set softirq @nr while also observing in_mwait in a race-free way.
>> + */
>> +static always_inline bool arch_pend_softirq(unsigned int nr, unsigned int cpu)
>> +{
>> + uint64_t *ptr = &irq_stat[cpu].softirq_mwait_raw;
>> + uint64_t old, new;
>> + unsigned int softirq = 1U << nr;
>> +
>> + old = ACCESS_ONCE(*ptr);
>> +
>> + do {
>> + if ( old & softirq )
>> + /* Softirq already pending, nothing to do. */
>> + return true;
>> +
>> + new = old | softirq;
>> +
>> + } while ( (old = cmpxchg(ptr, old, new)) != new );
>> +
>> + /*
>> + * We have caused the softirq to become pending. If in_mwait was set, the
>> + * target CPU will notice the modification and act on it.
>> + */
>> + return new & (1UL << 32);
> Maybe I haven't got enough coffee yet, but if we do the cmpxchg()
> won't it be enough to send the IPI when softirq is first set, but not
> necessarily for each extra softirq that's set if there's already one
> enabled?
It was me who didn't have enough coffee when writing the code. Already
fixed locally.
~Andrew
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2025-07-04 16:14 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-02 14:41 [PATCH 0/6] x86/idle: Multiple MWAIT fixes Andrew Cooper
2025-07-02 14:41 ` [PATCH 1/6] x86/idle: Remove broken MWAIT implementation Andrew Cooper
2025-07-03 16:01 ` Roger Pau Monné
2025-07-03 16:19 ` Andrew Cooper
2025-07-02 14:41 ` [PATCH 2/6] x86/idle: Convert force_mwait_ipi_wakeup to X86_BUG_MONITOR Andrew Cooper
2025-07-02 14:41 ` [PATCH 3/6] xen/softirq: Rework arch_skip_send_event_check() into arch_pend_softirq() Andrew Cooper
2025-07-03 8:11 ` Jan Beulich
2025-07-03 10:36 ` Andrew Cooper
2025-07-03 11:24 ` Jan Beulich
2025-07-03 16:21 ` Roger Pau Monné
2025-07-04 7:23 ` Jan Beulich
2025-07-04 7:55 ` Roger Pau Monné
2025-07-04 8:25 ` Jan Beulich
2025-07-04 16:00 ` Andrew Cooper
2025-07-02 14:41 ` [PATCH 4/6] x86/idle: Implement a new MWAIT IPI-elision algorithm Andrew Cooper
2025-07-03 9:01 ` Jan Beulich
2025-07-03 11:59 ` Andrew Cooper
2025-07-03 14:07 ` Jan Beulich
2025-07-03 17:29 ` Andrew Cooper
2025-07-03 16:36 ` Roger Pau Monné
2025-07-03 17:48 ` Andrew Cooper
2025-07-04 7:24 ` Roger Pau Monné
2025-07-04 16:13 ` Andrew Cooper
2025-07-04 7:52 ` Roger Pau Monné
2025-07-04 16:14 ` Andrew Cooper
2025-07-02 14:41 ` [PATCH 5/6] x86/idle: Drop incorrect smp_mb() in mwait_idle_with_hints() Andrew Cooper
2025-07-03 9:24 ` Jan Beulich
2025-07-03 12:37 ` Andrew Cooper
2025-07-03 13:30 ` Jan Beulich
2025-07-04 15:45 ` Andrew Cooper
2025-07-02 14:41 ` [PATCH 6/6] x86/idle: Fix buggy "x86/mwait-idle: enable interrupts before C1 on Xeons" Andrew Cooper
2025-07-03 9:35 ` Jan Beulich
2025-07-03 9:43 ` Jan Beulich
2025-07-03 12:10 ` Andrew Cooper
2025-07-03 13:11 ` Jan Beulich
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.