* [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry
@ 2024-11-08 12:29 Patryk Wlazlyn
2024-11-08 12:29 ` [PATCH v3 1/3] x86/smp: Allow calling mwait_play_dead with arbitrary hint Patryk Wlazlyn
` (4 more replies)
0 siblings, 5 replies; 51+ messages in thread
From: Patryk Wlazlyn @ 2024-11-08 12:29 UTC (permalink / raw)
To: x86
Cc: linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen, patryk.wlazlyn
Applied suggestions from Dave and Rafael.
Because we now call a smp function, we have to have ifdefs for
CONFIG_SMP or rely on mwait_play_dead_with_hint to return immediatelly
with an error on non-smp builds. I decided to do the later, but maybe we
should return -ENODEV (or other error constant) instead of 1. I am open
for suggestions.
Removing the existing "kexec hack" by bringing all offlined CPUs back
online before proceeding with the kexec would make it even simpler, but
I am not sure we can do that. It looks kind of obvious to me, but for
some reason the hack exist.
Changes since v2:
The whole approach changed, but there main things are below.
* Split mwait_play_dead (old code) into two parts:
1. Computing mwait hint based on cpuid leaf 0x5
2. Entering while(1) mwait loop with "kexec hack" handling
* Prefer cpuidle_play_dead over mwait_play_dead by reordering calls in
native_play_dead.
* Add implementation for enter_dead() handler in intel_idle that calls
executes old mwait_play_dead code, but with the mwait hint from the
cpuidle state table.
Patryk Wlazlyn (3):
x86/smp: Allow calling mwait_play_dead with arbitrary hint
x86/smp native_play_dead: Prefer cpuidle_play_dead() over
mwait_play_dead()
intel_idle: Provide enter_dead() handler for SRF
arch/x86/include/asm/smp.h | 6 ++++++
arch/x86/kernel/smpboot.c | 29 ++++++++++++++++++++---------
drivers/idle/intel_idle.c | 16 ++++++++++++++++
3 files changed, 42 insertions(+), 9 deletions(-)
--
2.47.0
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH v3 1/3] x86/smp: Allow calling mwait_play_dead with arbitrary hint
2024-11-08 12:29 [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry Patryk Wlazlyn
@ 2024-11-08 12:29 ` Patryk Wlazlyn
2024-11-08 16:03 ` Dave Hansen
2024-11-08 12:29 ` [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead() Patryk Wlazlyn
` (3 subsequent siblings)
4 siblings, 1 reply; 51+ messages in thread
From: Patryk Wlazlyn @ 2024-11-08 12:29 UTC (permalink / raw)
To: x86
Cc: linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen, patryk.wlazlyn
The current implementation for looking up the mwait hint for the deepest
cstate depends on them to be continuous in range [0, NUM_SUBSTATES-1].
While that is correct on most Intel x86 platforms, it is not documented
and may not result in reaching the most optimized idle state on some of
them.
For example Intel's Sierra Forest report two C6 substates in cpuid leaf 5:
C6S (hint 0x22)
C6SP (hint 0x23)
Hints 0x20 and 0x21 are skipped entirely, causing the current
implementation to compute the wrong hint, when looking for the deepest
cstate for offlined CPU to enter. As a result, package with an offlined
CPU can never reach PC6.
Allow the idle driver to call mwait_play_dead() code with the forced
mwait hint, skipping the cpuid based computation.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
---
arch/x86/include/asm/smp.h | 6 ++++++
arch/x86/kernel/smpboot.c | 25 ++++++++++++++++++-------
2 files changed, 24 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index ca073f40698f..fbd275d6661a 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -114,6 +114,7 @@ void wbinvd_on_cpu(int cpu);
int wbinvd_on_all_cpus(void);
void smp_kick_mwait_play_dead(void);
+int mwait_play_dead_with_hint(unsigned long hint);
void native_smp_send_reschedule(int cpu);
void native_send_call_func_ipi(const struct cpumask *mask);
@@ -164,6 +165,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
{
return (struct cpumask *)cpumask_of(0);
}
+
+static inline int mwait_play_dead_with_hint(unsigned long eax_hint)
+{
+ return 1;
+}
#endif /* CONFIG_SMP */
#ifdef CONFIG_DEBUG_NMI_SELFTEST
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 0c35207320cb..44c40781bad6 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1270,13 +1270,14 @@ void play_dead_common(void)
local_irq_disable();
}
+int mwait_play_dead_with_hint(unsigned long eax_hint);
+
/*
* We need to flush the caches before going to sleep, lest we have
* dirty data in our caches when we come back up.
*/
-static inline void mwait_play_dead(void)
+static inline int mwait_play_dead(void)
{
- struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
unsigned int eax, ebx, ecx, edx;
unsigned int highest_cstate = 0;
unsigned int highest_subcstate = 0;
@@ -1284,13 +1285,13 @@ static inline void mwait_play_dead(void)
if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
- return;
+ return 1;
if (!this_cpu_has(X86_FEATURE_MWAIT))
- return;
+ return 1;
if (!this_cpu_has(X86_FEATURE_CLFLUSH))
- return;
+ return 1;
if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF)
- return;
+ return 1;
eax = CPUID_MWAIT_LEAF;
ecx = 0;
@@ -1314,6 +1315,13 @@ static inline void mwait_play_dead(void)
(highest_subcstate - 1);
}
+ return mwait_play_dead_with_hint(eax);
+}
+
+int mwait_play_dead_with_hint(unsigned long eax_hint)
+{
+ struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
+
/* Set up state for the kexec() hack below */
md->status = CPUDEAD_MWAIT_WAIT;
md->control = CPUDEAD_MWAIT_WAIT;
@@ -1333,7 +1341,7 @@ static inline void mwait_play_dead(void)
mb();
__monitor(md, 0, 0);
mb();
- __mwait(eax, 0);
+ __mwait(eax_hint, 0);
if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
/*
@@ -1353,6 +1361,9 @@ static inline void mwait_play_dead(void)
native_halt();
}
}
+
+ /* Never reached */
+ return 0;
}
/*
--
2.47.0
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-08 12:29 [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry Patryk Wlazlyn
2024-11-08 12:29 ` [PATCH v3 1/3] x86/smp: Allow calling mwait_play_dead with arbitrary hint Patryk Wlazlyn
@ 2024-11-08 12:29 ` Patryk Wlazlyn
2024-11-08 16:14 ` Dave Hansen
2024-11-12 11:47 ` Peter Zijlstra
2024-11-08 12:29 ` [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF Patryk Wlazlyn
` (2 subsequent siblings)
4 siblings, 2 replies; 51+ messages in thread
From: Patryk Wlazlyn @ 2024-11-08 12:29 UTC (permalink / raw)
To: x86
Cc: linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen, patryk.wlazlyn
The generic implementation, based on cpuid leaf 0x5, for looking up the
mwait hint for the deepest cstate, depends on them to be continuous in
range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
platforms, it is not architectural and may not result in reaching the
most optimized idle state on some of them.
Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
fallback to the later in case of missing enter_dead() handler.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
---
arch/x86/kernel/smpboot.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 44c40781bad6..721bb931181c 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1416,9 +1416,9 @@ void native_play_dead(void)
play_dead_common();
tboot_shutdown(TB_SHUTDOWN_WFS);
- mwait_play_dead();
if (cpuidle_play_dead())
- hlt_play_dead();
+ mwait_play_dead();
+ hlt_play_dead();
}
#else /* ... !CONFIG_HOTPLUG_CPU */
--
2.47.0
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF
2024-11-08 12:29 [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry Patryk Wlazlyn
2024-11-08 12:29 ` [PATCH v3 1/3] x86/smp: Allow calling mwait_play_dead with arbitrary hint Patryk Wlazlyn
2024-11-08 12:29 ` [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead() Patryk Wlazlyn
@ 2024-11-08 12:29 ` Patryk Wlazlyn
2024-11-08 16:21 ` Dave Hansen
2024-11-08 22:12 ` kernel test robot
2024-11-08 16:22 ` [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry Dave Hansen
2024-11-12 11:45 ` Peter Zijlstra
4 siblings, 2 replies; 51+ messages in thread
From: Patryk Wlazlyn @ 2024-11-08 12:29 UTC (permalink / raw)
To: x86
Cc: linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen, patryk.wlazlyn
Intel's Sierra Forest report two C6 substates in cpuid leaf 5:
C6S (hint 0x22)
C6SP (hint 0x23)
Hints 0x20 and 0x21 are skipped entirely, causing the generic
implementation in mwait_play_dead() to compute the wrong hint, when
looking for the deepest cstate. As a result, package with an offlined
CPU can never reach PC6.
Define the enter_dead() handler for SRF.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
---
drivers/idle/intel_idle.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 9aab7abc2ae9..bd67959e5e8b 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -56,6 +56,7 @@
#include <asm/mwait.h>
#include <asm/spec-ctrl.h>
#include <asm/fpu/api.h>
+#include <asm/smp.h>
#define INTEL_IDLE_VERSION "0.5.1"
@@ -221,6 +222,17 @@ static __cpuidle int intel_idle_s2idle(struct cpuidle_device *dev,
return 0;
}
+static __cpuidle int intel_idle_enter_dead(struct cpuidle_device *dev,
+ int index)
+{
+ struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
+ struct cpuidle_state *state = &drv->states[index];
+ unsigned long eax = flg2MWAIT(state->flags);
+
+ /* Retruns only in case of an error. */
+ return mwait_play_dead_with_hint(eax);
+}
+
/*
* States are indexed by the cstate number,
* which is also the index into the MWAIT hint array.
@@ -1303,6 +1315,7 @@ static struct cpuidle_state srf_cstates[] __initdata = {
.exit_latency = 1,
.target_residency = 1,
.enter = &intel_idle,
+ .enter_dead = &intel_idle_enter_dead,
.enter_s2idle = intel_idle_s2idle, },
{
.name = "C1E",
@@ -1311,6 +1324,7 @@ static struct cpuidle_state srf_cstates[] __initdata = {
.exit_latency = 2,
.target_residency = 10,
.enter = &intel_idle,
+ .enter_dead = &intel_idle_enter_dead,
.enter_s2idle = intel_idle_s2idle, },
{
.name = "C6S",
@@ -1319,6 +1333,7 @@ static struct cpuidle_state srf_cstates[] __initdata = {
.exit_latency = 270,
.target_residency = 700,
.enter = &intel_idle,
+ .enter_dead = &intel_idle_enter_dead,
.enter_s2idle = intel_idle_s2idle, },
{
.name = "C6SP",
@@ -1327,6 +1342,7 @@ static struct cpuidle_state srf_cstates[] __initdata = {
.exit_latency = 310,
.target_residency = 900,
.enter = &intel_idle,
+ .enter_dead = &intel_idle_enter_dead,
.enter_s2idle = intel_idle_s2idle, },
{
.enter = NULL }
--
2.47.0
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/3] x86/smp: Allow calling mwait_play_dead with arbitrary hint
2024-11-08 12:29 ` [PATCH v3 1/3] x86/smp: Allow calling mwait_play_dead with arbitrary hint Patryk Wlazlyn
@ 2024-11-08 16:03 ` Dave Hansen
2024-11-12 10:54 ` Patryk Wlazlyn
0 siblings, 1 reply; 51+ messages in thread
From: Dave Hansen @ 2024-11-08 16:03 UTC (permalink / raw)
To: Patryk Wlazlyn, x86
Cc: linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
On 11/8/24 04:29, Patryk Wlazlyn wrote:
> The current implementation for looking up the mwait hint for the deepest
> cstate depends on them to be continuous in range [0, NUM_SUBSTATES-1].
> While that is correct on most Intel x86 platforms, it is not documented
> and may not result in reaching the most optimized idle state on some of
> them.
First, it is not clear what code this refers to. It needs to be more clear.
Second, this is not clear about the bug. Start with being crystal clear
about the problem and the impact:
MWAIT needs different hints on different CPUs to reach the most
efficient idle states. The hint calculation* works in practice
on current hardware, but it fails on newer ones. Those newer
CPUs' battery life really suffers when the system is suspended
like when a laptop lid is closed.
* The current calculation is the for loop inspecting edx in
mwait_play_dead()
Then you can go into detail about what the bad implementation does.
> For example Intel's Sierra Forest report two C6 substates in cpuid leaf 5:
> C6S (hint 0x22)
> C6SP (hint 0x23)
>
> Hints 0x20 and 0x21 are skipped entirely,
Why does it start at 0x20? Giving the example on new, broken hardware
is also a bit weak without an example from old, fully functional hardware.
> causing the current
> implementation to compute the wrong hint, when looking for the deepest
> cstate for offlined CPU to enter. As a result, package with an offlined
> CPU can never reach PC6.
I don't think the PC6 detail matters enough to mention here.
> Allow the idle driver to call mwait_play_dead() code with the forced
> mwait hint, skipping the cpuid based computation.
>
> Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
> ---
> arch/x86/include/asm/smp.h | 6 ++++++
> arch/x86/kernel/smpboot.c | 25 ++++++++++++++++++-------
> 2 files changed, 24 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
> index ca073f40698f..fbd275d6661a 100644
> --- a/arch/x86/include/asm/smp.h
> +++ b/arch/x86/include/asm/smp.h
> @@ -114,6 +114,7 @@ void wbinvd_on_cpu(int cpu);
> int wbinvd_on_all_cpus(void);
>
> void smp_kick_mwait_play_dead(void);
> +int mwait_play_dead_with_hint(unsigned long hint);
>
> void native_smp_send_reschedule(int cpu);
> void native_send_call_func_ipi(const struct cpumask *mask);
> @@ -164,6 +165,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
> {
> return (struct cpumask *)cpumask_of(0);
> }
> +
> +static inline int mwait_play_dead_with_hint(unsigned long eax_hint)
> +{
> + return 1;
> +}
> #endif /* CONFIG_SMP */
>
> #ifdef CONFIG_DEBUG_NMI_SELFTEST
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 0c35207320cb..44c40781bad6 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1270,13 +1270,14 @@ void play_dead_common(void)
> local_irq_disable();
> }
>
> +int mwait_play_dead_with_hint(unsigned long eax_hint);
We generally prefer these get a declaration in a header or get moved
around in the file, not declared like this.
> /*
> * We need to flush the caches before going to sleep, lest we have
> * dirty data in our caches when we come back up.
> */
> -static inline void mwait_play_dead(void)
> +static inline int mwait_play_dead(void)
> {
> - struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
> unsigned int eax, ebx, ecx, edx;
> unsigned int highest_cstate = 0;
> unsigned int highest_subcstate = 0;
> @@ -1284,13 +1285,13 @@ static inline void mwait_play_dead(void)
>
> if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> - return;
> + return 1;
> if (!this_cpu_has(X86_FEATURE_MWAIT))
> - return;
> + return 1;
> if (!this_cpu_has(X86_FEATURE_CLFLUSH))
> - return;
> + return 1;
> if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF)
> - return;
> + return 1;
If you're going to use an 'int' for a fail/nofail return type, please
just use -ERRNO's. It makes it *MUCH* more clear that these are error
returns and not something else.
That's what cpuidle_play_dead() does, for instance, Please follow its lead.
> eax = CPUID_MWAIT_LEAF;
> ecx = 0;
> @@ -1314,6 +1315,13 @@ static inline void mwait_play_dead(void)
> (highest_subcstate - 1);
> }
>
> + return mwait_play_dead_with_hint(eax);
> +}
> +
> +int mwait_play_dead_with_hint(unsigned long eax_hint)
> +{
> + struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
> +
> /* Set up state for the kexec() hack below */
> md->status = CPUDEAD_MWAIT_WAIT;
> md->control = CPUDEAD_MWAIT_WAIT;
> @@ -1333,7 +1341,7 @@ static inline void mwait_play_dead(void)
> mb();
> __monitor(md, 0, 0);
> mb();
> - __mwait(eax, 0);
> + __mwait(eax_hint, 0);
>
> if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
> /*
> @@ -1353,6 +1361,9 @@ static inline void mwait_play_dead(void)
> native_halt();
> }
> }
> +
> + /* Never reached */
> + return 0;
> }
I think the preferred way to do this is to add a __noreturn annotation.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-08 12:29 ` [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead() Patryk Wlazlyn
@ 2024-11-08 16:14 ` Dave Hansen
2024-11-12 10:55 ` Patryk Wlazlyn
2024-11-12 11:47 ` Peter Zijlstra
1 sibling, 1 reply; 51+ messages in thread
From: Dave Hansen @ 2024-11-08 16:14 UTC (permalink / raw)
To: Patryk Wlazlyn, x86
Cc: linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
On 11/8/24 04:29, Patryk Wlazlyn wrote:
> The generic implementation, based on cpuid leaf 0x5, for looking up the
> mwait hint for the deepest cstate, depends on them to be continuous in
> range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
> platforms, it is not architectural and may not result in reaching the
> most optimized idle state on some of them.
>
> Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
> fallback to the later in case of missing enter_dead() handler.
I don't think the bug has anything to do with this patch, really.
There's no need to rehash it here.
The issue here is that the only way to call mwait today is via
mwait_play_dead() directly, using its internally-calculated hint.
What you want is for a cpuidle-driver-calculated hint to be used. So,
you're using the new hint function via the cpuidle driver. But you just
need the cpuidle driver to be called first, not the old
mwait_play_dead()-calculated hint. The new code will still do mwait,
just via a different path and with a different hint.
The thing this doesn't mention is what the impact on everyone else is.
I _think_ the ACPI cpuidle driver is the only worry. Today, if there's
a system that supports mwait and ACPI cpuidle, it'll use mwait. After
this patch, it'll use ACPI cpuidle.
The changelog doesn't mention that behavior change or why that's OK.
Also, looking at this:
> - mwait_play_dead();
> if (cpuidle_play_dead())
> - hlt_play_dead();
> + mwait_play_dead();
> + hlt_play_dead();
None of those return on success, right?
Is there any reason this couldn't just be:
/* The first successful play_dead() will not return: */
cpuidle_play_dead();
mwait_play_dead();
hlt_play_dead();
That has the added bonus of not needing to return anything from
mwait_play_dead() and the resulting churn from the last patch.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF
2024-11-08 12:29 ` [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF Patryk Wlazlyn
@ 2024-11-08 16:21 ` Dave Hansen
2024-11-12 10:57 ` Patryk Wlazlyn
2024-11-08 22:12 ` kernel test robot
1 sibling, 1 reply; 51+ messages in thread
From: Dave Hansen @ 2024-11-08 16:21 UTC (permalink / raw)
To: Patryk Wlazlyn, x86
Cc: linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
On 11/8/24 04:29, Patryk Wlazlyn wrote:
> Intel's Sierra Forest report two C6 substates in cpuid leaf 5:
> C6S (hint 0x22)
> C6SP (hint 0x23)
>
> Hints 0x20 and 0x21 are skipped entirely, causing the generic
> implementation in mwait_play_dead() to compute the wrong hint, when
> looking for the deepest cstate. As a result, package with an offlined
> CPU can never reach PC6.
This series has said multiple times how the old algorithm is wrong. But
it never actually _fixed_ the bad algorithm, only worked around it.
Does mwait_play_dead() itself need to get fixed?
> Define the enter_dead() handler for SRF.
This effectively gets the mwait hints from ______ instead of using the
calculation in mwait_play_dead().
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 9aab7abc2ae9..bd67959e5e8b 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -56,6 +56,7 @@
> #include <asm/mwait.h>
> #include <asm/spec-ctrl.h>
> #include <asm/fpu/api.h>
> +#include <asm/smp.h>
>
> #define INTEL_IDLE_VERSION "0.5.1"
>
> @@ -221,6 +222,17 @@ static __cpuidle int intel_idle_s2idle(struct cpuidle_device *dev,
> return 0;
> }
>
> +static __cpuidle int intel_idle_enter_dead(struct cpuidle_device *dev,
> + int index)
> +{
> + struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
> + struct cpuidle_state *state = &drv->states[index];
> + unsigned long eax = flg2MWAIT(state->flags);
> +
> + /* Retruns only in case of an error. */
^ returns?
> + return mwait_play_dead_with_hint(eax);
> +}
> +
> /*
> * States are indexed by the cstate number,
> * which is also the index into the MWAIT hint array.
> @@ -1303,6 +1315,7 @@ static struct cpuidle_state srf_cstates[] __initdata = {
> .exit_latency = 1,
> .target_residency = 1,
> .enter = &intel_idle,
> + .enter_dead = &intel_idle_enter_dead,
> .enter_s2idle = intel_idle_s2idle, },
> {
> .name = "C1E",
> @@ -1311,6 +1324,7 @@ static struct cpuidle_state srf_cstates[] __initdata = {
> .exit_latency = 2,
> .target_residency = 10,
> .enter = &intel_idle,
> + .enter_dead = &intel_idle_enter_dead,
> .enter_s2idle = intel_idle_s2idle, },
> {
> .name = "C6S",
> @@ -1319,6 +1333,7 @@ static struct cpuidle_state srf_cstates[] __initdata = {
> .exit_latency = 270,
> .target_residency = 700,
> .enter = &intel_idle,
> + .enter_dead = &intel_idle_enter_dead,
> .enter_s2idle = intel_idle_s2idle, },
> {
> .name = "C6SP",
> @@ -1327,6 +1342,7 @@ static struct cpuidle_state srf_cstates[] __initdata = {
> .exit_latency = 310,
> .target_residency = 900,
> .enter = &intel_idle,
> + .enter_dead = &intel_idle_enter_dead,
> .enter_s2idle = intel_idle_s2idle, },
> {
> .enter = NULL }
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry
2024-11-08 12:29 [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry Patryk Wlazlyn
` (2 preceding siblings ...)
2024-11-08 12:29 ` [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF Patryk Wlazlyn
@ 2024-11-08 16:22 ` Dave Hansen
2024-11-12 11:45 ` Peter Zijlstra
4 siblings, 0 replies; 51+ messages in thread
From: Dave Hansen @ 2024-11-08 16:22 UTC (permalink / raw)
To: Patryk Wlazlyn, x86
Cc: linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
On 11/8/24 04:29, Patryk Wlazlyn wrote:
> Applied suggestions from Dave and Rafael.
The basic approach here is looking pretty sound, so thanks for that
Patryk. I mostly have mechanical nits left.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF
2024-11-08 12:29 ` [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF Patryk Wlazlyn
2024-11-08 16:21 ` Dave Hansen
@ 2024-11-08 22:12 ` kernel test robot
1 sibling, 0 replies; 51+ messages in thread
From: kernel test robot @ 2024-11-08 22:12 UTC (permalink / raw)
To: Patryk Wlazlyn, x86
Cc: oe-kbuild-all, linux-kernel, linux-pm, rafael.j.wysocki,
len.brown, artem.bityutskiy, dave.hansen, patryk.wlazlyn
Hi Patryk,
kernel test robot noticed the following build warnings:
[auto build test WARNING on acpi/next]
[also build test WARNING on tip/master tip/x86/core linus/master v6.12-rc6 next-20241108]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Patryk-Wlazlyn/x86-smp-Allow-calling-mwait_play_dead-with-arbitrary-hint/20241108-203137
base: https://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git next
patch link: https://lore.kernel.org/r/20241108122909.763663-4-patryk.wlazlyn%40linux.intel.com
patch subject: [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF
config: x86_64-randconfig-161-20241109 (https://download.01.org/0day-ci/archive/20241109/202411090651.FNnbtLe2-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241109/202411090651.FNnbtLe2-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411090651.FNnbtLe2-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> vmlinux.o: warning: objtool: intel_idle_enter_dead+0x9: call to cpuidle_get_cpu_driver() leaves .noinstr.text section
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/3] x86/smp: Allow calling mwait_play_dead with arbitrary hint
2024-11-08 16:03 ` Dave Hansen
@ 2024-11-12 10:54 ` Patryk Wlazlyn
0 siblings, 0 replies; 51+ messages in thread
From: Patryk Wlazlyn @ 2024-11-12 10:54 UTC (permalink / raw)
To: Dave Hansen, x86
Cc: linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
> First, it is not clear what code this refers to. It needs to be more clear.
ACK. I thought that code changes serve that purpose, but I can explicitly quote
the code I am reffering to.
> Second, this is not clear about the bug. Start with being crystal clear
> about the problem and the impact:
>
> MWAIT needs different hints on different CPUs to reach the most
> efficient idle states. The hint calculation* works in practice
> on current hardware, but it fails on newer ones. Those newer
> CPUs' battery life really suffers when the system is suspended
> like when a laptop lid is closed.
>
> * The current calculation is the for loop inspecting edx in
> mwait_play_dead()
>
> Then you can go into detail about what the bad implementation does.
It would kind of get into the bussiness of documenting the mwait instruction
itself. I am modifying behavior of already existing code.
Should I be more clear in the commit message or you also meant adding more code
comments?
>> For example Intel's Sierra Forest report two C6 substates in cpuid leaf 5:
>> C6S (hint 0x22)
>> C6SP (hint 0x23)
>>
>> Hints 0x20 and 0x21 are skipped entirely,
>
> Why does it start at 0x20? Giving the example on new, broken hardware
> is also a bit weak without an example from old, fully functional hardware.
It's just how mwait hints work on Intel hardware:
EAX[3:0] # Sub C-state
EAX[7:4] # C-state
CPUID leaf 0x5 report how many sub c-states are there for any c-state.
Old code assumes that you can derive the actual hints from that, but you can
only do that if available hints are continuous, which is not guaranteed and is
not true on SRF and possibly future platforms.
Hint 0x20 (C6) was given as an example, because it's where the problem is.
Mwait hints themselves are processor-specific and so the same hint may mean
different thing on different hardware.
>> causing the current
>> implementation to compute the wrong hint, when looking for the deepest
>> cstate for offlined CPU to enter. As a result, package with an offlined
>> CPU can never reach PC6.
>
> I don't think the PC6 detail matters enough to mention here.
I can simplify the commit message, but the PC6 problem is the reason this change
was proposed.
>> Allow the idle driver to call mwait_play_dead() code with the forced
>> mwait hint, skipping the cpuid based computation.
>>
>> Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
>> ---
>> arch/x86/include/asm/smp.h | 6 ++++++
>> arch/x86/kernel/smpboot.c | 25 ++++++++++++++++++-------
>> 2 files changed, 24 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
>> index ca073f40698f..fbd275d6661a 100644
>> --- a/arch/x86/include/asm/smp.h
>> +++ b/arch/x86/include/asm/smp.h
>> @@ -114,6 +114,7 @@ void wbinvd_on_cpu(int cpu);
>> int wbinvd_on_all_cpus(void);
>>
>> void smp_kick_mwait_play_dead(void);
>> +int mwait_play_dead_with_hint(unsigned long hint);
>>
>> void native_smp_send_reschedule(int cpu);
>> void native_send_call_func_ipi(const struct cpumask *mask);
>> @@ -164,6 +165,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
>> {
>> return (struct cpumask *)cpumask_of(0);
>> }
>> +
>> +static inline int mwait_play_dead_with_hint(unsigned long eax_hint)
>> +{
>> + return 1;
>> +}
>> #endif /* CONFIG_SMP */
>>
>> #ifdef CONFIG_DEBUG_NMI_SELFTEST
>> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
>> index 0c35207320cb..44c40781bad6 100644
>> --- a/arch/x86/kernel/smpboot.c
>> +++ b/arch/x86/kernel/smpboot.c
>> @@ -1270,13 +1270,14 @@ void play_dead_common(void)
>> local_irq_disable();
>> }
>>
>> +int mwait_play_dead_with_hint(unsigned long eax_hint);
>
> We generally prefer these get a declaration in a header or get moved
> around in the file, not declared like this.
If forward declared and defined after mwait_play_dead, it makes the git
diff/blame simpler. As an alternative I could define it above, but the resulting
diff is harder to read and I didn't want to include whole smp header just for
that, but perhaps I should.
>> /*
>> * We need to flush the caches before going to sleep, lest we have
>> * dirty data in our caches when we come back up.
>> */
>> -static inline void mwait_play_dead(void)
>> +static inline int mwait_play_dead(void)
>> {
>> - struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
>> unsigned int eax, ebx, ecx, edx;
>> unsigned int highest_cstate = 0;
>> unsigned int highest_subcstate = 0;
>> @@ -1284,13 +1285,13 @@ static inline void mwait_play_dead(void)
>>
>> if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
>> boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
>> - return;
>> + return 1;
>> if (!this_cpu_has(X86_FEATURE_MWAIT))
>> - return;
>> + return 1;
>> if (!this_cpu_has(X86_FEATURE_CLFLUSH))
>> - return;
>> + return 1;
>> if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF)
>> - return;
>> + return 1;
>
> If you're going to use an 'int' for a fail/nofail return type, please
> just use -ERRNO's. It makes it *MUCH* more clear that these are error
> returns and not something else.
>
> That's what cpuidle_play_dead() does, for instance, Please follow its lead.
ACK. Makes sense.
>> eax = CPUID_MWAIT_LEAF;
>> ecx = 0;
>> @@ -1314,6 +1315,13 @@ static inline void mwait_play_dead(void)
>> (highest_subcstate - 1);
>> }
>>
>> + return mwait_play_dead_with_hint(eax);
>> +}
>> +
>> +int mwait_play_dead_with_hint(unsigned long eax_hint)
>> +{
>> + struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
>> +
>> /* Set up state for the kexec() hack below */
>> md->status = CPUDEAD_MWAIT_WAIT;
>> md->control = CPUDEAD_MWAIT_WAIT;
>> @@ -1333,7 +1341,7 @@ static inline void mwait_play_dead(void)
>> mb();
>> __monitor(md, 0, 0);
>> mb();
>> - __mwait(eax, 0);
>> + __mwait(eax_hint, 0);
>>
>> if (READ_ONCE(md->control) == CPUDEAD_MWAIT_KEXEC_HLT) {
>> /*
>> @@ -1353,6 +1361,9 @@ static inline void mwait_play_dead(void)
>> native_halt();
>> }
>> }
>> +
>> + /* Never reached */
>> + return 0;
>> }
>
> I think the preferred way to do this is to add a __noreturn annotation.
OK.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-08 16:14 ` Dave Hansen
@ 2024-11-12 10:55 ` Patryk Wlazlyn
0 siblings, 0 replies; 51+ messages in thread
From: Patryk Wlazlyn @ 2024-11-12 10:55 UTC (permalink / raw)
To: Dave Hansen, x86
Cc: linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
> I don't think the bug has anything to do with this patch, really.
> There's no need to rehash it here.
>
> The issue here is that the only way to call mwait today is via
> mwait_play_dead() directly, using its internally-calculated hint.
>
> What you want is for a cpuidle-driver-calculated hint to be used. So,
> you're using the new hint function via the cpuidle driver. But you just
> need the cpuidle driver to be called first, not the old
> mwait_play_dead()-calculated hint. The new code will still do mwait,
> just via a different path and with a different hint.
Ok. I'll just say that we change the order because idle driver may know better.
> The thing this doesn't mention is what the impact on everyone else is.
> I _think_ the ACPI cpuidle driver is the only worry. Today, if there's
> a system that supports mwait and ACPI cpuidle, it'll use mwait. After
> this patch, it'll use ACPI cpuidle.
>
> The changelog doesn't mention that behavior change or why that's OK.
True, but I think the mwait_play_dead() is exclusively for Intel. Other target
platforms get an early return. I'll include that in the commit message.
> Also, looking at this:
>
>> - mwait_play_dead();
>> if (cpuidle_play_dead())
>> - hlt_play_dead();
>> + mwait_play_dead();
>> + hlt_play_dead();
>
> None of those return on success, right?
>
> Is there any reason this couldn't just be:
>
> /* The first successful play_dead() will not return: */
> cpuidle_play_dead();
> mwait_play_dead();
> hlt_play_dead();
>
> That has the added bonus of not needing to return anything from
> mwait_play_dead() and the resulting churn from the last patch.
mwait_play_dead may return if mwait_play_dead_with_hint returns and it only does
on non-smp builds. That being said, we do ignore the return value right now,
because in either case we want to enter hlt_play_dead() as a fallback, so I
guess we can make mwait_play_dead return void, but leave
mwait_play_dead_with_hint returning int or add ifdef CONFIG_SMP guards in
intel_idle.
When going with the return types proposed in this patch set, on non-smp builds
intel_idle would call mwait_play_dead_with_hint() which would "return 1"; and
propagate through cpuidle_play_dead().
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF
2024-11-08 16:21 ` Dave Hansen
@ 2024-11-12 10:57 ` Patryk Wlazlyn
2024-11-12 11:28 ` Rafael J. Wysocki
0 siblings, 1 reply; 51+ messages in thread
From: Patryk Wlazlyn @ 2024-11-12 10:57 UTC (permalink / raw)
To: Dave Hansen, x86
Cc: linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
> This series has said multiple times how the old algorithm is wrong. But
> it never actually _fixed_ the bad algorithm, only worked around it.
>
> Does mwait_play_dead() itself need to get fixed?
I don't think so. The old algorithm gives fairly good heuristic for computing
the mwait hint for the deepest cstate. Even though it's not guaranteed to work,
it does work on most of the platforms that don't early return. I think we should
leave it, but prefer idle_driver.
>> Define the enter_dead() handler for SRF.
>
> This effectively gets the mwait hints from ______ instead of using the
> calculation in mwait_play_dead().
Ok.
>> +static __cpuidle int intel_idle_enter_dead(struct cpuidle_device *dev,
>> + int index)
>> +{
>> + struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
>> + struct cpuidle_state *state = &drv->states[index];
>> + unsigned long eax = flg2MWAIT(state->flags);
>> +
>> + /* Retruns only in case of an error. */
>
> ^ returns?
Yup. Will fix.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF
2024-11-12 10:57 ` Patryk Wlazlyn
@ 2024-11-12 11:28 ` Rafael J. Wysocki
2024-11-12 16:07 ` Dave Hansen
2024-11-12 19:17 ` Thomas Gleixner
0 siblings, 2 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2024-11-12 11:28 UTC (permalink / raw)
To: Patryk Wlazlyn, Dave Hansen
Cc: x86, linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
On Tue, Nov 12, 2024 at 12:18 PM Patryk Wlazlyn
<patryk.wlazlyn@linux.intel.com> wrote:
>
> > This series has said multiple times how the old algorithm is wrong. But
> > it never actually _fixed_ the bad algorithm, only worked around it.
> >
> > Does mwait_play_dead() itself need to get fixed?
>
> I don't think so. The old algorithm gives fairly good heuristic for computing
> the mwait hint for the deepest cstate. Even though it's not guaranteed to work,
> it does work on most of the platforms that don't early return. I think we should
> leave it, but prefer idle_driver.
IOW, as a fallback mechanism, it is as good as it gets.
As the primary source of information though, not quite.
> >> Define the enter_dead() handler for SRF.
> >
> > This effectively gets the mwait hints from ______ instead of using the
> > calculation in mwait_play_dead().
>
> Ok.
>
> >> +static __cpuidle int intel_idle_enter_dead(struct cpuidle_device *dev,
> >> + int index)
> >> +{
> >> + struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
> >> + struct cpuidle_state *state = &drv->states[index];
> >> + unsigned long eax = flg2MWAIT(state->flags);
> >> +
> >> + /* Retruns only in case of an error. */
> >
> > ^ returns?
>
> Yup. Will fix.
>
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry
2024-11-08 12:29 [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry Patryk Wlazlyn
` (3 preceding siblings ...)
2024-11-08 16:22 ` [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry Dave Hansen
@ 2024-11-12 11:45 ` Peter Zijlstra
2024-11-12 15:43 ` Patryk Wlazlyn
4 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-12 11:45 UTC (permalink / raw)
To: Patryk Wlazlyn
Cc: x86, linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
On Fri, Nov 08, 2024 at 01:29:06PM +0100, Patryk Wlazlyn wrote:
> Removing the existing "kexec hack" by bringing all offlined CPUs back
> online before proceeding with the kexec would make it even simpler, but
> I am not sure we can do that. It looks kind of obvious to me, but for
> some reason the hack exist.
There's a comment there that explains why this is done. If you don't
understand this, then please don't touch this code.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-08 12:29 ` [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead() Patryk Wlazlyn
2024-11-08 16:14 ` Dave Hansen
@ 2024-11-12 11:47 ` Peter Zijlstra
2024-11-12 12:03 ` Rafael J. Wysocki
2024-11-12 13:23 ` Rafael J. Wysocki
1 sibling, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-12 11:47 UTC (permalink / raw)
To: Patryk Wlazlyn
Cc: x86, linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote:
> The generic implementation, based on cpuid leaf 0x5, for looking up the
> mwait hint for the deepest cstate, depends on them to be continuous in
> range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
> platforms, it is not architectural and may not result in reaching the
> most optimized idle state on some of them.
>
> Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
> fallback to the later in case of missing enter_dead() handler.
>
> Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
> ---
> arch/x86/kernel/smpboot.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 44c40781bad6..721bb931181c 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1416,9 +1416,9 @@ void native_play_dead(void)
> play_dead_common();
> tboot_shutdown(TB_SHUTDOWN_WFS);
>
> - mwait_play_dead();
> if (cpuidle_play_dead())
> - hlt_play_dead();
> + mwait_play_dead();
> + hlt_play_dead();
> }
Yeah, I don't think so. we don't want to accidentally hit
acpi_idle_play_dead().
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 11:47 ` Peter Zijlstra
@ 2024-11-12 12:03 ` Rafael J. Wysocki
2024-11-12 12:18 ` Peter Zijlstra
2024-11-12 13:23 ` Rafael J. Wysocki
1 sibling, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2024-11-12 12:03 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Patryk Wlazlyn, x86, linux-kernel, linux-pm, rafael.j.wysocki,
len.brown, artem.bityutskiy, dave.hansen
On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote:
> > The generic implementation, based on cpuid leaf 0x5, for looking up the
> > mwait hint for the deepest cstate, depends on them to be continuous in
> > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
> > platforms, it is not architectural and may not result in reaching the
> > most optimized idle state on some of them.
> >
> > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
> > fallback to the later in case of missing enter_dead() handler.
> >
> > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
> > ---
> > arch/x86/kernel/smpboot.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 44c40781bad6..721bb931181c 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -1416,9 +1416,9 @@ void native_play_dead(void)
> > play_dead_common();
> > tboot_shutdown(TB_SHUTDOWN_WFS);
> >
> > - mwait_play_dead();
> > if (cpuidle_play_dead())
> > - hlt_play_dead();
> > + mwait_play_dead();
> > + hlt_play_dead();
> > }
>
> Yeah, I don't think so. we don't want to accidentally hit
> acpi_idle_play_dead().
Fair enough.
Then we are back to the original approach though:
https://lore.kernel.org/linux-pm/20241029101507.7188-3-patryk.wlazlyn@linux.intel.com/
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 12:03 ` Rafael J. Wysocki
@ 2024-11-12 12:18 ` Peter Zijlstra
2024-11-12 12:30 ` Rafael J. Wysocki
2024-11-12 12:44 ` Artem Bityutskiy
0 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-12 12:18 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Patryk Wlazlyn, x86, linux-kernel, linux-pm, rafael.j.wysocki,
len.brown, artem.bityutskiy, dave.hansen
On Tue, Nov 12, 2024 at 01:03:49PM +0100, Rafael J. Wysocki wrote:
> On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote:
> > > The generic implementation, based on cpuid leaf 0x5, for looking up the
> > > mwait hint for the deepest cstate, depends on them to be continuous in
> > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
> > > platforms, it is not architectural and may not result in reaching the
> > > most optimized idle state on some of them.
> > >
> > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
> > > fallback to the later in case of missing enter_dead() handler.
> > >
> > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
> > > ---
> > > arch/x86/kernel/smpboot.c | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > > index 44c40781bad6..721bb931181c 100644
> > > --- a/arch/x86/kernel/smpboot.c
> > > +++ b/arch/x86/kernel/smpboot.c
> > > @@ -1416,9 +1416,9 @@ void native_play_dead(void)
> > > play_dead_common();
> > > tboot_shutdown(TB_SHUTDOWN_WFS);
> > >
> > > - mwait_play_dead();
> > > if (cpuidle_play_dead())
> > > - hlt_play_dead();
> > > + mwait_play_dead();
> > > + hlt_play_dead();
> > > }
> >
> > Yeah, I don't think so. we don't want to accidentally hit
> > acpi_idle_play_dead().
>
> Fair enough.
>
> Then we are back to the original approach though:
>
> https://lore.kernel.org/linux-pm/20241029101507.7188-3-patryk.wlazlyn@linux.intel.com/
Well, that won't be brilliant for hybrid systems where the available
states are different per CPU.
Also, all of this is a bit of a trainwreck... AFAICT AMD wants IO based
idle (per the 2018 commit). So they want the ACPI thing.
But on Intel we really don't want HLT, and had that MWAIT, but that has
real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y.
The ACPI thing doesn't support FFh states for it's enter_dead(), should
it?
Anyway, ideally x86 would grow a new instruction to offline a CPU, both
MWAIT and HLT have problems vs non-maskable interrupts.
I really don't know what is best here, maybe moving that whole CPUID
loop to boot, store the value in a per-cpu mwait_play_dead_hint. Have
AMD explicitly clear the value, and avoid mwait when 0 -- hint 0 is
equal to HLT anyway.
But as said, we need a new instruction.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 12:18 ` Peter Zijlstra
@ 2024-11-12 12:30 ` Rafael J. Wysocki
2024-11-12 12:38 ` Rafael J. Wysocki
2024-11-12 13:49 ` Peter Zijlstra
2024-11-12 12:44 ` Artem Bityutskiy
1 sibling, 2 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2024-11-12 12:30 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, artem.bityutskiy, dave.hansen
On Tue, Nov 12, 2024 at 1:18 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Nov 12, 2024 at 01:03:49PM +0100, Rafael J. Wysocki wrote:
> > On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote:
> > > > The generic implementation, based on cpuid leaf 0x5, for looking up the
> > > > mwait hint for the deepest cstate, depends on them to be continuous in
> > > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
> > > > platforms, it is not architectural and may not result in reaching the
> > > > most optimized idle state on some of them.
> > > >
> > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
> > > > fallback to the later in case of missing enter_dead() handler.
> > > >
> > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
> > > > ---
> > > > arch/x86/kernel/smpboot.c | 4 ++--
> > > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > > > index 44c40781bad6..721bb931181c 100644
> > > > --- a/arch/x86/kernel/smpboot.c
> > > > +++ b/arch/x86/kernel/smpboot.c
> > > > @@ -1416,9 +1416,9 @@ void native_play_dead(void)
> > > > play_dead_common();
> > > > tboot_shutdown(TB_SHUTDOWN_WFS);
> > > >
> > > > - mwait_play_dead();
> > > > if (cpuidle_play_dead())
> > > > - hlt_play_dead();
> > > > + mwait_play_dead();
> > > > + hlt_play_dead();
> > > > }
> > >
> > > Yeah, I don't think so. we don't want to accidentally hit
> > > acpi_idle_play_dead().
> >
> > Fair enough.
> >
> > Then we are back to the original approach though:
> >
> > https://lore.kernel.org/linux-pm/20241029101507.7188-3-patryk.wlazlyn@linux.intel.com/
>
> Well, that won't be brilliant for hybrid systems where the available
> states are different per CPU.
But they aren't.
At least so far that has not been the case on any platform known to me
and I'm not aware of any plans to make that happen (guess what, some
other OSes may be unhappy).
> Also, all of this is a bit of a trainwreck... AFAICT AMD wants IO based
> idle (per the 2018 commit). So they want the ACPI thing.
Yes.
> But on Intel we really don't want HLT, and had that MWAIT, but that has
> real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y.
We could because it handles ACPI now and ACPI idle doesn't add any
value on top of it except for the IO-based idle case.
> The ACPI thing doesn't support FFh states for it's enter_dead(), should it?
It does AFAICS, but the FFH is still MWAIT.
> Anyway, ideally x86 would grow a new instruction to offline a CPU, both
> MWAIT and HLT have problems vs non-maskable interrupts.
>
> I really don't know what is best here, maybe moving that whole CPUID
> loop to boot, store the value in a per-cpu mwait_play_dead_hint. Have
> AMD explicitly clear the value, and avoid mwait when 0 -- hint 0 is
> equal to HLT anyway.
>
> But as said, we need a new instruction.
Before that, there is the problem with the MWAIT hint computation in
mwait_play_dead() and in fact intel_idle does know what hint to use in
there.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 12:30 ` Rafael J. Wysocki
@ 2024-11-12 12:38 ` Rafael J. Wysocki
2024-11-12 13:49 ` Peter Zijlstra
1 sibling, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2024-11-12 12:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Patryk Wlazlyn, x86, linux-kernel, linux-pm, rafael.j.wysocki,
len.brown, artem.bityutskiy, dave.hansen
On Tue, Nov 12, 2024 at 1:30 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Nov 12, 2024 at 1:18 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Nov 12, 2024 at 01:03:49PM +0100, Rafael J. Wysocki wrote:
> > > On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote:
> > > > > The generic implementation, based on cpuid leaf 0x5, for looking up the
> > > > > mwait hint for the deepest cstate, depends on them to be continuous in
> > > > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
> > > > > platforms, it is not architectural and may not result in reaching the
> > > > > most optimized idle state on some of them.
> > > > >
> > > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
> > > > > fallback to the later in case of missing enter_dead() handler.
> > > > >
> > > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
> > > > > ---
> > > > > arch/x86/kernel/smpboot.c | 4 ++--
> > > > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > > > > index 44c40781bad6..721bb931181c 100644
> > > > > --- a/arch/x86/kernel/smpboot.c
> > > > > +++ b/arch/x86/kernel/smpboot.c
> > > > > @@ -1416,9 +1416,9 @@ void native_play_dead(void)
> > > > > play_dead_common();
> > > > > tboot_shutdown(TB_SHUTDOWN_WFS);
> > > > >
> > > > > - mwait_play_dead();
> > > > > if (cpuidle_play_dead())
> > > > > - hlt_play_dead();
> > > > > + mwait_play_dead();
> > > > > + hlt_play_dead();
> > > > > }
> > > >
> > > > Yeah, I don't think so. we don't want to accidentally hit
> > > > acpi_idle_play_dead().
> > >
> > > Fair enough.
> > >
> > > Then we are back to the original approach though:
> > >
> > > https://lore.kernel.org/linux-pm/20241029101507.7188-3-patryk.wlazlyn@linux.intel.com/
> >
> > Well, that won't be brilliant for hybrid systems where the available
> > states are different per CPU.
>
> But they aren't.
>
> At least so far that has not been the case on any platform known to me
> and I'm not aware of any plans to make that happen (guess what, some
> other OSes may be unhappy).
>
> > Also, all of this is a bit of a trainwreck... AFAICT AMD wants IO based
> > idle (per the 2018 commit). So they want the ACPI thing.
>
> Yes.
>
> > But on Intel we really don't want HLT, and had that MWAIT, but that has
> > real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y.
>
> We could because it handles ACPI now and ACPI idle doesn't add any
> value on top of it except for the IO-based idle case.
>
> > The ACPI thing doesn't support FFh states for it's enter_dead(), should it?
>
> It does AFAICS, but the FFH is still MWAIT.
Sorry, no, it doesn't.
It might I guess, but it would have been dead code in the vast
majority of configurations in the field.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 12:18 ` Peter Zijlstra
2024-11-12 12:30 ` Rafael J. Wysocki
@ 2024-11-12 12:44 ` Artem Bityutskiy
2024-11-12 14:01 ` Peter Zijlstra
1 sibling, 1 reply; 51+ messages in thread
From: Artem Bityutskiy @ 2024-11-12 12:44 UTC (permalink / raw)
To: Peter Zijlstra, Rafael J. Wysocki
Cc: Patryk Wlazlyn, x86, linux-kernel, linux-pm, rafael.j.wysocki,
len.brown, dave.hansen
On Tue, 2024-11-12 at 13:18 +0100, Peter Zijlstra wrote:
> But on Intel we really don't want HLT, and had that MWAIT, but that has
> real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y.
If INTEL_IDLE is not set, then we'll just use existing mwait creation algorithm
in 'mwait_play_dead()', which works too, just not ideal.
> Anyway, ideally x86 would grow a new instruction to offline a CPU, both
> MWAIT and HLT have problems vs non-maskable interrupts.
... snip ...
> But as said, we need a new instruction.
FYI, I already started discussing a special "gimme the deepest C-state" mwait
hint - just a constant like 0xFF. CPUID leaf 5 has many reserved bits, one could
be used for enumeration of this feature.
But this is just a quick idea so far, and informal discussions so far.
Artem.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 11:47 ` Peter Zijlstra
2024-11-12 12:03 ` Rafael J. Wysocki
@ 2024-11-12 13:23 ` Rafael J. Wysocki
2024-11-12 14:56 ` Peter Zijlstra
1 sibling, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2024-11-12 13:23 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Patryk Wlazlyn, x86, linux-kernel, linux-pm, rafael.j.wysocki,
len.brown, artem.bityutskiy, dave.hansen
On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote:
> > The generic implementation, based on cpuid leaf 0x5, for looking up the
> > mwait hint for the deepest cstate, depends on them to be continuous in
> > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
> > platforms, it is not architectural and may not result in reaching the
> > most optimized idle state on some of them.
> >
> > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
> > fallback to the later in case of missing enter_dead() handler.
> >
> > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
> > ---
> > arch/x86/kernel/smpboot.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 44c40781bad6..721bb931181c 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -1416,9 +1416,9 @@ void native_play_dead(void)
> > play_dead_common();
> > tboot_shutdown(TB_SHUTDOWN_WFS);
> >
> > - mwait_play_dead();
> > if (cpuidle_play_dead())
> > - hlt_play_dead();
> > + mwait_play_dead();
> > + hlt_play_dead();
> > }
>
> Yeah, I don't think so. we don't want to accidentally hit
> acpi_idle_play_dead().
Having inspected the code once again, I'm not sure what your concern is.
:enter.dead() is set to acpi_idle_play_dead() for all states in ACPI
idle - see acpi_processor_setup_cstates() and the role of the type
check is to filter out bogus table entries (the "type" must be 1, 2,
or 3 as per the spec).
Then cpuidle_play_dead() calls drv->states[i].enter_dead() for the
deepest state where it is set and if this is FFH,
acpi_idle_play_dead() will return an error. So after the change, the
code above will fall back to mwait_play_dead() then.
Or am I missing anything?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 12:30 ` Rafael J. Wysocki
2024-11-12 12:38 ` Rafael J. Wysocki
@ 2024-11-12 13:49 ` Peter Zijlstra
2024-11-12 14:56 ` Rafael J. Wysocki
1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-12 13:49 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Patryk Wlazlyn, x86, linux-kernel, linux-pm, rafael.j.wysocki,
len.brown, artem.bityutskiy, dave.hansen
On Tue, Nov 12, 2024 at 01:30:29PM +0100, Rafael J. Wysocki wrote:
> > > Then we are back to the original approach though:
> > >
> > > https://lore.kernel.org/linux-pm/20241029101507.7188-3-patryk.wlazlyn@linux.intel.com/
> >
> > Well, that won't be brilliant for hybrid systems where the available
> > states are different per CPU.
>
> But they aren't.
>
> At least so far that has not been the case on any platform known to me
> and I'm not aware of any plans to make that happen (guess what, some
> other OSes may be unhappy).
Well, that's something at least.
> > Also, all of this is a bit of a trainwreck... AFAICT AMD wants IO based
> > idle (per the 2018 commit). So they want the ACPI thing.
>
> Yes.
>
> > But on Intel we really don't want HLT, and had that MWAIT, but that has
> > real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y.
>
> We could because it handles ACPI now and ACPI idle doesn't add any
> value on top of it except for the IO-based idle case.
You're saying we can mandate INTEL_IDLE=y? Because currently defconfig
doesn't even have it on.
> > The ACPI thing doesn't support FFh states for it's enter_dead(), should it?
>
> It does AFAICS, but the FFH is still MWAIT.
What I'm trying to say is that acpi_idle_play_dead() doesn't seem to
support FFh and as such won't ever use MWAIT.
> > Anyway, ideally x86 would grow a new instruction to offline a CPU, both
> > MWAIT and HLT have problems vs non-maskable interrupts.
> >
> > I really don't know what is best here, maybe moving that whole CPUID
> > loop to boot, store the value in a per-cpu mwait_play_dead_hint. Have
> > AMD explicitly clear the value, and avoid mwait when 0 -- hint 0 is
> > equal to HLT anyway.
> >
> > But as said, we need a new instruction.
>
> Before that, there is the problem with the MWAIT hint computation in
> mwait_play_dead() and in fact intel_idle does know what hint to use in
> there.
But we need to deal witn INTEL_IDLE=n. Also, I don't see any MWAIT_LEAF
parsing in intel_idle.c. Yes, it requests the information, but then it
mostly ignores it -- it only consumes two ECX bits or so.
I don't see it finding a max-cstate from mwait_substates anywhere.
So given we don't have any such code, why can't we simply fix the cstate
parsing we have in mwait_play_dead() and call it a day?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 12:44 ` Artem Bityutskiy
@ 2024-11-12 14:01 ` Peter Zijlstra
2024-11-14 12:03 ` Peter Zijlstra
0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-12 14:01 UTC (permalink / raw)
To: Artem Bityutskiy
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, dave.hansen
On Tue, Nov 12, 2024 at 02:44:49PM +0200, Artem Bityutskiy wrote:
> On Tue, 2024-11-12 at 13:18 +0100, Peter Zijlstra wrote:
> > But on Intel we really don't want HLT, and had that MWAIT, but that has
> > real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y.
>
> If INTEL_IDLE is not set, then we'll just use existing mwait creation algorithm
> in 'mwait_play_dead()', which works too, just not ideal.
So why not fix the substate detectoring function and ignore everything?
> > Anyway, ideally x86 would grow a new instruction to offline a CPU, both
> > MWAIT and HLT have problems vs non-maskable interrupts.
> ... snip ...
> > But as said, we need a new instruction.
>
> FYI, I already started discussing a special "gimme the deepest C-state" mwait
> hint - just a constant like 0xFF. CPUID leaf 5 has many reserved bits, one could
> be used for enumeration of this feature.
>
> But this is just a quick idea so far, and informal discussions so far.
No, not mwait hint. We need an instruction that:
- goes to deepest C state
- drops into WAIT-for-Start-IPI (SIPI)
Notably, it should not wake from:
- random memory writes
- NMI, MCE, SMI and other such non-maskable thingies
- anything else -- the memory pointed to by RIP might no longer exist
Lets call the instruction: DEAD.
With the mwait 'hack', kexec still goes belly up if it gets a spurious
NMI (and them others) at an inopportune time, and this does happen
afaik. Just not enough to worry the data center guys like the mwait
thing.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 13:23 ` Rafael J. Wysocki
@ 2024-11-12 14:56 ` Peter Zijlstra
2024-11-12 15:00 ` Rafael J. Wysocki
2024-11-13 11:41 ` Gautham R. Shenoy
0 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-12 14:56 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Patryk Wlazlyn, x86, linux-kernel, linux-pm, rafael.j.wysocki,
len.brown, artem.bityutskiy, dave.hansen, gautham.shenoy
On Tue, Nov 12, 2024 at 02:23:14PM +0100, Rafael J. Wysocki wrote:
> On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote:
> > > The generic implementation, based on cpuid leaf 0x5, for looking up the
> > > mwait hint for the deepest cstate, depends on them to be continuous in
> > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
> > > platforms, it is not architectural and may not result in reaching the
> > > most optimized idle state on some of them.
> > >
> > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
> > > fallback to the later in case of missing enter_dead() handler.
> > >
> > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
> > > ---
> > > arch/x86/kernel/smpboot.c | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > > index 44c40781bad6..721bb931181c 100644
> > > --- a/arch/x86/kernel/smpboot.c
> > > +++ b/arch/x86/kernel/smpboot.c
> > > @@ -1416,9 +1416,9 @@ void native_play_dead(void)
> > > play_dead_common();
> > > tboot_shutdown(TB_SHUTDOWN_WFS);
> > >
> > > - mwait_play_dead();
> > > if (cpuidle_play_dead())
> > > - hlt_play_dead();
> > > + mwait_play_dead();
> > > + hlt_play_dead();
> > > }
> >
> > Yeah, I don't think so. we don't want to accidentally hit
> > acpi_idle_play_dead().
>
> Having inspected the code once again, I'm not sure what your concern is.
>
> :enter.dead() is set to acpi_idle_play_dead() for all states in ACPI
> idle - see acpi_processor_setup_cstates() and the role of the type
> check is to filter out bogus table entries (the "type" must be 1, 2,
> or 3 as per the spec).
>
> Then cpuidle_play_dead() calls drv->states[i].enter_dead() for the
> deepest state where it is set and if this is FFH,
> acpi_idle_play_dead() will return an error. So after the change, the
> code above will fall back to mwait_play_dead() then.
>
> Or am I missing anything?
So it relies on there being a C2/C3 state enumerated and that being FFh.
Otherwise it will find a 'working' state and we're up a creek.
Typically I expect C2/C3 FFh states will be there on Intel stuff, but it
seems awefully random to rely on this hole. AMD might unwittinly change
the ACPI driver (they're the main user) and then we'd be up a creek.
Robustly we'd teach the ACPI driver about FFh and set enter_dead on
every state -- but we'd have to double check that with AMD.
At the same time, intel_idle should then also set enter_dead on all
states.
And then the mwait case is only ever reached if CPUIDLE=n.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 13:49 ` Peter Zijlstra
@ 2024-11-12 14:56 ` Rafael J. Wysocki
2024-11-12 15:08 ` Peter Zijlstra
0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2024-11-12 14:56 UTC (permalink / raw)
To: Peter Zijlstra, artem.bityutskiy
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, dave.hansen
On Tue, Nov 12, 2024 at 2:50 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Nov 12, 2024 at 01:30:29PM +0100, Rafael J. Wysocki wrote:
>
> > > > Then we are back to the original approach though:
> > > >
> > > > https://lore.kernel.org/linux-pm/20241029101507.7188-3-patryk.wlazlyn@linux.intel.com/
> > >
> > > Well, that won't be brilliant for hybrid systems where the available
> > > states are different per CPU.
> >
> > But they aren't.
> >
> > At least so far that has not been the case on any platform known to me
> > and I'm not aware of any plans to make that happen (guess what, some
> > other OSes may be unhappy).
>
> Well, that's something at least.
>
> > > Also, all of this is a bit of a trainwreck... AFAICT AMD wants IO based
> > > idle (per the 2018 commit). So they want the ACPI thing.
> >
> > Yes.
> >
> > > But on Intel we really don't want HLT, and had that MWAIT, but that has
> > > real problems with KEXEC. And I don't think we can rely on INTEL_IDLE=y.
> >
> > We could because it handles ACPI now and ACPI idle doesn't add any
> > value on top of it except for the IO-based idle case.
>
> You're saying we can mandate INTEL_IDLE=y? Because currently defconfig
> doesn't even have it on.
It is conceivable.
> > > The ACPI thing doesn't support FFh states for it's enter_dead(), should it?
> >
> > It does AFAICS, but the FFH is still MWAIT.
>
> What I'm trying to say is that acpi_idle_play_dead() doesn't seem to
> support FFh and as such won't ever use MWAIT.
Right, but if it finds an FFH state deeper than C1, it will fall back
to the next play_dead method.
> > > Anyway, ideally x86 would grow a new instruction to offline a CPU, both
> > > MWAIT and HLT have problems vs non-maskable interrupts.
> > >
> > > I really don't know what is best here, maybe moving that whole CPUID
> > > loop to boot, store the value in a per-cpu mwait_play_dead_hint. Have
> > > AMD explicitly clear the value, and avoid mwait when 0 -- hint 0 is
> > > equal to HLT anyway.
> > >
> > > But as said, we need a new instruction.
> >
> > Before that, there is the problem with the MWAIT hint computation in
> > mwait_play_dead() and in fact intel_idle does know what hint to use in
> > there.
>
> But we need to deal witn INTEL_IDLE=n.
Then the code would do what it is doing today, as a matter of fallback.
> Also, I don't see any MWAIT_LEAF
> parsing in intel_idle.c. Yes, it requests the information, but then it
> mostly ignores it -- it only consumes two ECX bits or so.
>
> I don't see it finding a max-cstate from mwait_substates anywhere.
No, it gets this either from _CST or from a built-in table for the
given processor model.
> So given we don't have any such code, why can't we simply fix the cstate
> parsing we have in mwait_play_dead() and call it a day?
I'll leave this one to Artem, but there is at least one reason to
avoid doing that I know about: There is no guarantee that whatever has
been found was actually validated.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 14:56 ` Peter Zijlstra
@ 2024-11-12 15:00 ` Rafael J. Wysocki
2024-11-13 11:41 ` Gautham R. Shenoy
1 sibling, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2024-11-12 15:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, artem.bityutskiy, dave.hansen,
gautham.shenoy
On Tue, Nov 12, 2024 at 3:56 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Nov 12, 2024 at 02:23:14PM +0100, Rafael J. Wysocki wrote:
> > On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote:
> > > > The generic implementation, based on cpuid leaf 0x5, for looking up the
> > > > mwait hint for the deepest cstate, depends on them to be continuous in
> > > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
> > > > platforms, it is not architectural and may not result in reaching the
> > > > most optimized idle state on some of them.
> > > >
> > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
> > > > fallback to the later in case of missing enter_dead() handler.
> > > >
> > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
> > > > ---
> > > > arch/x86/kernel/smpboot.c | 4 ++--
> > > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > > > index 44c40781bad6..721bb931181c 100644
> > > > --- a/arch/x86/kernel/smpboot.c
> > > > +++ b/arch/x86/kernel/smpboot.c
> > > > @@ -1416,9 +1416,9 @@ void native_play_dead(void)
> > > > play_dead_common();
> > > > tboot_shutdown(TB_SHUTDOWN_WFS);
> > > >
> > > > - mwait_play_dead();
> > > > if (cpuidle_play_dead())
> > > > - hlt_play_dead();
> > > > + mwait_play_dead();
> > > > + hlt_play_dead();
> > > > }
> > >
> > > Yeah, I don't think so. we don't want to accidentally hit
> > > acpi_idle_play_dead().
> >
> > Having inspected the code once again, I'm not sure what your concern is.
> >
> > :enter.dead() is set to acpi_idle_play_dead() for all states in ACPI
> > idle - see acpi_processor_setup_cstates() and the role of the type
> > check is to filter out bogus table entries (the "type" must be 1, 2,
> > or 3 as per the spec).
> >
> > Then cpuidle_play_dead() calls drv->states[i].enter_dead() for the
> > deepest state where it is set and if this is FFH,
> > acpi_idle_play_dead() will return an error. So after the change, the
> > code above will fall back to mwait_play_dead() then.
> >
> > Or am I missing anything?
>
> So it relies on there being a C2/C3 state enumerated and that being FFh.
> Otherwise it will find a 'working' state and we're up a creek.
>
> Typically I expect C2/C3 FFh states will be there on Intel stuff, but it
> seems awefully random to rely on this hole. AMD might unwittinly change
> the ACPI driver (they're the main user) and then we'd be up a creek.
>
> Robustly we'd teach the ACPI driver about FFh and set enter_dead on
> every state -- but we'd have to double check that with AMD.
>
> At the same time, intel_idle should then also set enter_dead on all
> states.
>
> And then the mwait case is only ever reached if CPUIDLE=n.
So that's why I would prefer intel_idle, if configured, to give
mwait_play_dead() a hint on the MWAIT hint to use. Otherwise the
latter would just fall back to the current method.
This would not be bullet-proof, but it would take the opportunity to
work better if it could.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 14:56 ` Rafael J. Wysocki
@ 2024-11-12 15:08 ` Peter Zijlstra
2024-11-12 16:24 ` Rafael J. Wysocki
0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-12 15:08 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: artem.bityutskiy, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, dave.hansen
On Tue, Nov 12, 2024 at 03:56:22PM +0100, Rafael J. Wysocki wrote:
> > So given we don't have any such code, why can't we simply fix the cstate
> > parsing we have in mwait_play_dead() and call it a day?
>
> I'll leave this one to Artem, but there is at least one reason to
> avoid doing that I know about: There is no guarantee that whatever has
> been found was actually validated.
It's a bit daft to explicitly advertise a state in CPUID that's not
validated. I realize that MSFT will likely only ever use the ACPI table,
but at the same time, the CPUID bits and ACPI tables both come from the
same BIOS image, no?
.... and we've never seen the BIOS be self contradictory before ... /me
runs
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry
2024-11-12 11:45 ` Peter Zijlstra
@ 2024-11-12 15:43 ` Patryk Wlazlyn
2024-11-13 1:19 ` Thomas Gleixner
0 siblings, 1 reply; 51+ messages in thread
From: Patryk Wlazlyn @ 2024-11-12 15:43 UTC (permalink / raw)
To: Peter Zijlstra
Cc: x86, linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
> There's a comment there that explains why this is done. If you don't
> understand this, then please don't touch this code.
/*
* Kexec is about to happen. Don't go back into mwait() as
* the kexec kernel might overwrite text and data including
* page tables and stack. So mwait() would resume when the
* monitor cache line is written to and then the CPU goes
* south due to overwritten text, page tables and stack.
*
* Note: This does _NOT_ protect against a stray MCE, NMI,
* SMI. They will resume execution at the instruction
* following the HLT instruction and run into the problem
* which this is trying to prevent.
*/
If you are referring to this comment above, I do understand the need to
enter hlt loop before the kexec happens. I thought that I could bring
all of the offlined CPUs back online, effectively getting them out of
the mwait loop.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF
2024-11-12 11:28 ` Rafael J. Wysocki
@ 2024-11-12 16:07 ` Dave Hansen
2024-11-12 19:17 ` Thomas Gleixner
1 sibling, 0 replies; 51+ messages in thread
From: Dave Hansen @ 2024-11-12 16:07 UTC (permalink / raw)
To: Rafael J. Wysocki, Patryk Wlazlyn
Cc: x86, linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
On 11/12/24 03:28, Rafael J. Wysocki wrote:
> On Tue, Nov 12, 2024 at 12:18 PM Patryk Wlazlyn
> <patryk.wlazlyn@linux.intel.com> wrote:
>>> This series has said multiple times how the old algorithm is wrong. But
>>> it never actually _fixed_ the bad algorithm, only worked around it.
>>>
>>> Does mwait_play_dead() itself need to get fixed?
>> I don't think so. The old algorithm gives fairly good heuristic for computing
>> the mwait hint for the deepest cstate. Even though it's not guaranteed to work,
>> it does work on most of the platforms that don't early return. I think we should
>> leave it, but prefer idle_driver.
> IOW, as a fallback mechanism, it is as good as it gets.
>
> As the primary source of information though, not quite.
Could we also work to make this a bit more clear when it should be used
and where the primary sources of information are?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 15:08 ` Peter Zijlstra
@ 2024-11-12 16:24 ` Rafael J. Wysocki
0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2024-11-12 16:24 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, artem.bityutskiy, Patryk Wlazlyn, x86,
linux-kernel, linux-pm, rafael.j.wysocki, len.brown, dave.hansen
On Tue, Nov 12, 2024 at 4:08 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Nov 12, 2024 at 03:56:22PM +0100, Rafael J. Wysocki wrote:
>
> > > So given we don't have any such code, why can't we simply fix the cstate
> > > parsing we have in mwait_play_dead() and call it a day?
> >
> > I'll leave this one to Artem, but there is at least one reason to
> > avoid doing that I know about: There is no guarantee that whatever has
> > been found was actually validated.
>
> It's a bit daft to explicitly advertise a state in CPUID that's not
> validated. I realize that MSFT will likely only ever use the ACPI table,
Right.
> but at the same time, the CPUID bits and ACPI tables both come from the
> same BIOS image, no?
Yes, but the list of C-states advertised as supported in CPUID is
usually longer than the _CST list size (at most 3) ...
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF
2024-11-12 11:28 ` Rafael J. Wysocki
2024-11-12 16:07 ` Dave Hansen
@ 2024-11-12 19:17 ` Thomas Gleixner
2024-11-12 19:43 ` Rafael J. Wysocki
1 sibling, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2024-11-12 19:17 UTC (permalink / raw)
To: Rafael J. Wysocki, Patryk Wlazlyn, Dave Hansen
Cc: x86, linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
On Tue, Nov 12 2024 at 12:28, Rafael J. Wysocki wrote:
> On Tue, Nov 12, 2024 at 12:18 PM Patryk Wlazlyn
> <patryk.wlazlyn@linux.intel.com> wrote:
>> I don't think so. The old algorithm gives fairly good heuristic for computing
>> the mwait hint for the deepest cstate. Even though it's not guaranteed to work,
>> it does work on most of the platforms that don't early return. I think we should
>> leave it, but prefer idle_driver.
>
> IOW, as a fallback mechanism, it is as good as it gets.
>
> As the primary source of information though, not quite.
So we have at least 5 places in the kernel which evaluate CPUID leaf 0x5
in different ways.
Can we please have _ONE_ function which evaluates the leaf correctly
once and provides the required information for all places ready to use?
Thanks,
tglx
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF
2024-11-12 19:17 ` Thomas Gleixner
@ 2024-11-12 19:43 ` Rafael J. Wysocki
0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2024-11-12 19:43 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Rafael J. Wysocki, Patryk Wlazlyn, Dave Hansen, x86, linux-kernel,
linux-pm, rafael.j.wysocki, len.brown, artem.bityutskiy,
dave.hansen
On Tue, Nov 12, 2024 at 8:18 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, Nov 12 2024 at 12:28, Rafael J. Wysocki wrote:
> > On Tue, Nov 12, 2024 at 12:18 PM Patryk Wlazlyn
> > <patryk.wlazlyn@linux.intel.com> wrote:
> >> I don't think so. The old algorithm gives fairly good heuristic for computing
> >> the mwait hint for the deepest cstate. Even though it's not guaranteed to work,
> >> it does work on most of the platforms that don't early return. I think we should
> >> leave it, but prefer idle_driver.
> >
> > IOW, as a fallback mechanism, it is as good as it gets.
> >
> > As the primary source of information though, not quite.
>
> So we have at least 5 places in the kernel which evaluate CPUID leaf 0x5
> in different ways.
>
> Can we please have _ONE_ function which evaluates the leaf correctly
> once and provides the required information for all places ready to use?
Yup, that's what needs to be done.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry
2024-11-12 15:43 ` Patryk Wlazlyn
@ 2024-11-13 1:19 ` Thomas Gleixner
2024-11-14 17:13 ` Patryk Wlazlyn
0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2024-11-13 1:19 UTC (permalink / raw)
To: Patryk Wlazlyn, Peter Zijlstra
Cc: x86, linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
On Tue, Nov 12 2024 at 16:43, Patryk Wlazlyn wrote:
>> There's a comment there that explains why this is done. If you don't
>> understand this, then please don't touch this code.
>
> /*
> * Kexec is about to happen. Don't go back into mwait() as
> * the kexec kernel might overwrite text and data including
> * page tables and stack. So mwait() would resume when the
> * monitor cache line is written to and then the CPU goes
> * south due to overwritten text, page tables and stack.
> *
> * Note: This does _NOT_ protect against a stray MCE, NMI,
> * SMI. They will resume execution at the instruction
> * following the HLT instruction and run into the problem
> * which this is trying to prevent.
> */
>
> If you are referring to this comment above, I do understand the need to
> enter hlt loop before the kexec happens. I thought that I could bring
> all of the offlined CPUs back online, effectively getting them out of
> the mwait loop.
That's not really working:
1) Regular kexec offlines them again.
2) Kexec in panic can't do any of that.
Thanks
tglx
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 14:56 ` Peter Zijlstra
2024-11-12 15:00 ` Rafael J. Wysocki
@ 2024-11-13 11:41 ` Gautham R. Shenoy
2024-11-13 16:14 ` Dave Hansen
` (2 more replies)
1 sibling, 3 replies; 51+ messages in thread
From: Gautham R. Shenoy @ 2024-11-13 11:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, artem.bityutskiy, dave.hansen
On Tue, Nov 12, 2024 at 03:56:18PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 12, 2024 at 02:23:14PM +0100, Rafael J. Wysocki wrote:
> > On Tue, Nov 12, 2024 at 12:47 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Nov 08, 2024 at 01:29:08PM +0100, Patryk Wlazlyn wrote:
> > > > The generic implementation, based on cpuid leaf 0x5, for looking up the
> > > > mwait hint for the deepest cstate, depends on them to be continuous in
> > > > range [0, NUM_SUBSTATES-1]. While that is correct on most Intel x86
> > > > platforms, it is not architectural and may not result in reaching the
> > > > most optimized idle state on some of them.
> > > >
> > > > Prefer cpuidle_play_dead() over the generic mwait_play_dead() loop and
> > > > fallback to the later in case of missing enter_dead() handler.
> > > >
> > > > Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
> > > > ---
> > > > arch/x86/kernel/smpboot.c | 4 ++--
> > > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > > > index 44c40781bad6..721bb931181c 100644
> > > > --- a/arch/x86/kernel/smpboot.c
> > > > +++ b/arch/x86/kernel/smpboot.c
> > > > @@ -1416,9 +1416,9 @@ void native_play_dead(void)
> > > > play_dead_common();
> > > > tboot_shutdown(TB_SHUTDOWN_WFS);
> > > >
> > > > - mwait_play_dead();
> > > > if (cpuidle_play_dead())
> > > > - hlt_play_dead();
> > > > + mwait_play_dead();
> > > > + hlt_play_dead();
> > > > }
> > >
> > > Yeah, I don't think so. we don't want to accidentally hit
> > > acpi_idle_play_dead().
> >
> > Having inspected the code once again, I'm not sure what your concern is.
> >
> > :enter.dead() is set to acpi_idle_play_dead() for all states in ACPI
> > idle - see acpi_processor_setup_cstates() and the role of the type
> > check is to filter out bogus table entries (the "type" must be 1, 2,
> > or 3 as per the spec).
> >
> > Then cpuidle_play_dead() calls drv->states[i].enter_dead() for the
> > deepest state where it is set and if this is FFH,
> > acpi_idle_play_dead() will return an error. So after the change, the
> > code above will fall back to mwait_play_dead() then.
> >
> > Or am I missing anything?
>
> So it relies on there being a C2/C3 state enumerated and that being FFh.
> Otherwise it will find a 'working' state and we're up a creek.
>
> Typically I expect C2/C3 FFh states will be there on Intel stuff, but it
> seems awefully random to rely on this hole. AMD might unwittinly change
> the ACPI driver (they're the main user) and then we'd be up a creek.
AMD platforms won't be using FFH based states for offlined CPUs. We
prefer IO based states when available, and HLT otherwise.
>
> Robustly we'd teach the ACPI driver about FFh and set enter_dead on
> every state -- but we'd have to double check that with AMD.
Works for us as long as those FFh states aren't used for play_dead on
AMD platforms.
How about something like this (completely untested)
---------------------x8----------------------------------------------------
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index f3ffd0a3a012..bd611771fa6c 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
}
EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
+static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
+{
+ unsigned int cpu = smp_processor_id();
+ struct cstate_entry *percpu_entry;
+
+ /*
+ * This is ugly. But AMD processors don't prefer MWAIT based
+ * C-states when processors are offlined.
+ */
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
+ boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
+ return -ENODEV;
+
+ percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
+ return mwait_play_dead_with_hints(percpu_entry->states[cx->index].eax);
+}
+EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
+
static int __init ffh_cstate_init(void)
{
struct cpuinfo_x86 *c = &boot_cpu_data;
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 831fa4a12159..c535f5df9081 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -590,6 +590,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
raw_safe_halt();
else if (cx->entry_method == ACPI_CSTATE_SYSTEMIO) {
io_idle(cx->address);
+ } else if (cx->entry_method == ACPI_CSTATE_FFH) {
+ return acpi_procesor_ffh_play_dead(cx);
} else
return -ENODEV;
}
diff --git a/include/acpi/processor.h b/include/acpi/processor.h
index e6f6074eadbf..38329dcdd2b9 100644
--- a/include/acpi/processor.h
+++ b/include/acpi/processor.h
@@ -280,6 +280,7 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu,
struct acpi_processor_cx *cx,
struct acpi_power_register *reg);
void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cstate);
+int acpi_processor_ffh_dead(struct acpi_processor_cx *cstate);
#else
static inline void acpi_processor_power_init_bm_check(struct
acpi_processor_flags
@@ -300,6 +301,11 @@ static inline void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx
{
return;
}
+
+static inline acpi_processor_ffh_dead(struct acpi_processor_cx *cstate)
+{
+ return -ENODEV;
+}
#endif
static inline int call_on_cpu(int cpu, long (*fn)(void *), void *arg,
---------------------x8--------------------------------------------------------
--
Thanks and Regards
gautham.
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-13 11:41 ` Gautham R. Shenoy
@ 2024-11-13 16:14 ` Dave Hansen
2024-11-14 5:06 ` Gautham R. Shenoy
2024-11-13 16:22 ` Peter Zijlstra
2024-11-14 11:58 ` Peter Zijlstra
2 siblings, 1 reply; 51+ messages in thread
From: Dave Hansen @ 2024-11-13 16:14 UTC (permalink / raw)
To: Gautham R. Shenoy, Peter Zijlstra
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, artem.bityutskiy, dave.hansen
On 11/13/24 03:41, Gautham R. Shenoy wrote:
> + /*
> + * This is ugly. But AMD processors don't prefer MWAIT based
> + * C-states when processors are offlined.
> + */
> + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> + return -ENODEV;
Can we get an X86_FEATURE for this, please? Either a positive one:
X86_FEATURE_MWAIT_OK_FOR_OFFLINE
or a negative one:
X86_FEATURE_MWAIT_BUSTED_FOR_OFFLINE
... with better names.
Or even a helper. Because if you add this AMD||HYGON check, it'll be at
_least_ the second one of these for the same logical reason.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-13 11:41 ` Gautham R. Shenoy
2024-11-13 16:14 ` Dave Hansen
@ 2024-11-13 16:22 ` Peter Zijlstra
2024-11-13 16:27 ` Wysocki, Rafael J
2024-11-14 17:36 ` Gautham R. Shenoy
2024-11-14 11:58 ` Peter Zijlstra
2 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-13 16:22 UTC (permalink / raw)
To: Gautham R. Shenoy
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, artem.bityutskiy, dave.hansen
On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
> How about something like this (completely untested)
>
> ---------------------x8----------------------------------------------------
>
> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> index f3ffd0a3a012..bd611771fa6c 100644
> --- a/arch/x86/kernel/acpi/cstate.c
> +++ b/arch/x86/kernel/acpi/cstate.c
> @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
> }
> EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
>
> +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
> +{
> + unsigned int cpu = smp_processor_id();
> + struct cstate_entry *percpu_entry;
> +
> + /*
> + * This is ugly. But AMD processors don't prefer MWAIT based
> + * C-states when processors are offlined.
> + */
> + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> + return -ENODEV;
Are there AMD systems with FFh idle states at all?
Also, I don't think this works right, the loop in cpuidle_play_dead()
will terminate on this, and not try a shallower state which might have
IO/HLT on.
> +
> + percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
> + return mwait_play_dead_with_hints(percpu_entry->states[cx->index].eax);
> +}
> +EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-13 16:22 ` Peter Zijlstra
@ 2024-11-13 16:27 ` Wysocki, Rafael J
2024-11-14 11:58 ` Rafael J. Wysocki
2024-11-14 17:36 ` Gautham R. Shenoy
1 sibling, 1 reply; 51+ messages in thread
From: Wysocki, Rafael J @ 2024-11-13 16:27 UTC (permalink / raw)
To: Peter Zijlstra, Gautham R. Shenoy
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
len.brown, artem.bityutskiy, dave.hansen
On 11/13/2024 5:22 PM, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
>
>> How about something like this (completely untested)
>>
>> ---------------------x8----------------------------------------------------
>>
>> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
>> index f3ffd0a3a012..bd611771fa6c 100644
>> --- a/arch/x86/kernel/acpi/cstate.c
>> +++ b/arch/x86/kernel/acpi/cstate.c
>> @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
>> }
>> EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
>>
>> +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
>> +{
>> + unsigned int cpu = smp_processor_id();
>> + struct cstate_entry *percpu_entry;
>> +
>> + /*
>> + * This is ugly. But AMD processors don't prefer MWAIT based
>> + * C-states when processors are offlined.
>> + */
>> + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
>> + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
>> + return -ENODEV;
> Are there AMD systems with FFh idle states at all?
I don't know.
> Also, I don't think this works right, the loop in cpuidle_play_dead()
> will terminate on this, and not try a shallower state which might have
> IO/HLT on.
I think you are right.
>> +
>> + percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
>> + return mwait_play_dead_with_hints(percpu_entry->states[cx->index].eax);
>> +}
>> +EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-13 16:14 ` Dave Hansen
@ 2024-11-14 5:06 ` Gautham R. Shenoy
0 siblings, 0 replies; 51+ messages in thread
From: Gautham R. Shenoy @ 2024-11-14 5:06 UTC (permalink / raw)
To: Dave Hansen
Cc: Peter Zijlstra, Rafael J. Wysocki, Patryk Wlazlyn, x86,
linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen
Hello Dave,
On Wed, Nov 13, 2024 at 08:14:08AM -0800, Dave Hansen wrote:
> On 11/13/24 03:41, Gautham R. Shenoy wrote:
> > + /*
> > + * This is ugly. But AMD processors don't prefer MWAIT based
> > + * C-states when processors are offlined.
> > + */
> > + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> > + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> > + return -ENODEV;
>
> Can we get an X86_FEATURE for this, please? Either a positive one:
>
> X86_FEATURE_MWAIT_OK_FOR_OFFLINE
>
> or a negative one:
>
> X86_FEATURE_MWAIT_BUSTED_FOR_OFFLINE
>
Sure. That makes sense.
> ... with better names.
>
> Or even a helper. Because if you add this AMD||HYGON check, it'll be at
> _least_ the second one of these for the same logical reason.
Fair enough. Will add that.
--
Thanks and Regards
gautham.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-13 11:41 ` Gautham R. Shenoy
2024-11-13 16:14 ` Dave Hansen
2024-11-13 16:22 ` Peter Zijlstra
@ 2024-11-14 11:58 ` Peter Zijlstra
2024-11-14 17:24 ` Gautham R. Shenoy
2 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-14 11:58 UTC (permalink / raw)
To: Gautham R. Shenoy
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, artem.bityutskiy, dave.hansen
On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
> AMD platforms won't be using FFH based states for offlined CPUs. We
> prefer IO based states when available, and HLT otherwise.
>
> >
> > Robustly we'd teach the ACPI driver about FFh and set enter_dead on
> > every state -- but we'd have to double check that with AMD.
>
> Works for us as long as those FFh states aren't used for play_dead on
> AMD platforms.
AFAIU AMD doesn't want to use MWAIT -- ever, not only for offline.
Confirm?
But if it were to use MWAIT for regular idle, then surely it's OK for
offline too, right?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-13 16:27 ` Wysocki, Rafael J
@ 2024-11-14 11:58 ` Rafael J. Wysocki
2024-11-14 12:17 ` Peter Zijlstra
0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2024-11-14 11:58 UTC (permalink / raw)
To: Peter Zijlstra, Gautham R. Shenoy
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
len.brown, artem.bityutskiy, dave.hansen
On Wed, Nov 13, 2024 at 5:27 PM Wysocki, Rafael J
<rafael.j.wysocki@intel.com> wrote:
>
>
> On 11/13/2024 5:22 PM, Peter Zijlstra wrote:
> > On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
> >
> >> How about something like this (completely untested)
> >>
> >> ---------------------x8----------------------------------------------------
> >>
> >> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> >> index f3ffd0a3a012..bd611771fa6c 100644
> >> --- a/arch/x86/kernel/acpi/cstate.c
> >> +++ b/arch/x86/kernel/acpi/cstate.c
> >> @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
> >> }
> >> EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
> >>
> >> +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
> >> +{
> >> + unsigned int cpu = smp_processor_id();
> >> + struct cstate_entry *percpu_entry;
> >> +
> >> + /*
> >> + * This is ugly. But AMD processors don't prefer MWAIT based
> >> + * C-states when processors are offlined.
> >> + */
> >> + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> >> + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> >> + return -ENODEV;
> > Are there AMD systems with FFh idle states at all?
>
> I don't know.
>
>
> > Also, I don't think this works right, the loop in cpuidle_play_dead()
> > will terminate on this, and not try a shallower state which might have
> > IO/HLT on.
>
> I think you are right.
AFAICS, cpuidle_play_dead() needs to be reworked to not bail out after
receiving an error from :play_dead() for one state, but only after all
of them have failed.
> >> +
> >> + percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
> >> + return mwait_play_dead_with_hints(percpu_entry->states[cx->index].eax);
> >> +}
> >> +EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-12 14:01 ` Peter Zijlstra
@ 2024-11-14 12:03 ` Peter Zijlstra
2024-11-15 1:21 ` Thomas Gleixner
0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-14 12:03 UTC (permalink / raw)
To: Artem Bityutskiy
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, dave.hansen
On Tue, Nov 12, 2024 at 03:01:27PM +0100, Peter Zijlstra wrote:
> No, not mwait hint. We need an instruction that:
>
> - goes to deepest C state
> - drops into WAIT-for-Start-IPI (SIPI)
>
> Notably, it should not wake from:
>
> - random memory writes
> - NMI, MCE, SMI and other such non-maskable thingies
> - anything else -- the memory pointed to by RIP might no longer exist
>
> Lets call the instruction: DEAD.
So, turns out that when you send INIT to an AP it does the whole drop
into Wait-for-SIPI and ignore non-maskable crap.
The reason we don't do that is because INIT to CPU0 (BP) is somewhat
fatal, but since Thomas killed all that CPU0 hotplug crap, I think we
can actually go do that.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-14 11:58 ` Rafael J. Wysocki
@ 2024-11-14 12:17 ` Peter Zijlstra
0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-14 12:17 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Gautham R. Shenoy, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
len.brown, artem.bityutskiy, dave.hansen
On Thu, Nov 14, 2024 at 12:58:49PM +0100, Rafael J. Wysocki wrote:
> On Wed, Nov 13, 2024 at 5:27 PM Wysocki, Rafael J
> <rafael.j.wysocki@intel.com> wrote:
> >
> >
> > On 11/13/2024 5:22 PM, Peter Zijlstra wrote:
> > > On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
> > >
> > >> How about something like this (completely untested)
> > >>
> > >> ---------------------x8----------------------------------------------------
> > >>
> > >> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> > >> index f3ffd0a3a012..bd611771fa6c 100644
> > >> --- a/arch/x86/kernel/acpi/cstate.c
> > >> +++ b/arch/x86/kernel/acpi/cstate.c
> > >> @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
> > >> }
> > >> EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
> > >>
> > >> +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
> > >> +{
> > >> + unsigned int cpu = smp_processor_id();
> > >> + struct cstate_entry *percpu_entry;
> > >> +
> > >> + /*
> > >> + * This is ugly. But AMD processors don't prefer MWAIT based
> > >> + * C-states when processors are offlined.
> > >> + */
> > >> + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> > >> + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> > >> + return -ENODEV;
> > > Are there AMD systems with FFh idle states at all?
> >
> > I don't know.
> >
> >
> > > Also, I don't think this works right, the loop in cpuidle_play_dead()
> > > will terminate on this, and not try a shallower state which might have
> > > IO/HLT on.
> >
> > I think you are right.
>
> AFAICS, cpuidle_play_dead() needs to be reworked to not bail out after
> receiving an error from :play_dead() for one state, but only after all
> of them have failed.
That and ideally we remove that whole ACPI_STATE_C[123] filter on
setting enter_dead. I don't see how that makes sense.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry
2024-11-13 1:19 ` Thomas Gleixner
@ 2024-11-14 17:13 ` Patryk Wlazlyn
0 siblings, 0 replies; 51+ messages in thread
From: Patryk Wlazlyn @ 2024-11-14 17:13 UTC (permalink / raw)
To: Thomas Gleixner
Cc: x86, linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen, Peter Zijlstra
> That's not really working:
>
> 1) Regular kexec offlines them again.
But then, they execute hlt loop in the reboot vector right?
I think that's fine. We just don't want to be woken up from the mwait
when the RIP points to garbage.
> 2) Kexec in panic can't do any of that.
Does it make sense to change the kexec, so that every CPU is forced
into the reboot vector, which I believe happens for online CPUs?
We don't have to online them all the way. Maybe minimal bringup just
to be able to send them the IPI?
As a side note, It's not that important for this patchset. I think I
can make it work without simplifying the existing hack, but crossed my
mind at some point that maybe we could do that.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-14 11:58 ` Peter Zijlstra
@ 2024-11-14 17:24 ` Gautham R. Shenoy
2024-11-15 10:11 ` Peter Zijlstra
0 siblings, 1 reply; 51+ messages in thread
From: Gautham R. Shenoy @ 2024-11-14 17:24 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, artem.bityutskiy, dave.hansen,
Mario Limonciello
Hello Peter,
On Thu, Nov 14, 2024 at 12:58:31PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
>
> > AMD platforms won't be using FFH based states for offlined CPUs. We
> > prefer IO based states when available, and HLT otherwise.
> >
> > >
> > > Robustly we'd teach the ACPI driver about FFh and set enter_dead on
> > > every state -- but we'd have to double check that with AMD.
> >
> > Works for us as long as those FFh states aren't used for play_dead on
> > AMD platforms.
>
> AFAIU AMD doesn't want to use MWAIT -- ever, not only for offline.
> Confirm?
>
AMD wants to use MWAIT for cpuidle and it does use MWAIT based C1
state on both client and server parts.
Eg: On my server box
$ cpupower idle-info | grep "FFH" -B1 -A3
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 6591
Duration: 1482606
> But if it were to use MWAIT for regular idle, then surely it's OK for
> offline too, right?
I tried this out today and there is no functional issue.
However, I would like to run some experiments on whether HLT provides
better power savings than MWAIT C1 with CPUs offlined. I will get back
with this information tomorrow.
--
Thanks and Regards
gautham.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-13 16:22 ` Peter Zijlstra
2024-11-13 16:27 ` Wysocki, Rafael J
@ 2024-11-14 17:36 ` Gautham R. Shenoy
2024-11-14 17:58 ` Rafael J. Wysocki
1 sibling, 1 reply; 51+ messages in thread
From: Gautham R. Shenoy @ 2024-11-14 17:36 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, artem.bityutskiy, dave.hansen,
Mario Limonciello
On Wed, Nov 13, 2024 at 05:22:22PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
>
> > How about something like this (completely untested)
> >
> > ---------------------x8----------------------------------------------------
> >
> > diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> > index f3ffd0a3a012..bd611771fa6c 100644
> > --- a/arch/x86/kernel/acpi/cstate.c
> > +++ b/arch/x86/kernel/acpi/cstate.c
> > @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
> > }
> > EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
> >
> > +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
> > +{
> > + unsigned int cpu = smp_processor_id();
> > + struct cstate_entry *percpu_entry;
> > +
> > + /*
> > + * This is ugly. But AMD processors don't prefer MWAIT based
> > + * C-states when processors are offlined.
> > + */
> > + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> > + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> > + return -ENODEV;
>
> Are there AMD systems with FFh idle states at all?
Yes. On AMD systems, the ACPI C1 state is an FFh idle state. An
example from my server box.
$ cpupower idle-info | grep "FFH" -B1 -A3
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 6591
Duration: 1482606
>
> Also, I don't think this works right, the loop in cpuidle_play_dead()
> will terminate on this, and not try a shallower state which might have
> IO/HLT on.
Ah yes. You are right. I didn't observe that cpuidle_play_dead() calls
"return idle_state.enter_state". I suppose the solution would be to
not populate .enter_dead callback for FFH based C-States on AMD/Hygon
Platforms.
How about something like the following? I have tested this on AMD
servers by disabling the IO based C2 state, and I could observe that
the offlined CPUs entered HLT bypassing the MWAIT based C1. When IO
based C2 state is enabled, offlined CPUs enter the C2 state as before.
If Global C-States are disabled, then, offlined CPUs enter HLT.
This is on top of Patrick's series.
I have defined X86_FEATURE_NO_MWAIT_OFFLINE as suggested by Dave to
prevent MWAIT based C-states being used for CPU Offline on AMD and
Hygon.
---------------------x8----------------------------------------------------
From: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Date: Thu, 14 Nov 2024 14:48:17 +0530
Subject: [PATCH] acpi_idle: Teach acpi_idle_play_dead about FFH states
Signed-off-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/acpi/cstate.c | 14 ++++++++++++++
arch/x86/kernel/smpboot.c | 6 +++---
drivers/acpi/processor_idle.c | 24 ++++++++++++++++++++++--
include/acpi/processor.h | 6 ++++++
5 files changed, 46 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 05e985ce9880..ceb002406d0c 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -236,6 +236,7 @@
#define X86_FEATURE_PVUNLOCK ( 8*32+20) /* PV unlock function */
#define X86_FEATURE_VCPUPREEMPT ( 8*32+21) /* PV vcpu_is_preempted function */
#define X86_FEATURE_TDX_GUEST ( 8*32+22) /* "tdx_guest" Intel Trust Domain Extensions Guest */
+#define X86_FEATURE_NO_MWAIT_OFFLINE ( 8*32+23) /* Don't use MWAIT states for offlined CPUs*/
/* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
#define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* "fsgsbase" RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index f3ffd0a3a012..a5255a603745 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -215,6 +215,16 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
}
EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
+void acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
+{
+ unsigned int cpu = smp_processor_id();
+ struct cstate_entry *percpu_entry;
+
+ percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
+ mwait_play_dead_with_hint(percpu_entry->states[cx->index].eax);
+}
+EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
+
static int __init ffh_cstate_init(void)
{
struct cpuinfo_x86 *c = &boot_cpu_data;
@@ -224,6 +234,10 @@ static int __init ffh_cstate_init(void)
c->x86_vendor != X86_VENDOR_HYGON)
return -1;
+ if (c->x86_vendor == X86_VENDOR_AMD ||
+ c->x86_vendor == X86_VENDOR_HYGON)
+ set_cpu_cap(c, X86_FEATURE_NO_MWAIT_OFFLINE);
+
cpu_cstate_entry = alloc_percpu(struct cstate_entry);
return 0;
}
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 106f26e976b8..b5939bf96be6 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1285,9 +1285,9 @@ static inline int mwait_play_dead(void)
unsigned int highest_subcstate = 0;
int i;
- if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
- boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
- return 1;
+ if (boot_cpu_has(X86_FEATURE_NO_MWAIT_OFFLINE))
+ return -ENODEV;
+
if (!this_cpu_has(X86_FEATURE_MWAIT))
return 1;
if (!this_cpu_has(X86_FEATURE_CLFLUSH))
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 831fa4a12159..1e2259b4395b 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -590,6 +590,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
raw_safe_halt();
else if (cx->entry_method == ACPI_CSTATE_SYSTEMIO) {
io_idle(cx->address);
+ } else if (cx->entry_method == ACPI_CSTATE_FFH) {
+ acpi_processor_ffh_play_dead(cx);
} else
return -ENODEV;
}
@@ -772,6 +774,25 @@ static int acpi_processor_setup_cpuidle_cx(struct acpi_processor *pr,
return 0;
}
+static inline bool valid_acpi_idle_type(struct acpi_processor_cx *cx)
+{
+ if (cx->type == ACPI_STATE_C1 ||
+ cx->type == ACPI_STATE_C2 ||
+ cx->type == ACPI_STATE_C3)
+ return true;
+
+ return false;
+}
+
+static inline bool acpi_idle_can_play_dead(struct acpi_processor_cx *cx)
+{
+ if (cx->entry_method == ACPI_CSTATE_FFH &&
+ boot_cpu_has(X86_FEATURE_NO_MWAIT_OFFLINE))
+ return false;
+
+ return true;
+}
+
static int acpi_processor_setup_cstates(struct acpi_processor *pr)
{
int i, count;
@@ -803,8 +824,7 @@ static int acpi_processor_setup_cstates(struct acpi_processor *pr)
state->enter = acpi_idle_enter;
state->flags = 0;
- if (cx->type == ACPI_STATE_C1 || cx->type == ACPI_STATE_C2 ||
- cx->type == ACPI_STATE_C3) {
+ if (valid_acpi_idle_type(cx) && acpi_idle_can_play_dead(cx)) {
state->enter_dead = acpi_idle_play_dead;
if (cx->type != ACPI_STATE_C3)
drv->safe_state_index = count;
diff --git a/include/acpi/processor.h b/include/acpi/processor.h
index e6f6074eadbf..349fe47a579e 100644
--- a/include/acpi/processor.h
+++ b/include/acpi/processor.h
@@ -280,6 +280,7 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu,
struct acpi_processor_cx *cx,
struct acpi_power_register *reg);
void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cstate);
+void acpi_processor_ffh_play_dead(struct acpi_processor_cx *cstate);
#else
static inline void acpi_processor_power_init_bm_check(struct
acpi_processor_flags
@@ -300,6 +301,11 @@ static inline void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx
{
return;
}
+
+static inline void acpi_processor_ffh_play_dead(struct acpi_processor_cx *cstate)
+{
+ return;
+}
#endif
static inline int call_on_cpu(int cpu, long (*fn)(void *), void *arg,
--
Thanks and Regards
gautham.
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-14 17:36 ` Gautham R. Shenoy
@ 2024-11-14 17:58 ` Rafael J. Wysocki
0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2024-11-14 17:58 UTC (permalink / raw)
To: Gautham R. Shenoy
Cc: Peter Zijlstra, Rafael J. Wysocki, Patryk Wlazlyn, x86,
linux-kernel, linux-pm, rafael.j.wysocki, len.brown,
artem.bityutskiy, dave.hansen, Mario Limonciello
On Thu, Nov 14, 2024 at 6:36 PM Gautham R. Shenoy
<gautham.shenoy@amd.com> wrote:
>
> On Wed, Nov 13, 2024 at 05:22:22PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
> >
> > > How about something like this (completely untested)
> > >
> > > ---------------------x8----------------------------------------------------
> > >
> > > diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> > > index f3ffd0a3a012..bd611771fa6c 100644
> > > --- a/arch/x86/kernel/acpi/cstate.c
> > > +++ b/arch/x86/kernel/acpi/cstate.c
> > > @@ -215,6 +215,24 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
> > > }
> > > EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
> > >
> > > +static int acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
> > > +{
> > > + unsigned int cpu = smp_processor_id();
> > > + struct cstate_entry *percpu_entry;
> > > +
> > > + /*
> > > + * This is ugly. But AMD processors don't prefer MWAIT based
> > > + * C-states when processors are offlined.
> > > + */
> > > + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> > > + boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> > > + return -ENODEV;
> >
> > Are there AMD systems with FFh idle states at all?
>
> Yes. On AMD systems, the ACPI C1 state is an FFh idle state. An
> example from my server box.
>
>
> $ cpupower idle-info | grep "FFH" -B1 -A3
> C1:
> Flags/Description: ACPI FFH MWAIT 0x0
> Latency: 1
> Usage: 6591
> Duration: 1482606
>
> >
> > Also, I don't think this works right, the loop in cpuidle_play_dead()
> > will terminate on this, and not try a shallower state which might have
> > IO/HLT on.
>
> Ah yes. You are right. I didn't observe that cpuidle_play_dead() calls
> "return idle_state.enter_state". I suppose the solution would be to
> not populate .enter_dead callback for FFH based C-States on AMD/Hygon
> Platforms.
No, no, the solution is to change cpuidle_play_dead() to do the right thing.
Please see
https://lore.kernel.org/linux-pm/4992010.31r3eYUQgx@rjwysocki.net/
that I've just posted.
> How about something like the following? I have tested this on AMD
> servers by disabling the IO based C2 state, and I could observe that
> the offlined CPUs entered HLT bypassing the MWAIT based C1. When IO
> based C2 state is enabled, offlined CPUs enter the C2 state as before.
> If Global C-States are disabled, then, offlined CPUs enter HLT.
>
> This is on top of Patrick's series.
>
> I have defined X86_FEATURE_NO_MWAIT_OFFLINE as suggested by Dave to
> prevent MWAIT based C-states being used for CPU Offline on AMD and
> Hygon.
>
> ---------------------x8----------------------------------------------------
> From: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
> Date: Thu, 14 Nov 2024 14:48:17 +0530
> Subject: [PATCH] acpi_idle: Teach acpi_idle_play_dead about FFH states
>
> Signed-off-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
> ---
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/kernel/acpi/cstate.c | 14 ++++++++++++++
> arch/x86/kernel/smpboot.c | 6 +++---
> drivers/acpi/processor_idle.c | 24 ++++++++++++++++++++++--
> include/acpi/processor.h | 6 ++++++
> 5 files changed, 46 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 05e985ce9880..ceb002406d0c 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -236,6 +236,7 @@
> #define X86_FEATURE_PVUNLOCK ( 8*32+20) /* PV unlock function */
> #define X86_FEATURE_VCPUPREEMPT ( 8*32+21) /* PV vcpu_is_preempted function */
> #define X86_FEATURE_TDX_GUEST ( 8*32+22) /* "tdx_guest" Intel Trust Domain Extensions Guest */
> +#define X86_FEATURE_NO_MWAIT_OFFLINE ( 8*32+23) /* Don't use MWAIT states for offlined CPUs*/
>
> /* Intel-defined CPU features, CPUID level 0x00000007:0 (EBX), word 9 */
> #define X86_FEATURE_FSGSBASE ( 9*32+ 0) /* "fsgsbase" RDFSBASE, WRFSBASE, RDGSBASE, WRGSBASE instructions*/
> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> index f3ffd0a3a012..a5255a603745 100644
> --- a/arch/x86/kernel/acpi/cstate.c
> +++ b/arch/x86/kernel/acpi/cstate.c
> @@ -215,6 +215,16 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
> }
> EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
>
> +void acpi_processor_ffh_play_dead(struct acpi_processor_cx *cx)
> +{
> + unsigned int cpu = smp_processor_id();
> + struct cstate_entry *percpu_entry;
> +
> + percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
> + mwait_play_dead_with_hint(percpu_entry->states[cx->index].eax);
> +}
> +EXPORT_SYMBOL_GPL(acpi_processor_ffh_play_dead);
Why does it need to be exported to modules?
> +
> static int __init ffh_cstate_init(void)
> {
> struct cpuinfo_x86 *c = &boot_cpu_data;
> @@ -224,6 +234,10 @@ static int __init ffh_cstate_init(void)
> c->x86_vendor != X86_VENDOR_HYGON)
> return -1;
>
> + if (c->x86_vendor == X86_VENDOR_AMD ||
> + c->x86_vendor == X86_VENDOR_HYGON)
> + set_cpu_cap(c, X86_FEATURE_NO_MWAIT_OFFLINE);
> +
> cpu_cstate_entry = alloc_percpu(struct cstate_entry);
> return 0;
> }
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 106f26e976b8..b5939bf96be6 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1285,9 +1285,9 @@ static inline int mwait_play_dead(void)
> unsigned int highest_subcstate = 0;
> int i;
>
> - if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> - boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> - return 1;
> + if (boot_cpu_has(X86_FEATURE_NO_MWAIT_OFFLINE))
> + return -ENODEV;
> +
> if (!this_cpu_has(X86_FEATURE_MWAIT))
> return 1;
> if (!this_cpu_has(X86_FEATURE_CLFLUSH))
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 831fa4a12159..1e2259b4395b 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -590,6 +590,8 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
> raw_safe_halt();
> else if (cx->entry_method == ACPI_CSTATE_SYSTEMIO) {
> io_idle(cx->address);
> + } else if (cx->entry_method == ACPI_CSTATE_FFH) {
> + acpi_processor_ffh_play_dead(cx);
> } else
> return -ENODEV;
> }
> @@ -772,6 +774,25 @@ static int acpi_processor_setup_cpuidle_cx(struct acpi_processor *pr,
> return 0;
> }
>
> +static inline bool valid_acpi_idle_type(struct acpi_processor_cx *cx)
> +{
> + if (cx->type == ACPI_STATE_C1 ||
> + cx->type == ACPI_STATE_C2 ||
> + cx->type == ACPI_STATE_C3)
> + return true;
> +
> + return false;
This could do
return cx->type == ACPI_STATE_C1 || cx->type == ACPI_STATE_C2 ||
cx->type == ACPI_STATE_C3;
but it may just not be necessary.
Please see
https://lore.kernel.org/linux-pm/2373563.ElGaqSPkdT@rjwysocki.net/
> +}
> +
> +static inline bool acpi_idle_can_play_dead(struct acpi_processor_cx *cx)
> +{
> + if (cx->entry_method == ACPI_CSTATE_FFH &&
> + boot_cpu_has(X86_FEATURE_NO_MWAIT_OFFLINE))
> + return false;
> +
> + return true;
> +}
> +
> static int acpi_processor_setup_cstates(struct acpi_processor *pr)
> {
> int i, count;
> @@ -803,8 +824,7 @@ static int acpi_processor_setup_cstates(struct acpi_processor *pr)
> state->enter = acpi_idle_enter;
>
> state->flags = 0;
> - if (cx->type == ACPI_STATE_C1 || cx->type == ACPI_STATE_C2 ||
> - cx->type == ACPI_STATE_C3) {
> + if (valid_acpi_idle_type(cx) && acpi_idle_can_play_dead(cx)) {
> state->enter_dead = acpi_idle_play_dead;
The check below is separate from the "enter_dead" check, so it should
not depend on acpi_idle_can_play_dead(cx).
> if (cx->type != ACPI_STATE_C3)
> drv->safe_state_index = count;
> diff --git a/include/acpi/processor.h b/include/acpi/processor.h
> index e6f6074eadbf..349fe47a579e 100644
> --- a/include/acpi/processor.h
> +++ b/include/acpi/processor.h
> @@ -280,6 +280,7 @@ int acpi_processor_ffh_cstate_probe(unsigned int cpu,
> struct acpi_processor_cx *cx,
> struct acpi_power_register *reg);
> void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cstate);
> +void acpi_processor_ffh_play_dead(struct acpi_processor_cx *cstate);
> #else
> static inline void acpi_processor_power_init_bm_check(struct
> acpi_processor_flags
> @@ -300,6 +301,11 @@ static inline void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx
> {
> return;
> }
> +
> +static inline void acpi_processor_ffh_play_dead(struct acpi_processor_cx *cstate)
> +{
> + return;
> +}
> #endif
>
> static inline int call_on_cpu(int cpu, long (*fn)(void *), void *arg,
>
>
> --
> Thanks and Regards
> gautham.
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-14 12:03 ` Peter Zijlstra
@ 2024-11-15 1:21 ` Thomas Gleixner
2024-11-15 10:07 ` Peter Zijlstra
0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2024-11-15 1:21 UTC (permalink / raw)
To: Peter Zijlstra, Artem Bityutskiy
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, dave.hansen
On Thu, Nov 14 2024 at 13:03, Peter Zijlstra wrote:
> On Tue, Nov 12, 2024 at 03:01:27PM +0100, Peter Zijlstra wrote:
>
>> No, not mwait hint. We need an instruction that:
>>
>> - goes to deepest C state
>> - drops into WAIT-for-Start-IPI (SIPI)
>>
>> Notably, it should not wake from:
>>
>> - random memory writes
>> - NMI, MCE, SMI and other such non-maskable thingies
>> - anything else -- the memory pointed to by RIP might no longer exist
>>
>> Lets call the instruction: DEAD.
>
> So, turns out that when you send INIT to an AP it does the whole drop
> into Wait-for-SIPI and ignore non-maskable crap.
>
> The reason we don't do that is because INIT to CPU0 (BP) is somewhat
> fatal, but since Thomas killed all that CPU0 hotplug crap, I think we
> can actually go do that.
Instead of playing dead or to kick out CPUs from whatever dead play
routine they are in?
playimg dead is to stay because INIT will bring back the MCE broadcast
problem, which we try to avoid by bringing SMT siblings up just to shut
them down again by playing dead.
You need a MCE broadcast free system and/or some sensible BIOS bringup
code for that to work...
Thanks,
tglx
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-15 1:21 ` Thomas Gleixner
@ 2024-11-15 10:07 ` Peter Zijlstra
2024-11-15 15:37 ` Thomas Gleixner
0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-15 10:07 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Artem Bityutskiy, Rafael J. Wysocki, Patryk Wlazlyn, x86,
linux-kernel, linux-pm, rafael.j.wysocki, len.brown, dave.hansen
On Fri, Nov 15, 2024 at 02:21:19AM +0100, Thomas Gleixner wrote:
> On Thu, Nov 14 2024 at 13:03, Peter Zijlstra wrote:
> > On Tue, Nov 12, 2024 at 03:01:27PM +0100, Peter Zijlstra wrote:
> >
> >> No, not mwait hint. We need an instruction that:
> >>
> >> - goes to deepest C state
> >> - drops into WAIT-for-Start-IPI (SIPI)
> >>
> >> Notably, it should not wake from:
> >>
> >> - random memory writes
> >> - NMI, MCE, SMI and other such non-maskable thingies
> >> - anything else -- the memory pointed to by RIP might no longer exist
> >>
> >> Lets call the instruction: DEAD.
> >
> > So, turns out that when you send INIT to an AP it does the whole drop
> > into Wait-for-SIPI and ignore non-maskable crap.
> >
> > The reason we don't do that is because INIT to CPU0 (BP) is somewhat
> > fatal, but since Thomas killed all that CPU0 hotplug crap, I think we
> > can actually go do that.
>
> Instead of playing dead or to kick out CPUs from whatever dead play
> routine they are in?
>
> playimg dead is to stay because INIT will bring back the MCE broadcast
> problem, which we try to avoid by bringing SMT siblings up just to shut
> them down again by playing dead.
>
> You need a MCE broadcast free system and/or some sensible BIOS bringup
> code for that to work...
Isn't INIT a better state to be in during kexec than HLT?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-14 17:24 ` Gautham R. Shenoy
@ 2024-11-15 10:11 ` Peter Zijlstra
2024-11-25 5:45 ` Gautham R. Shenoy
0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2024-11-15 10:11 UTC (permalink / raw)
To: Gautham R. Shenoy
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, artem.bityutskiy, dave.hansen,
Mario Limonciello
On Thu, Nov 14, 2024 at 10:54:15PM +0530, Gautham R. Shenoy wrote:
> Hello Peter,
>
> On Thu, Nov 14, 2024 at 12:58:31PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 13, 2024 at 05:11:38PM +0530, Gautham R. Shenoy wrote:
> >
> > > AMD platforms won't be using FFH based states for offlined CPUs. We
> > > prefer IO based states when available, and HLT otherwise.
> > >
> > > >
> > > > Robustly we'd teach the ACPI driver about FFh and set enter_dead on
> > > > every state -- but we'd have to double check that with AMD.
> > >
> > > Works for us as long as those FFh states aren't used for play_dead on
> > > AMD platforms.
> >
> > AFAIU AMD doesn't want to use MWAIT -- ever, not only for offline.
> > Confirm?
> >
>
> AMD wants to use MWAIT for cpuidle and it does use MWAIT based C1
> state on both client and server parts.
>
> Eg: On my server box
>
> $ cpupower idle-info | grep "FFH" -B1 -A3
> C1:
> Flags/Description: ACPI FFH MWAIT 0x0
> Latency: 1
> Usage: 6591
> Duration: 1482606
>
> > But if it were to use MWAIT for regular idle, then surely it's OK for
> > offline too, right?
>
> I tried this out today and there is no functional issue.
>
> However, I would like to run some experiments on whether HLT provides
> better power savings than MWAIT C1 with CPUs offlined. I will get back
> with this information tomorrow.
Right, but in most cases you'll have C2/C3 with io ports specified and
those will be picked for play_dead anyway. It's just the exceptionally
weird BIOS case where you'll have C2/C3 as FFh -- because random BIOS
person was on drugs that day or something like that.
Anyway, what I'm trying to say is that you'll probably fine without
adding a bunch of if (AMD|HYGON) logic -- the less of that we have, the
better, etc..
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-15 10:07 ` Peter Zijlstra
@ 2024-11-15 15:37 ` Thomas Gleixner
0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gleixner @ 2024-11-15 15:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Artem Bityutskiy, Rafael J. Wysocki, Patryk Wlazlyn, x86,
linux-kernel, linux-pm, rafael.j.wysocki, len.brown, dave.hansen
On Fri, Nov 15 2024 at 11:07, Peter Zijlstra wrote:
> On Fri, Nov 15, 2024 at 02:21:19AM +0100, Thomas Gleixner wrote:
>> On Thu, Nov 14 2024 at 13:03, Peter Zijlstra wrote:
>> > On Tue, Nov 12, 2024 at 03:01:27PM +0100, Peter Zijlstra wrote:
>> >
>> >> No, not mwait hint. We need an instruction that:
>> >>
>> >> - goes to deepest C state
>> >> - drops into WAIT-for-Start-IPI (SIPI)
>> >>
>> >> Notably, it should not wake from:
>> >>
>> >> - random memory writes
>> >> - NMI, MCE, SMI and other such non-maskable thingies
>> >> - anything else -- the memory pointed to by RIP might no longer exist
>> >>
>> >> Lets call the instruction: DEAD.
>> >
>> > So, turns out that when you send INIT to an AP it does the whole drop
>> > into Wait-for-SIPI and ignore non-maskable crap.
>> >
>> > The reason we don't do that is because INIT to CPU0 (BP) is somewhat
>> > fatal, but since Thomas killed all that CPU0 hotplug crap, I think we
>> > can actually go do that.
>>
>> Instead of playing dead or to kick out CPUs from whatever dead play
>> routine they are in?
>>
>> playimg dead is to stay because INIT will bring back the MCE broadcast
>> problem, which we try to avoid by bringing SMT siblings up just to shut
>> them down again by playing dead.
>>
>> You need a MCE broadcast free system and/or some sensible BIOS bringup
>> code for that to work...
>
> Isn't INIT a better state to be in during kexec than HLT?
For the kexec transition I agree that INIT is better. We just need to be
careful for a crash kexec when the crashing CPU is not CPU0...
Thanks,
tglx
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
2024-11-15 10:11 ` Peter Zijlstra
@ 2024-11-25 5:45 ` Gautham R. Shenoy
0 siblings, 0 replies; 51+ messages in thread
From: Gautham R. Shenoy @ 2024-11-25 5:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, Patryk Wlazlyn, x86, linux-kernel, linux-pm,
rafael.j.wysocki, len.brown, artem.bityutskiy, dave.hansen,
Mario Limonciello
Hello Peter,
Apologies for the delay in response. I was away most of last week.
On Fri, Nov 15, 2024 at 11:11:00AM +0100, Peter Zijlstra wrote:
[..snip..]
> >
> > > But if it were to use MWAIT for regular idle, then surely it's OK for
> > > offline too, right?
Yes, this is not a problem.
> >
> > I tried this out today and there is no functional issue.
> >
> > However, I would like to run some experiments on whether HLT provides
> > better power savings than MWAIT C1 with CPUs offlined. I will get back
> > with this information tomorrow.
>
> Right, but in most cases you'll have C2/C3 with io ports specified and
> those will be picked for play_dead anyway. It's just the exceptionally
> weird BIOS case where you'll have C2/C3 as FFh -- because random BIOS
> person was on drugs that day or something like that.
It is unlikely to have a BIOSes on AMD platforms that allow users to
disable ACPI C2 state. That would be the only scenario when an
offlined CPU would enter FFH based C1.
>
> Anyway, what I'm trying to say is that you'll probably fine without
> adding a bunch of if (AMD|HYGON) logic -- the less of that we have, the
> better, etc..
I agree, I will resend the patch without this check.
--
Thanks and Regards
gautham.
^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2024-11-25 5:46 UTC | newest]
Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-08 12:29 [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry Patryk Wlazlyn
2024-11-08 12:29 ` [PATCH v3 1/3] x86/smp: Allow calling mwait_play_dead with arbitrary hint Patryk Wlazlyn
2024-11-08 16:03 ` Dave Hansen
2024-11-12 10:54 ` Patryk Wlazlyn
2024-11-08 12:29 ` [PATCH v3 2/3] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead() Patryk Wlazlyn
2024-11-08 16:14 ` Dave Hansen
2024-11-12 10:55 ` Patryk Wlazlyn
2024-11-12 11:47 ` Peter Zijlstra
2024-11-12 12:03 ` Rafael J. Wysocki
2024-11-12 12:18 ` Peter Zijlstra
2024-11-12 12:30 ` Rafael J. Wysocki
2024-11-12 12:38 ` Rafael J. Wysocki
2024-11-12 13:49 ` Peter Zijlstra
2024-11-12 14:56 ` Rafael J. Wysocki
2024-11-12 15:08 ` Peter Zijlstra
2024-11-12 16:24 ` Rafael J. Wysocki
2024-11-12 12:44 ` Artem Bityutskiy
2024-11-12 14:01 ` Peter Zijlstra
2024-11-14 12:03 ` Peter Zijlstra
2024-11-15 1:21 ` Thomas Gleixner
2024-11-15 10:07 ` Peter Zijlstra
2024-11-15 15:37 ` Thomas Gleixner
2024-11-12 13:23 ` Rafael J. Wysocki
2024-11-12 14:56 ` Peter Zijlstra
2024-11-12 15:00 ` Rafael J. Wysocki
2024-11-13 11:41 ` Gautham R. Shenoy
2024-11-13 16:14 ` Dave Hansen
2024-11-14 5:06 ` Gautham R. Shenoy
2024-11-13 16:22 ` Peter Zijlstra
2024-11-13 16:27 ` Wysocki, Rafael J
2024-11-14 11:58 ` Rafael J. Wysocki
2024-11-14 12:17 ` Peter Zijlstra
2024-11-14 17:36 ` Gautham R. Shenoy
2024-11-14 17:58 ` Rafael J. Wysocki
2024-11-14 11:58 ` Peter Zijlstra
2024-11-14 17:24 ` Gautham R. Shenoy
2024-11-15 10:11 ` Peter Zijlstra
2024-11-25 5:45 ` Gautham R. Shenoy
2024-11-08 12:29 ` [PATCH v3 3/3] intel_idle: Provide enter_dead() handler for SRF Patryk Wlazlyn
2024-11-08 16:21 ` Dave Hansen
2024-11-12 10:57 ` Patryk Wlazlyn
2024-11-12 11:28 ` Rafael J. Wysocki
2024-11-12 16:07 ` Dave Hansen
2024-11-12 19:17 ` Thomas Gleixner
2024-11-12 19:43 ` Rafael J. Wysocki
2024-11-08 22:12 ` kernel test robot
2024-11-08 16:22 ` [PATCH v3 0/3] SRF: Fix offline CPU preventing pc6 entry Dave Hansen
2024-11-12 11:45 ` Peter Zijlstra
2024-11-12 15:43 ` Patryk Wlazlyn
2024-11-13 1:19 ` Thomas Gleixner
2024-11-14 17:13 ` Patryk Wlazlyn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).