* [PATCH v1 1/2] Revert "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"
2025-05-28 12:53 [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2 Rafael J. Wysocki
@ 2025-05-28 12:53 ` Rafael J. Wysocki
2025-05-28 12:54 ` [PATCH v1 2/2] x86/smp: Prefer cpuidle_play_dead() to mwait_play_dead_cpuid_hint() Rafael J. Wysocki
` (2 subsequent siblings)
3 siblings, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2025-05-28 12:53 UTC (permalink / raw)
To: x86 Maintainers
Cc: LKML, Linux PM, Len Brown, Peter Zijlstra, Thomas Gleixner,
Borislav Petkov, Dave Hansen, Artem Bityutskiy, Gautham R. Shenoy,
Ingo Molnar, Todd Brandt
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Revert commit 96040f7273e2 ("x86/smp: Eliminate mwait_play_dead_cpuid_hint()")
because it introduced a significant power regression on systems that start
with "nosmt" in the kernel command line.
Namely, on such systems, SMT siblings permanently go offline early,
when cpuidle has not been initialized yet, so after the above commit,
hlt_play_dead() is called for them. Later on, when the processor
attempts to enter a deep package C-state, including PC10 which is
requisite for reaching minimum power in suspend-to-idle, it is not
able to do that because of the SMT siblings staying in C1 (which
they have been put into by HLT).
Fixes: 96040f7273e2 ("x86/smp: Eliminate mwait_play_dead_cpuid_hint()")
Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Cc: 6.15+ <stable@vger.kernel.org> # 6.15+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
arch/x86/kernel/smpboot.c | 54 ++++++++++++++++++++++++++++++++++++++++------
1 file changed, 47 insertions(+), 7 deletions(-)
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1238,10 +1238,6 @@
local_irq_disable();
}
-/*
- * We need to flush the caches before going to sleep, lest we have
- * dirty data in our caches when we come back up.
- */
void __noreturn mwait_play_dead(unsigned int eax_hint)
{
struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
@@ -1288,6 +1284,50 @@
}
/*
+ * We need to flush the caches before going to sleep, lest we have
+ * dirty data in our caches when we come back up.
+ */
+static inline void mwait_play_dead_cpuid_hint(void)
+{
+ unsigned int eax, ebx, ecx, edx;
+ unsigned int highest_cstate = 0;
+ unsigned int highest_subcstate = 0;
+ int i;
+
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
+ boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
+ return;
+ if (!this_cpu_has(X86_FEATURE_MWAIT))
+ return;
+ if (!this_cpu_has(X86_FEATURE_CLFLUSH))
+ return;
+
+ eax = CPUID_LEAF_MWAIT;
+ ecx = 0;
+ native_cpuid(&eax, &ebx, &ecx, &edx);
+
+ /*
+ * eax will be 0 if EDX enumeration is not valid.
+ * Initialized below to cstate, sub_cstate value when EDX is valid.
+ */
+ if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED)) {
+ eax = 0;
+ } else {
+ edx >>= MWAIT_SUBSTATE_SIZE;
+ for (i = 0; i < 7 && edx; i++, edx >>= MWAIT_SUBSTATE_SIZE) {
+ if (edx & MWAIT_SUBSTATE_MASK) {
+ highest_cstate = i;
+ highest_subcstate = edx & MWAIT_SUBSTATE_MASK;
+ }
+ }
+ eax = (highest_cstate << MWAIT_SUBSTATE_SIZE) |
+ (highest_subcstate - 1);
+ }
+
+ mwait_play_dead(eax);
+}
+
+/*
* Kick all "offline" CPUs out of mwait on kexec(). See comment in
* mwait_play_dead().
*/
@@ -1337,9 +1377,9 @@
play_dead_common();
tboot_shutdown(TB_SHUTDOWN_WFS);
- /* Below returns only on error. */
- cpuidle_play_dead();
- hlt_play_dead();
+ mwait_play_dead_cpuid_hint();
+ if (cpuidle_play_dead())
+ hlt_play_dead();
}
#else /* ... !CONFIG_HOTPLUG_CPU */
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH v1 2/2] x86/smp: Prefer cpuidle_play_dead() to mwait_play_dead_cpuid_hint()
2025-05-28 12:53 [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2 Rafael J. Wysocki
2025-05-28 12:53 ` [PATCH v1 1/2] Revert "x86/smp: Eliminate mwait_play_dead_cpuid_hint()" Rafael J. Wysocki
@ 2025-05-28 12:54 ` Rafael J. Wysocki
2025-05-28 13:17 ` [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2 Peter Zijlstra
2025-05-29 13:40 ` [PATCH v2] Revert "x86/smp: Eliminate mwait_play_dead_cpuid_hint()" Rafael J. Wysocki
3 siblings, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2025-05-28 12:54 UTC (permalink / raw)
To: x86 Maintainers
Cc: LKML, Linux PM, Len Brown, Peter Zijlstra, Thomas Gleixner,
Borislav Petkov, Dave Hansen, Artem Bityutskiy, Gautham R. Shenoy,
Ingo Molnar, Todd Brandt
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Currently, mwait_play_dead_cpuid_hint() looks up the MWAIT hint of the
deepest idle state by inspecting CPUID leaf 0x05 with the assumption
that, if the number of sub-states for a given major C-state is nonzero,
those sub-states are always represented by consecutive numbers starting
from 0. This assumption is not based on the documented platform behavior
and in fact it is not met on recent Intel platforms (eg. Sierra Forest).
For this reason, it is better to let the cpuidle driver for the given
platform put CPUs going offline into appropriate idle state and only
if that fails, fall back to mwait_play_dead_cpuid_hint(), which may
still be the next best "play dead" variant if cpuidle is not available.
For example, when "nosmt" is passed to the kernel in the command line,
SMT siblings are disabled early, before cpuidle gets ready, but they
need to be put into sufficiently deep idle states to allow the whole
processor to reach deep package idle states, like PC10, later on.
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
arch/x86/kernel/smpboot.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1377,9 +1377,10 @@
play_dead_common();
tboot_shutdown(TB_SHUTDOWN_WFS);
+ /* Each call in the following sequence returns only on errors. */
+ cpuidle_play_dead();
mwait_play_dead_cpuid_hint();
- if (cpuidle_play_dead())
- hlt_play_dead();
+ hlt_play_dead();
}
#else /* ... !CONFIG_HOTPLUG_CPU */
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-28 12:53 [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2 Rafael J. Wysocki
2025-05-28 12:53 ` [PATCH v1 1/2] Revert "x86/smp: Eliminate mwait_play_dead_cpuid_hint()" Rafael J. Wysocki
2025-05-28 12:54 ` [PATCH v1 2/2] x86/smp: Prefer cpuidle_play_dead() to mwait_play_dead_cpuid_hint() Rafael J. Wysocki
@ 2025-05-28 13:17 ` Peter Zijlstra
2025-05-28 13:20 ` Rafael J. Wysocki
2025-05-29 13:40 ` [PATCH v2] Revert "x86/smp: Eliminate mwait_play_dead_cpuid_hint()" Rafael J. Wysocki
3 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2025-05-28 13:17 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: x86 Maintainers, LKML, Linux PM, Len Brown, Thomas Gleixner,
Borislav Petkov, Dave Hansen, Artem Bityutskiy, Gautham R. Shenoy,
Ingo Molnar, Todd Brandt
On Wed, May 28, 2025 at 02:53:13PM +0200, Rafael J. Wysocki wrote:
> Hi Everyone,
>
> Commit 96040f7273e2 ("x86/smp: Eliminate mwait_play_dead_cpuid_hint()")
> that shipped in 6.15 introduced a nasty power regression on systems that
> start with "nosmt" in the kernel command line which prevents it from entering
> deep package idle states (for instance, PC10) later on. Idle power, including
> power in suspend-to-idle, goes up significantly on those systems as a result.
>
> Address this by reverting commit 96040f7273e2 (patch [1/2]) and using a
> different approach, which is to retain mwait_play_dead_cpuid_hint() and
> still prefer it to hlt_play_dead() in case it is needed when cpuidle is
> not available, but prefer cpuidle_play_dead() to it by default (patch [2/2]).
I don't understand. The revert says the reason it regresses is that it
goes into play_dead before cpuidle is initialized. The fix is then to
call cpuidle first.
But if cpuidle isn't initialized yet, how does that fix anything?
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-28 13:17 ` [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2 Peter Zijlstra
@ 2025-05-28 13:20 ` Rafael J. Wysocki
2025-05-28 13:38 ` Peter Zijlstra
0 siblings, 1 reply; 19+ messages in thread
From: Rafael J. Wysocki @ 2025-05-28 13:20 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM, Len Brown,
Thomas Gleixner, Borislav Petkov, Dave Hansen, Artem Bityutskiy,
Gautham R. Shenoy, Ingo Molnar, Todd Brandt
On Wed, May 28, 2025 at 3:18 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, May 28, 2025 at 02:53:13PM +0200, Rafael J. Wysocki wrote:
> > Hi Everyone,
> >
> > Commit 96040f7273e2 ("x86/smp: Eliminate mwait_play_dead_cpuid_hint()")
> > that shipped in 6.15 introduced a nasty power regression on systems that
> > start with "nosmt" in the kernel command line which prevents it from entering
> > deep package idle states (for instance, PC10) later on. Idle power, including
> > power in suspend-to-idle, goes up significantly on those systems as a result.
> >
> > Address this by reverting commit 96040f7273e2 (patch [1/2]) and using a
> > different approach, which is to retain mwait_play_dead_cpuid_hint() and
> > still prefer it to hlt_play_dead() in case it is needed when cpuidle is
> > not available, but prefer cpuidle_play_dead() to it by default (patch [2/2]).
>
> I don't understand. The revert says the reason it regresses is that it
> goes into play_dead before cpuidle is initialized. The fix is then to
> call cpuidle first.
>
> But if cpuidle isn't initialized yet, how does that fix anything?
The revert fixes the bug.
The other patch does what the reverted commit was supposed to be
doing, but differently.
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-28 13:20 ` Rafael J. Wysocki
@ 2025-05-28 13:38 ` Peter Zijlstra
2025-05-28 14:25 ` Rafael J. Wysocki
0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2025-05-28 13:38 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM, Len Brown,
Thomas Gleixner, Borislav Petkov, Dave Hansen, Artem Bityutskiy,
Gautham R. Shenoy, Ingo Molnar, Todd Brandt
On Wed, May 28, 2025 at 03:20:16PM +0200, Rafael J. Wysocki wrote:
> On Wed, May 28, 2025 at 3:18 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, May 28, 2025 at 02:53:13PM +0200, Rafael J. Wysocki wrote:
> > > Hi Everyone,
> > >
> > > Commit 96040f7273e2 ("x86/smp: Eliminate mwait_play_dead_cpuid_hint()")
> > > that shipped in 6.15 introduced a nasty power regression on systems that
> > > start with "nosmt" in the kernel command line which prevents it from entering
> > > deep package idle states (for instance, PC10) later on. Idle power, including
> > > power in suspend-to-idle, goes up significantly on those systems as a result.
> > >
> > > Address this by reverting commit 96040f7273e2 (patch [1/2]) and using a
> > > different approach, which is to retain mwait_play_dead_cpuid_hint() and
> > > still prefer it to hlt_play_dead() in case it is needed when cpuidle is
> > > not available, but prefer cpuidle_play_dead() to it by default (patch [2/2]).
> >
> > I don't understand. The revert says the reason it regresses is that it
> > goes into play_dead before cpuidle is initialized. The fix is then to
> > call cpuidle first.
> >
> > But if cpuidle isn't initialized yet, how does that fix anything?
>
> The revert fixes the bug.
This is not what I asked.
> The other patch does what the reverted commit was supposed to be
> doing, but differently.
No, it does not.
The whole point was that mwait_play_dead did not DTRT because hints are
stupid and it could not select the deepest C state in an unambiguous
fashion.
And now you're restoring that -- code you all argued was fundamentally
buggered.
Yes is 'fixes' things on old platforms, but it is equally broken on the
new platforms where you all argued it was broken on. So either way
around you're going to need to fix those, and this isn't it.
Now, SMT siblings are all AP, by definition. So can't we simply send
them INIT instead of doing CLI;HLT, that way they drop into
Wait-for-SIPI and the ucode can sort it out?
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-28 13:38 ` Peter Zijlstra
@ 2025-05-28 14:25 ` Rafael J. Wysocki
2025-05-28 16:05 ` Peter Zijlstra
0 siblings, 1 reply; 19+ messages in thread
From: Rafael J. Wysocki @ 2025-05-28 14:25 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, Rafael J. Wysocki, x86 Maintainers, LKML,
Linux PM, Len Brown, Thomas Gleixner, Borislav Petkov,
Dave Hansen, Artem Bityutskiy, Gautham R. Shenoy, Ingo Molnar,
Todd Brandt
On Wed, May 28, 2025 at 3:45 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, May 28, 2025 at 03:20:16PM +0200, Rafael J. Wysocki wrote:
> > On Wed, May 28, 2025 at 3:18 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, May 28, 2025 at 02:53:13PM +0200, Rafael J. Wysocki wrote:
> > > > Hi Everyone,
> > > >
> > > > Commit 96040f7273e2 ("x86/smp: Eliminate mwait_play_dead_cpuid_hint()")
> > > > that shipped in 6.15 introduced a nasty power regression on systems that
> > > > start with "nosmt" in the kernel command line which prevents it from entering
> > > > deep package idle states (for instance, PC10) later on. Idle power, including
> > > > power in suspend-to-idle, goes up significantly on those systems as a result.
> > > >
> > > > Address this by reverting commit 96040f7273e2 (patch [1/2]) and using a
> > > > different approach, which is to retain mwait_play_dead_cpuid_hint() and
> > > > still prefer it to hlt_play_dead() in case it is needed when cpuidle is
> > > > not available, but prefer cpuidle_play_dead() to it by default (patch [2/2]).
> > >
> > > I don't understand. The revert says the reason it regresses is that it
> > > goes into play_dead before cpuidle is initialized. The fix is then to
> > > call cpuidle first.
> > >
> > > But if cpuidle isn't initialized yet, how does that fix anything?
> >
> > The revert fixes the bug.
>
> This is not what I asked.
>
> > The other patch does what the reverted commit was supposed to be
> > doing, but differently.
>
> No, it does not.
If cpuidle is available and works, it will do the same thing.
> The whole point was that mwait_play_dead did not DTRT because hints are
> stupid and it could not select the deepest C state in an unambiguous
> fashion.
Yes, on some systems.
> And now you're restoring that -- code you all argued was fundamentally
> buggered.
>
> Yes is 'fixes' things on old platforms, but it is equally broken on the
> new platforms where you all argued it was broken on. So either way
> around you're going to need to fix those, and this isn't it.
There are systems where mwait_play_dead_cpuid_hint() does not work and
there are systems where it works. Some of the latter are actually
new.
Regardless, if cpuidle_play_dead() runs before it and cpuidle is
there, the right thing will be done because cpuidle_play_dead() will
not return in that case.
The only question is what to do when cpuidle is not there.
The commit reverted by the first patch removed
mwait_play_dead_cpuid_hint() altogether, so it never runs and the only
fallback is hlt_play_dead(), but this doesn't work for disabling SMT
siblings.
If mwait_play_dead_cpuid_hint() is allowed to run before
hlt_play_dead() though, then worst-case it may use an unrecognized
MWAIT hint and the CPU should fall back to C1. If the MWAIT hint is
valid though, it will enter a deep idle state and that's what happens
on all of the systems tested in the lab (20+), including the most
recent ones.
> Now, SMT siblings are all AP, by definition. So can't we simply send
> them INIT instead of doing CLI;HLT, that way they drop into
> Wait-for-SIPI and the ucode can sort it out?
No, I don't think so. I don't think that Wait-for-SIPI is an idle state.
But we are discussing patch [2/2] here while really the problem is
that the commit in question is broken, so it needs to be reverted in
the first place.
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-28 14:25 ` Rafael J. Wysocki
@ 2025-05-28 16:05 ` Peter Zijlstra
2025-05-28 17:09 ` Rafael J. Wysocki
0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2025-05-28 16:05 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM, Len Brown,
Thomas Gleixner, Borislav Petkov, Dave Hansen, Artem Bityutskiy,
Gautham R. Shenoy, Ingo Molnar, Todd Brandt
On Wed, May 28, 2025 at 04:25:19PM +0200, Rafael J. Wysocki wrote:
> If cpuidle is available and works, it will do the same thing.
Why can't we make it available sooner? But no, cpuidle does not do the
same thing -- it was argued it does the right thing because it has them
tables with C states on and doesn't try and divinate from CPUID.
> > The whole point was that mwait_play_dead did not DTRT because hints are
> > stupid and it could not select the deepest C state in an unambiguous
> > fashion.
>
> Yes, on some systems.
The 'on some systems' thing is irrelevant. Either it always works, or it
doesn't and we shouldnt be having it.
> > And now you're restoring that -- code you all argued was fundamentally
> > buggered.
> >
> > Yes is 'fixes' things on old platforms, but it is equally broken on the
> > new platforms where you all argued it was broken on. So either way
> > around you're going to need to fix those, and this isn't it.
> The commit reverted by the first patch removed
> mwait_play_dead_cpuid_hint() altogether, so it never runs and the only
> fallback is hlt_play_dead(), but this doesn't work for disabling SMT
> siblings.
It should either be fixed to always work or stay dead.
> > Now, SMT siblings are all AP, by definition. So can't we simply send
> > them INIT instead of doing CLI;HLT, that way they drop into
> > Wait-for-SIPI and the ucode can sort it out?
>
> No, I don't think so. I don't think that Wait-for-SIPI is an idle state.
>
> But we are discussing patch [2/2] here while really the problem is
> that the commit in question is broken, so it needs to be reverted in
> the first place.
No, you all very much argued that mwait_play_dead couldn't be fixed, as
such it must die and stay dead. Sometimes working is worse than never
working.
So no, I very much object to the revert.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-28 16:05 ` Peter Zijlstra
@ 2025-05-28 17:09 ` Rafael J. Wysocki
2025-05-29 8:53 ` Peter Zijlstra
0 siblings, 1 reply; 19+ messages in thread
From: Rafael J. Wysocki @ 2025-05-28 17:09 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, Rafael J. Wysocki, x86 Maintainers, LKML,
Linux PM, Len Brown, Thomas Gleixner, Borislav Petkov,
Dave Hansen, Artem Bityutskiy, Gautham R. Shenoy, Ingo Molnar,
Todd Brandt
On Wed, May 28, 2025 at 6:05 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, May 28, 2025 at 04:25:19PM +0200, Rafael J. Wysocki wrote:
>
> > If cpuidle is available and works, it will do the same thing.
>
> Why can't we make it available sooner? But no, cpuidle does not do the
> same thing -- it was argued it does the right thing because it has them
> tables with C states on and doesn't try and divinate from CPUID.
>
> > > The whole point was that mwait_play_dead did not DTRT because hints are
> > > stupid and it could not select the deepest C state in an unambiguous
> > > fashion.
> >
> > Yes, on some systems.
>
> The 'on some systems' thing is irrelevant. Either it always works, or it
> doesn't and we shouldnt be having it.
>
> > > And now you're restoring that -- code you all argued was fundamentally
> > > buggered.
> > >
> > > Yes is 'fixes' things on old platforms, but it is equally broken on the
> > > new platforms where you all argued it was broken on. So either way
> > > around you're going to need to fix those, and this isn't it.
>
> > The commit reverted by the first patch removed
> > mwait_play_dead_cpuid_hint() altogether, so it never runs and the only
> > fallback is hlt_play_dead(), but this doesn't work for disabling SMT
> > siblings.
>
> It should either be fixed to always work or stay dead.
I'm talking about the current code which is broken on many systems.
> > > Now, SMT siblings are all AP, by definition. So can't we simply send
> > > them INIT instead of doing CLI;HLT, that way they drop into
> > > Wait-for-SIPI and the ucode can sort it out?
> >
> > No, I don't think so. I don't think that Wait-for-SIPI is an idle state.
> >
> > But we are discussing patch [2/2] here while really the problem is
> > that the commit in question is broken, so it needs to be reverted in
> > the first place.
>
> No, you all very much argued that mwait_play_dead couldn't be fixed, as
> such it must die and stay dead. Sometimes working is worse than never
> working.
>
> So no, I very much object to the revert.
And I object to leaving a user-visible regression behind.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-28 17:09 ` Rafael J. Wysocki
@ 2025-05-29 8:53 ` Peter Zijlstra
2025-05-29 9:38 ` Rafael J. Wysocki
0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2025-05-29 8:53 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM, Len Brown,
Thomas Gleixner, Borislav Petkov, Dave Hansen, Artem Bityutskiy,
Gautham R. Shenoy, Ingo Molnar, Todd Brandt
On Wed, May 28, 2025 at 07:09:21PM +0200, Rafael J. Wysocki wrote:
> And I object to leaving a user-visible regression behind.
So we go fix it differently.
Why can't we initialize cpuidle before SMP bringup?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-29 8:53 ` Peter Zijlstra
@ 2025-05-29 9:38 ` Rafael J. Wysocki
2025-05-30 8:07 ` Peter Zijlstra
0 siblings, 1 reply; 19+ messages in thread
From: Rafael J. Wysocki @ 2025-05-29 9:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, Rafael J. Wysocki, x86 Maintainers, LKML,
Linux PM, Len Brown, Thomas Gleixner, Borislav Petkov,
Dave Hansen, Artem Bityutskiy, Gautham R. Shenoy, Ingo Molnar,
Todd Brandt
On Thu, May 29, 2025 at 10:54 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, May 28, 2025 at 07:09:21PM +0200, Rafael J. Wysocki wrote:
>
> > And I object to leaving a user-visible regression behind.
>
> So we go fix it differently.
>
> Why can't we initialize cpuidle before SMP bringup?
First off, I'm not sure if all of the requisite things are ready then
(sysfs etc.).
We may end up doing this eventually, but it may not be straightforward.
More importantly, this is not a change for 6.15.y (y > 0).
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-29 9:38 ` Rafael J. Wysocki
@ 2025-05-30 8:07 ` Peter Zijlstra
2025-05-30 9:18 ` Rafael J. Wysocki
0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2025-05-30 8:07 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM, Len Brown,
Thomas Gleixner, Borislav Petkov, Dave Hansen, Artem Bityutskiy,
Gautham R. Shenoy, Ingo Molnar, Todd Brandt
On Thu, May 29, 2025 at 11:38:05AM +0200, Rafael J. Wysocki wrote:
> First off, I'm not sure if all of the requisite things are ready then
> (sysfs etc.).
Pretty much everything is already running at early_initcall(). Sysfs
certainly is.
> We may end up doing this eventually, but it may not be straightforward.
>
> More importantly, this is not a change for 6.15.y (y > 0).
Seriously, have you even tried?
AFAICT the below is all that is needed. That boots just fine on the one
randon system I tried, and seems to still work.
And this is plenty small enough to go into 6.15.y
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 0835da449db8..0f25de8081af 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -814,4 +814,4 @@ static int __init cpuidle_init(void)
module_param(off, int, 0444);
module_param_string(governor, param_governor, CPUIDLE_NAME_LEN, 0444);
-core_initcall(cpuidle_init);
+early_initcall(cpuidle_init);
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-30 8:07 ` Peter Zijlstra
@ 2025-05-30 9:18 ` Rafael J. Wysocki
2025-05-30 9:27 ` Rafael J. Wysocki
2025-05-30 16:59 ` Rafael J. Wysocki
0 siblings, 2 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2025-05-30 9:18 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, Rafael J. Wysocki, x86 Maintainers, LKML,
Linux PM, Len Brown, Thomas Gleixner, Borislav Petkov,
Dave Hansen, Artem Bityutskiy, Gautham R. Shenoy, Ingo Molnar,
Todd Brandt
On Fri, May 30, 2025 at 10:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, May 29, 2025 at 11:38:05AM +0200, Rafael J. Wysocki wrote:
>
> > First off, I'm not sure if all of the requisite things are ready then
> > (sysfs etc.).
>
> Pretty much everything is already running at early_initcall(). Sysfs
> certainly is.
>
> > We may end up doing this eventually, but it may not be straightforward.
> >
> > More importantly, this is not a change for 6.15.y (y > 0).
>
> Seriously, have you even tried?
>
> AFAICT the below is all that is needed. That boots just fine on the one
> randon system I tried, and seems to still work.
>
> And this is plenty small enough to go into 6.15.y
But there is still intel_idle_init() which is device_initcall() ATM
and this needs to be tested on other arches too.
So not really there, sorry.
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 0835da449db8..0f25de8081af 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -814,4 +814,4 @@ static int __init cpuidle_init(void)
>
> module_param(off, int, 0444);
> module_param_string(governor, param_governor, CPUIDLE_NAME_LEN, 0444);
> -core_initcall(cpuidle_init);
> +early_initcall(cpuidle_init);
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-30 9:18 ` Rafael J. Wysocki
@ 2025-05-30 9:27 ` Rafael J. Wysocki
2025-05-30 16:59 ` Rafael J. Wysocki
1 sibling, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2025-05-30 9:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM, Len Brown,
Thomas Gleixner, Borislav Petkov, Dave Hansen, Artem Bityutskiy,
Gautham R. Shenoy, Ingo Molnar, Todd Brandt
On Fri, May 30, 2025 at 11:18 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, May 30, 2025 at 10:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, May 29, 2025 at 11:38:05AM +0200, Rafael J. Wysocki wrote:
> >
> > > First off, I'm not sure if all of the requisite things are ready then
> > > (sysfs etc.).
> >
> > Pretty much everything is already running at early_initcall(). Sysfs
> > certainly is.
> >
> > > We may end up doing this eventually, but it may not be straightforward.
> > >
> > > More importantly, this is not a change for 6.15.y (y > 0).
> >
> > Seriously, have you even tried?
> >
> > AFAICT the below is all that is needed. That boots just fine on the one
> > randon system I tried, and seems to still work.
> >
> > And this is plenty small enough to go into 6.15.y
>
> But there is still intel_idle_init() which is device_initcall() ATM
> and this needs to be tested on other arches too.
>
> So not really there, sorry.
BTW, I do think that this is the way to go, but not in a hurry, and if
6.15.y does not include a proper fix for the original issue, this is
not the end of the world.
If it turns out to be unproblematic and all goes well, we may as well
try to put it into 6.16.
> > diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> > index 0835da449db8..0f25de8081af 100644
> > --- a/drivers/cpuidle/cpuidle.c
> > +++ b/drivers/cpuidle/cpuidle.c
> > @@ -814,4 +814,4 @@ static int __init cpuidle_init(void)
> >
> > module_param(off, int, 0444);
> > module_param_string(governor, param_governor, CPUIDLE_NAME_LEN, 0444);
> > -core_initcall(cpuidle_init);
> > +early_initcall(cpuidle_init);
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-30 9:18 ` Rafael J. Wysocki
2025-05-30 9:27 ` Rafael J. Wysocki
@ 2025-05-30 16:59 ` Rafael J. Wysocki
2025-05-30 17:55 ` Rafael J. Wysocki
1 sibling, 1 reply; 19+ messages in thread
From: Rafael J. Wysocki @ 2025-05-30 16:59 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM, Len Brown,
Thomas Gleixner, Borislav Petkov, Dave Hansen, Artem Bityutskiy,
Gautham R. Shenoy, Ingo Molnar, Todd Brandt
On Fri, May 30, 2025 at 11:18 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, May 30, 2025 at 10:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, May 29, 2025 at 11:38:05AM +0200, Rafael J. Wysocki wrote:
> >
> > > First off, I'm not sure if all of the requisite things are ready then
> > > (sysfs etc.).
> >
> > Pretty much everything is already running at early_initcall(). Sysfs
> > certainly is.
> >
> > > We may end up doing this eventually, but it may not be straightforward.
> > >
> > > More importantly, this is not a change for 6.15.y (y > 0).
> >
> > Seriously, have you even tried?
> >
> > AFAICT the below is all that is needed. That boots just fine on the one
> > randon system I tried, and seems to still work.
> >
> > And this is plenty small enough to go into 6.15.y
>
> But there is still intel_idle_init() which is device_initcall() ATM
> and this needs to be tested on other arches too.
intel_idle_init() depends on ACPI which initializes via a
subsys_initcall() and doing that earlier would mean a major redesign.
Pretty much same for reordering APs initialization past ACPI initialization.
One more option is to kick the "dead" SMT siblings after the idle
driver initializes and let them do "play dead" again, but who'd be
responsible for doing that?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2
2025-05-30 16:59 ` Rafael J. Wysocki
@ 2025-05-30 17:55 ` Rafael J. Wysocki
0 siblings, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2025-05-30 17:55 UTC (permalink / raw)
To: Peter Zijlstra, x86 Maintainers
Cc: Rafael J. Wysocki, LKML, Linux PM, Len Brown, Thomas Gleixner,
Borislav Petkov, Dave Hansen, Artem Bityutskiy, Gautham R. Shenoy,
Ingo Molnar, Todd Brandt
On Fri, May 30, 2025 at 6:59 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, May 30, 2025 at 11:18 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Fri, May 30, 2025 at 10:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, May 29, 2025 at 11:38:05AM +0200, Rafael J. Wysocki wrote:
> > >
> > > > First off, I'm not sure if all of the requisite things are ready then
> > > > (sysfs etc.).
> > >
> > > Pretty much everything is already running at early_initcall(). Sysfs
> > > certainly is.
> > >
> > > > We may end up doing this eventually, but it may not be straightforward.
> > > >
> > > > More importantly, this is not a change for 6.15.y (y > 0).
> > >
> > > Seriously, have you even tried?
> > >
> > > AFAICT the below is all that is needed. That boots just fine on the one
> > > randon system I tried, and seems to still work.
> > >
> > > And this is plenty small enough to go into 6.15.y
> >
> > But there is still intel_idle_init() which is device_initcall() ATM
> > and this needs to be tested on other arches too.
>
> intel_idle_init() depends on ACPI which initializes via a
> subsys_initcall() and doing that earlier would mean a major redesign.
>
> Pretty much same for reordering APs initialization past ACPI initialization.
>
> One more option is to kick the "dead" SMT siblings after the idle
> driver initializes and let them do "play dead" again, but who'd be
> responsible for doing that?
Interestingly enough, a similar issue is there during resume from
hibernation and that's why there is arch_resume_nosmt().
So arguably intel_idle_init() could do something analogous to it after
successfully registering the idle driver. In principle, the ACPI idle
driver could do that too on x86.
Opinions?
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2] Revert "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"
2025-05-28 12:53 [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2 Rafael J. Wysocki
` (2 preceding siblings ...)
2025-05-28 13:17 ` [PATCH v1 0/2] x86/smp: Fix power regression introduced by commit 96040f7273e2 Peter Zijlstra
@ 2025-05-29 13:40 ` Rafael J. Wysocki
2025-05-29 14:25 ` Dave Hansen
3 siblings, 1 reply; 19+ messages in thread
From: Rafael J. Wysocki @ 2025-05-29 13:40 UTC (permalink / raw)
To: x86 Maintainers
Cc: LKML, Linux PM, Len Brown, Peter Zijlstra, Thomas Gleixner,
Borislav Petkov, Dave Hansen, Artem Bityutskiy, Gautham R. Shenoy,
Ingo Molnar, Todd Brandt
From: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Revert commit 96040f7273e2 ("x86/smp: Eliminate mwait_play_dead_cpuid_hint()")
because it introduced a significant power regression on systems that
start with "nosmt" in the kernel command line.
Namely, on such systems, SMT siblings permanently go offline early,
when cpuidle has not been initialized yet, so after the above commit,
hlt_play_dead() is called for them. Later on, when the processor
attempts to enter a deep package C-state, including PC10 which is
requisite for reaching minimum power in suspend-to-idle, it is not
able to do that because of the SMT siblings staying in C1 (which
they have been put into by HLT).
As a result, the idle power (including power in suspend-to-idle)
rises quite dramatically on those systems with all of the possible
consequences, which (needless to say) may not be expected by their
users.
This issue is hard to debug and potentially dangerous, so it needs to
be addressed as soon as possible in a way that will work for 6.15.y,
hence the revert.
Of course, after this revert, the issue that commit 96040f7273e2
attempted to address will be back and it will need to be fixed again
later.
Fixes: 96040f7273e2 ("x86/smp: Eliminate mwait_play_dead_cpuid_hint()")
Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Cc: 6.15+ <stable@vger.kernel.org> # 6.15+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
This supersedes https://lore.kernel.org/linux-pm/7811828.EvYhyI6sBW@rjwysocki.net/
v1 -> v2:
* Send as a standalone patch.
* Extend the changelog.
I honestly don't think that there is any reasonable alternative to this
revert that would be suitable for 6.15.y (y > 0), so I'm going to apply
it and include it in a PR during the remaining part of this merge window
unless somebody beats me to this.
Thanks!
---
arch/x86/kernel/smpboot.c | 54 ++++++++++++++++++++++++++++++++++++++++------
1 file changed, 47 insertions(+), 7 deletions(-)
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1238,10 +1238,6 @@
local_irq_disable();
}
-/*
- * We need to flush the caches before going to sleep, lest we have
- * dirty data in our caches when we come back up.
- */
void __noreturn mwait_play_dead(unsigned int eax_hint)
{
struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
@@ -1288,6 +1284,50 @@
}
/*
+ * We need to flush the caches before going to sleep, lest we have
+ * dirty data in our caches when we come back up.
+ */
+static inline void mwait_play_dead_cpuid_hint(void)
+{
+ unsigned int eax, ebx, ecx, edx;
+ unsigned int highest_cstate = 0;
+ unsigned int highest_subcstate = 0;
+ int i;
+
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
+ boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
+ return;
+ if (!this_cpu_has(X86_FEATURE_MWAIT))
+ return;
+ if (!this_cpu_has(X86_FEATURE_CLFLUSH))
+ return;
+
+ eax = CPUID_LEAF_MWAIT;
+ ecx = 0;
+ native_cpuid(&eax, &ebx, &ecx, &edx);
+
+ /*
+ * eax will be 0 if EDX enumeration is not valid.
+ * Initialized below to cstate, sub_cstate value when EDX is valid.
+ */
+ if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED)) {
+ eax = 0;
+ } else {
+ edx >>= MWAIT_SUBSTATE_SIZE;
+ for (i = 0; i < 7 && edx; i++, edx >>= MWAIT_SUBSTATE_SIZE) {
+ if (edx & MWAIT_SUBSTATE_MASK) {
+ highest_cstate = i;
+ highest_subcstate = edx & MWAIT_SUBSTATE_MASK;
+ }
+ }
+ eax = (highest_cstate << MWAIT_SUBSTATE_SIZE) |
+ (highest_subcstate - 1);
+ }
+
+ mwait_play_dead(eax);
+}
+
+/*
* Kick all "offline" CPUs out of mwait on kexec(). See comment in
* mwait_play_dead().
*/
@@ -1337,9 +1377,9 @@
play_dead_common();
tboot_shutdown(TB_SHUTDOWN_WFS);
- /* Below returns only on error. */
- cpuidle_play_dead();
- hlt_play_dead();
+ mwait_play_dead_cpuid_hint();
+ if (cpuidle_play_dead())
+ hlt_play_dead();
}
#else /* ... !CONFIG_HOTPLUG_CPU */
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH v2] Revert "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"
2025-05-29 13:40 ` [PATCH v2] Revert "x86/smp: Eliminate mwait_play_dead_cpuid_hint()" Rafael J. Wysocki
@ 2025-05-29 14:25 ` Dave Hansen
2025-05-29 15:39 ` Rafael J. Wysocki
0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2025-05-29 14:25 UTC (permalink / raw)
To: Rafael J. Wysocki, x86 Maintainers
Cc: LKML, Linux PM, Len Brown, Peter Zijlstra, Thomas Gleixner,
Borislav Petkov, Dave Hansen, Artem Bityutskiy, Gautham R. Shenoy,
Ingo Molnar, Todd Brandt
On 5/29/25 06:40, Rafael J. Wysocki wrote:
> This issue is hard to debug and potentially dangerous, so it needs to
> be addressed as soon as possible in a way that will work for 6.15.y,
> hence the revert.
Ugh.
I don't like the idea of reintroducing known buggy code. But the revert
does seem like the lesser of the two evils.
Seems like we should revert this for the time being and then try to fix
it properly again.
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2] Revert "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"
2025-05-29 14:25 ` Dave Hansen
@ 2025-05-29 15:39 ` Rafael J. Wysocki
0 siblings, 0 replies; 19+ messages in thread
From: Rafael J. Wysocki @ 2025-05-29 15:39 UTC (permalink / raw)
To: Dave Hansen
Cc: Rafael J. Wysocki, x86 Maintainers, LKML, Linux PM, Len Brown,
Peter Zijlstra, Thomas Gleixner, Borislav Petkov, Dave Hansen,
Artem Bityutskiy, Gautham R. Shenoy, Ingo Molnar, Todd Brandt
On Thu, May 29, 2025 at 4:25 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/29/25 06:40, Rafael J. Wysocki wrote:
> > This issue is hard to debug and potentially dangerous, so it needs to
> > be addressed as soon as possible in a way that will work for 6.15.y,
> > hence the revert.
>
> Ugh.
>
> I don't like the idea of reintroducing known buggy code. But the revert
> does seem like the lesser of the two evils.
>
> Seems like we should revert this for the time being and then try to fix
> it properly again.
>
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Thanks!
Applied.
^ permalink raw reply [flat|nested] 19+ messages in thread