From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
To: Dave Hansen <dave.hansen@intel.com>,
Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>,
x86@kernel.org
Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
rafael.j.wysocki@intel.com, len.brown@intel.com,
dave.hansen@linux.intel.com
Subject: Re: [PATCH v2 2/3] x86/smp: Allow forcing the mwait hint for play dead loop
Date: Wed, 06 Nov 2024 10:14:30 +0200 [thread overview]
Message-ID: <9272406507bca4eb284cd75759d8d9479b937da9.camel@linux.intel.com> (raw)
In-Reply-To: <000fd68e-2b24-4eb3-b2d7-e4856b403212@intel.com>
Hi Dave, community,
I'll cover general questions, and let Patryk cover the specific core-related
comments.
On Wed, 2024-10-30 at 12:32 -0700, Dave Hansen wrote:
> On 10/30/24 02:58, Artem Bityutskiy wrote:
> > On Tue, 2024-10-29 at 11:30 -0700, Dave Hansen wrote:
> > 1. Could we at least set up a few rules here? Like, say what the hints
> > are, what values can they have?
> >
> > The hints are 8-bit values, lower 4 bits define "sub-state", higher 4 bits
> > define the state.
> >
> > The state value (higher 4 bits) correspond to the state enumerated by CPUID
> > leaf
> > 5 (Value 0 is C0, value 1 is C1, etc). The sub-state value is an opaque
> > number.
> >
> > The hint is provided to the mwait instruction via EAX.
>
> OK, so can you distill that down to something succinct and get it in a
> comment above the new function, please?
Yes, that was my idea to write a long e-mail to cover your and other reviewer's
questions, and let Patryk turn this into nice comments at the right places.
> > 2. Where do they come from?
> >
> > Hardware C-states are defined by the specific platform (e.g., C1, C1E, CC6,
> > PC6). Then they are "mapped" to the SDM C-states (C0, C1, C2, etc). The
> > specific
> > platform defines the hint values.
> >
> > Intel typically provides the hint values in the EDS (External Design
> > Specification) document. It is typically non-public.
> >
> > Intel also discloses the hint values for open-source projects like Linux,
> > and
> > then Intel engineers submit them to the intel_idle driver.
> >
> > Some of the hints may also be found via ACPI _CST table.
>
> What about the mwait_play_dead() loop that calculates the hint? Doesn't
> that derive the hint from CPUID?
The mwait_play_dead() hint calculation algorithm is the root of the problem. It
calculates a sub-optimal hint for some platforms, such as Sierra Forest Xeon.
So at high level, mwait_play_dead() calculates the hint, saves it in the 'eax'
variable, and then just eeps issuing 'mwait(eax, ecx=0)' with that hint.
So the mwait hint calculation code is between lines [1303] and [1313] (see links
below).
The code basically finds the highest C-state index ('highest_cstate' variable)
and count of sub-states within that C-state ('highest_subcstate' variable), and
forms the mwait hint out of them.
You should also check leaf CPUID Leaf 5 SDM doc for more clarity.
Example #1.
==========
Sapphire Rapids Xeon, where the algorithm calculates the correct hint.
'highest_cstate' ends up being 2 (corresponds C3 in CPUID leaf 5 terms, note
that C0 is ignored), and highest_subcstate ends up being 1 (1 sub-state), so the
mwait hint ends up being 0x20. The math is at line [1311]:
eax = (highest_cstate << MWAIT_SUBSTATE_SIZE) | (highest_subcstate - 1) =
(2 << 4) | (1 - 1) = 0x20
This ends up being the deepest available C-state hint (C6). No problem here.
Code lines.
1303:
https://elixir.bootlin.com/linux/v6.9/source/arch/x86/kernel/smpboot.c#L1303
1313:
https://elixir.bootlin.com/linux/v6.9/source/arch/x86/kernel/smpboot.c#L1313
1311:
https://elixir.bootlin.com/linux/v6.9/source/arch/x86/kernel/smpboot.c#L1311
Exemple #2.
==========
Sierra Forest Xeon, where the algorithm calculates the incorrect hint.
Here is CPUID leaf 5.
$ cpuid -1 -l 5
CPU:
MONITOR/MWAIT (5):
smallest monitor-line size (bytes) = 0x40 (64)
largest monitor-line size (bytes) = 0x40 (64)
enum of Monitor-MWAIT exts supported = true
supports intrs as break-event for MWAIT = true
number of C0 sub C-states using MWAIT = 0x0 (0)
number of C1 sub C-states using MWAIT = 0x2 (2)
number of C2 sub C-states using MWAIT = 0x0 (0)
number of C3 sub C-states using MWAIT = 0x2 (2)
number of C4 sub C-states using MWAIT = 0x0 (0)
number of C5 sub C-states using MWAIT = 0x0 (0)
number of C6 sub C-states using MWAIT = 0x0 (0)
number of C7 sub C-states using MWAIT = 0x0 (0)
Notice that CPUID leaf 5 advertises two C3 sub-states, and with this input the
the algorithm ends up with mwait hint 0x21. This mwait hint is not documented as
supported on Sierra Forest.
The supported Sierra Forest mwait hints are defined in 'intel_idle.c':
https://github.com/torvalds/linux/blob/2e1b3cc9d7f790145a80cb705b168f05dab65df2/drivers/idle/intel_idle.c#L1345
They are 0x0, 0x01, 0x22, 0x23. And 0x23 is the deepest C-state (C6SP).
But offline code ends up issuing 0x21. The platform maps it to a "shallow
version" of core C6, which is not documented and is not supposed to be used.
At any rate, it ends up in a situation when offlining a single CPU will prevent
the platform from ever reaching PC6 state. In PC6 - happens when all CPUs
requested C6SP - all cores are powered off and there are power savings in uncore
(caches, interconnects, memory controller, etc).
Now, as I mentioned, in parallel I am working with Sierra Forest firmware team
to change platform behavior and map hint 0x21 to 0x23 in firmware. As a
workaround.
>
> > 5. What's the behavior when it is not set?
> >
> > The offline code will fall-back to the generic non-architectural algorithm,
> > which provides correct results for all server platforms I dealt with since
> > 2017.
> > It should provide the correct hint for most client platforms, as far as I am
> > aware.
> >
> > Sierra Forest Xeon is the first platform where the generic algorithm
> > provides a
> > sub-optimal value 0x21. It is not fatal, just sub-optimal.
>
> What is the non-architectural algorithm? Which Linux code are you
> referring to?
See above.
The non-architectural part of it is basically this.
The algorithm assumes that in the sub-state part of the mwait hint (4 least
significant bits) always starts with 0 and then next sub-state is always 1, and
there are no gaps. This is not documented in Intel SDM. This happens to work on
majority of Intel platforms, but this is not architectural.
Thank you,
Artem.
next prev parent reply other threads:[~2024-11-06 8:14 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-29 10:15 [PATCH v2 0/3] SRF: Fix offline CPU preventing pc6 entry Patryk Wlazlyn
2024-10-29 10:15 ` [PATCH v2 1/3] x86/smp: Move mwait hint computation out of mwait_play_dead Patryk Wlazlyn
2024-10-29 10:15 ` [PATCH v2 2/3] x86/smp: Allow forcing the mwait hint for play dead loop Patryk Wlazlyn
2024-10-29 18:30 ` Dave Hansen
2024-10-30 9:58 ` Artem Bityutskiy
2024-10-30 19:32 ` Dave Hansen
2024-10-30 19:53 ` Rafael J. Wysocki
2024-10-30 20:11 ` Dave Hansen
2024-10-30 20:14 ` Rafael J. Wysocki
2024-11-06 8:14 ` Artem Bityutskiy [this message]
2024-11-06 14:46 ` Dave Hansen
2024-10-30 13:33 ` Patryk Wlazlyn
2024-10-30 22:55 ` Dave Hansen
2024-10-29 10:15 ` [PATCH v2 3/3] intel_idle: Identify the deepest cstate for SRF Patryk Wlazlyn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9272406507bca4eb284cd75759d8d9479b937da9.camel@linux.intel.com \
--to=artem.bityutskiy@linux.intel.com \
--cc=dave.hansen@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=len.brown@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=patryk.wlazlyn@linux.intel.com \
--cc=rafael.j.wysocki@intel.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox