* [PATCH] x86/S3: restore MCE (APs) and add MTRR (BSP) init
@ 2026-03-04 13:39 Jan Beulich
2026-03-04 14:36 ` Marek Marczykowski
2026-03-23 11:16 ` Roger Pau Monné
0 siblings, 2 replies; 9+ messages in thread
From: Jan Beulich @ 2026-03-04 13:39 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Andrew Cooper, Roger Pau Monné, Marek Marczykowski
MCE init for APs was broken when CPU feature re-checking was added. MTRR
(re)init for the BSP looks to never have been there on the resume path.
Fixes: bb502a8ca592 ("x86: check feature flags after resume")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Sadly we need to go by CPU number (zero vs non-zero) here. See the call
site of recheck_cpu_features() in enter_state().
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -642,16 +642,21 @@ void identify_cpu(struct cpuinfo_x86 *c)
smp_processor_id());
}
- if (system_state == SYS_STATE_resume)
- return;
+ if (system_state == SYS_STATE_resume) {
+ unsigned int cpu = smp_processor_id();
+ if (cpu)
+ mcheck_init(&cpu_data[cpu], false);
+ else /* Yes, the BSP needs to use the AP function here. */
+ mtrr_ap_init();
+ }
/*
* On SMP, boot_cpu_data holds the common feature set between
* all CPUs; so make sure that we indicate which features are
* common between the CPUs. The first time this routine gets
* executed, c == &boot_cpu_data.
*/
- if ( c != &boot_cpu_data ) {
+ else if (c != &boot_cpu_data) {
/* AND the already accumulated flags with these */
for ( i = 0 ; i < NCAPINTS ; i++ )
boot_cpu_data.x86_capability[i] &= c->x86_capability[i];
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH] x86/S3: restore MCE (APs) and add MTRR (BSP) init 2026-03-04 13:39 [PATCH] x86/S3: restore MCE (APs) and add MTRR (BSP) init Jan Beulich @ 2026-03-04 14:36 ` Marek Marczykowski 2026-03-04 14:47 ` Jan Beulich 2026-03-23 11:16 ` Roger Pau Monné 1 sibling, 1 reply; 9+ messages in thread From: Marek Marczykowski @ 2026-03-04 14:36 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Roger Pau Monné [-- Attachment #1: Type: text/plain, Size: 974 bytes --] On Wed, Mar 04, 2026 at 02:39:01PM +0100, Jan Beulich wrote: > MCE init for APs was broken when CPU feature re-checking was added. MTRR > (re)init for the BSP looks to never have been there on the resume path. > > Fixes: bb502a8ca592 ("x86: check feature flags after resume") > Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > Sadly we need to go by CPU number (zero vs non-zero) here. See the call > site of recheck_cpu_features() in enter_state(). With this patch, I now see the "Thermal monitoring enabled" on resume also for AP. And then, the "Temperature above threshold" + "Running in modulated clock mode" for AP too. But, I don't see matching "Temperature/speed normal" for any of them... My simple performance test says it's okay for now, though. I'll see how it looks in a few hours... -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] x86/S3: restore MCE (APs) and add MTRR (BSP) init 2026-03-04 14:36 ` Marek Marczykowski @ 2026-03-04 14:47 ` Jan Beulich 2026-03-04 15:00 ` Marek Marczykowski 0 siblings, 1 reply; 9+ messages in thread From: Jan Beulich @ 2026-03-04 14:47 UTC (permalink / raw) To: Marek Marczykowski Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Roger Pau Monné On 04.03.2026 15:36, Marek Marczykowski wrote: > On Wed, Mar 04, 2026 at 02:39:01PM +0100, Jan Beulich wrote: >> MCE init for APs was broken when CPU feature re-checking was added. MTRR >> (re)init for the BSP looks to never have been there on the resume path. >> >> Fixes: bb502a8ca592 ("x86: check feature flags after resume") >> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> --- >> Sadly we need to go by CPU number (zero vs non-zero) here. See the call >> site of recheck_cpu_features() in enter_state(). > > With this patch, I now see the "Thermal monitoring enabled" on resume > also for AP. > And then, the "Temperature above threshold" + "Running in modulated > clock mode" for AP too. But, I don't see matching "Temperature/speed > normal" for any of them... Which would imply that for each CPU you see at most one such message after resume. Can you confirm this? (Generally for every CPU they should be alternating, but appear no more frequently than every 5 seconds. Albeit I can't help the impression that it is possible for the current state to not be reflected by the most recently seen message, for a potentially indefinite period of time.) > My simple performance test says it's okay for now, though. I'll see how > it looks in a few hours... I actually don't expect the change here to make a difference in that regard. intel_thermal_interrupt() exists only for reporting purposes. Jan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] x86/S3: restore MCE (APs) and add MTRR (BSP) init 2026-03-04 14:47 ` Jan Beulich @ 2026-03-04 15:00 ` Marek Marczykowski 2026-03-23 11:21 ` Jan Beulich 0 siblings, 1 reply; 9+ messages in thread From: Marek Marczykowski @ 2026-03-04 15:00 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Roger Pau Monné [-- Attachment #1: Type: text/plain, Size: 2251 bytes --] On Wed, Mar 04, 2026 at 03:47:14PM +0100, Jan Beulich wrote: > On 04.03.2026 15:36, Marek Marczykowski wrote: > > On Wed, Mar 04, 2026 at 02:39:01PM +0100, Jan Beulich wrote: > >> MCE init for APs was broken when CPU feature re-checking was added. MTRR > >> (re)init for the BSP looks to never have been there on the resume path. > >> > >> Fixes: bb502a8ca592 ("x86: check feature flags after resume") > >> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> > >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > >> --- > >> Sadly we need to go by CPU number (zero vs non-zero) here. See the call > >> site of recheck_cpu_features() in enter_state(). > > > > With this patch, I now see the "Thermal monitoring enabled" on resume > > also for AP. > > And then, the "Temperature above threshold" + "Running in modulated > > clock mode" for AP too. But, I don't see matching "Temperature/speed > > normal" for any of them... > > Which would imply that for each CPU you see at most one such message after > resume. Can you confirm this? For the current test, yes. I got the messages for CPUs 16, 6, 18, 4, 2 - in this order. Not for 0, 8-15 or 20-21. Not sure about CPU0, but for others it kinda looks like I got it for P cores, but not E cores? But I'm not sure how to reliably distinguish them - I base it on the holes in numbering due to smt=off. Specifically I have online CPUs: 0,2,4,6,8-16,18,20-21 (yeah, weird ordering...). > (Generally for every CPU they should be > alternating, but appear no more frequently than every 5 seconds. Albeit I > can't help the impression that it is possible for the current state to not > be reflected by the most recently seen message, for a potentially > indefinite period of time.) > > > My simple performance test says it's okay for now, though. I'll see how > > it looks in a few hours... > > I actually don't expect the change here to make a difference in that > regard. intel_thermal_interrupt() exists only for reporting purposes. Yeah, it's too soon to say definitely, but just after resume test said stable 6ms, and now (~30min later) later it's at 12-14ms. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] x86/S3: restore MCE (APs) and add MTRR (BSP) init 2026-03-04 15:00 ` Marek Marczykowski @ 2026-03-23 11:21 ` Jan Beulich 2026-03-23 11:26 ` Marek Marczykowski 0 siblings, 1 reply; 9+ messages in thread From: Jan Beulich @ 2026-03-23 11:21 UTC (permalink / raw) To: Marek Marczykowski Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Roger Pau Monné On 04.03.2026 16:00, Marek Marczykowski wrote: > On Wed, Mar 04, 2026 at 03:47:14PM +0100, Jan Beulich wrote: >> On 04.03.2026 15:36, Marek Marczykowski wrote: >>> On Wed, Mar 04, 2026 at 02:39:01PM +0100, Jan Beulich wrote: >>>> MCE init for APs was broken when CPU feature re-checking was added. MTRR >>>> (re)init for the BSP looks to never have been there on the resume path. >>>> >>>> Fixes: bb502a8ca592 ("x86: check feature flags after resume") >>>> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>>> --- >>>> Sadly we need to go by CPU number (zero vs non-zero) here. See the call >>>> site of recheck_cpu_features() in enter_state(). >>> >>> With this patch, I now see the "Thermal monitoring enabled" on resume >>> also for AP. >>> And then, the "Temperature above threshold" + "Running in modulated >>> clock mode" for AP too. But, I don't see matching "Temperature/speed >>> normal" for any of them... >> >> Which would imply that for each CPU you see at most one such message after >> resume. Can you confirm this? > > For the current test, yes. I got the messages for CPUs 16, 6, 18, 4, 2 - > in this order. Not for 0, 8-15 or 20-21. Not sure about CPU0, but for > others it kinda looks like I got it for P cores, but not E cores? But > I'm not sure how to reliably distinguish them - I base it on the holes > in numbering due to smt=off. Specifically I have online CPUs: > 0,2,4,6,8-16,18,20-21 (yeah, weird ordering...). I wonder, btw, if this is good enough to translate into a Tested-by: for this patch. Thoughts? Jan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] x86/S3: restore MCE (APs) and add MTRR (BSP) init 2026-03-23 11:21 ` Jan Beulich @ 2026-03-23 11:26 ` Marek Marczykowski 0 siblings, 0 replies; 9+ messages in thread From: Marek Marczykowski @ 2026-03-23 11:26 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Roger Pau Monné [-- Attachment #1: Type: text/plain, Size: 1896 bytes --] On Mon, Mar 23, 2026 at 12:21:46PM +0100, Jan Beulich wrote: > On 04.03.2026 16:00, Marek Marczykowski wrote: > > On Wed, Mar 04, 2026 at 03:47:14PM +0100, Jan Beulich wrote: > >> On 04.03.2026 15:36, Marek Marczykowski wrote: > >>> On Wed, Mar 04, 2026 at 02:39:01PM +0100, Jan Beulich wrote: > >>>> MCE init for APs was broken when CPU feature re-checking was added. MTRR > >>>> (re)init for the BSP looks to never have been there on the resume path. > >>>> > >>>> Fixes: bb502a8ca592 ("x86: check feature flags after resume") > >>>> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> > >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> > >>>> --- > >>>> Sadly we need to go by CPU number (zero vs non-zero) here. See the call > >>>> site of recheck_cpu_features() in enter_state(). > >>> > >>> With this patch, I now see the "Thermal monitoring enabled" on resume > >>> also for AP. > >>> And then, the "Temperature above threshold" + "Running in modulated > >>> clock mode" for AP too. But, I don't see matching "Temperature/speed > >>> normal" for any of them... > >> > >> Which would imply that for each CPU you see at most one such message after > >> resume. Can you confirm this? > > > > For the current test, yes. I got the messages for CPUs 16, 6, 18, 4, 2 - > > in this order. Not for 0, 8-15 or 20-21. Not sure about CPU0, but for > > others it kinda looks like I got it for P cores, but not E cores? But > > I'm not sure how to reliably distinguish them - I base it on the holes > > in numbering due to smt=off. Specifically I have online CPUs: > > 0,2,4,6,8-16,18,20-21 (yeah, weird ordering...). > > I wonder, btw, if this is good enough to translate into a Tested-by: for > this patch. Thoughts? I think so, It clearly fixes reporting issue. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] x86/S3: restore MCE (APs) and add MTRR (BSP) init 2026-03-04 13:39 [PATCH] x86/S3: restore MCE (APs) and add MTRR (BSP) init Jan Beulich 2026-03-04 14:36 ` Marek Marczykowski @ 2026-03-23 11:16 ` Roger Pau Monné 2026-03-23 11:38 ` Jan Beulich 1 sibling, 1 reply; 9+ messages in thread From: Roger Pau Monné @ 2026-03-23 11:16 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Marek Marczykowski On Wed, Mar 04, 2026 at 02:39:01PM +0100, Jan Beulich wrote: > MCE init for APs was broken when CPU feature re-checking was added. MTRR > (re)init for the BSP looks to never have been there on the resume path. I'm not sure the statement about MTRR init is correct, AFAICT mtrr_aps_sync_end() will also re-init the MTRRs on the BSP, and hence the added mtrr_ap_init() seems to duplicate what's already done in mtrr_aps_sync_end(). > Fixes: bb502a8ca592 ("x86: check feature flags after resume") > Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > Sadly we need to go by CPU number (zero vs non-zero) here. See the call > site of recheck_cpu_features() in enter_state(). > > --- a/xen/arch/x86/cpu/common.c > +++ b/xen/arch/x86/cpu/common.c > @@ -642,16 +642,21 @@ void identify_cpu(struct cpuinfo_x86 *c) > smp_processor_id()); > } > > - if (system_state == SYS_STATE_resume) > - return; > + if (system_state == SYS_STATE_resume) { > + unsigned int cpu = smp_processor_id(); > > + if (cpu) > + mcheck_init(&cpu_data[cpu], false); > + else /* Yes, the BSP needs to use the AP function here. */ > + mtrr_ap_init(); For symmetry with the BSP path, is it really needed to init MCE so early for the BSP by calling it directly in enter_state(), or could it also be done here? Thanks, Roger. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] x86/S3: restore MCE (APs) and add MTRR (BSP) init 2026-03-23 11:16 ` Roger Pau Monné @ 2026-03-23 11:38 ` Jan Beulich 2026-03-23 11:43 ` Roger Pau Monné 0 siblings, 1 reply; 9+ messages in thread From: Jan Beulich @ 2026-03-23 11:38 UTC (permalink / raw) To: Roger Pau Monné Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Marek Marczykowski On 23.03.2026 12:16, Roger Pau Monné wrote: > On Wed, Mar 04, 2026 at 02:39:01PM +0100, Jan Beulich wrote: >> MCE init for APs was broken when CPU feature re-checking was added. MTRR >> (re)init for the BSP looks to never have been there on the resume path. > > I'm not sure the statement about MTRR init is correct, AFAICT > mtrr_aps_sync_end() will also re-init the MTRRs on the BSP, and hence > the added mtrr_ap_init() seems to duplicate what's already done in > mtrr_aps_sync_end(). Hmm, right you are. Had I been asked, I would have confirmed that I checked the code past the "enable_cpu" label, but clearly I must not have, or I was blind at that time. Let me strip that out. >> --- a/xen/arch/x86/cpu/common.c >> +++ b/xen/arch/x86/cpu/common.c >> @@ -642,16 +642,21 @@ void identify_cpu(struct cpuinfo_x86 *c) >> smp_processor_id()); >> } >> >> - if (system_state == SYS_STATE_resume) >> - return; >> + if (system_state == SYS_STATE_resume) { >> + unsigned int cpu = smp_processor_id(); >> >> + if (cpu) >> + mcheck_init(&cpu_data[cpu], false); >> + else /* Yes, the BSP needs to use the AP function here. */ >> + mtrr_ap_init(); > > For symmetry with the BSP path, is it really needed to init MCE so > early for the BSP by calling it directly in enter_state(), or could it > also be done here? To be honest, I would put the question the other way around: Is it really okay to do it this late for APs (during boot also for the BSP [1])? Iirc an #MC prior to mcheck_init() is going to be deadly to the system. Moving it earlier may, however, be a more intrusive change. Jan [1] Us crashing (rebooting) during boot is perhaps less of an issue than us doing so during S3 resume: In that latter case it may mean data loss (or maybe even data corruption). Jan ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] x86/S3: restore MCE (APs) and add MTRR (BSP) init 2026-03-23 11:38 ` Jan Beulich @ 2026-03-23 11:43 ` Roger Pau Monné 0 siblings, 0 replies; 9+ messages in thread From: Roger Pau Monné @ 2026-03-23 11:43 UTC (permalink / raw) To: Jan Beulich Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Marek Marczykowski On Mon, Mar 23, 2026 at 12:38:48PM +0100, Jan Beulich wrote: > On 23.03.2026 12:16, Roger Pau Monné wrote: > > On Wed, Mar 04, 2026 at 02:39:01PM +0100, Jan Beulich wrote: > >> MCE init for APs was broken when CPU feature re-checking was added. MTRR > >> (re)init for the BSP looks to never have been there on the resume path. > > > > I'm not sure the statement about MTRR init is correct, AFAICT > > mtrr_aps_sync_end() will also re-init the MTRRs on the BSP, and hence > > the added mtrr_ap_init() seems to duplicate what's already done in > > mtrr_aps_sync_end(). > > Hmm, right you are. Had I been asked, I would have confirmed that I checked > the code past the "enable_cpu" label, but clearly I must not have, or I was > blind at that time. Let me strip that out. > > >> --- a/xen/arch/x86/cpu/common.c > >> +++ b/xen/arch/x86/cpu/common.c > >> @@ -642,16 +642,21 @@ void identify_cpu(struct cpuinfo_x86 *c) > >> smp_processor_id()); > >> } > >> > >> - if (system_state == SYS_STATE_resume) > >> - return; > >> + if (system_state == SYS_STATE_resume) { > >> + unsigned int cpu = smp_processor_id(); > >> > >> + if (cpu) > >> + mcheck_init(&cpu_data[cpu], false); > >> + else /* Yes, the BSP needs to use the AP function here. */ > >> + mtrr_ap_init(); > > > > For symmetry with the BSP path, is it really needed to init MCE so > > early for the BSP by calling it directly in enter_state(), or could it > > also be done here? > > To be honest, I would put the question the other way around: Is it really > okay to do it this late for APs (during boot also for the BSP [1])? Iirc > an #MC prior to mcheck_init() is going to be deadly to the system. Moving > it earlier may, however, be a more intrusive change. We might want to at least add a note to document this asymmetric initialization between the BSP and the APs at least? I would be perfectly happy with moving this earlier, and it needs to be consistent between the APs and the BSP. Thanks, Roger. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-03-23 11:43 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-04 13:39 [PATCH] x86/S3: restore MCE (APs) and add MTRR (BSP) init Jan Beulich 2026-03-04 14:36 ` Marek Marczykowski 2026-03-04 14:47 ` Jan Beulich 2026-03-04 15:00 ` Marek Marczykowski 2026-03-23 11:21 ` Jan Beulich 2026-03-23 11:26 ` Marek Marczykowski 2026-03-23 11:16 ` Roger Pau Monné 2026-03-23 11:38 ` Jan Beulich 2026-03-23 11:43 ` Roger Pau Monné
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.