* S3 regression related to XSA-471 patches
@ 2025-08-06 10:23 Marek Marczykowski-Górecki
2025-08-06 10:36 ` Jan Beulich
0 siblings, 1 reply; 8+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-08-06 10:23 UTC (permalink / raw)
To: xen-devel; +Cc: Andrew Cooper
[-- Attachment #1: Type: text/plain, Size: 2123 bytes --]
Hi,
We've got several reports that S3 reliability recently regressed. We
identified it's definitely related to XSA-471 patches, and bisection
points at "x86/idle: Remove broken MWAIT implementation". I don't have
reliable reproduction steps, so I'm not 100% sure if it's really this
patch, or maybe an earlier one - but it's definitely already broken at
this point in the series. Most reports are about Xen 4.17 (as that's
what stable Qubes OS version currently use), but I think I've seen
somebody reporting the issue on 4.19 too (but I don't have clear
evidence, especially if it's the same issue).
The problem manifests in system freezing on S3 resume. Sometimes it
manages to show the screenlocker password prompt, and sometimes one can
interact with it for a second or two. But then it freezes, mouse stops
moving etc (but no reboot).
One time I managed to get pass the screenlocker and interact with dom0
for a few minutes before it frozen. Resuming domUs didn't happen (the
qubes-specific script doing so resume hanged), and also no logs
persisted on the disk from this case (on disk it looked like it never
resumed). Generally it looked like some CPUs were stuck.
It appears to be more likely to hit the issue if some domUs are active
at the suspend/resume time. While Qubes OS does suspend (not just pause)
them for the host S3 time, some activity before/after does appear to
matter. My test case that has ~30-40% reproduction rate involves several
firefox instances playing youtube videos.
I've talked with Andrew about it a bit, with not much conclusions.
Initial reports mentioned only MTL and RPL systems, so we focused on
something related to weird topology. But just today I've got a report
of the same happening on KBL too...
Another observation (possibly invalidated by today's report...) is that
all reports were about systems running Coreboot (but not only Dasharo
flavor - at least one was Star Labs).
Most reports are collected at https://github.com/QubesOS/qubes-issues/issues/10110
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches
2025-08-06 10:23 S3 regression related to XSA-471 patches Marek Marczykowski-Górecki
@ 2025-08-06 10:36 ` Jan Beulich
2025-08-06 10:46 ` Marek Marczykowski-Górecki
0 siblings, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2025-08-06 10:36 UTC (permalink / raw)
To: Marek Marczykowski-Górecki; +Cc: Andrew Cooper, xen-devel
On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote:
> We've got several reports that S3 reliability recently regressed. We
> identified it's definitely related to XSA-471 patches, and bisection
> points at "x86/idle: Remove broken MWAIT implementation". I don't have
> reliable reproduction steps, so I'm not 100% sure if it's really this
> patch, or maybe an earlier one - but it's definitely already broken at
> this point in the series. Most reports are about Xen 4.17 (as that's
> what stable Qubes OS version currently use), but I think I've seen
> somebody reporting the issue on 4.19 too (but I don't have clear
> evidence, especially if it's the same issue).
At the time we've been discussing the explicit raising of TIMER_SOFTIRQ
in mwait_idle_with_hints() a lot. If it was now truly missing, that imo
shouldn't cause problems only after resume, but then it may have covered
for some omission during resume. As a far-fetched experiment, could you
try putting that back (including the calculation of the "expires" local
variable)?
Jan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches
2025-08-06 10:36 ` Jan Beulich
@ 2025-08-06 10:46 ` Marek Marczykowski-Górecki
2025-08-07 17:29 ` Marek Marczykowski-Górecki
0 siblings, 1 reply; 8+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-08-06 10:46 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, xen-devel
[-- Attachment #1: Type: text/plain, Size: 1249 bytes --]
On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote:
> On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote:
> > We've got several reports that S3 reliability recently regressed. We
> > identified it's definitely related to XSA-471 patches, and bisection
> > points at "x86/idle: Remove broken MWAIT implementation". I don't have
> > reliable reproduction steps, so I'm not 100% sure if it's really this
> > patch, or maybe an earlier one - but it's definitely already broken at
> > this point in the series. Most reports are about Xen 4.17 (as that's
> > what stable Qubes OS version currently use), but I think I've seen
> > somebody reporting the issue on 4.19 too (but I don't have clear
> > evidence, especially if it's the same issue).
>
> At the time we've been discussing the explicit raising of TIMER_SOFTIRQ
> in mwait_idle_with_hints() a lot. If it was now truly missing, that imo
> shouldn't cause problems only after resume, but then it may have covered
> for some omission during resume. As a far-fetched experiment, could you
> try putting that back (including the calculation of the "expires" local
> variable)?
Sure, I'll try.
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches
2025-08-06 10:46 ` Marek Marczykowski-Górecki
@ 2025-08-07 17:29 ` Marek Marczykowski-Górecki
2025-08-11 13:16 ` Andrew Cooper
0 siblings, 1 reply; 8+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-08-07 17:29 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, xen-devel
[-- Attachment #1: Type: text/plain, Size: 1494 bytes --]
On Wed, Aug 06, 2025 at 12:46:36PM +0200, Marek Marczykowski-Górecki wrote:
> On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote:
> > On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote:
> > > We've got several reports that S3 reliability recently regressed. We
> > > identified it's definitely related to XSA-471 patches, and bisection
> > > points at "x86/idle: Remove broken MWAIT implementation". I don't have
> > > reliable reproduction steps, so I'm not 100% sure if it's really this
> > > patch, or maybe an earlier one - but it's definitely already broken at
> > > this point in the series. Most reports are about Xen 4.17 (as that's
> > > what stable Qubes OS version currently use), but I think I've seen
> > > somebody reporting the issue on 4.19 too (but I don't have clear
> > > evidence, especially if it's the same issue).
> >
> > At the time we've been discussing the explicit raising of TIMER_SOFTIRQ
> > in mwait_idle_with_hints() a lot. If it was now truly missing, that imo
> > shouldn't cause problems only after resume, but then it may have covered
> > for some omission during resume. As a far-fetched experiment, could you
> > try putting that back (including the calculation of the "expires" local
> > variable)?
>
> Sure, I'll try.
It appears this fixes the issue, at least in ~10 attempts so far
(usually I could reproduce the issue after 2-3 attempts).
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches
2025-08-07 17:29 ` Marek Marczykowski-Górecki
@ 2025-08-11 13:16 ` Andrew Cooper
2025-08-13 2:53 ` Marek Marczykowski-Górecki
0 siblings, 1 reply; 8+ messages in thread
From: Andrew Cooper @ 2025-08-11 13:16 UTC (permalink / raw)
To: Marek Marczykowski-Górecki, Jan Beulich; +Cc: xen-devel
On 07/08/2025 6:29 pm, Marek Marczykowski-Górecki wrote:
> On Wed, Aug 06, 2025 at 12:46:36PM +0200, Marek Marczykowski-Górecki wrote:
>> On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote:
>>> On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote:
>>>> We've got several reports that S3 reliability recently regressed. We
>>>> identified it's definitely related to XSA-471 patches, and bisection
>>>> points at "x86/idle: Remove broken MWAIT implementation". I don't have
>>>> reliable reproduction steps, so I'm not 100% sure if it's really this
>>>> patch, or maybe an earlier one - but it's definitely already broken at
>>>> this point in the series. Most reports are about Xen 4.17 (as that's
>>>> what stable Qubes OS version currently use), but I think I've seen
>>>> somebody reporting the issue on 4.19 too (but I don't have clear
>>>> evidence, especially if it's the same issue).
>>> At the time we've been discussing the explicit raising of TIMER_SOFTIRQ
>>> in mwait_idle_with_hints() a lot. If it was now truly missing, that imo
>>> shouldn't cause problems only after resume, but then it may have covered
>>> for some omission during resume. As a far-fetched experiment, could you
>>> try putting that back (including the calculation of the "expires" local
>>> variable)?
>> Sure, I'll try.
> It appears this fixes the issue, at least in ~10 attempts so far
> (usually I could reproduce the issue after 2-3 attempts).
>
Can you show the exact code which seems to have made this stable?
We discussed this in the x86 maintainers meeting, and our best guess is
a timer that's not torn down or recreated properly on S3, but this is
largely speculation.
~Andrew
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches
2025-08-11 13:16 ` Andrew Cooper
@ 2025-08-13 2:53 ` Marek Marczykowski-Górecki
2025-08-13 7:26 ` Roger Pau Monné
0 siblings, 1 reply; 8+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-08-13 2:53 UTC (permalink / raw)
To: Andrew Cooper, Jan Beulich; +Cc: xen-devel
On August 11, 2025 3:16:46 PM GMT+02:00, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>On 07/08/2025 6:29 pm, Marek Marczykowski-Górecki wrote:
>> On Wed, Aug 06, 2025 at 12:46:36PM +0200, Marek Marczykowski-Górecki wrote:
>>> On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote:
>>>> On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote:
>>>>> We've got several reports that S3 reliability recently regressed. We
>>>>> identified it's definitely related to XSA-471 patches, and bisection
>>>>> points at "x86/idle: Remove broken MWAIT implementation". I don't have
>>>>> reliable reproduction steps, so I'm not 100% sure if it's really this
>>>>> patch, or maybe an earlier one - but it's definitely already broken at
>>>>> this point in the series. Most reports are about Xen 4.17 (as that's
>>>>> what stable Qubes OS version currently use), but I think I've seen
>>>>> somebody reporting the issue on 4.19 too (but I don't have clear
>>>>> evidence, especially if it's the same issue).
>>>> At the time we've been discussing the explicit raising of TIMER_SOFTIRQ
>>>> in mwait_idle_with_hints() a lot. If it was now truly missing, that imo
>>>> shouldn't cause problems only after resume, but then it may have covered
>>>> for some omission during resume. As a far-fetched experiment, could you
>>>> try putting that back (including the calculation of the "expires" local
>>>> variable)?
>>> Sure, I'll try.
>> It appears this fixes the issue, at least in ~10 attempts so far
>> (usually I could reproduce the issue after 2-3 attempts).
>>
>
>Can you show the exact code which seems to have made this stable?
This patch: https://github.com/marmarek/qubes-vmm-xen/blob/7c9e9e312948c772d9a68090109964121c1d16fe/0001-DEBUG-S3.patch
>
>We discussed this in the x86 maintainers meeting, and our best guess is
>a timer that's not torn down or recreated properly on S3, but this is
>largely speculation.
>
>~Andrew
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches
2025-08-13 2:53 ` Marek Marczykowski-Górecki
@ 2025-08-13 7:26 ` Roger Pau Monné
2025-08-22 0:53 ` Marek Marczykowski-Górecki
0 siblings, 1 reply; 8+ messages in thread
From: Roger Pau Monné @ 2025-08-13 7:26 UTC (permalink / raw)
To: Marek Marczykowski-Górecki; +Cc: Andrew Cooper, Jan Beulich, xen-devel
On Wed, Aug 13, 2025 at 04:53:53AM +0200, Marek Marczykowski-Górecki wrote:
>
>
> On August 11, 2025 3:16:46 PM GMT+02:00, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >On 07/08/2025 6:29 pm, Marek Marczykowski-Górecki wrote:
> >> On Wed, Aug 06, 2025 at 12:46:36PM +0200, Marek Marczykowski-Górecki wrote:
> >>> On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote:
> >>>> On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote:
> >>>>> We've got several reports that S3 reliability recently regressed. We
> >>>>> identified it's definitely related to XSA-471 patches, and bisection
> >>>>> points at "x86/idle: Remove broken MWAIT implementation". I don't have
> >>>>> reliable reproduction steps, so I'm not 100% sure if it's really this
> >>>>> patch, or maybe an earlier one - but it's definitely already broken at
> >>>>> this point in the series. Most reports are about Xen 4.17 (as that's
> >>>>> what stable Qubes OS version currently use), but I think I've seen
> >>>>> somebody reporting the issue on 4.19 too (but I don't have clear
> >>>>> evidence, especially if it's the same issue).
> >>>> At the time we've been discussing the explicit raising of TIMER_SOFTIRQ
> >>>> in mwait_idle_with_hints() a lot. If it was now truly missing, that imo
> >>>> shouldn't cause problems only after resume, but then it may have covered
> >>>> for some omission during resume. As a far-fetched experiment, could you
> >>>> try putting that back (including the calculation of the "expires" local
> >>>> variable)?
> >>> Sure, I'll try.
> >> It appears this fixes the issue, at least in ~10 attempts so far
> >> (usually I could reproduce the issue after 2-3 attempts).
> >>
> >
> >Can you show the exact code which seems to have made this stable?
>
> This patch: https://github.com/marmarek/qubes-vmm-xen/blob/7c9e9e312948c772d9a68090109964121c1d16fe/0001-DEBUG-S3.patch
Hello,
Can you test if the patch below in isolation makes any difference?
Thanks, Roger.
---
diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c
index 2ac162c997fe..27d672ad5dbb 100644
--- a/xen/arch/x86/acpi/power.c
+++ b/xen/arch/x86/acpi/power.c
@@ -19,6 +19,7 @@
#include <xen/iommu.h>
#include <xen/param.h>
#include <xen/sched.h>
+#include <xen/softirq.h>
#include <xen/spinlock.h>
#include <xen/watchdog.h>
@@ -310,6 +311,7 @@ static int enter_state(u32 state)
thaw_domains();
system_state = SYS_STATE_active;
spin_unlock(&pm_lock);
+ raise_softirq(TIMER_SOFTIRQ);
return error;
}
diff --git a/xen/arch/x86/apic.c b/xen/arch/x86/apic.c
index 0fd8bdba7067..9d66db861b74 100644
--- a/xen/arch/x86/apic.c
+++ b/xen/arch/x86/apic.c
@@ -65,7 +65,6 @@ static struct {
unsigned int apic_lvt0;
unsigned int apic_lvt1;
unsigned int apic_lvterr;
- unsigned int apic_tmict;
unsigned int apic_tdcr;
unsigned int apic_thmr;
} apic_pm_state;
@@ -658,7 +657,6 @@ int lapic_suspend(void)
apic_pm_state.apic_lvt0 = apic_read(APIC_LVT0);
apic_pm_state.apic_lvt1 = apic_read(APIC_LVT1);
apic_pm_state.apic_lvterr = apic_read(APIC_LVTERR);
- apic_pm_state.apic_tmict = apic_read(APIC_TMICT);
apic_pm_state.apic_tdcr = apic_read(APIC_TDCR);
if (maxlvt >= 5)
apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR);
@@ -718,7 +716,7 @@ int lapic_resume(void)
apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc);
apic_write(APIC_LVTT, apic_pm_state.apic_lvtt);
apic_write(APIC_TDCR, apic_pm_state.apic_tdcr);
- apic_write(APIC_TMICT, apic_pm_state.apic_tmict);
+ apic_write(APIC_TMICT, 0);
apic_write(APIC_ESR, 0);
apic_read(APIC_ESR);
apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr);
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches
2025-08-13 7:26 ` Roger Pau Monné
@ 2025-08-22 0:53 ` Marek Marczykowski-Górecki
0 siblings, 0 replies; 8+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-08-22 0:53 UTC (permalink / raw)
To: Roger Pau Monné; +Cc: Andrew Cooper, Jan Beulich, xen-devel
[-- Attachment #1: Type: text/plain, Size: 4200 bytes --]
On Wed, Aug 13, 2025 at 09:26:09AM +0200, Roger Pau Monné wrote:
> On Wed, Aug 13, 2025 at 04:53:53AM +0200, Marek Marczykowski-Górecki wrote:
> >
> >
> > On August 11, 2025 3:16:46 PM GMT+02:00, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> > >On 07/08/2025 6:29 pm, Marek Marczykowski-Górecki wrote:
> > >> On Wed, Aug 06, 2025 at 12:46:36PM +0200, Marek Marczykowski-Górecki wrote:
> > >>> On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote:
> > >>>> On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote:
> > >>>>> We've got several reports that S3 reliability recently regressed. We
> > >>>>> identified it's definitely related to XSA-471 patches, and bisection
> > >>>>> points at "x86/idle: Remove broken MWAIT implementation". I don't have
> > >>>>> reliable reproduction steps, so I'm not 100% sure if it's really this
> > >>>>> patch, or maybe an earlier one - but it's definitely already broken at
> > >>>>> this point in the series. Most reports are about Xen 4.17 (as that's
> > >>>>> what stable Qubes OS version currently use), but I think I've seen
> > >>>>> somebody reporting the issue on 4.19 too (but I don't have clear
> > >>>>> evidence, especially if it's the same issue).
> > >>>> At the time we've been discussing the explicit raising of TIMER_SOFTIRQ
> > >>>> in mwait_idle_with_hints() a lot. If it was now truly missing, that imo
> > >>>> shouldn't cause problems only after resume, but then it may have covered
> > >>>> for some omission during resume. As a far-fetched experiment, could you
> > >>>> try putting that back (including the calculation of the "expires" local
> > >>>> variable)?
> > >>> Sure, I'll try.
> > >> It appears this fixes the issue, at least in ~10 attempts so far
> > >> (usually I could reproduce the issue after 2-3 attempts).
> > >>
> > >
> > >Can you show the exact code which seems to have made this stable?
> >
> > This patch: https://github.com/marmarek/qubes-vmm-xen/blob/7c9e9e312948c772d9a68090109964121c1d16fe/0001-DEBUG-S3.patch
>
> Hello,
>
> Can you test if the patch below in isolation makes any difference?
Seems to help too. At least a test similar as before did not hit the
issue anymore.
> Thanks, Roger.
> ---
> diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c
> index 2ac162c997fe..27d672ad5dbb 100644
> --- a/xen/arch/x86/acpi/power.c
> +++ b/xen/arch/x86/acpi/power.c
> @@ -19,6 +19,7 @@
> #include <xen/iommu.h>
> #include <xen/param.h>
> #include <xen/sched.h>
> +#include <xen/softirq.h>
> #include <xen/spinlock.h>
> #include <xen/watchdog.h>
>
> @@ -310,6 +311,7 @@ static int enter_state(u32 state)
> thaw_domains();
> system_state = SYS_STATE_active;
> spin_unlock(&pm_lock);
> + raise_softirq(TIMER_SOFTIRQ);
> return error;
> }
>
> diff --git a/xen/arch/x86/apic.c b/xen/arch/x86/apic.c
> index 0fd8bdba7067..9d66db861b74 100644
> --- a/xen/arch/x86/apic.c
> +++ b/xen/arch/x86/apic.c
> @@ -65,7 +65,6 @@ static struct {
> unsigned int apic_lvt0;
> unsigned int apic_lvt1;
> unsigned int apic_lvterr;
> - unsigned int apic_tmict;
> unsigned int apic_tdcr;
> unsigned int apic_thmr;
> } apic_pm_state;
> @@ -658,7 +657,6 @@ int lapic_suspend(void)
> apic_pm_state.apic_lvt0 = apic_read(APIC_LVT0);
> apic_pm_state.apic_lvt1 = apic_read(APIC_LVT1);
> apic_pm_state.apic_lvterr = apic_read(APIC_LVTERR);
> - apic_pm_state.apic_tmict = apic_read(APIC_TMICT);
> apic_pm_state.apic_tdcr = apic_read(APIC_TDCR);
> if (maxlvt >= 5)
> apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR);
> @@ -718,7 +716,7 @@ int lapic_resume(void)
> apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc);
> apic_write(APIC_LVTT, apic_pm_state.apic_lvtt);
> apic_write(APIC_TDCR, apic_pm_state.apic_tdcr);
> - apic_write(APIC_TMICT, apic_pm_state.apic_tmict);
> + apic_write(APIC_TMICT, 0);
> apic_write(APIC_ESR, 0);
> apic_read(APIC_ESR);
> apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr);
>
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-08-22 0:54 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-06 10:23 S3 regression related to XSA-471 patches Marek Marczykowski-Górecki
2025-08-06 10:36 ` Jan Beulich
2025-08-06 10:46 ` Marek Marczykowski-Górecki
2025-08-07 17:29 ` Marek Marczykowski-Górecki
2025-08-11 13:16 ` Andrew Cooper
2025-08-13 2:53 ` Marek Marczykowski-Górecki
2025-08-13 7:26 ` Roger Pau Monné
2025-08-22 0:53 ` Marek Marczykowski-Górecki
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.