* S3 regression related to XSA-471 patches @ 2025-08-06 10:23 Marek Marczykowski-Górecki 2025-08-06 10:36 ` Jan Beulich 0 siblings, 1 reply; 8+ messages in thread From: Marek Marczykowski-Górecki @ 2025-08-06 10:23 UTC (permalink / raw) To: xen-devel; +Cc: Andrew Cooper [-- Attachment #1: Type: text/plain, Size: 2123 bytes --] Hi, We've got several reports that S3 reliability recently regressed. We identified it's definitely related to XSA-471 patches, and bisection points at "x86/idle: Remove broken MWAIT implementation". I don't have reliable reproduction steps, so I'm not 100% sure if it's really this patch, or maybe an earlier one - but it's definitely already broken at this point in the series. Most reports are about Xen 4.17 (as that's what stable Qubes OS version currently use), but I think I've seen somebody reporting the issue on 4.19 too (but I don't have clear evidence, especially if it's the same issue). The problem manifests in system freezing on S3 resume. Sometimes it manages to show the screenlocker password prompt, and sometimes one can interact with it for a second or two. But then it freezes, mouse stops moving etc (but no reboot). One time I managed to get pass the screenlocker and interact with dom0 for a few minutes before it frozen. Resuming domUs didn't happen (the qubes-specific script doing so resume hanged), and also no logs persisted on the disk from this case (on disk it looked like it never resumed). Generally it looked like some CPUs were stuck. It appears to be more likely to hit the issue if some domUs are active at the suspend/resume time. While Qubes OS does suspend (not just pause) them for the host S3 time, some activity before/after does appear to matter. My test case that has ~30-40% reproduction rate involves several firefox instances playing youtube videos. I've talked with Andrew about it a bit, with not much conclusions. Initial reports mentioned only MTL and RPL systems, so we focused on something related to weird topology. But just today I've got a report of the same happening on KBL too... Another observation (possibly invalidated by today's report...) is that all reports were about systems running Coreboot (but not only Dasharo flavor - at least one was Star Labs). Most reports are collected at https://github.com/QubesOS/qubes-issues/issues/10110 -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches 2025-08-06 10:23 S3 regression related to XSA-471 patches Marek Marczykowski-Górecki @ 2025-08-06 10:36 ` Jan Beulich 2025-08-06 10:46 ` Marek Marczykowski-Górecki 0 siblings, 1 reply; 8+ messages in thread From: Jan Beulich @ 2025-08-06 10:36 UTC (permalink / raw) To: Marek Marczykowski-Górecki; +Cc: Andrew Cooper, xen-devel On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote: > We've got several reports that S3 reliability recently regressed. We > identified it's definitely related to XSA-471 patches, and bisection > points at "x86/idle: Remove broken MWAIT implementation". I don't have > reliable reproduction steps, so I'm not 100% sure if it's really this > patch, or maybe an earlier one - but it's definitely already broken at > this point in the series. Most reports are about Xen 4.17 (as that's > what stable Qubes OS version currently use), but I think I've seen > somebody reporting the issue on 4.19 too (but I don't have clear > evidence, especially if it's the same issue). At the time we've been discussing the explicit raising of TIMER_SOFTIRQ in mwait_idle_with_hints() a lot. If it was now truly missing, that imo shouldn't cause problems only after resume, but then it may have covered for some omission during resume. As a far-fetched experiment, could you try putting that back (including the calculation of the "expires" local variable)? Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches 2025-08-06 10:36 ` Jan Beulich @ 2025-08-06 10:46 ` Marek Marczykowski-Górecki 2025-08-07 17:29 ` Marek Marczykowski-Górecki 0 siblings, 1 reply; 8+ messages in thread From: Marek Marczykowski-Górecki @ 2025-08-06 10:46 UTC (permalink / raw) To: Jan Beulich; +Cc: Andrew Cooper, xen-devel [-- Attachment #1: Type: text/plain, Size: 1249 bytes --] On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote: > On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote: > > We've got several reports that S3 reliability recently regressed. We > > identified it's definitely related to XSA-471 patches, and bisection > > points at "x86/idle: Remove broken MWAIT implementation". I don't have > > reliable reproduction steps, so I'm not 100% sure if it's really this > > patch, or maybe an earlier one - but it's definitely already broken at > > this point in the series. Most reports are about Xen 4.17 (as that's > > what stable Qubes OS version currently use), but I think I've seen > > somebody reporting the issue on 4.19 too (but I don't have clear > > evidence, especially if it's the same issue). > > At the time we've been discussing the explicit raising of TIMER_SOFTIRQ > in mwait_idle_with_hints() a lot. If it was now truly missing, that imo > shouldn't cause problems only after resume, but then it may have covered > for some omission during resume. As a far-fetched experiment, could you > try putting that back (including the calculation of the "expires" local > variable)? Sure, I'll try. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches 2025-08-06 10:46 ` Marek Marczykowski-Górecki @ 2025-08-07 17:29 ` Marek Marczykowski-Górecki 2025-08-11 13:16 ` Andrew Cooper 0 siblings, 1 reply; 8+ messages in thread From: Marek Marczykowski-Górecki @ 2025-08-07 17:29 UTC (permalink / raw) To: Jan Beulich; +Cc: Andrew Cooper, xen-devel [-- Attachment #1: Type: text/plain, Size: 1494 bytes --] On Wed, Aug 06, 2025 at 12:46:36PM +0200, Marek Marczykowski-Górecki wrote: > On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote: > > On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote: > > > We've got several reports that S3 reliability recently regressed. We > > > identified it's definitely related to XSA-471 patches, and bisection > > > points at "x86/idle: Remove broken MWAIT implementation". I don't have > > > reliable reproduction steps, so I'm not 100% sure if it's really this > > > patch, or maybe an earlier one - but it's definitely already broken at > > > this point in the series. Most reports are about Xen 4.17 (as that's > > > what stable Qubes OS version currently use), but I think I've seen > > > somebody reporting the issue on 4.19 too (but I don't have clear > > > evidence, especially if it's the same issue). > > > > At the time we've been discussing the explicit raising of TIMER_SOFTIRQ > > in mwait_idle_with_hints() a lot. If it was now truly missing, that imo > > shouldn't cause problems only after resume, but then it may have covered > > for some omission during resume. As a far-fetched experiment, could you > > try putting that back (including the calculation of the "expires" local > > variable)? > > Sure, I'll try. It appears this fixes the issue, at least in ~10 attempts so far (usually I could reproduce the issue after 2-3 attempts). -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches 2025-08-07 17:29 ` Marek Marczykowski-Górecki @ 2025-08-11 13:16 ` Andrew Cooper 2025-08-13 2:53 ` Marek Marczykowski-Górecki 0 siblings, 1 reply; 8+ messages in thread From: Andrew Cooper @ 2025-08-11 13:16 UTC (permalink / raw) To: Marek Marczykowski-Górecki, Jan Beulich; +Cc: xen-devel On 07/08/2025 6:29 pm, Marek Marczykowski-Górecki wrote: > On Wed, Aug 06, 2025 at 12:46:36PM +0200, Marek Marczykowski-Górecki wrote: >> On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote: >>> On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote: >>>> We've got several reports that S3 reliability recently regressed. We >>>> identified it's definitely related to XSA-471 patches, and bisection >>>> points at "x86/idle: Remove broken MWAIT implementation". I don't have >>>> reliable reproduction steps, so I'm not 100% sure if it's really this >>>> patch, or maybe an earlier one - but it's definitely already broken at >>>> this point in the series. Most reports are about Xen 4.17 (as that's >>>> what stable Qubes OS version currently use), but I think I've seen >>>> somebody reporting the issue on 4.19 too (but I don't have clear >>>> evidence, especially if it's the same issue). >>> At the time we've been discussing the explicit raising of TIMER_SOFTIRQ >>> in mwait_idle_with_hints() a lot. If it was now truly missing, that imo >>> shouldn't cause problems only after resume, but then it may have covered >>> for some omission during resume. As a far-fetched experiment, could you >>> try putting that back (including the calculation of the "expires" local >>> variable)? >> Sure, I'll try. > It appears this fixes the issue, at least in ~10 attempts so far > (usually I could reproduce the issue after 2-3 attempts). > Can you show the exact code which seems to have made this stable? We discussed this in the x86 maintainers meeting, and our best guess is a timer that's not torn down or recreated properly on S3, but this is largely speculation. ~Andrew ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches 2025-08-11 13:16 ` Andrew Cooper @ 2025-08-13 2:53 ` Marek Marczykowski-Górecki 2025-08-13 7:26 ` Roger Pau Monné 0 siblings, 1 reply; 8+ messages in thread From: Marek Marczykowski-Górecki @ 2025-08-13 2:53 UTC (permalink / raw) To: Andrew Cooper, Jan Beulich; +Cc: xen-devel On August 11, 2025 3:16:46 PM GMT+02:00, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >On 07/08/2025 6:29 pm, Marek Marczykowski-Górecki wrote: >> On Wed, Aug 06, 2025 at 12:46:36PM +0200, Marek Marczykowski-Górecki wrote: >>> On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote: >>>> On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote: >>>>> We've got several reports that S3 reliability recently regressed. We >>>>> identified it's definitely related to XSA-471 patches, and bisection >>>>> points at "x86/idle: Remove broken MWAIT implementation". I don't have >>>>> reliable reproduction steps, so I'm not 100% sure if it's really this >>>>> patch, or maybe an earlier one - but it's definitely already broken at >>>>> this point in the series. Most reports are about Xen 4.17 (as that's >>>>> what stable Qubes OS version currently use), but I think I've seen >>>>> somebody reporting the issue on 4.19 too (but I don't have clear >>>>> evidence, especially if it's the same issue). >>>> At the time we've been discussing the explicit raising of TIMER_SOFTIRQ >>>> in mwait_idle_with_hints() a lot. If it was now truly missing, that imo >>>> shouldn't cause problems only after resume, but then it may have covered >>>> for some omission during resume. As a far-fetched experiment, could you >>>> try putting that back (including the calculation of the "expires" local >>>> variable)? >>> Sure, I'll try. >> It appears this fixes the issue, at least in ~10 attempts so far >> (usually I could reproduce the issue after 2-3 attempts). >> > >Can you show the exact code which seems to have made this stable? This patch: https://github.com/marmarek/qubes-vmm-xen/blob/7c9e9e312948c772d9a68090109964121c1d16fe/0001-DEBUG-S3.patch > >We discussed this in the x86 maintainers meeting, and our best guess is >a timer that's not torn down or recreated properly on S3, but this is >largely speculation. > >~Andrew ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches 2025-08-13 2:53 ` Marek Marczykowski-Górecki @ 2025-08-13 7:26 ` Roger Pau Monné 2025-08-22 0:53 ` Marek Marczykowski-Górecki 0 siblings, 1 reply; 8+ messages in thread From: Roger Pau Monné @ 2025-08-13 7:26 UTC (permalink / raw) To: Marek Marczykowski-Górecki; +Cc: Andrew Cooper, Jan Beulich, xen-devel On Wed, Aug 13, 2025 at 04:53:53AM +0200, Marek Marczykowski-Górecki wrote: > > > On August 11, 2025 3:16:46 PM GMT+02:00, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > >On 07/08/2025 6:29 pm, Marek Marczykowski-Górecki wrote: > >> On Wed, Aug 06, 2025 at 12:46:36PM +0200, Marek Marczykowski-Górecki wrote: > >>> On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote: > >>>> On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote: > >>>>> We've got several reports that S3 reliability recently regressed. We > >>>>> identified it's definitely related to XSA-471 patches, and bisection > >>>>> points at "x86/idle: Remove broken MWAIT implementation". I don't have > >>>>> reliable reproduction steps, so I'm not 100% sure if it's really this > >>>>> patch, or maybe an earlier one - but it's definitely already broken at > >>>>> this point in the series. Most reports are about Xen 4.17 (as that's > >>>>> what stable Qubes OS version currently use), but I think I've seen > >>>>> somebody reporting the issue on 4.19 too (but I don't have clear > >>>>> evidence, especially if it's the same issue). > >>>> At the time we've been discussing the explicit raising of TIMER_SOFTIRQ > >>>> in mwait_idle_with_hints() a lot. If it was now truly missing, that imo > >>>> shouldn't cause problems only after resume, but then it may have covered > >>>> for some omission during resume. As a far-fetched experiment, could you > >>>> try putting that back (including the calculation of the "expires" local > >>>> variable)? > >>> Sure, I'll try. > >> It appears this fixes the issue, at least in ~10 attempts so far > >> (usually I could reproduce the issue after 2-3 attempts). > >> > > > >Can you show the exact code which seems to have made this stable? > > This patch: https://github.com/marmarek/qubes-vmm-xen/blob/7c9e9e312948c772d9a68090109964121c1d16fe/0001-DEBUG-S3.patch Hello, Can you test if the patch below in isolation makes any difference? Thanks, Roger. --- diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c index 2ac162c997fe..27d672ad5dbb 100644 --- a/xen/arch/x86/acpi/power.c +++ b/xen/arch/x86/acpi/power.c @@ -19,6 +19,7 @@ #include <xen/iommu.h> #include <xen/param.h> #include <xen/sched.h> +#include <xen/softirq.h> #include <xen/spinlock.h> #include <xen/watchdog.h> @@ -310,6 +311,7 @@ static int enter_state(u32 state) thaw_domains(); system_state = SYS_STATE_active; spin_unlock(&pm_lock); + raise_softirq(TIMER_SOFTIRQ); return error; } diff --git a/xen/arch/x86/apic.c b/xen/arch/x86/apic.c index 0fd8bdba7067..9d66db861b74 100644 --- a/xen/arch/x86/apic.c +++ b/xen/arch/x86/apic.c @@ -65,7 +65,6 @@ static struct { unsigned int apic_lvt0; unsigned int apic_lvt1; unsigned int apic_lvterr; - unsigned int apic_tmict; unsigned int apic_tdcr; unsigned int apic_thmr; } apic_pm_state; @@ -658,7 +657,6 @@ int lapic_suspend(void) apic_pm_state.apic_lvt0 = apic_read(APIC_LVT0); apic_pm_state.apic_lvt1 = apic_read(APIC_LVT1); apic_pm_state.apic_lvterr = apic_read(APIC_LVTERR); - apic_pm_state.apic_tmict = apic_read(APIC_TMICT); apic_pm_state.apic_tdcr = apic_read(APIC_TDCR); if (maxlvt >= 5) apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR); @@ -718,7 +716,7 @@ int lapic_resume(void) apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc); apic_write(APIC_LVTT, apic_pm_state.apic_lvtt); apic_write(APIC_TDCR, apic_pm_state.apic_tdcr); - apic_write(APIC_TMICT, apic_pm_state.apic_tmict); + apic_write(APIC_TMICT, 0); apic_write(APIC_ESR, 0); apic_read(APIC_ESR); apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr); ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: S3 regression related to XSA-471 patches 2025-08-13 7:26 ` Roger Pau Monné @ 2025-08-22 0:53 ` Marek Marczykowski-Górecki 0 siblings, 0 replies; 8+ messages in thread From: Marek Marczykowski-Górecki @ 2025-08-22 0:53 UTC (permalink / raw) To: Roger Pau Monné; +Cc: Andrew Cooper, Jan Beulich, xen-devel [-- Attachment #1: Type: text/plain, Size: 4200 bytes --] On Wed, Aug 13, 2025 at 09:26:09AM +0200, Roger Pau Monné wrote: > On Wed, Aug 13, 2025 at 04:53:53AM +0200, Marek Marczykowski-Górecki wrote: > > > > > > On August 11, 2025 3:16:46 PM GMT+02:00, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > >On 07/08/2025 6:29 pm, Marek Marczykowski-Górecki wrote: > > >> On Wed, Aug 06, 2025 at 12:46:36PM +0200, Marek Marczykowski-Górecki wrote: > > >>> On Wed, Aug 06, 2025 at 12:36:56PM +0200, Jan Beulich wrote: > > >>>> On 06.08.2025 12:23, Marek Marczykowski-Górecki wrote: > > >>>>> We've got several reports that S3 reliability recently regressed. We > > >>>>> identified it's definitely related to XSA-471 patches, and bisection > > >>>>> points at "x86/idle: Remove broken MWAIT implementation". I don't have > > >>>>> reliable reproduction steps, so I'm not 100% sure if it's really this > > >>>>> patch, or maybe an earlier one - but it's definitely already broken at > > >>>>> this point in the series. Most reports are about Xen 4.17 (as that's > > >>>>> what stable Qubes OS version currently use), but I think I've seen > > >>>>> somebody reporting the issue on 4.19 too (but I don't have clear > > >>>>> evidence, especially if it's the same issue). > > >>>> At the time we've been discussing the explicit raising of TIMER_SOFTIRQ > > >>>> in mwait_idle_with_hints() a lot. If it was now truly missing, that imo > > >>>> shouldn't cause problems only after resume, but then it may have covered > > >>>> for some omission during resume. As a far-fetched experiment, could you > > >>>> try putting that back (including the calculation of the "expires" local > > >>>> variable)? > > >>> Sure, I'll try. > > >> It appears this fixes the issue, at least in ~10 attempts so far > > >> (usually I could reproduce the issue after 2-3 attempts). > > >> > > > > > >Can you show the exact code which seems to have made this stable? > > > > This patch: https://github.com/marmarek/qubes-vmm-xen/blob/7c9e9e312948c772d9a68090109964121c1d16fe/0001-DEBUG-S3.patch > > Hello, > > Can you test if the patch below in isolation makes any difference? Seems to help too. At least a test similar as before did not hit the issue anymore. > Thanks, Roger. > --- > diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c > index 2ac162c997fe..27d672ad5dbb 100644 > --- a/xen/arch/x86/acpi/power.c > +++ b/xen/arch/x86/acpi/power.c > @@ -19,6 +19,7 @@ > #include <xen/iommu.h> > #include <xen/param.h> > #include <xen/sched.h> > +#include <xen/softirq.h> > #include <xen/spinlock.h> > #include <xen/watchdog.h> > > @@ -310,6 +311,7 @@ static int enter_state(u32 state) > thaw_domains(); > system_state = SYS_STATE_active; > spin_unlock(&pm_lock); > + raise_softirq(TIMER_SOFTIRQ); > return error; > } > > diff --git a/xen/arch/x86/apic.c b/xen/arch/x86/apic.c > index 0fd8bdba7067..9d66db861b74 100644 > --- a/xen/arch/x86/apic.c > +++ b/xen/arch/x86/apic.c > @@ -65,7 +65,6 @@ static struct { > unsigned int apic_lvt0; > unsigned int apic_lvt1; > unsigned int apic_lvterr; > - unsigned int apic_tmict; > unsigned int apic_tdcr; > unsigned int apic_thmr; > } apic_pm_state; > @@ -658,7 +657,6 @@ int lapic_suspend(void) > apic_pm_state.apic_lvt0 = apic_read(APIC_LVT0); > apic_pm_state.apic_lvt1 = apic_read(APIC_LVT1); > apic_pm_state.apic_lvterr = apic_read(APIC_LVTERR); > - apic_pm_state.apic_tmict = apic_read(APIC_TMICT); > apic_pm_state.apic_tdcr = apic_read(APIC_TDCR); > if (maxlvt >= 5) > apic_pm_state.apic_thmr = apic_read(APIC_LVTTHMR); > @@ -718,7 +716,7 @@ int lapic_resume(void) > apic_write(APIC_LVTPC, apic_pm_state.apic_lvtpc); > apic_write(APIC_LVTT, apic_pm_state.apic_lvtt); > apic_write(APIC_TDCR, apic_pm_state.apic_tdcr); > - apic_write(APIC_TMICT, apic_pm_state.apic_tmict); > + apic_write(APIC_TMICT, 0); > apic_write(APIC_ESR, 0); > apic_read(APIC_ESR); > apic_write(APIC_LVTERR, apic_pm_state.apic_lvterr); > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-08-22 0:54 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-06 10:23 S3 regression related to XSA-471 patches Marek Marczykowski-Górecki 2025-08-06 10:36 ` Jan Beulich 2025-08-06 10:46 ` Marek Marczykowski-Górecki 2025-08-07 17:29 ` Marek Marczykowski-Górecki 2025-08-11 13:16 ` Andrew Cooper 2025-08-13 2:53 ` Marek Marczykowski-Górecki 2025-08-13 7:26 ` Roger Pau Monné 2025-08-22 0:53 ` Marek Marczykowski-Górecki
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.