All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ci: add yet another HW runner
@ 2025-03-14  3:06 Marek Marczykowski-Górecki
  2025-03-14 21:19 ` Stefano Stabellini
  0 siblings, 1 reply; 17+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-03-14  3:06 UTC (permalink / raw)
  To: xen-devel
  Cc: Marek Marczykowski-Górecki, Jan Beulich, Andrew Cooper,
	Doug Goldstein, Stefano Stabellini

This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.

This one has working S3, so add a test for it here.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

The suspend test added here currently fails on staging[1], but passes on
staging-4.19[2]. So the regression wants fixing before committing this
patch.
For this to work, the runner needs to be added to relevant repositories
(especially hardware/xen one). Somebody with appropriate access need to
go to Settings->CI/CD->Runners and click "enable for this project" on
hal9001 runner.

[1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
[2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
---
 automation/gitlab-ci/test.yaml | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 9530e36e9aaa..6b8e1b830e3d 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -155,6 +155,16 @@
   tags:
     - qubes-hw2
 
+.zen2-x86-64:
+  # it's really similar to the above
+  extends: .adl-x86-64
+  variables:
+    PCIDEV: "01:00.0"
+    PCIDEV_INTR: "MSI-X"
+    CONSOLE_OPTS: "console=com1 com1=115200,8n1,pci,msi,04:00.0"
+  tags:
+    - qubes-hw1
+
 .zen3p-x86-64:
   # it's really similar to the above
   extends: .adl-x86-64
@@ -301,6 +311,22 @@ adl-tools-tests-pvh-x86-64-gcc-debug:
     - *x86-64-test-needs
     - alpine-3.18-gcc-debug
 
+zen2-smoke-x86-64-gcc-debug:
+  extends: .zen2-x86-64
+  script:
+    - ./automation/scripts/qubes-x86-64.sh 2>&1 | tee ${LOGFILE}
+  needs:
+    - *x86-64-test-needs
+    - alpine-3.18-gcc-debug
+
+zen2-suspend-x86-64-gcc-debug:
+  extends: .zen2-x86-64
+  script:
+    - ./automation/scripts/qubes-x86-64.sh s3 2>&1 | tee ${LOGFILE}
+  needs:
+    - *x86-64-test-needs
+    - alpine-3.18-gcc-debug
+
 zen3p-smoke-x86-64-gcc-debug:
   extends: .zen3p-x86-64
   script:
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] ci: add yet another HW runner
  2025-03-14  3:06 [PATCH] ci: add yet another HW runner Marek Marczykowski-Górecki
@ 2025-03-14 21:19 ` Stefano Stabellini
  2025-03-14 22:23   ` S3 regression on AMD in 4.20 (was: Re: [PATCH] ci: add yet another HW runner) Marek Marczykowski-Górecki
  2025-03-31 13:16   ` [PATCH] ci: add yet another HW runner Marek Marczykowski-Górecki
  0 siblings, 2 replies; 17+ messages in thread
From: Stefano Stabellini @ 2025-03-14 21:19 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Doug Goldstein,
	Stefano Stabellini, Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 2474 bytes --]

On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
> 
> This one has working S3, so add a test for it here.
> 
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> 
> The suspend test added here currently fails on staging[1], but passes on
> staging-4.19[2]. So the regression wants fixing before committing this
> patch.

We could commit the patch now without the s3 test.

I don't know what the x86 maintainers think about fixing the suspend
bug, but one idea would be to run a bisection between 4.20 and 4.19.


> For this to work, the runner needs to be added to relevant repositories
> (especially hardware/xen one). Somebody with appropriate access need to
> go to Settings->CI/CD->Runners and click "enable for this project" on
> hal9001 runner.

I did that now


> [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
> [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
> ---
>  automation/gitlab-ci/test.yaml | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
> index 9530e36e9aaa..6b8e1b830e3d 100644
> --- a/automation/gitlab-ci/test.yaml
> +++ b/automation/gitlab-ci/test.yaml
> @@ -155,6 +155,16 @@
>    tags:
>      - qubes-hw2
>  
> +.zen2-x86-64:
> +  # it's really similar to the above
> +  extends: .adl-x86-64
> +  variables:
> +    PCIDEV: "01:00.0"
> +    PCIDEV_INTR: "MSI-X"
> +    CONSOLE_OPTS: "console=com1 com1=115200,8n1,pci,msi,04:00.0"
> +  tags:
> +    - qubes-hw1
> +
>  .zen3p-x86-64:
>    # it's really similar to the above
>    extends: .adl-x86-64
> @@ -301,6 +311,22 @@ adl-tools-tests-pvh-x86-64-gcc-debug:
>      - *x86-64-test-needs
>      - alpine-3.18-gcc-debug
>  
> +zen2-smoke-x86-64-gcc-debug:
> +  extends: .zen2-x86-64
> +  script:
> +    - ./automation/scripts/qubes-x86-64.sh 2>&1 | tee ${LOGFILE}
> +  needs:
> +    - *x86-64-test-needs
> +    - alpine-3.18-gcc-debug
> +
> +zen2-suspend-x86-64-gcc-debug:
> +  extends: .zen2-x86-64
> +  script:
> +    - ./automation/scripts/qubes-x86-64.sh s3 2>&1 | tee ${LOGFILE}
> +  needs:
> +    - *x86-64-test-needs
> +    - alpine-3.18-gcc-debug
> +
>  zen3p-smoke-x86-64-gcc-debug:
>    extends: .zen3p-x86-64
>    script:
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* S3 regression on AMD in 4.20 (was: Re: [PATCH] ci: add yet another HW runner)
  2025-03-14 21:19 ` Stefano Stabellini
@ 2025-03-14 22:23   ` Marek Marczykowski-Górecki
  2025-03-14 23:53     ` Marek Marczykowski-Górecki
  2025-03-31 13:16   ` [PATCH] ci: add yet another HW runner Marek Marczykowski-Górecki
  1 sibling, 1 reply; 17+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-03-14 22:23 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Doug Goldstein,
	Roger Pau Monné, Oleksii Kurochko, Mykola Kvach,
	Mykyta Poturai

[-- Attachment #1: Type: text/plain, Size: 2413 bytes --]

On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
> > This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
> > 
> > This one has working S3, so add a test for it here.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > ---
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > 
> > The suspend test added here currently fails on staging[1], but passes on
> > staging-4.19[2]. So the regression wants fixing before committing this
> > patch.
> >
> > [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
> > [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
> 
> We could commit the patch now without the s3 test.
> 
> I don't know what the x86 maintainers think about fixing the suspend
> bug, but one idea would be to run a bisection between 4.20 and 4.19.

I'm on it already, but it's annoying. Lets convert this thread to
discussion about the issue:

So, I bisected it between staging-4.19 and master. The breakage is
somewhere between (inclusive):
eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
and
47990ecef286 x86/boot: Improve MBI2 structure check

But, the first one breaks booting on this system and it remains broken
until the second commit (or its parent) - at which point S3 is already
broken. So, there is a range of 71 commits that may be responsible...

But then, based on a matrix chat and Jan's observation I've tried
reverting f75780d26b2f "xen: move per-cpu area management into common
code" just on top of 47990ecef286, and that fixed suspend.
Applying "xen/percpu: don't initialize percpu on resume" on top of
47990ecef286 fixes suspend too.
But applying it on top of master
(91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
but the failure mode is different than without the patch - system resets
on S3 resume, with no crash message on the serial console (even with
sync_console), instead of hanging.
And one more data point: reverting f75780d26b2f on top of master is the
same as applying "xen/percpu: don't initialize percpu on resume" on
master - system reset on S3 resume.
So, it looks like there are more issues...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: S3 regression on AMD in 4.20 (was: Re: [PATCH] ci: add yet another HW runner)
  2025-03-14 22:23   ` S3 regression on AMD in 4.20 (was: Re: [PATCH] ci: add yet another HW runner) Marek Marczykowski-Górecki
@ 2025-03-14 23:53     ` Marek Marczykowski-Górecki
  2025-03-15  0:02       ` S3 regression on AMD in 4.20 Andrew Cooper
  0 siblings, 1 reply; 17+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-03-14 23:53 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Doug Goldstein,
	Roger Pau Monné, Oleksii Kurochko, Mykola Kvach,
	Mykyta Poturai

[-- Attachment #1: Type: text/plain, Size: 2869 bytes --]

On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek Marczykowski-Górecki wrote:
> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
> > On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
> > > This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
> > > 
> > > This one has working S3, so add a test for it here.
> > > 
> > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > ---
> > > Cc: Jan Beulich <jbeulich@suse.com>
> > > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > > 
> > > The suspend test added here currently fails on staging[1], but passes on
> > > staging-4.19[2]. So the regression wants fixing before committing this
> > > patch.
> > >
> > > [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
> > > [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
> > 
> > We could commit the patch now without the s3 test.
> > 
> > I don't know what the x86 maintainers think about fixing the suspend
> > bug, but one idea would be to run a bisection between 4.20 and 4.19.
> 
> I'm on it already, but it's annoying. Lets convert this thread to
> discussion about the issue:
> 
> So, I bisected it between staging-4.19 and master. The breakage is
> somewhere between (inclusive):
> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
> and
> 47990ecef286 x86/boot: Improve MBI2 structure check
> 
> But, the first one breaks booting on this system and it remains broken
> until the second commit (or its parent) - at which point S3 is already
> broken. So, there is a range of 71 commits that may be responsible...
> 
> But then, based on a matrix chat and Jan's observation I've tried
> reverting f75780d26b2f "xen: move per-cpu area management into common
> code" just on top of 47990ecef286, and that fixed suspend.
> Applying "xen/percpu: don't initialize percpu on resume" on top of
> 47990ecef286 fixes suspend too.
> But applying it on top of master
> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
> but the failure mode is different than without the patch - system resets
> on S3 resume, with no crash message on the serial console (even with
> sync_console), instead of hanging.
> And one more data point: reverting f75780d26b2f on top of master is the
> same as applying "xen/percpu: don't initialize percpu on resume" on
> master - system reset on S3 resume.
> So, it looks like there are more issues...

Another bisection round and I have the second culprit:

    8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed

With master+"xen/percpu: don't initialize percpu on resume"+revert of
8e60d47cf011 suspend works again on this AMD system.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: S3 regression on AMD in 4.20
  2025-03-14 23:53     ` Marek Marczykowski-Górecki
@ 2025-03-15  0:02       ` Andrew Cooper
  2025-03-17 15:56         ` Roger Pau Monné
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Cooper @ 2025-03-15  0:02 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Stefano Stabellini
  Cc: xen-devel, Jan Beulich, Doug Goldstein, Roger Pau Monné,
	Oleksii Kurochko, Mykola Kvach, Mykyta Poturai

On 14/03/2025 11:53 pm, Marek Marczykowski-Górecki wrote:
> On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek Marczykowski-Górecki wrote:
>> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
>>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
>>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
>>>>
>>>> This one has working S3, so add a test for it here.
>>>>
>>>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>>>> ---
>>>> Cc: Jan Beulich <jbeulich@suse.com>
>>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>
>>>> The suspend test added here currently fails on staging[1], but passes on
>>>> staging-4.19[2]. So the regression wants fixing before committing this
>>>> patch.
>>>>
>>>> [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
>>>> [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
>>> We could commit the patch now without the s3 test.
>>>
>>> I don't know what the x86 maintainers think about fixing the suspend
>>> bug, but one idea would be to run a bisection between 4.20 and 4.19.
>> I'm on it already, but it's annoying. Lets convert this thread to
>> discussion about the issue:
>>
>> So, I bisected it between staging-4.19 and master. The breakage is
>> somewhere between (inclusive):
>> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
>> and
>> 47990ecef286 x86/boot: Improve MBI2 structure check
>>
>> But, the first one breaks booting on this system and it remains broken
>> until the second commit (or its parent) - at which point S3 is already
>> broken. So, there is a range of 71 commits that may be responsible...
>>
>> But then, based on a matrix chat and Jan's observation I've tried
>> reverting f75780d26b2f "xen: move per-cpu area management into common
>> code" just on top of 47990ecef286, and that fixed suspend.
>> Applying "xen/percpu: don't initialize percpu on resume" on top of
>> 47990ecef286 fixes suspend too.
>> But applying it on top of master
>> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
>> but the failure mode is different than without the patch - system resets
>> on S3 resume, with no crash message on the serial console (even with
>> sync_console), instead of hanging.
>> And one more data point: reverting f75780d26b2f on top of master is the
>> same as applying "xen/percpu: don't initialize percpu on resume" on
>> master - system reset on S3 resume.
>> So, it looks like there are more issues...
> Another bisection round and I have the second culprit:
>
>     8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed
>
> With master+"xen/percpu: don't initialize percpu on resume"+revert of
> 8e60d47cf011 suspend works again on this AMD system.

That's not surprising in the slightest.

Caching hardware values in Xen isn't safe across S3, which QubesOS has
found time and time again, and for which we still have outstanding bugs.

S3 turns most of the system off.  RAM gets preserved, but devices and
plenty of internal registers don't.

~Andrew


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: S3 regression on AMD in 4.20
  2025-03-15  0:02       ` S3 regression on AMD in 4.20 Andrew Cooper
@ 2025-03-17 15:56         ` Roger Pau Monné
  2025-03-17 16:11           ` Jan Beulich
  2025-03-17 16:35           ` Marek Marczykowski-Górecki
  0 siblings, 2 replies; 17+ messages in thread
From: Roger Pau Monné @ 2025-03-17 15:56 UTC (permalink / raw)
  To: Andrew Cooper, Marek Marczykowski-Górecki
  Cc: Stefano Stabellini, xen-devel, Jan Beulich, Doug Goldstein,
	Oleksii Kurochko, Mykola Kvach, Mykyta Poturai

On Sat, Mar 15, 2025 at 12:02:50AM +0000, Andrew Cooper wrote:
> On 14/03/2025 11:53 pm, Marek Marczykowski-Górecki wrote:
> > On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek Marczykowski-Górecki wrote:
> >> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
> >>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
> >>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
> >>>>
> >>>> This one has working S3, so add a test for it here.
> >>>>
> >>>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> >>>> ---
> >>>> Cc: Jan Beulich <jbeulich@suse.com>
> >>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> >>>>
> >>>> The suspend test added here currently fails on staging[1], but passes on
> >>>> staging-4.19[2]. So the regression wants fixing before committing this
> >>>> patch.
> >>>>
> >>>> [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
> >>>> [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
> >>> We could commit the patch now without the s3 test.
> >>>
> >>> I don't know what the x86 maintainers think about fixing the suspend
> >>> bug, but one idea would be to run a bisection between 4.20 and 4.19.
> >> I'm on it already, but it's annoying. Lets convert this thread to
> >> discussion about the issue:
> >>
> >> So, I bisected it between staging-4.19 and master. The breakage is
> >> somewhere between (inclusive):
> >> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
> >> and
> >> 47990ecef286 x86/boot: Improve MBI2 structure check
> >>
> >> But, the first one breaks booting on this system and it remains broken
> >> until the second commit (or its parent) - at which point S3 is already
> >> broken. So, there is a range of 71 commits that may be responsible...
> >>
> >> But then, based on a matrix chat and Jan's observation I've tried
> >> reverting f75780d26b2f "xen: move per-cpu area management into common
> >> code" just on top of 47990ecef286, and that fixed suspend.
> >> Applying "xen/percpu: don't initialize percpu on resume" on top of
> >> 47990ecef286 fixes suspend too.
> >> But applying it on top of master
> >> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
> >> but the failure mode is different than without the patch - system resets
> >> on S3 resume, with no crash message on the serial console (even with
> >> sync_console), instead of hanging.
> >> And one more data point: reverting f75780d26b2f on top of master is the
> >> same as applying "xen/percpu: don't initialize percpu on resume" on
> >> master - system reset on S3 resume.
> >> So, it looks like there are more issues...
> > Another bisection round and I have the second culprit:
> >
> >     8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed
> >
> > With master+"xen/percpu: don't initialize percpu on resume"+revert of
> > 8e60d47cf011 suspend works again on this AMD system.
> 
> That's not surprising in the slightest.
> 
> Caching hardware values in Xen isn't safe across S3, which QubesOS has
> found time and time again, and for which we still have outstanding bugs.
> 
> S3 turns most of the system off.  RAM gets preserved, but devices and
> plenty of internal registers don't.

I think I've spotted the issue.  enable_iommu() called on resume
(ab)uses set_msi_affinity() to force an MSI register write, as it's
previous behavior was to unconditionally propagate the values.  With
my change it would no longer perform such writes on resume.

I think the patch below should help.

I wonder if we should unconditionally propagate the write from
__setup_msi_irq(), as it's also unlikely to make any difference to
skip that write, and would further keep the previous behavior.

Thanks, Roger.
---
commit 1d9bfd0d45f6b547b19f0d2f752fc3bd10103971
Author: Roger Pau Monne <roger.pau@citrix.com>
Date:   Mon Mar 17 15:40:11 2025 +0100

    x86/msi: always propagate MSI writes when not in active system mode
    
    Relax the limitation on MSI register writes, and only apply it when the
    system is in active state.  For example AMD IOMMU drivers rely on using
    set_msi_affinity() to force an MSI register write on resume from
    suspension.
    
    The original patch intention was to reduce the number of MSI register
    writes when the system is in active state.  Leave the other states to
    always perform the writes, as it's safer given the existing code, and it's
    expected to not make a difference performance wise.
    
    Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
    Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed')
    Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 163ccf874720..8bb3bb18af61 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -189,6 +189,15 @@ static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg,
 {
     entry->msg = *msg;
 
+    if ( unlikely(system_state != SYS_STATE_active) )
+        /*
+         * Always propagate writes when not in the 'active' state.  The
+         * optimization to avoid the MSI address and data registers write is
+         * only relevant for runtime state, and drivers on resume (at least)
+         * rely on set_msi_affinity() to update the hardware state.
+         */
+        force = true;
+
     if ( iommu_intremap != iommu_intremap_off )
     {
         int rc;


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: S3 regression on AMD in 4.20
  2025-03-17 15:56         ` Roger Pau Monné
@ 2025-03-17 16:11           ` Jan Beulich
  2025-03-17 17:38             ` Roger Pau Monné
  2025-03-17 16:35           ` Marek Marczykowski-Górecki
  1 sibling, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2025-03-17 16:11 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, xen-devel, Doug Goldstein, Oleksii Kurochko,
	Mykola Kvach, Mykyta Poturai, Andrew Cooper,
	Marek Marczykowski-Górecki

On 17.03.2025 16:56, Roger Pau Monné wrote:
> On Sat, Mar 15, 2025 at 12:02:50AM +0000, Andrew Cooper wrote:
>> On 14/03/2025 11:53 pm, Marek Marczykowski-Górecki wrote:
>>> On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek Marczykowski-Górecki wrote:
>>>> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
>>>>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
>>>>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
>>>>>>
>>>>>> This one has working S3, so add a test for it here.
>>>>>>
>>>>>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>>>>>> ---
>>>>>> Cc: Jan Beulich <jbeulich@suse.com>
>>>>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>>>
>>>>>> The suspend test added here currently fails on staging[1], but passes on
>>>>>> staging-4.19[2]. So the regression wants fixing before committing this
>>>>>> patch.
>>>>>>
>>>>>> [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
>>>>>> [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
>>>>> We could commit the patch now without the s3 test.
>>>>>
>>>>> I don't know what the x86 maintainers think about fixing the suspend
>>>>> bug, but one idea would be to run a bisection between 4.20 and 4.19.
>>>> I'm on it already, but it's annoying. Lets convert this thread to
>>>> discussion about the issue:
>>>>
>>>> So, I bisected it between staging-4.19 and master. The breakage is
>>>> somewhere between (inclusive):
>>>> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
>>>> and
>>>> 47990ecef286 x86/boot: Improve MBI2 structure check
>>>>
>>>> But, the first one breaks booting on this system and it remains broken
>>>> until the second commit (or its parent) - at which point S3 is already
>>>> broken. So, there is a range of 71 commits that may be responsible...
>>>>
>>>> But then, based on a matrix chat and Jan's observation I've tried
>>>> reverting f75780d26b2f "xen: move per-cpu area management into common
>>>> code" just on top of 47990ecef286, and that fixed suspend.
>>>> Applying "xen/percpu: don't initialize percpu on resume" on top of
>>>> 47990ecef286 fixes suspend too.
>>>> But applying it on top of master
>>>> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
>>>> but the failure mode is different than without the patch - system resets
>>>> on S3 resume, with no crash message on the serial console (even with
>>>> sync_console), instead of hanging.
>>>> And one more data point: reverting f75780d26b2f on top of master is the
>>>> same as applying "xen/percpu: don't initialize percpu on resume" on
>>>> master - system reset on S3 resume.
>>>> So, it looks like there are more issues...
>>> Another bisection round and I have the second culprit:
>>>
>>>     8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed
>>>
>>> With master+"xen/percpu: don't initialize percpu on resume"+revert of
>>> 8e60d47cf011 suspend works again on this AMD system.
>>
>> That's not surprising in the slightest.
>>
>> Caching hardware values in Xen isn't safe across S3, which QubesOS has
>> found time and time again, and for which we still have outstanding bugs.
>>
>> S3 turns most of the system off.  RAM gets preserved, but devices and
>> plenty of internal registers don't.
> 
> I think I've spotted the issue.  enable_iommu() called on resume
> (ab)uses set_msi_affinity() to force an MSI register write, as it's
> previous behavior was to unconditionally propagate the values.  With
> my change it would no longer perform such writes on resume.
> 
> I think the patch below should help.
> 
> I wonder if we should unconditionally propagate the write from
> __setup_msi_irq(), as it's also unlikely to make any difference to
> skip that write, and would further keep the previous behavior.

That might be better. In fact I wonder whether it wouldn't better be
callers of write_msi_msg() to use ...

> ---
> commit 1d9bfd0d45f6b547b19f0d2f752fc3bd10103971
> Author: Roger Pau Monne <roger.pau@citrix.com>
> Date:   Mon Mar 17 15:40:11 2025 +0100
> 
>     x86/msi: always propagate MSI writes when not in active system mode
>     
>     Relax the limitation on MSI register writes, and only apply it when the
>     system is in active state.  For example AMD IOMMU drivers rely on using
>     set_msi_affinity() to force an MSI register write on resume from
>     suspension.
>     
>     The original patch intention was to reduce the number of MSI register
>     writes when the system is in active state.  Leave the other states to
>     always perform the writes, as it's safer given the existing code, and it's
>     expected to not make a difference performance wise.
>     
>     Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>     Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed')
>     Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -189,6 +189,15 @@ static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg,
>  {
>      entry->msg = *msg;
>  
> +    if ( unlikely(system_state != SYS_STATE_active) )

... this condition as needed. Conceivably there might be cases where even
during resume writes aren't always necessary to propagate to hardware.

Jan

> +        /*
> +         * Always propagate writes when not in the 'active' state.  The
> +         * optimization to avoid the MSI address and data registers write is
> +         * only relevant for runtime state, and drivers on resume (at least)
> +         * rely on set_msi_affinity() to update the hardware state.
> +         */
> +        force = true;
> +
>      if ( iommu_intremap != iommu_intremap_off )
>      {
>          int rc;



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: S3 regression on AMD in 4.20
  2025-03-17 15:56         ` Roger Pau Monné
  2025-03-17 16:11           ` Jan Beulich
@ 2025-03-17 16:35           ` Marek Marczykowski-Górecki
  2025-03-17 17:35             ` Roger Pau Monné
  1 sibling, 1 reply; 17+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-03-17 16:35 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Stefano Stabellini, xen-devel, Jan Beulich,
	Doug Goldstein, Oleksii Kurochko, Mykola Kvach, Mykyta Poturai

[-- Attachment #1: Type: text/plain, Size: 10480 bytes --]

On Mon, Mar 17, 2025 at 04:56:15PM +0100, Roger Pau Monné wrote:
> On Sat, Mar 15, 2025 at 12:02:50AM +0000, Andrew Cooper wrote:
> > On 14/03/2025 11:53 pm, Marek Marczykowski-Górecki wrote:
> > > On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek Marczykowski-Górecki wrote:
> > >> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
> > >>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
> > >>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
> > >>>>
> > >>>> This one has working S3, so add a test for it here.
> > >>>>
> > >>>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > >>>> ---
> > >>>> Cc: Jan Beulich <jbeulich@suse.com>
> > >>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > >>>>
> > >>>> The suspend test added here currently fails on staging[1], but passes on
> > >>>> staging-4.19[2]. So the regression wants fixing before committing this
> > >>>> patch.
> > >>>>
> > >>>> [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
> > >>>> [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
> > >>> We could commit the patch now without the s3 test.
> > >>>
> > >>> I don't know what the x86 maintainers think about fixing the suspend
> > >>> bug, but one idea would be to run a bisection between 4.20 and 4.19.
> > >> I'm on it already, but it's annoying. Lets convert this thread to
> > >> discussion about the issue:
> > >>
> > >> So, I bisected it between staging-4.19 and master. The breakage is
> > >> somewhere between (inclusive):
> > >> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
> > >> and
> > >> 47990ecef286 x86/boot: Improve MBI2 structure check
> > >>
> > >> But, the first one breaks booting on this system and it remains broken
> > >> until the second commit (or its parent) - at which point S3 is already
> > >> broken. So, there is a range of 71 commits that may be responsible...
> > >>
> > >> But then, based on a matrix chat and Jan's observation I've tried
> > >> reverting f75780d26b2f "xen: move per-cpu area management into common
> > >> code" just on top of 47990ecef286, and that fixed suspend.
> > >> Applying "xen/percpu: don't initialize percpu on resume" on top of
> > >> 47990ecef286 fixes suspend too.
> > >> But applying it on top of master
> > >> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
> > >> but the failure mode is different than without the patch - system resets
> > >> on S3 resume, with no crash message on the serial console (even with
> > >> sync_console), instead of hanging.
> > >> And one more data point: reverting f75780d26b2f on top of master is the
> > >> same as applying "xen/percpu: don't initialize percpu on resume" on
> > >> master - system reset on S3 resume.
> > >> So, it looks like there are more issues...
> > > Another bisection round and I have the second culprit:
> > >
> > >     8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed
> > >
> > > With master+"xen/percpu: don't initialize percpu on resume"+revert of
> > > 8e60d47cf011 suspend works again on this AMD system.
> > 
> > That's not surprising in the slightest.
> > 
> > Caching hardware values in Xen isn't safe across S3, which QubesOS has
> > found time and time again, and for which we still have outstanding bugs.
> > 
> > S3 turns most of the system off.  RAM gets preserved, but devices and
> > plenty of internal registers don't.
> 
> I think I've spotted the issue.  enable_iommu() called on resume
> (ab)uses set_msi_affinity() to force an MSI register write, as it's
> previous behavior was to unconditionally propagate the values.  With
> my change it would no longer perform such writes on resume.
> 
> I think the patch below should help.

It doesn't, I still got reboot on resume, with no crash message on
serial (even with sync_console).

And the other patch (setting force=true in all calls) gets me panic
on boot:

(XEN) [   11.890757] Assertion '(msg->data & (entry[-nr].msi.nvec - 1)) == nr' failed at arch/x86/msi.c:210
(XEN) [   11.900440] ----[ Xen-4.21-unstable  x86_64  debug=y  Tainted:   C    ]----
(XEN) [   11.908082] CPU:    3
(XEN) [   11.910923] RIP:    e008:[<ffff82d040304bdc>] msi.c#write_msi_msg+0xf5/0x16c
(XEN) [   11.918652] RFLAGS: 0000000000010002   CONTEXT: hypervisor (d0v3)
(XEN) [   11.925404] rax: 0000000000000001   rbx: ffff830401ded840   rcx: ffff8303f5367d38
(XEN) [   11.933576] rdx: 0000000000000000   rsi: ffff8303f536e1d8   rdi: 8000000000000000
(XEN) [   11.941750] rbp: ffff8303f5367d90   rsp: ffff8303f5367d68   r8:  0000000000000000
(XEN) [   11.949924] r9:  0000000000000000   r10: ffff8303f5367d58   r11: 0000000000000000
(XEN) [   11.958097] r12: ffff8303f5367da0   r13: 0000000000000000   r14: 0000000000000001
(XEN) [   11.966269] r15: 0000000000000000   cr0: 0000000080050033   cr4: 00000000003506e0
(XEN) [   11.974442] cr3: 00000003f2a0c000   cr2: ffff888100109890
(XEN) [   11.980483] fsb: 0000000000000000   gsb: ffff8881352c0000   gss: 0000000000000000
(XEN) [   11.988656] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) [   11.996301] Xen code around <ffff82d040304bdc> (msi.c#write_msi_msg+0xf5/0x16c):
(XEN) [   12.004382]  41 bd 00 00 00 00 eb c2 <0f> 0b 4c 8b 73 28 44 0f b6 7b 01 41 8b 14 24 41
(XEN) [   12.012998] Xen stack trace from rsp=ffff8303f5367d68:
(XEN) [   12.018772]    ffff8303f5582000 0000000000000000 ffff830401ded840 ffff8303f555e000
(XEN) [   12.027036]    0000000000000000 ffff8303f5367dc8 ffff82d040304cd9 00000000fee00000
(XEN) [   12.035299]    0000000000004061 ffff8303f5582000 ffff830401ded380 ffff82d0403c6340
(XEN) [   12.043560]    ffff830401ded9a0 ffff82d04030b00a 000000010000002a ffff82d0403a5aaf
(XEN) [   12.051821]    ffff8303f5367ef8 ffffc90040087afc ffff8303f556f000 ffff8303f5569a80
(XEN) [   12.060083]    000000000000002a 00000000000004d8 ffff82d04020ca6f 0000002aaaaaaaaa
(XEN) [   12.068344]    ffff830401ded380 ffff8303f555e000 00000000000004d8 aaaaaaaa00000000
(XEN) [   12.076606]    aaaaaaaaaaaaaaaa ffff8303f5367ef8 0000000000000020 ffff8303f554f000
(XEN) [   12.084869]    ffffc90040087afc 0000000000000000 0000000000000002 ffff82d0402eb7fc
(XEN) [   12.093130]    ffff88810ff6a228 ffff888100448c90 ffffffffffffffff 0000000000000000
(XEN) [   12.101393]    0000000000000000 ffff8303f5367ef8 0000000000000000 ffff8303f554f000
(XEN) [   12.109654]    0000000000000000 0000000000000000 0000000000000000 ffff8303f5367fff
(XEN) [   12.117916]    0000000000000000 ffff82d0402012cd 0000000000000000 ffff88810ff6a2a4
(XEN) [   12.126179]    ffff88810ff6a360 0000000000000000 ffff88810236fde0 0000000000000060
(XEN) [   12.134442]    0000000000000246 0000000000000000 ffffffff82b3cce0 ffff888100448c90
(XEN) [   12.142703]    0000000000000020 ffffffff81df540a ffff88810ff6a228 ffffc90040087afc
(XEN) [   12.150966]    0000000000000002 0000010000000000 ffffffff81df540a 000000000000e033
(XEN) [   12.159227]    0000000000000246 ffffc90040087ae0 000000000000e02b c2c2c2c2c2c2c2c2
(XEN) [   12.167489]    c2c2c2c2c2c2c2c2 c2c2c2c2c2c2c2c2 c2c2c2c2c2c2c2c2 0000e01000000003
(XEN) [   12.175752]    ffff8303f554f000 00000033b4d9f000 00000000003506e0 0000000000000000
(XEN) [   12.184014] Xen call trace:
(XEN) [   12.187388]    [<ffff82d040304bdc>] R msi.c#write_msi_msg+0xf5/0x16c
(XEN) [   12.194407]    [<ffff82d040304cd9>] S set_msi_affinity+0x86/0x93
(XEN) [   12.201071]    [<ffff82d04030b00a>] S pirq_guest_bind+0x2c0/0x484
(XEN) [   12.207823]    [<ffff82d04020ca6f>] S do_event_channel_op+0xad1/0x11b0
(XEN) [   12.215019]    [<ffff82d0402eb7fc>] S pv_hypercall+0x55f/0x5dd
(XEN) [   12.221504]    [<ffff82d0402012cd>] S lstar_enter+0x13d/0x150
(XEN) [   12.227899] 
(XEN) [   12.229943] 
(XEN) [   12.231987] ****************************************
(XEN) [   12.237584] Panic on CPU 3:
(XEN) [   12.240960] Assertion '(msg->data & (entry[-nr].msi.nvec - 1)) == nr' failed at arch/x86/msi.c:210
(XEN) [   12.250644] ****************************************

I wonder if similar crash happens on resume with this patch.

BTW, the -nr in the above assert looks suspicious, is it trying to fine
the first entry?

> I wonder if we should unconditionally propagate the write from
> __setup_msi_irq(), as it's also unlikely to make any difference to
> skip that write, and would further keep the previous behavior.
> 
> Thanks, Roger.
> ---
> commit 1d9bfd0d45f6b547b19f0d2f752fc3bd10103971
> Author: Roger Pau Monne <roger.pau@citrix.com>
> Date:   Mon Mar 17 15:40:11 2025 +0100
> 
>     x86/msi: always propagate MSI writes when not in active system mode
>     
>     Relax the limitation on MSI register writes, and only apply it when the
>     system is in active state.  For example AMD IOMMU drivers rely on using
>     set_msi_affinity() to force an MSI register write on resume from
>     suspension.
>     
>     The original patch intention was to reduce the number of MSI register
>     writes when the system is in active state.  Leave the other states to
>     always perform the writes, as it's safer given the existing code, and it's
>     expected to not make a difference performance wise.
>     
>     Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>     Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed')
>     Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 163ccf874720..8bb3bb18af61 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -189,6 +189,15 @@ static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg,
>  {
>      entry->msg = *msg;
>  
> +    if ( unlikely(system_state != SYS_STATE_active) )
> +        /*
> +         * Always propagate writes when not in the 'active' state.  The
> +         * optimization to avoid the MSI address and data registers write is
> +         * only relevant for runtime state, and drivers on resume (at least)
> +         * rely on set_msi_affinity() to update the hardware state.
> +         */
> +        force = true;
> +
>      if ( iommu_intremap != iommu_intremap_off )
>      {
>          int rc;

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: S3 regression on AMD in 4.20
  2025-03-17 16:35           ` Marek Marczykowski-Górecki
@ 2025-03-17 17:35             ` Roger Pau Monné
  2025-03-17 17:54               ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 17+ messages in thread
From: Roger Pau Monné @ 2025-03-17 17:35 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Andrew Cooper, Stefano Stabellini, xen-devel, Jan Beulich,
	Doug Goldstein, Oleksii Kurochko, Mykola Kvach, Mykyta Poturai

On Mon, Mar 17, 2025 at 05:35:51PM +0100, Marek Marczykowski-Górecki wrote:
> On Mon, Mar 17, 2025 at 04:56:15PM +0100, Roger Pau Monné wrote:
> > On Sat, Mar 15, 2025 at 12:02:50AM +0000, Andrew Cooper wrote:
> > > On 14/03/2025 11:53 pm, Marek Marczykowski-Górecki wrote:
> > > > On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek Marczykowski-Górecki wrote:
> > > >> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
> > > >>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
> > > >>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
> > > >>>>
> > > >>>> This one has working S3, so add a test for it here.
> > > >>>>
> > > >>>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > >>>> ---
> > > >>>> Cc: Jan Beulich <jbeulich@suse.com>
> > > >>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > > >>>>
> > > >>>> The suspend test added here currently fails on staging[1], but passes on
> > > >>>> staging-4.19[2]. So the regression wants fixing before committing this
> > > >>>> patch.
> > > >>>>
> > > >>>> [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
> > > >>>> [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
> > > >>> We could commit the patch now without the s3 test.
> > > >>>
> > > >>> I don't know what the x86 maintainers think about fixing the suspend
> > > >>> bug, but one idea would be to run a bisection between 4.20 and 4.19.
> > > >> I'm on it already, but it's annoying. Lets convert this thread to
> > > >> discussion about the issue:
> > > >>
> > > >> So, I bisected it between staging-4.19 and master. The breakage is
> > > >> somewhere between (inclusive):
> > > >> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
> > > >> and
> > > >> 47990ecef286 x86/boot: Improve MBI2 structure check
> > > >>
> > > >> But, the first one breaks booting on this system and it remains broken
> > > >> until the second commit (or its parent) - at which point S3 is already
> > > >> broken. So, there is a range of 71 commits that may be responsible...
> > > >>
> > > >> But then, based on a matrix chat and Jan's observation I've tried
> > > >> reverting f75780d26b2f "xen: move per-cpu area management into common
> > > >> code" just on top of 47990ecef286, and that fixed suspend.
> > > >> Applying "xen/percpu: don't initialize percpu on resume" on top of
> > > >> 47990ecef286 fixes suspend too.
> > > >> But applying it on top of master
> > > >> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
> > > >> but the failure mode is different than without the patch - system resets
> > > >> on S3 resume, with no crash message on the serial console (even with
> > > >> sync_console), instead of hanging.
> > > >> And one more data point: reverting f75780d26b2f on top of master is the
> > > >> same as applying "xen/percpu: don't initialize percpu on resume" on
> > > >> master - system reset on S3 resume.
> > > >> So, it looks like there are more issues...
> > > > Another bisection round and I have the second culprit:
> > > >
> > > >     8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed
> > > >
> > > > With master+"xen/percpu: don't initialize percpu on resume"+revert of
> > > > 8e60d47cf011 suspend works again on this AMD system.
> > > 
> > > That's not surprising in the slightest.
> > > 
> > > Caching hardware values in Xen isn't safe across S3, which QubesOS has
> > > found time and time again, and for which we still have outstanding bugs.
> > > 
> > > S3 turns most of the system off.  RAM gets preserved, but devices and
> > > plenty of internal registers don't.
> > 
> > I think I've spotted the issue.  enable_iommu() called on resume
> > (ab)uses set_msi_affinity() to force an MSI register write, as it's
> > previous behavior was to unconditionally propagate the values.  With
> > my change it would no longer perform such writes on resume.
> > 
> > I think the patch below should help.
> 
> It doesn't, I still got reboot on resume, with no crash message on
> serial (even with sync_console).

There was an error with the previous patch, and it's also a bug in the
original patch.  Could you give a try to the updated patch below?

Sorry for bothering you, but ATM I haven't found a way to trigger
this myself.

Thanks, Roger.
---
commit 70ea4301d5ca663ac6d850658b3fe832950ec363
Author: Roger Pau Monne <roger.pau@citrix.com>
Date:   Mon Mar 17 15:40:11 2025 +0100

    x86/msi: always propagate MSI writes when not in active system mode
    
    Relax the limitation on MSI register writes, and only apply it when the
    system is in active state.  For example AMD IOMMU drivers rely on using
    set_msi_affinity() to force an MSI register write on resume from
    suspension.
    
    The original patch intention was to reduce the number of MSI register
    writes when the system is in active state.  Leave the other states to
    always perform the writes, as it's safer given the existing code, and it's
    expected to not make a difference performance wise.
    
    For such propagation to work even when the IRT index is not updated the MSI
    message must be adjusted in all success cases for AMD IOMMU, not just when
    the index has been newly allocated.
    
    Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
    Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed')
    Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 163ccf874720..8bb3bb18af61 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -189,6 +189,15 @@ static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg,
 {
     entry->msg = *msg;
 
+    if ( unlikely(system_state != SYS_STATE_active) )
+        /*
+         * Always propagate writes when not in the 'active' state.  The
+         * optimization to avoid the MSI address and data registers write is
+         * only relevant for runtime state, and drivers on resume (at least)
+         * rely on set_msi_affinity() to update the hardware state.
+         */
+        force = true;
+
     if ( iommu_intremap != iommu_intremap_off )
     {
         int rc;
diff --git a/xen/drivers/passthrough/amd/iommu_intr.c b/xen/drivers/passthrough/amd/iommu_intr.c
index 9abdc38053d7..08766122b421 100644
--- a/xen/drivers/passthrough/amd/iommu_intr.c
+++ b/xen/drivers/passthrough/amd/iommu_intr.c
@@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire(
     rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
                                             &msi_desc->remap_index,
                                             msg, &data);
-    if ( rc > 0 )
+    if ( rc >= 0 )
     {
         for ( i = 1; i < nr; ++i )
             msi_desc[i].remap_index = msi_desc->remap_index + i;



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: S3 regression on AMD in 4.20
  2025-03-17 16:11           ` Jan Beulich
@ 2025-03-17 17:38             ` Roger Pau Monné
  2025-03-18  6:24               ` Jan Beulich
  0 siblings, 1 reply; 17+ messages in thread
From: Roger Pau Monné @ 2025-03-17 17:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, xen-devel, Doug Goldstein, Oleksii Kurochko,
	Mykola Kvach, Mykyta Poturai, Andrew Cooper,
	Marek Marczykowski-Górecki

On Mon, Mar 17, 2025 at 05:11:56PM +0100, Jan Beulich wrote:
> On 17.03.2025 16:56, Roger Pau Monné wrote:
> > On Sat, Mar 15, 2025 at 12:02:50AM +0000, Andrew Cooper wrote:
> >> On 14/03/2025 11:53 pm, Marek Marczykowski-Górecki wrote:
> >>> On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek Marczykowski-Górecki wrote:
> >>>> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
> >>>>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
> >>>>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
> >>>>>>
> >>>>>> This one has working S3, so add a test for it here.
> >>>>>>
> >>>>>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> >>>>>> ---
> >>>>>> Cc: Jan Beulich <jbeulich@suse.com>
> >>>>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> >>>>>>
> >>>>>> The suspend test added here currently fails on staging[1], but passes on
> >>>>>> staging-4.19[2]. So the regression wants fixing before committing this
> >>>>>> patch.
> >>>>>>
> >>>>>> [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
> >>>>>> [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
> >>>>> We could commit the patch now without the s3 test.
> >>>>>
> >>>>> I don't know what the x86 maintainers think about fixing the suspend
> >>>>> bug, but one idea would be to run a bisection between 4.20 and 4.19.
> >>>> I'm on it already, but it's annoying. Lets convert this thread to
> >>>> discussion about the issue:
> >>>>
> >>>> So, I bisected it between staging-4.19 and master. The breakage is
> >>>> somewhere between (inclusive):
> >>>> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
> >>>> and
> >>>> 47990ecef286 x86/boot: Improve MBI2 structure check
> >>>>
> >>>> But, the first one breaks booting on this system and it remains broken
> >>>> until the second commit (or its parent) - at which point S3 is already
> >>>> broken. So, there is a range of 71 commits that may be responsible...
> >>>>
> >>>> But then, based on a matrix chat and Jan's observation I've tried
> >>>> reverting f75780d26b2f "xen: move per-cpu area management into common
> >>>> code" just on top of 47990ecef286, and that fixed suspend.
> >>>> Applying "xen/percpu: don't initialize percpu on resume" on top of
> >>>> 47990ecef286 fixes suspend too.
> >>>> But applying it on top of master
> >>>> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
> >>>> but the failure mode is different than without the patch - system resets
> >>>> on S3 resume, with no crash message on the serial console (even with
> >>>> sync_console), instead of hanging.
> >>>> And one more data point: reverting f75780d26b2f on top of master is the
> >>>> same as applying "xen/percpu: don't initialize percpu on resume" on
> >>>> master - system reset on S3 resume.
> >>>> So, it looks like there are more issues...
> >>> Another bisection round and I have the second culprit:
> >>>
> >>>     8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed
> >>>
> >>> With master+"xen/percpu: don't initialize percpu on resume"+revert of
> >>> 8e60d47cf011 suspend works again on this AMD system.
> >>
> >> That's not surprising in the slightest.
> >>
> >> Caching hardware values in Xen isn't safe across S3, which QubesOS has
> >> found time and time again, and for which we still have outstanding bugs.
> >>
> >> S3 turns most of the system off.  RAM gets preserved, but devices and
> >> plenty of internal registers don't.
> > 
> > I think I've spotted the issue.  enable_iommu() called on resume
> > (ab)uses set_msi_affinity() to force an MSI register write, as it's
> > previous behavior was to unconditionally propagate the values.  With
> > my change it would no longer perform such writes on resume.
> > 
> > I think the patch below should help.
> > 
> > I wonder if we should unconditionally propagate the write from
> > __setup_msi_irq(), as it's also unlikely to make any difference to
> > skip that write, and would further keep the previous behavior.
> 
> That might be better. In fact I wonder whether it wouldn't better be
> callers of write_msi_msg() to use ...
> 
> > ---
> > commit 1d9bfd0d45f6b547b19f0d2f752fc3bd10103971
> > Author: Roger Pau Monne <roger.pau@citrix.com>
> > Date:   Mon Mar 17 15:40:11 2025 +0100
> > 
> >     x86/msi: always propagate MSI writes when not in active system mode
> >     
> >     Relax the limitation on MSI register writes, and only apply it when the
> >     system is in active state.  For example AMD IOMMU drivers rely on using
> >     set_msi_affinity() to force an MSI register write on resume from
> >     suspension.
> >     
> >     The original patch intention was to reduce the number of MSI register
> >     writes when the system is in active state.  Leave the other states to
> >     always perform the writes, as it's safer given the existing code, and it's
> >     expected to not make a difference performance wise.
> >     
> >     Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> >     Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed')
> >     Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > 
> > --- a/xen/arch/x86/msi.c
> > +++ b/xen/arch/x86/msi.c
> > @@ -189,6 +189,15 @@ static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg,
> >  {
> >      entry->msg = *msg;
> >  
> > +    if ( unlikely(system_state != SYS_STATE_active) )
> 
> ... this condition as needed. Conceivably there might be cases where even
> during resume writes aren't always necessary to propagate to hardware.

Isn't this going to be quite more cumbersome, and I don't think the
cases outside of active mode are very interesting or relevant
performance wise?

The main purpose of the original change was the reduce the number of
PCI accesses during 'active' state, as every time an interrupt is
migrated to a different CPU such a (possibly redundant) PCI config
write would happen.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: S3 regression on AMD in 4.20
  2025-03-17 17:35             ` Roger Pau Monné
@ 2025-03-17 17:54               ` Marek Marczykowski-Górecki
  2025-03-18  8:31                 ` Roger Pau Monné
  0 siblings, 1 reply; 17+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-03-17 17:54 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Stefano Stabellini, xen-devel, Jan Beulich,
	Doug Goldstein, Oleksii Kurochko, Mykola Kvach, Mykyta Poturai

[-- Attachment #1: Type: text/plain, Size: 7532 bytes --]

On Mon, Mar 17, 2025 at 06:35:08PM +0100, Roger Pau Monné wrote:
> On Mon, Mar 17, 2025 at 05:35:51PM +0100, Marek Marczykowski-Górecki wrote:
> > On Mon, Mar 17, 2025 at 04:56:15PM +0100, Roger Pau Monné wrote:
> > > On Sat, Mar 15, 2025 at 12:02:50AM +0000, Andrew Cooper wrote:
> > > > On 14/03/2025 11:53 pm, Marek Marczykowski-Górecki wrote:
> > > > > On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek Marczykowski-Górecki wrote:
> > > > >> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
> > > > >>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
> > > > >>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
> > > > >>>>
> > > > >>>> This one has working S3, so add a test for it here.
> > > > >>>>
> > > > >>>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > > >>>> ---
> > > > >>>> Cc: Jan Beulich <jbeulich@suse.com>
> > > > >>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > > > >>>>
> > > > >>>> The suspend test added here currently fails on staging[1], but passes on
> > > > >>>> staging-4.19[2]. So the regression wants fixing before committing this
> > > > >>>> patch.
> > > > >>>>
> > > > >>>> [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
> > > > >>>> [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
> > > > >>> We could commit the patch now without the s3 test.
> > > > >>>
> > > > >>> I don't know what the x86 maintainers think about fixing the suspend
> > > > >>> bug, but one idea would be to run a bisection between 4.20 and 4.19.
> > > > >> I'm on it already, but it's annoying. Lets convert this thread to
> > > > >> discussion about the issue:
> > > > >>
> > > > >> So, I bisected it between staging-4.19 and master. The breakage is
> > > > >> somewhere between (inclusive):
> > > > >> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
> > > > >> and
> > > > >> 47990ecef286 x86/boot: Improve MBI2 structure check
> > > > >>
> > > > >> But, the first one breaks booting on this system and it remains broken
> > > > >> until the second commit (or its parent) - at which point S3 is already
> > > > >> broken. So, there is a range of 71 commits that may be responsible...
> > > > >>
> > > > >> But then, based on a matrix chat and Jan's observation I've tried
> > > > >> reverting f75780d26b2f "xen: move per-cpu area management into common
> > > > >> code" just on top of 47990ecef286, and that fixed suspend.
> > > > >> Applying "xen/percpu: don't initialize percpu on resume" on top of
> > > > >> 47990ecef286 fixes suspend too.
> > > > >> But applying it on top of master
> > > > >> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
> > > > >> but the failure mode is different than without the patch - system resets
> > > > >> on S3 resume, with no crash message on the serial console (even with
> > > > >> sync_console), instead of hanging.
> > > > >> And one more data point: reverting f75780d26b2f on top of master is the
> > > > >> same as applying "xen/percpu: don't initialize percpu on resume" on
> > > > >> master - system reset on S3 resume.
> > > > >> So, it looks like there are more issues...
> > > > > Another bisection round and I have the second culprit:
> > > > >
> > > > >     8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed
> > > > >
> > > > > With master+"xen/percpu: don't initialize percpu on resume"+revert of
> > > > > 8e60d47cf011 suspend works again on this AMD system.
> > > > 
> > > > That's not surprising in the slightest.
> > > > 
> > > > Caching hardware values in Xen isn't safe across S3, which QubesOS has
> > > > found time and time again, and for which we still have outstanding bugs.
> > > > 
> > > > S3 turns most of the system off.  RAM gets preserved, but devices and
> > > > plenty of internal registers don't.
> > > 
> > > I think I've spotted the issue.  enable_iommu() called on resume
> > > (ab)uses set_msi_affinity() to force an MSI register write, as it's
> > > previous behavior was to unconditionally propagate the values.  With
> > > my change it would no longer perform such writes on resume.
> > > 
> > > I think the patch below should help.
> > 
> > It doesn't, I still got reboot on resume, with no crash message on
> > serial (even with sync_console).
> 
> There was an error with the previous patch, and it's also a bug in the
> original patch.  Could you give a try to the updated patch below?
> 
> Sorry for bothering you, but ATM I haven't found a way to trigger
> this myself.

This one does fix the issue :)


> Thanks, Roger.
> ---
> commit 70ea4301d5ca663ac6d850658b3fe832950ec363
> Author: Roger Pau Monne <roger.pau@citrix.com>
> Date:   Mon Mar 17 15:40:11 2025 +0100
> 
>     x86/msi: always propagate MSI writes when not in active system mode
>     
>     Relax the limitation on MSI register writes, and only apply it when the
>     system is in active state.  For example AMD IOMMU drivers rely on using
>     set_msi_affinity() to force an MSI register write on resume from
>     suspension.
>     
>     The original patch intention was to reduce the number of MSI register
>     writes when the system is in active state.  Leave the other states to
>     always perform the writes, as it's safer given the existing code, and it's
>     expected to not make a difference performance wise.
>     
>     For such propagation to work even when the IRT index is not updated the MSI
>     message must be adjusted in all success cases for AMD IOMMU, not just when
>     the index has been newly allocated.
>     
>     Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>     Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed')
>     Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 163ccf874720..8bb3bb18af61 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -189,6 +189,15 @@ static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg,
>  {
>      entry->msg = *msg;
>  
> +    if ( unlikely(system_state != SYS_STATE_active) )
> +        /*
> +         * Always propagate writes when not in the 'active' state.  The
> +         * optimization to avoid the MSI address and data registers write is
> +         * only relevant for runtime state, and drivers on resume (at least)
> +         * rely on set_msi_affinity() to update the hardware state.
> +         */
> +        force = true;
> +
>      if ( iommu_intremap != iommu_intremap_off )
>      {
>          int rc;
> diff --git a/xen/drivers/passthrough/amd/iommu_intr.c b/xen/drivers/passthrough/amd/iommu_intr.c
> index 9abdc38053d7..08766122b421 100644
> --- a/xen/drivers/passthrough/amd/iommu_intr.c
> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
> @@ -546,7 +546,7 @@ int cf_check amd_iommu_msi_msg_update_ire(
>      rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
>                                              &msi_desc->remap_index,
>                                              msg, &data);
> -    if ( rc > 0 )
> +    if ( rc >= 0 )
>      {
>          for ( i = 1; i < nr; ++i )
>              msi_desc[i].remap_index = msi_desc->remap_index + i;
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: S3 regression on AMD in 4.20
  2025-03-17 17:38             ` Roger Pau Monné
@ 2025-03-18  6:24               ` Jan Beulich
  2025-03-18  8:25                 ` Roger Pau Monné
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2025-03-18  6:24 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Stefano Stabellini, xen-devel, Doug Goldstein, Oleksii Kurochko,
	Mykola Kvach, Mykyta Poturai, Andrew Cooper,
	Marek Marczykowski-Górecki

On 17.03.2025 18:38, Roger Pau Monné wrote:
> On Mon, Mar 17, 2025 at 05:11:56PM +0100, Jan Beulich wrote:
>> On 17.03.2025 16:56, Roger Pau Monné wrote:
>>> On Sat, Mar 15, 2025 at 12:02:50AM +0000, Andrew Cooper wrote:
>>>> On 14/03/2025 11:53 pm, Marek Marczykowski-Górecki wrote:
>>>>> On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek Marczykowski-Górecki wrote:
>>>>>> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
>>>>>>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
>>>>>>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
>>>>>>>>
>>>>>>>> This one has working S3, so add a test for it here.
>>>>>>>>
>>>>>>>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>>>>>>>> ---
>>>>>>>> Cc: Jan Beulich <jbeulich@suse.com>
>>>>>>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>>>>>
>>>>>>>> The suspend test added here currently fails on staging[1], but passes on
>>>>>>>> staging-4.19[2]. So the regression wants fixing before committing this
>>>>>>>> patch.
>>>>>>>>
>>>>>>>> [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
>>>>>>>> [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
>>>>>>> We could commit the patch now without the s3 test.
>>>>>>>
>>>>>>> I don't know what the x86 maintainers think about fixing the suspend
>>>>>>> bug, but one idea would be to run a bisection between 4.20 and 4.19.
>>>>>> I'm on it already, but it's annoying. Lets convert this thread to
>>>>>> discussion about the issue:
>>>>>>
>>>>>> So, I bisected it between staging-4.19 and master. The breakage is
>>>>>> somewhere between (inclusive):
>>>>>> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
>>>>>> and
>>>>>> 47990ecef286 x86/boot: Improve MBI2 structure check
>>>>>>
>>>>>> But, the first one breaks booting on this system and it remains broken
>>>>>> until the second commit (or its parent) - at which point S3 is already
>>>>>> broken. So, there is a range of 71 commits that may be responsible...
>>>>>>
>>>>>> But then, based on a matrix chat and Jan's observation I've tried
>>>>>> reverting f75780d26b2f "xen: move per-cpu area management into common
>>>>>> code" just on top of 47990ecef286, and that fixed suspend.
>>>>>> Applying "xen/percpu: don't initialize percpu on resume" on top of
>>>>>> 47990ecef286 fixes suspend too.
>>>>>> But applying it on top of master
>>>>>> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
>>>>>> but the failure mode is different than without the patch - system resets
>>>>>> on S3 resume, with no crash message on the serial console (even with
>>>>>> sync_console), instead of hanging.
>>>>>> And one more data point: reverting f75780d26b2f on top of master is the
>>>>>> same as applying "xen/percpu: don't initialize percpu on resume" on
>>>>>> master - system reset on S3 resume.
>>>>>> So, it looks like there are more issues...
>>>>> Another bisection round and I have the second culprit:
>>>>>
>>>>>     8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed
>>>>>
>>>>> With master+"xen/percpu: don't initialize percpu on resume"+revert of
>>>>> 8e60d47cf011 suspend works again on this AMD system.
>>>>
>>>> That's not surprising in the slightest.
>>>>
>>>> Caching hardware values in Xen isn't safe across S3, which QubesOS has
>>>> found time and time again, and for which we still have outstanding bugs.
>>>>
>>>> S3 turns most of the system off.  RAM gets preserved, but devices and
>>>> plenty of internal registers don't.
>>>
>>> I think I've spotted the issue.  enable_iommu() called on resume
>>> (ab)uses set_msi_affinity() to force an MSI register write, as it's
>>> previous behavior was to unconditionally propagate the values.  With
>>> my change it would no longer perform such writes on resume.
>>>
>>> I think the patch below should help.
>>>
>>> I wonder if we should unconditionally propagate the write from
>>> __setup_msi_irq(), as it's also unlikely to make any difference to
>>> skip that write, and would further keep the previous behavior.
>>
>> That might be better. In fact I wonder whether it wouldn't better be
>> callers of write_msi_msg() to use ...
>>
>>> ---
>>> commit 1d9bfd0d45f6b547b19f0d2f752fc3bd10103971
>>> Author: Roger Pau Monne <roger.pau@citrix.com>
>>> Date:   Mon Mar 17 15:40:11 2025 +0100
>>>
>>>     x86/msi: always propagate MSI writes when not in active system mode
>>>     
>>>     Relax the limitation on MSI register writes, and only apply it when the
>>>     system is in active state.  For example AMD IOMMU drivers rely on using
>>>     set_msi_affinity() to force an MSI register write on resume from
>>>     suspension.
>>>     
>>>     The original patch intention was to reduce the number of MSI register
>>>     writes when the system is in active state.  Leave the other states to
>>>     always perform the writes, as it's safer given the existing code, and it's
>>>     expected to not make a difference performance wise.
>>>     
>>>     Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>>>     Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed')
>>>     Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>>
>>> --- a/xen/arch/x86/msi.c
>>> +++ b/xen/arch/x86/msi.c
>>> @@ -189,6 +189,15 @@ static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg,
>>>  {
>>>      entry->msg = *msg;
>>>  
>>> +    if ( unlikely(system_state != SYS_STATE_active) )
>>
>> ... this condition as needed. Conceivably there might be cases where even
>> during resume writes aren't always necessary to propagate to hardware.
> 
> Isn't this going to be quite more cumbersome, and I don't think the
> cases outside of active mode are very interesting or relevant
> performance wise?

Certainly. As to cumbersome - having the conditions at the call sites
would also serve as kind of documentation. That said, this was just a
suggestion; I'm not going to insist. I see from other traffic on this
thread the updated patch is now fully addressing the regression. The
secondary change, if not broken out to a separate patch, would want
mentioning in the description, though (which may have been the plan
anyway).

Jan

> The main purpose of the original change was the reduce the number of
> PCI accesses during 'active' state, as every time an interrupt is
> migrated to a different CPU such a (possibly redundant) PCI config
> write would happen.
> 
> Thanks, Roger.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: S3 regression on AMD in 4.20
  2025-03-18  6:24               ` Jan Beulich
@ 2025-03-18  8:25                 ` Roger Pau Monné
  0 siblings, 0 replies; 17+ messages in thread
From: Roger Pau Monné @ 2025-03-18  8:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, xen-devel, Doug Goldstein, Oleksii Kurochko,
	Mykola Kvach, Mykyta Poturai, Andrew Cooper,
	Marek Marczykowski-Górecki

On Tue, Mar 18, 2025 at 07:24:31AM +0100, Jan Beulich wrote:
> On 17.03.2025 18:38, Roger Pau Monné wrote:
> > On Mon, Mar 17, 2025 at 05:11:56PM +0100, Jan Beulich wrote:
> >> On 17.03.2025 16:56, Roger Pau Monné wrote:
> >>> On Sat, Mar 15, 2025 at 12:02:50AM +0000, Andrew Cooper wrote:
> >>>> On 14/03/2025 11:53 pm, Marek Marczykowski-Górecki wrote:
> >>>>> On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek Marczykowski-Górecki wrote:
> >>>>>> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
> >>>>>>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
> >>>>>>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
> >>>>>>>>
> >>>>>>>> This one has working S3, so add a test for it here.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> >>>>>>>> ---
> >>>>>>>> Cc: Jan Beulich <jbeulich@suse.com>
> >>>>>>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> >>>>>>>>
> >>>>>>>> The suspend test added here currently fails on staging[1], but passes on
> >>>>>>>> staging-4.19[2]. So the regression wants fixing before committing this
> >>>>>>>> patch.
> >>>>>>>>
> >>>>>>>> [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
> >>>>>>>> [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
> >>>>>>> We could commit the patch now without the s3 test.
> >>>>>>>
> >>>>>>> I don't know what the x86 maintainers think about fixing the suspend
> >>>>>>> bug, but one idea would be to run a bisection between 4.20 and 4.19.
> >>>>>> I'm on it already, but it's annoying. Lets convert this thread to
> >>>>>> discussion about the issue:
> >>>>>>
> >>>>>> So, I bisected it between staging-4.19 and master. The breakage is
> >>>>>> somewhere between (inclusive):
> >>>>>> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
> >>>>>> and
> >>>>>> 47990ecef286 x86/boot: Improve MBI2 structure check
> >>>>>>
> >>>>>> But, the first one breaks booting on this system and it remains broken
> >>>>>> until the second commit (or its parent) - at which point S3 is already
> >>>>>> broken. So, there is a range of 71 commits that may be responsible...
> >>>>>>
> >>>>>> But then, based on a matrix chat and Jan's observation I've tried
> >>>>>> reverting f75780d26b2f "xen: move per-cpu area management into common
> >>>>>> code" just on top of 47990ecef286, and that fixed suspend.
> >>>>>> Applying "xen/percpu: don't initialize percpu on resume" on top of
> >>>>>> 47990ecef286 fixes suspend too.
> >>>>>> But applying it on top of master
> >>>>>> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
> >>>>>> but the failure mode is different than without the patch - system resets
> >>>>>> on S3 resume, with no crash message on the serial console (even with
> >>>>>> sync_console), instead of hanging.
> >>>>>> And one more data point: reverting f75780d26b2f on top of master is the
> >>>>>> same as applying "xen/percpu: don't initialize percpu on resume" on
> >>>>>> master - system reset on S3 resume.
> >>>>>> So, it looks like there are more issues...
> >>>>> Another bisection round and I have the second culprit:
> >>>>>
> >>>>>     8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed
> >>>>>
> >>>>> With master+"xen/percpu: don't initialize percpu on resume"+revert of
> >>>>> 8e60d47cf011 suspend works again on this AMD system.
> >>>>
> >>>> That's not surprising in the slightest.
> >>>>
> >>>> Caching hardware values in Xen isn't safe across S3, which QubesOS has
> >>>> found time and time again, and for which we still have outstanding bugs.
> >>>>
> >>>> S3 turns most of the system off.  RAM gets preserved, but devices and
> >>>> plenty of internal registers don't.
> >>>
> >>> I think I've spotted the issue.  enable_iommu() called on resume
> >>> (ab)uses set_msi_affinity() to force an MSI register write, as it's
> >>> previous behavior was to unconditionally propagate the values.  With
> >>> my change it would no longer perform such writes on resume.
> >>>
> >>> I think the patch below should help.
> >>>
> >>> I wonder if we should unconditionally propagate the write from
> >>> __setup_msi_irq(), as it's also unlikely to make any difference to
> >>> skip that write, and would further keep the previous behavior.
> >>
> >> That might be better. In fact I wonder whether it wouldn't better be
> >> callers of write_msi_msg() to use ...
> >>
> >>> ---
> >>> commit 1d9bfd0d45f6b547b19f0d2f752fc3bd10103971
> >>> Author: Roger Pau Monne <roger.pau@citrix.com>
> >>> Date:   Mon Mar 17 15:40:11 2025 +0100
> >>>
> >>>     x86/msi: always propagate MSI writes when not in active system mode
> >>>     
> >>>     Relax the limitation on MSI register writes, and only apply it when the
> >>>     system is in active state.  For example AMD IOMMU drivers rely on using
> >>>     set_msi_affinity() to force an MSI register write on resume from
> >>>     suspension.
> >>>     
> >>>     The original patch intention was to reduce the number of MSI register
> >>>     writes when the system is in active state.  Leave the other states to
> >>>     always perform the writes, as it's safer given the existing code, and it's
> >>>     expected to not make a difference performance wise.
> >>>     
> >>>     Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> >>>     Fixes: ('8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed')
> >>>     Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >>>
> >>> --- a/xen/arch/x86/msi.c
> >>> +++ b/xen/arch/x86/msi.c
> >>> @@ -189,6 +189,15 @@ static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg,
> >>>  {
> >>>      entry->msg = *msg;
> >>>  
> >>> +    if ( unlikely(system_state != SYS_STATE_active) )
> >>
> >> ... this condition as needed. Conceivably there might be cases where even
> >> during resume writes aren't always necessary to propagate to hardware.
> > 
> > Isn't this going to be quite more cumbersome, and I don't think the
> > cases outside of active mode are very interesting or relevant
> > performance wise?
> 
> Certainly. As to cumbersome - having the conditions at the call sites
> would also serve as kind of documentation. That said, this was just a
> suggestion; I'm not going to insist. I see from other traffic on this
> thread the updated patch is now fully addressing the regression. The
> secondary change, if not broken out to a separate patch, would want
> mentioning in the description, though (which may have been the plan
> anyway).

Indeed, it's already mentioned in the description.  Let me formally
send the patch.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: S3 regression on AMD in 4.20
  2025-03-17 17:54               ` Marek Marczykowski-Górecki
@ 2025-03-18  8:31                 ` Roger Pau Monné
  0 siblings, 0 replies; 17+ messages in thread
From: Roger Pau Monné @ 2025-03-18  8:31 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Andrew Cooper, Stefano Stabellini, xen-devel, Jan Beulich,
	Doug Goldstein, Oleksii Kurochko, Mykola Kvach, Mykyta Poturai

On Mon, Mar 17, 2025 at 06:54:06PM +0100, Marek Marczykowski-Górecki wrote:
> On Mon, Mar 17, 2025 at 06:35:08PM +0100, Roger Pau Monné wrote:
> > On Mon, Mar 17, 2025 at 05:35:51PM +0100, Marek Marczykowski-Górecki wrote:
> > > On Mon, Mar 17, 2025 at 04:56:15PM +0100, Roger Pau Monné wrote:
> > > > On Sat, Mar 15, 2025 at 12:02:50AM +0000, Andrew Cooper wrote:
> > > > > On 14/03/2025 11:53 pm, Marek Marczykowski-Górecki wrote:
> > > > > > On Fri, Mar 14, 2025 at 11:23:28PM +0100, Marek Marczykowski-Górecki wrote:
> > > > > >> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
> > > > > >>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
> > > > > >>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
> > > > > >>>>
> > > > > >>>> This one has working S3, so add a test for it here.
> > > > > >>>>
> > > > > >>>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > > > >>>> ---
> > > > > >>>> Cc: Jan Beulich <jbeulich@suse.com>
> > > > > >>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > > > > >>>>
> > > > > >>>> The suspend test added here currently fails on staging[1], but passes on
> > > > > >>>> staging-4.19[2]. So the regression wants fixing before committing this
> > > > > >>>> patch.
> > > > > >>>>
> > > > > >>>> [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
> > > > > >>>> [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
> > > > > >>> We could commit the patch now without the s3 test.
> > > > > >>>
> > > > > >>> I don't know what the x86 maintainers think about fixing the suspend
> > > > > >>> bug, but one idea would be to run a bisection between 4.20 and 4.19.
> > > > > >> I'm on it already, but it's annoying. Lets convert this thread to
> > > > > >> discussion about the issue:
> > > > > >>
> > > > > >> So, I bisected it between staging-4.19 and master. The breakage is
> > > > > >> somewhere between (inclusive):
> > > > > >> eb21ce14d709 x86/boot: Rewrite EFI/MBI2 code partly in C
> > > > > >> and
> > > > > >> 47990ecef286 x86/boot: Improve MBI2 structure check
> > > > > >>
> > > > > >> But, the first one breaks booting on this system and it remains broken
> > > > > >> until the second commit (or its parent) - at which point S3 is already
> > > > > >> broken. So, there is a range of 71 commits that may be responsible...
> > > > > >>
> > > > > >> But then, based on a matrix chat and Jan's observation I've tried
> > > > > >> reverting f75780d26b2f "xen: move per-cpu area management into common
> > > > > >> code" just on top of 47990ecef286, and that fixed suspend.
> > > > > >> Applying "xen/percpu: don't initialize percpu on resume" on top of
> > > > > >> 47990ecef286 fixes suspend too.
> > > > > >> But applying it on top of master
> > > > > >> (91772f8420dfa2fcfe4db68480c216db5b79c512 specifically) does not fix it,
> > > > > >> but the failure mode is different than without the patch - system resets
> > > > > >> on S3 resume, with no crash message on the serial console (even with
> > > > > >> sync_console), instead of hanging.
> > > > > >> And one more data point: reverting f75780d26b2f on top of master is the
> > > > > >> same as applying "xen/percpu: don't initialize percpu on resume" on
> > > > > >> master - system reset on S3 resume.
> > > > > >> So, it looks like there are more issues...
> > > > > > Another bisection round and I have the second culprit:
> > > > > >
> > > > > >     8e60d47cf011 x86/iommu: avoid MSI address and data writes if IRT index hasn't changed
> > > > > >
> > > > > > With master+"xen/percpu: don't initialize percpu on resume"+revert of
> > > > > > 8e60d47cf011 suspend works again on this AMD system.
> > > > > 
> > > > > That's not surprising in the slightest.
> > > > > 
> > > > > Caching hardware values in Xen isn't safe across S3, which QubesOS has
> > > > > found time and time again, and for which we still have outstanding bugs.
> > > > > 
> > > > > S3 turns most of the system off.  RAM gets preserved, but devices and
> > > > > plenty of internal registers don't.
> > > > 
> > > > I think I've spotted the issue.  enable_iommu() called on resume
> > > > (ab)uses set_msi_affinity() to force an MSI register write, as it's
> > > > previous behavior was to unconditionally propagate the values.  With
> > > > my change it would no longer perform such writes on resume.
> > > > 
> > > > I think the patch below should help.
> > > 
> > > It doesn't, I still got reboot on resume, with no crash message on
> > > serial (even with sync_console).
> > 
> > There was an error with the previous patch, and it's also a bug in the
> > original patch.  Could you give a try to the updated patch below?
> > 
> > Sorry for bothering you, but ATM I haven't found a way to trigger
> > this myself.
> 
> This one does fix the issue :)

I've now formally sent the patch, could you provide a Tested-by for
it?

https://lore.kernel.org/xen-devel/20250318082945.52019-1-roger.pau@citrix.com/T/#u

Thanks, Roger.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] ci: add yet another HW runner
  2025-03-14 21:19 ` Stefano Stabellini
  2025-03-14 22:23   ` S3 regression on AMD in 4.20 (was: Re: [PATCH] ci: add yet another HW runner) Marek Marczykowski-Górecki
@ 2025-03-31 13:16   ` Marek Marczykowski-Górecki
  2025-03-31 13:45     ` Andrew Cooper
  1 sibling, 1 reply; 17+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-03-31 13:16 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: xen-devel, Jan Beulich, Andrew Cooper, Doug Goldstein,
	Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 3004 bytes --]

On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
> > This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
> > 
> > This one has working S3, so add a test for it here.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > ---
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > 
> > The suspend test added here currently fails on staging[1], but passes on
> > staging-4.19[2]. So the regression wants fixing before committing this
> > patch.
> 
> We could commit the patch now without the s3 test.
> 
> I don't know what the x86 maintainers think about fixing the suspend
> bug, but one idea would be to run a bisection between 4.20 and 4.19.

This passes on staging now:
https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1744682789

So, the patch is good to commit as is.

> > For this to work, the runner needs to be added to relevant repositories
> > (especially hardware/xen one). Somebody with appropriate access need to
> > go to Settings->CI/CD->Runners and click "enable for this project" on
> > hal9001 runner.
> 
> I did that now
> 
> 
> > [1] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408437140
> > [2] https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/9408943441
> > ---
> >  automation/gitlab-ci/test.yaml | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> > 
> > diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
> > index 9530e36e9aaa..6b8e1b830e3d 100644
> > --- a/automation/gitlab-ci/test.yaml
> > +++ b/automation/gitlab-ci/test.yaml
> > @@ -155,6 +155,16 @@
> >    tags:
> >      - qubes-hw2
> >  
> > +.zen2-x86-64:
> > +  # it's really similar to the above
> > +  extends: .adl-x86-64
> > +  variables:
> > +    PCIDEV: "01:00.0"
> > +    PCIDEV_INTR: "MSI-X"
> > +    CONSOLE_OPTS: "console=com1 com1=115200,8n1,pci,msi,04:00.0"
> > +  tags:
> > +    - qubes-hw1
> > +
> >  .zen3p-x86-64:
> >    # it's really similar to the above
> >    extends: .adl-x86-64
> > @@ -301,6 +311,22 @@ adl-tools-tests-pvh-x86-64-gcc-debug:
> >      - *x86-64-test-needs
> >      - alpine-3.18-gcc-debug
> >  
> > +zen2-smoke-x86-64-gcc-debug:
> > +  extends: .zen2-x86-64
> > +  script:
> > +    - ./automation/scripts/qubes-x86-64.sh 2>&1 | tee ${LOGFILE}
> > +  needs:
> > +    - *x86-64-test-needs
> > +    - alpine-3.18-gcc-debug
> > +
> > +zen2-suspend-x86-64-gcc-debug:
> > +  extends: .zen2-x86-64
> > +  script:
> > +    - ./automation/scripts/qubes-x86-64.sh s3 2>&1 | tee ${LOGFILE}
> > +  needs:
> > +    - *x86-64-test-needs
> > +    - alpine-3.18-gcc-debug
> > +
> >  zen3p-smoke-x86-64-gcc-debug:
> >    extends: .zen3p-x86-64
> >    script:
> > -- 
> > 2.48.1
> > 


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] ci: add yet another HW runner
  2025-03-31 13:16   ` [PATCH] ci: add yet another HW runner Marek Marczykowski-Górecki
@ 2025-03-31 13:45     ` Andrew Cooper
  2025-03-31 13:47       ` Marek Marczykowski-Górecki
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Cooper @ 2025-03-31 13:45 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, Stefano Stabellini
  Cc: xen-devel, Jan Beulich, Doug Goldstein, Roger Pau Monné

On 31/03/2025 2:16 pm, Marek Marczykowski-Górecki wrote:
> On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
>> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
>>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
>>>
>>> This one has working S3, so add a test for it here.
>>>
>>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>>> ---
>>> Cc: Jan Beulich <jbeulich@suse.com>
>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>>
>>> The suspend test added here currently fails on staging[1], but passes on
>>> staging-4.19[2]. So the regression wants fixing before committing this
>>> patch.
>> We could commit the patch now without the s3 test.
>>
>> I don't know what the x86 maintainers think about fixing the suspend
>> bug, but one idea would be to run a bisection between 4.20 and 4.19.
> This passes on staging now:
> https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1744682789
>
> So, the patch is good to commit as is.

adl and zen3p both have SUT_ADDR.  zen2 inherits test-2.testnet from adl.

Presumably it wants to be test-1.testnet to match it's qubes-hw tag ?

~Andrew


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] ci: add yet another HW runner
  2025-03-31 13:45     ` Andrew Cooper
@ 2025-03-31 13:47       ` Marek Marczykowski-Górecki
  0 siblings, 0 replies; 17+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-03-31 13:47 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Stefano Stabellini, xen-devel, Jan Beulich, Doug Goldstein,
	Roger Pau Monné

[-- Attachment #1: Type: text/plain, Size: 1513 bytes --]

On Mon, Mar 31, 2025 at 02:45:05PM +0100, Andrew Cooper wrote:
> On 31/03/2025 2:16 pm, Marek Marczykowski-Górecki wrote:
> > On Fri, Mar 14, 2025 at 02:19:19PM -0700, Stefano Stabellini wrote:
> >> On Fri, 14 Mar 2025, Marek Marczykowski-Górecki wrote:
> >>> This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
> >>>
> >>> This one has working S3, so add a test for it here.
> >>>
> >>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> >>> ---
> >>> Cc: Jan Beulich <jbeulich@suse.com>
> >>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> >>>
> >>> The suspend test added here currently fails on staging[1], but passes on
> >>> staging-4.19[2]. So the regression wants fixing before committing this
> >>> patch.
> >> We could commit the patch now without the s3 test.
> >>
> >> I don't know what the x86 maintainers think about fixing the suspend
> >> bug, but one idea would be to run a bisection between 4.20 and 4.19.
> > This passes on staging now:
> > https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1744682789
> >
> > So, the patch is good to commit as is.
> 
> adl and zen3p both have SUT_ADDR.  zen2 inherits test-2.testnet from adl.
> 
> Presumably it wants to be test-1.testnet to match it's qubes-hw tag ?

Indeed, yes. Right now it's used only for "tools-tests", not added for
this runner, but it wants to be consistent anyway.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-03-31 13:47 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-14  3:06 [PATCH] ci: add yet another HW runner Marek Marczykowski-Górecki
2025-03-14 21:19 ` Stefano Stabellini
2025-03-14 22:23   ` S3 regression on AMD in 4.20 (was: Re: [PATCH] ci: add yet another HW runner) Marek Marczykowski-Górecki
2025-03-14 23:53     ` Marek Marczykowski-Górecki
2025-03-15  0:02       ` S3 regression on AMD in 4.20 Andrew Cooper
2025-03-17 15:56         ` Roger Pau Monné
2025-03-17 16:11           ` Jan Beulich
2025-03-17 17:38             ` Roger Pau Monné
2025-03-18  6:24               ` Jan Beulich
2025-03-18  8:25                 ` Roger Pau Monné
2025-03-17 16:35           ` Marek Marczykowski-Górecki
2025-03-17 17:35             ` Roger Pau Monné
2025-03-17 17:54               ` Marek Marczykowski-Górecki
2025-03-18  8:31                 ` Roger Pau Monné
2025-03-31 13:16   ` [PATCH] ci: add yet another HW runner Marek Marczykowski-Górecki
2025-03-31 13:45     ` Andrew Cooper
2025-03-31 13:47       ` Marek Marczykowski-Górecki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.