Re: Lost MSIs during hibernate

Linux Power Management development
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@linutronix.de>
To: Evan Green <evgreen@google.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Rajat Jain <rajatja@chromium.org>,
	Linux PM <linux-pm@vger.kernel.org>,
	linux-pci <linux-pci@vger.kernel.org>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Mathias Nyman <mathias.nyman@intel.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Subject: Re: Lost MSIs during hibernate
Date: Tue, 05 Apr 2022 16:06:50 +0200	[thread overview]
Message-ID: <87a6cz39qd.ffs@tglx> (raw)
In-Reply-To: <CAE=gft4a-QL82iFJE_xRQ3JrMmz-KZKWREtz=MghhjFbJeK=8A@mail.gmail.com>

Evan!

On Mon, Apr 04 2022 at 12:11, Evan Green wrote:
> To my surprise, I'm back with another MSI problem, and hoping to get
> some advice on how to approach fixing it.

Why am I not surprised?

> What worries me is those IRQ "no longer affine" messages, as well as
> my "EVAN don't touch hw" prints, indicating that requests to change
> the MSI are being dropped. These ignored requests are coming in when
> we try to migrate all IRQs off of the non-boot CPU, and they get
> ignored because all devices are "frozen" at this point, and presumably
> not in D0.

They are disabled at that point.

> To further try and prove that theory, I wrote a script to do the
> hibernate prepare image step in a loop, but messed with XHCI's IRQ
> affinity beforehand. If I move the IRQ to core 0, so far I have never
> seen a hang. But if I move it to another core, I can usually get a
> hang in the first attempt. I also very occasionally see wifi splats
> when trying this, and those "no longer affine" prints are all the wifi
> queue IRQs. So I think a wifi packet coming in at the wrong time can
> do the same thing.
>
> I wanted to see what thoughts you might have on this. Should I try to
> make a patch that moves all IRQs to CPU 0 *before* the devices all
> freeze? Sounds a little unpleasant. Or should PCI be doing something
> different to avoid this combination of "you're not allowed to modify
> my MSIs, but I might still generate interrupts that must not be lost"?

PCI cannot do much here and moving interrupts around is papering over
the underlying problem.

xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee1e000 4023

  This sets up the interrupt when the driver is loaded

xhci_hcd 0000:00:14.0: EVAN Write MSI 0 fee01000 4024

  Ditto

xhci_hcd 0000:00:0d.0: calling pci_pm_freeze+0x0/0xad @ 423, parent: pci0000:00
xhci_hcd 0000:00:14.0: calling pci_pm_freeze+0x0/0xad @ 4644, parent: pci0000:00
xhci_hcd 0000:00:14.0: pci_pm_freeze+0x0/0xad returned 0 after 0 usecs
xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee1e000 4023
xhci_hcd 0000:00:0d.0: pci_pm_freeze+0x0/0xad returned 0 after 196000 usecs

Those freeze() calls end up in xhci_suspend(), which tears down the XHCI
and ensures that no interrupts are on flight.

xhci_hcd 0000:00:0d.0: calling pci_pm_freeze_noirq+0x0/0xb2 @ 4645, parent: pci0000:00
xhci_hcd 0000:00:0d.0: pci_pm_freeze_noirq+0x0/0xb2 returned 0 after 30 usecs
xhci_hcd 0000:00:14.0: calling pci_pm_freeze_noirq+0x0/0xb2 @ 4644, parent: pci0000:00
xhci_hcd 0000:00:14.0: pci_pm_freeze_noirq+0x0/0xb2 returned 0 after 3118 usecs

   Now the devices are disabled and not accessible

xhci_hcd 0000:00:14.0: EVAN Don't touch hw 0 fee00000 4024
xhci_hcd 0000:00:0d.0: EVAN Don't touch hw 0 fee1e000 4045
xhci_hcd 0000:00:0d.0: EVAN Don't touch hw 0 fee00000 4045
xhci_hcd 0000:00:14.0: calling pci_pm_thaw_noirq+0x0/0x70 @ 9, parent: pci0000:00
xhci_hcd 0000:00:14.0: EVAN Write MSI 0 fee00000 4024

   This is the early restore _before_ the XHCI resume code is called
   This interrupt is targeted at CPU0 (it's the one which could not be
   written above).

xhci_hcd 0000:00:14.0: pci_pm_thaw_noirq+0x0/0x70 returned 0 after 5272 usecs
xhci_hcd 0000:00:0d.0: calling pci_pm_thaw_noirq+0x0/0x70 @ 1123, parent: pci0000:00
xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee00000 4045

   Ditto

xhci_hcd 0000:00:0d.0: pci_pm_thaw_noirq+0x0/0x70 returned 0 after 623 usecs
xhci_hcd 0000:00:14.0: calling pci_pm_thaw+0x0/0x7c @ 3856, parent: pci0000:00
xhci_hcd 0000:00:14.0: pci_pm_thaw+0x0/0x7c returned 0 after 0 usecs
xhci_hcd 0000:00:0d.0: calling pci_pm_thaw+0x0/0x7c @ 4664, parent: pci0000:00
xhci_hcd 0000:00:0d.0: pci_pm_thaw+0x0/0x7c returned 0 after 0 usecs

That means the suspend/resume logic is doing the right thing.

How the XHCI ends up being confused here is a mystery. Cc'ed a few more folks.

Thanks,

        tglx

next prev parent reply	other threads:[~2022-04-05 21:50 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-04 19:11 Lost MSIs during hibernate Evan Green
2022-04-05 14:06 ` Thomas Gleixner [this message]
2022-04-05 18:06   ` Evan Green

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a6cz39qd.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=bhelgaas@google.com \
    --cc=evgreen@google.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mathias.nyman@intel.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=rajatja@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox