From: "Limonciello, Mario" <mario.limonciello@amd.com>
To: Bjorn Helgaas <helgaas@kernel.org>, Basavaraj Natikar <bnatikar@amd.com>
Cc: "Natikar, Basavaraj" <Basavaraj.Natikar@amd.com>,
"bhelgaas@google.com" <bhelgaas@google.com>,
"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
"thomas@glanzmann.de" <thomas@glanzmann.de>
Subject: Re: [PATCH] PCI: Add quirk to clear MSI-X
Date: Thu, 9 Mar 2023 12:32:41 -0600 [thread overview]
Message-ID: <3edd370c-e9e2-733c-2d79-51a08dd10e9d@amd.com> (raw)
In-Reply-To: <20230309182514.GA1152206@bhelgaas>
On 3/9/2023 12:25, Bjorn Helgaas wrote:
> On Thu, Mar 09, 2023 at 01:04:17PM +0530, Basavaraj Natikar wrote:
>> On 3/9/2023 4:34 AM, Limonciello, Mario wrote:
>>>> -----Original Message-----
>>>> From: Bjorn Helgaas <helgaas@kernel.org>
>>>> Sent: Wednesday, March 8, 2023 16:44
>>>> To: Natikar, Basavaraj <Basavaraj.Natikar@amd.com>
>>>> Cc: bhelgaas@google.com; linux-pci@vger.kernel.org; Limonciello, Mario
>>>> <Mario.Limonciello@amd.com>; thomas@glanzmann.de
>>>> Subject: Re: [PATCH] PCI: Add quirk to clear MSI-X
>>>>
>>>> Let's mention the vendor and device name in the subject to make the
>>>> log more useful.
>>
>> Sure will change subject as below.
>> Add quirk on AMD 0x15b8 device to clear MSI-X enable bit
>
> "0x15b8" is not really useful in a subject line. Use a name
> meaningful to users, like something "lspci" reports (I don't see
> "1002:15b8" in https://pci-ids.ucw.cz/read/PC/1002; it would be nice
> to add it) or at least something like "USB controller". You can look
> at the history of drivers/pci/quirks.c to see examples.
>
>>>> On Mon, Mar 06, 2023 at 12:53:40PM +0530, Basavaraj Natikar wrote:
>>>>> One of the AMD USB controllers fails to maintain internal functional
>>>>> context when transitioning from D3 to D0, desynchronizing MSI-X bits.
>>>>> As a result, add a quirk to this controller to clear the MSI-X bits
>>>>> on suspend.
>>> ...
>>> FYI - it's not a hardware defect, it's a BIOS defect.
>
> Commit log ("controller fails to maintain") suggested hardware defect
> to me; maybe could be clarified. If it's a defect in the way BIOS
> initialized something, maybe the workaround could be a one-time thing
> instead of an every-resume quirk?
>
>>>> The quick clears the Function Mask bit, so the MSI-X vectors may be
>>>> *unmasked* depending on the state of each vectors Mask bit. I assume
>>>> the potential unmasking is safe because you also clear the MSI-X
>>>> Enable bit, so the function can't use MSI-X at all.
>>
>> Sure, will remove Function Mask bit only clear MSI-X enable bit is enough,
>> actually MSI-X enable bit doesn't change the internal hardware and there
>> will be no interrupts after resume hence below command timeout and eventually
>> error observed more logs below:
>>
>> [ 418.572737] xhci_hcd 0000:03:00.0: xhci_hub_status_data: stopping usb5 port polling
>> *[ 423.724511] xhci_hcd 0000:03:00.0: Command timeout, USBSTS: 0x00000000****[ 423.724517] xhci_hcd 0000:03:00.0: Command timeout*
>> [ 423.724519] xhci_hcd 0000:03:00.0: Abort command ring
>> [ 425.740742] xhci_hcd 0000:03:00.0: No stop event for abort, ring start fail?
>> *[ 425.740771] xhci_hcd 0000:03:00.0: Error while assigning device slot ID****[ 425.740777] xhci_hcd 0000:03:00.0: Max number of devices this xHCI
>> host supports is 64*.
>> [ 425.740782] usb usb5-port1: couldn't allocate usb_device
>> [ 425.740794] xhci_hcd 0000:03:00.0: disable port 5-1, portsc: 0x6e1
>> [ 425.740818] hub 5-0:1.0: hub_suspend
>> [ 425.740826] usb usb5: bus auto-suspend, wakeup 1
>> [ 425.740835] xhci_hcd 0000:03:00.0: xhci_hub_status_data: stopping usb5 port polling
>> [ 425.740842] xhci_hcd 0000:03:00.0: xhci_suspend: stopping usb5 port polling.
>> [ 425.756878] xhci_hcd 0000:03:00.0: // Setting command ring address to 0xffffe001
>> [ 425.776898] xhci_hcd 0000:03:00.0: WARN: xHC save state timeout
>> [ 425.776910] xhci_hcd 0000:03:00.0: PM: suspend_common(): xhci_pci_suspend+0x0/0x170 [xhci_pci] returns -110
>> [ 425.776917] xhci_hcd 0000:03:00.0: hcd_pci_runtime_suspend: -110
>> [ 425.776918] xhci_hcd 0000:03:00.0: can't suspend (hcd_pci_runtime_suspend returned -110)
>>
>> will change function name accordingly quirk_clear_msix_en
>> and with only ctrl &= ~PCI_MSIX_FLAGS_ENABLE;
>>
>>>> All state is lost in D3cold, so I guess this problem must occur during
>>>> a D3hot to D0 transition, right? I assume this device sets
>>>> No_Soft_Reset, so the function is supposed to return to D0active with
>>>> all internal state intact. But this device returns to D0active with
>>>> the MSI-X internal state corrupted?
>>>>
>>>> I assume this relies on pci_restore_state() to restore the MSI-X
>>>> state. Seems like that might be enough to restore the internal state
>>>> even without this quirk, but I guess it must not be.
>>>
>>> The important part is the register value changing to make
>>> the internal hardware move. Because it restores identically it doesn't change
>>> the internal hardware.
>>
>> Yes correct, even though pci_restore_state restores all pci registers states
>> including MSI-X bits __pci_restore_msix_state after resume but internal AMD
>> controller's MSI_X enable bit is out of sync and AMD controller fails to maintain
>> internal MSI-X enable bits.
>
> So the register value *change* is important, and you force a different
> value by writing something different at suspend-time so the value at
> restore-time will be different. That's a little obscure since those
> points are far separated.
>
> Also it changes the behavior (masking MSI-X at suspend-time), which
> complicates the analysis since we have to verify that we don't need
> MSI-X after the quirk runs. And the current quirk relies on the fact
> that PCI_MSIX_FLAGS_ENABLE is set, which again complicates the
> analysis (I guess if MSI-X is *not* enabled, you might not need the
> quirk at all?)
>
> Is there any way you could do the quirk at resume-time, e.g., if MSI-X
> is supposed to be enabled, first disable it and immediately re-enable
> it?
>
>>>>> Note: This quirk works in all scenarios, regardless of whether the
>>>>> integrated GPU is disabled in the BIOS.
>>>>
>>>> I don't know how the integrated GPU is related to this USB controller,
>>>> but I assume this fact is important somehow?
>>>
>>> This bug is due to a BIOS bug with the initialization. We also posted in
>>> parallel a different workaround that fixes the initialization to match what
>>> the BIOS should have set via the GPU driver.
>>>
>>> It should be going in for 6.3-rc2.
>>> https://gitlab.freedesktop.org/agd5f/linux/-/commit/07494a25fc8881e122c242a46b5c53e0e4403139
>
> That nbio_v7.2.c patch and this patch don't look anything alike. It
> looks like the nbio_v7.2.c patch might run once? Could *this* be done
> once at enumeration-time, too?
>
They don't look anything alike because they're attacking the problem
from different angles.
The NBIO patch fixes the initialization value for the internal
registers. This is what the BIOS "should" have done. When the internal
registers are configured properly then the behavior the kernel expects
works as well.
The NBIO patch will run both at amdgpu startup as well as when resuming
from suspend.
This patch we're discussing treats the symptoms of the deficiency and
avoids the impact.
This patch runs any time the controller is runtime resumed. So, yes it
will run more frequently. Because this patch is treating the symptoms
it needs to be applied every single time the controller exits D3.
next prev parent reply other threads:[~2023-03-09 18:32 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-06 7:23 [PATCH] PCI: Add quirk to clear MSI-X Basavaraj Natikar
2023-03-06 8:14 ` Thomas Glanzmann
2023-03-08 22:44 ` Bjorn Helgaas
2023-03-08 23:04 ` Limonciello, Mario
2023-03-09 7:34 ` Basavaraj Natikar
2023-03-09 18:25 ` Bjorn Helgaas
2023-03-09 18:32 ` Limonciello, Mario [this message]
2023-03-09 22:30 ` Bjorn Helgaas
2023-03-10 0:57 ` Mario Limonciello
2023-03-10 7:41 ` Basavaraj Natikar
2023-03-10 22:13 ` Bjorn Helgaas
2023-03-20 1:32 ` Limonciello, Mario
2023-03-20 17:14 ` Bjorn Helgaas
2023-03-20 17:20 ` Limonciello, Mario
2023-03-20 19:36 ` Bjorn Helgaas
2023-03-20 19:47 ` Limonciello, Mario
2023-03-20 21:30 ` Bjorn Helgaas
2023-03-20 21:37 ` Limonciello, Mario
2023-03-20 22:08 ` Bjorn Helgaas
2023-03-20 22:52 ` Mario Limonciello
2023-03-21 11:07 ` Bjorn Helgaas
2023-03-28 13:15 ` Basavaraj Natikar
2023-03-28 13:25 ` Limonciello, Mario
2023-03-28 17:42 ` Bjorn Helgaas
2023-03-10 7:22 ` Basavaraj Natikar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3edd370c-e9e2-733c-2d79-51a08dd10e9d@amd.com \
--to=mario.limonciello@amd.com \
--cc=Basavaraj.Natikar@amd.com \
--cc=bhelgaas@google.com \
--cc=bnatikar@amd.com \
--cc=helgaas@kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=thomas@glanzmann.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).