From: Alex Williamson <alex.williamson@redhat.com>
To: "Tarun Gupta (SW-GPU)" <targupta@nvidia.com>
Cc: Bjorn Helgaas <helgaas@kernel.org>,
linux-pci@vger.kernel.org, abhsahu@nvidia.com, zhguo@redhat.com,
Sajid Dalvi <sdalvi@google.com>
Subject: Re: [RFC PATCH] PCI: Extend D3hot delay for NVIDIA HDA controllers
Date: Wed, 12 Apr 2023 14:02:55 -0600 [thread overview]
Message-ID: <20230412140255.50c507dd.alex.williamson@redhat.com> (raw)
In-Reply-To: <29f51464-55f1-8ff5-db75-df93693e8d4f@nvidia.com>
On Thu, 13 Apr 2023 00:40:17 +0530
"Tarun Gupta (SW-GPU)" <targupta@nvidia.com> wrote:
> On 4/7/2023 3:31 AM, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, 6 Apr 2023 16:50:49 -0500
> > Bjorn Helgaas<helgaas@kernel.org> wrote:
> >
> >> [+cc Sajid, author of 3e347969a577]
> >>
> >> On Tue, Mar 28, 2023 at 04:59:30PM -0600, Alex Williamson wrote:
> >>> Assignment of NVIDIA Ampere-based GPUs have seen a regression since the
> >>> below referenced commit, where the reduced D3hot transition delay appears
> >>> to introduce a small window where a D3hot->D0 transition followed by a bus
> >>> reset can wedge the device. The entire device is subsequently unavailable,
> >>> returning -1 on config space read and is unrecoverable without a host reset.
> >>>
> >>> This has been observed with RTX A2000 and A5000 GPU and audio functions
> >>> assigned to a Windows VM, where shutdown of the VM places the devices in
> >>> D3hot prior to vfio-pci performing a bus reset when userspace releases the
> >>> devices. The issue has roughly a 2-3% chance of occurring per shutdown.
> >>>
> >>> Restoring the HDA controller d3hot_delay to the effective value before the
> >>> below commit has been shown to resolve the issue.
> >> Interesting. This sounds like it was a hassle to track down. I guess
> >> we knew there was some risk in reducing those delays.
> >>
> >> Did you by chance notice whether the actual delay when the device gets
> >> wedged is sufficient per spec?
> >>
> >> If there's a case where the usleep_range() doesn't quite wait the
> >> spec-mandated time, we should adjust that in case we have the same
> >> problem with other devices.
> > That would have been a good test, unfortunately I didn't check and
> > don't currently have access to the system anymore. Perhaps this is
> > something the NVIDIA folks can check as they're investigating the scope
> > of affected hardware. Thanks,
> >
> > Alex
> >
> >>> I'm looking for input from NVIDIA whether this issue is unique to
> >>> Ampere-based HDA controllers or should be assumed to linger in both older
> >>> and newer controllers as well. Currently we've not been able to reproduce
> >>> the issue other than on Ampere HDA controllers, however the implementation
> >>> here includes all NVIDIA HDA controllers based on PCI vendor and device
> >>> class.
>
> Hi, I believe it is safe to include this patch for all HDA controllers and not
> limit to only Ampere-based HDA controllers. This patch looks fine to me.
>
> Haven't explicitly tested whether usleep_range() performs as mandated, but
> from our testing it doesn't seem to be the problem. On the same platform,
> the problem reproduces on certain GPU configs but doesn't reproduce on other.
Thanks, Tarun. Feel free to also send an explicit Acked-by: or
Reviewed-by: if you'd like.
Bjorn, I can resend this as a non-RFC if you prefer, the only change
would be deleting "I'm looking for input..." through to the Cc:s.
Thanks,
Alex
> >>>
> >>> If we were to limit the quirk to Ampere HDA controllers, I think that would
> >>> include:
> >>>
> >>> 1aef GA102 High Definition Audio Controller
> >>> 228b GA104 High Definition Audio Controller
> >>> 228e GA106 High Definition Audio Controller
> >>>
> >>> Cc: Abhishek Sahu<abhsahu@nvidia.com>
> >>> Cc: Tarun Gupta<targupta@nvidia.com>
> >>> Fixes: 3e347969a577 ("PCI/PM: Reduce D3hot delay with usleep_range()")
> >>> Reported-by: Zhiyi Guo<zhguo@redhat.com>
> >>> Signed-off-by: Alex Williamson<alex.williamson@redhat.com>
> >>> ---
> >>> drivers/pci/quirks.c | 13 +++++++++++++
> >>> 1 file changed, 13 insertions(+)
> >>>
> >>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> >>> index 44cab813bf95..f4e2a88729fd 100644
> >>> --- a/drivers/pci/quirks.c
> >>> +++ b/drivers/pci/quirks.c
> >>> @@ -1939,6 +1939,19 @@ static void quirk_radeon_pm(struct pci_dev *dev)
> >>> }
> >>> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6741, quirk_radeon_pm);
> >>>
> >>> +/*
> >>> + * NVIDIA Ampere-based HDA controllers can wedge the whole device if a bus
> >>> + * reset is performed too soon after transition to D0, extend d3hot_delay
> >>> + * to previous effective default for all NVIDIA HDA controllers.
> >>> + */
> >>> +static void quirk_nvidia_hda_pm(struct pci_dev *dev)
> >>> +{
> >>> + quirk_d3hot_delay(dev, 20);
> >>> +}
> >>> +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> >>> + PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8,
> >>> + quirk_nvidia_hda_pm);
> >>> +
> >>> /*
> >>> * Ryzen5/7 XHCI controllers fail upon resume from runtime suspend or s2idle.
> >>> *https://bugzilla.kernel.org/show_bug.cgi?id=205587
> >>>
> >>>
next prev parent reply other threads:[~2023-04-12 20:04 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-28 22:59 [RFC PATCH] PCI: Extend D3hot delay for NVIDIA HDA controllers Alex Williamson
2023-04-06 21:50 ` Bjorn Helgaas
2023-04-06 22:01 ` Alex Williamson
[not found] ` <29f51464-55f1-8ff5-db75-df93693e8d4f@nvidia.com>
2023-04-12 20:02 ` Alex Williamson [this message]
2023-04-13 19:40 ` [PATCH] " Alex Williamson
2023-04-17 21:14 ` Bjorn Helgaas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230412140255.50c507dd.alex.williamson@redhat.com \
--to=alex.williamson@redhat.com \
--cc=abhsahu@nvidia.com \
--cc=helgaas@kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=sdalvi@google.com \
--cc=targupta@nvidia.com \
--cc=zhguo@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox