From: Bjorn Helgaas <helgaas@kernel.org>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: linux-pci@vger.kernel.org, abhsahu@nvidia.com,
targupta@nvidia.com, zhguo@redhat.com, sdalvi@google.com,
Mika Westerberg <mika.westerberg@linux.intel.com>,
Sathyanarayanan Kuppuswamy
<sathyanarayanan.kuppuswamy@linux.intel.com>,
Lukas Wunner <lukas@wunner.de>
Subject: Re: [PATCH] PCI: Extend D3hot delay for NVIDIA HDA controllers
Date: Mon, 17 Apr 2023 16:14:14 -0500 [thread overview]
Message-ID: <20230417211414.GA48587@bhelgaas> (raw)
In-Reply-To: <20230413194042.605768-1-alex.williamson@redhat.com>
[+cc Mika, Sathy, Lukas since they've been looking at similar delays]
On Thu, Apr 13, 2023 at 01:40:42PM -0600, Alex Williamson wrote:
> Assignment of NVIDIA Ampere-based GPUs have seen a regression since the
> below referenced commit, where the reduced D3hot transition delay appears
> to introduce a small window where a D3hot->D0 transition followed by a bus
> reset can wedge the device. The entire device is subsequently unavailable,
> returning -1 on config space read and is unrecoverable without a host reset.
>
> This has been observed with RTX A2000 and A5000 GPU and audio functions
> assigned to a Windows VM, where shutdown of the VM places the devices in
> D3hot prior to vfio-pci performing a bus reset when userspace releases the
> devices. The issue has roughly a 2-3% chance of occurring per shutdown.
>
> Restoring the HDA controller d3hot_delay to the effective value before the
> below commit has been shown to resolve the issue. NVIDIA confirms this
> change should be safe for all of their HDA controllers.
>
> Cc: Abhishek Sahu <abhsahu@nvidia.com>
> Cc: Tarun Gupta <targupta@nvidia.com>
> Fixes: 3e347969a577 ("PCI/PM: Reduce D3hot delay with usleep_range()")
> Reported-by: Zhiyi Guo <zhguo@redhat.com>
> Reviewed-by: Tarun Gupta <targupta@nvidia.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Applied to pci/reset for v6.4, thanks, Alex!
I guess there's no real risk here since we're waiting *longer*. It
only makes NVIDIA GPU resets take longer.
Mika has some patches in flight that increase delays generically in
some cases, but I think that applies to D3cold -> D0 transitions,
which I don't *think* you're doing here.
> ---
>
> Unfortunately Tarun's reply with confirmation doesn't show up on lore,
> possibly due to html email, or else I'd provide that as a Link:.
>
> drivers/pci/quirks.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 44cab813bf95..f4e2a88729fd 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -1939,6 +1939,19 @@ static void quirk_radeon_pm(struct pci_dev *dev)
> }
> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6741, quirk_radeon_pm);
>
> +/*
> + * NVIDIA Ampere-based HDA controllers can wedge the whole device if a bus
> + * reset is performed too soon after transition to D0, extend d3hot_delay
> + * to previous effective default for all NVIDIA HDA controllers.
> + */
> +static void quirk_nvidia_hda_pm(struct pci_dev *dev)
> +{
> + quirk_d3hot_delay(dev, 20);
> +}
> +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> + PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8,
> + quirk_nvidia_hda_pm);
> +
> /*
> * Ryzen5/7 XHCI controllers fail upon resume from runtime suspend or s2idle.
> * https://bugzilla.kernel.org/show_bug.cgi?id=205587
> --
> 2.39.2
>
prev parent reply other threads:[~2023-04-17 21:14 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-28 22:59 [RFC PATCH] PCI: Extend D3hot delay for NVIDIA HDA controllers Alex Williamson
2023-04-06 21:50 ` Bjorn Helgaas
2023-04-06 22:01 ` Alex Williamson
[not found] ` <29f51464-55f1-8ff5-db75-df93693e8d4f@nvidia.com>
2023-04-12 20:02 ` Alex Williamson
2023-04-13 19:40 ` [PATCH] " Alex Williamson
2023-04-17 21:14 ` Bjorn Helgaas [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230417211414.GA48587@bhelgaas \
--to=helgaas@kernel.org \
--cc=abhsahu@nvidia.com \
--cc=alex.williamson@redhat.com \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=mika.westerberg@linux.intel.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=sdalvi@google.com \
--cc=targupta@nvidia.com \
--cc=zhguo@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox