public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: linux-pci@vger.kernel.org, abhsahu@nvidia.com,
	targupta@nvidia.com, zhguo@redhat.com,
	Sajid Dalvi <sdalvi@google.com>
Subject: Re: [RFC PATCH] PCI: Extend D3hot delay for NVIDIA HDA controllers
Date: Thu, 6 Apr 2023 16:01:10 -0600	[thread overview]
Message-ID: <20230406160110.121cdc14.alex.williamson@redhat.com> (raw)
In-Reply-To: <20230406215049.GA3741554@bhelgaas>

On Thu, 6 Apr 2023 16:50:49 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> [+cc Sajid, author of 3e347969a577]
> 
> On Tue, Mar 28, 2023 at 04:59:30PM -0600, Alex Williamson wrote:
> > Assignment of NVIDIA Ampere-based GPUs have seen a regression since the
> > below referenced commit, where the reduced D3hot transition delay appears
> > to introduce a small window where a D3hot->D0 transition followed by a bus
> > reset can wedge the device.  The entire device is subsequently unavailable,
> > returning -1 on config space read and is unrecoverable without a host reset.
> > 
> > This has been observed with RTX A2000 and A5000 GPU and audio functions
> > assigned to a Windows VM, where shutdown of the VM places the devices in
> > D3hot prior to vfio-pci performing a bus reset when userspace releases the
> > devices.  The issue has roughly a 2-3% chance of occurring per shutdown.
> > 
> > Restoring the HDA controller d3hot_delay to the effective value before the
> > below commit has been shown to resolve the issue.  
> 
> Interesting.  This sounds like it was a hassle to track down.  I guess
> we knew there was some risk in reducing those delays.
> 
> Did you by chance notice whether the actual delay when the device gets
> wedged is sufficient per spec?
> 
> If there's a case where the usleep_range() doesn't quite wait the
> spec-mandated time, we should adjust that in case we have the same
> problem with other devices.

That would have been a good test, unfortunately I didn't check and
don't currently have access to the system anymore.  Perhaps this is
something the NVIDIA folks can check as they're investigating the scope
of affected hardware.  Thanks,

Alex

> > I'm looking for input from NVIDIA whether this issue is unique to
> > Ampere-based HDA controllers or should be assumed to linger in both older
> > and newer controllers as well.  Currently we've not been able to reproduce
> > the issue other than on Ampere HDA controllers, however the implementation
> > here includes all NVIDIA HDA controllers based on PCI vendor and device
> > class.
> > 
> > If we were to limit the quirk to Ampere HDA controllers, I think that would
> > include:
> > 
> > 1aef	GA102 High Definition Audio Controller
> > 228b	GA104 High Definition Audio Controller
> > 228e	GA106 High Definition Audio Controller
> > 
> > Cc: Abhishek Sahu <abhsahu@nvidia.com>
> > Cc: Tarun Gupta <targupta@nvidia.com>
> > Fixes: 3e347969a577 ("PCI/PM: Reduce D3hot delay with usleep_range()")
> > Reported-by: Zhiyi Guo <zhguo@redhat.com>
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  drivers/pci/quirks.c |   13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 44cab813bf95..f4e2a88729fd 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -1939,6 +1939,19 @@ static void quirk_radeon_pm(struct pci_dev *dev)
> >  }
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6741, quirk_radeon_pm);
> >  
> > +/*
> > + * NVIDIA Ampere-based HDA controllers can wedge the whole device if a bus
> > + * reset is performed too soon after transition to D0, extend d3hot_delay
> > + * to previous effective default for all NVIDIA HDA controllers.
> > + */
> > +static void quirk_nvidia_hda_pm(struct pci_dev *dev)
> > +{
> > +	quirk_d3hot_delay(dev, 20);
> > +}
> > +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> > +			      PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8,
> > +			      quirk_nvidia_hda_pm);
> > +
> >  /*
> >   * Ryzen5/7 XHCI controllers fail upon resume from runtime suspend or s2idle.
> >   * https://bugzilla.kernel.org/show_bug.cgi?id=205587
> > 
> >   
> 


  reply	other threads:[~2023-04-06 22:02 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-28 22:59 [RFC PATCH] PCI: Extend D3hot delay for NVIDIA HDA controllers Alex Williamson
2023-04-06 21:50 ` Bjorn Helgaas
2023-04-06 22:01   ` Alex Williamson [this message]
     [not found]     ` <29f51464-55f1-8ff5-db75-df93693e8d4f@nvidia.com>
2023-04-12 20:02       ` Alex Williamson
2023-04-13 19:40 ` [PATCH] " Alex Williamson
2023-04-17 21:14   ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230406160110.121cdc14.alex.williamson@redhat.com \
    --to=alex.williamson@redhat.com \
    --cc=abhsahu@nvidia.com \
    --cc=helgaas@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=sdalvi@google.com \
    --cc=targupta@nvidia.com \
    --cc=zhguo@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox