From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0F0FD1549A for ; Mon, 8 May 2023 11:20:30 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 83829C433EF; Mon, 8 May 2023 11:20:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1683544829; bh=kPlUUA4DjkPa9wuT4S37MyKf4L7fho3IXW1JRMFiZGY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=M+bTT26Ck8yPS6oPE4slY5kgey36T09scYxSF/KLaxVX3zIlyf8+ni4V0Oe2zbDvq FfopBURQqXbdkAzIti94K5rKm+s9WpiOcm4QRzk8MSIOxX+IWMx5pQsAJAEd8IEwnn HzEc2F3oBDJkqXGfIJDAtC+D3CR2SdopWb3GiS5Y= From: Greg Kroah-Hartman To: stable@vger.kernel.org Cc: Greg Kroah-Hartman , patches@lists.linux.dev, Zhiyi Guo , Alex Williamson , Bjorn Helgaas , Tarun Gupta , Abhishek Sahu , Sasha Levin Subject: [PATCH 6.3 536/694] PCI/PM: Extend D3hot delay for NVIDIA HDA controllers Date: Mon, 8 May 2023 11:46:11 +0200 Message-Id: <20230508094451.796392376@linuxfoundation.org> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230508094432.603705160@linuxfoundation.org> References: <20230508094432.603705160@linuxfoundation.org> User-Agent: quilt/0.67 Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Alex Williamson [ Upstream commit a5a6dd2624698b6e3045c3a1450874d8c790d5d9 ] Assignment of NVIDIA Ampere-based GPUs have seen a regression since the below referenced commit, where the reduced D3hot transition delay appears to introduce a small window where a D3hot->D0 transition followed by a bus reset can wedge the device. The entire device is subsequently unavailable, returning -1 on config space read and is unrecoverable without a host reset. This has been observed with RTX A2000 and A5000 GPU and audio functions assigned to a Windows VM, where shutdown of the VM places the devices in D3hot prior to vfio-pci performing a bus reset when userspace releases the devices. The issue has roughly a 2-3% chance of occurring per shutdown. Restoring the HDA controller d3hot_delay to the effective value before the below commit has been shown to resolve the issue. NVIDIA confirms this change should be safe for all of their HDA controllers. Fixes: 3e347969a577 ("PCI/PM: Reduce D3hot delay with usleep_range()") Link: https://lore.kernel.org/r/20230413194042.605768-1-alex.williamson@redhat.com Reported-by: Zhiyi Guo Signed-off-by: Alex Williamson Signed-off-by: Bjorn Helgaas Reviewed-by: Tarun Gupta Cc: Abhishek Sahu Cc: Tarun Gupta Signed-off-by: Sasha Levin --- drivers/pci/quirks.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 44cab813bf951..f4e2a88729fd1 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -1939,6 +1939,19 @@ static void quirk_radeon_pm(struct pci_dev *dev) } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6741, quirk_radeon_pm); +/* + * NVIDIA Ampere-based HDA controllers can wedge the whole device if a bus + * reset is performed too soon after transition to D0, extend d3hot_delay + * to previous effective default for all NVIDIA HDA controllers. + */ +static void quirk_nvidia_hda_pm(struct pci_dev *dev) +{ + quirk_d3hot_delay(dev, 20); +} +DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, + PCI_CLASS_MULTIMEDIA_HD_AUDIO, 8, + quirk_nvidia_hda_pm); + /* * Ryzen5/7 XHCI controllers fail upon resume from runtime suspend or s2idle. * https://bugzilla.kernel.org/show_bug.cgi?id=205587 -- 2.39.2