public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] PCI: Avoid FLR for NVIDIA 5090 GPU
@ 2026-04-16  7:07 yuan.gao
  2026-04-16 17:31 ` Bjorn Helgaas
  0 siblings, 1 reply; 4+ messages in thread
From: yuan.gao @ 2026-04-16  7:07 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci, linux-kernel; +Cc: yuan.gao

When passing through the NVIDIA 5090 GPU to a vm, there is a certain
probability of encountering an flr timeout during vm shutdown, which
subsequently leads to a soft lock of the host cpu.

As described in this post
(https://forum.level1techs.com/t/do-your-rtx-5090-or-general-rtx-50-series-has-reset-bug-in-vm-passthrough/228549).

And in dmesg:

 [401106.011979] vfio-pci 0000:d8:00.0: not ready 1023ms after FLR; waiting
 [401108.700074] vfio-pci 0000:d8:00.0: not ready 2047ms after FLR; waiting
 [401112.412204] vfio-pci 0000:d8:00.0: not ready 4095ms after FLR; waiting
 [401118.620399] vfio-pci 0000:d8:00.0: not ready 8191ms after FLR; waiting
 [401128.860788] vfio-pci 0000:d8:00.0: not ready 16383ms after FLR; waiting
 [401147.293518] vfio-pci 0000:d8:00.0: not ready 32767ms after FLR; waiting
 [401185.694859] vfio-pci 0000:d8:00.0: not ready 65535ms after FLR; giving up
 [401195.372583] vfio-pci 0000:38:00.2: Relaying device request to user (#0)

 [401208.274941] watchdog: BUG: soft lockup - CPU#11 stuck for 21s! [CPU 22/KVM:30337]

 [401209.887848] CPU: 11 PID: 30337 Comm: CPU 22/KVM Kdump: loaded Not tainted
 [401209.887854] RIP: 0010:pci_mmcfg_read+0xaa/0xd0

 [401209.887866] Call Trace:
 [401209.887872]  pci_bus_read_config_dword+0x43/0x70
 [401209.b887876]  pci_find_next_ext_capability.part.20+0x65/0xc0
 [401209.887879]  pci_restore_state.part.39+0x6d/0x3f0
 [401209.887883]  vfio_pci_disable+0x22b/0x4d0 [vfio_pci]
 [401209.887886]  ? __dentry_kill+0x118/0x160
 [401209.887888]  vfio_pci_release+0x5a/0xb0 [vfio_pci]
 [401209.887891]  vfio_device_fops_release+0x18/0x30 [vfio]
 [401209.887894]  __fput+0x98/0x240
 [401209.887897]  task_work_run+0x6a/0xa0
 [401209.887899]  do_exit+0x375/0xb10
 [401209.887900]  do_group_exit+0x3a/0xa0
 [401209.887902]  get_signal+0x140/0x7d0
 [401209.887906]  arch_do_signal+0x2c/0x260
 [401209.887909]  exit_to_user_mode_prepare+0xc0/0x120
 [401209.887912]  syscall_exit_to_user_mode+0x27/0x180
 [401209.887915]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The flr seems to have some issues on the NVIDIA 5090 GPU,
so I’ve added flr-related quirks for these devices.

And with this patch in place, the host kernel doesn't exhibit these
problems. The vm starts up and works as expected with the passed-through
NVIDIA 5090 GPU.

Signed-off-by: yuan.gao <yuan.gao@ucloud.cn>
---
 drivers/pci/quirks.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 48946cca4be72..71f833f3e2d84 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5618,6 +5618,9 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x7901, quirk_no_flr);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_MEDIATEK, 0x0616, quirk_no_flr);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_NVIDIA, 0x2b85, quirk_no_flr);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_NVIDIA, 0x2b87, quirk_no_flr);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_NVIDIA, 0x2b8c, quirk_no_flr);
 
 /* FLR may cause the SolidRun SNET DPU (rev 0x1) to hang */
 static void quirk_no_flr_snet(struct pci_dev *dev)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-17  7:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-16  7:07 [PATCH] PCI: Avoid FLR for NVIDIA 5090 GPU yuan.gao
2026-04-16 17:31 ` Bjorn Helgaas
2026-04-16 22:37   ` Jason Gunthorpe
2026-04-17  2:06     ` yuan.gao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox