[PATCH] PCI: Avoid FLR for NVIDIA 5090 GPU

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] PCI: Avoid FLR for NVIDIA 5090 GPU
@ 2026-04-16  7:07 yuan.gao
  2026-04-16 17:31 ` Bjorn Helgaas
  0 siblings, 1 reply; 4+ messages in thread
From: yuan.gao @ 2026-04-16  7:07 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci, linux-kernel; +Cc: yuan.gao

When passing through the NVIDIA 5090 GPU to a vm, there is a certain
probability of encountering an flr timeout during vm shutdown, which
subsequently leads to a soft lock of the host cpu.

As described in this post
(https://forum.level1techs.com/t/do-your-rtx-5090-or-general-rtx-50-series-has-reset-bug-in-vm-passthrough/228549).

And in dmesg:

 [401106.011979] vfio-pci 0000:d8:00.0: not ready 1023ms after FLR; waiting
 [401108.700074] vfio-pci 0000:d8:00.0: not ready 2047ms after FLR; waiting
 [401112.412204] vfio-pci 0000:d8:00.0: not ready 4095ms after FLR; waiting
 [401118.620399] vfio-pci 0000:d8:00.0: not ready 8191ms after FLR; waiting
 [401128.860788] vfio-pci 0000:d8:00.0: not ready 16383ms after FLR; waiting
 [401147.293518] vfio-pci 0000:d8:00.0: not ready 32767ms after FLR; waiting
 [401185.694859] vfio-pci 0000:d8:00.0: not ready 65535ms after FLR; giving up
 [401195.372583] vfio-pci 0000:38:00.2: Relaying device request to user (#0)

 [401208.274941] watchdog: BUG: soft lockup - CPU#11 stuck for 21s! [CPU 22/KVM:30337]

 [401209.887848] CPU: 11 PID: 30337 Comm: CPU 22/KVM Kdump: loaded Not tainted
 [401209.887854] RIP: 0010:pci_mmcfg_read+0xaa/0xd0

 [401209.887866] Call Trace:
 [401209.887872]  pci_bus_read_config_dword+0x43/0x70
 [401209.b887876]  pci_find_next_ext_capability.part.20+0x65/0xc0
 [401209.887879]  pci_restore_state.part.39+0x6d/0x3f0
 [401209.887883]  vfio_pci_disable+0x22b/0x4d0 [vfio_pci]
 [401209.887886]  ? __dentry_kill+0x118/0x160
 [401209.887888]  vfio_pci_release+0x5a/0xb0 [vfio_pci]
 [401209.887891]  vfio_device_fops_release+0x18/0x30 [vfio]
 [401209.887894]  __fput+0x98/0x240
 [401209.887897]  task_work_run+0x6a/0xa0
 [401209.887899]  do_exit+0x375/0xb10
 [401209.887900]  do_group_exit+0x3a/0xa0
 [401209.887902]  get_signal+0x140/0x7d0
 [401209.887906]  arch_do_signal+0x2c/0x260
 [401209.887909]  exit_to_user_mode_prepare+0xc0/0x120
 [401209.887912]  syscall_exit_to_user_mode+0x27/0x180
 [401209.887915]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The flr seems to have some issues on the NVIDIA 5090 GPU,
so I’ve added flr-related quirks for these devices.

And with this patch in place, the host kernel doesn't exhibit these
problems. The vm starts up and works as expected with the passed-through
NVIDIA 5090 GPU.

Signed-off-by: yuan.gao <yuan.gao@ucloud.cn>
---
 drivers/pci/quirks.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 48946cca4be72..71f833f3e2d84 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5618,6 +5618,9 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x7901, quirk_no_flr);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_MEDIATEK, 0x0616, quirk_no_flr);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_NVIDIA, 0x2b85, quirk_no_flr);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_NVIDIA, 0x2b87, quirk_no_flr);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_NVIDIA, 0x2b8c, quirk_no_flr);
 
 /* FLR may cause the SolidRun SNET DPU (rev 0x1) to hang */
 static void quirk_no_flr_snet(struct pci_dev *dev)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] PCI: Avoid FLR for NVIDIA 5090 GPU
  2026-04-16  7:07 [PATCH] PCI: Avoid FLR for NVIDIA 5090 GPU yuan.gao
@ 2026-04-16 17:31 ` Bjorn Helgaas
  2026-04-16 22:37   ` Jason Gunthorpe
  0 siblings, 1 reply; 4+ messages in thread
From: Bjorn Helgaas @ 2026-04-16 17:31 UTC (permalink / raw)
  To: yuan.gao
  Cc: Bjorn Helgaas, linux-pci, linux-kernel, Alex Williamson,
	Jason Gunthorpe

[+cc Alex, Jason]

On Thu, Apr 16, 2026 at 03:07:06PM +0800, yuan.gao wrote:
> When passing through the NVIDIA 5090 GPU to a vm, there is a certain
> probability of encountering an flr timeout during vm shutdown, which
> subsequently leads to a soft lock of the host cpu.

If possible, would like confirmation of device erratum from Nvidia.
If there's no known erratum, there might be something wrong in the
Linux FLR and wait.

> As described in this post
> (https://forum.level1techs.com/t/do-your-rtx-5090-or-general-rtx-50-series-has-reset-bug-in-vm-passthrough/228549).
> 
> And in dmesg:
> 
>  [401106.011979] vfio-pci 0000:d8:00.0: not ready 1023ms after FLR; waiting
>  [401108.700074] vfio-pci 0000:d8:00.0: not ready 2047ms after FLR; waiting
>  [401112.412204] vfio-pci 0000:d8:00.0: not ready 4095ms after FLR; waiting
>  [401118.620399] vfio-pci 0000:d8:00.0: not ready 8191ms after FLR; waiting
>  [401128.860788] vfio-pci 0000:d8:00.0: not ready 16383ms after FLR; waiting
>  [401147.293518] vfio-pci 0000:d8:00.0: not ready 32767ms after FLR; waiting
>  [401185.694859] vfio-pci 0000:d8:00.0: not ready 65535ms after FLR; giving up
>  [401195.372583] vfio-pci 0000:38:00.2: Relaying device request to user (#0)
> 
>  [401208.274941] watchdog: BUG: soft lockup - CPU#11 stuck for 21s! [CPU 22/KVM:30337]
> 
>  [401209.887848] CPU: 11 PID: 30337 Comm: CPU 22/KVM Kdump: loaded Not tainted
>  [401209.887854] RIP: 0010:pci_mmcfg_read+0xaa/0xd0
> 
>  [401209.887866] Call Trace:
>  [401209.887872]  pci_bus_read_config_dword+0x43/0x70
>  [401209.b887876]  pci_find_next_ext_capability.part.20+0x65/0xc0
>  [401209.887879]  pci_restore_state.part.39+0x6d/0x3f0
>  [401209.887883]  vfio_pci_disable+0x22b/0x4d0 [vfio_pci]
>  [401209.887886]  ? __dentry_kill+0x118/0x160
>  [401209.887888]  vfio_pci_release+0x5a/0xb0 [vfio_pci]
>  [401209.887891]  vfio_device_fops_release+0x18/0x30 [vfio]
>  [401209.887894]  __fput+0x98/0x240
>  [401209.887897]  task_work_run+0x6a/0xa0
>  [401209.887899]  do_exit+0x375/0xb10
>  [401209.887900]  do_group_exit+0x3a/0xa0
>  [401209.887902]  get_signal+0x140/0x7d0
>  [401209.887906]  arch_do_signal+0x2c/0x260
>  [401209.887909]  exit_to_user_mode_prepare+0xc0/0x120
>  [401209.887912]  syscall_exit_to_user_mode+0x27/0x180
>  [401209.887915]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The flr seems to have some issues on the NVIDIA 5090 GPU,
> so I’ve added flr-related quirks for these devices.
> 
> And with this patch in place, the host kernel doesn't exhibit these
> problems. The vm starts up and works as expected with the passed-through
> NVIDIA 5090 GPU.
> 
> Signed-off-by: yuan.gao <yuan.gao@ucloud.cn>
> ---
>  drivers/pci/quirks.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 48946cca4be72..71f833f3e2d84 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5618,6 +5618,9 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_AMD, 0x7901, quirk_no_flr);
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502, quirk_no_flr);
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503, quirk_no_flr);
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_MEDIATEK, 0x0616, quirk_no_flr);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_NVIDIA, 0x2b85, quirk_no_flr);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_NVIDIA, 0x2b87, quirk_no_flr);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_NVIDIA, 0x2b8c, quirk_no_flr);
>  
>  /* FLR may cause the SolidRun SNET DPU (rev 0x1) to hang */
>  static void quirk_no_flr_snet(struct pci_dev *dev)
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] PCI: Avoid FLR for NVIDIA 5090 GPU
  2026-04-16 17:31 ` Bjorn Helgaas
@ 2026-04-16 22:37   ` Jason Gunthorpe
  2026-04-17  2:06     ` yuan.gao
  0 siblings, 1 reply; 4+ messages in thread
From: Jason Gunthorpe @ 2026-04-16 22:37 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: yuan.gao, Bjorn Helgaas, linux-pci, linux-kernel, Alex Williamson

On Thu, Apr 16, 2026 at 12:31:00PM -0500, Bjorn Helgaas wrote:
> On Thu, Apr 16, 2026 at 03:07:06PM +0800, yuan.gao wrote:
> > When passing through the NVIDIA 5090 GPU to a vm, there is a certain
> > probability of encountering an flr timeout during vm shutdown, which
> > subsequently leads to a soft lock of the host cpu.
> 
> If possible, would like confirmation of device erratum from Nvidia.
> If there's no known erratum, there might be something wrong in the
> Linux FLR and wait.

I asked and was told there is a known device firmware defect that
causes this.

So blanket disabling FLR without detecting good and bad FW is not a
good idea.

I suggest Yuan try to use an NVIDIA support channel to try to resolve
the issue with their card..

Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] PCI: Avoid FLR for NVIDIA 5090 GPU
  2026-04-16 22:37   ` Jason Gunthorpe
@ 2026-04-17  2:06     ` yuan.gao
  0 siblings, 0 replies; 4+ messages in thread
From: yuan.gao @ 2026-04-17  2:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bjorn Helgaas, Bjorn Helgaas, linux-pci, linux-kernel,
	Alex Williamson

On Thu, Apr 16, 2026 at 07:37:54PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 16, 2026 at 12:31:00PM -0500, Bjorn Helgaas wrote:
> > On Thu, Apr 16, 2026 at 03:07:06PM +0800, yuan.gao wrote:
> > > When passing through the NVIDIA 5090 GPU to a vm, there is a certain
> > > probability of encountering an flr timeout during vm shutdown, which
> > > subsequently leads to a soft lock of the host cpu.
> > 
> > If possible, would like confirmation of device erratum from Nvidia.
> > If there's no known erratum, there might be something wrong in the
> > Linux FLR and wait.
> 
> I asked and was told there is a known device firmware defect that
> causes this.
> 
> So blanket disabling FLR without detecting good and bad FW is not a
> good idea.
> 
> I suggest Yuan try to use an NVIDIA support channel to try to resolve
> the issue with their card..
> 
> Jason
Got it, thanks.

Cheers,
Yuan Gao

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-17  7:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-16  7:07 [PATCH] PCI: Avoid FLR for NVIDIA 5090 GPU yuan.gao
2026-04-16 17:31 ` Bjorn Helgaas
2026-04-16 22:37   ` Jason Gunthorpe
2026-04-17  2:06     ` yuan.gao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox