[RFC PATCH] PCI: readiness condition with Configuration RRS in pci_dev

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] PCI: readiness condition with Configuration RRS in pci_dev_wait()
@ 2026-06-03  9:32 Yingying Zheng
  2026-06-03 12:44 ` Lukas Wunner
  0 siblings, 1 reply; 4+ messages in thread
From: Yingying Zheng @ 2026-06-03  9:32 UTC (permalink / raw)
  To: bhelgaas, linux-pci, linux-kernel; +Cc: 丁辉, zhengyingying

We are seeing reproducible AER/DPC fatal errors during VM boot when passing
through one or more NVIDIA RTX 4090 GPUs via VFIO. The issue is triggered
during QEMU device initialization, before the guest starts running, when
QEMU issues the VFIO_DEVICE_PCI_HOT_RESET ioctl. After this hot reset,
the subsequent PCI config restore may happen before the GPU is fully
re-initialized, which correlates with the AER/DPC fatal errors.

Kernel: based on Linux 6.6 stable

Call chain (simplified):
ioctl(..., VFIO_DEVICE_PCI_HOT_RESET, ...)         (QEMU)
      vfio_pci_core_ioctl                            (kernel)
          vfio_pci_ioctl_pci_hot_reset
              vfio_pci_ioctl_pci_hot_reset_groups
                  vfio_pci_dev_set_hot_reset
                      pci_reset_bus
                          __pci_reset_bus
                              pci_bridge_secondary_bus_reset

Hardware (example BDFs):
Root Port: 0000:b7:01.0 PCI bridge [0604]: Intel Corporation PCI Express Gen5 Port A [8086:352a]
PCIe switch: 0000:b9:00.0 PCI bridge [0604]: Broadcom / LSI PEX890xx PCIe Gen 5 Switch [1000:c030]
GPU: 0000:ba:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684]
           0000:ba:00.1 Audio device [0403]: NVIDIA Corporation AD102 High Definition Audio Controller [10de:22ba]

The GPU functions 0000:ba:00.0 and 0000:ba:00.1 are bound to vfio-pci on
the host and assigned to the guest. The GPU is connected to the Root Port
through the Broadcom/LSI PCIe switch.

Topology (lspci -vtnn excerpt):
+-[0000:b7]-+-...
  |           \-01.0-[b8-bb]----00.0-[b9-bb]--+-00.0-[ba]--+-00.0 NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684]
  |                                           |            \-00.1 NVIDIA Corporation AD102 High Definition Audio Controller [10de:22ba]
  |                                           \-01.0-[bb]--+-00.0 NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684]
  |                                                        \-00.1 NVIDIA Corporation AD102 High Definition Audio Controller [10de:22ba]
  ...
  +-[0000:97]-+-...
  |           \-01.0-[98-9d]----00.0-[99-9d]--+-00.0-[9a]--+-00.0 NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684]
  |                                           |            \-00.1 NVIDIA Corporation AD102 High Definition Audio Controller [10de:22ba]
  |                                           +-01.0-[9b]--+-00.0 NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684]
  |                                           |            \-00.1 NVIDIA Corporation AD102 High Definition Audio Controller [10de:22ba]
  |                                           +-02.0-[9c]----00.0 Broadcom / LSI Virtual PCIe Placeholder Endpoint [1000:02b2]
  |                                           \-1f.0-[9d]----00.0 Broadcom / LSI PCIe Switch management endpoint [1000:00b2]

During VM power-on, the host logs show a DPC containment event and an AER fatal
Transaction Layer error on the upstream Root Port:
pcieport 0000:b7:01.0: DPC: containment event, status:0x1f01 source:0x0000
pcieport 0000:b7:01.0: DPC: unmasked uncorrectable error detected
pcieport 0000:b7:01.0: PCIe Bus Error: severity=Uncorrected (Fatal),
                                       type=Transaction Layer, (Receiver ID)
pcieport 0000:b7:01.0: device [8086:352a] error status/mask=00040000/00180020
pcieport 0000:b7:01.0: [18] MalfTLP (First)
pcieport 0000:b7:01.0: AER: TLP Header: 60701001 ba00000f 00000001 2e45f000

On the upstream port, the Virtual Channel capability indicates only TC0
is mapped to VC0, e.g.:
Capabilities: [280 v1] Virtual Channel
                      VC0 ... Ctrl: ... TC/VC=01
                      VC1 ... Ctrl: ... TC/VC=00
However, on the NVIDIA GPU function (e.g. 0000:ba:00.0), after a reset
the GPU's Virtual Channel resource control (TC/VC mapping) is observed
to change from the expected 01 to ff:

Before VM power-on:
ba:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684] (rev a1) (prog-if 00 [VGA controller])
     Subsystem: Gigabyte Technology Co., Ltd Device [1458:40de]
     ...
     Capabilities: [100 v1] Virtual Channel
         ...
         VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
             Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
             Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=01
             Status: NegoPending- InProgress-

After VM power-on:
ba:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684] (rev a1) (prog-if 00 [VGA controller])
     Subsystem: Gigabyte Technology Co., Ltd Device [1458:40de]
     ...
     Capabilities: [100 v1] Virtual Channel
         ...
         VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
             Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
             Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
             Status:    NegoPending- InProgress-

With TC/VC=ff , the GPU may emit transactions with non-TC0 traffic
class encoding, and those TLPs are then treated as Malformed TLP by
the upstream port (which only expects TC0->VC0), triggering the AER
fatal error above.

We are running a Linux 6.6 stable kernel. After comparing behavior
with older kernels, we traced the regression to commit ac91e6980563
("PCI: Unify delay handling for reset and resume").

The key behavioral change is that pci_reset_secondary_bus() no longer
includes the previous 1-second delay after deasserting secondary bus reset.

On our system, after the GPU is reset, the GPU hardware temporarily ends
up with an unexpected Virtual Channel mapping (e.g. VC0 resource control
TC/VC=ff ). The VC state had been saved before reset via pci_save_vc_state() ,
but during the restore path pci_restore_vc_state() does not restore the
VC configuration because pci_find_ext_capability(dev, PCI_EXT_CAP_ID_VC)
returns 0 at that moment, which means the VC extended capability is not
accessible yet. As a result, the saved VC state is not restored and the
device continues operating with the incorrect mapping, which later triggers
AER on the upstream port.

As a workaround, reintroducing a 1-second delay after pci_reset_secondary_bus()
makes the issue go away on our system.

We then found commit d591f6804e7e ("PCI: Wait for device readiness with
Configuration RRS").

This looks like the proper direction: when the upstream Root Port enables
Configuration RRS Software Visibility, software can detect Configuration
RRS responses by reading Vendor ID and observing the reserved 0x0001 value,
so pci_dev_wait() can perform correct exponential backoff until the device
is actually ready for config accesses.

On our system, the upstream Root Port does report CRSVisible enabled, e.g. (excerpt):

b7:01.0 PCI bridge [0604]: Intel Corporation PCI Express Gen5 Port A [8086:352a] (rev 04) (prog-if 00 [Normal decode])
     ...
     Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
         ...
         RootCap: CRSVisible+
         RootCtl: ErrCorrectable- ErrNon-Fatal+ ErrFatal+ PMEIntEna+ CRSVisible+

However, with some PCIe switches (in our case Broadcom/LSI PEX890xx PCIe
Gen5 switch), when the downstream device is in reset or link training is
not completed, the switch exposes a Virtual PCIe Placeholder Endpoint on
the downstream side. During this window, reads to the GPU BDF Vendor ID
return the placeholder endpoint Vendor/Device ID (Broadcom/LSI) instead
of the expected 0x0001 RRS-visible value.

Based on this behavior, we have a candidate change for discussion: only
treat the device as ready once reads of PCI_VENDOR_ID appear to be coming
from the actual endpoint, i.e. the returned Vendor/Device ID matches the
dev->vendor/dev->device recorded at enumeration time.

If we keep reading PCI_VENDOR_ID from 0000:ba:00.0 over time, we observe
the following:

t+  0ms: 1000:02b2
t+ 16ms: 1000:02b2
t+ 28ms: 1000:02b2
t+ 40ms: 1000:02b2
t+ 56ms: 1000:02b2
t+120ms: 10de:2684

In our case, this would effectively wait until the PCI_VENDOR_ID read
transitions from 1000:02b2 to 10de:2684 (around t+120ms in the sequence
above), instead of returning immediately at t+0ms.

We are not sure about potential side effects of making pci_dev_wait()
more strict (e.g. for SR-IOV VFs or other devices/platforms), so we
would appreciate feedback on whether this approach is acceptable and
whether it should be handled generically or via a quirk.

We can provide more details (full topology, exact reset trigger path
in VFIO/QEMU, kernel logs, and config diffs before or after reset)
if that would help.

Appreciate any comment and suggestion, thanks.

Signed-off-by: Yingying Zheng <zhengyingying@sangfor.com.cn>
Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>
---
  drivers/pci/pci.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b98e04865..1e6d8a84a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1219,7 +1219,9 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)

          if (root && root->config_crs_sv) {
              pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
-            if (!pci_bus_crs_vendor_id(id))
+            if (!pci_bus_crs_vendor_id(id) &&
+                (id & 0xffff) == dev->vendor &&
+                (id >> 16) == dev->device)
                  break;
          } else {
              pci_read_config_dword(dev, PCI_COMMAND, &id);
-- 
2.18.4

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] PCI: readiness condition with Configuration RRS in pci_dev_wait()
  2026-06-03  9:32 [RFC PATCH] PCI: readiness condition with Configuration RRS in pci_dev_wait() Yingying Zheng
@ 2026-06-03 12:44 ` Lukas Wunner
  2026-06-03 12:49   ` Lukas Wunner
  2026-06-04  9:26   ` Yingying Zheng
  0 siblings, 2 replies; 4+ messages in thread
From: Lukas Wunner @ 2026-06-03 12:44 UTC (permalink / raw)
  To: Yingying Zheng; +Cc: bhelgaas, linux-pci, linux-kernel, ??????

On Wed, Jun 03, 2026 at 05:32:54PM +0800, Yingying Zheng wrote:
> However, with some PCIe switches (in our case Broadcom/LSI PEX890xx PCIe
> Gen5 switch), when the downstream device is in reset or link training is
> not completed, the switch exposes a Virtual PCIe Placeholder Endpoint on
> the downstream side. During this window, reads to the GPU BDF Vendor ID
> return the placeholder endpoint Vendor/Device ID (Broadcom/LSI) instead
> of the expected 0x0001 RRS-visible value.

What's the OEM and model name of the system you're observing this on?

We ran into this issue on a Supermicro SYS-421-GE-TNRT server with
AOM-PCIE5-418P-1-P board.  The two Broadcom PCIe switches are located
on the AOM board.

Do you happen to use the same system?

The Broadcom PCIe switches support a "Synthetic Mode" (alternatively to
"Base Mode") for use cases where a single PCI function is accessible to 
multiple hosts.

In Synthetic Mode, the Broadcom switch spoofs responses when the actual
device downstream of the switch is inaccessible.  The spoofed responses
come from the virtual placeholder device with ID [1000:02b2].

When I investigated the issue, I cooked up a tentative patch to recognize
spoofed responses in the PCI core:

https://github.com/l1k/linux/commits/broadcom/

After some back and forth with Broadcom and Supermicro FAEs, it turned out
that the switch had an outdated firmware with version 04.101.00.00.

The Supermicro server had already been updated to the latest BIOS version,
but the BIOS update does not include updates for the PCIe switches.
One has to contact Broadcom directly and ask for a separate firmware
update.  In our case, we received a file called
"AOM-PCIe5-418P-1 _PLX FW 4.16.0 Package.zip".

The zip file contained a g4Xdiagnostics.efi utility which can be run from
an EFI shell to query the current firmware version on the Broadcom switches.
It can also flash the switches with a new firmware, which was included in
the zip file as:
"AOM-PCIe5-418P-1_4.16.0GCA_PLX1_07172024.fw" and
"AOM-PCIe5-418P-1_4.16.0GCA_PLX2_07172024.fw".
                               ^
After updating the switches with these files, they reported 04.16.00.00
as firmware version.  In that version, Synthetic Mode is deactivated,
the placeholder device does not appear and reset on passthrough works
as it should.

According to the Broadcom FAE, the OEMs own the switch firmware settings
and so the kernel is not permitted to deactivate Synthetic Mode at runtime.

Because the problem was no longer reproducible after updating the switch
firmware, we decided that this is the recommended way to solve the problem
and I did not pursue an in-kernel workaround as a result.

As an aside, the amdgpu driver contains a workaround for Broadcom Synthetic
Mode which was introduced with 1dd2fa0e00f1 (and subsequently fixed up with
9b608fe94870 and 4e89d629dc72).  Obviously, this only helps if there are
AMD devices downstream of the Broadcom switch, and not with those of other
vendors.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] PCI: readiness condition with Configuration RRS in pci_dev_wait()
  2026-06-03 12:44 ` Lukas Wunner
@ 2026-06-03 12:49   ` Lukas Wunner
  2026-06-04  9:26   ` Yingying Zheng
  1 sibling, 0 replies; 4+ messages in thread
From: Lukas Wunner @ 2026-06-03 12:49 UTC (permalink / raw)
  To: Yingying Zheng; +Cc: bhelgaas, linux-pci, linux-kernel, ??????

On Wed, Jun 03, 2026 at 02:44:27PM +0200, Lukas Wunner wrote:
> The Supermicro server had already been updated to the latest BIOS version,
> but the BIOS update does not include updates for the PCIe switches.
> One has to contact Broadcom directly and ask for a separate firmware
> update.

Sorry, this should have been "One has to contact *Supermicro*".

We got the firmware update from Supermicro.
Broadcom always pointed to the OEM, i.e. Supermicro.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] PCI: readiness condition with Configuration RRS in pci_dev_wait()
  2026-06-03 12:44 ` Lukas Wunner
  2026-06-03 12:49   ` Lukas Wunner
@ 2026-06-04  9:26   ` Yingying Zheng
  1 sibling, 0 replies; 4+ messages in thread
From: Yingying Zheng @ 2026-06-04  9:26 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: bhelgaas, linux-pci, linux-kernel, ??????

Thanks a lot for the detailed explanation and the pointers about Broadcom
"Synthetic Mode". This matches what we are observing very well.

在 2026/6/3 20:44, Lukas Wunner 写道:
> On Wed, Jun 03, 2026 at 05:32:54PM +0800, Yingying Zheng wrote:
>> However, with some PCIe switches (in our case Broadcom/LSI PEX890xx PCIe
>> Gen5 switch), when the downstream device is in reset or link training is
>> not completed, the switch exposes a Virtual PCIe Placeholder Endpoint on
>> the downstream side. During this window, reads to the GPU BDF Vendor ID
>> return the placeholder endpoint Vendor/Device ID (Broadcom/LSI) instead
>> of the expected 0x0001 RRS-visible value.
> 
> What's the OEM and model name of the system you're observing this on?
> 
> We ran into this issue on a Supermicro SYS-421-GE-TNRT server with
> AOM-PCIE5-418P-1-P board.  The two Broadcom PCIe switches are located
> on the AOM board.
> 
> Do you happen to use the same system?
> 

OEM: our system is from a different OEM (not Supermicro)
Switch firmware: we checked with Broadcom’s g4Xdiagnostics tool and both
PEX89104 switches report FW version 01.05.11.01

> The Broadcom PCIe switches support a "Synthetic Mode" (alternatively to
> "Base Mode") for use cases where a single PCI function is accessible to
> multiple hosts.
> 
> In Synthetic Mode, the Broadcom switch spoofs responses when the actual
> device downstream of the switch is inaccessible.  The spoofed responses
> come from the virtual placeholder device with ID [1000:02b2].
> 
> When I investigated the issue, I cooked up a tentative patch to recognize
> spoofed responses in the PCI core:
> 
> https://github.com/l1k/linux/commits/broadcom/
> 
> After some back and forth with Broadcom and Supermicro FAEs, it turned out
> that the switch had an outdated firmware with version 04.101.00.00.
> 
> The Supermicro server had already been updated to the latest BIOS version,
> but the BIOS update does not include updates for the PCIe switches.
> One has to contact Broadcom directly and ask for a separate firmware
> update.  In our case, we received a file called
> "AOM-PCIe5-418P-1 _PLX FW 4.16.0 Package.zip".
> 
> The zip file contained a g4Xdiagnostics.efi utility which can be run from
> an EFI shell to query the current firmware version on the Broadcom switches.
> It can also flash the switches with a new firmware, which was included in
> the zip file as:
> "AOM-PCIe5-418P-1_4.16.0GCA_PLX1_07172024.fw" and
> "AOM-PCIe5-418P-1_4.16.0GCA_PLX2_07172024.fw".
>                                 ^
> After updating the switches with these files, they reported 04.16.00.00
> as firmware version.  In that version, Synthetic Mode is deactivated,
> the placeholder device does not appear and reset on passthrough works
> as it should.
> 

Given your findings, this firmware version seems quite old and is likely
the root cause of the spoofed placeholder responses during reset/link
training on our platform as well.

At the moment, we are not sure whether we will be able to obtain an updated
switch firmware from the OEM in a timely manner. We will try to contact the
OEM to obtain an updated switch firmware and will report back whether the
issue disappears after the update.

Thanks again for sharing your investigation and the recommended resolution.

> According to the Broadcom FAE, the OEMs own the switch firmware settings
> and so the kernel is not permitted to deactivate Synthetic Mode at runtime.
> 
> Because the problem was no longer reproducible after updating the switch
> firmware, we decided that this is the recommended way to solve the problem
> and I did not pursue an in-kernel workaround as a result.
> 
> As an aside, the amdgpu driver contains a workaround for Broadcom Synthetic
> Mode which was introduced with 1dd2fa0e00f1 (and subsequently fixed up with
> 9b608fe94870 and 4e89d629dc72).  Obviously, this only helps if there are
> AMD devices downstream of the Broadcom switch, and not with those of other
> vendors.
> 
> Thanks,
> 
> Lukas
> 
> 

Best regards,
Yingying Zheng

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-04 10:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-03  9:32 [RFC PATCH] PCI: readiness condition with Configuration RRS in pci_dev_wait() Yingying Zheng
2026-06-03 12:44 ` Lukas Wunner
2026-06-03 12:49   ` Lukas Wunner
2026-06-04  9:26   ` Yingying Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.