[PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
@ 2025-06-11 10:14 Hui Wang
  2025-06-12 16:48 ` Bjorn Helgaas
  2025-08-14 15:55 ` Bjorn Helgaas
  0 siblings, 2 replies; 14+ messages in thread
From: Hui Wang @ 2025-06-11 10:14 UTC (permalink / raw)
  To: linux-pci, bhelgaas
  Cc: raphael.norwitz, alay.shah, suresh.gumpula, ilpo.jarvinen,
	hui.wang

Prior to commit d591f6804e7e ("PCI: Wait for device readiness with
Configuration RRS"), this Intel nvme [8086:0a54] works well. Since
that patch is merged to the kernel, this nvme stops working.

Through debugging, we found that commit introduces the RRS polling in
the pci_dev_wait(), for this nvme, when polling the PCI_VENDOR_ID, it
will return ~0 if the config access is not ready yet, but the polling
expects a return value of 0x0001 or a valid vendor_id, so the RRS
polling doesn't work for this nvme.

Here we add a pci quirk to disable the RRS polling for the device.

Fixes: d591f6804e7e ("PCI: Wait for device readiness with Configuration RRS")
Link: https://bugs.launchpad.net/bugs/2111521
Signed-off-by: Hui Wang <hui.wang@canonical.com>
---
 drivers/pci/probe.c  |  3 ++-
 drivers/pci/quirks.c | 18 ++++++++++++++++++
 include/linux/pci.h  |  2 ++
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 4b8693ec9e4c..848fa0e6cf60 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1270,7 +1270,8 @@ static void pci_enable_rrs_sv(struct pci_dev *pdev)
 	if (root_cap & PCI_EXP_RTCAP_RRS_SV) {
 		pcie_capability_set_word(pdev, PCI_EXP_RTCTL,
 					 PCI_EXP_RTCTL_RRS_SVE);
-		pdev->config_rrs_sv = 1;
+		if (!(pdev->dev_flags & PCI_DEV_FLAGS_NO_RRS_SV))
+			pdev->config_rrs_sv = 1;
 	}
 }
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index cf483d82572c..519e48ff6448 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -6336,3 +6336,21 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev)
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout);
 #endif
+
+/*
+ * Although the root port device claims to support RRS, some devices don't work
+ * with RRS polling, when reading the Vendor ID, they just return ~0 if config
+ * access is not ready, this will break the pci_dev_wait(). Here disable the RRS
+ * forcibly for this type of device.
+ */
+static void quirk_no_rrs_sv(struct pci_dev *dev)
+{
+	struct pci_dev *root;
+
+	root = pcie_find_root_port(dev);
+	if (root) {
+		root->dev_flags |= PCI_DEV_FLAGS_NO_RRS_SV;
+		root->config_rrs_sv = 0;
+	}
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0a54, quirk_no_rrs_sv);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 05e68f35f392..f4dd9ada12e4 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -247,6 +247,8 @@ enum pci_dev_flags {
 	PCI_DEV_FLAGS_HAS_MSI_MASKING = (__force pci_dev_flags_t) (1 << 12),
 	/* Device requires write to PCI_MSIX_ENTRY_DATA before any MSIX reads */
 	PCI_DEV_FLAGS_MSIX_TOUCH_ENTRY_DATA_FIRST = (__force pci_dev_flags_t) (1 << 13),
+	/* Do not use Configuration Request Retry Status polling in pci_dev_wait() */
+	PCI_DEV_FLAGS_NO_RRS_SV = (__force pci_dev_flags_t) (1 << 14),
 };
 
 enum pci_irq_reroute_variant {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-06-11 10:14 [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme Hui Wang
@ 2025-06-12 16:48 ` Bjorn Helgaas
  2025-06-16 11:55   ` Hui Wang
  2025-08-14 15:55 ` Bjorn Helgaas
  1 sibling, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2025-06-12 16:48 UTC (permalink / raw)
  To: Hui Wang
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, Nirmal Patel, Jonathan Derrick

[+cc VMD folks]

On Wed, Jun 11, 2025 at 06:14:42PM +0800, Hui Wang wrote:
> Prior to commit d591f6804e7e ("PCI: Wait for device readiness with
> Configuration RRS"), this Intel nvme [8086:0a54] works well. Since
> that patch is merged to the kernel, this nvme stops working.
> 
> Through debugging, we found that commit introduces the RRS polling in
> the pci_dev_wait(), for this nvme, when polling the PCI_VENDOR_ID, it
> will return ~0 if the config access is not ready yet, but the polling
> expects a return value of 0x0001 or a valid vendor_id, so the RRS
> polling doesn't work for this nvme.

Sorry for breaking this, and thanks for all your work in debugging
this!  Issues like this are really hard to track down.

I would think we would have heard about this earlier if the NVMe
device were broken on all systems.  Maybe there's some connection with
VMD?  From the non-working dmesg log in your bug report
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/+attachment/5879970/+files/dmesg-60.txt):

  DMI: ASUSTeK COMPUTER INC. ESC8000 G4/Z11PG-D24 Series, BIOS 5501 04/17/2019
  vmd 0000:d7:05.5: PCI host bridge to bus 10000:00
  pci 10000:00:02.0: [8086:2032] type 01 class 0x060400 PCIe Root Port
  pci 10000:00:02.0: PCI bridge to [bus 01]
  pci 10000:00:02.0: bridge window [mem 0xf8000000-0xf81fffff]: assigned
  pci 10000:01:00.0: [8086:0a54] type 00 class 0x010802 PCIe Endpoint
  pci 10000:01:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]

  <I think vmd_enable_domain() calls pci_reset_bus() here>

  pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
  pci 10000:01:00.0: BAR 0: error updating (high 0x00000000 != 0xffffffff)
  pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
  pci 10000:01:00.0: BAR 0: error updating (0xf8010004 != 0xffffffff)
  nvme nvme0: pci function 10000:01:00.0
  nvme 10000:01:00.0: enabling device (0000 -> 0002)

Things I notice:

  - The 10000:01:00.0 NVMe device is behind a VMD bridge

  - We successfully read the Vendor & Device IDs (8086:0a54)

  - The NVMe device is uninitialized.  We successfully sized the BAR,
    which included successful config reads and writes.  The BAR
    wasn't assigned by BIOS, which is normal since it's behind VMD.

  - We allocated space for BAR 0 but the config writes to program the
    BAR failed.  The read back from the BAR was 0xffffffff; probably a
    PCIe error, e.g., the NVMe device didn't respond.

  - The device *did* respond when nvme_probe() enabled it: the
    "enabling device (0000 -> 0002)" means pci_enable_resources() read
    PCI_COMMAND and got 0x0000.

  - The dmesg from the working config doesn't include the "enabling
    device" line, which suggests that pci_enable_resources() saw
    PCI_COMMAND_MEMORY (0x0002) already set and didn't bother setting
    it again.  I don't know why it would already be set.
   
d591f6804e7e really only changes pci_dev_wait(), which is used after
device resets.  I think vmd_enable_domain() resets the VMD Root Ports
after pci_scan_child_bus(), and maybe we're not waiting long enough
afterwards.

My guess is that we got the ~0 because we did a config read too soon
after reset and the device didn't respond.  The Root Port would time
out, log an error, and synthesize ~0 data to complete the CPU read
(see PCIe r6.0, sec 2.3.2 implementation note).

It's *possible* that we waited long enough but the NVMe device is
broken and didn't respond when it should have, but my money is on a
software defect.

There are a few pci_dbg() calls about these delays; can you set
CONFIG_DYNAMIC_DEBUG=y and boot with dyndbg="file drivers/pci/* +p" to
collect that output?  Please also collect the "sudo lspci -vv" output
from a working system.

Bjorn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-06-12 16:48 ` Bjorn Helgaas
@ 2025-06-16 11:55   ` Hui Wang
  2025-06-16 13:38     ` Hui Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Hui Wang @ 2025-06-16 11:55 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, Nirmal Patel, Jonathan Derrick


On 6/13/25 00:48, Bjorn Helgaas wrote:
> [+cc VMD folks]
>
> On Wed, Jun 11, 2025 at 06:14:42PM +0800, Hui Wang wrote:
>> Prior to commit d591f6804e7e ("PCI: Wait for device readiness with
>> Configuration RRS"), this Intel nvme [8086:0a54] works well. Since
>> that patch is merged to the kernel, this nvme stops working.
>>
>> Through debugging, we found that commit introduces the RRS polling in
>> the pci_dev_wait(), for this nvme, when polling the PCI_VENDOR_ID, it
>> will return ~0 if the config access is not ready yet, but the polling
>> expects a return value of 0x0001 or a valid vendor_id, so the RRS
>> polling doesn't work for this nvme.
> Sorry for breaking this, and thanks for all your work in debugging
> this!  Issues like this are really hard to track down.
>
> I would think we would have heard about this earlier if the NVMe
> device were broken on all systems.  Maybe there's some connection with
> VMD?  From the non-working dmesg log in your bug report
> (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/+attachment/5879970/+files/dmesg-60.txt):
>
>    DMI: ASUSTeK COMPUTER INC. ESC8000 G4/Z11PG-D24 Series, BIOS 5501 04/17/2019
>    vmd 0000:d7:05.5: PCI host bridge to bus 10000:00
>    pci 10000:00:02.0: [8086:2032] type 01 class 0x060400 PCIe Root Port
>    pci 10000:00:02.0: PCI bridge to [bus 01]
>    pci 10000:00:02.0: bridge window [mem 0xf8000000-0xf81fffff]: assigned
>    pci 10000:01:00.0: [8086:0a54] type 00 class 0x010802 PCIe Endpoint
>    pci 10000:01:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]
>
>    <I think vmd_enable_domain() calls pci_reset_bus() here>

Yes, and the pci_dev_wait() is called here. With the RRS polling, will 
get a ~0 from PCI_VENDOR_ID, then will get 0xfffffff when configuring 
the BAR0 subsequently. With the original polling method, it will get 
enough delay in the pci_dev_wait(), so nvme works normally.

The line "[   10.193589] hhhhhhhhhhhhhhhhhhhhhhhhhhhh dev->device = 0a54 
id = ffffffff" is output from pci_dev_wait(), please refer to 
https://launchpadlibrarian.net/798708446/LP2111521-dmesg-test9.txt

>
>    pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
>    pci 10000:01:00.0: BAR 0: error updating (high 0x00000000 != 0xffffffff)
>    pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
>    pci 10000:01:00.0: BAR 0: error updating (0xf8010004 != 0xffffffff)
>    nvme nvme0: pci function 10000:01:00.0
>    nvme 10000:01:00.0: enabling device (0000 -> 0002)
>
> Things I notice:
>
>    - The 10000:01:00.0 NVMe device is behind a VMD bridge
>
>    - We successfully read the Vendor & Device IDs (8086:0a54)
>
>    - The NVMe device is uninitialized.  We successfully sized the BAR,
>      which included successful config reads and writes.  The BAR
>      wasn't assigned by BIOS, which is normal since it's behind VMD.
>
>    - We allocated space for BAR 0 but the config writes to program the
>      BAR failed.  The read back from the BAR was 0xffffffff; probably a
>      PCIe error, e.g., the NVMe device didn't respond.
>
>    - The device *did* respond when nvme_probe() enabled it: the
>      "enabling device (0000 -> 0002)" means pci_enable_resources() read
>      PCI_COMMAND and got 0x0000.
>
>    - The dmesg from the working config doesn't include the "enabling
>      device" line, which suggests that pci_enable_resources() saw
>      PCI_COMMAND_MEMORY (0x0002) already set and didn't bother setting
>      it again.  I don't know why it would already be set.
>     
> d591f6804e7e really only changes pci_dev_wait(), which is used after
> device resets.  I think vmd_enable_domain() resets the VMD Root Ports
> after pci_scan_child_bus(), and maybe we're not waiting long enough
> afterwards.
>
> My guess is that we got the ~0 because we did a config read too soon
> after reset and the device didn't respond.  The Root Port would time
> out, log an error, and synthesize ~0 data to complete the CPU read
> (see PCIe r6.0, sec 2.3.2 implementation note).
>
> It's *possible* that we waited long enough but the NVMe device is
> broken and didn't respond when it should have, but my money is on a
> software defect.
>
> There are a few pci_dbg() calls about these delays; can you set
> CONFIG_DYNAMIC_DEBUG=y and boot with dyndbg="file drivers/pci/* +p" to
> collect that output?  Please also collect the "sudo lspci -vv" output
> from a working system.

Already passed the testing request to bug reporters, wait for their 
feedback.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/55

Thanks,

Hui.


>
> Bjorn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-06-16 11:55   ` Hui Wang
@ 2025-06-16 13:38     ` Hui Wang
  2025-06-17 20:12       ` Bjorn Helgaas
  0 siblings, 1 reply; 14+ messages in thread
From: Hui Wang @ 2025-06-16 13:38 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, Nirmal Patel, Jonathan Derrick


On 6/16/25 19:55, Hui Wang wrote:
>
> On 6/13/25 00:48, Bjorn Helgaas wrote:
>> [+cc VMD folks]
>>
>> On Wed, Jun 11, 2025 at 06:14:42PM +0800, Hui Wang wrote:
>>> Prior to commit d591f6804e7e ("PCI: Wait for device readiness with
>>> Configuration RRS"), this Intel nvme [8086:0a54] works well. Since
>>> that patch is merged to the kernel, this nvme stops working.
>>>
>>> Through debugging, we found that commit introduces the RRS polling in
>>> the pci_dev_wait(), for this nvme, when polling the PCI_VENDOR_ID, it
>>> will return ~0 if the config access is not ready yet, but the polling
>>> expects a return value of 0x0001 or a valid vendor_id, so the RRS
>>> polling doesn't work for this nvme.
>> Sorry for breaking this, and thanks for all your work in debugging
>> this!  Issues like this are really hard to track down.
>>
>> I would think we would have heard about this earlier if the NVMe
>> device were broken on all systems.  Maybe there's some connection with
>> VMD?  From the non-working dmesg log in your bug report
>> (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/+attachment/5879970/+files/dmesg-60.txt): 
>>
>>
>>    DMI: ASUSTeK COMPUTER INC. ESC8000 G4/Z11PG-D24 Series, BIOS 5501 
>> 04/17/2019
>>    vmd 0000:d7:05.5: PCI host bridge to bus 10000:00
>>    pci 10000:00:02.0: [8086:2032] type 01 class 0x060400 PCIe Root Port
>>    pci 10000:00:02.0: PCI bridge to [bus 01]
>>    pci 10000:00:02.0: bridge window [mem 0xf8000000-0xf81fffff]: 
>> assigned
>>    pci 10000:01:00.0: [8086:0a54] type 00 class 0x010802 PCIe Endpoint
>>    pci 10000:01:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]
>>
>>    <I think vmd_enable_domain() calls pci_reset_bus() here>
>
> Yes, and the pci_dev_wait() is called here. With the RRS polling, will 
> get a ~0 from PCI_VENDOR_ID, then will get 0xfffffff when configuring 
> the BAR0 subsequently. With the original polling method, it will get 
> enough delay in the pci_dev_wait(), so nvme works normally.
>
> The line "[   10.193589] hhhhhhhhhhhhhhhhhhhhhhhhhhhh dev->device = 
> 0a54 id = ffffffff" is output from pci_dev_wait(), please refer to 
> https://launchpadlibrarian.net/798708446/LP2111521-dmesg-test9.txt
>
>>
>>    pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
>>    pci 10000:01:00.0: BAR 0: error updating (high 0x00000000 != 
>> 0xffffffff)
>>    pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
>>    pci 10000:01:00.0: BAR 0: error updating (0xf8010004 != 0xffffffff)
>>    nvme nvme0: pci function 10000:01:00.0
>>    nvme 10000:01:00.0: enabling device (0000 -> 0002)
>>
>> Things I notice:
>>
>>    - The 10000:01:00.0 NVMe device is behind a VMD bridge
>>
>>    - We successfully read the Vendor & Device IDs (8086:0a54)
>>
>>    - The NVMe device is uninitialized.  We successfully sized the BAR,
>>      which included successful config reads and writes.  The BAR
>>      wasn't assigned by BIOS, which is normal since it's behind VMD.
>>
>>    - We allocated space for BAR 0 but the config writes to program the
>>      BAR failed.  The read back from the BAR was 0xffffffff; probably a
>>      PCIe error, e.g., the NVMe device didn't respond.
>>
>>    - The device *did* respond when nvme_probe() enabled it: the
>>      "enabling device (0000 -> 0002)" means pci_enable_resources() read
>>      PCI_COMMAND and got 0x0000.
>>
>>    - The dmesg from the working config doesn't include the "enabling
>>      device" line, which suggests that pci_enable_resources() saw
>>      PCI_COMMAND_MEMORY (0x0002) already set and didn't bother setting
>>      it again.  I don't know why it would already be set.
>>     d591f6804e7e really only changes pci_dev_wait(), which is used after
>> device resets.  I think vmd_enable_domain() resets the VMD Root Ports
>> after pci_scan_child_bus(), and maybe we're not waiting long enough
>> afterwards.
>>
>> My guess is that we got the ~0 because we did a config read too soon
>> after reset and the device didn't respond.  The Root Port would time
>> out, log an error, and synthesize ~0 data to complete the CPU read
>> (see PCIe r6.0, sec 2.3.2 implementation note).
>>
>> It's *possible* that we waited long enough but the NVMe device is
>> broken and didn't respond when it should have, but my money is on a
>> software defect.
>>
>> There are a few pci_dbg() calls about these delays; can you set
>> CONFIG_DYNAMIC_DEBUG=y and boot with dyndbg="file drivers/pci/* +p" to
>> collect that output?  Please also collect the "sudo lspci -vv" output
>> from a working system.
>
> Already passed the testing request to bug reporters, wait for their 
> feedback.
>
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/55
>
> Thanks,
>
> Hui.

This is the dmesg with dyndbg="file drivers/pci/* +p"

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/56

And this is the lspci output:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/57

>
>
>>
>> Bjorn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-06-16 13:38     ` Hui Wang
@ 2025-06-17 20:12       ` Bjorn Helgaas
  2025-06-23 22:58         ` Bjorn Helgaas
  0 siblings, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2025-06-17 20:12 UTC (permalink / raw)
  To: Hui Wang
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, Nirmal Patel, Jonathan Derrick,
	Chaitanya Kumar Borah, Ville Syrjälä

[+cc Chaitanya, Ville for possible Intel NVMe contacts]

On Mon, Jun 16, 2025 at 09:38:25PM +0800, Hui Wang wrote:
> On 6/16/25 19:55, Hui Wang wrote:
> > On 6/13/25 00:48, Bjorn Helgaas wrote:
> > > [+cc VMD folks]
> > > 
> > > On Wed, Jun 11, 2025 at 06:14:42PM +0800, Hui Wang wrote:
> > > > Prior to commit d591f6804e7e ("PCI: Wait for device readiness with
> > > > Configuration RRS"), this Intel nvme [8086:0a54] works well. Since
> > > > that patch is merged to the kernel, this nvme stops working.
> > > > 
> > > > Through debugging, we found that commit introduces the RRS polling in
> > > > the pci_dev_wait(), for this nvme, when polling the PCI_VENDOR_ID, it
> > > > will return ~0 if the config access is not ready yet, but the polling
> > > > expects a return value of 0x0001 or a valid vendor_id, so the RRS
> > > > polling doesn't work for this nvme.
> > >
> > > Sorry for breaking this, and thanks for all your work in debugging
> > > this!  Issues like this are really hard to track down.
> > > 
> > > I would think we would have heard about this earlier if the NVMe
> > > device were broken on all systems.  Maybe there's some connection with
> > > VMD?  From the non-working dmesg log in your bug report
> > > (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/+attachment/5879970/+files/dmesg-60.txt):
> > > 
> > > 
> > >    DMI: ASUSTeK COMPUTER INC. ESC8000 G4/Z11PG-D24 Series, BIOS 5501
> > > 04/17/2019
> > >    vmd 0000:d7:05.5: PCI host bridge to bus 10000:00
> > >    pci 10000:00:02.0: [8086:2032] type 01 class 0x060400 PCIe Root Port
> > >    pci 10000:00:02.0: PCI bridge to [bus 01]
> > >    pci 10000:00:02.0: bridge window [mem 0xf8000000-0xf81fffff]:
> > > assigned
> > >    pci 10000:01:00.0: [8086:0a54] type 00 class 0x010802 PCIe Endpoint
> > >    pci 10000:01:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]
> > > 
> > >    <I think vmd_enable_domain() calls pci_reset_bus() here>
> > 
> > Yes, and the pci_dev_wait() is called here. With the RRS polling, will
> > get a ~0 from PCI_VENDOR_ID, then will get 0xfffffff when configuring
> > the BAR0 subsequently. With the original polling method, it will get
> > enough delay in the pci_dev_wait(), so nvme works normally.
> > 
> > The line "[   10.193589] hhhhhhhhhhhhhhhhhhhhhhhhhhhh dev->device = 0a54
> > id = ffffffff" is output from pci_dev_wait(), please refer to
> > https://launchpadlibrarian.net/798708446/LP2111521-dmesg-test9.txt
> > 
> > >    pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
> > >    pci 10000:01:00.0: BAR 0: error updating (high 0x00000000 !=
> > > 0xffffffff)
> > >    pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
> > >    pci 10000:01:00.0: BAR 0: error updating (0xf8010004 != 0xffffffff)
> > >    nvme nvme0: pci function 10000:01:00.0
> > >    nvme 10000:01:00.0: enabling device (0000 -> 0002)
> > > 
> > > Things I notice:
> > > 
> > >    - The 10000:01:00.0 NVMe device is behind a VMD bridge
> > > 
> > >    - We successfully read the Vendor & Device IDs (8086:0a54)
> > > 
> > >    - The NVMe device is uninitialized.  We successfully sized the BAR,
> > >      which included successful config reads and writes.  The BAR
> > >      wasn't assigned by BIOS, which is normal since it's behind VMD.
> > > 
> > >    - We allocated space for BAR 0 but the config writes to program the
> > >      BAR failed.  The read back from the BAR was 0xffffffff; probably a
> > >      PCIe error, e.g., the NVMe device didn't respond.
> > > 
> > >    - The device *did* respond when nvme_probe() enabled it: the
> > >      "enabling device (0000 -> 0002)" means pci_enable_resources() read
> > >      PCI_COMMAND and got 0x0000.
> > > 
> > >    - The dmesg from the working config doesn't include the "enabling
> > >      device" line, which suggests that pci_enable_resources() saw
> > >      PCI_COMMAND_MEMORY (0x0002) already set and didn't bother setting
> > >      it again.  I don't know why it would already be set.
> > >      d591f6804e7e really only changes pci_dev_wait(), which is used after
> > >      device resets.  I think vmd_enable_domain() resets the VMD Root Ports
> > >      after pci_scan_child_bus(), and maybe we're not waiting long enough
> > >      afterwards.
> > > 
> > > My guess is that we got the ~0 because we did a config read too soon
> > > after reset and the device didn't respond.  The Root Port would time
> > > out, log an error, and synthesize ~0 data to complete the CPU read
> > > (see PCIe r6.0, sec 2.3.2 implementation note).
> > > 
> > > It's *possible* that we waited long enough but the NVMe device is
> > > broken and didn't respond when it should have, but my money is on a
> > > software defect.
> > > 
> > > There are a few pci_dbg() calls about these delays; can you set
> > > CONFIG_DYNAMIC_DEBUG=y and boot with dyndbg="file drivers/pci/* +p" to
> > > collect that output?  Please also collect the "sudo lspci -vv" output
> > > from a working system.
> > 
> > Already passed the testing request to bug reporters, wait for their
> > feedback.
> > 
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/55
> 
> This is the dmesg with dyndbg="file drivers/pci/* +p"
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/56

Thank you very much!  I'm stumped.

Both Ports here are capable of 8GT/s, and the Root Port is hot-plug
capable and supports Data Link Layer Link Active Reporting but not DRS:

  10000:00:02.0 Intel Sky Lake-E PCI Express Root Port C
    LnkCap: Port #11, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
            ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
    LnkSta: Speed 8GT/s, Width x4
            TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
    SltCap: AttnBtn- PwrCtrl- MRL- AttnInd+ PwrInd+ HotPlug+ Surprise+
    LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-

  10000:01:00.0 Intel NVMe Datacenter SSD
    LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s <64ns

After a reset, we have to delay before doing config reads.  The wait
time requirements are in PCIe r6.0, sec 6.6.1:

  • ... For cases where system software cannot determine that DRS is
    supported by the attached device, or by the Downstream Port above
    the attached device [as in this case]:

    ◦ ...

    ◦ With a Downstream Port that supports Link speeds greater than
      5.0 GT/s, software must wait a minimum of 100 ms after Link
      training completes before sending a Configuration Request to the
      device immediately below that Port. Software can determine when
      Link training completes by polling the Data Link Layer Link
      Active bit or by setting up an associated interrupt (see §
      Section 6.7.3.3). It is strongly recommended for software to
      use this mechanism whenever the Downstream Port supports it.

Since the Port supports 8GT/s, we should wait 100ms after Link training
completes (PCI_EXP_LNKSTA_DLLLA becomes set), and based on the code
path below, I think we *do*:

  pci_bridge_secondary_bus_reset
    pcibios_reset_secondary_bus
      pci_reset_secondary_bus
        # assert PCI_BRIDGE_CTL_BUS_RESET
    pci_bridge_wait_for_secondary_bus(bridge, "bus reset")
      pci_dbg("waiting %d ms for downstream link, after activation\n") # 10.86
      delay = pci_bus_max_d3cold_delay()    # default 100ms
      pcie_wait_for_link_delay(bridge, active=true, delay=100)
        pcie_wait_for_link_status(use_lt=false, active=true)
          end_jiffies = jiffies + PCIE_LINK_RETRAIN_TIMEOUT_MS
          do {
            pcie_capability_read_word(PCI_EXP_LNKSTA, &lnksta)
            if ((lnksta & PCI_EXP_LNKSTA_DLLLA) == PCI_EXP_LNKSTA_DLLLA)
              return 0
            msleep(1)                       # likely wait several ms for link active
          } while (time_before(jiffies, end_jiffies))
        if (active)                         # (true)
          msleep(delay)                     # wait 100ms here
      pci_dev_wait(child, "bus reset", PCIE_RESET_READY_POLL_MS - delay)
        delay = 1
        for (;;) {
          pci_read_config_dword(PCI_VENDOR_ID, &id)
          if (!pci_bus_rrs_vendor_id(id))   # got 0xffff, assume valid
            break;
        }
        pci_dbg("ready 0ms after bus reset", delay - 1)  # 11.11

And from the dmesg log:

  [   10.862226] pci 10000:00:02.0: waiting 100 ms for downstream link, after activation
  [   11.109581] pci 10000:01:00.0: ready 0ms after bus reset

it looks like we waited about .25s (250ms) after the link came up
before trying to read the Vendor ID, which should be plenty.

I guess maybe this device requires more than 250ms after reset before
it can respond?  That still seems surprising to me; I *assume* Intel
would have tested this for spec conformance.  But maybe there's some
erratum.  I added a couple Intel folks who are mentioned in the nvme
history in case they have pointers.

But it *is* also true that pci_dev_wait() doesn't know how to handle
PCIe errors at all, and maybe there's a way to make it smarter.

Can you try adding the patch below and collect another dmesg log (with
dyndbg="file drivers/pci/* +p" as before)?  This isn't a fix, but
might give a little more insight into what's happening.

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e9448d55113b..42a36ff5c6cd 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1268,6 +1268,7 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 	bool retrain = false;
 	struct pci_dev *root, *bridge;
 
+	pci_dbg(dev, "%s: %s timeout %d\n", __func__, reset_type, timeout);
 	root = pcie_find_root_port(dev);
 
 	if (pci_is_pcie(dev)) {
@@ -1305,14 +1306,32 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 
 		if (root && root->config_rrs_sv) {
 			pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
-			if (!pci_bus_rrs_vendor_id(id))
-				break;
+			pci_dbg(dev, "%s: vf %d read %#06x\n", __func__,
+				dev->is_virtfn, id);
+			if (pci_bus_rrs_vendor_id(id))
+				goto retry;
+
+			/*
+			 * We might read 0xffff if the device is a VF and
+			 * the read was successful (the VF Vendor ID is
+			 * 0xffff per spec).
+			 *
+			 * If the device is not a VF, 0xffff likely means
+			 * there was an error on PCIe.  E.g., maybe the
+			 * device couldn't even respond with RRS status,
+			 * and the RC timed out and synthesized ~0 data.
+			 */
+			if (PCI_POSSIBLE_ERROR(id) && !dev->is_virtfn)
+				    goto retry;
+
+			break;
 		} else {
 			pci_read_config_dword(dev, PCI_COMMAND, &id);
 			if (!PCI_POSSIBLE_ERROR(id))
 				break;
 		}
 
+retry:
 		if (delay > timeout) {
 			pci_warn(dev, "not ready %dms after %s; giving up\n",
 				 delay - 1, reset_type);
@@ -4760,6 +4779,8 @@ static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
 	 * Some controllers might not implement link active reporting. In this
 	 * case, we wait for 1000 ms + any delay requested by the caller.
 	 */
+	pci_dbg(pdev, "%s: active %d delay %d link_active_reporting %d\n",
+		__func__, active, delay, pdev->link_active_reporting);
 	if (!pdev->link_active_reporting) {
 		msleep(PCIE_LINK_RETRAIN_TIMEOUT_MS + delay);
 		return true;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-06-17 20:12       ` Bjorn Helgaas
@ 2025-06-23 22:58         ` Bjorn Helgaas
  2025-06-24  0:58           ` Hui Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2025-06-23 22:58 UTC (permalink / raw)
  To: Hui Wang
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, Nirmal Patel, Jonathan Derrick,
	Chaitanya Kumar Borah, Ville Syrjälä

Ping, any chance you could try the patch below to collect a little
more data.  The proposed quirk covers up a deeper problem, and I want
to fix that problem instead of covering it up.

On Tue, Jun 17, 2025 at 03:12:55PM -0500, Bjorn Helgaas wrote:
> [+cc Chaitanya, Ville for possible Intel NVMe contacts]
> 
> On Mon, Jun 16, 2025 at 09:38:25PM +0800, Hui Wang wrote:
> > On 6/16/25 19:55, Hui Wang wrote:
> > > On 6/13/25 00:48, Bjorn Helgaas wrote:
> > > > [+cc VMD folks]
> > > > 
> > > > On Wed, Jun 11, 2025 at 06:14:42PM +0800, Hui Wang wrote:
> > > > > Prior to commit d591f6804e7e ("PCI: Wait for device readiness with
> > > > > Configuration RRS"), this Intel nvme [8086:0a54] works well. Since
> > > > > that patch is merged to the kernel, this nvme stops working.
> > > > > 
> > > > > Through debugging, we found that commit introduces the RRS polling in
> > > > > the pci_dev_wait(), for this nvme, when polling the PCI_VENDOR_ID, it
> > > > > will return ~0 if the config access is not ready yet, but the polling
> > > > > expects a return value of 0x0001 or a valid vendor_id, so the RRS
> > > > > polling doesn't work for this nvme.
> > > >
> > > > Sorry for breaking this, and thanks for all your work in debugging
> > > > this!  Issues like this are really hard to track down.
> > > > 
> > > > I would think we would have heard about this earlier if the NVMe
> > > > device were broken on all systems.  Maybe there's some connection with
> > > > VMD?  From the non-working dmesg log in your bug report
> > > > (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/+attachment/5879970/+files/dmesg-60.txt):
> > > > 
> > > > 
> > > >    DMI: ASUSTeK COMPUTER INC. ESC8000 G4/Z11PG-D24 Series, BIOS 5501
> > > > 04/17/2019
> > > >    vmd 0000:d7:05.5: PCI host bridge to bus 10000:00
> > > >    pci 10000:00:02.0: [8086:2032] type 01 class 0x060400 PCIe Root Port
> > > >    pci 10000:00:02.0: PCI bridge to [bus 01]
> > > >    pci 10000:00:02.0: bridge window [mem 0xf8000000-0xf81fffff]:
> > > > assigned
> > > >    pci 10000:01:00.0: [8086:0a54] type 00 class 0x010802 PCIe Endpoint
> > > >    pci 10000:01:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]
> > > > 
> > > >    <I think vmd_enable_domain() calls pci_reset_bus() here>
> > > 
> > > Yes, and the pci_dev_wait() is called here. With the RRS polling, will
> > > get a ~0 from PCI_VENDOR_ID, then will get 0xfffffff when configuring
> > > the BAR0 subsequently. With the original polling method, it will get
> > > enough delay in the pci_dev_wait(), so nvme works normally.
> > > 
> > > The line "[   10.193589] hhhhhhhhhhhhhhhhhhhhhhhhhhhh dev->device = 0a54
> > > id = ffffffff" is output from pci_dev_wait(), please refer to
> > > https://launchpadlibrarian.net/798708446/LP2111521-dmesg-test9.txt
> > > 
> > > >    pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
> > > >    pci 10000:01:00.0: BAR 0: error updating (high 0x00000000 !=
> > > > 0xffffffff)
> > > >    pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
> > > >    pci 10000:01:00.0: BAR 0: error updating (0xf8010004 != 0xffffffff)
> > > >    nvme nvme0: pci function 10000:01:00.0
> > > >    nvme 10000:01:00.0: enabling device (0000 -> 0002)
> > > > 
> > > > Things I notice:
> > > > 
> > > >    - The 10000:01:00.0 NVMe device is behind a VMD bridge
> > > > 
> > > >    - We successfully read the Vendor & Device IDs (8086:0a54)
> > > > 
> > > >    - The NVMe device is uninitialized.  We successfully sized the BAR,
> > > >      which included successful config reads and writes.  The BAR
> > > >      wasn't assigned by BIOS, which is normal since it's behind VMD.
> > > > 
> > > >    - We allocated space for BAR 0 but the config writes to program the
> > > >      BAR failed.  The read back from the BAR was 0xffffffff; probably a
> > > >      PCIe error, e.g., the NVMe device didn't respond.
> > > > 
> > > >    - The device *did* respond when nvme_probe() enabled it: the
> > > >      "enabling device (0000 -> 0002)" means pci_enable_resources() read
> > > >      PCI_COMMAND and got 0x0000.
> > > > 
> > > >    - The dmesg from the working config doesn't include the "enabling
> > > >      device" line, which suggests that pci_enable_resources() saw
> > > >      PCI_COMMAND_MEMORY (0x0002) already set and didn't bother setting
> > > >      it again.  I don't know why it would already be set.
> > > >      d591f6804e7e really only changes pci_dev_wait(), which is used after
> > > >      device resets.  I think vmd_enable_domain() resets the VMD Root Ports
> > > >      after pci_scan_child_bus(), and maybe we're not waiting long enough
> > > >      afterwards.
> > > > 
> > > > My guess is that we got the ~0 because we did a config read too soon
> > > > after reset and the device didn't respond.  The Root Port would time
> > > > out, log an error, and synthesize ~0 data to complete the CPU read
> > > > (see PCIe r6.0, sec 2.3.2 implementation note).
> > > > 
> > > > It's *possible* that we waited long enough but the NVMe device is
> > > > broken and didn't respond when it should have, but my money is on a
> > > > software defect.
> > > > 
> > > > There are a few pci_dbg() calls about these delays; can you set
> > > > CONFIG_DYNAMIC_DEBUG=y and boot with dyndbg="file drivers/pci/* +p" to
> > > > collect that output?  Please also collect the "sudo lspci -vv" output
> > > > from a working system.
> > > 
> > > Already passed the testing request to bug reporters, wait for their
> > > feedback.
> > > 
> > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/55
> > 
> > This is the dmesg with dyndbg="file drivers/pci/* +p"
> > 
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/56
> 
> Thank you very much!  I'm stumped.
> 
> Both Ports here are capable of 8GT/s, and the Root Port is hot-plug
> capable and supports Data Link Layer Link Active Reporting but not DRS:
> 
>   10000:00:02.0 Intel Sky Lake-E PCI Express Root Port C
>     LnkCap: Port #11, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
>             ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
>     LnkSta: Speed 8GT/s, Width x4
>             TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
>     SltCap: AttnBtn- PwrCtrl- MRL- AttnInd+ PwrInd+ HotPlug+ Surprise+
>     LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
> 
>   10000:01:00.0 Intel NVMe Datacenter SSD
>     LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s <64ns
> 
> After a reset, we have to delay before doing config reads.  The wait
> time requirements are in PCIe r6.0, sec 6.6.1:
> 
>   • ... For cases where system software cannot determine that DRS is
>     supported by the attached device, or by the Downstream Port above
>     the attached device [as in this case]:
> 
>     ◦ ...
> 
>     ◦ With a Downstream Port that supports Link speeds greater than
>       5.0 GT/s, software must wait a minimum of 100 ms after Link
>       training completes before sending a Configuration Request to the
>       device immediately below that Port. Software can determine when
>       Link training completes by polling the Data Link Layer Link
>       Active bit or by setting up an associated interrupt (see §
>       Section 6.7.3.3). It is strongly recommended for software to
>       use this mechanism whenever the Downstream Port supports it.
> 
> Since the Port supports 8GT/s, we should wait 100ms after Link training
> completes (PCI_EXP_LNKSTA_DLLLA becomes set), and based on the code
> path below, I think we *do*:
> 
>   pci_bridge_secondary_bus_reset
>     pcibios_reset_secondary_bus
>       pci_reset_secondary_bus
>         # assert PCI_BRIDGE_CTL_BUS_RESET
>     pci_bridge_wait_for_secondary_bus(bridge, "bus reset")
>       pci_dbg("waiting %d ms for downstream link, after activation\n") # 10.86
>       delay = pci_bus_max_d3cold_delay()    # default 100ms
>       pcie_wait_for_link_delay(bridge, active=true, delay=100)
>         pcie_wait_for_link_status(use_lt=false, active=true)
>           end_jiffies = jiffies + PCIE_LINK_RETRAIN_TIMEOUT_MS
>           do {
>             pcie_capability_read_word(PCI_EXP_LNKSTA, &lnksta)
>             if ((lnksta & PCI_EXP_LNKSTA_DLLLA) == PCI_EXP_LNKSTA_DLLLA)
>               return 0
>             msleep(1)                       # likely wait several ms for link active
>           } while (time_before(jiffies, end_jiffies))
>         if (active)                         # (true)
>           msleep(delay)                     # wait 100ms here
>       pci_dev_wait(child, "bus reset", PCIE_RESET_READY_POLL_MS - delay)
>         delay = 1
>         for (;;) {
>           pci_read_config_dword(PCI_VENDOR_ID, &id)
>           if (!pci_bus_rrs_vendor_id(id))   # got 0xffff, assume valid
>             break;
>         }
>         pci_dbg("ready 0ms after bus reset", delay - 1)  # 11.11
> 
> And from the dmesg log:
> 
>   [   10.862226] pci 10000:00:02.0: waiting 100 ms for downstream link, after activation
>   [   11.109581] pci 10000:01:00.0: ready 0ms after bus reset
> 
> it looks like we waited about .25s (250ms) after the link came up
> before trying to read the Vendor ID, which should be plenty.
> 
> I guess maybe this device requires more than 250ms after reset before
> it can respond?  That still seems surprising to me; I *assume* Intel
> would have tested this for spec conformance.  But maybe there's some
> erratum.  I added a couple Intel folks who are mentioned in the nvme
> history in case they have pointers.
> 
> But it *is* also true that pci_dev_wait() doesn't know how to handle
> PCIe errors at all, and maybe there's a way to make it smarter.
> 
> Can you try adding the patch below and collect another dmesg log (with
> dyndbg="file drivers/pci/* +p" as before)?  This isn't a fix, but
> might give a little more insight into what's happening.
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e9448d55113b..42a36ff5c6cd 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1268,6 +1268,7 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  	bool retrain = false;
>  	struct pci_dev *root, *bridge;
>  
> +	pci_dbg(dev, "%s: %s timeout %d\n", __func__, reset_type, timeout);
>  	root = pcie_find_root_port(dev);
>  
>  	if (pci_is_pcie(dev)) {
> @@ -1305,14 +1306,32 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>  
>  		if (root && root->config_rrs_sv) {
>  			pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> -			if (!pci_bus_rrs_vendor_id(id))
> -				break;
> +			pci_dbg(dev, "%s: vf %d read %#06x\n", __func__,
> +				dev->is_virtfn, id);
> +			if (pci_bus_rrs_vendor_id(id))
> +				goto retry;
> +
> +			/*
> +			 * We might read 0xffff if the device is a VF and
> +			 * the read was successful (the VF Vendor ID is
> +			 * 0xffff per spec).
> +			 *
> +			 * If the device is not a VF, 0xffff likely means
> +			 * there was an error on PCIe.  E.g., maybe the
> +			 * device couldn't even respond with RRS status,
> +			 * and the RC timed out and synthesized ~0 data.
> +			 */
> +			if (PCI_POSSIBLE_ERROR(id) && !dev->is_virtfn)
> +				    goto retry;
> +
> +			break;
>  		} else {
>  			pci_read_config_dword(dev, PCI_COMMAND, &id);
>  			if (!PCI_POSSIBLE_ERROR(id))
>  				break;
>  		}
>  
> +retry:
>  		if (delay > timeout) {
>  			pci_warn(dev, "not ready %dms after %s; giving up\n",
>  				 delay - 1, reset_type);
> @@ -4760,6 +4779,8 @@ static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
>  	 * Some controllers might not implement link active reporting. In this
>  	 * case, we wait for 1000 ms + any delay requested by the caller.
>  	 */
> +	pci_dbg(pdev, "%s: active %d delay %d link_active_reporting %d\n",
> +		__func__, active, delay, pdev->link_active_reporting);
>  	if (!pdev->link_active_reporting) {
>  		msleep(PCIE_LINK_RETRAIN_TIMEOUT_MS + delay);
>  		return true;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-06-23 22:58         ` Bjorn Helgaas
@ 2025-06-24  0:58           ` Hui Wang
  2025-07-01 23:23             ` Bjorn Helgaas
  0 siblings, 1 reply; 14+ messages in thread
From: Hui Wang @ 2025-06-24  0:58 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, Nirmal Patel, Jonathan Derrick,
	Chaitanya Kumar Borah, Ville Syrjälä

Sorry for late response, I was OOO the past week.

This is the log after applied your patch: 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/61

Looks like the "retry" makes the nvme work.


On 6/24/25 06:58, Bjorn Helgaas wrote:
> Ping, any chance you could try the patch below to collect a little
> more data.  The proposed quirk covers up a deeper problem, and I want
> to fix that problem instead of covering it up.
>
> On Tue, Jun 17, 2025 at 03:12:55PM -0500, Bjorn Helgaas wrote:
>> [+cc Chaitanya, Ville for possible Intel NVMe contacts]
>>
>> On Mon, Jun 16, 2025 at 09:38:25PM +0800, Hui Wang wrote:
>>> On 6/16/25 19:55, Hui Wang wrote:
>>>> On 6/13/25 00:48, Bjorn Helgaas wrote:
>>>>> [+cc VMD folks]
>>>>>
>>>>> On Wed, Jun 11, 2025 at 06:14:42PM +0800, Hui Wang wrote:
>>>>>> Prior to commit d591f6804e7e ("PCI: Wait for device readiness with
>>>>>> Configuration RRS"), this Intel nvme [8086:0a54] works well. Since
>>>>>> that patch is merged to the kernel, this nvme stops working.
>>>>>>
>>>>>> Through debugging, we found that commit introduces the RRS polling in
>>>>>> the pci_dev_wait(), for this nvme, when polling the PCI_VENDOR_ID, it
>>>>>> will return ~0 if the config access is not ready yet, but the polling
>>>>>> expects a return value of 0x0001 or a valid vendor_id, so the RRS
>>>>>> polling doesn't work for this nvme.
>>>>> Sorry for breaking this, and thanks for all your work in debugging
>>>>> this!  Issues like this are really hard to track down.
>>>>>
>>>>> I would think we would have heard about this earlier if the NVMe
>>>>> device were broken on all systems.  Maybe there's some connection with
>>>>> VMD?  From the non-working dmesg log in your bug report
>>>>> (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/+attachment/5879970/+files/dmesg-60.txt):
>>>>>
>>>>>
>>>>>     DMI: ASUSTeK COMPUTER INC. ESC8000 G4/Z11PG-D24 Series, BIOS 5501
>>>>> 04/17/2019
>>>>>     vmd 0000:d7:05.5: PCI host bridge to bus 10000:00
>>>>>     pci 10000:00:02.0: [8086:2032] type 01 class 0x060400 PCIe Root Port
>>>>>     pci 10000:00:02.0: PCI bridge to [bus 01]
>>>>>     pci 10000:00:02.0: bridge window [mem 0xf8000000-0xf81fffff]:
>>>>> assigned
>>>>>     pci 10000:01:00.0: [8086:0a54] type 00 class 0x010802 PCIe Endpoint
>>>>>     pci 10000:01:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]
>>>>>
>>>>>     <I think vmd_enable_domain() calls pci_reset_bus() here>
>>>> Yes, and the pci_dev_wait() is called here. With the RRS polling, will
>>>> get a ~0 from PCI_VENDOR_ID, then will get 0xfffffff when configuring
>>>> the BAR0 subsequently. With the original polling method, it will get
>>>> enough delay in the pci_dev_wait(), so nvme works normally.
>>>>
>>>> The line "[   10.193589] hhhhhhhhhhhhhhhhhhhhhhhhhhhh dev->device = 0a54
>>>> id = ffffffff" is output from pci_dev_wait(), please refer to
>>>> https://launchpadlibrarian.net/798708446/LP2111521-dmesg-test9.txt
>>>>
>>>>>     pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
>>>>>     pci 10000:01:00.0: BAR 0: error updating (high 0x00000000 !=
>>>>> 0xffffffff)
>>>>>     pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
>>>>>     pci 10000:01:00.0: BAR 0: error updating (0xf8010004 != 0xffffffff)
>>>>>     nvme nvme0: pci function 10000:01:00.0
>>>>>     nvme 10000:01:00.0: enabling device (0000 -> 0002)
>>>>>
>>>>> Things I notice:
>>>>>
>>>>>     - The 10000:01:00.0 NVMe device is behind a VMD bridge
>>>>>
>>>>>     - We successfully read the Vendor & Device IDs (8086:0a54)
>>>>>
>>>>>     - The NVMe device is uninitialized.  We successfully sized the BAR,
>>>>>       which included successful config reads and writes.  The BAR
>>>>>       wasn't assigned by BIOS, which is normal since it's behind VMD.
>>>>>
>>>>>     - We allocated space for BAR 0 but the config writes to program the
>>>>>       BAR failed.  The read back from the BAR was 0xffffffff; probably a
>>>>>       PCIe error, e.g., the NVMe device didn't respond.
>>>>>
>>>>>     - The device *did* respond when nvme_probe() enabled it: the
>>>>>       "enabling device (0000 -> 0002)" means pci_enable_resources() read
>>>>>       PCI_COMMAND and got 0x0000.
>>>>>
>>>>>     - The dmesg from the working config doesn't include the "enabling
>>>>>       device" line, which suggests that pci_enable_resources() saw
>>>>>       PCI_COMMAND_MEMORY (0x0002) already set and didn't bother setting
>>>>>       it again.  I don't know why it would already be set.
>>>>>       d591f6804e7e really only changes pci_dev_wait(), which is used after
>>>>>       device resets.  I think vmd_enable_domain() resets the VMD Root Ports
>>>>>       after pci_scan_child_bus(), and maybe we're not waiting long enough
>>>>>       afterwards.
>>>>>
>>>>> My guess is that we got the ~0 because we did a config read too soon
>>>>> after reset and the device didn't respond.  The Root Port would time
>>>>> out, log an error, and synthesize ~0 data to complete the CPU read
>>>>> (see PCIe r6.0, sec 2.3.2 implementation note).
>>>>>
>>>>> It's *possible* that we waited long enough but the NVMe device is
>>>>> broken and didn't respond when it should have, but my money is on a
>>>>> software defect.
>>>>>
>>>>> There are a few pci_dbg() calls about these delays; can you set
>>>>> CONFIG_DYNAMIC_DEBUG=y and boot with dyndbg="file drivers/pci/* +p" to
>>>>> collect that output?  Please also collect the "sudo lspci -vv" output
>>>>> from a working system.
>>>> Already passed the testing request to bug reporters, wait for their
>>>> feedback.
>>>>
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/55
>>> This is the dmesg with dyndbg="file drivers/pci/* +p"
>>>
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/56
>> Thank you very much!  I'm stumped.
>>
>> Both Ports here are capable of 8GT/s, and the Root Port is hot-plug
>> capable and supports Data Link Layer Link Active Reporting but not DRS:
>>
>>    10000:00:02.0 Intel Sky Lake-E PCI Express Root Port C
>>      LnkCap: Port #11, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
>>              ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
>>      LnkSta: Speed 8GT/s, Width x4
>>              TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
>>      SltCap: AttnBtn- PwrCtrl- MRL- AttnInd+ PwrInd+ HotPlug+ Surprise+
>>      LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
>>
>>    10000:01:00.0 Intel NVMe Datacenter SSD
>>      LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s <64ns
>>
>> After a reset, we have to delay before doing config reads.  The wait
>> time requirements are in PCIe r6.0, sec 6.6.1:
>>
>>    • ... For cases where system software cannot determine that DRS is
>>      supported by the attached device, or by the Downstream Port above
>>      the attached device [as in this case]:
>>
>>      ◦ ...
>>
>>      ◦ With a Downstream Port that supports Link speeds greater than
>>        5.0 GT/s, software must wait a minimum of 100 ms after Link
>>        training completes before sending a Configuration Request to the
>>        device immediately below that Port. Software can determine when
>>        Link training completes by polling the Data Link Layer Link
>>        Active bit or by setting up an associated interrupt (see §
>>        Section 6.7.3.3). It is strongly recommended for software to
>>        use this mechanism whenever the Downstream Port supports it.
>>
>> Since the Port supports 8GT/s, we should wait 100ms after Link training
>> completes (PCI_EXP_LNKSTA_DLLLA becomes set), and based on the code
>> path below, I think we *do*:
>>
>>    pci_bridge_secondary_bus_reset
>>      pcibios_reset_secondary_bus
>>        pci_reset_secondary_bus
>>          # assert PCI_BRIDGE_CTL_BUS_RESET
>>      pci_bridge_wait_for_secondary_bus(bridge, "bus reset")
>>        pci_dbg("waiting %d ms for downstream link, after activation\n") # 10.86
>>        delay = pci_bus_max_d3cold_delay()    # default 100ms
>>        pcie_wait_for_link_delay(bridge, active=true, delay=100)
>>          pcie_wait_for_link_status(use_lt=false, active=true)
>>            end_jiffies = jiffies + PCIE_LINK_RETRAIN_TIMEOUT_MS
>>            do {
>>              pcie_capability_read_word(PCI_EXP_LNKSTA, &lnksta)
>>              if ((lnksta & PCI_EXP_LNKSTA_DLLLA) == PCI_EXP_LNKSTA_DLLLA)
>>                return 0
>>              msleep(1)                       # likely wait several ms for link active
>>            } while (time_before(jiffies, end_jiffies))
>>          if (active)                         # (true)
>>            msleep(delay)                     # wait 100ms here
>>        pci_dev_wait(child, "bus reset", PCIE_RESET_READY_POLL_MS - delay)
>>          delay = 1
>>          for (;;) {
>>            pci_read_config_dword(PCI_VENDOR_ID, &id)
>>            if (!pci_bus_rrs_vendor_id(id))   # got 0xffff, assume valid
>>              break;
>>          }
>>          pci_dbg("ready 0ms after bus reset", delay - 1)  # 11.11
>>
>> And from the dmesg log:
>>
>>    [   10.862226] pci 10000:00:02.0: waiting 100 ms for downstream link, after activation
>>    [   11.109581] pci 10000:01:00.0: ready 0ms after bus reset
>>
>> it looks like we waited about .25s (250ms) after the link came up
>> before trying to read the Vendor ID, which should be plenty.
>>
>> I guess maybe this device requires more than 250ms after reset before
>> it can respond?  That still seems surprising to me; I *assume* Intel
>> would have tested this for spec conformance.  But maybe there's some
>> erratum.  I added a couple Intel folks who are mentioned in the nvme
>> history in case they have pointers.
>>
>> But it *is* also true that pci_dev_wait() doesn't know how to handle
>> PCIe errors at all, and maybe there's a way to make it smarter.
>>
>> Can you try adding the patch below and collect another dmesg log (with
>> dyndbg="file drivers/pci/* +p" as before)?  This isn't a fix, but
>> might give a little more insight into what's happening.
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e9448d55113b..42a36ff5c6cd 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -1268,6 +1268,7 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>>   	bool retrain = false;
>>   	struct pci_dev *root, *bridge;
>>   
>> +	pci_dbg(dev, "%s: %s timeout %d\n", __func__, reset_type, timeout);
>>   	root = pcie_find_root_port(dev);
>>   
>>   	if (pci_is_pcie(dev)) {
>> @@ -1305,14 +1306,32 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>>   
>>   		if (root && root->config_rrs_sv) {
>>   			pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
>> -			if (!pci_bus_rrs_vendor_id(id))
>> -				break;
>> +			pci_dbg(dev, "%s: vf %d read %#06x\n", __func__,
>> +				dev->is_virtfn, id);
>> +			if (pci_bus_rrs_vendor_id(id))
>> +				goto retry;
>> +
>> +			/*
>> +			 * We might read 0xffff if the device is a VF and
>> +			 * the read was successful (the VF Vendor ID is
>> +			 * 0xffff per spec).
>> +			 *
>> +			 * If the device is not a VF, 0xffff likely means
>> +			 * there was an error on PCIe.  E.g., maybe the
>> +			 * device couldn't even respond with RRS status,
>> +			 * and the RC timed out and synthesized ~0 data.
>> +			 */
>> +			if (PCI_POSSIBLE_ERROR(id) && !dev->is_virtfn)
>> +				    goto retry;
>> +
>> +			break;
>>   		} else {
>>   			pci_read_config_dword(dev, PCI_COMMAND, &id);
>>   			if (!PCI_POSSIBLE_ERROR(id))
>>   				break;
>>   		}
>>   
>> +retry:
>>   		if (delay > timeout) {
>>   			pci_warn(dev, "not ready %dms after %s; giving up\n",
>>   				 delay - 1, reset_type);
>> @@ -4760,6 +4779,8 @@ static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
>>   	 * Some controllers might not implement link active reporting. In this
>>   	 * case, we wait for 1000 ms + any delay requested by the caller.
>>   	 */
>> +	pci_dbg(pdev, "%s: active %d delay %d link_active_reporting %d\n",
>> +		__func__, active, delay, pdev->link_active_reporting);
>>   	if (!pdev->link_active_reporting) {
>>   		msleep(PCIE_LINK_RETRAIN_TIMEOUT_MS + delay);
>>   		return true;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-06-24  0:58           ` Hui Wang
@ 2025-07-01 23:23             ` Bjorn Helgaas
  2025-07-02  9:43               ` Hui Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2025-07-01 23:23 UTC (permalink / raw)
  To: Hui Wang
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, Nirmal Patel, Jonathan Derrick,
	Chaitanya Kumar Borah, Ville Syrjälä

On Tue, Jun 24, 2025 at 08:58:57AM +0800, Hui Wang wrote:
> Sorry for late response, I was OOO the past week.
> 
> This is the log after applied your patch:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/61
> 
> Looks like the "retry" makes the nvme work.

Thank you!  It seems like we get 0xffffffff (probably PCIe error) for
a long time after we think the device should be able to respond with
RRS.

I always thought the spec required that after the delays, a device
should respond with RRS if it's not ready, but now I guess I'm not
100% sure.  Maybe it's allowed to just do nothing, which would lead to
the Root Port timing out and logging an Unsupported Request error.

Can I trouble you to try the patch below?  I think we might have to
start explicitly checking for that error.  That probably would require
some setup to enable the error, check for it, and clear it.  I hacked
in some of that here, but ultimately some of it should go elsewhere.

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e9448d55113b..c276d0a2b522 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1264,10 +1264,13 @@ void pci_resume_bus(struct pci_bus *bus)
 
 static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 {
-	int delay = 1;
+	int delay = 10;
 	bool retrain = false;
 	struct pci_dev *root, *bridge;
+	u16 devctl, devsta;
 
+	pci_info(dev, "%s: VF%c %s timeout %d\n", __func__,
+		 dev->is_virtfn ? '+' : '-', reset_type, timeout);
 	root = pcie_find_root_port(dev);
 
 	if (pci_is_pcie(dev)) {
@@ -1276,6 +1279,19 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 			retrain = true;
 	}
 
+	if (root) {
+		pcie_capability_read_word(root, PCI_EXP_DEVCTL, &devctl);
+		if (!(devctl & PCI_EXP_DEVCTL_URRE))
+			pcie_capability_write_word(root, PCI_EXP_DEVCTL,
+					    devctl | PCI_EXP_DEVCTL_URRE);
+		pcie_capability_read_word(root, PCI_EXP_DEVSTA, &devsta);
+		if (devsta & PCI_EXP_DEVSTA_URD)
+			pcie_capability_write_word(root, PCI_EXP_DEVSTA,
+						   PCI_EXP_DEVSTA_URD);
+		pci_info(root, "%s: DEVCTL %#06x DEVSTA %#06x\n", __func__,
+			 devctl, devsta);
+	}
+
 	/*
 	 * The caller has already waited long enough after a reset that the
 	 * device should respond to config requests, but it may respond
@@ -1305,14 +1321,33 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 
 		if (root && root->config_rrs_sv) {
 			pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
-			if (!pci_bus_rrs_vendor_id(id))
-				break;
+
+			if (pci_bus_rrs_vendor_id(id)) {
+				pci_info(dev, "%s: read %#06x (RRS)\n",
+					 __func__, id);
+				goto retry;
+			}
+
+			if (PCI_POSSIBLE_ERROR(id)) {
+				pcie_capability_read_word(root, PCI_EXP_DEVSTA,
+							  &devsta);
+				if (devsta & PCI_EXP_DEVSTA_URD)
+					pcie_capability_write_word(root,
+							    PCI_EXP_DEVSTA,
+							    PCI_EXP_DEVSTA_URD);
+				pci_info(root, "%s: read %#06x DEVSTA %#06x\n",
+					 __func__, id, devsta);
+				goto retry;
+			}
+
+			break;
 		} else {
 			pci_read_config_dword(dev, PCI_COMMAND, &id);
 			if (!PCI_POSSIBLE_ERROR(id))
 				break;
 		}
 
+retry:
 		if (delay > timeout) {
 			pci_warn(dev, "not ready %dms after %s; giving up\n",
 				 delay - 1, reset_type);
@@ -1332,7 +1367,6 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
 		}
 
 		msleep(delay);
-		delay *= 2;
 	}
 
 	if (delay > PCI_RESET_WAIT)
@@ -4670,8 +4704,10 @@ static int pcie_wait_for_link_status(struct pci_dev *pdev,
 	end_jiffies = jiffies + msecs_to_jiffies(PCIE_LINK_RETRAIN_TIMEOUT_MS);
 	do {
 		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnksta);
-		if ((lnksta & lnksta_mask) == lnksta_match)
+		if ((lnksta & lnksta_mask) == lnksta_match) {
+			pci_info(pdev, "%s: LNKSTA %#06x\n", __func__, lnksta);
 			return 0;
+		}
 		msleep(1);
 	} while (time_before(jiffies, end_jiffies));
 
@@ -4760,6 +4796,8 @@ static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
 	 * Some controllers might not implement link active reporting. In this
 	 * case, we wait for 1000 ms + any delay requested by the caller.
 	 */
+	pci_info(pdev, "%s: active %d delay %d link_active_reporting %d\n",
+		 __func__, active, delay, pdev->link_active_reporting);
 	if (!pdev->link_active_reporting) {
 		msleep(PCIE_LINK_RETRAIN_TIMEOUT_MS + delay);
 		return true;
@@ -4784,6 +4822,7 @@ static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
 			return false;
 
 		msleep(delay);
+		pci_info(pdev, "%s: waited %dms\n", __func__, delay);
 		return true;
 	}
 
@@ -4960,6 +4999,7 @@ void pci_reset_secondary_bus(struct pci_dev *dev)
 
 	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
 	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
+	pci_info(dev, "%s: PCI_BRIDGE_CTL_BUS_RESET deasserted\n", __func__);
 }
 
 void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-07-01 23:23             ` Bjorn Helgaas
@ 2025-07-02  9:43               ` Hui Wang
  2025-07-03  0:05                 ` Hui Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Hui Wang @ 2025-07-02  9:43 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, Nirmal Patel, Jonathan Derrick,
	Chaitanya Kumar Borah, Ville Syrjälä


On 7/2/25 07:23, Bjorn Helgaas wrote:
> On Tue, Jun 24, 2025 at 08:58:57AM +0800, Hui Wang wrote:
>> Sorry for late response, I was OOO the past week.
>>
>> This is the log after applied your patch:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/61
>>
>> Looks like the "retry" makes the nvme work.
> Thank you!  It seems like we get 0xffffffff (probably PCIe error) for
> a long time after we think the device should be able to respond with
> RRS.
>
> I always thought the spec required that after the delays, a device
> should respond with RRS if it's not ready, but now I guess I'm not
> 100% sure.  Maybe it's allowed to just do nothing, which would lead to
> the Root Port timing out and logging an Unsupported Request error.
>
> Can I trouble you to try the patch below?  I think we might have to
> start explicitly checking for that error.  That probably would require
> some setup to enable the error, check for it, and clear it.  I hacked
> in some of that here, but ultimately some of it should go elsewhere.

OK, built a testing kernel, wait for bug reporter to test it and collect 
the log.


> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e9448d55113b..c276d0a2b522 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1264,10 +1264,13 @@ void pci_resume_bus(struct pci_bus *bus)
>   
>   static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>   {
> -	int delay = 1;
> +	int delay = 10;
>   	bool retrain = false;
>   	struct pci_dev *root, *bridge;
> +	u16 devctl, devsta;
>   
> +	pci_info(dev, "%s: VF%c %s timeout %d\n", __func__,
> +		 dev->is_virtfn ? '+' : '-', reset_type, timeout);
>   	root = pcie_find_root_port(dev);
>   
>   	if (pci_is_pcie(dev)) {
> @@ -1276,6 +1279,19 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>   			retrain = true;
>   	}
>   
> +	if (root) {
> +		pcie_capability_read_word(root, PCI_EXP_DEVCTL, &devctl);
> +		if (!(devctl & PCI_EXP_DEVCTL_URRE))
> +			pcie_capability_write_word(root, PCI_EXP_DEVCTL,
> +					    devctl | PCI_EXP_DEVCTL_URRE);
> +		pcie_capability_read_word(root, PCI_EXP_DEVSTA, &devsta);
> +		if (devsta & PCI_EXP_DEVSTA_URD)
> +			pcie_capability_write_word(root, PCI_EXP_DEVSTA,
> +						   PCI_EXP_DEVSTA_URD);
> +		pci_info(root, "%s: DEVCTL %#06x DEVSTA %#06x\n", __func__,
> +			 devctl, devsta);
> +	}
> +
>   	/*
>   	 * The caller has already waited long enough after a reset that the
>   	 * device should respond to config requests, but it may respond
> @@ -1305,14 +1321,33 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>   
>   		if (root && root->config_rrs_sv) {
>   			pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> -			if (!pci_bus_rrs_vendor_id(id))
> -				break;
> +
> +			if (pci_bus_rrs_vendor_id(id)) {
> +				pci_info(dev, "%s: read %#06x (RRS)\n",
> +					 __func__, id);
> +				goto retry;
> +			}
> +
> +			if (PCI_POSSIBLE_ERROR(id)) {
> +				pcie_capability_read_word(root, PCI_EXP_DEVSTA,
> +							  &devsta);
> +				if (devsta & PCI_EXP_DEVSTA_URD)
> +					pcie_capability_write_word(root,
> +							    PCI_EXP_DEVSTA,
> +							    PCI_EXP_DEVSTA_URD);
> +				pci_info(root, "%s: read %#06x DEVSTA %#06x\n",
> +					 __func__, id, devsta);
> +				goto retry;
> +			}
> +
> +			break;
>   		} else {
>   			pci_read_config_dword(dev, PCI_COMMAND, &id);
>   			if (!PCI_POSSIBLE_ERROR(id))
>   				break;
>   		}
>   
> +retry:
>   		if (delay > timeout) {
>   			pci_warn(dev, "not ready %dms after %s; giving up\n",
>   				 delay - 1, reset_type);
> @@ -1332,7 +1367,6 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>   		}
>   
>   		msleep(delay);
> -		delay *= 2;
>   	}
>   
>   	if (delay > PCI_RESET_WAIT)
> @@ -4670,8 +4704,10 @@ static int pcie_wait_for_link_status(struct pci_dev *pdev,
>   	end_jiffies = jiffies + msecs_to_jiffies(PCIE_LINK_RETRAIN_TIMEOUT_MS);
>   	do {
>   		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnksta);
> -		if ((lnksta & lnksta_mask) == lnksta_match)
> +		if ((lnksta & lnksta_mask) == lnksta_match) {
> +			pci_info(pdev, "%s: LNKSTA %#06x\n", __func__, lnksta);
>   			return 0;
> +		}
>   		msleep(1);
>   	} while (time_before(jiffies, end_jiffies));
>   
> @@ -4760,6 +4796,8 @@ static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
>   	 * Some controllers might not implement link active reporting. In this
>   	 * case, we wait for 1000 ms + any delay requested by the caller.
>   	 */
> +	pci_info(pdev, "%s: active %d delay %d link_active_reporting %d\n",
> +		 __func__, active, delay, pdev->link_active_reporting);
>   	if (!pdev->link_active_reporting) {
>   		msleep(PCIE_LINK_RETRAIN_TIMEOUT_MS + delay);
>   		return true;
> @@ -4784,6 +4822,7 @@ static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
>   			return false;
>   
>   		msleep(delay);
> +		pci_info(pdev, "%s: waited %dms\n", __func__, delay);
>   		return true;
>   	}
>   
> @@ -4960,6 +4999,7 @@ void pci_reset_secondary_bus(struct pci_dev *dev)
>   
>   	ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>   	pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
> +	pci_info(dev, "%s: PCI_BRIDGE_CTL_BUS_RESET deasserted\n", __func__);
>   }
>   
>   void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-07-02  9:43               ` Hui Wang
@ 2025-07-03  0:05                 ` Hui Wang
  2025-08-08  2:23                   ` Hui Wang
  2025-08-21 16:39                   ` Bjorn Helgaas
  0 siblings, 2 replies; 14+ messages in thread
From: Hui Wang @ 2025-07-03  0:05 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, Nirmal Patel, Jonathan Derrick,
	Chaitanya Kumar Borah, Ville Syrjälä


On 7/2/25 17:43, Hui Wang wrote:
>
> On 7/2/25 07:23, Bjorn Helgaas wrote:
>> On Tue, Jun 24, 2025 at 08:58:57AM +0800, Hui Wang wrote:
>>> Sorry for late response, I was OOO the past week.
>>>
>>> This is the log after applied your patch:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/61 
>>>
>>>
>>> Looks like the "retry" makes the nvme work.
>> Thank you!  It seems like we get 0xffffffff (probably PCIe error) for
>> a long time after we think the device should be able to respond with
>> RRS.
>>
>> I always thought the spec required that after the delays, a device
>> should respond with RRS if it's not ready, but now I guess I'm not
>> 100% sure.  Maybe it's allowed to just do nothing, which would lead to
>> the Root Port timing out and logging an Unsupported Request error.
>>
>> Can I trouble you to try the patch below?  I think we might have to
>> start explicitly checking for that error.  That probably would require
>> some setup to enable the error, check for it, and clear it.  I hacked
>> in some of that here, but ultimately some of it should go elsewhere.
>
> OK, built a testing kernel, wait for bug reporter to test it and 
> collect the log.
>
This is the testing result and log. 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/65


>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e9448d55113b..c276d0a2b522 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -1264,10 +1264,13 @@ void pci_resume_bus(struct pci_bus *bus)
>>     static int pci_dev_wait(struct pci_dev *dev, char *reset_type, 
>> int timeout)
>>   {
>> -    int delay = 1;
>> +    int delay = 10;
>>       bool retrain = false;
>>       struct pci_dev *root, *bridge;
>> +    u16 devctl, devsta;
>>   +    pci_info(dev, "%s: VF%c %s timeout %d\n", __func__,
>> +         dev->is_virtfn ? '+' : '-', reset_type, timeout);
>>       root = pcie_find_root_port(dev);
>>         if (pci_is_pcie(dev)) {
>> @@ -1276,6 +1279,19 @@ static int pci_dev_wait(struct pci_dev *dev, 
>> char *reset_type, int timeout)
>>               retrain = true;
>>       }
>>   +    if (root) {
>> +        pcie_capability_read_word(root, PCI_EXP_DEVCTL, &devctl);
>> +        if (!(devctl & PCI_EXP_DEVCTL_URRE))
>> +            pcie_capability_write_word(root, PCI_EXP_DEVCTL,
>> +                        devctl | PCI_EXP_DEVCTL_URRE);
>> +        pcie_capability_read_word(root, PCI_EXP_DEVSTA, &devsta);
>> +        if (devsta & PCI_EXP_DEVSTA_URD)
>> +            pcie_capability_write_word(root, PCI_EXP_DEVSTA,
>> +                           PCI_EXP_DEVSTA_URD);
>> +        pci_info(root, "%s: DEVCTL %#06x DEVSTA %#06x\n", __func__,
>> +             devctl, devsta);
>> +    }
>> +
>>       /*
>>        * The caller has already waited long enough after a reset that 
>> the
>>        * device should respond to config requests, but it may respond
>> @@ -1305,14 +1321,33 @@ static int pci_dev_wait(struct pci_dev *dev, 
>> char *reset_type, int timeout)
>>             if (root && root->config_rrs_sv) {
>>               pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
>> -            if (!pci_bus_rrs_vendor_id(id))
>> -                break;
>> +
>> +            if (pci_bus_rrs_vendor_id(id)) {
>> +                pci_info(dev, "%s: read %#06x (RRS)\n",
>> +                     __func__, id);
>> +                goto retry;
>> +            }
>> +
>> +            if (PCI_POSSIBLE_ERROR(id)) {
>> +                pcie_capability_read_word(root, PCI_EXP_DEVSTA,
>> +                              &devsta);
>> +                if (devsta & PCI_EXP_DEVSTA_URD)
>> +                    pcie_capability_write_word(root,
>> +                                PCI_EXP_DEVSTA,
>> +                                PCI_EXP_DEVSTA_URD);
>> +                pci_info(root, "%s: read %#06x DEVSTA %#06x\n",
>> +                     __func__, id, devsta);
>> +                goto retry;
>> +            }
>> +
>> +            break;
>>           } else {
>>               pci_read_config_dword(dev, PCI_COMMAND, &id);
>>               if (!PCI_POSSIBLE_ERROR(id))
>>                   break;
>>           }
>>   +retry:
>>           if (delay > timeout) {
>>               pci_warn(dev, "not ready %dms after %s; giving up\n",
>>                    delay - 1, reset_type);
>> @@ -1332,7 +1367,6 @@ static int pci_dev_wait(struct pci_dev *dev, 
>> char *reset_type, int timeout)
>>           }
>>             msleep(delay);
>> -        delay *= 2;
>>       }
>>         if (delay > PCI_RESET_WAIT)
>> @@ -4670,8 +4704,10 @@ static int pcie_wait_for_link_status(struct 
>> pci_dev *pdev,
>>       end_jiffies = jiffies + 
>> msecs_to_jiffies(PCIE_LINK_RETRAIN_TIMEOUT_MS);
>>       do {
>>           pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnksta);
>> -        if ((lnksta & lnksta_mask) == lnksta_match)
>> +        if ((lnksta & lnksta_mask) == lnksta_match) {
>> +            pci_info(pdev, "%s: LNKSTA %#06x\n", __func__, lnksta);
>>               return 0;
>> +        }
>>           msleep(1);
>>       } while (time_before(jiffies, end_jiffies));
>>   @@ -4760,6 +4796,8 @@ static bool pcie_wait_for_link_delay(struct 
>> pci_dev *pdev, bool active,
>>        * Some controllers might not implement link active reporting. 
>> In this
>>        * case, we wait for 1000 ms + any delay requested by the caller.
>>        */
>> +    pci_info(pdev, "%s: active %d delay %d link_active_reporting %d\n",
>> +         __func__, active, delay, pdev->link_active_reporting);
>>       if (!pdev->link_active_reporting) {
>>           msleep(PCIE_LINK_RETRAIN_TIMEOUT_MS + delay);
>>           return true;
>> @@ -4784,6 +4822,7 @@ static bool pcie_wait_for_link_delay(struct 
>> pci_dev *pdev, bool active,
>>               return false;
>>             msleep(delay);
>> +        pci_info(pdev, "%s: waited %dms\n", __func__, delay);
>>           return true;
>>       }
>>   @@ -4960,6 +4999,7 @@ void pci_reset_secondary_bus(struct pci_dev 
>> *dev)
>>         ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>>       pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>> +    pci_info(dev, "%s: PCI_BRIDGE_CTL_BUS_RESET deasserted\n", 
>> __func__);
>>   }
>>     void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-07-03  0:05                 ` Hui Wang
@ 2025-08-08  2:23                   ` Hui Wang
  2025-08-11 23:04                     ` Bjorn Helgaas
  2025-08-21 16:39                   ` Bjorn Helgaas
  1 sibling, 1 reply; 14+ messages in thread
From: Hui Wang @ 2025-08-08  2:23 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, Nirmal Patel, Jonathan Derrick,
	Chaitanya Kumar Borah, Ville Syrjälä, Hui Wang

Hi Bjorn,

Any progress on this issue, do we have a fix for this now? The ubuntu 
users are waiting for a fix :-).

Thanks,

Hui.


On 7/3/25 08:05, Hui Wang wrote:
>
> On 7/2/25 17:43, Hui Wang wrote:
>>
>> On 7/2/25 07:23, Bjorn Helgaas wrote:
>>> On Tue, Jun 24, 2025 at 08:58:57AM +0800, Hui Wang wrote:
>>>> Sorry for late response, I was OOO the past week.
>>>>
>>>> This is the log after applied your patch:
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/61 
>>>>
>>>>
>>>> Looks like the "retry" makes the nvme work.
>>> Thank you!  It seems like we get 0xffffffff (probably PCIe error) for
>>> a long time after we think the device should be able to respond with
>>> RRS.
>>>
>>> I always thought the spec required that after the delays, a device
>>> should respond with RRS if it's not ready, but now I guess I'm not
>>> 100% sure.  Maybe it's allowed to just do nothing, which would lead to
>>> the Root Port timing out and logging an Unsupported Request error.
>>>
>>> Can I trouble you to try the patch below?  I think we might have to
>>> start explicitly checking for that error.  That probably would require
>>> some setup to enable the error, check for it, and clear it.  I hacked
>>> in some of that here, but ultimately some of it should go elsewhere.
>>
>> OK, built a testing kernel, wait for bug reporter to test it and 
>> collect the log.
>>
> This is the testing result and log. 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/65
>
>
>>
>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>> index e9448d55113b..c276d0a2b522 100644
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -1264,10 +1264,13 @@ void pci_resume_bus(struct pci_bus *bus)
>>>     static int pci_dev_wait(struct pci_dev *dev, char *reset_type, 
>>> int timeout)
>>>   {
>>> -    int delay = 1;
>>> +    int delay = 10;
>>>       bool retrain = false;
>>>       struct pci_dev *root, *bridge;
>>> +    u16 devctl, devsta;
>>>   +    pci_info(dev, "%s: VF%c %s timeout %d\n", __func__,
>>> +         dev->is_virtfn ? '+' : '-', reset_type, timeout);
>>>       root = pcie_find_root_port(dev);
>>>         if (pci_is_pcie(dev)) {
>>> @@ -1276,6 +1279,19 @@ static int pci_dev_wait(struct pci_dev *dev, 
>>> char *reset_type, int timeout)
>>>               retrain = true;
>>>       }
>>>   +    if (root) {
>>> +        pcie_capability_read_word(root, PCI_EXP_DEVCTL, &devctl);
>>> +        if (!(devctl & PCI_EXP_DEVCTL_URRE))
>>> +            pcie_capability_write_word(root, PCI_EXP_DEVCTL,
>>> +                        devctl | PCI_EXP_DEVCTL_URRE);
>>> +        pcie_capability_read_word(root, PCI_EXP_DEVSTA, &devsta);
>>> +        if (devsta & PCI_EXP_DEVSTA_URD)
>>> +            pcie_capability_write_word(root, PCI_EXP_DEVSTA,
>>> +                           PCI_EXP_DEVSTA_URD);
>>> +        pci_info(root, "%s: DEVCTL %#06x DEVSTA %#06x\n", __func__,
>>> +             devctl, devsta);
>>> +    }
>>> +
>>>       /*
>>>        * The caller has already waited long enough after a reset 
>>> that the
>>>        * device should respond to config requests, but it may respond
>>> @@ -1305,14 +1321,33 @@ static int pci_dev_wait(struct pci_dev *dev, 
>>> char *reset_type, int timeout)
>>>             if (root && root->config_rrs_sv) {
>>>               pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
>>> -            if (!pci_bus_rrs_vendor_id(id))
>>> -                break;
>>> +
>>> +            if (pci_bus_rrs_vendor_id(id)) {
>>> +                pci_info(dev, "%s: read %#06x (RRS)\n",
>>> +                     __func__, id);
>>> +                goto retry;
>>> +            }
>>> +
>>> +            if (PCI_POSSIBLE_ERROR(id)) {
>>> +                pcie_capability_read_word(root, PCI_EXP_DEVSTA,
>>> +                              &devsta);
>>> +                if (devsta & PCI_EXP_DEVSTA_URD)
>>> +                    pcie_capability_write_word(root,
>>> +                                PCI_EXP_DEVSTA,
>>> +                                PCI_EXP_DEVSTA_URD);
>>> +                pci_info(root, "%s: read %#06x DEVSTA %#06x\n",
>>> +                     __func__, id, devsta);
>>> +                goto retry;
>>> +            }
>>> +
>>> +            break;
>>>           } else {
>>>               pci_read_config_dword(dev, PCI_COMMAND, &id);
>>>               if (!PCI_POSSIBLE_ERROR(id))
>>>                   break;
>>>           }
>>>   +retry:
>>>           if (delay > timeout) {
>>>               pci_warn(dev, "not ready %dms after %s; giving up\n",
>>>                    delay - 1, reset_type);
>>> @@ -1332,7 +1367,6 @@ static int pci_dev_wait(struct pci_dev *dev, 
>>> char *reset_type, int timeout)
>>>           }
>>>             msleep(delay);
>>> -        delay *= 2;
>>>       }
>>>         if (delay > PCI_RESET_WAIT)
>>> @@ -4670,8 +4704,10 @@ static int pcie_wait_for_link_status(struct 
>>> pci_dev *pdev,
>>>       end_jiffies = jiffies + 
>>> msecs_to_jiffies(PCIE_LINK_RETRAIN_TIMEOUT_MS);
>>>       do {
>>>           pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnksta);
>>> -        if ((lnksta & lnksta_mask) == lnksta_match)
>>> +        if ((lnksta & lnksta_mask) == lnksta_match) {
>>> +            pci_info(pdev, "%s: LNKSTA %#06x\n", __func__, lnksta);
>>>               return 0;
>>> +        }
>>>           msleep(1);
>>>       } while (time_before(jiffies, end_jiffies));
>>>   @@ -4760,6 +4796,8 @@ static bool pcie_wait_for_link_delay(struct 
>>> pci_dev *pdev, bool active,
>>>        * Some controllers might not implement link active reporting. 
>>> In this
>>>        * case, we wait for 1000 ms + any delay requested by the caller.
>>>        */
>>> +    pci_info(pdev, "%s: active %d delay %d link_active_reporting 
>>> %d\n",
>>> +         __func__, active, delay, pdev->link_active_reporting);
>>>       if (!pdev->link_active_reporting) {
>>>           msleep(PCIE_LINK_RETRAIN_TIMEOUT_MS + delay);
>>>           return true;
>>> @@ -4784,6 +4822,7 @@ static bool pcie_wait_for_link_delay(struct 
>>> pci_dev *pdev, bool active,
>>>               return false;
>>>             msleep(delay);
>>> +        pci_info(pdev, "%s: waited %dms\n", __func__, delay);
>>>           return true;
>>>       }
>>>   @@ -4960,6 +4999,7 @@ void pci_reset_secondary_bus(struct pci_dev 
>>> *dev)
>>>         ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>>>       pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>> +    pci_info(dev, "%s: PCI_BRIDGE_CTL_BUS_RESET deasserted\n", 
>>> __func__);
>>>   }
>>>     void __weak pcibios_reset_secondary_bus(struct pci_dev *dev)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-08-08  2:23                   ` Hui Wang
@ 2025-08-11 23:04                     ` Bjorn Helgaas
  0 siblings, 0 replies; 14+ messages in thread
From: Bjorn Helgaas @ 2025-08-11 23:04 UTC (permalink / raw)
  To: Hui Wang
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, Nirmal Patel, Jonathan Derrick,
	Chaitanya Kumar Borah, Ville Syrjälä

On Fri, Aug 08, 2025 at 10:23:45AM +0800, Hui Wang wrote:
> Hi Bjorn,
> 
> Any progress on this issue, do we have a fix for this now? The
> ubuntu users are waiting for a fix :-).

Not yet, but thanks for the reminder.  Keep bugging me!

PCIe r7.0, sec 2.3.1, makes it clear that devices are permitted to
return RRS after FLR:

  ◦ For Configuration Requests only, if Device Readiness Status is not
    supported, following reset it is permitted for a Function to
    terminate the request and indicate that it is temporarily unable
    to process the Request, but will be able to process the Request in
    the future - in this case, the Request Retry Status (RRS)
    Completion Status must be used (see § Section 6.6). Valid reset
    conditions after which a device/Function is permitted to return
    RRS in response to a Configuration Request are:

    ▪ FLRs

    ...

But I am a little bit concerned because sec 2.3.2, which talks about
how a Root Complex handles that RRS and the RRS Software Visiblity
feature, says (note the "system reset" period):

  Root Complex handling of a Completion with Request Retry Status for
  a Configuration Request is implementation specific, except for the
  period following SYSTEM RESET (see § Section 6.6). For Root
  Complexes that support Configuration RRS Software Visibility, the
  following rules apply:

    ◦ If Configuration RRS Software Visibility is enabled:

      ▪ For a Configuration Read Request that includes both bytes of
	the Vendor ID field of a device Function's Configuration Space
	Header, the Root Complex must complete the Request to the host
	by returning a read-data value of 0001h for the Vendor ID
	field and all 1's for any additional bytes included in the
	request.

So I'm worried that the Software Visibility feature might work after
*system reset*, but not necessarily after an FLR.  That might make
sense because I don't think the RC can tell when we are doing an FLR
to a device.

It seems that after FLR, most RCs *do* make RRS visible via SV.  But
if we can't rely on that, I don't know how we're supposed to learn
when a device becomes ready.

Bjorn

> On 7/3/25 08:05, Hui Wang wrote:
> > On 7/2/25 17:43, Hui Wang wrote:
> > > On 7/2/25 07:23, Bjorn Helgaas wrote:
> > > > On Tue, Jun 24, 2025 at 08:58:57AM +0800, Hui Wang wrote:
> > > > > Sorry for late response, I was OOO the past week.
> > > > > 
> > > > > This is the log after applied your patch:
> > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/61
> > > > > 
> > > > > Looks like the "retry" makes the nvme work.
> > > >
> > > > Thank you!  It seems like we get 0xffffffff (probably PCIe
> > > > error) for a long time after we think the device should be
> > > > able to respond with RRS.
> > > > 
> > > > I always thought the spec required that after the delays, a
> > > > device should respond with RRS if it's not ready, but now I
> > > > guess I'm not 100% sure.  Maybe it's allowed to just do
> > > > nothing, which would lead to the Root Port timing out and
> > > > logging an Unsupported Request error.
> > > > 
> > > > Can I trouble you to try the patch below?  I think we might
> > > > have to start explicitly checking for that error.  That
> > > > probably would require some setup to enable the error, check
> > > > for it, and clear it.  I hacked in some of that here, but
> > > > ultimately some of it should go elsewhere.
> ...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-06-11 10:14 [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme Hui Wang
  2025-06-12 16:48 ` Bjorn Helgaas
@ 2025-08-14 15:55 ` Bjorn Helgaas
  1 sibling, 0 replies; 14+ messages in thread
From: Bjorn Helgaas @ 2025-08-14 15:55 UTC (permalink / raw)
  To: Hui Wang
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, regressions

On Wed, Jun 11, 2025 at 06:14:42PM +0800, Hui Wang wrote:
> Prior to commit d591f6804e7e ("PCI: Wait for device readiness with
> Configuration RRS"), this Intel nvme [8086:0a54] works well. Since
> that patch is merged to the kernel, this nvme stops working.
> 
> Through debugging, we found that commit introduces the RRS polling in
> the pci_dev_wait(), for this nvme, when polling the PCI_VENDOR_ID, it
> will return ~0 if the config access is not ready yet, but the polling
> expects a return value of 0x0001 or a valid vendor_id, so the RRS
> polling doesn't work for this nvme.
> 
> Here we add a pci quirk to disable the RRS polling for the device.
> 
> Fixes: d591f6804e7e ("PCI: Wait for device readiness with Configuration RRS")
> Link: https://bugs.launchpad.net/bugs/2111521
> Signed-off-by: Hui Wang <hui.wang@canonical.com>

To help keep this top of mind:

#regzbot introduced: d591f6804e7e ("PCI: Wait for device readiness with Configuration RRS")

> ---
>  drivers/pci/probe.c  |  3 ++-
>  drivers/pci/quirks.c | 18 ++++++++++++++++++
>  include/linux/pci.h  |  2 ++
>  3 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 4b8693ec9e4c..848fa0e6cf60 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1270,7 +1270,8 @@ static void pci_enable_rrs_sv(struct pci_dev *pdev)
>  	if (root_cap & PCI_EXP_RTCAP_RRS_SV) {
>  		pcie_capability_set_word(pdev, PCI_EXP_RTCTL,
>  					 PCI_EXP_RTCTL_RRS_SVE);
> -		pdev->config_rrs_sv = 1;
> +		if (!(pdev->dev_flags & PCI_DEV_FLAGS_NO_RRS_SV))
> +			pdev->config_rrs_sv = 1;
>  	}
>  }
>  
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index cf483d82572c..519e48ff6448 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -6336,3 +6336,21 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout);
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout);
>  #endif
> +
> +/*
> + * Although the root port device claims to support RRS, some devices don't work
> + * with RRS polling, when reading the Vendor ID, they just return ~0 if config
> + * access is not ready, this will break the pci_dev_wait(). Here disable the RRS
> + * forcibly for this type of device.
> + */
> +static void quirk_no_rrs_sv(struct pci_dev *dev)
> +{
> +	struct pci_dev *root;
> +
> +	root = pcie_find_root_port(dev);
> +	if (root) {
> +		root->dev_flags |= PCI_DEV_FLAGS_NO_RRS_SV;
> +		root->config_rrs_sv = 0;
> +	}
> +}
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0a54, quirk_no_rrs_sv);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 05e68f35f392..f4dd9ada12e4 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -247,6 +247,8 @@ enum pci_dev_flags {
>  	PCI_DEV_FLAGS_HAS_MSI_MASKING = (__force pci_dev_flags_t) (1 << 12),
>  	/* Device requires write to PCI_MSIX_ENTRY_DATA before any MSIX reads */
>  	PCI_DEV_FLAGS_MSIX_TOUCH_ENTRY_DATA_FIRST = (__force pci_dev_flags_t) (1 << 13),
> +	/* Do not use Configuration Request Retry Status polling in pci_dev_wait() */
> +	PCI_DEV_FLAGS_NO_RRS_SV = (__force pci_dev_flags_t) (1 << 14),
>  };
>  
>  enum pci_irq_reroute_variant {
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
  2025-07-03  0:05                 ` Hui Wang
  2025-08-08  2:23                   ` Hui Wang
@ 2025-08-21 16:39                   ` Bjorn Helgaas
  1 sibling, 0 replies; 14+ messages in thread
From: Bjorn Helgaas @ 2025-08-21 16:39 UTC (permalink / raw)
  To: Hui Wang
  Cc: linux-pci, bhelgaas, raphael.norwitz, alay.shah, suresh.gumpula,
	ilpo.jarvinen, Nirmal Patel, Jonathan Derrick,
	Chaitanya Kumar Borah, Ville Syrjälä

On Thu, Jul 03, 2025 at 08:05:05AM +0800, Hui Wang wrote:
> On 7/2/25 17:43, Hui Wang wrote:
> > On 7/2/25 07:23, Bjorn Helgaas wrote:
> > > On Tue, Jun 24, 2025 at 08:58:57AM +0800, Hui Wang wrote:
> > > > Sorry for late response, I was OOO the past week.
> > > > 
> > > > This is the log after applied your patch:
> > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/61
> > > > 
> > > > 
> > > > Looks like the "retry" makes the nvme work.
> > > Thank you!  It seems like we get 0xffffffff (probably PCIe error) for
> > > a long time after we think the device should be able to respond with
> > > RRS.
> > > 
> > > I always thought the spec required that after the delays, a device
> > > should respond with RRS if it's not ready, but now I guess I'm not
> > > 100% sure.  Maybe it's allowed to just do nothing, which would lead to
> > > the Root Port timing out and logging an Unsupported Request error.
> > > 
> > > Can I trouble you to try the patch below?  I think we might have to
> > > start explicitly checking for that error.  That probably would require
> > > some setup to enable the error, check for it, and clear it.  I hacked
> > > in some of that here, but ultimately some of it should go elsewhere.
> > 
> > OK, built a testing kernel, wait for bug reporter to test it and collect
> > the log.
> > 
> This is the testing result and log.
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/65

Thanks!  This looks like an Intel S2600WFT, and I assume it has a BMC
that maintains a System Event Log.  Any chance you check or keep that
log?

> > > @@ -1305,14 +1321,33 @@ static int pci_dev_wait(struct pci_dev *dev,
> > > char *reset_type, int timeout)
> > >             if (root && root->config_rrs_sv) {
> > >               pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> > > -            if (!pci_bus_rrs_vendor_id(id))
> > > -                break;
> > > +
> > > +            if (pci_bus_rrs_vendor_id(id)) {
> > > +                pci_info(dev, "%s: read %#06x (RRS)\n",
> > > +                     __func__, id);
> > > +                goto retry;
> > > +            }
> > > +
> > > +            if (PCI_POSSIBLE_ERROR(id)) {
> > > +                pcie_capability_read_word(root, PCI_EXP_DEVSTA,
> > > +                              &devsta);
> > > +                if (devsta & PCI_EXP_DEVSTA_URD)
> > > +                    pcie_capability_write_word(root,
> > > +                                PCI_EXP_DEVSTA,
> > > +                                PCI_EXP_DEVSTA_URD);
> > > +                pci_info(root, "%s: read %#06x DEVSTA %#06x\n",
> > > +                     __func__, id, devsta);

We're waiting for 01:00.0, and we're seeing the poll message for about
375 ms:

  [   10.334786] pci 10000:01:00.0: pci_dev_wait: VF- bus reset timeout 59900
  [   10.334792] pci 10000:00:02.0: pci_dev_wait: read 0xffffffff DEVSTA 0x0000
  ...
  [   10.708367] pci 10000:00:02.0: pci_dev_wait: read 0xffffffff DEVSTA 0x0000

The 00:02.0 Root Port has RRS enabled, but the config reads of the
01:00.0 Vendor ID did not return the RRS value (0x0001).  Instead,
they returned 0xffffffff, which typically means an error on PCIe.

If an error occurred, I think it *should* set one of the Error
Detected bits in the Device Status register, but we always see 0
there.

I think the platform enabled firmware-first error handling and
declined to give Linux control of AER, so I'm wondering if BIOS is
capturing and clearing those errors before Linux would see them, hence
my question about the SEL.

  [    6.565996] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.
  [    6.702329] acpi PNP0A08:00: _OSC: platform does not support [SHPCHotplug AER LTR DPC]
  [    6.702463] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]

Even if this is the case and the SEL has error info, I don't know how
that would help us, other than maybe to understand why Linux doesn't
see the errors.

Bjorn

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-08-21 16:39 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11 10:14 [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme Hui Wang
2025-06-12 16:48 ` Bjorn Helgaas
2025-06-16 11:55   ` Hui Wang
2025-06-16 13:38     ` Hui Wang
2025-06-17 20:12       ` Bjorn Helgaas
2025-06-23 22:58         ` Bjorn Helgaas
2025-06-24  0:58           ` Hui Wang
2025-07-01 23:23             ` Bjorn Helgaas
2025-07-02  9:43               ` Hui Wang
2025-07-03  0:05                 ` Hui Wang
2025-08-08  2:23                   ` Hui Wang
2025-08-11 23:04                     ` Bjorn Helgaas
2025-08-21 16:39                   ` Bjorn Helgaas
2025-08-14 15:55 ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).