From: Bjorn Helgaas <helgaas@kernel.org>
To: Hui Wang <hui.wang@canonical.com>
Cc: linux-pci@vger.kernel.org, bhelgaas@google.com,
raphael.norwitz@nutanix.com, alay.shah@nutanix.com,
suresh.gumpula@nutanix.com, ilpo.jarvinen@linux.intel.com,
"Nirmal Patel" <nirmal.patel@linux.intel.com>,
"Jonathan Derrick" <jonathan.derrick@linux.dev>,
"Chaitanya Kumar Borah" <chaitanya.kumar.borah@intel.com>,
"Ville Syrjälä" <ville.syrjala@linux.intel.com>
Subject: Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
Date: Mon, 23 Jun 2025 17:58:35 -0500 [thread overview]
Message-ID: <20250623225835.GA1449671@bhelgaas> (raw)
In-Reply-To: <20250617201255.GA1100556@bhelgaas>
Ping, any chance you could try the patch below to collect a little
more data. The proposed quirk covers up a deeper problem, and I want
to fix that problem instead of covering it up.
On Tue, Jun 17, 2025 at 03:12:55PM -0500, Bjorn Helgaas wrote:
> [+cc Chaitanya, Ville for possible Intel NVMe contacts]
>
> On Mon, Jun 16, 2025 at 09:38:25PM +0800, Hui Wang wrote:
> > On 6/16/25 19:55, Hui Wang wrote:
> > > On 6/13/25 00:48, Bjorn Helgaas wrote:
> > > > [+cc VMD folks]
> > > >
> > > > On Wed, Jun 11, 2025 at 06:14:42PM +0800, Hui Wang wrote:
> > > > > Prior to commit d591f6804e7e ("PCI: Wait for device readiness with
> > > > > Configuration RRS"), this Intel nvme [8086:0a54] works well. Since
> > > > > that patch is merged to the kernel, this nvme stops working.
> > > > >
> > > > > Through debugging, we found that commit introduces the RRS polling in
> > > > > the pci_dev_wait(), for this nvme, when polling the PCI_VENDOR_ID, it
> > > > > will return ~0 if the config access is not ready yet, but the polling
> > > > > expects a return value of 0x0001 or a valid vendor_id, so the RRS
> > > > > polling doesn't work for this nvme.
> > > >
> > > > Sorry for breaking this, and thanks for all your work in debugging
> > > > this! Issues like this are really hard to track down.
> > > >
> > > > I would think we would have heard about this earlier if the NVMe
> > > > device were broken on all systems. Maybe there's some connection with
> > > > VMD? From the non-working dmesg log in your bug report
> > > > (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/+attachment/5879970/+files/dmesg-60.txt):
> > > >
> > > >
> > > > DMI: ASUSTeK COMPUTER INC. ESC8000 G4/Z11PG-D24 Series, BIOS 5501
> > > > 04/17/2019
> > > > vmd 0000:d7:05.5: PCI host bridge to bus 10000:00
> > > > pci 10000:00:02.0: [8086:2032] type 01 class 0x060400 PCIe Root Port
> > > > pci 10000:00:02.0: PCI bridge to [bus 01]
> > > > pci 10000:00:02.0: bridge window [mem 0xf8000000-0xf81fffff]:
> > > > assigned
> > > > pci 10000:01:00.0: [8086:0a54] type 00 class 0x010802 PCIe Endpoint
> > > > pci 10000:01:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]
> > > >
> > > > <I think vmd_enable_domain() calls pci_reset_bus() here>
> > >
> > > Yes, and the pci_dev_wait() is called here. With the RRS polling, will
> > > get a ~0 from PCI_VENDOR_ID, then will get 0xfffffff when configuring
> > > the BAR0 subsequently. With the original polling method, it will get
> > > enough delay in the pci_dev_wait(), so nvme works normally.
> > >
> > > The line "[ 10.193589] hhhhhhhhhhhhhhhhhhhhhhhhhhhh dev->device = 0a54
> > > id = ffffffff" is output from pci_dev_wait(), please refer to
> > > https://launchpadlibrarian.net/798708446/LP2111521-dmesg-test9.txt
> > >
> > > > pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
> > > > pci 10000:01:00.0: BAR 0: error updating (high 0x00000000 !=
> > > > 0xffffffff)
> > > > pci 10000:01:00.0: BAR 0 [mem 0xf8010000-0xf8013fff 64bit]: assigned
> > > > pci 10000:01:00.0: BAR 0: error updating (0xf8010004 != 0xffffffff)
> > > > nvme nvme0: pci function 10000:01:00.0
> > > > nvme 10000:01:00.0: enabling device (0000 -> 0002)
> > > >
> > > > Things I notice:
> > > >
> > > > - The 10000:01:00.0 NVMe device is behind a VMD bridge
> > > >
> > > > - We successfully read the Vendor & Device IDs (8086:0a54)
> > > >
> > > > - The NVMe device is uninitialized. We successfully sized the BAR,
> > > > which included successful config reads and writes. The BAR
> > > > wasn't assigned by BIOS, which is normal since it's behind VMD.
> > > >
> > > > - We allocated space for BAR 0 but the config writes to program the
> > > > BAR failed. The read back from the BAR was 0xffffffff; probably a
> > > > PCIe error, e.g., the NVMe device didn't respond.
> > > >
> > > > - The device *did* respond when nvme_probe() enabled it: the
> > > > "enabling device (0000 -> 0002)" means pci_enable_resources() read
> > > > PCI_COMMAND and got 0x0000.
> > > >
> > > > - The dmesg from the working config doesn't include the "enabling
> > > > device" line, which suggests that pci_enable_resources() saw
> > > > PCI_COMMAND_MEMORY (0x0002) already set and didn't bother setting
> > > > it again. I don't know why it would already be set.
> > > > d591f6804e7e really only changes pci_dev_wait(), which is used after
> > > > device resets. I think vmd_enable_domain() resets the VMD Root Ports
> > > > after pci_scan_child_bus(), and maybe we're not waiting long enough
> > > > afterwards.
> > > >
> > > > My guess is that we got the ~0 because we did a config read too soon
> > > > after reset and the device didn't respond. The Root Port would time
> > > > out, log an error, and synthesize ~0 data to complete the CPU read
> > > > (see PCIe r6.0, sec 2.3.2 implementation note).
> > > >
> > > > It's *possible* that we waited long enough but the NVMe device is
> > > > broken and didn't respond when it should have, but my money is on a
> > > > software defect.
> > > >
> > > > There are a few pci_dbg() calls about these delays; can you set
> > > > CONFIG_DYNAMIC_DEBUG=y and boot with dyndbg="file drivers/pci/* +p" to
> > > > collect that output? Please also collect the "sudo lspci -vv" output
> > > > from a working system.
> > >
> > > Already passed the testing request to bug reporters, wait for their
> > > feedback.
> > >
> > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/55
> >
> > This is the dmesg with dyndbg="file drivers/pci/* +p"
> >
> > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/56
>
> Thank you very much! I'm stumped.
>
> Both Ports here are capable of 8GT/s, and the Root Port is hot-plug
> capable and supports Data Link Layer Link Active Reporting but not DRS:
>
> 10000:00:02.0 Intel Sky Lake-E PCI Express Root Port C
> LnkCap: Port #11, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <16us
> ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
> LnkSta: Speed 8GT/s, Width x4
> TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
> SltCap: AttnBtn- PwrCtrl- MRL- AttnInd+ PwrInd+ HotPlug+ Surprise+
> LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
>
> 10000:01:00.0 Intel NVMe Datacenter SSD
> LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s, Exit Latency L0s <64ns
>
> After a reset, we have to delay before doing config reads. The wait
> time requirements are in PCIe r6.0, sec 6.6.1:
>
> • ... For cases where system software cannot determine that DRS is
> supported by the attached device, or by the Downstream Port above
> the attached device [as in this case]:
>
> ◦ ...
>
> ◦ With a Downstream Port that supports Link speeds greater than
> 5.0 GT/s, software must wait a minimum of 100 ms after Link
> training completes before sending a Configuration Request to the
> device immediately below that Port. Software can determine when
> Link training completes by polling the Data Link Layer Link
> Active bit or by setting up an associated interrupt (see §
> Section 6.7.3.3). It is strongly recommended for software to
> use this mechanism whenever the Downstream Port supports it.
>
> Since the Port supports 8GT/s, we should wait 100ms after Link training
> completes (PCI_EXP_LNKSTA_DLLLA becomes set), and based on the code
> path below, I think we *do*:
>
> pci_bridge_secondary_bus_reset
> pcibios_reset_secondary_bus
> pci_reset_secondary_bus
> # assert PCI_BRIDGE_CTL_BUS_RESET
> pci_bridge_wait_for_secondary_bus(bridge, "bus reset")
> pci_dbg("waiting %d ms for downstream link, after activation\n") # 10.86
> delay = pci_bus_max_d3cold_delay() # default 100ms
> pcie_wait_for_link_delay(bridge, active=true, delay=100)
> pcie_wait_for_link_status(use_lt=false, active=true)
> end_jiffies = jiffies + PCIE_LINK_RETRAIN_TIMEOUT_MS
> do {
> pcie_capability_read_word(PCI_EXP_LNKSTA, &lnksta)
> if ((lnksta & PCI_EXP_LNKSTA_DLLLA) == PCI_EXP_LNKSTA_DLLLA)
> return 0
> msleep(1) # likely wait several ms for link active
> } while (time_before(jiffies, end_jiffies))
> if (active) # (true)
> msleep(delay) # wait 100ms here
> pci_dev_wait(child, "bus reset", PCIE_RESET_READY_POLL_MS - delay)
> delay = 1
> for (;;) {
> pci_read_config_dword(PCI_VENDOR_ID, &id)
> if (!pci_bus_rrs_vendor_id(id)) # got 0xffff, assume valid
> break;
> }
> pci_dbg("ready 0ms after bus reset", delay - 1) # 11.11
>
> And from the dmesg log:
>
> [ 10.862226] pci 10000:00:02.0: waiting 100 ms for downstream link, after activation
> [ 11.109581] pci 10000:01:00.0: ready 0ms after bus reset
>
> it looks like we waited about .25s (250ms) after the link came up
> before trying to read the Vendor ID, which should be plenty.
>
> I guess maybe this device requires more than 250ms after reset before
> it can respond? That still seems surprising to me; I *assume* Intel
> would have tested this for spec conformance. But maybe there's some
> erratum. I added a couple Intel folks who are mentioned in the nvme
> history in case they have pointers.
>
> But it *is* also true that pci_dev_wait() doesn't know how to handle
> PCIe errors at all, and maybe there's a way to make it smarter.
>
> Can you try adding the patch below and collect another dmesg log (with
> dyndbg="file drivers/pci/* +p" as before)? This isn't a fix, but
> might give a little more insight into what's happening.
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e9448d55113b..42a36ff5c6cd 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1268,6 +1268,7 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
> bool retrain = false;
> struct pci_dev *root, *bridge;
>
> + pci_dbg(dev, "%s: %s timeout %d\n", __func__, reset_type, timeout);
> root = pcie_find_root_port(dev);
>
> if (pci_is_pcie(dev)) {
> @@ -1305,14 +1306,32 @@ static int pci_dev_wait(struct pci_dev *dev, char *reset_type, int timeout)
>
> if (root && root->config_rrs_sv) {
> pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> - if (!pci_bus_rrs_vendor_id(id))
> - break;
> + pci_dbg(dev, "%s: vf %d read %#06x\n", __func__,
> + dev->is_virtfn, id);
> + if (pci_bus_rrs_vendor_id(id))
> + goto retry;
> +
> + /*
> + * We might read 0xffff if the device is a VF and
> + * the read was successful (the VF Vendor ID is
> + * 0xffff per spec).
> + *
> + * If the device is not a VF, 0xffff likely means
> + * there was an error on PCIe. E.g., maybe the
> + * device couldn't even respond with RRS status,
> + * and the RC timed out and synthesized ~0 data.
> + */
> + if (PCI_POSSIBLE_ERROR(id) && !dev->is_virtfn)
> + goto retry;
> +
> + break;
> } else {
> pci_read_config_dword(dev, PCI_COMMAND, &id);
> if (!PCI_POSSIBLE_ERROR(id))
> break;
> }
>
> +retry:
> if (delay > timeout) {
> pci_warn(dev, "not ready %dms after %s; giving up\n",
> delay - 1, reset_type);
> @@ -4760,6 +4779,8 @@ static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
> * Some controllers might not implement link active reporting. In this
> * case, we wait for 1000 ms + any delay requested by the caller.
> */
> + pci_dbg(pdev, "%s: active %d delay %d link_active_reporting %d\n",
> + __func__, active, delay, pdev->link_active_reporting);
> if (!pdev->link_active_reporting) {
> msleep(PCIE_LINK_RETRAIN_TIMEOUT_MS + delay);
> return true;
next prev parent reply other threads:[~2025-06-23 22:58 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-11 10:14 [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme Hui Wang
2025-06-12 16:48 ` Bjorn Helgaas
2025-06-16 11:55 ` Hui Wang
2025-06-16 13:38 ` Hui Wang
2025-06-17 20:12 ` Bjorn Helgaas
2025-06-23 22:58 ` Bjorn Helgaas [this message]
2025-06-24 0:58 ` Hui Wang
2025-07-01 23:23 ` Bjorn Helgaas
2025-07-02 9:43 ` Hui Wang
2025-07-03 0:05 ` Hui Wang
2025-08-08 2:23 ` Hui Wang
2025-08-11 23:04 ` Bjorn Helgaas
2025-09-11 9:24 ` Vitaly Chikunov
2025-08-21 16:39 ` Bjorn Helgaas
2025-10-08 16:53 ` Bjorn Helgaas
2026-01-06 13:30 ` Linux regression tracking (Thorsten Leemhuis)
2026-02-08 21:30 ` Linux-Fan
2026-02-09 9:37 ` Thorsten Leemhuis
2026-02-09 16:34 ` Bjorn Helgaas
2026-02-13 19:37 ` Linux-Fan
2025-08-14 15:55 ` Bjorn Helgaas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250623225835.GA1449671@bhelgaas \
--to=helgaas@kernel.org \
--cc=alay.shah@nutanix.com \
--cc=bhelgaas@google.com \
--cc=chaitanya.kumar.borah@intel.com \
--cc=hui.wang@canonical.com \
--cc=ilpo.jarvinen@linux.intel.com \
--cc=jonathan.derrick@linux.dev \
--cc=linux-pci@vger.kernel.org \
--cc=nirmal.patel@linux.intel.com \
--cc=raphael.norwitz@nutanix.com \
--cc=suresh.gumpula@nutanix.com \
--cc=ville.syrjala@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox