Linux PCI subsystem development
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Hui Wang <hui.wang@canonical.com>
Cc: linux-pci@vger.kernel.org, bhelgaas@google.com,
	raphael.norwitz@nutanix.com, alay.shah@nutanix.com,
	suresh.gumpula@nutanix.com, ilpo.jarvinen@linux.intel.com,
	"Nirmal Patel" <nirmal.patel@linux.intel.com>,
	"Jonathan Derrick" <jonathan.derrick@linux.dev>,
	"Chaitanya Kumar Borah" <chaitanya.kumar.borah@intel.com>,
	"Ville Syrjälä" <ville.syrjala@linux.intel.com>
Subject: Re: [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme
Date: Thu, 21 Aug 2025 11:39:36 -0500	[thread overview]
Message-ID: <20250821163936.GA681451@bhelgaas> (raw)
In-Reply-To: <eae31738-5d5d-4c74-af1c-66168c36ead5@canonical.com>

On Thu, Jul 03, 2025 at 08:05:05AM +0800, Hui Wang wrote:
> On 7/2/25 17:43, Hui Wang wrote:
> > On 7/2/25 07:23, Bjorn Helgaas wrote:
> > > On Tue, Jun 24, 2025 at 08:58:57AM +0800, Hui Wang wrote:
> > > > Sorry for late response, I was OOO the past week.
> > > > 
> > > > This is the log after applied your patch:
> > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/61
> > > > 
> > > > 
> > > > Looks like the "retry" makes the nvme work.
> > > Thank you!  It seems like we get 0xffffffff (probably PCIe error) for
> > > a long time after we think the device should be able to respond with
> > > RRS.
> > > 
> > > I always thought the spec required that after the delays, a device
> > > should respond with RRS if it's not ready, but now I guess I'm not
> > > 100% sure.  Maybe it's allowed to just do nothing, which would lead to
> > > the Root Port timing out and logging an Unsupported Request error.
> > > 
> > > Can I trouble you to try the patch below?  I think we might have to
> > > start explicitly checking for that error.  That probably would require
> > > some setup to enable the error, check for it, and clear it.  I hacked
> > > in some of that here, but ultimately some of it should go elsewhere.
> > 
> > OK, built a testing kernel, wait for bug reporter to test it and collect
> > the log.
> > 
> This is the testing result and log.
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2111521/comments/65

Thanks!  This looks like an Intel S2600WFT, and I assume it has a BMC
that maintains a System Event Log.  Any chance you check or keep that
log?

> > > @@ -1305,14 +1321,33 @@ static int pci_dev_wait(struct pci_dev *dev,
> > > char *reset_type, int timeout)
> > >             if (root && root->config_rrs_sv) {
> > >               pci_read_config_dword(dev, PCI_VENDOR_ID, &id);
> > > -            if (!pci_bus_rrs_vendor_id(id))
> > > -                break;
> > > +
> > > +            if (pci_bus_rrs_vendor_id(id)) {
> > > +                pci_info(dev, "%s: read %#06x (RRS)\n",
> > > +                     __func__, id);
> > > +                goto retry;
> > > +            }
> > > +
> > > +            if (PCI_POSSIBLE_ERROR(id)) {
> > > +                pcie_capability_read_word(root, PCI_EXP_DEVSTA,
> > > +                              &devsta);
> > > +                if (devsta & PCI_EXP_DEVSTA_URD)
> > > +                    pcie_capability_write_word(root,
> > > +                                PCI_EXP_DEVSTA,
> > > +                                PCI_EXP_DEVSTA_URD);
> > > +                pci_info(root, "%s: read %#06x DEVSTA %#06x\n",
> > > +                     __func__, id, devsta);

We're waiting for 01:00.0, and we're seeing the poll message for about
375 ms:

  [   10.334786] pci 10000:01:00.0: pci_dev_wait: VF- bus reset timeout 59900
  [   10.334792] pci 10000:00:02.0: pci_dev_wait: read 0xffffffff DEVSTA 0x0000
  ...
  [   10.708367] pci 10000:00:02.0: pci_dev_wait: read 0xffffffff DEVSTA 0x0000

The 00:02.0 Root Port has RRS enabled, but the config reads of the
01:00.0 Vendor ID did not return the RRS value (0x0001).  Instead,
they returned 0xffffffff, which typically means an error on PCIe.

If an error occurred, I think it *should* set one of the Error
Detected bits in the Device Status register, but we always see 0
there.

I think the platform enabled firmware-first error handling and
declined to give Linux control of AER, so I'm wondering if BIOS is
capturing and clearing those errors before Linux would see them, hence
my question about the SEL.

  [    6.565996] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.
  [    6.702329] acpi PNP0A08:00: _OSC: platform does not support [SHPCHotplug AER LTR DPC]
  [    6.702463] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]

Even if this is the case and the SEL has error info, I don't know how
that would help us, other than maybe to understand why Linux doesn't
see the errors.

Bjorn

  parent reply	other threads:[~2025-08-21 16:39 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-11 10:14 [PATCH] PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme Hui Wang
2025-06-12 16:48 ` Bjorn Helgaas
2025-06-16 11:55   ` Hui Wang
2025-06-16 13:38     ` Hui Wang
2025-06-17 20:12       ` Bjorn Helgaas
2025-06-23 22:58         ` Bjorn Helgaas
2025-06-24  0:58           ` Hui Wang
2025-07-01 23:23             ` Bjorn Helgaas
2025-07-02  9:43               ` Hui Wang
2025-07-03  0:05                 ` Hui Wang
2025-08-08  2:23                   ` Hui Wang
2025-08-11 23:04                     ` Bjorn Helgaas
2025-09-11  9:24                       ` Vitaly Chikunov
2025-08-21 16:39                   ` Bjorn Helgaas [this message]
2025-10-08 16:53                     ` Bjorn Helgaas
2026-01-06 13:30                       ` Linux regression tracking (Thorsten Leemhuis)
2026-02-08 21:30                         ` Linux-Fan
2026-02-09  9:37                           ` Thorsten Leemhuis
2026-02-09 16:34                           ` Bjorn Helgaas
2026-02-13 19:37                             ` Linux-Fan
2025-08-14 15:55 ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250821163936.GA681451@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=alay.shah@nutanix.com \
    --cc=bhelgaas@google.com \
    --cc=chaitanya.kumar.borah@intel.com \
    --cc=hui.wang@canonical.com \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=jonathan.derrick@linux.dev \
    --cc=linux-pci@vger.kernel.org \
    --cc=nirmal.patel@linux.intel.com \
    --cc=raphael.norwitz@nutanix.com \
    --cc=suresh.gumpula@nutanix.com \
    --cc=ville.syrjala@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox