* [bugzilla-daemon@kernel.org: [Bug 216511] New: Spurious PCI_EXP_SLTSTA_DLLSC when hot plugging]
@ 2022-09-21 11:40 Bjorn Helgaas
2022-09-21 18:03 ` Bjorn Helgaas
0 siblings, 1 reply; 3+ messages in thread
From: Bjorn Helgaas @ 2022-09-21 11:40 UTC (permalink / raw)
To: linux-pci, Lukas Wunner; +Cc: Richard Weinberger
----- Forwarded message from bugzilla-daemon@kernel.org -----
Date: Wed, 21 Sep 2022 11:30:47 +0000
From: bugzilla-daemon@kernel.org
To: bjorn@helgaas.com
Subject: [Bug 216511] New: Spurious PCI_EXP_SLTSTA_DLLSC when hot plugging
Message-ID: <bug-216511-41252@https.bugzilla.kernel.org/>
https://bugzilla.kernel.org/show_bug.cgi?id=216511
Bug ID: 216511
Summary: Spurious PCI_EXP_SLTSTA_DLLSC when hot plugging
Product: Drivers
Version: 2.5
Kernel Version: Any
Hardware: All
OS: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: PCI
Assignee: drivers_pci@kernel-bugs.osdl.org
Reporter: richard@nod.at
Regression: No
Created attachment 301842
--> https://bugzilla.kernel.org/attachment.cgi?id=301842&action=edit
full dmesg while hotplugging two nvmes and spurious link change
A x86_64 machine has a PCI switch (PEX 8747) with four ports, on two of them
NVMe disks are attachable.
Using a vendor specific tool I can power on/off each port.
When I power on both ports, hot plugging a NVMe into any port, it works
perfectly fine,
but as soon I plug a second one, *both* ports receive a PCI_EXP_SLTSTA_DLLSC
event.
As consequence the previously attached NVMe will be detached and only device
remains, or the previously attached NVMe gets detached and immediately
reattached but all IO fails later.
To me it seems very wrong that both ports see PCI_EXP_SLTSTA_DLLSC.
The problem can be observed with any kernel so far.
Could this be a firmware issue? What debug further methods do you suggest?
Thanks,
//richard
--
You may reply to this email to add a comment.
You are receiving this mail because:
You are watching the assignee of the bug.
----- End forwarded message -----
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [bugzilla-daemon@kernel.org: [Bug 216511] New: Spurious PCI_EXP_SLTSTA_DLLSC when hot plugging] 2022-09-21 11:40 [bugzilla-daemon@kernel.org: [Bug 216511] New: Spurious PCI_EXP_SLTSTA_DLLSC when hot plugging] Bjorn Helgaas @ 2022-09-21 18:03 ` Bjorn Helgaas 2022-09-21 18:56 ` Lukas Wunner 0 siblings, 1 reply; 3+ messages in thread From: Bjorn Helgaas @ 2022-09-21 18:03 UTC (permalink / raw) To: linux-pci, Lukas Wunner; +Cc: Richard Weinberger, aaron On Wed, Sep 21, 2022 at 06:40:20AM -0500, Bjorn Helgaas wrote: > ----- Forwarded message from bugzilla-daemon@kernel.org ----- > > Date: Wed, 21 Sep 2022 11:30:47 +0000 > From: bugzilla-daemon@kernel.org > To: bjorn@helgaas.com > Subject: [Bug 216511] New: Spurious PCI_EXP_SLTSTA_DLLSC when hot plugging > Message-ID: <bug-216511-41252@https.bugzilla.kernel.org/> > > https://bugzilla.kernel.org/show_bug.cgi?id=216511 > > Bug ID: 216511 > Summary: Spurious PCI_EXP_SLTSTA_DLLSC when hot plugging > ... > A x86_64 machine has a PCI switch (PEX 8747) with four ports, on two of them > NVMe disks are attachable. > Using a vendor specific tool I can power on/off each port. > When I power on both ports, hot plugging a NVMe into any port, it works > perfectly fine, > but as soon I plug a second one, *both* ports receive a PCI_EXP_SLTSTA_DLLSC > event. > As consequence the previously attached NVMe will be detached and only device > remains, or the previously attached NVMe gets detached and immediately > reattached but all IO fails later. > > To me it seems very wrong that both ports see PCI_EXP_SLTSTA_DLLSC. > > The problem can be observed with any kernel so far. > Could this be a firmware issue? What debug further methods do you suggest? Relevant devices from lspci: 0a:00.0 PLX 8748 Upstream Port to [bus 0b-1b] 0b:08.0 PLX 8747 Downstream Port to [bus 0c-0f] # Slot 0 0c:00.0 NVMe 0b:09.0 PLX 8747 Downstream Port to [bus 10-13] # Slot 0-1 10:00.0 NVMe From dmesg log, we add 10:00.0 in Slot 0-1 first, then add 0c:00.0 in Slot 0. When 0c:00.0 is added, Slot 0-1 gets a PCI_EXP_SLTSTA_DLLSC interrupt for 10:00.0: pcieport 0000:0b:09.0: pciehp: pending interrupts 0x0008 from Slot Status presence detect changed # Slot 0-1 pcieport 0000:0b:09.0: pciehp: pending interrupts 0x0100 from Slot Status DLL state changed # Slot 0-1 pcieport 0000:0b:09.0: pciehp: pciehp_check_link_status: lnk_status = a023 PCI_EXP_LNKSTA_LABS PCI_EXP_LNKSTA_DLLLA PCI_EXP_LNKSTA_NLW_X2 PCI_EXP_LNKSTA_CLS_8_0GB pci 0000:10:00.0: [27d1:5216] type 00 class 0x010802 # NVMe in Slot 0-1 pcieport 0000:0b:08.0: pciehp: pending interrupts 0x0008 from Slot Status presence detect changed # Slot 0 pcieport 0000:0b:09.0: pciehp: pending interrupts 0x0100 from Slot Status DLL state changed # Slot 0-1 (?) pcieport 0000:0b:09.0: pciehp: Slot(0-1): Link Down Here's the call chain when handling that DLL state change: pciehp_ist pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status) status &= ... PCI_EXP_SLTSTA_DLLSC events |= status if (events & PCI_EXP_SLTSTA_DLLSC) pciehp_handle_presence_or_link_change pciehp_disable_slot __pciehp_disable_slot remove_board pciehp_unconfigure_device pci_stop_and_remove_bus_device Per spec, "software must read the Data Link Layer Link Active bit of the Link Status Register to determine if the Link is active before initiating configuration cycles to the hot plugged device" (PCIe r6.0, sec 7.5.3.11). It looks like Linux depends on PCI_EXP_SLTSTA_DLLSC but does not actually read PCI_EXP_LNKSTA in this path, so this looks like a pciehp defect. Bjorn ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [bugzilla-daemon@kernel.org: [Bug 216511] New: Spurious PCI_EXP_SLTSTA_DLLSC when hot plugging] 2022-09-21 18:03 ` Bjorn Helgaas @ 2022-09-21 18:56 ` Lukas Wunner 0 siblings, 0 replies; 3+ messages in thread From: Lukas Wunner @ 2022-09-21 18:56 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: linux-pci, Richard Weinberger, aaron On Wed, Sep 21, 2022 at 01:03:26PM -0500, Bjorn Helgaas wrote: > On Wed, Sep 21, 2022 at 06:40:20AM -0500, Bjorn Helgaas wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=216511 [...] > Here's the call chain when handling that DLL state change: > > pciehp_ist > pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status) > status &= ... PCI_EXP_SLTSTA_DLLSC > events |= status > if (events & PCI_EXP_SLTSTA_DLLSC) > pciehp_handle_presence_or_link_change > pciehp_disable_slot > __pciehp_disable_slot > remove_board > pciehp_unconfigure_device > pci_stop_and_remove_bus_device > > Per spec, "software must read the Data Link Layer Link Active bit of > the Link Status Register to determine if the Link is active before > initiating configuration cycles to the hot plugged device" (PCIe r6.0, > sec 7.5.3.11). > > It looks like Linux depends on PCI_EXP_SLTSTA_DLLSC but does not > actually read PCI_EXP_LNKSTA in this path, so this looks like a pciehp > defect. I disagree. The spec citation pertains to *bringup* of the slot, but this is the bringdown code path. The logic in pciehp is such that if we receive DLLSC or PDC and the slot is up, we always bring it down. Only then do we check whether the slot is occupied or link is up. If that's the case, we attempt to bring the slot up again. pciehp assumes that the card may have changed when it receives DLLSC or PDC. That's the rationale behind this behavior. In theory one might think that if DLLSC is received without a concurrent PDC event, then the card in the slot is still the same and only the link went down (probably flapped). Unfortunately the reality is not that simple: For one, DLLSC and PDC events may come in arbitrary order and with quite a delay between them. Second, there are broken slots which hardwire PDC to 0 and we support those. So we can't reliably determine if presence hasn't changed and only link has. In this particular case, the PEX switch is clearly broken because it shouldn't signal DLLSC both for a slot where the link change occurred and its sibling. A while ago Jon Derrick submitted a patch for a similar problem: A bifurcated SSD where bringing down one half of the SSD results in a spurious DLLSC event for the other half: https://lore.kernel.org/linux-pci/20210830155628.130054-1-jonathan.derrick@linux.dev/ I'm not really happy with that patch because it adds a quirk in the middle of the code path for interpreting slot events which makes it difficult to reason about the code's correctness. I'm starting to wonder if instead of Jon's patch, we should just disable DLLSC events on broken devices such as this PEX switch or Jon's SSD. We'd only rely on PDC then but that's probably sufficient. And the code changes would be less intrusive. FWIW, Jon is still interested in upstreaming his quirk: https://lore.kernel.org/linux-pci/446a21e2-aea2-773f-ca88-b6676b54b292@linux.dev/ @Richard: I think Jon's patch doesn't solve your issue does it? Because I think the issue he's seeing is slightly different albeit likewise caused by unreliable DLLSC. (His pertains to bringdown, yours to bringup of the slot it seems.) Thanks, Lukas ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2022-09-21 18:56 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-09-21 11:40 [bugzilla-daemon@kernel.org: [Bug 216511] New: Spurious PCI_EXP_SLTSTA_DLLSC when hot plugging] Bjorn Helgaas 2022-09-21 18:03 ` Bjorn Helgaas 2022-09-21 18:56 ` Lukas Wunner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox