From: Lukas Wunner <lukas@wunner.de>
To: Shuai Xue <xueshuai@linux.alibaba.com>
Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
linuxppc-dev@lists.ozlabs.org, bhelgaas@google.com,
kbusch@kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com,
mahesh@linux.ibm.com, oohall@gmail.com,
Jonathan.Cameron@huawei.com, terry.bowman@amd.com,
tianruidong@linux.alibaba.com
Subject: Re: [PATCH v7 2/5] PCI/DPC: Run recovery on device that detected the error
Date: Mon, 2 Feb 2026 22:09:27 +0100 [thread overview]
Message-ID: <aYESh4bCE2lzTg2S@wunner.de> (raw)
In-Reply-To: <aYCujqZIvxElSsOE@wunner.de>
On Mon, Feb 02, 2026 at 03:02:54PM +0100, Lukas Wunner wrote:
> You're assuming that the parent of the Requester is always identical
> to the containing Downstream Port. But that's not necessarily the case:
>
> E.g., imagine a DPC-capable Root Port with a PCIe switch below
> whose Downstream Ports are not DPC-capable. Let's say an Endpoint
> beneath the PCIe switch sends ERR_FATAL upstream. AFAICS, your patch
> will cause pcie_do_recovery() to invoke dpc_reset_link() with the
> Downstream Port of the PCIe switch as argument. That function will
> then happily use pdev->dpc_cap even though it's 0.
Thinking about this some more, I realized there's another problem:
In a scenario like the one I've outlined above, after your change,
pcie_do_recovery() will only broadcast error_detected (and other
callbacks) below the downstream port of the PCIe switch -- and not
to any other devices below the containing Root Port.
However, the DPC-induced Link Down event at the Root Port results
in a Hot Reset being propagated down the hierarchy to any device
below the Root Port. So with your change, the siblings of the
downstream port on the PCIe switch will no longer be informed of
the reset and thus are no longer given an opportunity to recover
after reset.
The premise on which this patch is built is false -- that the bridge
upstream of the error-reporting device is always equal to the
containing Downstream Port.
It seems the only reason why you want to pass the reporting device
to pcie_do_recovery() is that you want to call pcie_clear_device_status()
and pci_aer_clear_nonfatal_status() with that device.
However as I've said before, those calls are AER-specific and should
be moved out of pcie_do_recovery() so that it becomes generic and can
be used by EEH and s390:
https://lore.kernel.org/all/aPYKe1UKKkR7qrt1@wunner.de/
There's another problem: When a device experiences an error while DPC
is ongoing (i.e. while the link is down), its ERR_FATAL or ERR_NONFATAL
Message may not come through. Still the error bits are set in the
device's Uncorrectable Error Status register. So I think what we need to
do is walk the hierarchy below the containing Downstream Port after the
link is back up and search for devices with any error bits set,
then report and clear those errors. We may do this after first
examining the device in the DPC Error Source ID register.
Any additional errors found while walking the hierarchy can then
be identified as "occurred during DPC recovery".
Thanks,
Lukas
next prev parent reply other threads:[~2026-02-02 21:09 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-24 7:45 [PATCH v7 0/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
2026-01-24 7:45 ` [PATCH v7 1/5] PCI/DPC: Clarify naming for error port in DPC Handling Shuai Xue
2026-01-27 10:10 ` Jonathan Cameron
2026-01-24 7:45 ` [PATCH v7 2/5] PCI/DPC: Run recovery on device that detected the error Shuai Xue
2026-01-27 10:24 ` Jonathan Cameron
2026-01-28 12:27 ` Shuai Xue
2026-01-28 15:02 ` Jonathan Cameron
2026-01-29 5:49 ` Shuai Xue
2026-02-02 14:02 ` Lukas Wunner
2026-02-02 21:09 ` Lukas Wunner [this message]
2026-02-07 7:48 ` Shuai Xue
2026-02-27 8:28 ` Shuai Xue
2026-02-27 10:47 ` Lukas Wunner
2026-02-27 12:28 ` Shuai Xue
2026-02-06 8:41 ` Shuai Xue
2026-01-24 7:45 ` [PATCH v7 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
2026-01-27 10:36 ` Jonathan Cameron
2026-01-28 12:29 ` Shuai Xue
2026-01-28 16:50 ` Kuppuswamy Sathyanarayanan
2026-01-29 11:46 ` Shuai Xue
2026-01-24 7:45 ` [PATCH v7 4/5] PCI/AER: Clear both AER fatal and non-fatal status Shuai Xue
2026-01-27 10:39 ` Jonathan Cameron
2026-01-28 12:30 ` Shuai Xue
2026-01-28 16:58 ` Kuppuswamy Sathyanarayanan
2026-02-03 8:06 ` Lukas Wunner
2026-02-07 8:34 ` Shuai Xue
2026-01-24 7:45 ` [PATCH v7 5/5] PCI/AER: Only clear error bits in pcie_clear_device_status() Shuai Xue
2026-01-27 10:45 ` Jonathan Cameron
2026-01-28 12:45 ` Shuai Xue
2026-02-03 7:44 ` Lukas Wunner
2026-02-06 8:12 ` Shuai Xue
2026-01-28 17:01 ` Kuppuswamy Sathyanarayanan
2026-01-29 12:09 ` Shuai Xue
2026-02-03 7:53 ` Lukas Wunner
2026-02-06 7:39 ` Shuai Xue
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aYESh4bCE2lzTg2S@wunner.de \
--to=lukas@wunner.de \
--cc=Jonathan.Cameron@huawei.com \
--cc=bhelgaas@google.com \
--cc=kbusch@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mahesh@linux.ibm.com \
--cc=oohall@gmail.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=terry.bowman@amd.com \
--cc=tianruidong@linux.alibaba.com \
--cc=xueshuai@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox