From: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
To: "guanghui.fgh" <guanghuifeng@linux.alibaba.com>
Cc: bhelgaas <bhelgaas@google.com>,
linux-pci <linux-pci@vger.kernel.org>,
kanie <kanie@linux.alibaba.com>,
alikernel-developer <alikernel-developer@linux.alibaba.com>
Subject: Re: [PATCH] PCI: Fix PCIe SBR dev/link wait error
Date: Wed, 26 Nov 2025 16:47:56 +0200 (EET) [thread overview]
Message-ID: <2e3a1e6b-40ae-3878-e237-fb9032796af8@linux.intel.com> (raw)
In-Reply-To: <6b654498-f43f-4772-aad8-d84798d25107.guanghuifeng@linux.alibaba.com>
[-- Attachment #1: Type: text/plain, Size: 11084 bytes --]
On Wed, 26 Nov 2025, guanghui.fgh wrote:
> Refer to the previous email discussion replies:
>
> 1. When a multifunction device is connected to a PCIe slot that supports hotplug,
> during the passthrogh device to guest process based on VFIO, some devices
> need to perform a hot reset to initialize the device:
> vfio_pci_dev_set_hot_reset ---> __pci_reset_slot/__pci_reset_bus --->
> pci_bridge_secondary_bus_reset ---> pci_bridge_wait_for_secondary_bus.
>
> After __pci_reset_slot/__pci_reset_bus calls pci_bridge_wait_for_secondary_bus,
> the device will be restored via pci_dev_restore.
...and then they call to pci_slot_restore_locked()/pci_bus_restore_locked()
that do recursively wait + restore config spaces since the commit
3e40aa29d47e ("PCI: Wait for Link before restoring Downstream Buses").
> However, when a ====== multifunction PCIe device =========
> is connected, executing pci_bridge_wait_for_secondary_bus only guarantees the restoration
> of a random device. For other devices that are still restoring, executing pci_dev_restore cannot
> restore the device state normally, resulting in errors or even device offline.
Addressing this doesn't require recursion AFAICT.
> 2. In the PCIe dpc_handler process, do pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link)
> to restores the PCIe link.
> But dpc_reset_link waits for the link to recover via pci_bridge_wait_for_secondary_bus before returning.
> Similarly, pcie_do_recovery restores devices via pci_walk_bridge(bridge, report_resume, &status),
>
> ====== which also requires pci_bridge_wait_for_secondary_bus to wait for all devices to recover completely.
> For other devices that are still restoring, executing report_resume cannot restore the device state normally,
> resulting in errors or even device offline.
This might lack wait for subordinate buses and ordering with restore, but
I don't think the right place to add that is
pci_bridge_wait_for_secondary_bus() that is used by others.
> 3. The PCIe specification only constrains the minimum wait time under different resets. In the historical kernel implementation,
> SBR wait also did not need to restore the PCIe device configuration space before waiting for the coordinate.
I'm not saying there isn't anything to fix/change but I'm not convinced by
the approach taken by your patch.
--
i.
> ------------------------------------------------------------------
> On Wed, 26 Nov 2025, guanghui.fgh wrote:
>
> > 1. Does this add recursion without restoring config space on each level
> > before starting the child wait?
> >
> > Yes
> > The current implementation does not require restoring the PCIe device
> > configuration space. The status is determined during the waiting
> > process, either based on software-recorded status or on the device's RO
> > and HWINIT type registers.
>
> What guarantees a link even come up without restoring the config space
> first (it may work in your case but is that true universally)?
>
> The commit 3e40aa29d47e ("PCI: Wait for Link before restoring Downstream
> Buses") tried to address this sub-hierarchy wait issue within the
> recursive bus restore (without me ever encountering this problem for
> real). So I don't understand why you had to add the recursion to the wait
> side as well?
>
> --
> i.
>
> > > When executing a PCIe secondary bus reset, all downstream switches and
> > > endpoints will generate reset events. Simultaneously, all PCIe links
> > > will undergo retraining, and each link will independently re-execute the
> > > LTSSM state machine training. Therefore, after executing the SBR, it is
> > > necessary to wait for all downstream links and devices to complete
> > > recovery. Otherwise, after the SBR returns, accessing devices with some
> > > links or endpoints not yet fully recovered may result in driver errors,
> > > or even trigger device offline issues.
> > >
> > > Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
> > > Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
> > > ---
> > > drivers/pci/pci.c | 112 ++++++++++++++++++++++++++++++++++++++--------
> > > 1 file changed, 94 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index b14dd064006c..9cf13fe69d94 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -4788,9 +4788,74 @@ static int pci_bus_max_d3cold_delay(const struct pci_bus *bus)
> > > return max(min_delay, max_delay);
> > > }
> > >
> > > +int __pci_bridge_wait_for_secondary_bus(struct pci_dev *, unsigned long, char *);
> > > +
> > > +static int pci_dev_wait_child(struct pci_dev *pdev, unsigned long start_t, int timeout,
> > > + char *reset_type)
> > > +{
> > > + struct pci_dev *child, **dev = NULL;
> > > + int ct = 0, i = 0, ret = 0, left = 1;
> > > + unsigned long dev_start_t;
> > > +
> > > + down_read(&pci_bus_sem);
> > > +
> > > + list_for_each_entry(child, &pdev->subordinate->devices, bus_list)
> > > + ct++;
> > > +
> > > + if (ct) {
> > > + dev = kzalloc(sizeof(struct pci_dev *) * ct, GFP_KERNEL);
> > > +
> > > + if(!dev) {
> > > + pci_err(pdev, "dev mem alloc err\n");
> > > + up_read(&pci_bus_sem);
> > > + return -ENOMEM;
> > > + }
> > > +
> > > + list_for_each_entry(child, &pdev->subordinate->devices, bus_list)
> > > + dev[i++] = pci_dev_get(child);
> > > + }
> > > +
> > > + up_read(&pci_bus_sem);
> > > +
> > > + for (i = 0; i < ct; ++i) {
> > > + left = 1;
> > > +
> > > +dev_wait:
> > > + dev_start_t = jiffies;
> > > + ret = pci_dev_wait(dev[i], reset_type, left);
> > > + timeout -= jiffies_to_msecs(jiffies - dev_start_t);
> > > +
> > > + if (ret) {
> > > + if (pci_dev_is_disconnected(dev[i]))
> > > + continue;
> > > +
> > > + if (timeout <= 0)
> > > + goto end;
> > > +
> > > + left <<= 1;
> > > + left = timeout > left ? left : timeout;
> > > + goto dev_wait;
> > > + }
> > > + }
> > > +
> > > + for (i = 0; i < ct; ++i) {
> > > + ret = __pci_bridge_wait_for_secondary_bus(dev[i], start_t, reset_type);
> >
> > Does this add recursion without restoring config space on each level
> > before starting the child wait?
> >
> > > + if (ret)
> > > + break;
> > > + }
> > > +
> > > +end:
> > > + for (i = 0; i < ct; ++i)
> > > + pci_dev_put(dev[i]);
> > > +
> > > + kfree(dev);
> > > + return ret;
> > > +}
> > > +
> > > /**
> > > - * pci_bridge_wait_for_secondary_bus - Wait for secondary bus to be accessible
> > > + * __pci_bridge_wait_for_secondary_bus - Wait for secondary bus to be accessible
> > > * @dev: PCI bridge
> > > + * @start_t: wait start jiffies time
> > > * @reset_type: reset type in human-readable form
> > > *
> > > * Handle necessary delays before access to the devices on the secondary
> > > @@ -4804,10 +4869,9 @@ static int pci_bus_max_d3cold_delay(const struct pci_bus *bus)
> > > * Return 0 on success or -ENOTTY if the first device on the secondary bus
> > > * failed to become accessible.
> > > */
> > > -int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
> > > +int __pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, unsigned long start_t, char *reset_type)
> > > {
> > > - struct pci_dev *child __free(pci_dev_put) = NULL;
> > > - int delay;
> > > + int delay, left;
> > >
> > > if (pci_dev_is_disconnected(dev))
> > > return 0;
> > > @@ -4835,8 +4899,6 @@ int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
> > > return 0;
> > > }
> > >
> > > - child = pci_dev_get(list_first_entry(&dev->subordinate->devices,
> > > - struct pci_dev, bus_list));
> > > up_read(&pci_bus_sem);
> > >
> > > /*
> > > @@ -4844,8 +4906,12 @@ int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
> > > * accessing the device after reset (that is 1000 ms + 100 ms).
> > > */
> > > if (!pci_is_pcie(dev)) {
> > > - pci_dbg(dev, "waiting %d ms for secondary bus\n", 1000 + delay);
> > > - msleep(1000 + delay);
> > > + left = 1000 + delay - jiffies_to_msecs(jiffies - start_t);
> > > + pci_dbg(dev, "waiting %d ms for secondary bus\n", left > 0 ? left : 0);
> > > +
> > > + if (left > 0)
> > > + msleep(left);
> > > +
> > > return 0;
> > > }
> > >
> > > @@ -4870,10 +4936,14 @@ int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
> > > if (pcie_get_speed_cap(dev) <= PCIE_SPEED_5_0GT) {
> > > u16 status;
> > >
> > > - pci_dbg(dev, "waiting %d ms for downstream link\n", delay);
> > > - msleep(delay);
> > > + left = delay - jiffies_to_msecs(jiffies - start_t);
> > > + pci_dbg(dev, "waiting %d ms for downstream link\n", left > 0 ? left : 0);
> > > +
> > > + if (left > 0)
> > > + msleep(left);
> > >
> > > - if (!pci_dev_wait(child, reset_type, PCI_RESET_WAIT - delay))
> > > + left = PCI_RESET_WAIT - jiffies_to_msecs(jiffies - start_t);
> > > + if(!pci_dev_wait_child(dev, start_t, left > 0 ? left : 0, reset_type))
> > > return 0;
> > >
> > > /*
> > > @@ -4888,20 +4958,26 @@ int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
> > > if (!(status & PCI_EXP_LNKSTA_DLLLA))
> > > return -ENOTTY;
> > >
> > > - return pci_dev_wait(child, reset_type,
> > > - PCIE_RESET_READY_POLL_MS - PCI_RESET_WAIT);
> > > + left = PCIE_RESET_READY_POLL_MS - jiffies_to_msecs(jiffies - start_t);
> > > + return pci_dev_wait_child(dev, start_t, left > 0 ? left : 0, reset_type);
> > > }
> > >
> > > - pci_dbg(dev, "waiting %d ms for downstream link, after activation\n",
> > > - delay);
> > > - if (!pcie_wait_for_link_delay(dev, true, delay)) {
> > > + left = delay - jiffies_to_msecs(jiffies - start_t);
> > > + pci_dbg(dev, "waiting %d ms for downstream link, after activation\n", left > 0 ? left : 0);
> > > +
> > > + if (!pcie_wait_for_link_delay(dev, true, left > 0 ? left : 0)) {
> > > /* Did not train, no need to wait any further */
> > > pci_info(dev, "Data Link Layer Link Active not set in %d msec\n", delay);
> > > return -ENOTTY;
> > > }
> > >
> > > - return pci_dev_wait(child, reset_type,
> > > - PCIE_RESET_READY_POLL_MS - delay);
> > > + left = PCIE_RESET_READY_POLL_MS - jiffies_to_msecs(jiffies - start_t);
> > > + return pci_dev_wait_child(dev, start_t, left > 0 ? left : 0, reset_type);
> > > +}
> > > +
> > > +int pci_bridge_wait_for_secondary_bus(struct pci_dev *dev, char *reset_type)
> > > +{
> > > + return __pci_bridge_wait_for_secondary_bus(dev, jiffies, reset_type);
> > > }
> > >
> > > void pci_reset_secondary_bus(struct pci_dev *dev)
> > >
> >
> > --
> > i.
> >
>
--
i.
next prev parent reply other threads:[~2025-11-26 14:48 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-24 10:45 [PATCH] PCI: Fix PCIe SBR dev/link wait error Guanghui Feng
2025-11-24 23:58 ` Bjorn Helgaas
2025-11-25 6:20 ` guanghui.fgh
2025-12-01 10:03 ` Lukas Wunner
2025-12-01 12:56 ` guanghuifeng
2025-12-01 13:26 ` Lukas Wunner
2025-12-01 14:46 ` guanghuifeng
2025-12-01 16:18 ` Lukas Wunner
2025-11-26 8:20 ` Ilpo Järvinen
2025-11-26 12:08 ` guanghui.fgh
2025-11-26 12:37 ` Ilpo Järvinen
2025-11-26 14:22 ` guanghui.fgh
2025-11-26 14:47 ` Ilpo Järvinen [this message]
2025-11-29 16:36 ` [PATCH v2] " Guanghui Feng
2025-12-01 9:21 ` Ilpo Järvinen
2025-12-01 12:21 ` guanghuifeng
2025-12-01 13:08 ` Ilpo Järvinen
2025-12-02 4:32 ` [PATCH v4 0/1] " Guanghui Feng
2025-12-02 4:32 ` [PATCH v4 v4 1/1] " Guanghui Feng
2025-12-02 16:49 ` Bjorn Helgaas
2025-12-02 16:51 ` Bjorn Helgaas
2025-12-03 14:41 ` Ilpo Järvinen
2025-11-30 5:17 ` [PATCH v3] " Guanghui Feng
2025-12-01 9:24 ` Ilpo Järvinen
2025-12-01 12:31 ` guanghuifeng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2e3a1e6b-40ae-3878-e237-fb9032796af8@linux.intel.com \
--to=ilpo.jarvinen@linux.intel.com \
--cc=alikernel-developer@linux.alibaba.com \
--cc=bhelgaas@google.com \
--cc=guanghuifeng@linux.alibaba.com \
--cc=kanie@linux.alibaba.com \
--cc=linux-pci@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.