* [PATCH] PCI: Drop unnecessary retries when restoring BARs
@ 2026-05-03 13:34 Lukas Wunner
2026-05-03 13:51 ` sashiko-bot
0 siblings, 1 reply; 10+ messages in thread
From: Lukas Wunner @ 2026-05-03 13:34 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: linux-pci, Marco Nenciarini, Michal Winiarski, Ilpo Jarvinen,
Rafael J. Wysocki, Eric Chanudet, Jean Guyader, Alex Williamson,
Sinan Kaya
In 2012, commit 26f41062f28d ("PCI: check for pci bar restore completion
and retry") amended pci_restore_state() to attempt BAR restoration up to
10 times. This was necessary because back in the day, only a 100 msec
delay was observed after pcie_flr() carried out a Function Level Reset.
The retries ensured that BARs were restored even if devices needed more
time to come out of reset.
In 2016, commit 5adecf817dd6 ("PCI: Wait for up to 1000ms after FLR
reset") extended the delay to 1 sec. Commit a2758b6b8fdb ("PCI: Rename
pci_flr_wait() to pci_dev_wait() and make it generic") subsequently
extended it further to 60 sec.
The lengthened delay makes it unnecessary to retry BAR restoration, so
drop it.
Reported-by: Bjorn Helgaas <bhelgaas@google.com>
Closes: https://lore.kernel.org/r/20260416225745.GA41850@bhelgaas/
Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
Jesse did ask "why we don't do this delay/check when we perform the FLR
in the first place" when he applied the patch, but nobody followed up
with a patch until Alex did four years later:
https://lore.kernel.org/r/003AAFE53969E14CB1F09B6FD68C3CD409A8B4F2@ORSMSX105.amr.corp.intel.com/
drivers/pci/pci.c | 36 +++++++++++++-----------------------
1 file changed, 13 insertions(+), 23 deletions(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8f7cfcc..1e386df 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1764,7 +1764,7 @@ int pci_save_state(struct pci_dev *dev)
EXPORT_SYMBOL(pci_save_state);
static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
- u32 saved_val, int retry, bool force)
+ u32 saved_val, bool force)
{
u32 val;
@@ -1772,52 +1772,42 @@ static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
if (!force && val == saved_val)
return;
- for (;;) {
- pci_dbg(pdev, "restore config %#04x: %#010x -> %#010x\n",
- offset, val, saved_val);
- pci_write_config_dword(pdev, offset, saved_val);
- if (retry-- <= 0)
- return;
+ pci_dbg(pdev, "restore config %#04x: %#010x -> %#010x\n", offset, val,
+ saved_val);
- pci_read_config_dword(pdev, offset, &val);
- if (val == saved_val)
- return;
-
- mdelay(1);
- }
+ pci_write_config_dword(pdev, offset, saved_val);
}
static void pci_restore_config_space_range(struct pci_dev *pdev,
- int start, int end, int retry,
- bool force)
+ int start, int end, bool force)
{
int index;
for (index = end; index >= start; index--)
pci_restore_config_dword(pdev, 4 * index,
pdev->saved_config_space[index],
- retry, force);
+ force);
}
static void pci_restore_config_space(struct pci_dev *pdev)
{
if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) {
- pci_restore_config_space_range(pdev, 10, 15, 0, false);
+ pci_restore_config_space_range(pdev, 10, 15, false);
/* Restore BARs before the command register. */
- pci_restore_config_space_range(pdev, 4, 9, 10, false);
- pci_restore_config_space_range(pdev, 0, 3, 0, false);
+ pci_restore_config_space_range(pdev, 4, 9, false);
+ pci_restore_config_space_range(pdev, 0, 3, false);
} else if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
- pci_restore_config_space_range(pdev, 12, 15, 0, false);
+ pci_restore_config_space_range(pdev, 12, 15, false);
/*
* Force rewriting of prefetch registers to avoid S3 resume
* issues on Intel PCI bridges that occur when these
* registers are not explicitly written.
*/
- pci_restore_config_space_range(pdev, 9, 11, 0, true);
- pci_restore_config_space_range(pdev, 0, 8, 0, false);
+ pci_restore_config_space_range(pdev, 9, 11, true);
+ pci_restore_config_space_range(pdev, 0, 8, false);
} else {
- pci_restore_config_space_range(pdev, 0, 15, 0, false);
+ pci_restore_config_space_range(pdev, 0, 15, false);
}
}
--
2.51.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH] PCI: Drop unnecessary retries when restoring BARs
2026-05-03 13:34 [PATCH] PCI: Drop unnecessary retries when restoring BARs Lukas Wunner
@ 2026-05-03 13:51 ` sashiko-bot
2026-05-04 7:49 ` Lukas Wunner
0 siblings, 1 reply; 10+ messages in thread
From: sashiko-bot @ 2026-05-03 13:51 UTC (permalink / raw)
To: Lukas Wunner; +Cc: linux-pci
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Removing the read-back and retry loop for BAR restoration in `pci_restore_state()` introduces a risk of silent regressions for hardware resuming from non-FLR resets (such as D3hot to D0 transitions or custom driver resets). The commit incorrectly assumes the 60s delay from `pci_dev_wait()` covers all usages, but standard PM resume paths only delay for 10ms (`PCI_PM_D3_WAIT`) before calling `pci_restore_state()`. Historically, hardware that needed slightly longer to accept configuration writes relied on the 10x 1ms retry loop to successfully restore BARs. By removing both the retry and the read-back verification, BAR writes to slow devices will be silently dropped, leaving hardware unconfigured and causing MMIO accesses to result in IOMMU faults or kernel crashes.
--
commit 28d12557c0e570f2b2d21da1dcbb7889df2f4672
Author: Lukas Wunner <lukas@wunner.de>
PCI: Drop unnecessary retries when restoring BARs
This commit removes the retry loop when restoring PCI configuration space.
It relies on the lengthened FLR delays now present in the kernel, concluding
that retries for BAR restoration are no longer needed.
Closes: https://lore.kernel.org/r/20260416225745.GA41850@bhelgaas/
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 8f7cfcc00090..1e386dfaa669 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1764,7 +1764,7 @@ int pci_save_state(struct pci_dev *dev)
[ ... ]
> static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
> - u32 saved_val, int retry, bool force)
> + u32 saved_val, bool force)
> {
> u32 val;
>
> @@ -1772,52 +1772,42 @@ static void pci_restore_config_dword(struct pci_dev *pdev, int offset,
> if (!force && val == saved_val)
> return;
>
> - for (;;) {
> - pci_dbg(pdev, "restore config %#04x: %#010x -> %#010x\n",
> - offset, val, saved_val);
> - pci_write_config_dword(pdev, offset, saved_val);
> - if (retry-- <= 0)
> - return;
> + pci_dbg(pdev, "restore config %#04x: %#010x -> %#010x\n", offset, val,
> + saved_val);
>
> - pci_read_config_dword(pdev, offset, &val);
> - if (val == saved_val)
> - return;
> -
> - mdelay(1);
> - }
> + pci_write_config_dword(pdev, offset, saved_val);
> }
Does removing the read-back and retry loop introduce a risk for hardware
resuming from non-FLR resets?
While FLR paths might now have a 60 second delay, pci_restore_state() is
used globally across the kernel. For example, during a PM resume from
D3hot to D0, the kernel typically only delays for 10ms (PCI_PM_D3_WAIT)
before restoring state.
If a device is slow to exit reset, will these configuration writes be
silently dropped without the fallback retry loop, potentially leaving BARs
unconfigured and causing MMIO accesses to fault?
--
Sashiko AI review · https://sashiko.dev/#/patchset/785c98b50a7a00d0698848c75d51b8f5669ad18f.1777814679.git.lukas@wunner.de?part=1
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] PCI: Drop unnecessary retries when restoring BARs
2026-05-03 13:51 ` sashiko-bot
@ 2026-05-04 7:49 ` Lukas Wunner
2026-05-04 17:09 ` Bjorn Helgaas
0 siblings, 1 reply; 10+ messages in thread
From: Lukas Wunner @ 2026-05-04 7:49 UTC (permalink / raw)
To: sashiko-bot
Cc: linux-pci, sashiko, Bjorn Helgaas, Marco Nenciarini,
Michal Winiarski, Ilpo Jarvinen, Rafael J. Wysocki, Eric Chanudet,
Jean Guyader, Alex Williamson, Sinan Kaya
On Sun, May 03, 2026 at 01:51:08PM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential
> issue(s) to consider:
> - [High] Removing the read-back and retry loop for BAR restoration in
> `pci_restore_state()` introduces a risk of silent regressions for
> hardware resuming from non-FLR resets (such as D3hot to D0 transitions
> or custom driver resets). The commit incorrectly assumes the 60s delay
> from `pci_dev_wait()` covers all usages, but standard PM resume paths
> only delay for 10ms (`PCI_PM_D3_WAIT`) before calling `pci_restore_state()`.
> Historically, hardware that needed slightly longer to accept
> configuration writes relied on the 10x 1ms retry loop to successfully
> restore BARs. By removing both the retry and the read-back verification,
> BAR writes to slow devices will be silently dropped, leaving hardware
> unconfigured and causing MMIO accesses to result in IOMMU faults or
> kernel crashes.
Hallucination alert:
PCI_PM_D3_WAIT does not exist, it was renamed to PCI_PM_D3HOT_WAIT
six years ago by commit 3789af9a13e5, which went into v5.10.
The macro is used in:
pci_pm_resume_noirq()
pci_pm_default_resume_early()
pci_pm_power_up_and_verify_state()
pci_power_up()
pci_dev_d3_sleep()
However before pci_power_up() calls pci_dev_d3_sleep(), it reads
the PMCSR register and errors out if config space is inaccessible.
Hence when pci_restore_state() is invoked a bit later, config space
can be assumed to be accessible.
sashiko reviews are a mixed bag, I think this one doesn't even reach
junior developer level. :(
Thanks,
Lukas
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] PCI: Drop unnecessary retries when restoring BARs
2026-05-04 7:49 ` Lukas Wunner
@ 2026-05-04 17:09 ` Bjorn Helgaas
2026-05-04 19:31 ` Rafael J. Wysocki
0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2026-05-04 17:09 UTC (permalink / raw)
To: Lukas Wunner
Cc: sashiko-bot, linux-pci, sashiko, Marco Nenciarini,
Michal Winiarski, Ilpo Jarvinen, Rafael J. Wysocki, Eric Chanudet,
Jean Guyader, Alex Williamson, Sinan Kaya, Mario Limonciello
[+cc Mario]
On Mon, May 04, 2026 at 09:49:36AM +0200, Lukas Wunner wrote:
> On Sun, May 03, 2026 at 01:51:08PM +0000, sashiko-bot@kernel.org wrote:
> > Thank you for your contribution! Sashiko AI review found 1 potential
> > issue(s) to consider:
> > - [High] Removing the read-back and retry loop for BAR restoration in
> > `pci_restore_state()` introduces a risk of silent regressions for
> > hardware resuming from non-FLR resets (such as D3hot to D0 transitions
> > or custom driver resets). The commit incorrectly assumes the 60s delay
> > from `pci_dev_wait()` covers all usages, but standard PM resume paths
> > only delay for 10ms (`PCI_PM_D3_WAIT`) before calling `pci_restore_state()`.
> > Historically, hardware that needed slightly longer to accept
> > configuration writes relied on the 10x 1ms retry loop to successfully
> > restore BARs. By removing both the retry and the read-back verification,
> > BAR writes to slow devices will be silently dropped, leaving hardware
> > unconfigured and causing MMIO accesses to result in IOMMU faults or
> > kernel crashes.
>
> Hallucination alert:
>
> PCI_PM_D3_WAIT does not exist, it was renamed to PCI_PM_D3HOT_WAIT
> six years ago by commit 3789af9a13e5, which went into v5.10.
>
> The macro is used in:
>
> pci_pm_resume_noirq()
> pci_pm_default_resume_early()
> pci_pm_power_up_and_verify_state()
> pci_power_up()
> pci_dev_d3_sleep()
>
> However before pci_power_up() calls pci_dev_d3_sleep(), it reads
> the PMCSR register and errors out if config space is inaccessible.
>
> Hence when pci_restore_state() is invoked a bit later, config space
> can be assumed to be accessible.
I don't quite follow this. In this path, pci_power_up() changes a
device from some low-power state to D0. If the device was in D3hot or
D3cold, we must delay at least 10ms before any access to it (PCIe
r7.0, sec 5.9). pci_power_up() doesn't do any delay before the PMCSR
read. That part seems like a pre-existing issue even before this
patch.
If the PMCSR read returns PCI_POSSIBLE_ERROR(), pci_power_up() does
complain "Unable to change power state ... to D0" and return -EIO, but
pci_pm_power_up_and_verify_state() doesn't look at it, and
pci_pm_default_resume_early() continues on to pci_restore_state(), so
it looks to me like we could try to restore state to an inaccessible
device.
We do call pci_dev_wait() in pci_pm_reset(), which does a D3hot -> D0
transition; shouldn't we do the same in pci_power_up()?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] PCI: Drop unnecessary retries when restoring BARs
2026-05-04 17:09 ` Bjorn Helgaas
@ 2026-05-04 19:31 ` Rafael J. Wysocki
2026-05-04 21:17 ` Bjorn Helgaas
0 siblings, 1 reply; 10+ messages in thread
From: Rafael J. Wysocki @ 2026-05-04 19:31 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Lukas Wunner, sashiko-bot, linux-pci, sashiko, Marco Nenciarini,
Michal Winiarski, Ilpo Jarvinen, Rafael J. Wysocki, Eric Chanudet,
Jean Guyader, Alex Williamson, Sinan Kaya, Mario Limonciello,
Mika Westerberg
On Mon, May 4, 2026 at 7:09 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Mario]
>
> On Mon, May 04, 2026 at 09:49:36AM +0200, Lukas Wunner wrote:
> > On Sun, May 03, 2026 at 01:51:08PM +0000, sashiko-bot@kernel.org wrote:
> > > Thank you for your contribution! Sashiko AI review found 1 potential
> > > issue(s) to consider:
> > > - [High] Removing the read-back and retry loop for BAR restoration in
> > > `pci_restore_state()` introduces a risk of silent regressions for
> > > hardware resuming from non-FLR resets (such as D3hot to D0 transitions
> > > or custom driver resets). The commit incorrectly assumes the 60s delay
> > > from `pci_dev_wait()` covers all usages, but standard PM resume paths
> > > only delay for 10ms (`PCI_PM_D3_WAIT`) before calling `pci_restore_state()`.
> > > Historically, hardware that needed slightly longer to accept
> > > configuration writes relied on the 10x 1ms retry loop to successfully
> > > restore BARs. By removing both the retry and the read-back verification,
> > > BAR writes to slow devices will be silently dropped, leaving hardware
> > > unconfigured and causing MMIO accesses to result in IOMMU faults or
> > > kernel crashes.
> >
> > Hallucination alert:
> >
> > PCI_PM_D3_WAIT does not exist, it was renamed to PCI_PM_D3HOT_WAIT
> > six years ago by commit 3789af9a13e5, which went into v5.10.
> >
> > The macro is used in:
> >
> > pci_pm_resume_noirq()
> > pci_pm_default_resume_early()
> > pci_pm_power_up_and_verify_state()
> > pci_power_up()
> > pci_dev_d3_sleep()
> >
> > However before pci_power_up() calls pci_dev_d3_sleep(), it reads
> > the PMCSR register and errors out if config space is inaccessible.
> >
> > Hence when pci_restore_state() is invoked a bit later, config space
> > can be assumed to be accessible.
>
> I don't quite follow this. In this path, pci_power_up() changes a
> device from some low-power state to D0. If the device was in D3hot or
> D3cold, we must delay at least 10ms before any access to it (PCIe
> r7.0, sec 5.9).
D3hot and D3cold are different in this respect, so using "or" here is
inaccurate and confusing.
Accessing the config space of a device in D3hot is entirely correct
and doesn't require any delay.
Accessing the config space of a device in D3cold is questionable
because the device may not be accessible, but then the host bridge
should just fail the access.
The 10 ms delay is after attempting to program the device into D0 from
D3hot (or the other way around) and it is observed as required.
> pci_power_up() doesn't do any delay before the PMCSR read.
It assumes that platform_pci_set_power_state() has run and either it
has succeeded or the config space is not accessible which is when
PCI_POSSIBLE_ERROR() will trigger. Unfortunately, there is no
canonical way to verify that power has been restored to the device
other than attempting to access it.
> That part seems like a pre-existing issue even before this patch.
I beg to differ.
> If the PMCSR read returns PCI_POSSIBLE_ERROR(), pci_power_up() does
> complain "Unable to change power state ... to D0" and return -EIO, but
> pci_pm_power_up_and_verify_state() doesn't look at it,
In fact, pci_update_current_state() changes the power state to D3cold
if the config space is not accessible.
> and pci_pm_default_resume_early() continues on to pci_restore_state(), so
> it looks to me like we could try to restore state to an inaccessible
> device.
So the pci_restore_state() call could be avoided if the power state of
the device was D3cold, but then the question is how much of a problem
that is in practice.
> We do call pci_dev_wait() in pci_pm_reset(), which does a D3hot -> D0
> transition; shouldn't we do the same in pci_power_up()?
It would be good to make them consistent, but maybe the other way around?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] PCI: Drop unnecessary retries when restoring BARs
2026-05-04 19:31 ` Rafael J. Wysocki
@ 2026-05-04 21:17 ` Bjorn Helgaas
2026-05-05 10:43 ` Rafael J. Wysocki
0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2026-05-04 21:17 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Lukas Wunner, sashiko-bot, linux-pci, sashiko, Marco Nenciarini,
Michal Winiarski, Ilpo Jarvinen, Eric Chanudet, Jean Guyader,
Alex Williamson, Sinan Kaya, Mario Limonciello, Mika Westerberg
On Mon, May 04, 2026 at 09:31:51PM +0200, Rafael J. Wysocki wrote:
> On Mon, May 4, 2026 at 7:09 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, May 04, 2026 at 09:49:36AM +0200, Lukas Wunner wrote:
> > > On Sun, May 03, 2026 at 01:51:08PM +0000, sashiko-bot@kernel.org wrote:
> > > > Thank you for your contribution! Sashiko AI review found 1 potential
> > > > issue(s) to consider:
> > > > - [High] Removing the read-back and retry loop for BAR restoration in
> > > > `pci_restore_state()` introduces a risk of silent regressions for
> > > > hardware resuming from non-FLR resets (such as D3hot to D0 transitions
> > > > or custom driver resets). The commit incorrectly assumes the 60s delay
> > > > from `pci_dev_wait()` covers all usages, but standard PM resume paths
> > > > only delay for 10ms (`PCI_PM_D3_WAIT`) before calling `pci_restore_state()`.
> > > > Historically, hardware that needed slightly longer to accept
> > > > configuration writes relied on the 10x 1ms retry loop to successfully
> > > > restore BARs. By removing both the retry and the read-back verification,
> > > > BAR writes to slow devices will be silently dropped, leaving hardware
> > > > unconfigured and causing MMIO accesses to result in IOMMU faults or
> > > > kernel crashes.
> > >
> > > Hallucination alert:
> > >
> > > PCI_PM_D3_WAIT does not exist, it was renamed to PCI_PM_D3HOT_WAIT
> > > six years ago by commit 3789af9a13e5, which went into v5.10.
> > >
> > > The macro is used in:
> > >
> > > pci_pm_resume_noirq()
> > > pci_pm_default_resume_early()
> > > pci_pm_power_up_and_verify_state()
> > > pci_power_up()
> > > pci_dev_d3_sleep()
> > >
> > > However before pci_power_up() calls pci_dev_d3_sleep(), it reads
> > > the PMCSR register and errors out if config space is inaccessible.
> > >
> > > Hence when pci_restore_state() is invoked a bit later, config space
> > > can be assumed to be accessible.
> >
> > I don't quite follow this. In this path, pci_power_up() changes a
> > device from some low-power state to D0. If the device was in D3hot or
> > D3cold, we must delay at least 10ms before any access to it (PCIe
> > r7.0, sec 5.9).
>
> D3hot and D3cold are different in this respect, so using "or" here is
> inaccurate and confusing.
>
> Accessing the config space of a device in D3hot is entirely correct
> and doesn't require any delay.
>
> Accessing the config space of a device in D3cold is questionable
> because the device may not be accessible, but then the host bridge
> should just fail the access.
>
> The 10 ms delay is after attempting to program the device into D0 from
> D3hot (or the other way around) and it is observed as required.
Thanks for taking a look at this!
If pci_power_up() is doing D3cold -> D0, main power is initially off,
so platform_pci_set_power_state(PCI_D0) would turn on main power,
which is a Fundamental Reset leaving the device in D0uninitialized.
In general the device needs at least 100 ms after that reset before
any access, e.g., before the PMCSR read.
If pci_power_up() is doing D3hot -> D0, the device already has main
power, so I suppose platform_pci_set_power_state(PCI_D0) doesn't do
anything. The device remains in D3hot, the PMCSR read should be fine,
and the PMCSR write to transition to D0 is immediately followed by the
10 ms delay. When No_Soft_Reset == 1, this should be enough.
But when No_Soft_Reset == 0, the transition is to D0uninitialized,
which sec 2.3.1 includes as a reset condition after which the device
is permitted to return RRS.
So to me it looks like we need these delays:
- For D3cold -> D0, msleep(100) and pci_dev_wait() between
platform_pci_set_power_state(PCI_D0) and the PMCSR read, similar
to what pcie_flr() does.
Maybe platform_pci_set_power_state() takes care of some or all of
this internally?
- For D3hot -> D0 with No_Soft_Reset == 0, pci_dev_wait() after the
PMCSR write to transition to D0.
abbcf0e2a99d ("PCI: Wait for device to become ready after a power
management reset") added the similar wait in pci_pm_reset().
> > pci_power_up() doesn't do any delay before the PMCSR read.
>
> It assumes that platform_pci_set_power_state() has run and either it
> has succeeded or the config space is not accessible which is when
> PCI_POSSIBLE_ERROR() will trigger. Unfortunately, there is no
> canonical way to verify that power has been restored to the device
> other than attempting to access it.
>
> > That part seems like a pre-existing issue even before this patch.
>
> I beg to differ.
>
> > If the PMCSR read returns PCI_POSSIBLE_ERROR(), pci_power_up() does
> > complain "Unable to change power state ... to D0" and return -EIO, but
> > pci_pm_power_up_and_verify_state() doesn't look at it,
>
> In fact, pci_update_current_state() changes the power state to D3cold
> if the config space is not accessible.
>
> > and pci_pm_default_resume_early() continues on to pci_restore_state(), so
> > it looks to me like we could try to restore state to an inaccessible
> > device.
>
> So the pci_restore_state() call could be avoided if the power state of
> the device was D3cold, but then the question is how much of a problem
> that is in practice.
>
> > We do call pci_dev_wait() in pci_pm_reset(), which does a D3hot -> D0
> > transition; shouldn't we do the same in pci_power_up()?
>
> It would be good to make them consistent, but maybe the other way around?
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] PCI: Drop unnecessary retries when restoring BARs
2026-05-04 21:17 ` Bjorn Helgaas
@ 2026-05-05 10:43 ` Rafael J. Wysocki
2026-05-08 0:17 ` Bjorn Helgaas
0 siblings, 1 reply; 10+ messages in thread
From: Rafael J. Wysocki @ 2026-05-05 10:43 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Rafael J. Wysocki, Lukas Wunner, sashiko-bot, linux-pci, sashiko,
Marco Nenciarini, Michal Winiarski, Ilpo Jarvinen, Eric Chanudet,
Jean Guyader, Alex Williamson, Sinan Kaya, Mario Limonciello,
Mika Westerberg
On Mon, May 4, 2026 at 11:17 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Mon, May 04, 2026 at 09:31:51PM +0200, Rafael J. Wysocki wrote:
> > On Mon, May 4, 2026 at 7:09 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Mon, May 04, 2026 at 09:49:36AM +0200, Lukas Wunner wrote:
> > > > On Sun, May 03, 2026 at 01:51:08PM +0000, sashiko-bot@kernel.org wrote:
> > > > > Thank you for your contribution! Sashiko AI review found 1 potential
> > > > > issue(s) to consider:
> > > > > - [High] Removing the read-back and retry loop for BAR restoration in
> > > > > `pci_restore_state()` introduces a risk of silent regressions for
> > > > > hardware resuming from non-FLR resets (such as D3hot to D0 transitions
> > > > > or custom driver resets). The commit incorrectly assumes the 60s delay
> > > > > from `pci_dev_wait()` covers all usages, but standard PM resume paths
> > > > > only delay for 10ms (`PCI_PM_D3_WAIT`) before calling `pci_restore_state()`.
> > > > > Historically, hardware that needed slightly longer to accept
> > > > > configuration writes relied on the 10x 1ms retry loop to successfully
> > > > > restore BARs. By removing both the retry and the read-back verification,
> > > > > BAR writes to slow devices will be silently dropped, leaving hardware
> > > > > unconfigured and causing MMIO accesses to result in IOMMU faults or
> > > > > kernel crashes.
> > > >
> > > > Hallucination alert:
> > > >
> > > > PCI_PM_D3_WAIT does not exist, it was renamed to PCI_PM_D3HOT_WAIT
> > > > six years ago by commit 3789af9a13e5, which went into v5.10.
> > > >
> > > > The macro is used in:
> > > >
> > > > pci_pm_resume_noirq()
> > > > pci_pm_default_resume_early()
> > > > pci_pm_power_up_and_verify_state()
> > > > pci_power_up()
> > > > pci_dev_d3_sleep()
> > > >
> > > > However before pci_power_up() calls pci_dev_d3_sleep(), it reads
> > > > the PMCSR register and errors out if config space is inaccessible.
> > > >
> > > > Hence when pci_restore_state() is invoked a bit later, config space
> > > > can be assumed to be accessible.
> > >
> > > I don't quite follow this. In this path, pci_power_up() changes a
> > > device from some low-power state to D0. If the device was in D3hot or
> > > D3cold, we must delay at least 10ms before any access to it (PCIe
> > > r7.0, sec 5.9).
> >
> > D3hot and D3cold are different in this respect, so using "or" here is
> > inaccurate and confusing.
> >
> > Accessing the config space of a device in D3hot is entirely correct
> > and doesn't require any delay.
> >
> > Accessing the config space of a device in D3cold is questionable
> > because the device may not be accessible, but then the host bridge
> > should just fail the access.
> >
> > The 10 ms delay is after attempting to program the device into D0 from
> > D3hot (or the other way around) and it is observed as required.
>
> Thanks for taking a look at this!
No prob.
> If pci_power_up() is doing D3cold -> D0, main power is initially off,
> so platform_pci_set_power_state(PCI_D0) would turn on main power,
> which is a Fundamental Reset leaving the device in D0uninitialized.
> In general the device needs at least 100 ms after that reset before
> any access, e.g., before the PMCSR read.
Yes, it does in general, but that should be covered by
platform_pci_set_power_state() because on some platforms the delay can
be shorter.
> If pci_power_up() is doing D3hot -> D0, the device already has main
> power, so I suppose platform_pci_set_power_state(PCI_D0) doesn't do
> anything.
It may or may not. Some platforms supply AML to run during D3hot ->
D0 transitions (and the ACPI spec allows that).
> The device remains in D3hot, the PMCSR read should be fine,
> and the PMCSR write to transition to D0 is immediately followed by the
> 10 ms delay. When No_Soft_Reset == 1, this should be enough.
>
> But when No_Soft_Reset == 0, the transition is to D0uninitialized,
> which sec 2.3.1 includes as a reset condition after which the device
> is permitted to return RRS.
>
> So to me it looks like we need these delays:
>
> - For D3cold -> D0, msleep(100) and pci_dev_wait() between
> platform_pci_set_power_state(PCI_D0) and the PMCSR read, similar
> to what pcie_flr() does.
Adding those delays directly to the PCI PM code would ruin the timing
of system resume on many systems where devices go into D3cold during
suspend, or are powered off by the platform firmware later in the case
of S3 transitions. For instance, during resume from S3, everything is
typically powered up by the platform firmware which also takes care of
the delays and the kernel effectively gets all devices in D0uninit.
Also, as stated above, IMO it is better if
platform_pci_set_power_state() takes care of the delay.
> Maybe platform_pci_set_power_state() takes care of some or all of
> this internally?
At least on some platforms, yes it does.
Moreover, the power-on may not actually happen during
platform_pci_set_power_state() for the device in question, but for its
parent. It is quite common to provide an ACPI power resource for a
PCIe root port for bringing up the downstream link, which doesn't
really change the power state of the root port itself, but it affects
the subordinate device (and possibly everything below it).
Of course, a good question is what Alternative OSes do when
transitioning PCIe devices from D3cold to D0, but I think I can find
that out.
> - For D3hot -> D0 with No_Soft_Reset == 0, pci_dev_wait() after the
> PMCSR write to transition to D0.
>
> abbcf0e2a99d ("PCI: Wait for device to become ready after a power
> management reset") added the similar wait in pci_pm_reset().
>
> > > pci_power_up() doesn't do any delay before the PMCSR read.
> >
> > It assumes that platform_pci_set_power_state() has run and either it
> > has succeeded or the config space is not accessible which is when
> > PCI_POSSIBLE_ERROR() will trigger. Unfortunately, there is no
> > canonical way to verify that power has been restored to the device
> > other than attempting to access it.
> >
> > > That part seems like a pre-existing issue even before this patch.
> >
> > I beg to differ.
> >
> > > If the PMCSR read returns PCI_POSSIBLE_ERROR(), pci_power_up() does
> > > complain "Unable to change power state ... to D0" and return -EIO, but
> > > pci_pm_power_up_and_verify_state() doesn't look at it,
> >
> > In fact, pci_update_current_state() changes the power state to D3cold
> > if the config space is not accessible.
> >
> > > and pci_pm_default_resume_early() continues on to pci_restore_state(), so
> > > it looks to me like we could try to restore state to an inaccessible
> > > device.
> >
> > So the pci_restore_state() call could be avoided if the power state of
> > the device was D3cold, but then the question is how much of a problem
> > that is in practice.
> >
> > > We do call pci_dev_wait() in pci_pm_reset(), which does a D3hot -> D0
> > > transition; shouldn't we do the same in pci_power_up()?
> >
> > It would be good to make them consistent, but maybe the other way around?
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] PCI: Drop unnecessary retries when restoring BARs
2026-05-05 10:43 ` Rafael J. Wysocki
@ 2026-05-08 0:17 ` Bjorn Helgaas
2026-05-08 12:51 ` Rafael J. Wysocki
0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2026-05-08 0:17 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Lukas Wunner, sashiko-bot, linux-pci, sashiko, Marco Nenciarini,
Michal Winiarski, Ilpo Jarvinen, Eric Chanudet, Jean Guyader,
Alex Williamson, Sinan Kaya, Mario Limonciello, Mika Westerberg
On Tue, May 05, 2026 at 12:43:17PM +0200, Rafael J. Wysocki wrote:
> On Mon, May 4, 2026 at 11:17 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> ...
> > If pci_power_up() is doing D3cold -> D0, main power is initially
> > off, so platform_pci_set_power_state(PCI_D0) would turn on main
> > power, which is a Fundamental Reset leaving the device in
> > D0uninitialized. In general the device needs at least 100 ms
> > after that reset before any access, e.g., before the PMCSR read.
>
> Yes, it does in general, but that should be covered by
> platform_pci_set_power_state() because on some platforms the delay
> can be shorter.
It's surprising to me that platform_pci_set_power_state() takes
care of that delay because there's so much PCIe infrastructure
(dependencies on link speed, RRS polling, Immediate Readiness,
Readiness Notifications, etc), much of which sounds more like
OS-level support than firmware support.
But if platform_pci_set_power_state() does guarantee that the device
is Configuration-Ready, we should document that somewhere.
> > If pci_power_up() is doing D3hot -> D0, the device already has
> > main power, so I suppose platform_pci_set_power_state(PCI_D0)
> > doesn't do anything.
>
> It may or may not. Some platforms supply AML to run during D3hot ->
> D0 transitions (and the ACPI spec allows that).
If the device started in D3hot, it never lost main power, and I
assumed platform_pci_set_power_state(PCI_D0) might just leave it in
D3hot. It sounds like it takes it all the way to D0, although the
subsequent code that checks the PMCSR state does suggest that the
device might *not* be in D0:
platform_pci_set_power_state(dev, PCI_D0);
pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
state = pmcsr & PCI_PM_CTRL_STATE_MASK;
if (state == PCI_D0)
goto end;
pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, 0);
if (state == PCI_D3hot)
pci_dev_d3_sleep(dev);
...
end:
If platform_pci_set_power_state(PCI_D0) puts the device in D0 and
takes care of all the delays, "state" should always be PCI_D0, and we
shouldn't need the D3hot and D2 delays here.
Bjorn
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] PCI: Drop unnecessary retries when restoring BARs
2026-05-08 0:17 ` Bjorn Helgaas
@ 2026-05-08 12:51 ` Rafael J. Wysocki
2026-05-08 21:43 ` Bjorn Helgaas
0 siblings, 1 reply; 10+ messages in thread
From: Rafael J. Wysocki @ 2026-05-08 12:51 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Rafael J. Wysocki, Lukas Wunner, sashiko-bot, linux-pci, sashiko,
Marco Nenciarini, Michal Winiarski, Ilpo Jarvinen, Eric Chanudet,
Jean Guyader, Alex Williamson, Sinan Kaya, Mario Limonciello,
Mika Westerberg
On Fri, May 8, 2026 at 2:17 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Tue, May 05, 2026 at 12:43:17PM +0200, Rafael J. Wysocki wrote:
> > On Mon, May 4, 2026 at 11:17 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > ...
>
> > > If pci_power_up() is doing D3cold -> D0, main power is initially
> > > off, so platform_pci_set_power_state(PCI_D0) would turn on main
> > > power, which is a Fundamental Reset leaving the device in
> > > D0uninitialized. In general the device needs at least 100 ms
> > > after that reset before any access, e.g., before the PMCSR read.
> >
> > Yes, it does in general, but that should be covered by
> > platform_pci_set_power_state() because on some platforms the delay
> > can be shorter.
>
> It's surprising to me that platform_pci_set_power_state() takes
> care of that delay because there's so much PCIe infrastructure
> (dependencies on link speed, RRS polling, Immediate Readiness,
> Readiness Notifications, etc), much of which sounds more like
> OS-level support than firmware support.
It is kind of a mixed bag because it involves platform-specific
operations (the PCIe spec doesn't define a standard method of
transitioning devices in D3cold into D0).
The consensus in the industry appears to be that if AML is used for
carrying out D3cold->D0 transitions, it is expected to take the
requisite delays into account.
> But if platform_pci_set_power_state() does guarantee that the device
> is Configuration-Ready, we should document that somewhere.
It is more along the lines of what I said before: After
platform_pci_set_power_state() the device is either
Configuration-Ready, or inaccessible (in which case it can be assumed
to have dropped off the bus, but this is all platform-specific, so on
some platforms there may be ways to revive such devices).
That said, I'm all for documenting it.
> > > If pci_power_up() is doing D3hot -> D0, the device already has
> > > main power, so I suppose platform_pci_set_power_state(PCI_D0)
> > > doesn't do anything.
> >
> > It may or may not. Some platforms supply AML to run during D3hot ->
> > D0 transitions (and the ACPI spec allows that).
>
> If the device started in D3hot, it never lost main power, and I
> assumed platform_pci_set_power_state(PCI_D0) might just leave it in
> D3hot. It sounds like it takes it all the way to D0, although the
> subsequent code that checks the PMCSR state does suggest that the
> device might *not* be in D0:
On platforms using ACPI, platform_pci_set_power_state() carries out
the AML part of the transition which may do nothing for devices in
D3hot (for instance, there may be an ACPI power resource that needs to
be turned "on" in order to restore power to the device, but if the
power resource is "on" already, say because it is shared with another
device that is in D0 ATM, its state may not change).
platform_pci_set_power_state() may also be a no-op for the given
device (for example, if this is an add-on device without any
associated AML).
The rule of thumb is that if there is a programmatic way to remove
power from a PCIe device (so it can go into D3cold), it is generally
platform-specific and there should be a complementary programmatic way
to do the reverse (also generally platform-specific). Otherwise, the
device can go only as deep as D3hot except when the system as a whole
is powered down or it is on a Thunderbolt link that can be
disconnected at any time etc.
> platform_pci_set_power_state(dev, PCI_D0);
>
> pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> state = pmcsr & PCI_PM_CTRL_STATE_MASK;
> if (state == PCI_D0)
> goto end;
>
> pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, 0);
> if (state == PCI_D3hot)
> pci_dev_d3_sleep(dev);
> ...
>
> end:
>
> If platform_pci_set_power_state(PCI_D0) puts the device in D0 and
> takes care of all the delays, "state" should always be PCI_D0, and we
> shouldn't need the D3hot and D2 delays here.
Well, not quite, as per the above.
I guess the name is somewhat confusing, but I'm not sure what other
name would be better.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] PCI: Drop unnecessary retries when restoring BARs
2026-05-08 12:51 ` Rafael J. Wysocki
@ 2026-05-08 21:43 ` Bjorn Helgaas
0 siblings, 0 replies; 10+ messages in thread
From: Bjorn Helgaas @ 2026-05-08 21:43 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Lukas Wunner, sashiko-bot, linux-pci, sashiko, Marco Nenciarini,
Michal Winiarski, Ilpo Jarvinen, Eric Chanudet, Jean Guyader,
Alex Williamson, Sinan Kaya, Mario Limonciello, Mika Westerberg
On Fri, May 08, 2026 at 02:51:37PM +0200, Rafael J. Wysocki wrote:
> On Fri, May 8, 2026 at 2:17 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Tue, May 05, 2026 at 12:43:17PM +0200, Rafael J. Wysocki wrote:
> > > On Mon, May 4, 2026 at 11:17 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > ...
> >
> > > > If pci_power_up() is doing D3cold -> D0, main power is initially
> > > > off, so platform_pci_set_power_state(PCI_D0) would turn on main
> > > > power, which is a Fundamental Reset leaving the device in
> > > > D0uninitialized. In general the device needs at least 100 ms
> > > > after that reset before any access, e.g., before the PMCSR read.
> > >
> > > Yes, it does in general, but that should be covered by
> > > platform_pci_set_power_state() because on some platforms the delay
> > > can be shorter.
> >
> > It's surprising to me that platform_pci_set_power_state() takes
> > care of that delay because there's so much PCIe infrastructure
> > (dependencies on link speed, RRS polling, Immediate Readiness,
> > Readiness Notifications, etc), much of which sounds more like
> > OS-level support than firmware support.
>
> It is kind of a mixed bag because it involves platform-specific
> operations (the PCIe spec doesn't define a standard method of
> transitioning devices in D3cold into D0).
>
> The consensus in the industry appears to be that if AML is used for
> carrying out D3cold->D0 transitions, it is expected to take the
> requisite delays into account.
>
> > But if platform_pci_set_power_state() does guarantee that the device
> > is Configuration-Ready, we should document that somewhere.
>
> It is more along the lines of what I said before: After
> platform_pci_set_power_state() the device is either
> Configuration-Ready, or inaccessible (in which case it can be assumed
> to have dropped off the bus, but this is all platform-specific, so on
> some platforms there may be ways to revive such devices).
If the device is Configuration-Ready or inaccessible after
platform_pci_set_power_state(PCI_D0), then I agree we shouldn't add
that delay in pci_power_up().
> That said, I'm all for documenting it.
Strawman below. "Inaccessible upon return" is bound to lead to
questions, but if the PCI core can't do anything I don't know what
else to say.
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ed1b1d7a7b46..2b0513c35387 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1068,6 +1068,12 @@ static inline bool platform_pci_power_manageable(struct pci_dev *dev)
return acpi_pci_power_manageable(dev);
}
+/*
+ * When setting power state to D0, platform_pci_set_power_state() ensures
+ * main power is on and that any required delays after a transition to D0
+ * have been completed. The device is either Configuration-Ready or
+ * inaccessible upon return.
+ */
static inline int platform_pci_set_power_state(struct pci_dev *dev,
pci_power_t t)
{
> > > > If pci_power_up() is doing D3hot -> D0, the device already has
> > > > main power, so I suppose platform_pci_set_power_state(PCI_D0)
> > > > doesn't do anything.
> > >
> > > It may or may not. Some platforms supply AML to run during D3hot ->
> > > D0 transitions (and the ACPI spec allows that).
> >
> > If the device started in D3hot, it never lost main power, and I
> > assumed platform_pci_set_power_state(PCI_D0) might just leave it in
> > D3hot. It sounds like it takes it all the way to D0, although the
> > subsequent code that checks the PMCSR state does suggest that the
> > device might *not* be in D0:
>
> On platforms using ACPI, platform_pci_set_power_state() carries out
> the AML part of the transition which may do nothing for devices in
> D3hot (for instance, there may be an ACPI power resource that needs to
> be turned "on" in order to restore power to the device, but if the
> power resource is "on" already, say because it is shared with another
> device that is in D0 ATM, its state may not change).
>
> platform_pci_set_power_state() may also be a no-op for the given
> device (for example, if this is an add-on device without any
> associated AML).
>
> The rule of thumb is that if there is a programmatic way to remove
> power from a PCIe device (so it can go into D3cold), it is generally
> platform-specific and there should be a complementary programmatic way
> to do the reverse (also generally platform-specific). Otherwise, the
> device can go only as deep as D3hot except when the system as a whole
> is powered down or it is on a Thunderbolt link that can be
> disconnected at any time etc.
>
> > platform_pci_set_power_state(dev, PCI_D0);
> >
> > pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> > state = pmcsr & PCI_PM_CTRL_STATE_MASK;
> > if (state == PCI_D0)
> > goto end;
> >
> > pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, 0);
> > if (state == PCI_D3hot)
> > pci_dev_d3_sleep(dev);
> > ...
> >
> > end:
> >
> > If platform_pci_set_power_state(PCI_D0) puts the device in D0 and
> > takes care of all the delays, "state" should always be PCI_D0, and we
> > shouldn't need the D3hot and D2 delays here.
>
> Well, not quite, as per the above.
OK, so after platform_pci_set_power_state(PCI_D0), the device may be
in D3hot in some cases. Then there's a D3hot -> D0 transition, and we
do pci_dev_d3_sleep(), but not pci_dev_wait(). Sec 2.3.1 says RRS is
permitted after a D3hot to D0uninitialized, and pci_dev_wait() is what
waits and retries for the RRS case.
It looks to me like we need pci_dev_wait() after that transition when
No_Soft_Reset == 0, i.e., something like the below. Do you think this
is unnecessary?
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8f7cfcc00090..ed1b1d7a7b46 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1341,10 +1341,14 @@ int pci_power_up(struct pci_dev *dev)
pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, 0);
/* Mandatory transition delays; see PCI PM 1.2. */
- if (state == PCI_D3hot)
+ if (state == PCI_D3hot) {
pci_dev_d3_sleep(dev);
- else if (state == PCI_D2)
+ if (!(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET))
+ pci_dev_wait(dev, "power up D3hot->D0",
+ PCIE_RESET_READY_POLL_MS);
+ } else if (state == PCI_D2) {
udelay(PCI_PM_D2_DELAY);
+ }
end:
dev->current_state = PCI_D0;
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-05-08 21:43 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-03 13:34 [PATCH] PCI: Drop unnecessary retries when restoring BARs Lukas Wunner
2026-05-03 13:51 ` sashiko-bot
2026-05-04 7:49 ` Lukas Wunner
2026-05-04 17:09 ` Bjorn Helgaas
2026-05-04 19:31 ` Rafael J. Wysocki
2026-05-04 21:17 ` Bjorn Helgaas
2026-05-05 10:43 ` Rafael J. Wysocki
2026-05-08 0:17 ` Bjorn Helgaas
2026-05-08 12:51 ` Rafael J. Wysocki
2026-05-08 21:43 ` Bjorn Helgaas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox