From: Raag Jadav <raag.jadav@intel.com>
To: Mario Limonciello <superm1@kernel.org>
Cc: Denis Benato <benato.denis96@gmail.com>,
rafael@kernel.org, mahesh@linux.ibm.com, oohall@gmail.com,
bhelgaas@google.com, linux-pci@vger.kernel.org,
linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org,
ilpo.jarvinen@linux.intel.com, lukas@wunner.de,
aravind.iddamsetty@linux.intel.com
Subject: Re: [PATCH v4] PCI: Prevent power state transition of erroneous device
Date: Tue, 20 May 2025 18:47:02 +0300 [thread overview]
Message-ID: <aCyj9nbnIRet93O-@black.fi.intel.com> (raw)
In-Reply-To: <a8c83435-4c91-495c-950c-4d12b955c54c@kernel.org>
On Tue, May 20, 2025 at 10:23:57AM -0500, Mario Limonciello wrote:
> On 5/20/2025 4:48 AM, Raag Jadav wrote:
> > On Mon, May 19, 2025 at 11:42:31PM +0200, Denis Benato wrote:
> > > On 5/19/25 12:41, Raag Jadav wrote:
> > > > On Mon, May 19, 2025 at 03:58:08PM +0530, Raag Jadav wrote:
> > > > > If error status is set on an AER capable device, most likely either the
> > > > > device recovery is in progress or has already failed. Neither of the
> > > > > cases are well suited for power state transition of the device, since
> > > > > this can lead to unpredictable consequences like resume failure, or in
> > > > > worst case the device is lost because of it. Leave the device in its
> > > > > existing power state to avoid such issues.
> > > > >
> > > > > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > > > > ---
> > > > >
> > > > > v2: Synchronize AER handling with PCI PM (Rafael)
> > > > > v3: Move pci_aer_in_progress() to pci_set_low_power_state() (Rafael)
> > > > > Elaborate "why" (Bjorn)
> > > > > v4: Rely on error status instead of device status
> > > > > Condense comment (Lukas)
> > > > Since pci_aer_in_progress() is changed I've not included Rafael's tag with
> > > > my understanding of this needing a revisit. If this was a mistake, please
> > > > let me know.
> > > >
> > > > Denis, Mario, does this fix your issue?
> > > >
> > > Hello,
> > >
> > > Unfortunately no, I have prepared a dmesg but had to remove the bootup process because it was too long of a few kb: https://pastebin.com/1uBEA1FL
> >
> > Thanks for the test. It seems there's no hotplug event this time around
> > and endpoint device is still intact without any PCI related failure.
> >
> > Also,
> >
> > amdgpu 0000:09:00.0: PCI PM: Suspend power state: D3hot
> >
> > Which means whatever you're facing is either not related to this patch,
> > or at best exposed some nasty side-effect that's not handled correctly
> > by the driver.
> >
> > I'd say amdgpu folks would be of better help for your case.
> >
> > Raag
>
> So according to the logs Denis shared with v4
> (https://pastebin.com/1uBEA1FL) the GPU should have been going to BOCO. This
> stands for "Bus off Chip Off"
>
> amdgpu 0000:09:00.0: amdgpu: Using BOCO for runtime pm
>
> If it's going to D3hot - that's not going to be BOCO, it should be going to
> D3cold.
Yes, because upstream port is in D0 for some reason (might be this patch
but not sure) and so will be the root port.
pcieport 0000:07:00.0: PCI PM: Suspend power state: D0
pcieport 0000:07:00.0: PCI PM: Skipped
and my best guess is the driver is not able to cope with the lack of D3cold.
Raag
> Denis, can you redo your logs with out Raag's patch patch and set
> CONFIG_PCI_DEBUG to compare? The 6.14.6 log you shared already
> (https://pastebin.com/kLZtibcD) also chooses BOCO but I'm suspecting picks
> D3cold like it should.
>
> >
> > > > > More discussion on [1].
> > > > > [1] https://lore.kernel.org/all/CAJZ5v0g-aJXfVH+Uc=9eRPuW08t-6PwzdyMXsC6FZRKYJtY03Q@mail.gmail.com/
> > > > >
> > > > > drivers/pci/pci.c | 9 +++++++++
> > > > > drivers/pci/pcie/aer.c | 13 +++++++++++++
> > > > > include/linux/aer.h | 2 ++
> > > > > 3 files changed, 24 insertions(+)
> > > > >
> > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > > > index 4d7c9f64ea24..a20018692933 100644
> > > > > --- a/drivers/pci/pci.c
> > > > > +++ b/drivers/pci/pci.c
> > > > > @@ -9,6 +9,7 @@
> > > > > */
> > > > > #include <linux/acpi.h>
> > > > > +#include <linux/aer.h>
> > > > > #include <linux/kernel.h>
> > > > > #include <linux/delay.h>
> > > > > #include <linux/dmi.h>
> > > > > @@ -1539,6 +1540,14 @@ static int pci_set_low_power_state(struct pci_dev *dev, pci_power_t state, bool
> > > > > || (state == PCI_D2 && !dev->d2_support))
> > > > > return -EIO;
> > > > > + /*
> > > > > + * If error status is set on an AER capable device, it is not well
> > > > > + * suited for power state transition. Leave it in its existing power
> > > > > + * state to avoid issues like unpredictable resume failure.
> > > > > + */
> > > > > + if (pci_aer_in_progress(dev))
> > > > > + return -EIO;
> > > > > +
> > > > > pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> > > > > if (PCI_POSSIBLE_ERROR(pmcsr)) {
> > > > > pci_err(dev, "Unable to change power state from %s to %s, device inaccessible\n",
> > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > > > > index a1cf8c7ef628..617fbac0d38a 100644
> > > > > --- a/drivers/pci/pcie/aer.c
> > > > > +++ b/drivers/pci/pcie/aer.c
> > > > > @@ -237,6 +237,19 @@ int pcie_aer_is_native(struct pci_dev *dev)
> > > > > }
> > > > > EXPORT_SYMBOL_NS_GPL(pcie_aer_is_native, "CXL");
> > > > > +bool pci_aer_in_progress(struct pci_dev *dev)
> > > > > +{
> > > > > + int aer = dev->aer_cap;
> > > > > + u32 cor, uncor;
> > > > > +
> > > > > + if (!pcie_aer_is_native(dev))
> > > > > + return false;
> > > > > +
> > > > > + pci_read_config_dword(dev, aer + PCI_ERR_COR_STATUS, &cor);
> > > > > + pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, &uncor);
> > > > > + return cor || uncor;
> > > > > +}
> > > > > +
> > > > > static int pci_enable_pcie_error_reporting(struct pci_dev *dev)
> > > > > {
> > > > > int rc;
> > > > > diff --git a/include/linux/aer.h b/include/linux/aer.h
> > > > > index 02940be66324..e6a380bb2e68 100644
> > > > > --- a/include/linux/aer.h
> > > > > +++ b/include/linux/aer.h
> > > > > @@ -56,12 +56,14 @@ struct aer_capability_regs {
> > > > > #if defined(CONFIG_PCIEAER)
> > > > > int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
> > > > > int pcie_aer_is_native(struct pci_dev *dev);
> > > > > +bool pci_aer_in_progress(struct pci_dev *dev);
> > > > > #else
> > > > > static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
> > > > > {
> > > > > return -EINVAL;
> > > > > }
> > > > > static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
> > > > > +static inline bool pci_aer_in_progress(struct pci_dev *dev) { return false; }
> > > > > #endif
> > > > > void pci_print_aer(struct pci_dev *dev, int aer_severity,
> > > > > --
> > > > > 2.34.1
> > > > >
>
next prev parent reply other threads:[~2025-05-20 15:47 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-19 10:28 [PATCH v4] PCI: Prevent power state transition of erroneous device Raag Jadav
2025-05-19 10:41 ` Raag Jadav
2025-05-19 21:42 ` Denis Benato
2025-05-20 9:48 ` Raag Jadav
2025-05-20 15:23 ` Mario Limonciello
2025-05-20 15:47 ` Raag Jadav [this message]
2025-05-20 15:49 ` Mario Limonciello
2025-05-20 17:22 ` Denis Benato
2025-05-20 17:39 ` Mario Limonciello
2025-05-20 18:42 ` Raag Jadav
2025-05-20 18:56 ` Mario Limonciello
2025-05-21 8:54 ` Raag Jadav
2025-05-21 11:27 ` Rafael J. Wysocki
2025-05-23 15:23 ` Rafael J. Wysocki
2025-05-30 17:23 ` Raag Jadav
2025-05-30 17:49 ` Rafael J. Wysocki
2025-06-04 15:42 ` Raag Jadav
2025-06-04 18:19 ` Rafael J. Wysocki
2025-06-05 11:44 ` Raag Jadav
2025-06-05 12:26 ` Rafael J. Wysocki
2025-06-10 13:44 ` Raag Jadav
2025-06-10 13:53 ` Rafael J. Wysocki
2025-06-20 12:14 ` Raag Jadav
2025-05-21 13:39 ` Lukas Wunner
2025-05-21 17:06 ` Mario Limonciello
2025-05-21 20:28 ` Denis Benato
2025-05-22 7:31 ` Lukas Wunner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aCyj9nbnIRet93O-@black.fi.intel.com \
--to=raag.jadav@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=benato.denis96@gmail.com \
--cc=bhelgaas@google.com \
--cc=ilpo.jarvinen@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=mahesh@linux.ibm.com \
--cc=oohall@gmail.com \
--cc=rafael@kernel.org \
--cc=superm1@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.