[PATCH 0/4] PCI: Introduce pci_dev_suspend_retention

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API
@ 2026-04-14 15:59 Manivannan Sadhasivam via B4 Relay
  2026-04-14 15:59 ` [PATCH 1/4] PCI: Introduce an API to check if RC/platform can retain device context during suspend Manivannan Sadhasivam via B4 Relay
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Manivannan Sadhasivam via B4 Relay @ 2026-04-14 15:59 UTC (permalink / raw)
  To: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg
  Cc: linux-pci, linux-kernel, linux-arm-msm, linux-nvme,
	Manivannan Sadhasivam

Hi all,

This series introduces a new PCI API pci_dev_suspend_retention_supported() to
let the client drivers know whether they can expect context retention across
suspend/resume or not and uses it in the NVMe PCI host driver.

This new API is targeted to abstract the PCI power management details away from
the client drivers. This is needed because client drivers like NVMe make use of
APIs such as pm_suspend_via_firmware() and decide to keep the device in low
power mode if this API returns 'false'. But some platforms may have other
limitations like in the case of Qcom, where if the RC driver removes the
resource vote to allow the SoC to enter low power mode, it cannot reliably exit
the L1ss state when the endpoint asserts CLKREQ#. So in this case also, the
client drivers cannot keep the device in low power state during suspend and
expect context retention.

And these limitations may just keep adding in the future. Without a unified
API, the client drivers have to implement their own logic which may cause code
duplication and may also lead to drivers missing some of the platform
limitations.

Once this series gets merged, we can extend this API usage to other client
drivers as well.

Testing
=======

This series is tested on Qualcomm Hamoa based Lenovo Thinkpad T14s latop with
NVMe drive.

Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
---
Manivannan Sadhasivam (4):
      PCI: Introduce an API to check if RC/platform can retain device context during suspend
      PCI: Indicate context lost if L1ss exit is broken during resume from system suspend
      PCI: qcom: Indicate broken L1ss exit during resume from system suspend
      nvme-pci: Use pci_dev_suspend_retention_supported() API during suspend

 drivers/nvme/host/pci.c                |  3 ++-
 drivers/pci/controller/dwc/pcie-qcom.c | 11 +++++++++++
 drivers/pci/pci.c                      | 34 ++++++++++++++++++++++++++++++++++
 include/linux/pci.h                    |  9 +++++++++
 4 files changed, 56 insertions(+), 1 deletion(-)
---
base-commit: 591cd656a1bf5ea94a222af5ef2ee76df029c1d2
change-id: 20260414-l1ss-fix-6c9cf2451944

Best regards,
--  
Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] PCI: Introduce an API to check if RC/platform can retain device context during suspend
  2026-04-14 15:59 [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API Manivannan Sadhasivam via B4 Relay
@ 2026-04-14 15:59 ` Manivannan Sadhasivam via B4 Relay
  2026-04-16 19:18   ` Bjorn Helgaas
  2026-04-14 15:59 ` [PATCH 2/4] PCI: Indicate context lost if L1ss exit is broken during resume from system suspend Manivannan Sadhasivam via B4 Relay
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Manivannan Sadhasivam via B4 Relay @ 2026-04-14 15:59 UTC (permalink / raw)
  To: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg
  Cc: linux-pci, linux-kernel, linux-arm-msm, linux-nvme,
	Manivannan Sadhasivam

From: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>

Currently, the PCI endpoint drivers like NVMe checks whether the device
context will be retained or not during system suspend, with the help of
pm_suspend_via_firmware() API.

But it is possible that the device context might be lost due to some
platform limitation as well. Having those checks in the endpoint drivers
will not scale and will cause a lot of code duplication.

So introduce an API that acts as a sole point of truth that the endpoint
drivers can rely on to check whether they can expect the device context
to be retained or not.

If the API returns 'false', then the client drivers need to prepare for
context loss by performing actions such as resetting the device, saving
the context, shutting it down etc... If it returns 'true', then the drivers
do not need to perform any special action and can leave the device in
active state.

Right now, this API only incorporates the pm_suspend_via_firmware() check.
But will be extended in the future commits.

Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
---
 drivers/pci/pci.c   | 23 +++++++++++++++++++++++
 include/linux/pci.h |  7 +++++++
 2 files changed, 30 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8479c2e1f74f..211616467a77 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -33,6 +33,7 @@
 #include <asm/dma.h>
 #include <linux/aer.h>
 #include <linux/bitfield.h>
+#include <linux/suspend.h>
 #include "pci.h"
 
 DEFINE_MUTEX(pci_slot_mutex);
@@ -2900,6 +2901,28 @@ void pci_config_pm_runtime_put(struct pci_dev *pdev)
 		pm_runtime_put_sync(parent);
 }
 
+/**
+ * pci_dev_suspend_retention_supported - Check if the platform can retain the device
+ *					 context during system suspend
+ * @pdev: PCI device to check
+ *
+ * Returns true if the platform can guarantee to retain the device context,
+ * false otherwise.
+ */
+bool pci_dev_suspend_retention_supported(struct pci_dev *pdev)
+{
+	/*
+	 * If the platform firmware (like ACPI) is involved at the end of system
+	 * suspend, device context may not be retained.
+	 */
+	if (pm_suspend_via_firmware())
+		return false;
+
+	/* Assume that the context is retained by default */
+	return true;
+}
+EXPORT_SYMBOL_GPL(pci_dev_suspend_retention_supported);
+
 static const struct dmi_system_id bridge_d3_blacklist[] = {
 #ifdef CONFIG_X86
 	{
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 1c270f1d5123..d9bc7ad4eaa4 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2080,6 +2080,8 @@ pci_release_mem_regions(struct pci_dev *pdev)
 			    pci_select_bars(pdev, IORESOURCE_MEM));
 }
 
+bool pci_dev_suspend_retention_supported(struct pci_dev *pdev);
+
 #else /* CONFIG_PCI is not enabled */
 
 static inline void pci_set_flags(int flags) { }
@@ -2239,6 +2241,11 @@ pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
 static inline void pci_free_irq_vectors(struct pci_dev *dev)
 {
 }
+
+static inline bool pci_dev_suspend_retention_supported(struct pci_dev *pdev)
+{
+	return true;
+}
 #endif /* CONFIG_PCI */
 
 /* Include architecture-dependent settings and functions */

-- 
2.51.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] PCI: Indicate context lost if L1ss exit is broken during resume from system suspend
  2026-04-14 15:59 [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API Manivannan Sadhasivam via B4 Relay
  2026-04-14 15:59 ` [PATCH 1/4] PCI: Introduce an API to check if RC/platform can retain device context during suspend Manivannan Sadhasivam via B4 Relay
@ 2026-04-14 15:59 ` Manivannan Sadhasivam via B4 Relay
  2026-04-14 15:59 ` [PATCH 3/4] PCI: qcom: Indicate broken L1ss exit " Manivannan Sadhasivam via B4 Relay
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Manivannan Sadhasivam via B4 Relay @ 2026-04-14 15:59 UTC (permalink / raw)
  To: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg
  Cc: linux-pci, linux-kernel, linux-arm-msm, linux-nvme,
	Manivannan Sadhasivam

From: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>

The PCIe spec v7.0, sec 5.5.3.3.1, states that for exiting L1.2 due to an
endpoint asserting CLKREQ# signal, the refclk must be turned on no earlier
than TL10_REFCLK_ON, and within the latency advertised in the LTR message.
This same behavior applies to L1.1 as well.

On some platforms like Qcom, these requirements are satisfied during OS
runtime, but not while resuming from the system suspend. This happens
because the PCIe RC driver may remove all resource votes and turns off the
PHY during suspend to maximize power savings while keeping the link in
L1ss.

Consequently, when the endpoint asserts CLKREQ# to wake up, the OS must
first resume and the RC driver must restore the PHY and enable the refclk.
This recovery process exceeds the L1ss exit latency time. And this latency
violation results in an L1ss exit timeout, followed by Link Down (LDn). If
the endpoint device is used to host the RootFS, it will result in an OS
crash. For other endpoints, it may result in a complete device
reset/recovery.

So to indicate this platform limitation to the client drivers, introduce a
new flag 'pci_host_bridge::broken_l1ss_resume' and check it in the
pci_dev_suspend_retention_supported() API. If the flag is set by the RC
driver, the API will return 'false' indicating the client drivers that the
device context may not be retained and the drivers must be prepared for
context loss.

Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
---
 drivers/pci/pci.c   | 11 +++++++++++
 include/linux/pci.h |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 211616467a77..e871cccf24ae 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2911,6 +2911,8 @@ void pci_config_pm_runtime_put(struct pci_dev *pdev)
  */
 bool pci_dev_suspend_retention_supported(struct pci_dev *pdev)
 {
+	struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
+
 	/*
 	 * If the platform firmware (like ACPI) is involved at the end of system
 	 * suspend, device context may not be retained.
@@ -2918,6 +2920,15 @@ bool pci_dev_suspend_retention_supported(struct pci_dev *pdev)
 	if (pm_suspend_via_firmware())
 		return false;

+	/*
+	 * Some host bridges power off the PHY to enter deep low-power modes
+	 * during system suspend. Exiting L1 PM Substates from this condition
+	 * violates strict timing requirements and results in Link Down (LDn).
+	 * On such platforms, the endpoint must be prepared for context loss.
+	 */
+	if (bridge && bridge->broken_l1ss_resume)
+		return false;
+
 	/* Assume that the context is retained by default */
 	return true;
 }
diff --git a/include/linux/pci.h b/include/linux/pci.h
index d9bc7ad4eaa4..860d8a774b51 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -658,6 +658,8 @@ struct pci_host_bridge {
 	unsigned int	preserve_config:1;	/* Preserve FW resource setup */
 	unsigned int	size_windows:1;		/* Enable root bus sizing */
 	unsigned int	msi_domain:1;		/* Bridge wants MSI domain */
+	unsigned int	broken_l1ss_resume:1;	/* Resuming from L1ss during
+						   system suspend is broken */

 	/* Resource alignment requirements */
 	resource_size_t (*align_resource)(struct pci_dev *dev,

-- 
2.51.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] PCI: qcom: Indicate broken L1ss exit during resume from system suspend
  2026-04-14 15:59 [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API Manivannan Sadhasivam via B4 Relay
  2026-04-14 15:59 ` [PATCH 1/4] PCI: Introduce an API to check if RC/platform can retain device context during suspend Manivannan Sadhasivam via B4 Relay
  2026-04-14 15:59 ` [PATCH 2/4] PCI: Indicate context lost if L1ss exit is broken during resume from system suspend Manivannan Sadhasivam via B4 Relay
@ 2026-04-14 15:59 ` Manivannan Sadhasivam via B4 Relay
  2026-04-16 19:20   ` Bjorn Helgaas
  2026-04-14 15:59 ` [PATCH 4/4] nvme-pci: Use pci_dev_suspend_retention_supported() API during suspend Manivannan Sadhasivam via B4 Relay
  2026-04-16 19:11 ` [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API Bjorn Helgaas
  4 siblings, 1 reply; 15+ messages in thread
From: Manivannan Sadhasivam via B4 Relay @ 2026-04-14 15:59 UTC (permalink / raw)
  To: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg
  Cc: linux-pci, linux-kernel, linux-arm-msm, linux-nvme,
	Manivannan Sadhasivam

From: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>

Qcom PCIe RCs can successfully exit from L1ss during OS runtime. However,
during system suspend, the Qcom PCIe RC driver may remove all resource
votes and turns off the PHY to maximize power savings.

Consequently, when the host is in system suspend with the link in L1ss and
the endpoint asserts CLKREQ#, the OS must first wake up and the RC driver
must restore the PHY and enable the refclk. This recovery process causes
the strict L1ss exit latency time to be exceeded. (If the RC driver were to
retain all votes during suspend, L1ss exit would succeed without issue, but
at the expense of higher power consumption).

This latency violation leads to an L1ss exit timeout, followed by a Link
Down (LDn) condition during resume. This LDn can crash the OS if the
endpoint hosts the RootFS, and for other types of devices, it may result in
a full device reset/recovery.

So to ensure that the client drivers can properly handle this scenario, let
them know about this platform limitation by setting the
'pci_host_bridge::broken_l1ss_resume' flag.

Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
---
 drivers/pci/controller/dwc/pcie-qcom.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/pci/controller/dwc/pcie-qcom.c b/drivers/pci/controller/dwc/pcie-qcom.c
index 67a16af69ddc..01afffd384f2 100644
--- a/drivers/pci/controller/dwc/pcie-qcom.c
+++ b/drivers/pci/controller/dwc/pcie-qcom.c
@@ -1363,6 +1363,17 @@ static void qcom_pcie_host_post_init(struct dw_pcie_rp *pp)
 	struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
 	struct qcom_pcie *pcie = to_qcom_pcie(pci);
 
+	/*
+	 * During system suspend, the Qcom RC driver may turn off the PHY and
+	 * remove votes to save power. If the endpoint asserts CLKREQ# to
+	 * exit L1ss, the time required to wake the system and restore the
+	 * PHY/refclk exceeds the strict L1ss exit timing, resulting in Link
+	 * Down (LDn). Set this flag to indicate this limitation to client
+	 * drivers so that they will avoid relying on L1ss during system
+	 * suspend.
+	 */
+	pp->bridge->broken_l1ss_resume = true;
+
 	if (pcie->cfg->ops->host_post_init)
 		pcie->cfg->ops->host_post_init(pcie);
 }

-- 
2.51.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] nvme-pci: Use pci_dev_suspend_retention_supported() API during suspend
  2026-04-14 15:59 [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API Manivannan Sadhasivam via B4 Relay
                   ` (2 preceding siblings ...)
  2026-04-14 15:59 ` [PATCH 3/4] PCI: qcom: Indicate broken L1ss exit " Manivannan Sadhasivam via B4 Relay
@ 2026-04-14 15:59 ` Manivannan Sadhasivam via B4 Relay
  2026-04-16 19:11 ` [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API Bjorn Helgaas
  4 siblings, 0 replies; 15+ messages in thread
From: Manivannan Sadhasivam via B4 Relay @ 2026-04-14 15:59 UTC (permalink / raw)
  To: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg
  Cc: linux-pci, linux-kernel, linux-arm-msm, linux-nvme,
	Manivannan Sadhasivam

From: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>

The pci_dev_suspend_retention_supported() API lets PCI client drivers
know if the platform can retain the device context during suspend. This
is decided based on several factors like:

1. Firmware involvement at the end of suspend
2. Any platform limitation in waking from low power state (L1ss)

And this API might also get extended in the future to cover other platform
specific issues impacting the device low power mode during system suspend.

So use this API instead of checks like pm_suspend_via_firmware(). When this
API returns false, then assume that the platform cannot retain the context
and shutdown the controller. If it returns true, then assume that the
context will be retained and keep the device in low power mode.

Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
---
 drivers/nvme/host/pci.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b78ba239c8ea..19010330469f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3915,6 +3915,7 @@ static int nvme_suspend(struct device *dev)
 	 * use host managed nvme power settings for lowest idle power if
 	 * possible. This should have quicker resume latency than a full device
 	 * shutdown.  But if the firmware is involved after the suspend or the
+	 * platform has any limitation in waking from low power states or the
 	 * device does not support any non-default power states, shut down the
 	 * device fully.
 	 *
@@ -3923,7 +3924,7 @@ static int nvme_suspend(struct device *dev)
 	 * down, so as to allow the platform to achieve its minimum low-power
 	 * state (which may not be possible if the link is up).
 	 */
-	if (pm_suspend_via_firmware() || !ctrl->npss ||
+	if (!pci_dev_suspend_retention_supported(pdev) || !ctrl->npss ||
 	    !pcie_aspm_enabled(pdev) ||
 	    (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND))
 		return nvme_disable_prepare_reset(ndev, true);

-- 
2.51.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API
  2026-04-14 15:59 [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API Manivannan Sadhasivam via B4 Relay
                   ` (3 preceding siblings ...)
  2026-04-14 15:59 ` [PATCH 4/4] nvme-pci: Use pci_dev_suspend_retention_supported() API during suspend Manivannan Sadhasivam via B4 Relay
@ 2026-04-16 19:11 ` Bjorn Helgaas
  2026-04-17 11:04   ` Manivannan Sadhasivam
  4 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2026-04-16 19:11 UTC (permalink / raw)
  To: manivannan.sadhasivam
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Rafael J. Wysocki, linux-pci,
	linux-kernel, linux-arm-msm, linux-nvme

[+cc Rafael]

On Tue, Apr 14, 2026 at 09:29:38PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> This series introduces a new PCI API
> pci_dev_suspend_retention_supported() to let the client drivers know
> whether they can expect context retention across suspend/resume or
> not and uses it in the NVMe PCI host driver.
> 
> This new API is targeted to abstract the PCI power management
> details away from the client drivers. This is needed because client
> drivers like NVMe make use of APIs such as pm_suspend_via_firmware()
> and decide to keep the device in low power mode if this API returns
> 'false'. But some platforms may have other limitations like in the
> case of Qcom, where if the RC driver removes the resource vote to
> allow the SoC to enter low power mode, it cannot reliably exit the
> L1ss state when the endpoint asserts CLKREQ#. So in this case also,
> the client drivers cannot keep the device in low power state during
> suspend and expect context retention.

I don't know what pm_suspend_via_firmware() means.  The kernel-doc
says "platform firmware is going to be invoked at the end of the
system-wide power management transition," but that doesn't say
anything about what firmware might do or what it means to drivers.

Based on d916b1be94b6 ("nvme-pci: use host managed power state for
suspend"), which used it in nvme_suspend(), I guess the assumption is
that pm_suspend_via_firmware() means the device might be put in D3cold
and lose all its internal state, and conversely,
!pm_suspend_via_firmware() means the device will *never* be put in a
low-power state that loses internal state.

> And these limitations may just keep adding in the future. Without a
> unified API, the client drivers have to implement their own logic
> which may cause code duplication and may also lead to drivers
> missing some of the platform limitations.
> 
> Once this series gets merged, we can extend this API usage to other
> client drivers as well.

> 
> Testing
> =======
> 
> This series is tested on Qualcomm Hamoa based Lenovo Thinkpad T14s latop with
> NVMe drive.
> 
> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
> ---
> Manivannan Sadhasivam (4):
>       PCI: Introduce an API to check if RC/platform can retain device context during suspend
>       PCI: Indicate context lost if L1ss exit is broken during resume from system suspend
>       PCI: qcom: Indicate broken L1ss exit during resume from system suspend
>       nvme-pci: Use pci_dev_suspend_retention_supported() API during suspend
> 
>  drivers/nvme/host/pci.c                |  3 ++-
>  drivers/pci/controller/dwc/pcie-qcom.c | 11 +++++++++++
>  drivers/pci/pci.c                      | 34 ++++++++++++++++++++++++++++++++++
>  include/linux/pci.h                    |  9 +++++++++
>  4 files changed, 56 insertions(+), 1 deletion(-)
> ---
> base-commit: 591cd656a1bf5ea94a222af5ef2ee76df029c1d2
> change-id: 20260414-l1ss-fix-6c9cf2451944
> 
> Best regards,
> --  
> Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] PCI: Introduce an API to check if RC/platform can retain device context during suspend
  2026-04-14 15:59 ` [PATCH 1/4] PCI: Introduce an API to check if RC/platform can retain device context during suspend Manivannan Sadhasivam via B4 Relay
@ 2026-04-16 19:18   ` Bjorn Helgaas
  2026-04-17 11:11     ` Manivannan Sadhasivam
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2026-04-16 19:18 UTC (permalink / raw)
  To: manivannan.sadhasivam
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Rafael J. Wysocki, linux-pci,
	linux-kernel, linux-arm-msm, linux-nvme

[+cc Rafael]

On Tue, Apr 14, 2026 at 09:29:39PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> From: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
> 
> Currently, the PCI endpoint drivers like NVMe checks whether the device
> context will be retained or not during system suspend, with the help of
> pm_suspend_via_firmware() API.
> 
> But it is possible that the device context might be lost due to some
> platform limitation as well. Having those checks in the endpoint drivers
> will not scale and will cause a lot of code duplication.
> 
> So introduce an API that acts as a sole point of truth that the endpoint
> drivers can rely on to check whether they can expect the device context
> to be retained or not.
> 
> If the API returns 'false', then the client drivers need to prepare for
> context loss by performing actions such as resetting the device, saving
> the context, shutting it down etc... If it returns 'true', then the drivers
> do not need to perform any special action and can leave the device in
> active state.
> 
> Right now, this API only incorporates the pm_suspend_via_firmware() check.
> But will be extended in the future commits.
> 
> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
> ---
>  drivers/pci/pci.c   | 23 +++++++++++++++++++++++
>  include/linux/pci.h |  7 +++++++
>  2 files changed, 30 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 8479c2e1f74f..211616467a77 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -33,6 +33,7 @@
>  #include <asm/dma.h>
>  #include <linux/aer.h>
>  #include <linux/bitfield.h>
> +#include <linux/suspend.h>
>  #include "pci.h"
>  
>  DEFINE_MUTEX(pci_slot_mutex);
> @@ -2900,6 +2901,28 @@ void pci_config_pm_runtime_put(struct pci_dev *pdev)
>  		pm_runtime_put_sync(parent);
>  }
>  
> +/**
> + * pci_dev_suspend_retention_supported - Check if the platform can retain the device
> + *					 context during system suspend
> + * @pdev: PCI device to check
> + *
> + * Returns true if the platform can guarantee to retain the device context,
> + * false otherwise.
> + */
> +bool pci_dev_suspend_retention_supported(struct pci_dev *pdev)

This doesn't seem like the right name.  This isn't a property of the
*device*; that's all determined by the PCI spec (devices must retain
all internal state in D0, D1, and D2, they retain it in D3hot if
No_Soft_Reset, and they never do in D3cold).

So this seems like something to do with the *platform* behavior.  It
sounds like this is basically a way to learn whether the device might
be put in D3cold on system suspend.

> +{
> +	/*
> +	 * If the platform firmware (like ACPI) is involved at the end of system
> +	 * suspend, device context may not be retained.
> +	 */
> +	if (pm_suspend_via_firmware())
> +		return false;
> +
> +	/* Assume that the context is retained by default */
> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(pci_dev_suspend_retention_supported);
> +
>  static const struct dmi_system_id bridge_d3_blacklist[] = {
>  #ifdef CONFIG_X86
>  	{
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 1c270f1d5123..d9bc7ad4eaa4 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -2080,6 +2080,8 @@ pci_release_mem_regions(struct pci_dev *pdev)
>  			    pci_select_bars(pdev, IORESOURCE_MEM));
>  }
>  
> +bool pci_dev_suspend_retention_supported(struct pci_dev *pdev);
> +
>  #else /* CONFIG_PCI is not enabled */
>  
>  static inline void pci_set_flags(int flags) { }
> @@ -2239,6 +2241,11 @@ pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
>  static inline void pci_free_irq_vectors(struct pci_dev *dev)
>  {
>  }
> +
> +static inline bool pci_dev_suspend_retention_supported(struct pci_dev *pdev)
> +{
> +	return true;
> +}
>  #endif /* CONFIG_PCI */
>  
>  /* Include architecture-dependent settings and functions */
> 
> -- 
> 2.51.0
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] PCI: qcom: Indicate broken L1ss exit during resume from system suspend
  2026-04-14 15:59 ` [PATCH 3/4] PCI: qcom: Indicate broken L1ss exit " Manivannan Sadhasivam via B4 Relay
@ 2026-04-16 19:20   ` Bjorn Helgaas
  2026-04-17 12:06     ` Manivannan Sadhasivam
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2026-04-16 19:20 UTC (permalink / raw)
  To: manivannan.sadhasivam
  Cc: Bjorn Helgaas, Manivannan Sadhasivam, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Rafael J. Wysocki, linux-pci,
	linux-kernel, linux-arm-msm, linux-nvme

[+cc Rafael]

On Tue, Apr 14, 2026 at 09:29:41PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> From: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
> 
> Qcom PCIe RCs can successfully exit from L1ss during OS runtime. However,
> during system suspend, the Qcom PCIe RC driver may remove all resource
> votes and turns off the PHY to maximize power savings.
> 
> Consequently, when the host is in system suspend with the link in L1ss and
> the endpoint asserts CLKREQ#, the OS must first wake up and the RC driver
> must restore the PHY and enable the refclk. This recovery process causes
> the strict L1ss exit latency time to be exceeded. (If the RC driver were to
> retain all votes during suspend, L1ss exit would succeed without issue, but
> at the expense of higher power consumption).

I don't think the link can be in L1.x if the PHY is turned off, can
it?  I assume if the PHY is off, the link would be in L2 (if aux power
is available) or L3.

L2 and L3 both correspond to the downstream device being in D3cold
(PCIe r7.0, sec 5.3.2), so I assume this is a reset as far as the
device is concerned, and we need all the delays associated with reset
and the D3cold -> D0 transition.

> This latency violation leads to an L1ss exit timeout, followed by a Link
> Down (LDn) condition during resume. This LDn can crash the OS if the
> endpoint hosts the RootFS, and for other types of devices, it may result in
> a full device reset/recovery.

What does "L1SS exit timeout" mean in PCIe terms?  Is there some event
(Message, interrupt, etc) that is triggered by the timeout?

> So to ensure that the client drivers can properly handle this scenario, let
> them know about this platform limitation by setting the
> 'pci_host_bridge::broken_l1ss_resume' flag.

I don't see how this means L1SS is broken.  If the device is
effectively reset, of course we can't go from L1.x to L0 because we
didn't start from L1.x.

> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
> ---
>  drivers/pci/controller/dwc/pcie-qcom.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-qcom.c b/drivers/pci/controller/dwc/pcie-qcom.c
> index 67a16af69ddc..01afffd384f2 100644
> --- a/drivers/pci/controller/dwc/pcie-qcom.c
> +++ b/drivers/pci/controller/dwc/pcie-qcom.c
> @@ -1363,6 +1363,17 @@ static void qcom_pcie_host_post_init(struct dw_pcie_rp *pp)
>  	struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
>  	struct qcom_pcie *pcie = to_qcom_pcie(pci);
>  
> +	/*
> +	 * During system suspend, the Qcom RC driver may turn off the PHY and
> +	 * remove votes to save power. If the endpoint asserts CLKREQ# to
> +	 * exit L1ss, the time required to wake the system and restore the
> +	 * PHY/refclk exceeds the strict L1ss exit timing, resulting in Link
> +	 * Down (LDn). Set this flag to indicate this limitation to client
> +	 * drivers so that they will avoid relying on L1ss during system
> +	 * suspend.
> +	 */
> +	pp->bridge->broken_l1ss_resume = true;
> +
>  	if (pcie->cfg->ops->host_post_init)
>  		pcie->cfg->ops->host_post_init(pcie);
>  }
> 
> -- 
> 2.51.0
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API
  2026-04-16 19:11 ` [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API Bjorn Helgaas
@ 2026-04-17 11:04   ` Manivannan Sadhasivam
  2026-04-17 22:29     ` Bjorn Helgaas
  0 siblings, 1 reply; 15+ messages in thread
From: Manivannan Sadhasivam @ 2026-04-17 11:04 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: manivannan.sadhasivam, Bjorn Helgaas, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Rafael J. Wysocki, linux-pci,
	linux-kernel, linux-arm-msm, linux-nvme

On Thu, Apr 16, 2026 at 02:11:11PM -0500, Bjorn Helgaas wrote:
> [+cc Rafael]
> 
> On Tue, Apr 14, 2026 at 09:29:38PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> > This series introduces a new PCI API
> > pci_dev_suspend_retention_supported() to let the client drivers know
> > whether they can expect context retention across suspend/resume or
> > not and uses it in the NVMe PCI host driver.
> > 
> > This new API is targeted to abstract the PCI power management
> > details away from the client drivers. This is needed because client
> > drivers like NVMe make use of APIs such as pm_suspend_via_firmware()
> > and decide to keep the device in low power mode if this API returns
> > 'false'. But some platforms may have other limitations like in the
> > case of Qcom, where if the RC driver removes the resource vote to
> > allow the SoC to enter low power mode, it cannot reliably exit the
> > L1ss state when the endpoint asserts CLKREQ#. So in this case also,
> > the client drivers cannot keep the device in low power state during
> > suspend and expect context retention.
> 
> I don't know what pm_suspend_via_firmware() means.  The kernel-doc
> says "platform firmware is going to be invoked at the end of the
> system-wide power management transition," but that doesn't say
> anything about what firmware might do or what it means to drivers.
> 

It's hard to predict what the firmware might do after it gains control from the
OS. But as far as the API goes, it just expects the drivers to save the context
and reset the device so that the firmware could do anything it want.

> Based on d916b1be94b6 ("nvme-pci: use host managed power state for
> suspend"), which used it in nvme_suspend(), I guess the assumption is
> that pm_suspend_via_firmware() means the device might be put in D3cold
> and lose all its internal state, and conversely,
> !pm_suspend_via_firmware() means the device will *never* be put in a
> low-power state that loses internal state.
> 

Yes, that's the assumption. Though, the firmware might not do D3Cold at all,
but the drivers should be prepared for that to be compatible with all firmware
implementations.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] PCI: Introduce an API to check if RC/platform can retain device context during suspend
  2026-04-16 19:18   ` Bjorn Helgaas
@ 2026-04-17 11:11     ` Manivannan Sadhasivam
  0 siblings, 0 replies; 15+ messages in thread
From: Manivannan Sadhasivam @ 2026-04-17 11:11 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: manivannan.sadhasivam, Bjorn Helgaas, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Rafael J. Wysocki, linux-pci,
	linux-kernel, linux-arm-msm, linux-nvme

On Thu, Apr 16, 2026 at 02:18:55PM -0500, Bjorn Helgaas wrote:
> [+cc Rafael]
> 
> On Tue, Apr 14, 2026 at 09:29:39PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> > From: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
> > 
> > Currently, the PCI endpoint drivers like NVMe checks whether the device
> > context will be retained or not during system suspend, with the help of
> > pm_suspend_via_firmware() API.
> > 
> > But it is possible that the device context might be lost due to some
> > platform limitation as well. Having those checks in the endpoint drivers
> > will not scale and will cause a lot of code duplication.
> > 
> > So introduce an API that acts as a sole point of truth that the endpoint
> > drivers can rely on to check whether they can expect the device context
> > to be retained or not.
> > 
> > If the API returns 'false', then the client drivers need to prepare for
> > context loss by performing actions such as resetting the device, saving
> > the context, shutting it down etc... If it returns 'true', then the drivers
> > do not need to perform any special action and can leave the device in
> > active state.
> > 
> > Right now, this API only incorporates the pm_suspend_via_firmware() check.
> > But will be extended in the future commits.
> > 
> > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
> > ---
> >  drivers/pci/pci.c   | 23 +++++++++++++++++++++++
> >  include/linux/pci.h |  7 +++++++
> >  2 files changed, 30 insertions(+)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 8479c2e1f74f..211616467a77 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -33,6 +33,7 @@
> >  #include <asm/dma.h>
> >  #include <linux/aer.h>
> >  #include <linux/bitfield.h>
> > +#include <linux/suspend.h>
> >  #include "pci.h"
> >  
> >  DEFINE_MUTEX(pci_slot_mutex);
> > @@ -2900,6 +2901,28 @@ void pci_config_pm_runtime_put(struct pci_dev *pdev)
> >  		pm_runtime_put_sync(parent);
> >  }
> >  
> > +/**
> > + * pci_dev_suspend_retention_supported - Check if the platform can retain the device
> > + *					 context during system suspend
> > + * @pdev: PCI device to check
> > + *
> > + * Returns true if the platform can guarantee to retain the device context,
> > + * false otherwise.
> > + */
> > +bool pci_dev_suspend_retention_supported(struct pci_dev *pdev)
> 
> This doesn't seem like the right name.  This isn't a property of the
> *device*; that's all determined by the PCI spec (devices must retain
> all internal state in D0, D1, and D2, they retain it in D3hot if
> No_Soft_Reset, and they never do in D3cold).
> 
> So this seems like something to do with the *platform* behavior.  It
> sounds like this is basically a way to learn whether the device might
> be put in D3cold on system suspend.
> 

That's correct. But I wanted to keep it device specific, since apart from
pm_suspend_via_firmware() there could be other issues causing context to be
lost. Like the issue with RC, brought up in the successive patches. There could
be chances that only one hierarchy might be affected. So making it device
specific would give us the granularity.

- Mani

> > +{
> > +	/*
> > +	 * If the platform firmware (like ACPI) is involved at the end of system
> > +	 * suspend, device context may not be retained.
> > +	 */
> > +	if (pm_suspend_via_firmware())
> > +		return false;
> > +
> > +	/* Assume that the context is retained by default */
> > +	return true;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_dev_suspend_retention_supported);
> > +
> >  static const struct dmi_system_id bridge_d3_blacklist[] = {
> >  #ifdef CONFIG_X86
> >  	{
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 1c270f1d5123..d9bc7ad4eaa4 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -2080,6 +2080,8 @@ pci_release_mem_regions(struct pci_dev *pdev)
> >  			    pci_select_bars(pdev, IORESOURCE_MEM));
> >  }
> >  
> > +bool pci_dev_suspend_retention_supported(struct pci_dev *pdev);
> > +
> >  #else /* CONFIG_PCI is not enabled */
> >  
> >  static inline void pci_set_flags(int flags) { }
> > @@ -2239,6 +2241,11 @@ pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
> >  static inline void pci_free_irq_vectors(struct pci_dev *dev)
> >  {
> >  }
> > +
> > +static inline bool pci_dev_suspend_retention_supported(struct pci_dev *pdev)
> > +{
> > +	return true;
> > +}
> >  #endif /* CONFIG_PCI */
> >  
> >  /* Include architecture-dependent settings and functions */
> > 
> > -- 
> > 2.51.0
> > 
> > 

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] PCI: qcom: Indicate broken L1ss exit during resume from system suspend
  2026-04-16 19:20   ` Bjorn Helgaas
@ 2026-04-17 12:06     ` Manivannan Sadhasivam
  2026-04-17 22:26       ` Bjorn Helgaas
  0 siblings, 1 reply; 15+ messages in thread
From: Manivannan Sadhasivam @ 2026-04-17 12:06 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: manivannan.sadhasivam, Bjorn Helgaas, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Rafael J. Wysocki, linux-pci,
	linux-kernel, linux-arm-msm, linux-nvme

On Thu, Apr 16, 2026 at 02:20:00PM -0500, Bjorn Helgaas wrote:
> [+cc Rafael]
> 
> On Tue, Apr 14, 2026 at 09:29:41PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> > From: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
> > 
> > Qcom PCIe RCs can successfully exit from L1ss during OS runtime. However,
> > during system suspend, the Qcom PCIe RC driver may remove all resource
> > votes and turns off the PHY to maximize power savings.
> > 
> > Consequently, when the host is in system suspend with the link in L1ss and
> > the endpoint asserts CLKREQ#, the OS must first wake up and the RC driver
> > must restore the PHY and enable the refclk. This recovery process causes
> > the strict L1ss exit latency time to be exceeded. (If the RC driver were to
> > retain all votes during suspend, L1ss exit would succeed without issue, but
> > at the expense of higher power consumption).
> 
> I don't think the link can be in L1.x if the PHY is turned off, can
> it?  I assume if the PHY is off, the link would be in L2 (if aux power
> is available) or L3.
> 

As per the spec, if the link is in L1.2, the entire analog circuitry of the PHY
can be powered off and that's what I meant here. The LTSSM state would be
preserved by the MAC layer, whose context is always retained.

The only problem is that, CLKREQ# is routed to an Always-on-Domain (AON) inside
the SoC. So when the endpoint asserts CLKREQ#, AON wakes up the SoC and later
the PCIe controller driver turns ON the PHY. But by that time, the L1ss exit
latency would've elapsed, causing LDn.

> L2 and L3 both correspond to the downstream device being in D3cold
> (PCIe r7.0, sec 5.3.2), so I assume this is a reset as far as the
> device is concerned, and we need all the delays associated with reset
> and the D3cold -> D0 transition.
> 
> > This latency violation leads to an L1ss exit timeout, followed by a Link
> > Down (LDn) condition during resume. This LDn can crash the OS if the
> > endpoint hosts the RootFS, and for other types of devices, it may result in
> > a full device reset/recovery.
> 
> What does "L1SS exit timeout" mean in PCIe terms?  Is there some event
> (Message, interrupt, etc) that is triggered by the timeout?
> 

By 'L1ss exit timeout' I meant the failure to move to L0 state post L1.2 exit.
During L1.2 exit, the endpoint expects the refclk and common mode voltage to be
restored within the negotiated time. Per spec, r7.0, sec 5.5.3.3.1, Exit from
L1.2:

```
Next state is L1.0 after waiting for TPOWER_ON

* Common mode is permitted to be established passively during L1.0, and actively
during Recovery. In order to ensure common mode has been established, the
Downstream Port must maintain a timer, and the Downstream Port must continue to
send TS1 training sequences until a minimum of TCOMMONMODE has elapsed since the
Downstream Port has started transmitting TS1 training sequences and has detected
electrical idle exit on any Lane of the configured Link.
```

So if this condition is not satisfied, then the link would move to the LDn
state and that's the only event triggered to the OS.

> > So to ensure that the client drivers can properly handle this scenario, let
> > them know about this platform limitation by setting the
> > 'pci_host_bridge::broken_l1ss_resume' flag.
> 
> I don't see how this means L1SS is broken.  If the device is
> effectively reset, of course we can't go from L1.x to L0 because we
> didn't start from L1.x.
> 

From the OS perspective, the link would still be in L1ss and not expected to
move to L2/L3 during suspend/resume, since that transition is controlled by the
OS itself. But when the OS resumes, the link would go to LDn state and it can
only be brought back to L0, after a complete reset.

- Mani

> > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
> > ---
> >  drivers/pci/controller/dwc/pcie-qcom.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/drivers/pci/controller/dwc/pcie-qcom.c b/drivers/pci/controller/dwc/pcie-qcom.c
> > index 67a16af69ddc..01afffd384f2 100644
> > --- a/drivers/pci/controller/dwc/pcie-qcom.c
> > +++ b/drivers/pci/controller/dwc/pcie-qcom.c
> > @@ -1363,6 +1363,17 @@ static void qcom_pcie_host_post_init(struct dw_pcie_rp *pp)
> >  	struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> >  	struct qcom_pcie *pcie = to_qcom_pcie(pci);
> >  
> > +	/*
> > +	 * During system suspend, the Qcom RC driver may turn off the PHY and
> > +	 * remove votes to save power. If the endpoint asserts CLKREQ# to
> > +	 * exit L1ss, the time required to wake the system and restore the
> > +	 * PHY/refclk exceeds the strict L1ss exit timing, resulting in Link
> > +	 * Down (LDn). Set this flag to indicate this limitation to client
> > +	 * drivers so that they will avoid relying on L1ss during system
> > +	 * suspend.
> > +	 */
> > +	pp->bridge->broken_l1ss_resume = true;
> > +
> >  	if (pcie->cfg->ops->host_post_init)
> >  		pcie->cfg->ops->host_post_init(pcie);
> >  }
> > 
> > -- 
> > 2.51.0
> > 
> > 

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] PCI: qcom: Indicate broken L1ss exit during resume from system suspend
  2026-04-17 12:06     ` Manivannan Sadhasivam
@ 2026-04-17 22:26       ` Bjorn Helgaas
  2026-04-18  5:39         ` Manivannan Sadhasivam
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2026-04-17 22:26 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: manivannan.sadhasivam, Bjorn Helgaas, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Rafael J. Wysocki, linux-pci,
	linux-kernel, linux-arm-msm, linux-nvme

On Fri, Apr 17, 2026 at 05:36:42PM +0530, Manivannan Sadhasivam wrote:
> On Thu, Apr 16, 2026 at 02:20:00PM -0500, Bjorn Helgaas wrote:
> > On Tue, Apr 14, 2026 at 09:29:41PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> > > From: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
> > > 
> > > Qcom PCIe RCs can successfully exit from L1ss during OS runtime.
> > > However, during system suspend, the Qcom PCIe RC driver may
> > > remove all resource votes and turns off the PHY to maximize
> > > power savings.
> > > 
> > > Consequently, when the host is in system suspend with the link
> > > in L1ss and the endpoint asserts CLKREQ#, the OS must first wake
> > > up and the RC driver must restore the PHY and enable the refclk.
> > > This recovery process causes the strict L1ss exit latency time
> > > to be exceeded. (If the RC driver were to retain all votes
> > > during suspend, L1ss exit would succeed without issue, but at
> > > the expense of higher power consumption).
> > 
> > I don't think the link can be in L1.x if the PHY is turned off,
> > can it?  I assume if the PHY is off, the link would be in L2 (if
> > aux power is available) or L3.
> 
> As per the spec, if the link is in L1.2, the entire analog circuitry
> of the PHY can be powered off and that's what I meant here. The
> LTSSM state would be preserved by the MAC layer, whose context is
> always retained.
> 
> The only problem is that, CLKREQ# is routed to an Always-on-Domain
> (AON) inside the SoC. So when the endpoint asserts CLKREQ#, AON
> wakes up the SoC and later the PCIe controller driver turns ON the
> PHY. But by that time, the L1ss exit latency would've elapsed,
> causing LDn.
> 
> > L2 and L3 both correspond to the downstream device being in D3cold
> > (PCIe r7.0, sec 5.3.2), so I assume this is a reset as far as the
> > device is concerned, and we need all the delays associated with
> > reset and the D3cold -> D0 transition.
> > 
> > > This latency violation leads to an L1ss exit timeout, followed
> > > by a Link Down (LDn) condition during resume. This LDn can crash
> > > the OS if the endpoint hosts the RootFS, and for other types of
> > > devices, it may result in a full device reset/recovery.
> > 
> > What does "L1SS exit timeout" mean in PCIe terms?  Is there some
> > event (Message, interrupt, etc) that is triggered by the timeout?
> 
> By 'L1ss exit timeout' I meant the failure to move to L0 state post
> L1.2 exit.  During L1.2 exit, the endpoint expects the refclk and
> common mode voltage to be restored within the negotiated time. Per
> spec, r7.0, sec 5.5.3.3.1, Exit from L1.2:
> 
> ```
> Next state is L1.0 after waiting for TPOWER_ON
> 
> * Common mode is permitted to be established passively during L1.0,
> and actively during Recovery. In order to ensure common mode has
> been established, the Downstream Port must maintain a timer, and the
> Downstream Port must continue to send TS1 training sequences until a
> minimum of TCOMMONMODE has elapsed since the Downstream Port has
> started transmitting TS1 training sequences and has detected
> electrical idle exit on any Lane of the configured Link.
> ```
> 
> So if this condition is not satisfied, then the link would move to
> the LDn state and that's the only event triggered to the OS.
> 
> > > So to ensure that the client drivers can properly handle this
> > > scenario, let them know about this platform limitation by
> > > setting the 'pci_host_bridge::broken_l1ss_resume' flag.
> > 
> > I don't see how this means L1SS is broken.  If the device is
> > effectively reset, of course we can't go from L1.x to L0 because
> > we didn't start from L1.x.
> 
> From the OS perspective, the link would still be in L1ss and not
> expected to move to L2/L3 during suspend/resume, since that
> transition is controlled by the OS itself. But when the OS resumes,
> the link would go to LDn state and it can only be brought back to
> L0, after a complete reset.

Thanks for the background.  It would help a lot if I had more of a
hardware background!

Does L1.2 have to meet the advertised L1 Exit Latency?  I assume maybe
it does because I don't see an exception for L1.x or any exit
latencies advertised in the L1 PM Substates Capability.

Regardless, I'd be kind of surprised if *any* system could meet an
L1.2 exit latency from a system suspend situation where PHY power is
removed.  On ACPI systems, the OS doesn't know how to remove PHY
power, so I don't think that situation can happen unless firmware
is involved in the suspend.

Maybe that's part of why pm_suspend_via_firmware() exists.  What if
native host drivers just called pm_set_suspend_via_firmware()?  After
all, if they support suspend, they're doing things that are done by
firmware on other systems.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API
  2026-04-17 11:04   ` Manivannan Sadhasivam
@ 2026-04-17 22:29     ` Bjorn Helgaas
  2026-04-18  5:16       ` Manivannan Sadhasivam
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2026-04-17 22:29 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: manivannan.sadhasivam, Bjorn Helgaas, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Rafael J. Wysocki, linux-pci,
	linux-kernel, linux-arm-msm, linux-nvme

On Fri, Apr 17, 2026 at 04:34:53PM +0530, Manivannan Sadhasivam wrote:
> On Thu, Apr 16, 2026 at 02:11:11PM -0500, Bjorn Helgaas wrote:
> > On Tue, Apr 14, 2026 at 09:29:38PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> > > This series introduces a new PCI API
> > > pci_dev_suspend_retention_supported() to let the client drivers
> > > know whether they can expect context retention across
> > > suspend/resume or not and uses it in the NVMe PCI host driver.
> > > 
> > > This new API is targeted to abstract the PCI power management
> > > details away from the client drivers. This is needed because
> > > client drivers like NVMe make use of APIs such as
> > > pm_suspend_via_firmware() and decide to keep the device in low
> > > power mode if this API returns 'false'. But some platforms may
> > > have other limitations like in the case of Qcom, where if the RC
> > > driver removes the resource vote to allow the SoC to enter low
> > > power mode, it cannot reliably exit the L1ss state when the
> > > endpoint asserts CLKREQ#. So in this case also, the client
> > > drivers cannot keep the device in low power state during suspend
> > > and expect context retention.
> > 
> > I don't know what pm_suspend_via_firmware() means.  The kernel-doc
> > says "platform firmware is going to be invoked at the end of the
> > system-wide power management transition," but that doesn't say
> > anything about what firmware might do or what it means to drivers.
> 
> It's hard to predict what the firmware might do after it gains
> control from the OS. But as far as the API goes, it just expects the
> drivers to save the context and reset the device so that the
> firmware could do anything it want.

I don't see anything about the driver needing to reset the device.
(Kernel-doc says "driver *may* need to reset it" but no hint about how
to know.)

Adding something like "device internal state is not preserved" would
go a long ways here.

> > Based on d916b1be94b6 ("nvme-pci: use host managed power state for
> > suspend"), which used it in nvme_suspend(), I guess the assumption
> > is that pm_suspend_via_firmware() means the device might be put in
> > D3cold and lose all its internal state, and conversely,
> > !pm_suspend_via_firmware() means the device will *never* be put in
> > a low-power state that loses internal state.
> 
> Yes, that's the assumption. Though, the firmware might not do D3Cold
> at all, but the drivers should be prepared for that to be compatible
> with all firmware implementations.

I don't think it's useful for a driver to know "firmware might not do
D3cold".  What could a driver do with that?  Unless the driver *knows*
internal state will be preserved, it must act as though the state is
lost.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API
  2026-04-17 22:29     ` Bjorn Helgaas
@ 2026-04-18  5:16       ` Manivannan Sadhasivam
  0 siblings, 0 replies; 15+ messages in thread
From: Manivannan Sadhasivam @ 2026-04-18  5:16 UTC (permalink / raw)
  To: Bjorn Helgaas, Rafael J. Wysocki
  Cc: manivannan.sadhasivam, Bjorn Helgaas, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, linux-pci, linux-kernel,
	linux-arm-msm, linux-nvme

On Fri, Apr 17, 2026 at 05:29:04PM -0500, Bjorn Helgaas wrote:
> On Fri, Apr 17, 2026 at 04:34:53PM +0530, Manivannan Sadhasivam wrote:
> > On Thu, Apr 16, 2026 at 02:11:11PM -0500, Bjorn Helgaas wrote:
> > > On Tue, Apr 14, 2026 at 09:29:38PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> > > > This series introduces a new PCI API
> > > > pci_dev_suspend_retention_supported() to let the client drivers
> > > > know whether they can expect context retention across
> > > > suspend/resume or not and uses it in the NVMe PCI host driver.
> > > > 
> > > > This new API is targeted to abstract the PCI power management
> > > > details away from the client drivers. This is needed because
> > > > client drivers like NVMe make use of APIs such as
> > > > pm_suspend_via_firmware() and decide to keep the device in low
> > > > power mode if this API returns 'false'. But some platforms may
> > > > have other limitations like in the case of Qcom, where if the RC
> > > > driver removes the resource vote to allow the SoC to enter low
> > > > power mode, it cannot reliably exit the L1ss state when the
> > > > endpoint asserts CLKREQ#. So in this case also, the client
> > > > drivers cannot keep the device in low power state during suspend
> > > > and expect context retention.
> > > 
> > > I don't know what pm_suspend_via_firmware() means.  The kernel-doc
> > > says "platform firmware is going to be invoked at the end of the
> > > system-wide power management transition," but that doesn't say
> > > anything about what firmware might do or what it means to drivers.
> > 
> > It's hard to predict what the firmware might do after it gains
> > control from the OS. But as far as the API goes, it just expects the
> > drivers to save the context and reset the device so that the
> > firmware could do anything it want.
> 
> I don't see anything about the driver needing to reset the device.
> (Kernel-doc says "driver *may* need to reset it" but no hint about how
> to know.)
> 
> Adding something like "device internal state is not preserved" would
> go a long ways here.
> 

IIUC, 'may' is used in the description because not all firmware are going to
turn off or do something with the device. But for a driver that is supposed to
work with all firmware implementations, like a NIC/storage client driver, it
should save the internal state and prepare for a possible power loss. This is
what the NVMe driver does currently.

> > > Based on d916b1be94b6 ("nvme-pci: use host managed power state for
> > > suspend"), which used it in nvme_suspend(), I guess the assumption
> > > is that pm_suspend_via_firmware() means the device might be put in
> > > D3cold and lose all its internal state, and conversely,
> > > !pm_suspend_via_firmware() means the device will *never* be put in
> > > a low-power state that loses internal state.
> > 
> > Yes, that's the assumption. Though, the firmware might not do D3Cold
> > at all, but the drivers should be prepared for that to be compatible
> > with all firmware implementations.
> 
> I don't think it's useful for a driver to know "firmware might not do
> D3cold".  What could a driver do with that?  Unless the driver *knows*
> internal state will be preserved, it must act as though the state is
> lost.

A driver doesn't need to know whether device will be put into D3Cold or not. But
it does need to know whether there is a possibility or not. Because, AFAIK,
there is no way the OS can query what the firmware is going to do at the end of
the suspend. So to be on the conservative side, this API gives an indication to
the client drivers saying 'hey, firmware is going to be invoked at the end of
suspend and it may do something with the device state like invoking D3Cold or
doing something else. So be prepared for that.'

And 'be prepared' means, saving the context and resetting the device.

@Rafael: Please correct me if my above understanding is wrong.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] PCI: qcom: Indicate broken L1ss exit during resume from system suspend
  2026-04-17 22:26       ` Bjorn Helgaas
@ 2026-04-18  5:39         ` Manivannan Sadhasivam
  0 siblings, 0 replies; 15+ messages in thread
From: Manivannan Sadhasivam @ 2026-04-18  5:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: manivannan.sadhasivam, Bjorn Helgaas, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Rob Herring, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg, Rafael J. Wysocki, linux-pci,
	linux-kernel, linux-arm-msm, linux-nvme

On Fri, Apr 17, 2026 at 05:26:15PM -0500, Bjorn Helgaas wrote:
> On Fri, Apr 17, 2026 at 05:36:42PM +0530, Manivannan Sadhasivam wrote:
> > On Thu, Apr 16, 2026 at 02:20:00PM -0500, Bjorn Helgaas wrote:
> > > On Tue, Apr 14, 2026 at 09:29:41PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> > > > From: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
> > > > 
> > > > Qcom PCIe RCs can successfully exit from L1ss during OS runtime.
> > > > However, during system suspend, the Qcom PCIe RC driver may
> > > > remove all resource votes and turns off the PHY to maximize
> > > > power savings.
> > > > 
> > > > Consequently, when the host is in system suspend with the link
> > > > in L1ss and the endpoint asserts CLKREQ#, the OS must first wake
> > > > up and the RC driver must restore the PHY and enable the refclk.
> > > > This recovery process causes the strict L1ss exit latency time
> > > > to be exceeded. (If the RC driver were to retain all votes
> > > > during suspend, L1ss exit would succeed without issue, but at
> > > > the expense of higher power consumption).
> > > 
> > > I don't think the link can be in L1.x if the PHY is turned off,
> > > can it?  I assume if the PHY is off, the link would be in L2 (if
> > > aux power is available) or L3.
> > 
> > As per the spec, if the link is in L1.2, the entire analog circuitry
> > of the PHY can be powered off and that's what I meant here. The
> > LTSSM state would be preserved by the MAC layer, whose context is
> > always retained.
> > 
> > The only problem is that, CLKREQ# is routed to an Always-on-Domain
> > (AON) inside the SoC. So when the endpoint asserts CLKREQ#, AON
> > wakes up the SoC and later the PCIe controller driver turns ON the
> > PHY. But by that time, the L1ss exit latency would've elapsed,
> > causing LDn.
> > 
> > > L2 and L3 both correspond to the downstream device being in D3cold
> > > (PCIe r7.0, sec 5.3.2), so I assume this is a reset as far as the
> > > device is concerned, and we need all the delays associated with
> > > reset and the D3cold -> D0 transition.
> > > 
> > > > This latency violation leads to an L1ss exit timeout, followed
> > > > by a Link Down (LDn) condition during resume. This LDn can crash
> > > > the OS if the endpoint hosts the RootFS, and for other types of
> > > > devices, it may result in a full device reset/recovery.
> > > 
> > > What does "L1SS exit timeout" mean in PCIe terms?  Is there some
> > > event (Message, interrupt, etc) that is triggered by the timeout?
> > 
> > By 'L1ss exit timeout' I meant the failure to move to L0 state post
> > L1.2 exit.  During L1.2 exit, the endpoint expects the refclk and
> > common mode voltage to be restored within the negotiated time. Per
> > spec, r7.0, sec 5.5.3.3.1, Exit from L1.2:
> > 
> > ```
> > Next state is L1.0 after waiting for TPOWER_ON
> > 
> > * Common mode is permitted to be established passively during L1.0,
> > and actively during Recovery. In order to ensure common mode has
> > been established, the Downstream Port must maintain a timer, and the
> > Downstream Port must continue to send TS1 training sequences until a
> > minimum of TCOMMONMODE has elapsed since the Downstream Port has
> > started transmitting TS1 training sequences and has detected
> > electrical idle exit on any Lane of the configured Link.
> > ```
> > 
> > So if this condition is not satisfied, then the link would move to
> > the LDn state and that's the only event triggered to the OS.
> > 
> > > > So to ensure that the client drivers can properly handle this
> > > > scenario, let them know about this platform limitation by
> > > > setting the 'pci_host_bridge::broken_l1ss_resume' flag.
> > > 
> > > I don't see how this means L1SS is broken.  If the device is
> > > effectively reset, of course we can't go from L1.x to L0 because
> > > we didn't start from L1.x.
> > 
> > From the OS perspective, the link would still be in L1ss and not
> > expected to move to L2/L3 during suspend/resume, since that
> > transition is controlled by the OS itself. But when the OS resumes,
> > the link would go to LDn state and it can only be brought back to
> > L0, after a complete reset.
> 
> Thanks for the background.  It would help a lot if I had more of a
> hardware background!
> 
> Does L1.2 have to meet the advertised L1 Exit Latency?  I assume maybe
> it does because I don't see an exception for L1.x or any exit
> latencies advertised in the L1 PM Substates Capability.
> 

As per my understanding, 'L1 Exit Latency' only covers ASPM L1 state, not L1ss.
Because, 'L1 Exit Latency' field exists even before L1 PM Substates got
introduced in r3.1. So it doesn't cover L1.2 exit latency.

> Regardless, I'd be kind of surprised if *any* system could meet an
> L1.2 exit latency from a system suspend situation where PHY power is
> removed.  On ACPI systems, the OS doesn't know how to remove PHY
> power, so I don't think that situation can happen unless firmware
> is involved in the suspend.
> 

Yes, you are right. Even for systems turning off the PHY completely, they should
have some mechanism to detect the CLKREQ# assert and turn ON the PHY within the
expected time. On our Qcom platforms, we do have some co-processors handling
this even before the OS wakesup. But support for that co-processor is currently
not available in upstream and we don't know when it is going to be added. Until
then, we only have one option to not put the link to L1ss during suspend and
keep the devices into D3Cold to achieve the SoC low power state.

> Maybe that's part of why pm_suspend_via_firmware() exists.  What if
> native host drivers just called pm_set_suspend_via_firmware()?  After
> all, if they support suspend, they're doing things that are done by
> firmware on other systems.

No, that would be inappropriate. pm_set_suspend_via_firmware() is supposed to
be called only when the firmware is invoked at the end of suspend. If OS handles
everything and not the firmware, there is no need to invoke this API.

- Mani

-- 
மணிவண்ணன் சதாசிவம்

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-04-18  5:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-14 15:59 [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API Manivannan Sadhasivam via B4 Relay
2026-04-14 15:59 ` [PATCH 1/4] PCI: Introduce an API to check if RC/platform can retain device context during suspend Manivannan Sadhasivam via B4 Relay
2026-04-16 19:18   ` Bjorn Helgaas
2026-04-17 11:11     ` Manivannan Sadhasivam
2026-04-14 15:59 ` [PATCH 2/4] PCI: Indicate context lost if L1ss exit is broken during resume from system suspend Manivannan Sadhasivam via B4 Relay
2026-04-14 15:59 ` [PATCH 3/4] PCI: qcom: Indicate broken L1ss exit " Manivannan Sadhasivam via B4 Relay
2026-04-16 19:20   ` Bjorn Helgaas
2026-04-17 12:06     ` Manivannan Sadhasivam
2026-04-17 22:26       ` Bjorn Helgaas
2026-04-18  5:39         ` Manivannan Sadhasivam
2026-04-14 15:59 ` [PATCH 4/4] nvme-pci: Use pci_dev_suspend_retention_supported() API during suspend Manivannan Sadhasivam via B4 Relay
2026-04-16 19:11 ` [PATCH 0/4] PCI: Introduce pci_dev_suspend_retention_supported() API Bjorn Helgaas
2026-04-17 11:04   ` Manivannan Sadhasivam
2026-04-17 22:29     ` Bjorn Helgaas
2026-04-18  5:16       ` Manivannan Sadhasivam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox