* [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
@ 2024-11-18 8:23 Manivannan Sadhasivam
2024-11-18 12:58 ` Christoph Hellwig
2024-11-22 22:20 ` Bjorn Helgaas
0 siblings, 2 replies; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-11-18 8:23 UTC (permalink / raw)
To: kbusch, axboe, hch, sagi
Cc: linux-nvme, linux-kernel, linux-pci, andersson, konradybcio,
Manivannan Sadhasivam
PCI core allows users to configure the D3Cold state for each PCI device
through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This
attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users
to allow/disallow the PCI devices to enter D3Cold during system suspend.
So make use of this flag in the NVMe driver to shutdown the NVMe device
during system suspend if the user has allowed D3Cold for the device.
Existing checks in the NVMe driver decide whether to shut down the device
(based on platform/device limitations), so use this flag as the last resort
to keep the existing behavior.
The default behavior of the 'pci_dev:d3cold_allowed' flag is to allow
D3Cold and the users can disallow it through sysfs if they want.
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
---
drivers/nvme/host/pci.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4b9fda0b1d9a..a4d4687854bf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3287,7 +3287,8 @@ static int nvme_suspend(struct device *dev)
*/
if (pm_suspend_via_firmware() || !ctrl->npss ||
!pcie_aspm_enabled(pdev) ||
- (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND))
+ (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND) ||
+ pdev->d3cold_allowed)
return nvme_disable_prepare_reset(ndev, true);
nvme_start_freeze(ctrl);
--
2.25.1
^ permalink raw reply related [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-11-18 8:23 [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user Manivannan Sadhasivam
@ 2024-11-18 12:58 ` Christoph Hellwig
2024-11-18 14:58 ` Manivannan Sadhasivam
2024-11-22 22:20 ` Bjorn Helgaas
1 sibling, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-11-18 12:58 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: kbusch, axboe, hch, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Bjorn Helgaas, Acked-by
On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> PCI core allows users to configure the D3Cold state for each PCI device
> through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This
> attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users
> to allow/disallow the PCI devices to enter D3Cold during system suspend.
>
> So make use of this flag in the NVMe driver to shutdown the NVMe device
> during system suspend if the user has allowed D3Cold for the device.
> Existing checks in the NVMe driver decide whether to shut down the device
> (based on platform/device limitations), so use this flag as the last resort
> to keep the existing behavior.
Umm, what? The documentation of this attribute says:
"d3cold_allowed is bit to control whether the corresponding PCI
device can be put into D3Cold state. If it is cleared, the
device will never be put into D3Cold state. If it is set, the
device may be put into D3Cold state if other requirements are
satisfied too. Reading this attribute will show the current
value of d3cold_allowed bit. Writing this attribute will
the value of d3cold_allowed bit."
Which honestly already sounds rather non-specific, but everything but
a mandate for drivers to act on it.
The only place currently checking it is pci_dev_check_d3cold in the
PCI core, which is used to set the bridge_d3 attibute.
So blindly using it in a driver to force a different PM strategy feels
completely wrong. Even if the attrite should have that effect it
needs to happen through a well documented PCI or PM layer helper and
open coded like this.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-11-18 12:58 ` Christoph Hellwig
@ 2024-11-18 14:58 ` Manivannan Sadhasivam
0 siblings, 0 replies; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-11-18 14:58 UTC (permalink / raw)
To: Christoph Hellwig
Cc: kbusch, axboe, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Bjorn Helgaas, Rafael J. Wysocki
On Mon, Nov 18, 2024 at 01:58:17PM +0100, Christoph Hellwig wrote:
> On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> > PCI core allows users to configure the D3Cold state for each PCI device
> > through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This
> > attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users
> > to allow/disallow the PCI devices to enter D3Cold during system suspend.
> >
> > So make use of this flag in the NVMe driver to shutdown the NVMe device
> > during system suspend if the user has allowed D3Cold for the device.
> > Existing checks in the NVMe driver decide whether to shut down the device
> > (based on platform/device limitations), so use this flag as the last resort
> > to keep the existing behavior.
>
> Umm, what? The documentation of this attribute says:
>
> "d3cold_allowed is bit to control whether the corresponding PCI
> device can be put into D3Cold state. If it is cleared, the
> device will never be put into D3Cold state. If it is set, the
> device may be put into D3Cold state if other requirements are
> satisfied too. Reading this attribute will show the current
> value of d3cold_allowed bit. Writing this attribute will
> the value of d3cold_allowed bit."
>
> Which honestly already sounds rather non-specific, but everything but
> a mandate for drivers to act on it.
>
> The only place currently checking it is pci_dev_check_d3cold in the
> PCI core, which is used to set the bridge_d3 attibute.
>
Yeah, it is pretty much used internally up until now. But the attribute looks
like a close match of what I could find for this usecase and that's why I used
it.
> So blindly using it in a driver to force a different PM strategy feels
> completely wrong. Even if the attrite should have that effect it
> needs to happen through a well documented PCI or PM layer helper and
> open coded like this.
>
Ok. I'd like to get some feedback from Bjorn H (PCI maintainer) about using this
attribute before moving forward with a helper.
Thanks!
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-11-18 8:23 [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user Manivannan Sadhasivam
2024-11-18 12:58 ` Christoph Hellwig
@ 2024-11-22 22:20 ` Bjorn Helgaas
2024-11-23 9:01 ` Manivannan Sadhasivam
1 sibling, 1 reply; 47+ messages in thread
From: Bjorn Helgaas @ 2024-11-22 22:20 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: kbusch, axboe, hch, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Rafael J. Wysocki
[+cc Rafael]
On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> PCI core allows users to configure the D3Cold state for each PCI device
> through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This
> attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users
> to allow/disallow the PCI devices to enter D3Cold during system suspend.
>
> So make use of this flag in the NVMe driver to shutdown the NVMe device
> during system suspend if the user has allowed D3Cold for the device.
> Existing checks in the NVMe driver decide whether to shut down the device
> (based on platform/device limitations), so use this flag as the last resort
> to keep the existing behavior.
>
> The default behavior of the 'pci_dev:d3cold_allowed' flag is to allow
> D3Cold and the users can disallow it through sysfs if they want.
What problem does this solve? I guess there must be a case where
suspend leaves NVMe in a higher power state than you want?
What does it mean that this is the "last resort to keep the existing
behavior"? All the checks are OR'd together and none have side
effects, so the order doesn't really matter. It changes the existing
behavior *unless* the user has explicitly cleared d3cold_allowed via
sysfs.
pdev->d3cold_allowed is set by default, so I guess this change means
that unless the user clears d3cold_allowed, we let the PCI core decide
the suspend state instead of managing it directly in nvme_suspend()?
> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> ---
> drivers/nvme/host/pci.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 4b9fda0b1d9a..a4d4687854bf 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3287,7 +3287,8 @@ static int nvme_suspend(struct device *dev)
> */
> if (pm_suspend_via_firmware() || !ctrl->npss ||
> !pcie_aspm_enabled(pdev) ||
> - (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND))
> + (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND) ||
> + pdev->d3cold_allowed)
> return nvme_disable_prepare_reset(ndev, true);
I guess your intent is that if users prevent use of D3cold via sysfs,
we'll use the NVMe-specific power states, and otherwise, the PCI core
will decide?
I think returning 0 here means the PCI core decides what state to use
in the pci_pm_suspend_noirq() -> pci_prepare_to_sleep() path. This
could be any state from D0 to D3cold depending on platform support and
wakeup considerations (see pci_target_state()).
I'm not sure the use of pdev->d3cold_allowed here really expresses
your underlying intent. It suggests that you're really hoping for
D3cold, but that's only a possibility if firmware supports it, and we
have no visibility into that here.
It also seems contrary to the earlier comment that suggests we prefer
host managed nvme power settings:
* The platform does not remove power for a kernel managed suspend so
* use host managed nvme power settings for lowest idle power if
* possible. This should have quicker resume latency than a full device
* shutdown. But if the firmware is involved after the suspend or the
* device does not support any non-default power states, shut down the
* device fully.
Bjorn
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-11-22 22:20 ` Bjorn Helgaas
@ 2024-11-23 9:01 ` Manivannan Sadhasivam
2024-11-26 17:11 ` Bjorn Andersson
2024-12-05 23:29 ` Bjorn Helgaas
0 siblings, 2 replies; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-11-23 9:01 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: kbusch, axboe, hch, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Rafael J. Wysocki, Ulf Hansson
+ Ulf (also interested in this topic)
On Fri, Nov 22, 2024 at 04:20:50PM -0600, Bjorn Helgaas wrote:
> [+cc Rafael]
>
> On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> > PCI core allows users to configure the D3Cold state for each PCI device
> > through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This
> > attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users
> > to allow/disallow the PCI devices to enter D3Cold during system suspend.
> >
> > So make use of this flag in the NVMe driver to shutdown the NVMe device
> > during system suspend if the user has allowed D3Cold for the device.
> > Existing checks in the NVMe driver decide whether to shut down the device
> > (based on platform/device limitations), so use this flag as the last resort
> > to keep the existing behavior.
> >
> > The default behavior of the 'pci_dev:d3cold_allowed' flag is to allow
> > D3Cold and the users can disallow it through sysfs if they want.
>
> What problem does this solve? I guess there must be a case where
> suspend leaves NVMe in a higher power state than you want?
>
Yeah, this is the case for all systems that doesn't fit into the existing checks
in the NVMe suspend path:
1. ACPI based platforms
2. Controller doesn't support NPSS (hardware issue/limitation)
3. ASPM not enabled
4. Devices/systems setting NVME_QUIRK_SIMPLE_SUSPEND flag
In my case, all the Qualcomm SoCs using Devicetree doesn't fall into the above
checks. Hence, they were not fully powered down during system suspend and always
in low power state. This means, I cannot achieve 'CX power collapse', a Qualcomm
specific SoC powered down state that consumes just enough power to wake up the
SoC. Since the controller driver keeps the PCI resource vote because of NVMe,
the firmware in the Qualcomm SoCs cannot put the SoC into above mentioned low
power state.
> What does it mean that this is the "last resort to keep the existing
> behavior"? All the checks are OR'd together and none have side
> effects, so the order doesn't really matter. It changes the existing
> behavior *unless* the user has explicitly cleared d3cold_allowed via
> sysfs.
>
Since the checks are ORed, this new check is not going to change the existing
behavior for systems satisfying above checks i.e., even if the user changes the
flag to forbid D3Cold, it won't affect them and it *shoudn't*.
> pdev->d3cold_allowed is set by default, so I guess this change means
> that unless the user clears d3cold_allowed, we let the PCI core decide
> the suspend state instead of managing it directly in nvme_suspend()?
>
If 'pdev->d3cold_allowed' is set, then NVMe driver will shutdown the device and
the PCI controller driver can turn off all bus specific resources. Otherwise,
NVMe driver will put the device into low power mode and the controller driver
has to do something similar to retain the device power.
> > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > ---
> > drivers/nvme/host/pci.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 4b9fda0b1d9a..a4d4687854bf 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -3287,7 +3287,8 @@ static int nvme_suspend(struct device *dev)
> > */
> > if (pm_suspend_via_firmware() || !ctrl->npss ||
> > !pcie_aspm_enabled(pdev) ||
> > - (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND))
> > + (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND) ||
> > + pdev->d3cold_allowed)
> > return nvme_disable_prepare_reset(ndev, true);
>
> I guess your intent is that if users prevent use of D3cold via sysfs,
> we'll use the NVMe-specific power states, and otherwise, the PCI core
> will decide?
>
Not PCI core, but the controller drivers actually which takes care of powering
down the PCI bus resources.
> I think returning 0 here means the PCI core decides what state to use
> in the pci_pm_suspend_noirq() -> pci_prepare_to_sleep() path. This
> could be any state from D0 to D3cold depending on platform support and
> wakeup considerations (see pci_target_state()).
>
> I'm not sure the use of pdev->d3cold_allowed here really expresses
> your underlying intent. It suggests that you're really hoping for
> D3cold, but that's only a possibility if firmware supports it, and we
> have no visibility into that here.
>
I'm not relying on firmware to do anything here. If firmware has to decide the
suspend state, it should already satisfy the pm_suspend_via_firmware() check in
nvme_suspend(). Here, the controller driver takes care of putting the device
into D3Cold. Currently, the controller drivers cannot do it (on DT platforms)
because of NVMe driver's behavior.
> It also seems contrary to the earlier comment that suggests we prefer
> host managed nvme power settings:
>
> * The platform does not remove power for a kernel managed suspend so
> * use host managed nvme power settings for lowest idle power if
> * possible. This should have quicker resume latency than a full device
> * shutdown. But if the firmware is involved after the suspend or the
> * device does not support any non-default power states, shut down the
> * device fully.
This above comment satisfies the ACPI platforms as the firmware controls the
suspend state. On DT platforms, even though the firmware takes care of suspend
state, it still relies on the controller driver to relinquish the votes for PCI
resources. Only then, the firmware will put the whole SoC in power down mode
a.k.a CX power collapse mode in Qcom SoCs.
We did attempt so solve this problem in multiple ways, but the lesson learned
was, kernel cannot decide the power mode without help from userspace. That's the
reason I wanted to make use of this 'd3cold_allowed' sysfs attribute to allow
userspace to override the D3Cold if it wants based on platform requirement.
This is similar to how UFS allows users to configure power states of both the
device and controller:
/sys/bus/platform/drivers/ufshcd/*/spm_lvl
/sys/bus/platform/devices/*.ufs/spm_lvl
If the 'd3cold_allowed' attribute is not a good fit for this usecase, then I'd
like to introduce a new attribute similar to UFS.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-11-23 9:01 ` Manivannan Sadhasivam
@ 2024-11-26 17:11 ` Bjorn Andersson
2024-11-27 5:49 ` Manivannan Sadhasivam
2024-12-05 23:29 ` Bjorn Helgaas
1 sibling, 1 reply; 47+ messages in thread
From: Bjorn Andersson @ 2024-11-26 17:11 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: Bjorn Helgaas, kbusch, axboe, hch, sagi, linux-nvme, linux-kernel,
linux-pci, konradybcio, Rafael J. Wysocki, Ulf Hansson
On Sat, Nov 23, 2024 at 02:31:13PM +0530, Manivannan Sadhasivam wrote:
> + Ulf (also interested in this topic)
>
> On Fri, Nov 22, 2024 at 04:20:50PM -0600, Bjorn Helgaas wrote:
> > [+cc Rafael]
> >
> > On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> > > PCI core allows users to configure the D3Cold state for each PCI device
> > > through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This
> > > attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users
> > > to allow/disallow the PCI devices to enter D3Cold during system suspend.
> > >
> > > So make use of this flag in the NVMe driver to shutdown the NVMe device
> > > during system suspend if the user has allowed D3Cold for the device.
> > > Existing checks in the NVMe driver decide whether to shut down the device
> > > (based on platform/device limitations), so use this flag as the last resort
> > > to keep the existing behavior.
> > >
> > > The default behavior of the 'pci_dev:d3cold_allowed' flag is to allow
> > > D3Cold and the users can disallow it through sysfs if they want.
> >
> > What problem does this solve? I guess there must be a case where
> > suspend leaves NVMe in a higher power state than you want?
> >
>
> Yeah, this is the case for all systems that doesn't fit into the existing checks
> in the NVMe suspend path:
>
> 1. ACPI based platforms
> 2. Controller doesn't support NPSS (hardware issue/limitation)
> 3. ASPM not enabled
> 4. Devices/systems setting NVME_QUIRK_SIMPLE_SUSPEND flag
>
> In my case, all the Qualcomm SoCs using Devicetree doesn't fall into the above
> checks. Hence, they were not fully powered down during system suspend and always
> in low power state. This means, I cannot achieve 'CX power collapse', a Qualcomm
> specific SoC powered down state that consumes just enough power to wake up the
> SoC. Since the controller driver keeps the PCI resource vote because of NVMe,
> the firmware in the Qualcomm SoCs cannot put the SoC into above mentioned low
> power state.
>
Per our previous discussions, I think we concluded that we have two
cases:
a) of_machine_is_compatible("qcom,sc8280xp")
b) Everything else
In #a we have to power down NVMe as the link doesn't survive the CX
collapse, is this your #2?. For #b it's primarily a policy decision
about the tradeoff between battery and flash life (and in some cases, as
already seen in the nvme driver, some device-specific policy).
> > What does it mean that this is the "last resort to keep the existing
> > behavior"? All the checks are OR'd together and none have side
> > effects, so the order doesn't really matter. It changes the existing
> > behavior *unless* the user has explicitly cleared d3cold_allowed via
> > sysfs.
> >
>
> Since the checks are ORed, this new check is not going to change the existing
> behavior for systems satisfying above checks i.e., even if the user changes the
> flag to forbid D3Cold, it won't affect them and it *shoudn't*.
>
> > pdev->d3cold_allowed is set by default, so I guess this change means
> > that unless the user clears d3cold_allowed, we let the PCI core decide
> > the suspend state instead of managing it directly in nvme_suspend()?
> >
>
> If 'pdev->d3cold_allowed' is set, then NVMe driver will shutdown the device and
> the PCI controller driver can turn off all bus specific resources. Otherwise,
> NVMe driver will put the device into low power mode and the controller driver
> has to do something similar to retain the device power.
>
> > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > > ---
> > > drivers/nvme/host/pci.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > index 4b9fda0b1d9a..a4d4687854bf 100644
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -3287,7 +3287,8 @@ static int nvme_suspend(struct device *dev)
> > > */
> > > if (pm_suspend_via_firmware() || !ctrl->npss ||
> > > !pcie_aspm_enabled(pdev) ||
> > > - (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND))
> > > + (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND) ||
> > > + pdev->d3cold_allowed)
> > > return nvme_disable_prepare_reset(ndev, true);
> >
> > I guess your intent is that if users prevent use of D3cold via sysfs,
> > we'll use the NVMe-specific power states, and otherwise, the PCI core
> > will decide?
> >
>
> Not PCI core, but the controller drivers actually which takes care of powering
> down the PCI bus resources.
>
> > I think returning 0 here means the PCI core decides what state to use
> > in the pci_pm_suspend_noirq() -> pci_prepare_to_sleep() path. This
> > could be any state from D0 to D3cold depending on platform support and
> > wakeup considerations (see pci_target_state()).
> >
> > I'm not sure the use of pdev->d3cold_allowed here really expresses
> > your underlying intent. It suggests that you're really hoping for
> > D3cold, but that's only a possibility if firmware supports it, and we
> > have no visibility into that here.
> >
>
> I'm not relying on firmware to do anything here. If firmware has to decide the
> suspend state, it should already satisfy the pm_suspend_via_firmware() check in
> nvme_suspend(). Here, the controller driver takes care of putting the device
> into D3Cold. Currently, the controller drivers cannot do it (on DT platforms)
> because of NVMe driver's behavior.
>
> > It also seems contrary to the earlier comment that suggests we prefer
> > host managed nvme power settings:
> >
> > * The platform does not remove power for a kernel managed suspend so
> > * use host managed nvme power settings for lowest idle power if
> > * possible. This should have quicker resume latency than a full device
> > * shutdown. But if the firmware is involved after the suspend or the
> > * device does not support any non-default power states, shut down the
> > * device fully.
>
> This above comment satisfies the ACPI platforms as the firmware controls the
> suspend state. On DT platforms, even though the firmware takes care of suspend
> state, it still relies on the controller driver to relinquish the votes for PCI
> resources. Only then, the firmware will put the whole SoC in power down mode
> a.k.a CX power collapse mode in Qcom SoCs.
>
> We did attempt so solve this problem in multiple ways, but the lesson learned
> was, kernel cannot decide the power mode without help from userspace. That's the
> reason I wanted to make use of this 'd3cold_allowed' sysfs attribute to allow
> userspace to override the D3Cold if it wants based on platform requirement.
>
What do you mean "without help from userspace". Which entity in
userspace is going to help make this decision?
> This is similar to how UFS allows users to configure power states of both the
> device and controller:
>
"Allow users to configure" is an optimization, is the purpose of this
patch to allow similar kind of optimization? Or is it supposed to allow
userspace to help solve case #a above (hardware doesn't survive CX
collapse)?
Regards,
Bjorn
> /sys/bus/platform/drivers/ufshcd/*/spm_lvl
> /sys/bus/platform/devices/*.ufs/spm_lvl
>
> If the 'd3cold_allowed' attribute is not a good fit for this usecase, then I'd
> like to introduce a new attribute similar to UFS.
>
> - Mani
>
> --
> மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-11-26 17:11 ` Bjorn Andersson
@ 2024-11-27 5:49 ` Manivannan Sadhasivam
0 siblings, 0 replies; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-11-27 5:49 UTC (permalink / raw)
To: Bjorn Andersson
Cc: Bjorn Helgaas, kbusch, axboe, hch, sagi, linux-nvme, linux-kernel,
linux-pci, konradybcio, Rafael J. Wysocki, Ulf Hansson
On Tue, Nov 26, 2024 at 11:11:38AM -0600, Bjorn Andersson wrote:
> On Sat, Nov 23, 2024 at 02:31:13PM +0530, Manivannan Sadhasivam wrote:
> > + Ulf (also interested in this topic)
> >
> > On Fri, Nov 22, 2024 at 04:20:50PM -0600, Bjorn Helgaas wrote:
> > > [+cc Rafael]
> > >
> > > On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> > > > PCI core allows users to configure the D3Cold state for each PCI device
> > > > through the sysfs attribute '/sys/bus/pci/devices/.../d3cold_allowed'. This
> > > > attribute sets the 'pci_dev:d3cold_allowed' flag and could be used by users
> > > > to allow/disallow the PCI devices to enter D3Cold during system suspend.
> > > >
> > > > So make use of this flag in the NVMe driver to shutdown the NVMe device
> > > > during system suspend if the user has allowed D3Cold for the device.
> > > > Existing checks in the NVMe driver decide whether to shut down the device
> > > > (based on platform/device limitations), so use this flag as the last resort
> > > > to keep the existing behavior.
> > > >
> > > > The default behavior of the 'pci_dev:d3cold_allowed' flag is to allow
> > > > D3Cold and the users can disallow it through sysfs if they want.
> > >
> > > What problem does this solve? I guess there must be a case where
> > > suspend leaves NVMe in a higher power state than you want?
> > >
> >
> > Yeah, this is the case for all systems that doesn't fit into the existing checks
> > in the NVMe suspend path:
> >
> > 1. ACPI based platforms
> > 2. Controller doesn't support NPSS (hardware issue/limitation)
> > 3. ASPM not enabled
> > 4. Devices/systems setting NVME_QUIRK_SIMPLE_SUSPEND flag
> >
> > In my case, all the Qualcomm SoCs using Devicetree doesn't fall into the above
> > checks. Hence, they were not fully powered down during system suspend and always
> > in low power state. This means, I cannot achieve 'CX power collapse', a Qualcomm
> > specific SoC powered down state that consumes just enough power to wake up the
> > SoC. Since the controller driver keeps the PCI resource vote because of NVMe,
> > the firmware in the Qualcomm SoCs cannot put the SoC into above mentioned low
> > power state.
> >
>
> Per our previous discussions, I think we concluded that we have two
> cases:
> a) of_machine_is_compatible("qcom,sc8280xp")
> b) Everything else
>
> In #a we have to power down NVMe as the link doesn't survive the CX
> collapse, is this your #2?. For #b it's primarily a policy decision
> about the tradeoff between battery and flash life (and in some cases, as
> already seen in the nvme driver, some device-specific policy).
>
With this proposed usage of the 'd3cold_allowed' sysfs attribute, all devices
will enter D3Cold by default, unless overridden by userspace. So #a will be
covered by default, and #b is left to userspace.
> > > What does it mean that this is the "last resort to keep the existing
> > > behavior"? All the checks are OR'd together and none have side
> > > effects, so the order doesn't really matter. It changes the existing
> > > behavior *unless* the user has explicitly cleared d3cold_allowed via
> > > sysfs.
> > >
> >
> > Since the checks are ORed, this new check is not going to change the existing
> > behavior for systems satisfying above checks i.e., even if the user changes the
> > flag to forbid D3Cold, it won't affect them and it *shoudn't*.
> >
> > > pdev->d3cold_allowed is set by default, so I guess this change means
> > > that unless the user clears d3cold_allowed, we let the PCI core decide
> > > the suspend state instead of managing it directly in nvme_suspend()?
> > >
> >
> > If 'pdev->d3cold_allowed' is set, then NVMe driver will shutdown the device and
> > the PCI controller driver can turn off all bus specific resources. Otherwise,
> > NVMe driver will put the device into low power mode and the controller driver
> > has to do something similar to retain the device power.
> >
> > > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > > > ---
> > > > drivers/nvme/host/pci.c | 3 ++-
> > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > > index 4b9fda0b1d9a..a4d4687854bf 100644
> > > > --- a/drivers/nvme/host/pci.c
> > > > +++ b/drivers/nvme/host/pci.c
> > > > @@ -3287,7 +3287,8 @@ static int nvme_suspend(struct device *dev)
> > > > */
> > > > if (pm_suspend_via_firmware() || !ctrl->npss ||
> > > > !pcie_aspm_enabled(pdev) ||
> > > > - (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND))
> > > > + (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND) ||
> > > > + pdev->d3cold_allowed)
> > > > return nvme_disable_prepare_reset(ndev, true);
> > >
> > > I guess your intent is that if users prevent use of D3cold via sysfs,
> > > we'll use the NVMe-specific power states, and otherwise, the PCI core
> > > will decide?
> > >
> >
> > Not PCI core, but the controller drivers actually which takes care of powering
> > down the PCI bus resources.
> >
> > > I think returning 0 here means the PCI core decides what state to use
> > > in the pci_pm_suspend_noirq() -> pci_prepare_to_sleep() path. This
> > > could be any state from D0 to D3cold depending on platform support and
> > > wakeup considerations (see pci_target_state()).
> > >
> > > I'm not sure the use of pdev->d3cold_allowed here really expresses
> > > your underlying intent. It suggests that you're really hoping for
> > > D3cold, but that's only a possibility if firmware supports it, and we
> > > have no visibility into that here.
> > >
> >
> > I'm not relying on firmware to do anything here. If firmware has to decide the
> > suspend state, it should already satisfy the pm_suspend_via_firmware() check in
> > nvme_suspend(). Here, the controller driver takes care of putting the device
> > into D3Cold. Currently, the controller drivers cannot do it (on DT platforms)
> > because of NVMe driver's behavior.
> >
> > > It also seems contrary to the earlier comment that suggests we prefer
> > > host managed nvme power settings:
> > >
> > > * The platform does not remove power for a kernel managed suspend so
> > > * use host managed nvme power settings for lowest idle power if
> > > * possible. This should have quicker resume latency than a full device
> > > * shutdown. But if the firmware is involved after the suspend or the
> > > * device does not support any non-default power states, shut down the
> > > * device fully.
> >
> > This above comment satisfies the ACPI platforms as the firmware controls the
> > suspend state. On DT platforms, even though the firmware takes care of suspend
> > state, it still relies on the controller driver to relinquish the votes for PCI
> > resources. Only then, the firmware will put the whole SoC in power down mode
> > a.k.a CX power collapse mode in Qcom SoCs.
> >
> > We did attempt so solve this problem in multiple ways, but the lesson learned
> > was, kernel cannot decide the power mode without help from userspace. That's the
> > reason I wanted to make use of this 'd3cold_allowed' sysfs attribute to allow
> > userspace to override the D3Cold if it wants based on platform requirement.
> >
>
> What do you mean "without help from userspace". Which entity in
> userspace is going to help make this decision?
>
The entities controlling the similar UFS 'spm_lvl' attributes (Android userspace
is one such example). Right now, most of the platforms (not just Qcom) require
the PCIe devices to be powered down during system suspend. So no userspace
intervention is required. But if someone puts the same chipsets on Android
form factor devices, then they need to teach the Android userspace to control
this attribute similar to how it is done for UFS.
> > This is similar to how UFS allows users to configure power states of both the
> > device and controller:
> >
>
> "Allow users to configure" is an optimization, is the purpose of this
> patch to allow similar kind of optimization? Or is it supposed to allow
> userspace to help solve case #a above (hardware doesn't survive CX
> collapse)?
>
Both as explained above.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-11-23 9:01 ` Manivannan Sadhasivam
2024-11-26 17:11 ` Bjorn Andersson
@ 2024-12-05 23:29 ` Bjorn Helgaas
2024-12-06 1:49 ` Bjorn Helgaas
2024-12-09 14:57 ` Manivannan Sadhasivam
1 sibling, 2 replies; 47+ messages in thread
From: Bjorn Helgaas @ 2024-12-05 23:29 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: kbusch, axboe, hch, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Rafael J. Wysocki, Ulf Hansson
On Sat, Nov 23, 2024 at 02:31:13PM +0530, Manivannan Sadhasivam wrote:
> On Fri, Nov 22, 2024 at 04:20:50PM -0600, Bjorn Helgaas wrote:
> > On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> > > PCI core allows users to configure the D3Cold state for each PCI
> > > device through the sysfs attribute
> > > '/sys/bus/pci/devices/.../d3cold_allowed'. This attribute sets
> > > the 'pci_dev:d3cold_allowed' flag and could be used by users to
> > > allow/disallow the PCI devices to enter D3Cold during system
> > > suspend.
> > >
> > > So make use of this flag in the NVMe driver to shutdown the NVMe
> > > device during system suspend if the user has allowed D3Cold for
> > > the device. Existing checks in the NVMe driver decide whether
> > > to shut down the device (based on platform/device limitations),
> > > so use this flag as the last resort to keep the existing
> > > behavior.
> > >
> > > The default behavior of the 'pci_dev:d3cold_allowed' flag is to
> > > allow D3Cold and the users can disallow it through sysfs if they
> > > want.
> >
> > What problem does this solve? I guess there must be a case where
> > suspend leaves NVMe in a higher power state than you want?
>
> Yeah, this is the case for all systems that doesn't fit into the
> existing checks in the NVMe suspend path:
>
> 1. ACPI based platforms
> 2. Controller doesn't support NPSS (hardware issue/limitation)
> 3. ASPM not enabled
> 4. Devices/systems setting NVME_QUIRK_SIMPLE_SUSPEND flag
>
> In my case, all the Qualcomm SoCs using Devicetree doesn't fall into
> the above checks. Hence, they were not fully powered down during
> system suspend and always in low power state. This means, I cannot
> achieve 'CX power collapse', a Qualcomm specific SoC powered down
> state that consumes just enough power to wake up the SoC. Since the
> controller driver keeps the PCI resource vote because of NVMe, the
> firmware in the Qualcomm SoCs cannot put the SoC into above
> mentioned low power state.
IIUC nvme_suspend() has two paths:
- Do nvme_disable_prepare_reset() without calling pci_save_state(),
so the PCI core chooses and sets the low-power state.
- Put the device in an NVMe-specific low-power state and call
pci_save_state(), which prevents the PCI core from putting the
device in a low-power state.
(The PCI core part is in pci_pm_suspend_noirq(),
pci_pm_poweroff_noirq(), pci_pm_runtime_suspend())
And I guess you want the first path for basically all systems? The
only systems that would use the NVMe-specific path are those where:
- !pm_suspend_via_firmware() (not an ACPI system), AND
- ctrl->npss (device supports NVMe power states), AND
- pcie_aspm_enabled(), AND
- !NVME_QUIRK_SIMPLE_SUSPEND (it's not something with a weird
quirk), AND
- !pdev->d3cold_allowed (user has cleared it via sysfs)
This frankly seems almost unintelligible to me, so I'm glad I'm not
responsible for nvme :)
> > I'm not sure the use of pdev->d3cold_allowed here really expresses
> > your underlying intent. It suggests that you're really hoping for
> > D3cold, but that's only a possibility if firmware supports it, and
> > we have no visibility into that here.
>
> I'm not relying on firmware to do anything here. If firmware has to
> decide the suspend state, it should already satisfy the
> pm_suspend_via_firmware() check in nvme_suspend(). ...
I'm confused about this because we want to use PCI core power
management, which chooses the new state with pci_target_state(),
which looks like it will choose D3hot unless we're on an ACPI system
and acpi_pci_choose_state() returns D3cold.
But your system is not an ACPI system, so we should get D3hot, but yet
you decide based on D3*cold* being allowed?
> Here, the controller driver takes care of putting the device into
> D3Cold. Currently, the controller drivers cannot do it (on DT
> platforms) because of NVMe driver's behavior.
I'm missing the connection to the controller driver (I assume you mean
qcom-pcie?). Maybe it's that having the NVMe device in a PCI
low-power state allows qcom_pcie_suspend_noirq() to reduce the ICC
bandwidth vote and do other power-saving things? And it can't do
those things if we're using the NVMe low-power state because that
state is not visible to qcom-pcie?
> We did attempt to solve this problem in multiple ways, but the
> lesson learned was, kernel cannot decide the power mode without help
> from userspace. That's the reason I wanted to make use of this
> 'd3cold_allowed' sysfs attribute to allow userspace to override the
> D3Cold if it wants based on platform requirement.
It seems sub-optimal that this only works how you want if the user
intervenes.
Bjorn
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-05 23:29 ` Bjorn Helgaas
@ 2024-12-06 1:49 ` Bjorn Helgaas
2024-12-09 13:36 ` Christoph Hellwig
2024-12-09 14:43 ` Manivannan Sadhasivam
2024-12-09 14:57 ` Manivannan Sadhasivam
1 sibling, 2 replies; 47+ messages in thread
From: Bjorn Helgaas @ 2024-12-06 1:49 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: kbusch, axboe, hch, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Rafael J. Wysocki, Ulf Hansson
On Thu, Dec 05, 2024 at 05:29:00PM -0600, Bjorn Helgaas wrote:
> On Sat, Nov 23, 2024 at 02:31:13PM +0530, Manivannan Sadhasivam wrote:
> > On Fri, Nov 22, 2024 at 04:20:50PM -0600, Bjorn Helgaas wrote:
> > > On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> > > > PCI core allows users to configure the D3Cold state for each PCI
> > > > device through the sysfs attribute
> > > > '/sys/bus/pci/devices/.../d3cold_allowed'. This attribute sets
> > > > the 'pci_dev:d3cold_allowed' flag and could be used by users to
> > > > allow/disallow the PCI devices to enter D3Cold during system
> > > > suspend.
> ...
> > We did attempt to solve this problem in multiple ways, but the
> > lesson learned was, kernel cannot decide the power mode without help
> > from userspace. That's the reason I wanted to make use of this
> > 'd3cold_allowed' sysfs attribute to allow userspace to override the
> > D3Cold if it wants based on platform requirement.
>
> It seems sub-optimal that this only works how you want if the user
> intervenes.
Oops, I think I got this part backwards. The patch uses PCI PM if
d3cold_allowed is set, and it's set by default, so it does what you
need for the Qualcomm platform *without* user intervention.
But I guess using the flag allows users in other situations to force
use of NVMe power management by clearing d3cold_allowed via sysfs.
Does that mean some unspecified other platforms might only work
correctly with that user intervention?
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-06 1:49 ` Bjorn Helgaas
@ 2024-12-09 13:36 ` Christoph Hellwig
2024-12-09 14:38 ` Manivannan Sadhasivam
2024-12-09 14:43 ` Manivannan Sadhasivam
1 sibling, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-12-09 13:36 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Manivannan Sadhasivam, kbusch, axboe, hch, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio,
Rafael J. Wysocki, Ulf Hansson
On Thu, Dec 05, 2024 at 07:49:34PM -0600, Bjorn Helgaas wrote:
> Oops, I think I got this part backwards. The patch uses PCI PM if
> d3cold_allowed is set, and it's set by default, so it does what you
> need for the Qualcomm platform *without* user intervention.
>
> But I guess using the flag allows users in other situations to force
> use of NVMe power management by clearing d3cold_allowed via sysfs.
> Does that mean some unspecified other platforms might only work
> correctly with that user intervention?
Still seems awkward to overload fields like this.
The istory here is the the NVMe internal power states are significantly
better for the SSDs. It avoid shutting down the SSD frequently, which
creates a lot of extra erase cycles and reduces life time. It also
prevents the SSD from performing maintainance operations while the host
system is idle, which is the perfect time for them. But the idea of
putting all periphals into D3 is gaining a lot of ground because it
makes the platform vendors life a lot simpler at the cost of others.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-09 13:36 ` Christoph Hellwig
@ 2024-12-09 14:38 ` Manivannan Sadhasivam
2024-12-12 5:59 ` Christoph Hellwig
0 siblings, 1 reply; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-09 14:38 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme, linux-kernel,
linux-pci, andersson, konradybcio, Rafael J. Wysocki, Ulf Hansson
On Mon, Dec 09, 2024 at 02:36:06PM +0100, Christoph Hellwig wrote:
> On Thu, Dec 05, 2024 at 07:49:34PM -0600, Bjorn Helgaas wrote:
> > Oops, I think I got this part backwards. The patch uses PCI PM if
> > d3cold_allowed is set, and it's set by default, so it does what you
> > need for the Qualcomm platform *without* user intervention.
> >
> > But I guess using the flag allows users in other situations to force
> > use of NVMe power management by clearing d3cold_allowed via sysfs.
> > Does that mean some unspecified other platforms might only work
> > correctly with that user intervention?
>
> Still seems awkward to overload fields like this.
>
> The istory here is the the NVMe internal power states are significantly
> better for the SSDs. It avoid shutting down the SSD frequently, which
> creates a lot of extra erase cycles and reduces life time. It also
> prevents the SSD from performing maintainance operations while the host
> system is idle, which is the perfect time for them. But the idea of
> putting all periphals into D3 is gaining a lot of ground because it
> makes the platform vendors life a lot simpler at the cost of others.
No, I disagree with the last comment. When the system goes to low power mode
(like S2R/hibernate), it *does* makes a lot of sense to put the devices into
D3Cold to save power. Using NVMe or other endpoint devices internal power states
only matters when the system is idle (which is not S2R/hibernate). This is what
ACPI is doing currently.
Then one might suggest to check the suspend state using
'pm_suspend_target_state' in device drivers and decide to keep the devices in
D3Cold. Unfortunately, it won't work for Qcom platforms as most of them (except
chromebooks) don't have S2R (a.k.a PSCI_SYSTEM_SUSPEND), but only S2Idle.
When it comes to non-Qcom platforms based on devicetree, they also cannot power
down the NVMe device during suspend (as they cannot satisfy existing checks in
nvme_suspend()).
The current reality is that most of the devicetree platforms *do* want to power
down the NVMe during suspend. Only when NVMe is used in an OS like Android, we
might not want to do so (that's something for future, but still a possibility).
So this is how I ended up using the existing 'd3cold_allowed' attribute as it
allows the devices to be powered down by default and if the OS doesn't want to
do so, it can tweak the attribute from userspace (similar to UFS in Android).
The flexibility of this attribute is that it just acts as a hint for the device
drivers. If the device driver doesn't want to do D3Cold (when used as a wakeup
source etc...) it can still opt out (assuming that the platform would also honor
the wakeup capability of the device).
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-06 1:49 ` Bjorn Helgaas
2024-12-09 13:36 ` Christoph Hellwig
@ 2024-12-09 14:43 ` Manivannan Sadhasivam
1 sibling, 0 replies; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-09 14:43 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: kbusch, axboe, hch, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Rafael J. Wysocki, Ulf Hansson
On Thu, Dec 05, 2024 at 07:49:34PM -0600, Bjorn Helgaas wrote:
> On Thu, Dec 05, 2024 at 05:29:00PM -0600, Bjorn Helgaas wrote:
> > On Sat, Nov 23, 2024 at 02:31:13PM +0530, Manivannan Sadhasivam wrote:
> > > On Fri, Nov 22, 2024 at 04:20:50PM -0600, Bjorn Helgaas wrote:
> > > > On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> > > > > PCI core allows users to configure the D3Cold state for each PCI
> > > > > device through the sysfs attribute
> > > > > '/sys/bus/pci/devices/.../d3cold_allowed'. This attribute sets
> > > > > the 'pci_dev:d3cold_allowed' flag and could be used by users to
> > > > > allow/disallow the PCI devices to enter D3Cold during system
> > > > > suspend.
> > ...
>
> > > We did attempt to solve this problem in multiple ways, but the
> > > lesson learned was, kernel cannot decide the power mode without help
> > > from userspace. That's the reason I wanted to make use of this
> > > 'd3cold_allowed' sysfs attribute to allow userspace to override the
> > > D3Cold if it wants based on platform requirement.
> >
> > It seems sub-optimal that this only works how you want if the user
> > intervenes.
>
> Oops, I think I got this part backwards. The patch uses PCI PM if
> d3cold_allowed is set, and it's set by default, so it does what you
> need for the Qualcomm platform *without* user intervention.
>
> But I guess using the flag allows users in other situations to force
> use of NVMe power management by clearing d3cold_allowed via sysfs.
> Does that mean some unspecified other platforms might only work
> correctly with that user intervention?
The 'unspecified platforms' would be Android devices making use of NVMe, which
is not a reality now but a possibility in the future. But even when it happens,
it won't be a problem for Android as it has userspace knobs to control the power
state of the storage devices (as like how it controls UFS power states today).
UFS Reference:
'/sys/bus/platform/devices/*.ufs/spm_lvl' documented in:
Documentation/ABI/testing/sysfs-driver-ufs
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-05 23:29 ` Bjorn Helgaas
2024-12-06 1:49 ` Bjorn Helgaas
@ 2024-12-09 14:57 ` Manivannan Sadhasivam
1 sibling, 0 replies; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-09 14:57 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: kbusch, axboe, hch, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Rafael J. Wysocki, Ulf Hansson
On Thu, Dec 05, 2024 at 05:29:00PM -0600, Bjorn Helgaas wrote:
> On Sat, Nov 23, 2024 at 02:31:13PM +0530, Manivannan Sadhasivam wrote:
> > On Fri, Nov 22, 2024 at 04:20:50PM -0600, Bjorn Helgaas wrote:
> > > On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> > > > PCI core allows users to configure the D3Cold state for each PCI
> > > > device through the sysfs attribute
> > > > '/sys/bus/pci/devices/.../d3cold_allowed'. This attribute sets
> > > > the 'pci_dev:d3cold_allowed' flag and could be used by users to
> > > > allow/disallow the PCI devices to enter D3Cold during system
> > > > suspend.
> > > >
> > > > So make use of this flag in the NVMe driver to shutdown the NVMe
> > > > device during system suspend if the user has allowed D3Cold for
> > > > the device. Existing checks in the NVMe driver decide whether
> > > > to shut down the device (based on platform/device limitations),
> > > > so use this flag as the last resort to keep the existing
> > > > behavior.
> > > >
> > > > The default behavior of the 'pci_dev:d3cold_allowed' flag is to
> > > > allow D3Cold and the users can disallow it through sysfs if they
> > > > want.
> > >
> > > What problem does this solve? I guess there must be a case where
> > > suspend leaves NVMe in a higher power state than you want?
> >
> > Yeah, this is the case for all systems that doesn't fit into the
> > existing checks in the NVMe suspend path:
> >
> > 1. ACPI based platforms
> > 2. Controller doesn't support NPSS (hardware issue/limitation)
> > 3. ASPM not enabled
> > 4. Devices/systems setting NVME_QUIRK_SIMPLE_SUSPEND flag
> >
> > In my case, all the Qualcomm SoCs using Devicetree doesn't fall into
> > the above checks. Hence, they were not fully powered down during
> > system suspend and always in low power state. This means, I cannot
> > achieve 'CX power collapse', a Qualcomm specific SoC powered down
> > state that consumes just enough power to wake up the SoC. Since the
> > controller driver keeps the PCI resource vote because of NVMe, the
> > firmware in the Qualcomm SoCs cannot put the SoC into above
> > mentioned low power state.
>
> IIUC nvme_suspend() has two paths:
>
> - Do nvme_disable_prepare_reset() without calling pci_save_state(),
> so the PCI core chooses and sets the low-power state.
>
> - Put the device in an NVMe-specific low-power state and call
> pci_save_state(), which prevents the PCI core from putting the
> device in a low-power state.
>
> (The PCI core part is in pci_pm_suspend_noirq(),
> pci_pm_poweroff_noirq(), pci_pm_runtime_suspend())
>
> And I guess you want the first path for basically all systems? The
> only systems that would use the NVMe-specific path are those where:
>
> - !pm_suspend_via_firmware() (not an ACPI system), AND
>
> - ctrl->npss (device supports NVMe power states), AND
>
> - pcie_aspm_enabled(), AND
>
> - !NVME_QUIRK_SIMPLE_SUSPEND (it's not something with a weird
> quirk), AND
>
> - !pdev->d3cold_allowed (user has cleared it via sysfs)
>
> This frankly seems almost unintelligible to me, so I'm glad I'm not
> responsible for nvme :)
>
I agree that using the attribute is not a great idea in the NVMe driver where
there are already a handful of quirks/checks, but that seems unavoidable.
> > > I'm not sure the use of pdev->d3cold_allowed here really expresses
> > > your underlying intent. It suggests that you're really hoping for
> > > D3cold, but that's only a possibility if firmware supports it, and
> > > we have no visibility into that here.
> >
> > I'm not relying on firmware to do anything here. If firmware has to
> > decide the suspend state, it should already satisfy the
> > pm_suspend_via_firmware() check in nvme_suspend(). ...
>
> I'm confused about this because we want to use PCI core power
> management, which chooses the new state with pci_target_state(),
> which looks like it will choose D3hot unless we're on an ACPI system
> and acpi_pci_choose_state() returns D3cold.
>
> But your system is not an ACPI system, so we should get D3hot, but yet
> you decide based on D3*cold* being allowed?
>
This is an existing ungliness of DT platforms. D3Cold is not tied to the PCI
core, but the controller drivers decide on their own.
> > Here, the controller driver takes care of putting the device into
> > D3Cold. Currently, the controller drivers cannot do it (on DT
> > platforms) because of NVMe driver's behavior.
>
> I'm missing the connection to the controller driver (I assume you mean
> qcom-pcie?). Maybe it's that having the NVMe device in a PCI
> low-power state allows qcom_pcie_suspend_noirq() to reduce the ICC
> bandwidth vote and do other power-saving things? And it can't do
> those things if we're using the NVMe low-power state because that
> state is not visible to qcom-pcie?
>
Yes. In DT platforms, peripheral power state is not controlled by the firmware
to some extent. For PCIe, the controller driver is responsible for handling the
endpoint devices power state. As you referred qcom-pcie driver, it currently has
the 1 Kbps ICC vote in qcom_pcie_suspend_noirq() just because we cannot fully
remove the ICC vote due to NVMe. Because of this, even if NVMe is in its optimal
low power state, the SoC cannot enter its own low power state. Plus it doesn't
make a lot of sense to keep NVMe in low power mode even if you suspend your DT
based laptop for hours.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-09 14:38 ` Manivannan Sadhasivam
@ 2024-12-12 5:59 ` Christoph Hellwig
2024-12-12 12:21 ` Rafael J. Wysocki
0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-12-12 5:59 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme, linux-kernel,
linux-pci, andersson, konradybcio, Rafael J. Wysocki, Ulf Hansson,
Len Brown, linux-pm
On Mon, Dec 09, 2024 at 08:08:21PM +0530, Manivannan Sadhasivam wrote:
> >
> > The istory here is the the NVMe internal power states are significantly
> > better for the SSDs. It avoid shutting down the SSD frequently, which
> > creates a lot of extra erase cycles and reduces life time. It also
> > prevents the SSD from performing maintainance operations while the host
> > system is idle, which is the perfect time for them. But the idea of
> > putting all periphals into D3 is gaining a lot of ground because it
> > makes the platform vendors life a lot simpler at the cost of others.
>
> No, I disagree with the last comment. When the system goes to low power mode
> (like S2R/hibernate), it *does* makes a lot of sense to put the devices into
> D3Cold to save power.
Yes. That's what the pm_suspend_via_firmware call in nvme_suspend is
supposed to catch. If that is not the right way to check for a
non-runtime suspend we'll need to improve the interface. Which really
are a mess, and the last thing I want is more "intelligence" in the
drivers. We need go information from the PM core what is going on so
that things work out of the box. Overloading obscure sysfs files with
new logic queried from a driver is a complete no-go.
> The current reality is that most of the devicetree platforms *do* want to power
> down the NVMe during suspend. Only when NVMe is used in an OS like Android, we
> might not want to do so (that's something for future, but still a possibility).
So fix the bloody interface instead of piling up hacks in drivers.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-12 5:59 ` Christoph Hellwig
@ 2024-12-12 12:21 ` Rafael J. Wysocki
2024-12-12 12:49 ` Ulf Hansson
0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2024-12-12 12:21 UTC (permalink / raw)
To: Manivannan Sadhasivam, Christoph Hellwig
Cc: Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme, linux-kernel,
linux-pci, andersson, konradybcio, Ulf Hansson, Len Brown,
linux-pm, Rafael J. Wysocki
On Thursday, December 12, 2024 6:59:20 AM CET Christoph Hellwig wrote:
> On Mon, Dec 09, 2024 at 08:08:21PM +0530, Manivannan Sadhasivam wrote:
> > >
> > > The istory here is the the NVMe internal power states are significantly
> > > better for the SSDs. It avoid shutting down the SSD frequently, which
> > > creates a lot of extra erase cycles and reduces life time. It also
> > > prevents the SSD from performing maintainance operations while the host
> > > system is idle, which is the perfect time for them. But the idea of
> > > putting all periphals into D3 is gaining a lot of ground because it
> > > makes the platform vendors life a lot simpler at the cost of others.
> >
> > No, I disagree with the last comment. When the system goes to low power mode
> > (like S2R/hibernate), it *does* makes a lot of sense to put the devices into
> > D3Cold to save power.
>
> Yes. That's what the pm_suspend_via_firmware call in nvme_suspend is
> supposed to catch.
pm_suspend_via_firmware() is to distinguish different flavors of system
suspend.
For runtime-suspend, there are different callbacks, but it doesn't look
like they are used by NVMe. At least nvme_dev_pm_ops doesn't set them.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-12 12:21 ` Rafael J. Wysocki
@ 2024-12-12 12:49 ` Ulf Hansson
2024-12-12 15:13 ` Christoph Hellwig
0 siblings, 1 reply; 47+ messages in thread
From: Ulf Hansson @ 2024-12-12 12:49 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Manivannan Sadhasivam, Christoph Hellwig, Bjorn Helgaas, kbusch,
axboe, sagi, linux-nvme, linux-kernel, linux-pci, andersson,
konradybcio, Len Brown, linux-pm, Rafael J. Wysocki
On Thu, 12 Dec 2024 at 13:21, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> On Thursday, December 12, 2024 6:59:20 AM CET Christoph Hellwig wrote:
> > On Mon, Dec 09, 2024 at 08:08:21PM +0530, Manivannan Sadhasivam wrote:
> > > >
> > > > The istory here is the the NVMe internal power states are significantly
> > > > better for the SSDs. It avoid shutting down the SSD frequently, which
> > > > creates a lot of extra erase cycles and reduces life time. It also
> > > > prevents the SSD from performing maintainance operations while the host
> > > > system is idle, which is the perfect time for them. But the idea of
> > > > putting all periphals into D3 is gaining a lot of ground because it
> > > > makes the platform vendors life a lot simpler at the cost of others.
> > >
> > > No, I disagree with the last comment. When the system goes to low power mode
> > > (like S2R/hibernate), it *does* makes a lot of sense to put the devices into
> > > D3Cold to save power.
> >
> > Yes. That's what the pm_suspend_via_firmware call in nvme_suspend is
> > supposed to catch.
>
> pm_suspend_via_firmware() is to distinguish different flavors of system
> suspend.
Right. This seems to somewhat work for ACPI types of systems, because
ACPI is controlling the low power state for all the devices. Based on
the requested system wide low power state, ACPI can then decide to
call pm_set_suspend_via_firmware() or not.
Still there is a problem with this for ACPI too.
How does ACPI know whether it's actually a good idea to keep the NVMe
storage powered in s2idle (ACPI calls pm_set_suspend_via_firmware()
only for S2R and S2disk!?)? Especially when my laptop only supports
s2idle and that's what I will use when I close the lid. In this way,
the NMVe storage will certainly contribute to draining the battery,
especially when I won't be using my laptop for a couple of days.
In my opinion, we need a better approach that is both flexible and
that dynamically adjusts based upon the use case.
[...]
Kind regards
Uffe
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-12 12:49 ` Ulf Hansson
@ 2024-12-12 15:13 ` Christoph Hellwig
2024-12-13 14:35 ` Rafael J. Wysocki
0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-12-12 15:13 UTC (permalink / raw)
To: Ulf Hansson
Cc: Rafael J. Wysocki, Manivannan Sadhasivam, Christoph Hellwig,
Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme, linux-kernel,
linux-pci, andersson, konradybcio, Len Brown, linux-pm,
Rafael J. Wysocki
On Thu, Dec 12, 2024 at 01:49:15PM +0100, Ulf Hansson wrote:
> Right. This seems to somewhat work for ACPI types of systems, because
> ACPI is controlling the low power state for all the devices. Based on
> the requested system wide low power state, ACPI can then decide to
> call pm_set_suspend_via_firmware() or not.
>
> Still there is a problem with this for ACPI too.
>
> How does ACPI know whether it's actually a good idea to keep the NVMe
> storage powered in s2idle (ACPI calls pm_set_suspend_via_firmware()
> only for S2R and S2disk!?)? Especially when my laptop only supports
> s2idle and that's what I will use when I close the lid. In this way,
> the NMVe storage will certainly contribute to draining the battery,
> especially when I won't be using my laptop for a couple of days.
>
> In my opinion, we need a better approach that is both flexible and
> that dynamically adjusts based upon the use case.
Agreed. I'd be happy to work with the PM maintainers to do this,
but I don't really know enough about the PM core to drive it
(as the reply from Rafael to my mail makes pretty clear :))
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-12 15:13 ` Christoph Hellwig
@ 2024-12-13 14:35 ` Rafael J. Wysocki
2024-12-14 6:30 ` Manivannan Sadhasivam
0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2024-12-13 14:35 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Ulf Hansson, Rafael J. Wysocki, Manivannan Sadhasivam,
Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme, linux-kernel,
linux-pci, andersson, konradybcio, Len Brown, linux-pm,
Rafael J. Wysocki
On Thu, Dec 12, 2024 at 4:14 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Thu, Dec 12, 2024 at 01:49:15PM +0100, Ulf Hansson wrote:
> > Right. This seems to somewhat work for ACPI types of systems, because
> > ACPI is controlling the low power state for all the devices. Based on
> > the requested system wide low power state, ACPI can then decide to
> > call pm_set_suspend_via_firmware() or not.
> >
> > Still there is a problem with this for ACPI too.
> >
> > How does ACPI know whether it's actually a good idea to keep the NVMe
> > storage powered in s2idle (ACPI calls pm_set_suspend_via_firmware()
> > only for S2R and S2disk!?)? Especially when my laptop only supports
> > s2idle and that's what I will use when I close the lid. In this way,
> > the NMVe storage will certainly contribute to draining the battery,
> > especially when I won't be using my laptop for a couple of days.
> >
> > In my opinion, we need a better approach that is both flexible and
> > that dynamically adjusts based upon the use case.
>
> Agreed. I'd be happy to work with the PM maintainers to do this,
> but I don't really know enough about the PM core to drive it
> (as the reply from Rafael to my mail makes pretty clear :))
I'm here to help.
Let me know what exactly you want to achieve and we'll see how to make it work.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-13 14:35 ` Rafael J. Wysocki
@ 2024-12-14 6:30 ` Manivannan Sadhasivam
2024-12-16 16:23 ` Christoph Hellwig
2024-12-16 16:24 ` Rafael J. Wysocki
0 siblings, 2 replies; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-14 6:30 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Christoph Hellwig, Ulf Hansson, Rafael J. Wysocki, Bjorn Helgaas,
kbusch, axboe, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Len Brown, linux-pm
On Fri, Dec 13, 2024 at 03:35:15PM +0100, Rafael J. Wysocki wrote:
> On Thu, Dec 12, 2024 at 4:14 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Thu, Dec 12, 2024 at 01:49:15PM +0100, Ulf Hansson wrote:
> > > Right. This seems to somewhat work for ACPI types of systems, because
> > > ACPI is controlling the low power state for all the devices. Based on
> > > the requested system wide low power state, ACPI can then decide to
> > > call pm_set_suspend_via_firmware() or not.
> > >
> > > Still there is a problem with this for ACPI too.
> > >
> > > How does ACPI know whether it's actually a good idea to keep the NVMe
> > > storage powered in s2idle (ACPI calls pm_set_suspend_via_firmware()
> > > only for S2R and S2disk!?)? Especially when my laptop only supports
> > > s2idle and that's what I will use when I close the lid. In this way,
> > > the NMVe storage will certainly contribute to draining the battery,
> > > especially when I won't be using my laptop for a couple of days.
> > >
> > > In my opinion, we need a better approach that is both flexible and
> > > that dynamically adjusts based upon the use case.
> >
> > Agreed. I'd be happy to work with the PM maintainers to do this,
> > but I don't really know enough about the PM core to drive it
> > (as the reply from Rafael to my mail makes pretty clear :))
>
> I'm here to help.
>
> Let me know what exactly you want to achieve and we'll see how to make it work.
I'll try to summarize the requirement here since I started this thread:
Problem statement
=================
We need a PM core API that tells the device drivers when it is safe to powerdown
the devices. The usecase here is with PCIe based NVMe devices but the problem is
applicable to other devices as well.
Drivers are relying on couple of options now:
1. If pm_suspend_via_firmware() returns true, then drivers will shutdown the
device assuming that the firmware is going to handle the suspend. But this API
is currently used only by ACPI. Even there, ACPI relies on S2R being supported
by the platform and it sets pm_set_suspend_via_firmware() only when the suspend
is S2R. But if the platform doesn't support S2R (current case of most of the
Qcom SoCs), then pm_suspend_via_firmware() will return false and NVMe won't be
powered down draining the battery.
If the platform is using DT, then there is no entity setting
pm_set_suspend_via_firmware(). So NVMe will be kept in low power state all the
time (still draining the battery). There were attempts to set this flag from
PSCI [1], but there were objections on setting this flag when PSCI_SUSPEND is
not supported by the platform (again, the case with Qcom SoCs). Even if this
approach succeeds, then there are concerns that if the platform is used in an
OS like Android where the S2Idle cycle is far more high, NVMe will wear out
very quickly. So this is where the forthcoming API need to "dynamically adjusts
based upon the use case" as quoted by Ulf in his previous reply. One way to
achieve would be by giving the flexibility to the userspace to choose the
suspend state (if platform has options to select). UFS does something similar
with 'spm_lvl' [2] sysfs attribute that I believe Android userspace itself makes
use of.
2. Making use of pm_suspend_target_state to differentiate between suspend states
and powering down the devices only during PM_SUSPEND_MEM (S2R). But this also
suffers from the same issue as mentioned above (when platform doesn't support
S2R).
TLDR: We need a PM core API that that sets a sane default suspend state for the
platform and also allows dynamically changing/overriding the state (by taking
inputs from userspace etc...). This API should also forbid override, if the
platform has limitations (like if it requires powering down the devices all the
time (x13s laptops)). Finally, this API would be used by the device drivers to
decide when to safely power down the devices during suspend.
@Ulf/others: Please chime in if you don't agree with anything I said above.
- Mani
[1] https://lore.kernel.org/all/20241028-topic-cpu_suspend_s2ram-v1-0-9fdd9a04b75c@oss.qualcomm.com
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-driver-ufs#n1041
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-14 6:30 ` Manivannan Sadhasivam
@ 2024-12-16 16:23 ` Christoph Hellwig
2024-12-16 16:42 ` Rafael J. Wysocki
2024-12-16 16:24 ` Rafael J. Wysocki
1 sibling, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-12-16 16:23 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: Rafael J. Wysocki, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Sat, Dec 14, 2024 at 12:00:23PM +0530, Manivannan Sadhasivam wrote:
> We need a PM core API that tells the device drivers when it is safe to powerdown
> the devices. The usecase here is with PCIe based NVMe devices but the problem is
> applicable to other devices as well.
Maybe I'm misunderstanding things, but I think the important part is
to indicate when a suspend actually MUST put the device into D3. Because
doing that should always be safe, but not always optimal.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-14 6:30 ` Manivannan Sadhasivam
2024-12-16 16:23 ` Christoph Hellwig
@ 2024-12-16 16:24 ` Rafael J. Wysocki
2024-12-16 17:11 ` Manivannan Sadhasivam
1 sibling, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2024-12-16 16:24 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: Rafael J. Wysocki, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Sat, Dec 14, 2024 at 7:30 AM Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> On Fri, Dec 13, 2024 at 03:35:15PM +0100, Rafael J. Wysocki wrote:
> > On Thu, Dec 12, 2024 at 4:14 PM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > On Thu, Dec 12, 2024 at 01:49:15PM +0100, Ulf Hansson wrote:
> > > > Right. This seems to somewhat work for ACPI types of systems, because
> > > > ACPI is controlling the low power state for all the devices. Based on
> > > > the requested system wide low power state, ACPI can then decide to
> > > > call pm_set_suspend_via_firmware() or not.
> > > >
> > > > Still there is a problem with this for ACPI too.
> > > >
> > > > How does ACPI know whether it's actually a good idea to keep the NVMe
> > > > storage powered in s2idle (ACPI calls pm_set_suspend_via_firmware()
> > > > only for S2R and S2disk!?)? Especially when my laptop only supports
> > > > s2idle and that's what I will use when I close the lid. In this way,
> > > > the NMVe storage will certainly contribute to draining the battery,
> > > > especially when I won't be using my laptop for a couple of days.
> > > >
> > > > In my opinion, we need a better approach that is both flexible and
> > > > that dynamically adjusts based upon the use case.
> > >
> > > Agreed. I'd be happy to work with the PM maintainers to do this,
> > > but I don't really know enough about the PM core to drive it
> > > (as the reply from Rafael to my mail makes pretty clear :))
> >
> > I'm here to help.
> >
> > Let me know what exactly you want to achieve and we'll see how to make it work.
>
> I'll try to summarize the requirement here since I started this thread:
>
> Problem statement
> =================
>
> We need a PM core API that tells the device drivers when it is safe to powerdown
> the devices. The usecase here is with PCIe based NVMe devices but the problem is
> applicable to other devices as well.
>
> Drivers are relying on couple of options now:
>
> 1. If pm_suspend_via_firmware() returns true, then drivers will shutdown the
> device assuming that the firmware is going to handle the suspend. But this API
> is currently used only by ACPI. Even there, ACPI relies on S2R being supported
> by the platform and it sets pm_set_suspend_via_firmware() only when the suspend
> is S2R. But if the platform doesn't support S2R (current case of most of the
> Qcom SoCs), then pm_suspend_via_firmware() will return false and NVMe won't be
> powered down draining the battery.
So my question here would be why is it not powered down always during
system-wide suspend?
Why exactly is it necessary to distinguish one case from the other
(assuming that we are talking about system-wide suspend only)?
There are drivers that use pm_suspend_via_firmware() to check whether
or not something special needs to be done to the device because if
"false" is returned, the platform firmware is not going to remove
power from it.
However, you seem to be talking about the opposite, so doing something
special to the device if "true" is returned. I'm not sure why this is
necessary.
> If the platform is using DT, then there is no entity setting
> pm_set_suspend_via_firmware().
That's true and so the assumption is that in this case the handling of
all devices will always be the same regardless of which flavor of
system suspend is chosen by user space.
> So NVMe will be kept in low power state all the
> time (still draining the battery).
So what would be the problem with powering it down unconditionally?
> There were attempts to set this flag from
> PSCI [1], but there were objections on setting this flag when PSCI_SUSPEND is
> not supported by the platform (again, the case with Qcom SoCs). Even if this
> approach succeeds, then there are concerns that if the platform is used in an
> OS like Android where the S2Idle cycle is far more high, NVMe will wear out
> very quickly.
I see.
> So this is where the forthcoming API need to "dynamically adjusts
> based upon the use case" as quoted by Ulf in his previous reply. One way to
> achieve would be by giving the flexibility to the userspace to choose the
> suspend state (if platform has options to select). UFS does something similar
> with 'spm_lvl' [2] sysfs attribute that I believe Android userspace itself makes
> use of.
Before we're talking about APIs, let's talk about the desired behavior.
It looks like there are cases in which you'd want to turn the device
off completely (say put it into D3cold in the PCI terminology) and
there are cases in which you'd want it to stay in a somewhat-powered
low-power state.
It is unclear to me what they are at this point.
> 2. Making use of pm_suspend_target_state to differentiate between suspend states
> and powering down the devices only during PM_SUSPEND_MEM (S2R). But this also
> suffers from the same issue as mentioned above (when platform doesn't support
> S2R).
>
> TLDR: We need a PM core API that that sets a sane default suspend state for the
> platform and also allows dynamically changing/overriding the state (by taking
> inputs from userspace etc...). This API should also forbid override, if the
> platform has limitations (like if it requires powering down the devices all the
> time (x13s laptops)). Finally, this API would be used by the device drivers to
> decide when to safely power down the devices during suspend.
>
> @Ulf/others: Please chime in if you don't agree with anything I said above.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 16:23 ` Christoph Hellwig
@ 2024-12-16 16:42 ` Rafael J. Wysocki
2024-12-16 16:48 ` Manivannan Sadhasivam
2024-12-20 15:15 ` Konrad Dybcio
0 siblings, 2 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2024-12-16 16:42 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Manivannan Sadhasivam, Rafael J. Wysocki, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Mon, Dec 16, 2024 at 5:23 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Sat, Dec 14, 2024 at 12:00:23PM +0530, Manivannan Sadhasivam wrote:
> > We need a PM core API that tells the device drivers when it is safe to powerdown
> > the devices. The usecase here is with PCIe based NVMe devices but the problem is
> > applicable to other devices as well.
>
> Maybe I'm misunderstanding things, but I think the important part is
> to indicate when a suspend actually MUST put the device into D3. Because
> doing that should always be safe, but not always optimal.
I'm not aware of any cases when a device must be put into D3cold
(which I think is what you mean) during system-wide suspend.
Suspend-to-idle on x86 doesn't require this, at least not for
correctness. I don't think any platforms using DT require it either.
In theory, ACPI S3 or hibernation may request that, but I've never
seen it happen in practice.
Suspend-to-idle on x86 may want devices to end up in specific power
states in order to be able to switch the entire platform into a deep
energy-saving mode, but that's never been D3cold so far.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 16:42 ` Rafael J. Wysocki
@ 2024-12-16 16:48 ` Manivannan Sadhasivam
2024-12-16 17:28 ` Rafael J. Wysocki
2024-12-20 15:15 ` Konrad Dybcio
1 sibling, 1 reply; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-16 16:48 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Christoph Hellwig, Ulf Hansson, Rafael J. Wysocki, Bjorn Helgaas,
kbusch, axboe, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Len Brown, linux-pm
On Mon, Dec 16, 2024 at 05:42:30PM +0100, Rafael J. Wysocki wrote:
> On Mon, Dec 16, 2024 at 5:23 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Sat, Dec 14, 2024 at 12:00:23PM +0530, Manivannan Sadhasivam wrote:
> > > We need a PM core API that tells the device drivers when it is safe to powerdown
> > > the devices. The usecase here is with PCIe based NVMe devices but the problem is
> > > applicable to other devices as well.
> >
> > Maybe I'm misunderstanding things, but I think the important part is
> > to indicate when a suspend actually MUST put the device into D3. Because
> > doing that should always be safe, but not always optimal.
>
> I'm not aware of any cases when a device must be put into D3cold
> (which I think is what you mean) during system-wide suspend.
>
> Suspend-to-idle on x86 doesn't require this, at least not for
> correctness. I don't think any platforms using DT require it either.
>
On suspend-to-idle, yes D3Cold doesn't make sense, but on suspend-to-ram it is
pretty much required. That applies to DT as well.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 16:24 ` Rafael J. Wysocki
@ 2024-12-16 17:11 ` Manivannan Sadhasivam
2024-12-16 17:35 ` Rafael J. Wysocki
0 siblings, 1 reply; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-16 17:11 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Christoph Hellwig, Ulf Hansson, Rafael J. Wysocki, Bjorn Helgaas,
kbusch, axboe, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Len Brown, linux-pm
On Mon, Dec 16, 2024 at 05:24:40PM +0100, Rafael J. Wysocki wrote:
> On Sat, Dec 14, 2024 at 7:30 AM Manivannan Sadhasivam
> <manivannan.sadhasivam@linaro.org> wrote:
> >
> > On Fri, Dec 13, 2024 at 03:35:15PM +0100, Rafael J. Wysocki wrote:
> > > On Thu, Dec 12, 2024 at 4:14 PM Christoph Hellwig <hch@lst.de> wrote:
> > > >
> > > > On Thu, Dec 12, 2024 at 01:49:15PM +0100, Ulf Hansson wrote:
> > > > > Right. This seems to somewhat work for ACPI types of systems, because
> > > > > ACPI is controlling the low power state for all the devices. Based on
> > > > > the requested system wide low power state, ACPI can then decide to
> > > > > call pm_set_suspend_via_firmware() or not.
> > > > >
> > > > > Still there is a problem with this for ACPI too.
> > > > >
> > > > > How does ACPI know whether it's actually a good idea to keep the NVMe
> > > > > storage powered in s2idle (ACPI calls pm_set_suspend_via_firmware()
> > > > > only for S2R and S2disk!?)? Especially when my laptop only supports
> > > > > s2idle and that's what I will use when I close the lid. In this way,
> > > > > the NMVe storage will certainly contribute to draining the battery,
> > > > > especially when I won't be using my laptop for a couple of days.
> > > > >
> > > > > In my opinion, we need a better approach that is both flexible and
> > > > > that dynamically adjusts based upon the use case.
> > > >
> > > > Agreed. I'd be happy to work with the PM maintainers to do this,
> > > > but I don't really know enough about the PM core to drive it
> > > > (as the reply from Rafael to my mail makes pretty clear :))
> > >
> > > I'm here to help.
> > >
> > > Let me know what exactly you want to achieve and we'll see how to make it work.
> >
> > I'll try to summarize the requirement here since I started this thread:
> >
> > Problem statement
> > =================
> >
> > We need a PM core API that tells the device drivers when it is safe to powerdown
> > the devices. The usecase here is with PCIe based NVMe devices but the problem is
> > applicable to other devices as well.
> >
> > Drivers are relying on couple of options now:
> >
> > 1. If pm_suspend_via_firmware() returns true, then drivers will shutdown the
> > device assuming that the firmware is going to handle the suspend. But this API
> > is currently used only by ACPI. Even there, ACPI relies on S2R being supported
> > by the platform and it sets pm_set_suspend_via_firmware() only when the suspend
> > is S2R. But if the platform doesn't support S2R (current case of most of the
> > Qcom SoCs), then pm_suspend_via_firmware() will return false and NVMe won't be
> > powered down draining the battery.
>
> So my question here would be why is it not powered down always during
> system-wide suspend?
>
> Why exactly is it necessary to distinguish one case from the other
> (assuming that we are talking about system-wide suspend only)?
>
To support Android like usecase with firmware that only supports
suspend-to-idle (Qcom platforms). This usecase is not applicable right now, but
one can't just rule out the possibility in the near future.
And the problem is that we need to support both Android and non-Android systems
with same firmware :/
> There are drivers that use pm_suspend_via_firmware() to check whether
> or not something special needs to be done to the device because if
> "false" is returned, the platform firmware is not going to remove
> power from it.
>
> However, you seem to be talking about the opposite, so doing something
> special to the device if "true" is returned. I'm not sure why this is
> necessary.
>
Because, since 'false' is returned, drivers like NVMe are assuming that the
platform won't remove power on all DT systems and they just keep the devices in
low power state (not powering them down). But why would I want my NVMe in DT
based laptop to be always powered in system suspend?
> > If the platform is using DT, then there is no entity setting
> > pm_set_suspend_via_firmware().
>
> That's true and so the assumption is that in this case the handling of
> all devices will always be the same regardless of which flavor of
> system suspend is chosen by user space.
>
Right and that's why the above concern is raised.
> > So NVMe will be kept in low power state all the
> > time (still draining the battery).
>
> So what would be the problem with powering it down unconditionally?
>
I'm not sure how would you do that (by checking dev_of_node()?). Even so, it
will wear out the NVMe devices if used in Android tablets etc...
> > There were attempts to set this flag from
> > PSCI [1], but there were objections on setting this flag when PSCI_SUSPEND is
> > not supported by the platform (again, the case with Qcom SoCs). Even if this
> > approach succeeds, then there are concerns that if the platform is used in an
> > OS like Android where the S2Idle cycle is far more high, NVMe will wear out
> > very quickly.
>
> I see.
>
> > So this is where the forthcoming API need to "dynamically adjusts
> > based upon the use case" as quoted by Ulf in his previous reply. One way to
> > achieve would be by giving the flexibility to the userspace to choose the
> > suspend state (if platform has options to select). UFS does something similar
> > with 'spm_lvl' [2] sysfs attribute that I believe Android userspace itself makes
> > use of.
>
> Before we're talking about APIs, let's talk about the desired behavior.
>
> It looks like there are cases in which you'd want to turn the device
> off completely (say put it into D3cold in the PCI terminology) and
> there are cases in which you'd want it to stay in a somewhat-powered
> low-power state.
>
> It is unclear to me what they are at this point.
>
I hope that my above explanation clarifies. Here is the short version of the
suspend requirement across platforms:
1. D3Cold (power down) - Laptops/Automotive
2. D3hot (low power) - Android Tablets
FWIW, I did receive feedback from people asking to just ignore the Android
usecase and always power down the devices for DT platforms. But I happen to
disagree with them. Let me know if I was wrong and I should not worry about
Android usecase as it is for the future.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 16:48 ` Manivannan Sadhasivam
@ 2024-12-16 17:28 ` Rafael J. Wysocki
2024-12-16 17:39 ` Manivannan Sadhasivam
0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2024-12-16 17:28 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: Rafael J. Wysocki, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Mon, Dec 16, 2024 at 5:48 PM Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> On Mon, Dec 16, 2024 at 05:42:30PM +0100, Rafael J. Wysocki wrote:
> > On Mon, Dec 16, 2024 at 5:23 PM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > On Sat, Dec 14, 2024 at 12:00:23PM +0530, Manivannan Sadhasivam wrote:
> > > > We need a PM core API that tells the device drivers when it is safe to powerdown
> > > > the devices. The usecase here is with PCIe based NVMe devices but the problem is
> > > > applicable to other devices as well.
> > >
> > > Maybe I'm misunderstanding things, but I think the important part is
> > > to indicate when a suspend actually MUST put the device into D3. Because
> > > doing that should always be safe, but not always optimal.
> >
> > I'm not aware of any cases when a device must be put into D3cold
> > (which I think is what you mean) during system-wide suspend.
> >
> > Suspend-to-idle on x86 doesn't require this, at least not for
> > correctness. I don't think any platforms using DT require it either.
> >
>
> On suspend-to-idle, yes D3Cold doesn't make sense,
Why?
> but on suspend-to-ram it is pretty much required.
Well, I know for a fact that on x86 platforms ACPI S3 does not require
putting devices into D3cold in general.
Why is it required for NVMe?
> That applies to DT as well.
Again, why?
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 17:11 ` Manivannan Sadhasivam
@ 2024-12-16 17:35 ` Rafael J. Wysocki
2024-12-16 17:52 ` Manivannan Sadhasivam
0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2024-12-16 17:35 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: Rafael J. Wysocki, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Mon, Dec 16, 2024 at 6:11 PM Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> On Mon, Dec 16, 2024 at 05:24:40PM +0100, Rafael J. Wysocki wrote:
> > On Sat, Dec 14, 2024 at 7:30 AM Manivannan Sadhasivam
> > <manivannan.sadhasivam@linaro.org> wrote:
> > >
> > > On Fri, Dec 13, 2024 at 03:35:15PM +0100, Rafael J. Wysocki wrote:
> > > > On Thu, Dec 12, 2024 at 4:14 PM Christoph Hellwig <hch@lst.de> wrote:
> > > > >
> > > > > On Thu, Dec 12, 2024 at 01:49:15PM +0100, Ulf Hansson wrote:
> > > > > > Right. This seems to somewhat work for ACPI types of systems, because
> > > > > > ACPI is controlling the low power state for all the devices. Based on
> > > > > > the requested system wide low power state, ACPI can then decide to
> > > > > > call pm_set_suspend_via_firmware() or not.
> > > > > >
> > > > > > Still there is a problem with this for ACPI too.
> > > > > >
> > > > > > How does ACPI know whether it's actually a good idea to keep the NVMe
> > > > > > storage powered in s2idle (ACPI calls pm_set_suspend_via_firmware()
> > > > > > only for S2R and S2disk!?)? Especially when my laptop only supports
> > > > > > s2idle and that's what I will use when I close the lid. In this way,
> > > > > > the NMVe storage will certainly contribute to draining the battery,
> > > > > > especially when I won't be using my laptop for a couple of days.
> > > > > >
> > > > > > In my opinion, we need a better approach that is both flexible and
> > > > > > that dynamically adjusts based upon the use case.
> > > > >
> > > > > Agreed. I'd be happy to work with the PM maintainers to do this,
> > > > > but I don't really know enough about the PM core to drive it
> > > > > (as the reply from Rafael to my mail makes pretty clear :))
> > > >
> > > > I'm here to help.
> > > >
> > > > Let me know what exactly you want to achieve and we'll see how to make it work.
> > >
> > > I'll try to summarize the requirement here since I started this thread:
> > >
> > > Problem statement
> > > =================
> > >
> > > We need a PM core API that tells the device drivers when it is safe to powerdown
> > > the devices. The usecase here is with PCIe based NVMe devices but the problem is
> > > applicable to other devices as well.
> > >
> > > Drivers are relying on couple of options now:
> > >
> > > 1. If pm_suspend_via_firmware() returns true, then drivers will shutdown the
> > > device assuming that the firmware is going to handle the suspend. But this API
> > > is currently used only by ACPI. Even there, ACPI relies on S2R being supported
> > > by the platform and it sets pm_set_suspend_via_firmware() only when the suspend
> > > is S2R. But if the platform doesn't support S2R (current case of most of the
> > > Qcom SoCs), then pm_suspend_via_firmware() will return false and NVMe won't be
> > > powered down draining the battery.
> >
> > So my question here would be why is it not powered down always during
> > system-wide suspend?
> >
> > Why exactly is it necessary to distinguish one case from the other
> > (assuming that we are talking about system-wide suspend only)?
> >
>
> To support Android like usecase with firmware that only supports
> suspend-to-idle (Qcom platforms). This usecase is not applicable right now, but
> one can't just rule out the possibility in the near future.
This doesn't explain anything to me, sorry.
> And the problem is that we need to support both Android and non-Android systems
> with same firmware :/
So what technically is the problem?
> > There are drivers that use pm_suspend_via_firmware() to check whether
> > or not something special needs to be done to the device because if
> > "false" is returned, the platform firmware is not going to remove
> > power from it.
> >
> > However, you seem to be talking about the opposite, so doing something
> > special to the device if "true" is returned. I'm not sure why this is
> > necessary.
> >
>
> Because, since 'false' is returned, drivers like NVMe are assuming that the
> platform won't remove power on all DT systems and they just keep the devices in
> low power state (not powering them down). But why would I want my NVMe in DT
> based laptop to be always powered in system suspend?
Because it causes the system to use less energy when suspended.
> > > If the platform is using DT, then there is no entity setting
> > > pm_set_suspend_via_firmware().
> >
> > That's true and so the assumption is that in this case the handling of
> > all devices will always be the same regardless of which flavor of
> > system suspend is chosen by user space.
> >
>
> Right and that's why the above concern is raised.
And it is still unclear to me what the problem with it is.
What exactly can go wrong?
> > > So NVMe will be kept in low power state all the
> > > time (still draining the battery).
> >
> > So what would be the problem with powering it down unconditionally?
> >
>
> I'm not sure how would you do that (by checking dev_of_node()?). Even so, it
> will wear out the NVMe devices if used in Android tablets etc...
I understand the wear-out concern.
Is there anything else?
> > > There were attempts to set this flag from
> > > PSCI [1], but there were objections on setting this flag when PSCI_SUSPEND is
> > > not supported by the platform (again, the case with Qcom SoCs). Even if this
> > > approach succeeds, then there are concerns that if the platform is used in an
> > > OS like Android where the S2Idle cycle is far more high, NVMe will wear out
> > > very quickly.
> >
> > I see.
> >
> > > So this is where the forthcoming API need to "dynamically adjusts
> > > based upon the use case" as quoted by Ulf in his previous reply. One way to
> > > achieve would be by giving the flexibility to the userspace to choose the
> > > suspend state (if platform has options to select). UFS does something similar
> > > with 'spm_lvl' [2] sysfs attribute that I believe Android userspace itself makes
> > > use of.
> >
> > Before we're talking about APIs, let's talk about the desired behavior.
> >
> > It looks like there are cases in which you'd want to turn the device
> > off completely (say put it into D3cold in the PCI terminology) and
> > there are cases in which you'd want it to stay in a somewhat-powered
> > low-power state.
> >
> > It is unclear to me what they are at this point.
> >
>
> I hope that my above explanation clarifies.
Sorry, but not really.
> Here is the short version of the suspend requirement across platforms:
>
> 1. D3Cold (power down) - Laptops/Automotive
> 2. D3hot (low power) - Android Tablets
Where do the above requirements come from?
> FWIW, I did receive feedback from people asking to just ignore the Android
> usecase and always power down the devices for DT platforms. But I happen to
> disagree with them. Let me know if I was wrong and I should not worry about
> Android usecase as it is for the future.
I'm not sure what you mean by the "Android usecase" TBH. Do you mean
the wear-out concern in the Android usage scenario or is there more to
it?
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 17:28 ` Rafael J. Wysocki
@ 2024-12-16 17:39 ` Manivannan Sadhasivam
2024-12-16 19:10 ` Rafael J. Wysocki
0 siblings, 1 reply; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-16 17:39 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Christoph Hellwig, Ulf Hansson, Rafael J. Wysocki, Bjorn Helgaas,
kbusch, axboe, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Len Brown, linux-pm
On Mon, Dec 16, 2024 at 06:28:55PM +0100, Rafael J. Wysocki wrote:
> On Mon, Dec 16, 2024 at 5:48 PM Manivannan Sadhasivam
> <manivannan.sadhasivam@linaro.org> wrote:
> >
> > On Mon, Dec 16, 2024 at 05:42:30PM +0100, Rafael J. Wysocki wrote:
> > > On Mon, Dec 16, 2024 at 5:23 PM Christoph Hellwig <hch@lst.de> wrote:
> > > >
> > > > On Sat, Dec 14, 2024 at 12:00:23PM +0530, Manivannan Sadhasivam wrote:
> > > > > We need a PM core API that tells the device drivers when it is safe to powerdown
> > > > > the devices. The usecase here is with PCIe based NVMe devices but the problem is
> > > > > applicable to other devices as well.
> > > >
> > > > Maybe I'm misunderstanding things, but I think the important part is
> > > > to indicate when a suspend actually MUST put the device into D3. Because
> > > > doing that should always be safe, but not always optimal.
> > >
> > > I'm not aware of any cases when a device must be put into D3cold
> > > (which I think is what you mean) during system-wide suspend.
> > >
> > > Suspend-to-idle on x86 doesn't require this, at least not for
> > > correctness. I don't think any platforms using DT require it either.
> > >
> >
> > On suspend-to-idle, yes D3Cold doesn't make sense,
>
> Why?
>
Because there is no requirement to remove power during S2Idle, isn't it?
From Documentation/admin-guide/pm/sleep-states.rst:
'This is a generic, pure software, light-weight variant of system suspend'.
> > but on suspend-to-ram it is pretty much required.
>
> Well, I know for a fact that on x86 platforms ACPI S3 does not require
> putting devices into D3cold in general.
>
> Why is it required for NVMe?
>
But ACPI code currently calls pm_set_suspend_via_firmware() for S3 suspend. And
that causes NVMe to be powered down because of pm_suspend_via_firmware() check.
> > That applies to DT as well.
>
> Again, why?
On DT systems if firmware supports both S2Idle and S2R, devices can be kept in
low power state during S2Idle and powered down during S2R.
The problem comes if the firmware only supports the former state.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 17:35 ` Rafael J. Wysocki
@ 2024-12-16 17:52 ` Manivannan Sadhasivam
2024-12-16 19:34 ` Rafael J. Wysocki
2024-12-19 6:30 ` Christoph Hellwig
0 siblings, 2 replies; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-16 17:52 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Christoph Hellwig, Ulf Hansson, Rafael J. Wysocki, Bjorn Helgaas,
kbusch, axboe, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Len Brown, linux-pm
On Mon, Dec 16, 2024 at 06:35:52PM +0100, Rafael J. Wysocki wrote:
> On Mon, Dec 16, 2024 at 6:11 PM Manivannan Sadhasivam
> <manivannan.sadhasivam@linaro.org> wrote:
> >
> > On Mon, Dec 16, 2024 at 05:24:40PM +0100, Rafael J. Wysocki wrote:
> > > On Sat, Dec 14, 2024 at 7:30 AM Manivannan Sadhasivam
> > > <manivannan.sadhasivam@linaro.org> wrote:
> > > >
> > > > On Fri, Dec 13, 2024 at 03:35:15PM +0100, Rafael J. Wysocki wrote:
> > > > > On Thu, Dec 12, 2024 at 4:14 PM Christoph Hellwig <hch@lst.de> wrote:
> > > > > >
> > > > > > On Thu, Dec 12, 2024 at 01:49:15PM +0100, Ulf Hansson wrote:
> > > > > > > Right. This seems to somewhat work for ACPI types of systems, because
> > > > > > > ACPI is controlling the low power state for all the devices. Based on
> > > > > > > the requested system wide low power state, ACPI can then decide to
> > > > > > > call pm_set_suspend_via_firmware() or not.
> > > > > > >
> > > > > > > Still there is a problem with this for ACPI too.
> > > > > > >
> > > > > > > How does ACPI know whether it's actually a good idea to keep the NVMe
> > > > > > > storage powered in s2idle (ACPI calls pm_set_suspend_via_firmware()
> > > > > > > only for S2R and S2disk!?)? Especially when my laptop only supports
> > > > > > > s2idle and that's what I will use when I close the lid. In this way,
> > > > > > > the NMVe storage will certainly contribute to draining the battery,
> > > > > > > especially when I won't be using my laptop for a couple of days.
> > > > > > >
> > > > > > > In my opinion, we need a better approach that is both flexible and
> > > > > > > that dynamically adjusts based upon the use case.
> > > > > >
> > > > > > Agreed. I'd be happy to work with the PM maintainers to do this,
> > > > > > but I don't really know enough about the PM core to drive it
> > > > > > (as the reply from Rafael to my mail makes pretty clear :))
> > > > >
> > > > > I'm here to help.
> > > > >
> > > > > Let me know what exactly you want to achieve and we'll see how to make it work.
> > > >
> > > > I'll try to summarize the requirement here since I started this thread:
> > > >
> > > > Problem statement
> > > > =================
> > > >
> > > > We need a PM core API that tells the device drivers when it is safe to powerdown
> > > > the devices. The usecase here is with PCIe based NVMe devices but the problem is
> > > > applicable to other devices as well.
> > > >
> > > > Drivers are relying on couple of options now:
> > > >
> > > > 1. If pm_suspend_via_firmware() returns true, then drivers will shutdown the
> > > > device assuming that the firmware is going to handle the suspend. But this API
> > > > is currently used only by ACPI. Even there, ACPI relies on S2R being supported
> > > > by the platform and it sets pm_set_suspend_via_firmware() only when the suspend
> > > > is S2R. But if the platform doesn't support S2R (current case of most of the
> > > > Qcom SoCs), then pm_suspend_via_firmware() will return false and NVMe won't be
> > > > powered down draining the battery.
> > >
> > > So my question here would be why is it not powered down always during
> > > system-wide suspend?
> > >
> > > Why exactly is it necessary to distinguish one case from the other
> > > (assuming that we are talking about system-wide suspend only)?
> > >
> >
> > To support Android like usecase with firmware that only supports
> > suspend-to-idle (Qcom platforms). This usecase is not applicable right now, but
> > one can't just rule out the possibility in the near future.
>
> This doesn't explain anything to me, sorry.
>
> > And the problem is that we need to support both Android and non-Android systems
> > with same firmware :/
>
> So what technically is the problem?
>
NVMe wear out is the problem.
> > > There are drivers that use pm_suspend_via_firmware() to check whether
> > > or not something special needs to be done to the device because if
> > > "false" is returned, the platform firmware is not going to remove
> > > power from it.
> > >
> > > However, you seem to be talking about the opposite, so doing something
> > > special to the device if "true" is returned. I'm not sure why this is
> > > necessary.
> > >
> >
> > Because, since 'false' is returned, drivers like NVMe are assuming that the
> > platform won't remove power on all DT systems and they just keep the devices in
> > low power state (not powering them down). But why would I want my NVMe in DT
> > based laptop to be always powered in system suspend?
>
> Because it causes the system to use less energy when suspended.
>
But the NVMe driver would be still operational. Powering it down would cause the
system to use much less energy. There is also a case where some devices like
(Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
powered down in order for the SoC to reach its low power state (CX power
collapse in Qcom terms). If not, the SoC would continue to draw more power
leading to battery draining quickly. This platform is supported in upstream and
we keep the PCIe interconnect voted during suspend as the NVMe driver is
expecting the device to retain its state during resume. Because of this
requirement, this platform is not reaching SoC level low power state with
upstream kernel.
> > > > If the platform is using DT, then there is no entity setting
> > > > pm_set_suspend_via_firmware().
> > >
> > > That's true and so the assumption is that in this case the handling of
> > > all devices will always be the same regardless of which flavor of
> > > system suspend is chosen by user space.
> > >
> >
> > Right and that's why the above concern is raised.
>
> And it is still unclear to me what the problem with it is.
>
> What exactly can go wrong?
>
> > > > So NVMe will be kept in low power state all the
> > > > time (still draining the battery).
> > >
> > > So what would be the problem with powering it down unconditionally?
> > >
> >
> > I'm not sure how would you do that (by checking dev_of_node()?). Even so, it
> > will wear out the NVMe devices if used in Android tablets etc...
>
> I understand the wear-out concern.
>
> Is there anything else?
>
No, that's the only concern.
> > > > There were attempts to set this flag from
> > > > PSCI [1], but there were objections on setting this flag when PSCI_SUSPEND is
> > > > not supported by the platform (again, the case with Qcom SoCs). Even if this
> > > > approach succeeds, then there are concerns that if the platform is used in an
> > > > OS like Android where the S2Idle cycle is far more high, NVMe will wear out
> > > > very quickly.
> > >
> > > I see.
> > >
> > > > So this is where the forthcoming API need to "dynamically adjusts
> > > > based upon the use case" as quoted by Ulf in his previous reply. One way to
> > > > achieve would be by giving the flexibility to the userspace to choose the
> > > > suspend state (if platform has options to select). UFS does something similar
> > > > with 'spm_lvl' [2] sysfs attribute that I believe Android userspace itself makes
> > > > use of.
> > >
> > > Before we're talking about APIs, let's talk about the desired behavior.
> > >
> > > It looks like there are cases in which you'd want to turn the device
> > > off completely (say put it into D3cold in the PCI terminology) and
> > > there are cases in which you'd want it to stay in a somewhat-powered
> > > low-power state.
> > >
> > > It is unclear to me what they are at this point.
> > >
> >
> > I hope that my above explanation clarifies.
>
> Sorry, but not really.
>
> > Here is the short version of the suspend requirement across platforms:
> >
> > 1. D3Cold (power down) - Laptops/Automotive
> > 2. D3hot (low power) - Android Tablets
>
> Where do the above requirements come from?
>
In my case, it is coming from the SoC vendor, Qcom.
> > FWIW, I did receive feedback from people asking to just ignore the Android
> > usecase and always power down the devices for DT platforms. But I happen to
> > disagree with them. Let me know if I was wrong and I should not worry about
> > Android usecase as it is for the future.
>
> I'm not sure what you mean by the "Android usecase" TBH. Do you mean
> the wear-out concern in the Android usage scenario or is there more to
> it?
No, just the wear out concern in Android usecase.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 17:39 ` Manivannan Sadhasivam
@ 2024-12-16 19:10 ` Rafael J. Wysocki
0 siblings, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2024-12-16 19:10 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: Rafael J. Wysocki, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Mon, Dec 16, 2024 at 6:39 PM Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> On Mon, Dec 16, 2024 at 06:28:55PM +0100, Rafael J. Wysocki wrote:
> > On Mon, Dec 16, 2024 at 5:48 PM Manivannan Sadhasivam
> > <manivannan.sadhasivam@linaro.org> wrote:
> > >
> > > On Mon, Dec 16, 2024 at 05:42:30PM +0100, Rafael J. Wysocki wrote:
> > > > On Mon, Dec 16, 2024 at 5:23 PM Christoph Hellwig <hch@lst.de> wrote:
> > > > >
> > > > > On Sat, Dec 14, 2024 at 12:00:23PM +0530, Manivannan Sadhasivam wrote:
> > > > > > We need a PM core API that tells the device drivers when it is safe to powerdown
> > > > > > the devices. The usecase here is with PCIe based NVMe devices but the problem is
> > > > > > applicable to other devices as well.
> > > > >
> > > > > Maybe I'm misunderstanding things, but I think the important part is
> > > > > to indicate when a suspend actually MUST put the device into D3. Because
> > > > > doing that should always be safe, but not always optimal.
> > > >
> > > > I'm not aware of any cases when a device must be put into D3cold
> > > > (which I think is what you mean) during system-wide suspend.
> > > >
> > > > Suspend-to-idle on x86 doesn't require this, at least not for
> > > > correctness. I don't think any platforms using DT require it either.
> > > >
> > >
> > > On suspend-to-idle, yes D3Cold doesn't make sense,
> >
> > Why?
> >
>
> Because there is no requirement to remove power during S2Idle, isn't it?
There is no requirement, but there is a reason that I've already
mentioned: A device in D3cold causes a system in s2idle use less
energy.
I think this reason is good enough.
> From Documentation/admin-guide/pm/sleep-states.rst:
>
> 'This is a generic, pure software, light-weight variant of system suspend'.
Sure. Which basically means that platform firmware is not involved
(at least not as much as in the ACPI S3 case) and CPUs are managed in
a more lightweight way.
The power states of devices are what they are.
Moreover, the whole idea of suspend-to-idle is to re-use the
suspend-to-RAM (ACPI S3, basically) control flow, along with the
handling of devices. The devices handled differently are exceptions,
not a rule.
> > > but on suspend-to-ram it is pretty much required.
> >
> > Well, I know for a fact that on x86 platforms ACPI S3 does not require
> > putting devices into D3cold in general.
> >
> > Why is it required for NVMe?
> >
>
> But ACPI code currently calls pm_set_suspend_via_firmware() for S3 suspend. And
> that causes NVMe to be powered down because of pm_suspend_via_firmware() check.
That's how the driver is designed, but is it actually required to be
designed this way?
> > > That applies to DT as well.
> >
> > Again, why?
>
> On DT systems if firmware supports both S2Idle and S2R, devices can be kept in
> low power state during S2Idle and powered down during S2R.
>
> The problem comes if the firmware only supports the former state.
I don't get it, sorry.
Firmware support for suspend-to-idle is not required, at least in
principle, so I'm not sure what you mean by firmware support for it.
I'm also not sure how S2R is supported by firmware on DT systems, so
care to explain?
In any case, I don't see why in principle devices need to be handled
differently depending on what flavor of system suspend in used at any
given time.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 17:52 ` Manivannan Sadhasivam
@ 2024-12-16 19:34 ` Rafael J. Wysocki
2024-12-16 19:40 ` Keith Busch
2024-12-17 5:26 ` manivannan.sadhasivam
2024-12-19 6:30 ` Christoph Hellwig
1 sibling, 2 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2024-12-16 19:34 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: Rafael J. Wysocki, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Mon, Dec 16, 2024 at 6:52 PM Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> On Mon, Dec 16, 2024 at 06:35:52PM +0100, Rafael J. Wysocki wrote:
> > On Mon, Dec 16, 2024 at 6:11 PM Manivannan Sadhasivam
> > <manivannan.sadhasivam@linaro.org> wrote:
> > >
> > > On Mon, Dec 16, 2024 at 05:24:40PM +0100, Rafael J. Wysocki wrote:
> > > > On Sat, Dec 14, 2024 at 7:30 AM Manivannan Sadhasivam
> > > > <manivannan.sadhasivam@linaro.org> wrote:
> > > > >
> > > > > On Fri, Dec 13, 2024 at 03:35:15PM +0100, Rafael J. Wysocki wrote:
> > > > > > On Thu, Dec 12, 2024 at 4:14 PM Christoph Hellwig <hch@lst.de> wrote:
> > > > > > >
> > > > > > > On Thu, Dec 12, 2024 at 01:49:15PM +0100, Ulf Hansson wrote:
> > > > > > > > Right. This seems to somewhat work for ACPI types of systems, because
> > > > > > > > ACPI is controlling the low power state for all the devices. Based on
> > > > > > > > the requested system wide low power state, ACPI can then decide to
> > > > > > > > call pm_set_suspend_via_firmware() or not.
> > > > > > > >
> > > > > > > > Still there is a problem with this for ACPI too.
> > > > > > > >
> > > > > > > > How does ACPI know whether it's actually a good idea to keep the NVMe
> > > > > > > > storage powered in s2idle (ACPI calls pm_set_suspend_via_firmware()
> > > > > > > > only for S2R and S2disk!?)? Especially when my laptop only supports
> > > > > > > > s2idle and that's what I will use when I close the lid. In this way,
> > > > > > > > the NMVe storage will certainly contribute to draining the battery,
> > > > > > > > especially when I won't be using my laptop for a couple of days.
> > > > > > > >
> > > > > > > > In my opinion, we need a better approach that is both flexible and
> > > > > > > > that dynamically adjusts based upon the use case.
> > > > > > >
> > > > > > > Agreed. I'd be happy to work with the PM maintainers to do this,
> > > > > > > but I don't really know enough about the PM core to drive it
> > > > > > > (as the reply from Rafael to my mail makes pretty clear :))
> > > > > >
> > > > > > I'm here to help.
> > > > > >
> > > > > > Let me know what exactly you want to achieve and we'll see how to make it work.
> > > > >
> > > > > I'll try to summarize the requirement here since I started this thread:
> > > > >
> > > > > Problem statement
> > > > > =================
> > > > >
> > > > > We need a PM core API that tells the device drivers when it is safe to powerdown
> > > > > the devices. The usecase here is with PCIe based NVMe devices but the problem is
> > > > > applicable to other devices as well.
> > > > >
> > > > > Drivers are relying on couple of options now:
> > > > >
> > > > > 1. If pm_suspend_via_firmware() returns true, then drivers will shutdown the
> > > > > device assuming that the firmware is going to handle the suspend. But this API
> > > > > is currently used only by ACPI. Even there, ACPI relies on S2R being supported
> > > > > by the platform and it sets pm_set_suspend_via_firmware() only when the suspend
> > > > > is S2R. But if the platform doesn't support S2R (current case of most of the
> > > > > Qcom SoCs), then pm_suspend_via_firmware() will return false and NVMe won't be
> > > > > powered down draining the battery.
> > > >
> > > > So my question here would be why is it not powered down always during
> > > > system-wide suspend?
> > > >
> > > > Why exactly is it necessary to distinguish one case from the other
> > > > (assuming that we are talking about system-wide suspend only)?
> > > >
> > >
> > > To support Android like usecase with firmware that only supports
> > > suspend-to-idle (Qcom platforms). This usecase is not applicable right now, but
> > > one can't just rule out the possibility in the near future.
> >
> > This doesn't explain anything to me, sorry.
> >
> > > And the problem is that we need to support both Android and non-Android systems
> > > with same firmware :/
> >
> > So what technically is the problem?
> >
>
> NVMe wear out is the problem.
>
> > > > There are drivers that use pm_suspend_via_firmware() to check whether
> > > > or not something special needs to be done to the device because if
> > > > "false" is returned, the platform firmware is not going to remove
> > > > power from it.
> > > >
> > > > However, you seem to be talking about the opposite, so doing something
> > > > special to the device if "true" is returned. I'm not sure why this is
> > > > necessary.
> > > >
> > >
> > > Because, since 'false' is returned, drivers like NVMe are assuming that the
> > > platform won't remove power on all DT systems and they just keep the devices in
> > > low power state (not powering them down). But why would I want my NVMe in DT
> > > based laptop to be always powered in system suspend?
> >
> > Because it causes the system to use less energy when suspended.
> >
>
> But the NVMe driver would be still operational. Powering it down would cause the
> system to use much less energy.
Yes, that's what I wanted to say, sorry for the confusion.
IIRC, there are two reasons why the NVMe driver does this:
(1) On some platforms the devices handled by it couldn't come back
from D3cold without platform firmware's help.
(2) Putting devices into D3hot alone was not sufficient to reduce
their power sufficiently and it didn't make much difference on top of
the other things done to them for this purpose.
It avoids putting devices into D3cold (or even into D3hot for that
matter) because of the above, not because it is generally required to
do so in suspend-to-idle. So this really is exceptional behavior and
not following a rule of some kind.
> There is also a case where some devices like
> (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
> powered down in order for the SoC to reach its low power state (CX power
> collapse in Qcom terms). If not, the SoC would continue to draw more power
> leading to battery draining quickly. This platform is supported in upstream and
> we keep the PCIe interconnect voted during suspend as the NVMe driver is
> expecting the device to retain its state during resume. Because of this
> requirement, this platform is not reaching SoC level low power state with
> upstream kernel.
OK, now all of this makes sense and that's why you really want NVMe
devices to end up in some form of PCI D3 in suspend-to-idle.
Would D3hot be sufficient for this platform or does it need to be
D3cold? If the latter, what exactly is the method by which they are
put into D3cold?
> > > > > If the platform is using DT, then there is no entity setting
> > > > > pm_set_suspend_via_firmware().
> > > >
> > > > That's true and so the assumption is that in this case the handling of
> > > > all devices will always be the same regardless of which flavor of
> > > > system suspend is chosen by user space.
> > > >
> > >
> > > Right and that's why the above concern is raised.
> >
> > And it is still unclear to me what the problem with it is.
> >
> > What exactly can go wrong?
> >
> > > > > So NVMe will be kept in low power state all the
> > > > > time (still draining the battery).
> > > >
> > > > So what would be the problem with powering it down unconditionally?
> > > >
> > >
> > > I'm not sure how would you do that (by checking dev_of_node()?). Even so, it
> > > will wear out the NVMe devices if used in Android tablets etc...
> >
> > I understand the wear-out concern.
> >
> > Is there anything else?
> >
>
> No, that's the only concern.
OK
I think we're getting to the bottom of the issue.
First off, there really is no requirement to avoid putting PCI devices
into D3hot or D3cold during suspend-to-idle. On all modern Intel
platforms, PCIe devices are put into D3(hot or cold) during
suspend-to-idle and I don't see why this should be any different on
platforms from any other vendors.
The NVMe driver is an exception and it avoids D3(hot or cold) during
suspend-to-idle because of problems with some hardware or platforms.
It might in principle allow devices to go into D3(hot or cold) during
suspend-to-idle, so long as it knows that this is going to work.
However, there is an additional concern that putting an NVMe device
into D3cold every time during system suspend on Android might cause it
to wear out more quickly.
Is there anything else?
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 19:34 ` Rafael J. Wysocki
@ 2024-12-16 19:40 ` Keith Busch
2024-12-16 19:43 ` Rafael J. Wysocki
2024-12-17 5:26 ` manivannan.sadhasivam
1 sibling, 1 reply; 47+ messages in thread
From: Keith Busch @ 2024-12-16 19:40 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Manivannan Sadhasivam, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote:
> However, there is an additional concern that putting an NVMe device
> into D3cold every time during system suspend on Android might cause it
> to wear out more quickly.
>
> Is there anything else?
I recall a previous reason for this behavior was because the resume
latency was significantly faster if we don't prepare the device for D3,
and the nvme protocol specific power states for some platforms was
sufficiently low enough. Apparently this choice hasn't been universally
optimal.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 19:40 ` Keith Busch
@ 2024-12-16 19:43 ` Rafael J. Wysocki
0 siblings, 0 replies; 47+ messages in thread
From: Rafael J. Wysocki @ 2024-12-16 19:43 UTC (permalink / raw)
To: Keith Busch
Cc: Rafael J. Wysocki, Manivannan Sadhasivam, Christoph Hellwig,
Ulf Hansson, Rafael J. Wysocki, Bjorn Helgaas, axboe, sagi,
linux-nvme, linux-kernel, linux-pci, andersson, konradybcio,
Len Brown, linux-pm
On Mon, Dec 16, 2024 at 8:40 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote:
> > However, there is an additional concern that putting an NVMe device
> > into D3cold every time during system suspend on Android might cause it
> > to wear out more quickly.
> >
> > Is there anything else?
>
> I recall a previous reason for this behavior was because the resume
> latency was significantly faster if we don't prepare the device for D3,
> and the nvme protocol specific power states for some platforms was
> sufficiently low enough. Apparently this choice hasn't been universally
> optimal.
Thanks for chiming in!
I'm recalling some other reasons too (mentioned in one of my previous
messages), but overall this is a driver design choice, it is not done
to adhere to some general rule.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 19:34 ` Rafael J. Wysocki
2024-12-16 19:40 ` Keith Busch
@ 2024-12-17 5:26 ` manivannan.sadhasivam
2024-12-17 19:45 ` Rafael J. Wysocki
1 sibling, 1 reply; 47+ messages in thread
From: manivannan.sadhasivam @ 2024-12-17 5:26 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Christoph Hellwig, Ulf Hansson, Rafael J. Wysocki, Bjorn Helgaas,
kbusch, axboe, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Len Brown, linux-pm
On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote:
[...]
> > There is also a case where some devices like
> > (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
> > powered down in order for the SoC to reach its low power state (CX power
> > collapse in Qcom terms). If not, the SoC would continue to draw more power
> > leading to battery draining quickly. This platform is supported in upstream and
> > we keep the PCIe interconnect voted during suspend as the NVMe driver is
> > expecting the device to retain its state during resume. Because of this
> > requirement, this platform is not reaching SoC level low power state with
> > upstream kernel.
>
> OK, now all of this makes sense and that's why you really want NVMe
> devices to end up in some form of PCI D3 in suspend-to-idle.
>
> Would D3hot be sufficient for this platform or does it need to be
> D3cold? If the latter, what exactly is the method by which they are
> put into D3cold?
>
D3Cold is what preferred. Earlier the controller driver used to transition the
device into D3Cold by sending PME_Turn_Off, turning off refclk, regulators
etc... Now we have a new framework called 'pwrctrl' that handles the
clock/regulators supplied to the device. So both controller and pwrctrl drivers
need to work in a tandem to put the device into D3Cold.
Once the PCIe client driver (NVMe in this case) powers down the device, then
controller/pwrctrl drivers will check the PCIe device state and transition the
device into D3Cold. This is a TODO btw.
But right now there is no generic D3Cold handling available for DT platforms. I
am hoping to fix that too once this NVMe issue is handled.
> > > > > > If the platform is using DT, then there is no entity setting
> > > > > > pm_set_suspend_via_firmware().
> > > > >
> > > > > That's true and so the assumption is that in this case the handling of
> > > > > all devices will always be the same regardless of which flavor of
> > > > > system suspend is chosen by user space.
> > > > >
> > > >
> > > > Right and that's why the above concern is raised.
> > >
> > > And it is still unclear to me what the problem with it is.
> > >
> > > What exactly can go wrong?
> > >
> > > > > > So NVMe will be kept in low power state all the
> > > > > > time (still draining the battery).
> > > > >
> > > > > So what would be the problem with powering it down unconditionally?
> > > > >
> > > >
> > > > I'm not sure how would you do that (by checking dev_of_node()?). Even so, it
> > > > will wear out the NVMe devices if used in Android tablets etc...
> > >
> > > I understand the wear-out concern.
> > >
> > > Is there anything else?
> > >
> >
> > No, that's the only concern.
>
> OK
>
> I think we're getting to the bottom of the issue.
>
> First off, there really is no requirement to avoid putting PCI devices
> into D3hot or D3cold during suspend-to-idle. On all modern Intel
> platforms, PCIe devices are put into D3(hot or cold) during
> suspend-to-idle and I don't see why this should be any different on
> platforms from any other vendors.
>
> The NVMe driver is an exception and it avoids D3(hot or cold) during
> suspend-to-idle because of problems with some hardware or platforms.
> It might in principle allow devices to go into D3(hot or cold) during
> suspend-to-idle, so long as it knows that this is going to work.
>
Slight correction here.
NVMe driver avoids PCI PM _only_ when it wants to handle the NVMe power
state on its own, not all the time. It has some checks [1] in the suspend path
and if the platform/device passes one of the checks, it will power down the
device.
DT platforms doesn't pass any of the checks, so the NVMe driver always manages
the power state on its own. Unfortunately, the resultant power saving is not
enough, so the vendors (Laptop/Automotive) using DT want NVMe to be powered down
all the time. This is the first problem we are trying to solve.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c#n3448
> However, there is an additional concern that putting an NVMe device
> into D3cold every time during system suspend on Android might cause it
> to wear out more quickly.
>
Right, this is the second problem.
> Is there anything else?
We also need to consider the fact that the firmware in some platforms doesn't
support S2R. So we cannot take a decision based on S2I/S2R alone.
I think there are atleast a couple of ways to solve above mentioned problems:
1. Go extra mile, take account of all issues/limitations mentioned above and
come up with an API. This API will provide a sane default suspend behavior to
drivers (fixing first problem) and will also allow changing the behavior
dynamically (to fix the second problem where kernel cannot distinguish Android
vs other OS).
2. Allow DT platforms to call pm_set_suspend_via_firmware() and make use of the
existing behavior in the NVMe driver. This would only solve the first problem
though. But would be a good start.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-17 5:26 ` manivannan.sadhasivam
@ 2024-12-17 19:45 ` Rafael J. Wysocki
2024-12-19 8:02 ` Manivannan Sadhasivam
0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2024-12-17 19:45 UTC (permalink / raw)
To: manivannan.sadhasivam
Cc: Rafael J. Wysocki, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Tue, Dec 17, 2024 at 6:26 AM <manivannan.sadhasivam@linaro.org> wrote:
>
> On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote:
>
> [...]
>
> > > There is also a case where some devices like
> > > (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
> > > powered down in order for the SoC to reach its low power state (CX power
> > > collapse in Qcom terms). If not, the SoC would continue to draw more power
> > > leading to battery draining quickly. This platform is supported in upstream and
> > > we keep the PCIe interconnect voted during suspend as the NVMe driver is
> > > expecting the device to retain its state during resume. Because of this
> > > requirement, this platform is not reaching SoC level low power state with
> > > upstream kernel.
> >
> > OK, now all of this makes sense and that's why you really want NVMe
> > devices to end up in some form of PCI D3 in suspend-to-idle.
> >
> > Would D3hot be sufficient for this platform or does it need to be
> > D3cold? If the latter, what exactly is the method by which they are
> > put into D3cold?
> >
>
> D3Cold is what preferred. Earlier the controller driver used to transition the
> device into D3Cold by sending PME_Turn_Off, turning off refclk, regulators
> etc... Now we have a new framework called 'pwrctrl' that handles the
> clock/regulators supplied to the device. So both controller and pwrctrl drivers
> need to work in a tandem to put the device into D3Cold.
>
> Once the PCIe client driver (NVMe in this case) powers down the device, then
> controller/pwrctrl drivers will check the PCIe device state and transition the
> device into D3Cold. This is a TODO btw.
>
> But right now there is no generic D3Cold handling available for DT platforms. I
> am hoping to fix that too once this NVMe issue is handled.
There's no generic D3cold handling for PCIe devices at all AFAICS. At
least, I'm not aware of any standard way to do it. Yes, there are
vendor-specific conventions that may even be followed by multiple
vendors, but not much beyond that.
> > > > > > > If the platform is using DT, then there is no entity setting
> > > > > > > pm_set_suspend_via_firmware().
> > > > > >
> > > > > > That's true and so the assumption is that in this case the handling of
> > > > > > all devices will always be the same regardless of which flavor of
> > > > > > system suspend is chosen by user space.
> > > > > >
> > > > >
> > > > > Right and that's why the above concern is raised.
> > > >
> > > > And it is still unclear to me what the problem with it is.
> > > >
> > > > What exactly can go wrong?
> > > >
> > > > > > > So NVMe will be kept in low power state all the
> > > > > > > time (still draining the battery).
> > > > > >
> > > > > > So what would be the problem with powering it down unconditionally?
> > > > > >
> > > > >
> > > > > I'm not sure how would you do that (by checking dev_of_node()?). Even so, it
> > > > > will wear out the NVMe devices if used in Android tablets etc...
> > > >
> > > > I understand the wear-out concern.
> > > >
> > > > Is there anything else?
> > > >
> > >
> > > No, that's the only concern.
> >
> > OK
> >
> > I think we're getting to the bottom of the issue.
> >
> > First off, there really is no requirement to avoid putting PCI devices
> > into D3hot or D3cold during suspend-to-idle. On all modern Intel
> > platforms, PCIe devices are put into D3(hot or cold) during
> > suspend-to-idle and I don't see why this should be any different on
> > platforms from any other vendors.
> >
> > The NVMe driver is an exception and it avoids D3(hot or cold) during
> > suspend-to-idle because of problems with some hardware or platforms.
> > It might in principle allow devices to go into D3(hot or cold) during
> > suspend-to-idle, so long as it knows that this is going to work.
> >
>
> Slight correction here.
>
> NVMe driver avoids PCI PM _only_ when it wants to handle the NVMe power
> state on its own, not all the time. It has some checks [1] in the suspend path
> and if the platform/device passes one of the checks, it will power down the
> device.
Yes, there is a comment in that code explaining what's going on and
that is basically "if key ingredients are missing or the firmware is
going to do its thing anyway, don't bother".
> DT platforms doesn't pass any of the checks, so the NVMe driver always manages
> the power state on its own. Unfortunately, the resultant power saving is not
> enough, so the vendors (Laptop/Automotive) using DT want NVMe to be powered down
> all the time. This is the first problem we are trying to solve.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c#n3448
I see.
This cannot be done by the driver itself, though, at least not in
general. The PCI layer needs to be involved and, if we are talking
about D3cold, the platform firmware needs to be involved either.
As a rule, the PCI layer reaches out to the platform firmware for help
as needed and drivers don't take part in this directly.
The NVMe driver would need to let the PCI layer take over and set the
PCI power state of the device if it wanted to get any deeper than NVMe
protocol specific power states.
In principle, this can be addressed with some kind of a driver opt-in.
That is, the NVMe driver would continue to work the way it does, but
instead of completely preventing the PCI layer from taking over, it
would opt in (the exact opt-in mechanism is TBD) for D3cold if (a) the
platform firmware provides a mechanism to do so and (b) the DT
indicates that this mechanism should be used for the given device.
> > However, there is an additional concern that putting an NVMe device
> > into D3cold every time during system suspend on Android might cause it
> > to wear out more quickly.
> >
>
> Right, this is the second problem.
Let's set this one somewhat aside for now. We'll get back to it when
we have clarity in the above.
> > Is there anything else?
>
> We also need to consider the fact that the firmware in some platforms doesn't
> support S2R. So we cannot take a decision based on S2I/S2R alone.
Right, the S2I/S2R thing is ACPI-specific.
On platforms using ACPI, pm_suspend_via_firmware() means that the
firmware is at least likely to power down the whole platform, so the
PCI layer may as well be allowed to do what it wants with the device.
> I think there are atleast a couple of ways to solve above mentioned problems:
>
> 1. Go extra mile, take account of all issues/limitations mentioned above and
> come up with an API. This API will provide a sane default suspend behavior to
> drivers (fixing first problem) and will also allow changing the behavior
> dynamically (to fix the second problem where kernel cannot distinguish Android
> vs other OS).
I don't quite follow TBH.
A "default suspend behavior" is there already and it is to allow the
PCI layer to set the device's power state (in collaboration with the
platform firmware). Thing is, the NVMe driver doesn't always want to
rely on it.
> 2. Allow DT platforms to call pm_set_suspend_via_firmware() and make use of the
> existing behavior in the NVMe driver. This would only solve the first problem
> though. But would be a good start.
That would mean just letting the PCI layer always take over on the
platforms that would call pm_set_suspend_via_firmware().
There is a potential issue with doing it, which is that everybody
calling pm_suspend_via_firmware() would then assume that the platform
would be powered down by the firmware. I'm not sure how much of an
issue that would be in practice, though.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 17:52 ` Manivannan Sadhasivam
2024-12-16 19:34 ` Rafael J. Wysocki
@ 2024-12-19 6:30 ` Christoph Hellwig
2024-12-19 8:03 ` Manivannan Sadhasivam
1 sibling, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-12-19 6:30 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: Rafael J. Wysocki, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Mon, Dec 16, 2024 at 11:22:10PM +0530, Manivannan Sadhasivam wrote:
> > So what technically is the problem?
> >
>
> NVMe wear out is the problem.
Btw, it's not just the PE cycles, it's also the sheer time a clean
shutdown can take (or the time for an unclean enable if we don't wait
long enough).
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-17 19:45 ` Rafael J. Wysocki
@ 2024-12-19 8:02 ` Manivannan Sadhasivam
2024-12-19 12:45 ` Rafael J. Wysocki
0 siblings, 1 reply; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-19 8:02 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Christoph Hellwig, Ulf Hansson, Rafael J. Wysocki, Bjorn Helgaas,
kbusch, axboe, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Len Brown, linux-pm
On Tue, Dec 17, 2024 at 08:45:55PM +0100, Rafael J. Wysocki wrote:
> On Tue, Dec 17, 2024 at 6:26 AM <manivannan.sadhasivam@linaro.org> wrote:
> >
> > On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote:
> >
> > [...]
> >
> > > > There is also a case where some devices like
> > > > (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
> > > > powered down in order for the SoC to reach its low power state (CX power
> > > > collapse in Qcom terms). If not, the SoC would continue to draw more power
> > > > leading to battery draining quickly. This platform is supported in upstream and
> > > > we keep the PCIe interconnect voted during suspend as the NVMe driver is
> > > > expecting the device to retain its state during resume. Because of this
> > > > requirement, this platform is not reaching SoC level low power state with
> > > > upstream kernel.
> > >
> > > OK, now all of this makes sense and that's why you really want NVMe
> > > devices to end up in some form of PCI D3 in suspend-to-idle.
> > >
> > > Would D3hot be sufficient for this platform or does it need to be
> > > D3cold? If the latter, what exactly is the method by which they are
> > > put into D3cold?
> > >
> >
> > D3Cold is what preferred. Earlier the controller driver used to transition the
> > device into D3Cold by sending PME_Turn_Off, turning off refclk, regulators
> > etc... Now we have a new framework called 'pwrctrl' that handles the
> > clock/regulators supplied to the device. So both controller and pwrctrl drivers
> > need to work in a tandem to put the device into D3Cold.
> >
> > Once the PCIe client driver (NVMe in this case) powers down the device, then
> > controller/pwrctrl drivers will check the PCIe device state and transition the
> > device into D3Cold. This is a TODO btw.
> >
> > But right now there is no generic D3Cold handling available for DT platforms. I
> > am hoping to fix that too once this NVMe issue is handled.
>
> There's no generic D3cold handling for PCIe devices at all AFAICS. At
> least, I'm not aware of any standard way to do it. Yes, there are
> vendor-specific conventions that may even be followed by multiple
> vendors, but not much beyond that.
>
Yeah, right. Atleast ACPI has its own way of handling D3Cold and that's what I
meant. There is no such option available for DT right now. I was shoping that
once this NVMe issue is resolved, then I could look into D3Cold for DT
platforms.
> > > > > > > > If the platform is using DT, then there is no entity setting
> > > > > > > > pm_set_suspend_via_firmware().
> > > > > > >
> > > > > > > That's true and so the assumption is that in this case the handling of
> > > > > > > all devices will always be the same regardless of which flavor of
> > > > > > > system suspend is chosen by user space.
> > > > > > >
> > > > > >
> > > > > > Right and that's why the above concern is raised.
> > > > >
> > > > > And it is still unclear to me what the problem with it is.
> > > > >
> > > > > What exactly can go wrong?
> > > > >
> > > > > > > > So NVMe will be kept in low power state all the
> > > > > > > > time (still draining the battery).
> > > > > > >
> > > > > > > So what would be the problem with powering it down unconditionally?
> > > > > > >
> > > > > >
> > > > > > I'm not sure how would you do that (by checking dev_of_node()?). Even so, it
> > > > > > will wear out the NVMe devices if used in Android tablets etc...
> > > > >
> > > > > I understand the wear-out concern.
> > > > >
> > > > > Is there anything else?
> > > > >
> > > >
> > > > No, that's the only concern.
> > >
> > > OK
> > >
> > > I think we're getting to the bottom of the issue.
> > >
> > > First off, there really is no requirement to avoid putting PCI devices
> > > into D3hot or D3cold during suspend-to-idle. On all modern Intel
> > > platforms, PCIe devices are put into D3(hot or cold) during
> > > suspend-to-idle and I don't see why this should be any different on
> > > platforms from any other vendors.
> > >
> > > The NVMe driver is an exception and it avoids D3(hot or cold) during
> > > suspend-to-idle because of problems with some hardware or platforms.
> > > It might in principle allow devices to go into D3(hot or cold) during
> > > suspend-to-idle, so long as it knows that this is going to work.
> > >
> >
> > Slight correction here.
> >
> > NVMe driver avoids PCI PM _only_ when it wants to handle the NVMe power
> > state on its own, not all the time. It has some checks [1] in the suspend path
> > and if the platform/device passes one of the checks, it will power down the
> > device.
>
> Yes, there is a comment in that code explaining what's going on and
> that is basically "if key ingredients are missing or the firmware is
> going to do its thing anyway, don't bother".
>
> > DT platforms doesn't pass any of the checks, so the NVMe driver always manages
> > the power state on its own. Unfortunately, the resultant power saving is not
> > enough, so the vendors (Laptop/Automotive) using DT want NVMe to be powered down
> > all the time. This is the first problem we are trying to solve.
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c#n3448
>
> I see.
>
> This cannot be done by the driver itself, though, at least not in
> general. The PCI layer needs to be involved and, if we are talking
> about D3cold, the platform firmware needs to be involved either.
>
Right, but the device driver needs to have some idea about what state PCI core
is going to choose for the device. I believe that's the purpose of
pci_choose_state() API. More below...
> As a rule, the PCI layer reaches out to the platform firmware for help
> as needed and drivers don't take part in this directly.
>
> The NVMe driver would need to let the PCI layer take over and set the
> PCI power state of the device if it wanted to get any deeper than NVMe
> protocol specific power states.
>
> In principle, this can be addressed with some kind of a driver opt-in.
> That is, the NVMe driver would continue to work the way it does, but
> instead of completely preventing the PCI layer from taking over, it
> would opt in (the exact opt-in mechanism is TBD) for D3cold if (a) the
> platform firmware provides a mechanism to do so and (b) the DT
> indicates that this mechanism should be used for the given device.
>
Ok, IIUC you are talking about something like this?
static int nvme_suspend(struct device *dev)
{
...
if (pm_suspend_via_firmware() || !ctrl->npss || ... ||
pci_choose_state(pdev, PMSG_SUSPEND) == PCI_D3cold)
return nvme_disable_prepare_reset(ndev, true);
/* continue using protocol specific low power state */
...
}
Here, pci_choose_state() should tell the driver if the device should enter
D3Cold. ACPI already supports this API, now I need to add DT support (which is
not straightforward, but doable). Since this API is already exported, I think it
makes perfect sense to use it here (and other drivers for similar usecase).
> > > However, there is an additional concern that putting an NVMe device
> > > into D3cold every time during system suspend on Android might cause it
> > > to wear out more quickly.
> > >
> >
> > Right, this is the second problem.
>
> Let's set this one somewhat aside for now. We'll get back to it when
> we have clarity in the above.
>
Ok. I believe this could be addressed in pci_choose_state() itself if required.
> > > Is there anything else?
> >
> > We also need to consider the fact that the firmware in some platforms doesn't
> > support S2R. So we cannot take a decision based on S2I/S2R alone.
>
> Right, the S2I/S2R thing is ACPI-specific.
>
> On platforms using ACPI, pm_suspend_via_firmware() means that the
> firmware is at least likely to power down the whole platform, so the
> PCI layer may as well be allowed to do what it wants with the device.
>
> > I think there are atleast a couple of ways to solve above mentioned problems:
> >
> > 1. Go extra mile, take account of all issues/limitations mentioned above and
> > come up with an API. This API will provide a sane default suspend behavior to
> > drivers (fixing first problem) and will also allow changing the behavior
> > dynamically (to fix the second problem where kernel cannot distinguish Android
> > vs other OS).
>
> I don't quite follow TBH.
>
> A "default suspend behavior" is there already and it is to allow the
> PCI layer to set the device's power state (in collaboration with the
> platform firmware). Thing is, the NVMe driver doesn't always want to
> rely on it.
>
> > 2. Allow DT platforms to call pm_set_suspend_via_firmware() and make use of the
> > existing behavior in the NVMe driver. This would only solve the first problem
> > though. But would be a good start.
>
> That would mean just letting the PCI layer always take over on the
> platforms that would call pm_set_suspend_via_firmware().
>
> There is a potential issue with doing it, which is that everybody
> calling pm_suspend_via_firmware() would then assume that the platform
> would be powered down by the firmware. I'm not sure how much of an
> issue that would be in practice, though.
Yeah, that would be a concern.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-19 6:30 ` Christoph Hellwig
@ 2024-12-19 8:03 ` Manivannan Sadhasivam
0 siblings, 0 replies; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-19 8:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Rafael J. Wysocki, Ulf Hansson, Rafael J. Wysocki, Bjorn Helgaas,
kbusch, axboe, sagi, linux-nvme, linux-kernel, linux-pci,
andersson, konradybcio, Len Brown, linux-pm
On Thu, Dec 19, 2024 at 07:30:52AM +0100, Christoph Hellwig wrote:
> On Mon, Dec 16, 2024 at 11:22:10PM +0530, Manivannan Sadhasivam wrote:
> > > So what technically is the problem?
> > >
> >
> > NVMe wear out is the problem.
>
> Btw, it's not just the PE cycles, it's also the sheer time a clean
> shutdown can take (or the time for an unclean enable if we don't wait
> long enough).
>
I do agree with you!
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-19 8:02 ` Manivannan Sadhasivam
@ 2024-12-19 12:45 ` Rafael J. Wysocki
2024-12-19 16:41 ` Ulf Hansson
0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2024-12-19 12:45 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: Rafael J. Wysocki, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Thu, Dec 19, 2024 at 9:02 AM Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> On Tue, Dec 17, 2024 at 08:45:55PM +0100, Rafael J. Wysocki wrote:
> > On Tue, Dec 17, 2024 at 6:26 AM <manivannan.sadhasivam@linaro.org> wrote:
> > >
> > > On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote:
> > >
> > > [...]
> > >
> > > > > There is also a case where some devices like
> > > > > (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
> > > > > powered down in order for the SoC to reach its low power state (CX power
> > > > > collapse in Qcom terms). If not, the SoC would continue to draw more power
> > > > > leading to battery draining quickly. This platform is supported in upstream and
> > > > > we keep the PCIe interconnect voted during suspend as the NVMe driver is
> > > > > expecting the device to retain its state during resume. Because of this
> > > > > requirement, this platform is not reaching SoC level low power state with
> > > > > upstream kernel.
> > > >
> > > > OK, now all of this makes sense and that's why you really want NVMe
> > > > devices to end up in some form of PCI D3 in suspend-to-idle.
> > > >
> > > > Would D3hot be sufficient for this platform or does it need to be
> > > > D3cold? If the latter, what exactly is the method by which they are
> > > > put into D3cold?
> > > >
> > >
> > > D3Cold is what preferred. Earlier the controller driver used to transition the
> > > device into D3Cold by sending PME_Turn_Off, turning off refclk, regulators
> > > etc... Now we have a new framework called 'pwrctrl' that handles the
> > > clock/regulators supplied to the device. So both controller and pwrctrl drivers
> > > need to work in a tandem to put the device into D3Cold.
> > >
> > > Once the PCIe client driver (NVMe in this case) powers down the device, then
> > > controller/pwrctrl drivers will check the PCIe device state and transition the
> > > device into D3Cold. This is a TODO btw.
> > >
> > > But right now there is no generic D3Cold handling available for DT platforms. I
> > > am hoping to fix that too once this NVMe issue is handled.
> >
> > There's no generic D3cold handling for PCIe devices at all AFAICS. At
> > least, I'm not aware of any standard way to do it. Yes, there are
> > vendor-specific conventions that may even be followed by multiple
> > vendors, but not much beyond that.
> >
>
> Yeah, right. Atleast ACPI has its own way of handling D3Cold and that's what I
> meant.
Well, not so simple.
ACPI has a way to put devices into D3cold, but it is based on power
resources: If all of the ACPI power resources for state D3hot of the
given device are turned off, the device is assumed to have no power
which effectively means D3cold.
Now, this basically works for platform devices, but it doesn't work
particularly well for PCIe devices that get their power from the bus.
This is addressed by using a conventional approach that has not been
codified as a standard so far AFAICS. The convention is to define a
D3hot power resource for the parent PCIe port of the device in
question and turn that power resource off when both the device and its
parent port have been programmed to go into D3hot (via PMCSR). So the
flow is that the device goes into D3hot first. Later, its parent port
goes into D3hot and then the D3hot power resource of the port is
turned off. It usually doesn't really remove power from the port
(especially if this is a root port), but it turns down the PCIe link
from it to the device (in a platform-specific way) which effectively
cuts power from the device.
On the way back, the D3hot power resource of the port is turned on
first and this turns the PCIe link between the port and the device up.
> There is no such option available for DT right now. I was shoping that
> once this NVMe issue is resolved, then I could look into D3Cold for DT
> platforms.
Well, consider the above.
[cut]
> > > >
> > > > I think we're getting to the bottom of the issue.
> > > >
> > > > First off, there really is no requirement to avoid putting PCI devices
> > > > into D3hot or D3cold during suspend-to-idle. On all modern Intel
> > > > platforms, PCIe devices are put into D3(hot or cold) during
> > > > suspend-to-idle and I don't see why this should be any different on
> > > > platforms from any other vendors.
> > > >
> > > > The NVMe driver is an exception and it avoids D3(hot or cold) during
> > > > suspend-to-idle because of problems with some hardware or platforms.
> > > > It might in principle allow devices to go into D3(hot or cold) during
> > > > suspend-to-idle, so long as it knows that this is going to work.
> > > >
> > >
> > > Slight correction here.
> > >
> > > NVMe driver avoids PCI PM _only_ when it wants to handle the NVMe power
> > > state on its own, not all the time. It has some checks [1] in the suspend path
> > > and if the platform/device passes one of the checks, it will power down the
> > > device.
> >
> > Yes, there is a comment in that code explaining what's going on and
> > that is basically "if key ingredients are missing or the firmware is
> > going to do its thing anyway, don't bother".
> >
> > > DT platforms doesn't pass any of the checks, so the NVMe driver always manages
> > > the power state on its own. Unfortunately, the resultant power saving is not
> > > enough, so the vendors (Laptop/Automotive) using DT want NVMe to be powered down
> > > all the time. This is the first problem we are trying to solve.
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c#n3448
> >
> > I see.
> >
> > This cannot be done by the driver itself, though, at least not in
> > general. The PCI layer needs to be involved and, if we are talking
> > about D3cold, the platform firmware needs to be involved either.
> >
>
> Right, but the device driver needs to have some idea about what state PCI core
> is going to choose for the device. I believe that's the purpose of
> pci_choose_state() API. More below...
This is not really straightforward as per the above.
The driver would effectively need to know if the device's parent port
would be put into D3cold and pci_choose_state() is only about the
device itself.
> > As a rule, the PCI layer reaches out to the platform firmware for help
> > as needed and drivers don't take part in this directly.
> >
> > The NVMe driver would need to let the PCI layer take over and set the
> > PCI power state of the device if it wanted to get any deeper than NVMe
> > protocol specific power states.
> >
> > In principle, this can be addressed with some kind of a driver opt-in.
> > That is, the NVMe driver would continue to work the way it does, but
> > instead of completely preventing the PCI layer from taking over, it
> > would opt in (the exact opt-in mechanism is TBD) for D3cold if (a) the
> > platform firmware provides a mechanism to do so and (b) the DT
> > indicates that this mechanism should be used for the given device.
> >
>
> Ok, IIUC you are talking about something like this?
>
> static int nvme_suspend(struct device *dev)
> {
> ...
>
> if (pm_suspend_via_firmware() || !ctrl->npss || ... ||
> pci_choose_state(pdev, PMSG_SUSPEND) == PCI_D3cold)
> return nvme_disable_prepare_reset(ndev, true);
>
> /* continue using protocol specific low power state */
>
> ...
> }
>
> Here, pci_choose_state() should tell the driver if the device should enter
> D3Cold. ACPI already supports this API, now I need to add DT support (which is
> not straightforward, but doable). Since this API is already exported, I think it
> makes perfect sense to use it here (and other drivers for similar usecase).
I didn't really think about using pci_choose_state() here, but it is
not unreasonable IMV. It may in principle be extended to cover the
"port + device combo" approach described above. I'd rather avoid
calling pci_choose_state() twice for the same device during one
transition (once in the driver and then in the PCI layer), but that
could be addressed by rearranging the code.
However, I thought about adding a way for the driver to effectively
say "I'm fine with putting this device into power state X so long as
you have an indication from the platform firmware that this is OK",
where X would be the deepest power state allowed by the driver.
The drawback of this is that the driver would not know whether or not
the device would go into state X upfront, so it would need to do its
own thing in case this doesn't happen. In the NVMe driver case, this
would mean putting the device into a protocol low-power state before
returning from the suspend callback which may not be a bad idea
overall.
BTW, I'm wondering what could go wrong if the
pm_suspend_via_firmware() check were dropped from nvme_suspend()? The
driver would do some extra work on systems with ACPI S3 support, but
would it actually break anything? Does anybody know? May be worth
trying ...
> > > > However, there is an additional concern that putting an NVMe device
> > > > into D3cold every time during system suspend on Android might cause it
> > > > to wear out more quickly.
> > > >
> > >
> > > Right, this is the second problem.
> >
> > Let's set this one somewhat aside for now. We'll get back to it when
> > we have clarity in the above.
> >
>
> Ok. I believe this could be addressed in pci_choose_state() itself if required.
Possibly.
Unrelated notice: I will be mostly offline starting tomorrow through
Jan 6, 2025 and responses may be slow during that time.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-19 12:45 ` Rafael J. Wysocki
@ 2024-12-19 16:41 ` Ulf Hansson
2024-12-19 18:28 ` Rafael J. Wysocki
0 siblings, 1 reply; 47+ messages in thread
From: Ulf Hansson @ 2024-12-19 16:41 UTC (permalink / raw)
To: Christoph Hellwig, Manivannan Sadhasivam, Rafael J. Wysocki
Cc: Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Thu, 19 Dec 2024 at 13:45, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Thu, Dec 19, 2024 at 9:02 AM Manivannan Sadhasivam
> <manivannan.sadhasivam@linaro.org> wrote:
> >
> > On Tue, Dec 17, 2024 at 08:45:55PM +0100, Rafael J. Wysocki wrote:
> > > On Tue, Dec 17, 2024 at 6:26 AM <manivannan.sadhasivam@linaro.org> wrote:
> > > >
> > > > On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote:
> > > >
> > > > [...]
> > > >
> > > > > > There is also a case where some devices like
> > > > > > (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
> > > > > > powered down in order for the SoC to reach its low power state (CX power
> > > > > > collapse in Qcom terms). If not, the SoC would continue to draw more power
> > > > > > leading to battery draining quickly. This platform is supported in upstream and
> > > > > > we keep the PCIe interconnect voted during suspend as the NVMe driver is
> > > > > > expecting the device to retain its state during resume. Because of this
> > > > > > requirement, this platform is not reaching SoC level low power state with
> > > > > > upstream kernel.
> > > > >
> > > > > OK, now all of this makes sense and that's why you really want NVMe
> > > > > devices to end up in some form of PCI D3 in suspend-to-idle.
> > > > >
> > > > > Would D3hot be sufficient for this platform or does it need to be
> > > > > D3cold? If the latter, what exactly is the method by which they are
> > > > > put into D3cold?
> > > > >
> > > >
> > > > D3Cold is what preferred. Earlier the controller driver used to transition the
> > > > device into D3Cold by sending PME_Turn_Off, turning off refclk, regulators
> > > > etc... Now we have a new framework called 'pwrctrl' that handles the
> > > > clock/regulators supplied to the device. So both controller and pwrctrl drivers
> > > > need to work in a tandem to put the device into D3Cold.
> > > >
> > > > Once the PCIe client driver (NVMe in this case) powers down the device, then
> > > > controller/pwrctrl drivers will check the PCIe device state and transition the
> > > > device into D3Cold. This is a TODO btw.
> > > >
> > > > But right now there is no generic D3Cold handling available for DT platforms. I
> > > > am hoping to fix that too once this NVMe issue is handled.
> > >
> > > There's no generic D3cold handling for PCIe devices at all AFAICS. At
> > > least, I'm not aware of any standard way to do it. Yes, there are
> > > vendor-specific conventions that may even be followed by multiple
> > > vendors, but not much beyond that.
> > >
> >
> > Yeah, right. Atleast ACPI has its own way of handling D3Cold and that's what I
> > meant.
>
> Well, not so simple.
>
> ACPI has a way to put devices into D3cold, but it is based on power
> resources: If all of the ACPI power resources for state D3hot of the
> given device are turned off, the device is assumed to have no power
> which effectively means D3cold.
>
> Now, this basically works for platform devices, but it doesn't work
> particularly well for PCIe devices that get their power from the bus.
>
> This is addressed by using a conventional approach that has not been
> codified as a standard so far AFAICS. The convention is to define a
> D3hot power resource for the parent PCIe port of the device in
> question and turn that power resource off when both the device and its
> parent port have been programmed to go into D3hot (via PMCSR). So the
> flow is that the device goes into D3hot first. Later, its parent port
> goes into D3hot and then the D3hot power resource of the port is
> turned off. It usually doesn't really remove power from the port
> (especially if this is a root port), but it turns down the PCIe link
> from it to the device (in a platform-specific way) which effectively
> cuts power from the device.
>
> On the way back, the D3hot power resource of the port is turned on
> first and this turns the PCIe link between the port and the device up.
>
> > There is no such option available for DT right now. I was shoping that
> > once this NVMe issue is resolved, then I could look into D3Cold for DT
> > platforms.
>
> Well, consider the above.
>
> [cut]
>
> > > > >
> > > > > I think we're getting to the bottom of the issue.
> > > > >
> > > > > First off, there really is no requirement to avoid putting PCI devices
> > > > > into D3hot or D3cold during suspend-to-idle. On all modern Intel
> > > > > platforms, PCIe devices are put into D3(hot or cold) during
> > > > > suspend-to-idle and I don't see why this should be any different on
> > > > > platforms from any other vendors.
> > > > >
> > > > > The NVMe driver is an exception and it avoids D3(hot or cold) during
> > > > > suspend-to-idle because of problems with some hardware or platforms.
> > > > > It might in principle allow devices to go into D3(hot or cold) during
> > > > > suspend-to-idle, so long as it knows that this is going to work.
> > > > >
> > > >
> > > > Slight correction here.
> > > >
> > > > NVMe driver avoids PCI PM _only_ when it wants to handle the NVMe power
> > > > state on its own, not all the time. It has some checks [1] in the suspend path
> > > > and if the platform/device passes one of the checks, it will power down the
> > > > device.
> > >
> > > Yes, there is a comment in that code explaining what's going on and
> > > that is basically "if key ingredients are missing or the firmware is
> > > going to do its thing anyway, don't bother".
> > >
> > > > DT platforms doesn't pass any of the checks, so the NVMe driver always manages
> > > > the power state on its own. Unfortunately, the resultant power saving is not
> > > > enough, so the vendors (Laptop/Automotive) using DT want NVMe to be powered down
> > > > all the time. This is the first problem we are trying to solve.
> > > >
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c#n3448
> > >
> > > I see.
> > >
> > > This cannot be done by the driver itself, though, at least not in
> > > general. The PCI layer needs to be involved and, if we are talking
> > > about D3cold, the platform firmware needs to be involved either.
> > >
> >
> > Right, but the device driver needs to have some idea about what state PCI core
> > is going to choose for the device. I believe that's the purpose of
> > pci_choose_state() API. More below...
>
> This is not really straightforward as per the above.
>
> The driver would effectively need to know if the device's parent port
> would be put into D3cold and pci_choose_state() is only about the
> device itself.
>
> > > As a rule, the PCI layer reaches out to the platform firmware for help
> > > as needed and drivers don't take part in this directly.
> > >
> > > The NVMe driver would need to let the PCI layer take over and set the
> > > PCI power state of the device if it wanted to get any deeper than NVMe
> > > protocol specific power states.
> > >
> > > In principle, this can be addressed with some kind of a driver opt-in.
> > > That is, the NVMe driver would continue to work the way it does, but
> > > instead of completely preventing the PCI layer from taking over, it
> > > would opt in (the exact opt-in mechanism is TBD) for D3cold if (a) the
> > > platform firmware provides a mechanism to do so and (b) the DT
> > > indicates that this mechanism should be used for the given device.
> > >
> >
> > Ok, IIUC you are talking about something like this?
> >
> > static int nvme_suspend(struct device *dev)
> > {
> > ...
> >
> > if (pm_suspend_via_firmware() || !ctrl->npss || ... ||
> > pci_choose_state(pdev, PMSG_SUSPEND) == PCI_D3cold)
> > return nvme_disable_prepare_reset(ndev, true);
> >
> > /* continue using protocol specific low power state */
> >
> > ...
> > }
> >
> > Here, pci_choose_state() should tell the driver if the device should enter
> > D3Cold. ACPI already supports this API, now I need to add DT support (which is
> > not straightforward, but doable). Since this API is already exported, I think it
> > makes perfect sense to use it here (and other drivers for similar usecase).
>
> I didn't really think about using pci_choose_state() here, but it is
> not unreasonable IMV. It may in principle be extended to cover the
> "port + device combo" approach described above. I'd rather avoid
> calling pci_choose_state() twice for the same device during one
> transition (once in the driver and then in the PCI layer), but that
> could be addressed by rearranging the code.
>
> However, I thought about adding a way for the driver to effectively
> say "I'm fine with putting this device into power state X so long as
> you have an indication from the platform firmware that this is OK",
> where X would be the deepest power state allowed by the driver.
>
> The drawback of this is that the driver would not know whether or not
> the device would go into state X upfront, so it would need to do its
> own thing in case this doesn't happen. In the NVMe driver case, this
> would mean putting the device into a protocol low-power state before
> returning from the suspend callback which may not be a bad idea
> overall.
>
> BTW, I'm wondering what could go wrong if the
> pm_suspend_via_firmware() check were dropped from nvme_suspend()? The
> driver would do some extra work on systems with ACPI S3 support, but
> would it actually break anything? Does anybody know? May be worth
> trying ...
>
> > > > > However, there is an additional concern that putting an NVMe device
> > > > > into D3cold every time during system suspend on Android might cause it
> > > > > to wear out more quickly.
> > > > >
> > > >
> > > > Right, this is the second problem.
> > >
> > > Let's set this one somewhat aside for now. We'll get back to it when
> > > we have clarity in the above.
> > >
> >
> > Ok. I believe this could be addressed in pci_choose_state() itself if required.
>
> Possibly.
>
> Unrelated notice: I will be mostly offline starting tomorrow through
> Jan 6, 2025 and responses may be slow during that time.
I didn't really find a good thread to reply to, but just decided to
pick the current latest email to share some of my additional thoughts.
Sorry, if it becomes confusing.
I think a lot of the requirements that have been discussed need to be
considered together (and there are possibly even more requirements
that have not been highlighted yet), to be able to find a proper good
solution. For example, if we just decide to always enter the deepest
low-power state that is available, that may hurt the internal storage
device, for *some* use cases.
That said, we really need to find a good step-by-step approach, to
avoid damaging storage devices along the road.
For example, we have reported issues on (e)MMC/SD where we could end
up too frequently powering-on/off the card during system
suspend/resume, to potentially damage it. For eMMC (or in-fact
non-removable cards), we have tried to mitigate the problem by
deferring the power-on of the card during system resume, until there
is actual I/O being requested for the card.
Kind regards
Uffe
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-19 16:41 ` Ulf Hansson
@ 2024-12-19 18:28 ` Rafael J. Wysocki
2025-01-03 7:26 ` Christoph Hellwig
0 siblings, 1 reply; 47+ messages in thread
From: Rafael J. Wysocki @ 2024-12-19 18:28 UTC (permalink / raw)
To: Ulf Hansson
Cc: Christoph Hellwig, Manivannan Sadhasivam, Rafael J. Wysocki,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Thu, Dec 19, 2024 at 5:42 PM Ulf Hansson <ulf.hansson@linaro.org> wrote:
>
> On Thu, 19 Dec 2024 at 13:45, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Thu, Dec 19, 2024 at 9:02 AM Manivannan Sadhasivam
> > <manivannan.sadhasivam@linaro.org> wrote:
> > >
> > > On Tue, Dec 17, 2024 at 08:45:55PM +0100, Rafael J. Wysocki wrote:
> > > > On Tue, Dec 17, 2024 at 6:26 AM <manivannan.sadhasivam@linaro.org> wrote:
> > > > >
> > > > > On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote:
> > > > >
> > > > > [...]
> > > > >
> > > > > > > There is also a case where some devices like
> > > > > > > (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
> > > > > > > powered down in order for the SoC to reach its low power state (CX power
> > > > > > > collapse in Qcom terms). If not, the SoC would continue to draw more power
> > > > > > > leading to battery draining quickly. This platform is supported in upstream and
> > > > > > > we keep the PCIe interconnect voted during suspend as the NVMe driver is
> > > > > > > expecting the device to retain its state during resume. Because of this
> > > > > > > requirement, this platform is not reaching SoC level low power state with
> > > > > > > upstream kernel.
> > > > > >
> > > > > > OK, now all of this makes sense and that's why you really want NVMe
> > > > > > devices to end up in some form of PCI D3 in suspend-to-idle.
> > > > > >
> > > > > > Would D3hot be sufficient for this platform or does it need to be
> > > > > > D3cold? If the latter, what exactly is the method by which they are
> > > > > > put into D3cold?
> > > > > >
> > > > >
> > > > > D3Cold is what preferred. Earlier the controller driver used to transition the
> > > > > device into D3Cold by sending PME_Turn_Off, turning off refclk, regulators
> > > > > etc... Now we have a new framework called 'pwrctrl' that handles the
> > > > > clock/regulators supplied to the device. So both controller and pwrctrl drivers
> > > > > need to work in a tandem to put the device into D3Cold.
> > > > >
> > > > > Once the PCIe client driver (NVMe in this case) powers down the device, then
> > > > > controller/pwrctrl drivers will check the PCIe device state and transition the
> > > > > device into D3Cold. This is a TODO btw.
> > > > >
> > > > > But right now there is no generic D3Cold handling available for DT platforms. I
> > > > > am hoping to fix that too once this NVMe issue is handled.
> > > >
> > > > There's no generic D3cold handling for PCIe devices at all AFAICS. At
> > > > least, I'm not aware of any standard way to do it. Yes, there are
> > > > vendor-specific conventions that may even be followed by multiple
> > > > vendors, but not much beyond that.
> > > >
> > >
> > > Yeah, right. Atleast ACPI has its own way of handling D3Cold and that's what I
> > > meant.
> >
> > Well, not so simple.
> >
> > ACPI has a way to put devices into D3cold, but it is based on power
> > resources: If all of the ACPI power resources for state D3hot of the
> > given device are turned off, the device is assumed to have no power
> > which effectively means D3cold.
> >
> > Now, this basically works for platform devices, but it doesn't work
> > particularly well for PCIe devices that get their power from the bus.
> >
> > This is addressed by using a conventional approach that has not been
> > codified as a standard so far AFAICS. The convention is to define a
> > D3hot power resource for the parent PCIe port of the device in
> > question and turn that power resource off when both the device and its
> > parent port have been programmed to go into D3hot (via PMCSR). So the
> > flow is that the device goes into D3hot first. Later, its parent port
> > goes into D3hot and then the D3hot power resource of the port is
> > turned off. It usually doesn't really remove power from the port
> > (especially if this is a root port), but it turns down the PCIe link
> > from it to the device (in a platform-specific way) which effectively
> > cuts power from the device.
> >
> > On the way back, the D3hot power resource of the port is turned on
> > first and this turns the PCIe link between the port and the device up.
> >
> > > There is no such option available for DT right now. I was shoping that
> > > once this NVMe issue is resolved, then I could look into D3Cold for DT
> > > platforms.
> >
> > Well, consider the above.
> >
> > [cut]
> >
> > > > > >
> > > > > > I think we're getting to the bottom of the issue.
> > > > > >
> > > > > > First off, there really is no requirement to avoid putting PCI devices
> > > > > > into D3hot or D3cold during suspend-to-idle. On all modern Intel
> > > > > > platforms, PCIe devices are put into D3(hot or cold) during
> > > > > > suspend-to-idle and I don't see why this should be any different on
> > > > > > platforms from any other vendors.
> > > > > >
> > > > > > The NVMe driver is an exception and it avoids D3(hot or cold) during
> > > > > > suspend-to-idle because of problems with some hardware or platforms.
> > > > > > It might in principle allow devices to go into D3(hot or cold) during
> > > > > > suspend-to-idle, so long as it knows that this is going to work.
> > > > > >
> > > > >
> > > > > Slight correction here.
> > > > >
> > > > > NVMe driver avoids PCI PM _only_ when it wants to handle the NVMe power
> > > > > state on its own, not all the time. It has some checks [1] in the suspend path
> > > > > and if the platform/device passes one of the checks, it will power down the
> > > > > device.
> > > >
> > > > Yes, there is a comment in that code explaining what's going on and
> > > > that is basically "if key ingredients are missing or the firmware is
> > > > going to do its thing anyway, don't bother".
> > > >
> > > > > DT platforms doesn't pass any of the checks, so the NVMe driver always manages
> > > > > the power state on its own. Unfortunately, the resultant power saving is not
> > > > > enough, so the vendors (Laptop/Automotive) using DT want NVMe to be powered down
> > > > > all the time. This is the first problem we are trying to solve.
> > > > >
> > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c#n3448
> > > >
> > > > I see.
> > > >
> > > > This cannot be done by the driver itself, though, at least not in
> > > > general. The PCI layer needs to be involved and, if we are talking
> > > > about D3cold, the platform firmware needs to be involved either.
> > > >
> > >
> > > Right, but the device driver needs to have some idea about what state PCI core
> > > is going to choose for the device. I believe that's the purpose of
> > > pci_choose_state() API. More below...
> >
> > This is not really straightforward as per the above.
> >
> > The driver would effectively need to know if the device's parent port
> > would be put into D3cold and pci_choose_state() is only about the
> > device itself.
> >
> > > > As a rule, the PCI layer reaches out to the platform firmware for help
> > > > as needed and drivers don't take part in this directly.
> > > >
> > > > The NVMe driver would need to let the PCI layer take over and set the
> > > > PCI power state of the device if it wanted to get any deeper than NVMe
> > > > protocol specific power states.
> > > >
> > > > In principle, this can be addressed with some kind of a driver opt-in.
> > > > That is, the NVMe driver would continue to work the way it does, but
> > > > instead of completely preventing the PCI layer from taking over, it
> > > > would opt in (the exact opt-in mechanism is TBD) for D3cold if (a) the
> > > > platform firmware provides a mechanism to do so and (b) the DT
> > > > indicates that this mechanism should be used for the given device.
> > > >
> > >
> > > Ok, IIUC you are talking about something like this?
> > >
> > > static int nvme_suspend(struct device *dev)
> > > {
> > > ...
> > >
> > > if (pm_suspend_via_firmware() || !ctrl->npss || ... ||
> > > pci_choose_state(pdev, PMSG_SUSPEND) == PCI_D3cold)
> > > return nvme_disable_prepare_reset(ndev, true);
> > >
> > > /* continue using protocol specific low power state */
> > >
> > > ...
> > > }
> > >
> > > Here, pci_choose_state() should tell the driver if the device should enter
> > > D3Cold. ACPI already supports this API, now I need to add DT support (which is
> > > not straightforward, but doable). Since this API is already exported, I think it
> > > makes perfect sense to use it here (and other drivers for similar usecase).
> >
> > I didn't really think about using pci_choose_state() here, but it is
> > not unreasonable IMV. It may in principle be extended to cover the
> > "port + device combo" approach described above. I'd rather avoid
> > calling pci_choose_state() twice for the same device during one
> > transition (once in the driver and then in the PCI layer), but that
> > could be addressed by rearranging the code.
> >
> > However, I thought about adding a way for the driver to effectively
> > say "I'm fine with putting this device into power state X so long as
> > you have an indication from the platform firmware that this is OK",
> > where X would be the deepest power state allowed by the driver.
> >
> > The drawback of this is that the driver would not know whether or not
> > the device would go into state X upfront, so it would need to do its
> > own thing in case this doesn't happen. In the NVMe driver case, this
> > would mean putting the device into a protocol low-power state before
> > returning from the suspend callback which may not be a bad idea
> > overall.
> >
> > BTW, I'm wondering what could go wrong if the
> > pm_suspend_via_firmware() check were dropped from nvme_suspend()? The
> > driver would do some extra work on systems with ACPI S3 support, but
> > would it actually break anything? Does anybody know? May be worth
> > trying ...
> >
> > > > > > However, there is an additional concern that putting an NVMe device
> > > > > > into D3cold every time during system suspend on Android might cause it
> > > > > > to wear out more quickly.
> > > > > >
> > > > >
> > > > > Right, this is the second problem.
> > > >
> > > > Let's set this one somewhat aside for now. We'll get back to it when
> > > > we have clarity in the above.
> > > >
> > >
> > > Ok. I believe this could be addressed in pci_choose_state() itself if required.
> >
> > Possibly.
> >
> > Unrelated notice: I will be mostly offline starting tomorrow through
> > Jan 6, 2025 and responses may be slow during that time.
>
> I didn't really find a good thread to reply to, but just decided to
> pick the current latest email to share some of my additional thoughts.
> Sorry, if it becomes confusing.
>
> I think a lot of the requirements that have been discussed need to be
> considered together (and there are possibly even more requirements
> that have not been highlighted yet), to be able to find a proper good
> solution. For example, if we just decide to always enter the deepest
> low-power state that is available, that may hurt the internal storage
> device, for *some* use cases.
>
> That said, we really need to find a good step-by-step approach, to
> avoid damaging storage devices along the road.
>
> For example, we have reported issues on (e)MMC/SD where we could end
> up too frequently powering-on/off the card during system
> suspend/resume, to potentially damage it. For eMMC (or in-fact
> non-removable cards), we have tried to mitigate the problem by
> deferring the power-on of the card during system resume, until there
> is actual I/O being requested for the card.
This is a reasonable approach and there are other possibilities.
In the particular case of NVMe, though, the question of how to
actually power it down is still open. Until there is a clear answer
to it, the possibility of powering NVMe devices down and up too often
is not really a practical concern.
Generally, the problem of choosing suitable power states during system
suspend is not limited to storage (for instance, if the sleep time is
going to be short enough and it is known in advance, it may not be
useful to change power states of many devices because that is just
going to take too much time) and the attempts to address it have been
very limited so far. There are many factors that come into play, like
how much energy is to be saved, how much time are suspend and resume
transitions allowed or expected to take, which devices are expected to
wake up the system from sleep, etc, and the possible device wearout is
one of them.
I kind of know where you are going with this, but I'd first like to
know what hardware and firmware interfaces are going to be used and
how and then consider problems that may arise from doing that.
Otherwise we'll end up in an endless discussion of theoretical
possibilities I'm afraid.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-16 16:42 ` Rafael J. Wysocki
2024-12-16 16:48 ` Manivannan Sadhasivam
@ 2024-12-20 15:15 ` Konrad Dybcio
2024-12-21 3:38 ` Manivannan Sadhasivam
2025-01-03 7:28 ` Christoph Hellwig
1 sibling, 2 replies; 47+ messages in thread
From: Konrad Dybcio @ 2024-12-20 15:15 UTC (permalink / raw)
To: Rafael J. Wysocki, Christoph Hellwig
Cc: Manivannan Sadhasivam, Ulf Hansson, Rafael J. Wysocki,
Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme, linux-kernel,
linux-pci, andersson, konradybcio, Len Brown, linux-pm
On 16.12.2024 5:42 PM, Rafael J. Wysocki wrote:
> On Mon, Dec 16, 2024 at 5:23 PM Christoph Hellwig <hch@lst.de> wrote:
>>
>> On Sat, Dec 14, 2024 at 12:00:23PM +0530, Manivannan Sadhasivam wrote:
>>> We need a PM core API that tells the device drivers when it is safe to powerdown
>>> the devices. The usecase here is with PCIe based NVMe devices but the problem is
>>> applicable to other devices as well.
>>
>> Maybe I'm misunderstanding things, but I think the important part is
>> to indicate when a suspend actually MUST put the device into D3. Because
>> doing that should always be safe, but not always optimal.
>
> I'm not aware of any cases when a device must be put into D3cold
> (which I think is what you mean) during system-wide suspend.
>
> Suspend-to-idle on x86 doesn't require this, at least not for
> correctness. I don't think any platforms using DT require it either.
That would be correct.
The Qualcomm platform (or class of platforms) we're looking at with this
specific issue requires PCIe (implying NVMe) shutdown for S2RAM.
The S2RAM entry mechanism is unfortunately misrepresented as an S2Idle
state by Linux as of today, and I'm trying really hard to convince some
folks to let me describe it correctly, with little success so far..
That is the real underlying issue and once/if it's solved, this patch
will not be necessary.
> In theory, ACPI S3 or hibernation may request that, but I've never
> seen it happen in practice.
>
> Suspend-to-idle on x86 may want devices to end up in specific power
> states in order to be able to switch the entire platform into a deep
> energy-saving mode, but that's never been D3cold so far.
In our case the plug is only pulled in S2RAM, otherwise the best we can
do is just turn off the devices individually to decrease the overall
power draw
(simplifying some small edge cases here, but that's mostly the story)
Konrad
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-20 15:15 ` Konrad Dybcio
@ 2024-12-21 3:38 ` Manivannan Sadhasivam
2024-12-21 11:17 ` Konrad Dybcio
2025-01-03 7:28 ` Christoph Hellwig
1 sibling, 1 reply; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-21 3:38 UTC (permalink / raw)
To: Konrad Dybcio
Cc: Rafael J. Wysocki, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Fri, Dec 20, 2024 at 04:15:21PM +0100, Konrad Dybcio wrote:
> On 16.12.2024 5:42 PM, Rafael J. Wysocki wrote:
> > On Mon, Dec 16, 2024 at 5:23 PM Christoph Hellwig <hch@lst.de> wrote:
> >>
> >> On Sat, Dec 14, 2024 at 12:00:23PM +0530, Manivannan Sadhasivam wrote:
> >>> We need a PM core API that tells the device drivers when it is safe to powerdown
> >>> the devices. The usecase here is with PCIe based NVMe devices but the problem is
> >>> applicable to other devices as well.
> >>
> >> Maybe I'm misunderstanding things, but I think the important part is
> >> to indicate when a suspend actually MUST put the device into D3. Because
> >> doing that should always be safe, but not always optimal.
> >
> > I'm not aware of any cases when a device must be put into D3cold
> > (which I think is what you mean) during system-wide suspend.
> >
> > Suspend-to-idle on x86 doesn't require this, at least not for
> > correctness. I don't think any platforms using DT require it either.
>
> That would be correct.
>
> The Qualcomm platform (or class of platforms) we're looking at with this
> specific issue requires PCIe (implying NVMe) shutdown for S2RAM.
>
> The S2RAM entry mechanism is unfortunately misrepresented as an S2Idle
> state by Linux as of today, and I'm trying really hard to convince some
> folks to let me describe it correctly, with little success so far..
>
Perhaps you should say 'S2RAM is misrepresented as S2Idle by the firmware as of
today'...
But I'll leave it up to the PSCI folks to decide whether it makes sense to
expose PSCI SYSTEM_SUSPEND through CPU_SUSPEND or not.
For the people in this thread, I'm leaving the link to the PSCI discussion here:
https://lore.kernel.org/all/20241028-topic-cpu_suspend_s2ram-v1-0-9fdd9a04b75c@oss.qualcomm.com/
> That is the real underlying issue and once/if it's solved, this patch
> will not be necessary.
>
> > In theory, ACPI S3 or hibernation may request that, but I've never
> > seen it happen in practice.
> >
> > Suspend-to-idle on x86 may want devices to end up in specific power
> > states in order to be able to switch the entire platform into a deep
> > energy-saving mode, but that's never been D3cold so far.
>
> In our case the plug is only pulled in S2RAM, otherwise the best we can
> do is just turn off the devices individually to decrease the overall
> power draw
>
I don't think this is accurate. Qcom FW (the one we are discussing in this
thread) doesn't pull the plug (except on platforms like x13s due to hw
limitation). On ACPI though, the FW *might* pull the plug, so that's why drivers
prepare the devices by powering down them (largely) if pm_suspend_via_firmware()
succeeds. On Qcom platforms, we are trying to allow the SoC to transition to low
power state and that requires relinquishing the resource votes by the drivers.
I still have doubt that pm_set_suspend_via_firmware() applies to Qcom FW or not.
Also the API description doesn't exactly match its usecase.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-21 3:38 ` Manivannan Sadhasivam
@ 2024-12-21 11:17 ` Konrad Dybcio
2024-12-26 16:22 ` Manivannan Sadhasivam
0 siblings, 1 reply; 47+ messages in thread
From: Konrad Dybcio @ 2024-12-21 11:17 UTC (permalink / raw)
To: Manivannan Sadhasivam, Konrad Dybcio
Cc: Rafael J. Wysocki, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On 21.12.2024 4:38 AM, Manivannan Sadhasivam wrote:
> On Fri, Dec 20, 2024 at 04:15:21PM +0100, Konrad Dybcio wrote:
>> On 16.12.2024 5:42 PM, Rafael J. Wysocki wrote:
>>> On Mon, Dec 16, 2024 at 5:23 PM Christoph Hellwig <hch@lst.de> wrote:
>>>>
>>>> On Sat, Dec 14, 2024 at 12:00:23PM +0530, Manivannan Sadhasivam wrote:
>>>>> We need a PM core API that tells the device drivers when it is safe to powerdown
>>>>> the devices. The usecase here is with PCIe based NVMe devices but the problem is
>>>>> applicable to other devices as well.
>>>>
>>>> Maybe I'm misunderstanding things, but I think the important part is
>>>> to indicate when a suspend actually MUST put the device into D3. Because
>>>> doing that should always be safe, but not always optimal.
>>>
>>> I'm not aware of any cases when a device must be put into D3cold
>>> (which I think is what you mean) during system-wide suspend.
>>>
>>> Suspend-to-idle on x86 doesn't require this, at least not for
>>> correctness. I don't think any platforms using DT require it either.
>>
>> That would be correct.
>>
>> The Qualcomm platform (or class of platforms) we're looking at with this
>> specific issue requires PCIe (implying NVMe) shutdown for S2RAM.
>>
>> The S2RAM entry mechanism is unfortunately misrepresented as an S2Idle
>> state by Linux as of today, and I'm trying really hard to convince some
>> folks to let me describe it correctly, with little success so far..
>>
>
> Perhaps you should say 'S2RAM is misrepresented as S2Idle by the firmware as of
> today'...
>
> But I'll leave it up to the PSCI folks to decide whether it makes sense to
> expose PSCI SYSTEM_SUSPEND through CPU_SUSPEND or not.
The firmware happily performs the actions required to put the platform
in S2RAM, but the interface used to request entry (CPU_SUSPEND) is
mostly used for entering CPU/cluster idle states on arm64.
(although the PSCI spec also clearly states that using CPU_SUSPEND for
system-level low power states is allowed *plus* the reference
implementation literally just calls CPU_SUSPEND internally whenever
the """proper""" SYSTEM_SUSPEND call is used, anyway)
>
> For the people in this thread, I'm leaving the link to the PSCI discussion here:
> https://lore.kernel.org/all/20241028-topic-cpu_suspend_s2ram-v1-0-9fdd9a04b75c@oss.qualcomm.com/
>
>> That is the real underlying issue and once/if it's solved, this patch
>> will not be necessary.
>>
>>> In theory, ACPI S3 or hibernation may request that, but I've never
>>> seen it happen in practice.
>>>
>>> Suspend-to-idle on x86 may want devices to end up in specific power
>>> states in order to be able to switch the entire platform into a deep
>>> energy-saving mode, but that's never been D3cold so far.
>>
>> In our case the plug is only pulled in S2RAM, otherwise the best we can
>> do is just turn off the devices individually to decrease the overall
>> power draw
>>
>
> I don't think this is accurate. Qcom FW (the one we are discussing in this
> thread) doesn't pull the plug (except on platforms like x13s due to hw
> limitation). On ACPI though, the FW *might* pull the plug, so that's why drivers
> prepare the devices by powering down them (largely) if pm_suspend_via_firmware()
> succeeds. On Qcom platforms, we are trying to allow the SoC to transition to low
> power state and that requires relinquishing the resource votes by the drivers.
Look, I have a power measurement device before my eyes and I clearly see
the main power rail being cut on successful S2RAM entry.
In s2idle/runtime cpuidle, no power is removed to anything except CPUs
(as decided by the adjacent uncore MCU) and Linux-PM-managed devices.
This is what the "pure software, light-weight variant of system suspend"
wording refers to in the doc - we shut off some peripheral devices and
put the CPUs in some sort of a wait-for-event state, opportunistically
cutting power from them.
For S2RAM, in the special snowflake sc8280xp/x13s case, we need to
shut down all PCIe RCs manually from Linux, so that another power
management MCU can then cut the system power rail.
But on other platforms it'd be enough to put the RCs in a lower power
state and have something that's not controlled by the OS decide
whether power should flow to them (more like the ACPI scenario).
The latter we don't/can't support as of now, so at least getting the
first case squared out would be good, as tearing down RCs always
works, even if it's not preferred for $REASONS.
Konrad
>
> I still have doubt that pm_set_suspend_via_firmware() applies to Qcom FW or not.
> Also the API description doesn't exactly match its usecase.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-21 11:17 ` Konrad Dybcio
@ 2024-12-26 16:22 ` Manivannan Sadhasivam
0 siblings, 0 replies; 47+ messages in thread
From: Manivannan Sadhasivam @ 2024-12-26 16:22 UTC (permalink / raw)
To: Konrad Dybcio
Cc: Rafael J. Wysocki, Christoph Hellwig, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Sat, Dec 21, 2024 at 12:17:02PM +0100, Konrad Dybcio wrote:
> On 21.12.2024 4:38 AM, Manivannan Sadhasivam wrote:
> > On Fri, Dec 20, 2024 at 04:15:21PM +0100, Konrad Dybcio wrote:
> >> On 16.12.2024 5:42 PM, Rafael J. Wysocki wrote:
> >>> On Mon, Dec 16, 2024 at 5:23 PM Christoph Hellwig <hch@lst.de> wrote:
> >>>>
> >>>> On Sat, Dec 14, 2024 at 12:00:23PM +0530, Manivannan Sadhasivam wrote:
> >>>>> We need a PM core API that tells the device drivers when it is safe to powerdown
> >>>>> the devices. The usecase here is with PCIe based NVMe devices but the problem is
> >>>>> applicable to other devices as well.
> >>>>
> >>>> Maybe I'm misunderstanding things, but I think the important part is
> >>>> to indicate when a suspend actually MUST put the device into D3. Because
> >>>> doing that should always be safe, but not always optimal.
> >>>
> >>> I'm not aware of any cases when a device must be put into D3cold
> >>> (which I think is what you mean) during system-wide suspend.
> >>>
> >>> Suspend-to-idle on x86 doesn't require this, at least not for
> >>> correctness. I don't think any platforms using DT require it either.
> >>
> >> That would be correct.
> >>
> >> The Qualcomm platform (or class of platforms) we're looking at with this
> >> specific issue requires PCIe (implying NVMe) shutdown for S2RAM.
> >>
> >> The S2RAM entry mechanism is unfortunately misrepresented as an S2Idle
> >> state by Linux as of today, and I'm trying really hard to convince some
> >> folks to let me describe it correctly, with little success so far..
> >>
> >
> > Perhaps you should say 'S2RAM is misrepresented as S2Idle by the firmware as of
> > today'...
> >
> > But I'll leave it up to the PSCI folks to decide whether it makes sense to
> > expose PSCI SYSTEM_SUSPEND through CPU_SUSPEND or not.
>
> The firmware happily performs the actions required to put the platform
> in S2RAM, but the interface used to request entry (CPU_SUSPEND) is
> mostly used for entering CPU/cluster idle states on arm64.
>
> (although the PSCI spec also clearly states that using CPU_SUSPEND for
> system-level low power states is allowed *plus* the reference
> implementation literally just calls CPU_SUSPEND internally whenever
> the """proper""" SYSTEM_SUSPEND call is used, anyway)
>
Ok, sounds fair.
> >
> > For the people in this thread, I'm leaving the link to the PSCI discussion here:
> > https://lore.kernel.org/all/20241028-topic-cpu_suspend_s2ram-v1-0-9fdd9a04b75c@oss.qualcomm.com/
> >
> >> That is the real underlying issue and once/if it's solved, this patch
> >> will not be necessary.
> >>
> >>> In theory, ACPI S3 or hibernation may request that, but I've never
> >>> seen it happen in practice.
> >>>
> >>> Suspend-to-idle on x86 may want devices to end up in specific power
> >>> states in order to be able to switch the entire platform into a deep
> >>> energy-saving mode, but that's never been D3cold so far.
> >>
> >> In our case the plug is only pulled in S2RAM, otherwise the best we can
> >> do is just turn off the devices individually to decrease the overall
> >> power draw
> >>
> >
> > I don't think this is accurate. Qcom FW (the one we are discussing in this
> > thread) doesn't pull the plug (except on platforms like x13s due to hw
> > limitation). On ACPI though, the FW *might* pull the plug, so that's why drivers
> > prepare the devices by powering down them (largely) if pm_suspend_via_firmware()
> > succeeds. On Qcom platforms, we are trying to allow the SoC to transition to low
> > power state and that requires relinquishing the resource votes by the drivers.
>
> Look, I have a power measurement device before my eyes and I clearly see
> the main power rail being cut on successful S2RAM entry.
>
You seem to have misunderstood what I said. I do *know* the power state of the
SoC when it enters the CX power collapse state. What I said was in the case of
ACPI, it powers down the peripherals in S3 without any SW dependency (except for
wakeup capable devices). But in Qcom case, each driver has to relinquish the
vote for the SoC to enter CX collapse state. But anyhow, the difference doesn't
matter much here as all drivers need to drop the vote except in wakeup path.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-19 18:28 ` Rafael J. Wysocki
@ 2025-01-03 7:26 ` Christoph Hellwig
0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2025-01-03 7:26 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Ulf Hansson, Christoph Hellwig, Manivannan Sadhasivam,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On Thu, Dec 19, 2024 at 07:28:53PM +0100, Rafael J. Wysocki wrote:
> In the particular case of NVMe, though, the question of how to
> actually power it down is still open.
Powering down nvme controller is down by shutting the controller
down (nvme_disable_ctrl with shutdown=true).
> Until there is a clear answer
> to it, the possibility of powering NVMe devices down and up too often
> is not really a practical concern.
Why do you think it isn't a practial concern?
> Generally, the problem of choosing suitable power states during system
> suspend is not limited to storage (for instance, if the sleep time is
> going to be short enough and it is known in advance, it may not be
> useful to change power states of many devices because that is just
> going to take too much time) and the attempts to address it have been
> very limited so far. There are many factors that come into play, like
> how much energy is to be saved, how much time are suspend and resume
> transitions allowed or expected to take, which devices are expected to
> wake up the system from sleep, etc, and the possible device wearout is
> one of them.
It is true that this applies to all kinds of devices. But for (flash)
storage devices it is more important because of the effect on device
wear, the shutdown (and unclean restart) times that are much longer than
most devices.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2024-12-20 15:15 ` Konrad Dybcio
2024-12-21 3:38 ` Manivannan Sadhasivam
@ 2025-01-03 7:28 ` Christoph Hellwig
2025-01-03 11:48 ` Konrad Dybcio
1 sibling, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2025-01-03 7:28 UTC (permalink / raw)
To: Konrad Dybcio
Cc: Rafael J. Wysocki, Christoph Hellwig, Manivannan Sadhasivam,
Ulf Hansson, Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe,
sagi, linux-nvme, linux-kernel, linux-pci, andersson, konradybcio,
Len Brown, linux-pm
On Fri, Dec 20, 2024 at 04:15:21PM +0100, Konrad Dybcio wrote:
> The Qualcomm platform (or class of platforms) we're looking at with this
> specific issue requires PCIe (implying NVMe) shutdown for S2RAM.
>
> The S2RAM entry mechanism is unfortunately misrepresented as an S2Idle
> state by Linux as of today, and I'm trying really hard to convince some
> folks to let me describe it correctly, with little success so far..
Well, not advertizing the right mechanism isn't going to cause havoc
to any scheme.
> That is the real underlying issue and once/if it's solved, this patch
> will not be necessary.
Well, maybe this thread gave good enough fodder to finally fix it?
>
> > In theory, ACPI S3 or hibernation may request that, but I've never
> > seen it happen in practice.
> >
> > Suspend-to-idle on x86 may want devices to end up in specific power
> > states in order to be able to switch the entire platform into a deep
> > energy-saving mode, but that's never been D3cold so far.
>
> In our case the plug is only pulled in S2RAM, otherwise the best we can
> do is just turn off the devices individually to decrease the overall
> power draw
FYI, going to D3 for S2RAM seems perfectly reasonable from the NVMe POV.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
2025-01-03 7:28 ` Christoph Hellwig
@ 2025-01-03 11:48 ` Konrad Dybcio
0 siblings, 0 replies; 47+ messages in thread
From: Konrad Dybcio @ 2025-01-03 11:48 UTC (permalink / raw)
To: Christoph Hellwig, Konrad Dybcio
Cc: Rafael J. Wysocki, Manivannan Sadhasivam, Ulf Hansson,
Rafael J. Wysocki, Bjorn Helgaas, kbusch, axboe, sagi, linux-nvme,
linux-kernel, linux-pci, andersson, konradybcio, Len Brown,
linux-pm
On 3.01.2025 8:28 AM, Christoph Hellwig wrote:
> On Fri, Dec 20, 2024 at 04:15:21PM +0100, Konrad Dybcio wrote:
>> The Qualcomm platform (or class of platforms) we're looking at with this
>> specific issue requires PCIe (implying NVMe) shutdown for S2RAM.
>>
>> The S2RAM entry mechanism is unfortunately misrepresented as an S2Idle
>> state by Linux as of today, and I'm trying really hard to convince some
>> folks to let me describe it correctly, with little success so far..
>
> Well, not advertizing the right mechanism isn't going to cause havoc
> to any scheme.
>
>> That is the real underlying issue and once/if it's solved, this patch
>> will not be necessary.
>
> Well, maybe this thread gave good enough fodder to finally fix it?
Oh I'd love to..
But my changes are getting rejected over philosophical FW design
disagreements, it seems.
Konrad
>
>>
>>> In theory, ACPI S3 or hibernation may request that, but I've never
>>> seen it happen in practice.
>>>
>>> Suspend-to-idle on x86 may want devices to end up in specific power
>>> states in order to be able to switch the entire platform into a deep
>>> energy-saving mode, but that's never been D3cold so far.
>>
>> In our case the plug is only pulled in S2RAM, otherwise the best we can
>> do is just turn off the devices individually to decrease the overall
>> power draw
>
> FYI, going to D3 for S2RAM seems perfectly reasonable from the NVMe POV.
>
^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2025-01-03 11:49 UTC | newest]
Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-18 8:23 [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user Manivannan Sadhasivam
2024-11-18 12:58 ` Christoph Hellwig
2024-11-18 14:58 ` Manivannan Sadhasivam
2024-11-22 22:20 ` Bjorn Helgaas
2024-11-23 9:01 ` Manivannan Sadhasivam
2024-11-26 17:11 ` Bjorn Andersson
2024-11-27 5:49 ` Manivannan Sadhasivam
2024-12-05 23:29 ` Bjorn Helgaas
2024-12-06 1:49 ` Bjorn Helgaas
2024-12-09 13:36 ` Christoph Hellwig
2024-12-09 14:38 ` Manivannan Sadhasivam
2024-12-12 5:59 ` Christoph Hellwig
2024-12-12 12:21 ` Rafael J. Wysocki
2024-12-12 12:49 ` Ulf Hansson
2024-12-12 15:13 ` Christoph Hellwig
2024-12-13 14:35 ` Rafael J. Wysocki
2024-12-14 6:30 ` Manivannan Sadhasivam
2024-12-16 16:23 ` Christoph Hellwig
2024-12-16 16:42 ` Rafael J. Wysocki
2024-12-16 16:48 ` Manivannan Sadhasivam
2024-12-16 17:28 ` Rafael J. Wysocki
2024-12-16 17:39 ` Manivannan Sadhasivam
2024-12-16 19:10 ` Rafael J. Wysocki
2024-12-20 15:15 ` Konrad Dybcio
2024-12-21 3:38 ` Manivannan Sadhasivam
2024-12-21 11:17 ` Konrad Dybcio
2024-12-26 16:22 ` Manivannan Sadhasivam
2025-01-03 7:28 ` Christoph Hellwig
2025-01-03 11:48 ` Konrad Dybcio
2024-12-16 16:24 ` Rafael J. Wysocki
2024-12-16 17:11 ` Manivannan Sadhasivam
2024-12-16 17:35 ` Rafael J. Wysocki
2024-12-16 17:52 ` Manivannan Sadhasivam
2024-12-16 19:34 ` Rafael J. Wysocki
2024-12-16 19:40 ` Keith Busch
2024-12-16 19:43 ` Rafael J. Wysocki
2024-12-17 5:26 ` manivannan.sadhasivam
2024-12-17 19:45 ` Rafael J. Wysocki
2024-12-19 8:02 ` Manivannan Sadhasivam
2024-12-19 12:45 ` Rafael J. Wysocki
2024-12-19 16:41 ` Ulf Hansson
2024-12-19 18:28 ` Rafael J. Wysocki
2025-01-03 7:26 ` Christoph Hellwig
2024-12-19 6:30 ` Christoph Hellwig
2024-12-19 8:03 ` Manivannan Sadhasivam
2024-12-09 14:43 ` Manivannan Sadhasivam
2024-12-09 14:57 ` Manivannan Sadhasivam
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).