[PATCH] arm64: suspend: Remove forcing error from suspend finisher

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH] arm64: suspend: Remove forcing error from suspend finisher
@ 2026-03-16  8:48 Maulik Shah
  2026-03-19 14:11 ` Will Deacon
  2026-03-19 15:14 ` Sudeep Holla
  0 siblings, 2 replies; 8+ messages in thread
From: Maulik Shah @ 2026-03-16  8:48 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: linux-arm-kernel, linux-kernel, linux-arm-msm, Maulik Shah

Successful cpu_suspend() may not always want to return to cpu_resume() to
save the work and latency involved.

consider a scenario,

when single physical CPU (pCPU) is used on different virtual machines (VMs)
as virtual CPUs (vCPUs). VM-x's vCPU can request a powerdown state after
saving the context by invoking __cpu_suspend_enter() whereas VM-y's vCPU is
requesting a shallower than powerdown state. The hypervisor aggregates to a
non powerdown state for pCPU. A wakeup event for VM-x's vCPU may want to
resume the execution at the same place instead of jumping to cpu_resume()
as the HW never reached till powerdown state which would have lost the
context.

While the vCPU of VM-x had latency impact of saving the context in suspend
entry path but having the return to same place saves the latency to restore
the context in resume path.

consider another scenario,

Newer CPUs include a feature called “powerdown abandon”. The feature is
based on the observation that events like GIC wakeups have a high
likelihood of happening while the CPU is in the middle of its powerdown
sequence (at wfi). Older CPUs will powerdown and immediately power back
up when this happens. The newer CPUs will “give up” mid way through if
no context has been lost yet. This is possible as the powerdown operation
is lengthy and a large part of it does not lose context [1].

As the wakeup arrived after SW powerdown is done but before HW is fully
powered down. From SW view this is still a successful entry to suspend
and since the HW did not loose the context there is no reason to return at
entry address cpu_resume() to restore the context.

Remove forcing the failure at kernel if the execution does not resume at
cpu_resume() as kernel has no reason to treat such returns as failures
when the firmware has already filled in return as success.

[1] https://trustedfirmware-a.readthedocs.io/en/v2.14.0/design/firmware-design.html#cpu-specific-operations-framework

Signed-off-by: Maulik Shah <maulik.shah@oss.qualcomm.com>
---
 arch/arm64/kernel/suspend.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
index eaaff94329cddb8d1fb8d1523395453f3501c9a5..b54e578f0f8b03c1dba38157c6012bb064adaa12 100644
--- a/arch/arm64/kernel/suspend.c
+++ b/arch/arm64/kernel/suspend.c
@@ -144,15 +144,14 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned long))
 		ret = fn(arg);

 		/*
-		 * Never gets here, unless the suspend finisher fails.
-		 * Successful cpu_suspend() should return from cpu_resume(),
-		 * returning through this code path is considered an error
-		 * If the return value is set to 0 force ret = -EOPNOTSUPP
-		 * to make sure a proper error condition is propagated
+		 * Successful HW power down should return at cpu_resume()
+		 * however successful SW power down may still want to
+		 * return here to save the work and latency involved in
+		 * restoring the context when the HW never lost it.
+		 *
+		 * If the return value is set to 0 do not force failure
+		 * from here.
 		 */
-		if (!ret)
-			ret = -EOPNOTSUPP;
-
 		ct_cpuidle_exit();
 	} else {
 		ct_cpuidle_exit();

---
base-commit: 343f51842f4ed7143872f3aa116a214a5619a4b9
change-id: 20260316-suspend_ret-d5e202d34cc9

Best regards,
-- 
Maulik Shah <maulik.shah@oss.qualcomm.com>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm64: suspend: Remove forcing error from suspend finisher
  2026-03-16  8:48 [PATCH] arm64: suspend: Remove forcing error from suspend finisher Maulik Shah
@ 2026-03-19 14:11 ` Will Deacon
  2026-03-19 15:14 ` Sudeep Holla
  1 sibling, 0 replies; 8+ messages in thread
From: Will Deacon @ 2026-03-19 14:11 UTC (permalink / raw)
  To: Maulik Shah
  Cc: Catalin Marinas, linux-arm-kernel, linux-kernel, linux-arm-msm,
	mark.rutland, lpieralisi, sudeep.holla

[+Mark, Lorenzo and Sudeep]

On Mon, Mar 16, 2026 at 02:18:18PM +0530, Maulik Shah wrote:
> Successful cpu_suspend() may not always want to return to cpu_resume() to
> save the work and latency involved.
> 
> consider a scenario,
> 
> when single physical CPU (pCPU) is used on different virtual machines (VMs)
> as virtual CPUs (vCPUs). VM-x's vCPU can request a powerdown state after
> saving the context by invoking __cpu_suspend_enter() whereas VM-y's vCPU is
> requesting a shallower than powerdown state. The hypervisor aggregates to a
> non powerdown state for pCPU. A wakeup event for VM-x's vCPU may want to
> resume the execution at the same place instead of jumping to cpu_resume()
> as the HW never reached till powerdown state which would have lost the
> context.
> 
> While the vCPU of VM-x had latency impact of saving the context in suspend
> entry path but having the return to same place saves the latency to restore
> the context in resume path.
> 
> consider another scenario,
> 
> Newer CPUs include a feature called “powerdown abandon”. The feature is
> based on the observation that events like GIC wakeups have a high
> likelihood of happening while the CPU is in the middle of its powerdown
> sequence (at wfi). Older CPUs will powerdown and immediately power back
> up when this happens. The newer CPUs will “give up” mid way through if
> no context has been lost yet. This is possible as the powerdown operation
> is lengthy and a large part of it does not lose context [1].
> 
> As the wakeup arrived after SW powerdown is done but before HW is fully
> powered down. From SW view this is still a successful entry to suspend
> and since the HW did not loose the context there is no reason to return at
> entry address cpu_resume() to restore the context.
> 
> Remove forcing the failure at kernel if the execution does not resume at
> cpu_resume() as kernel has no reason to treat such returns as failures
> when the firmware has already filled in return as success.
> 
> [1] https://trustedfirmware-a.readthedocs.io/en/v2.14.0/design/firmware-design.html#cpu-specific-operations-framework
> 
> Signed-off-by: Maulik Shah <maulik.shah@oss.qualcomm.com>
> ---
>  arch/arm64/kernel/suspend.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
> index eaaff94329cddb8d1fb8d1523395453f3501c9a5..b54e578f0f8b03c1dba38157c6012bb064adaa12 100644
> --- a/arch/arm64/kernel/suspend.c
> +++ b/arch/arm64/kernel/suspend.c
> @@ -144,15 +144,14 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned long))
>  		ret = fn(arg);
>  
>  		/*
> -		 * Never gets here, unless the suspend finisher fails.
> -		 * Successful cpu_suspend() should return from cpu_resume(),
> -		 * returning through this code path is considered an error
> -		 * If the return value is set to 0 force ret = -EOPNOTSUPP
> -		 * to make sure a proper error condition is propagated
> +		 * Successful HW power down should return at cpu_resume()
> +		 * however successful SW power down may still want to
> +		 * return here to save the work and latency involved in
> +		 * restoring the context when the HW never lost it.
> +		 *
> +		 * If the return value is set to 0 do not force failure
> +		 * from here.
>  		 */
> -		if (!ret)
> -			ret = -EOPNOTSUPP;
> -

This doesn't look right to me.

afaict, the only suspend finisher we get here on arm64 is for PSCI. The
PSCI spec returns SUCCESS if a shallower state is entered than the one
requested, in which case we should return an error back to cpuidle rather
than pretend to have entered a deeper state than we actually did.

I wonder if we could remove the 'fn' paramater from cpu_suspend()
altogether for arm64 and hardwire PSCI directly, given that it's the
only one we seem to support?

Will


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm64: suspend: Remove forcing error from suspend finisher
  2026-03-16  8:48 [PATCH] arm64: suspend: Remove forcing error from suspend finisher Maulik Shah
  2026-03-19 14:11 ` Will Deacon
@ 2026-03-19 15:14 ` Sudeep Holla
  2026-03-19 16:52   ` Sudeep Holla
  1 sibling, 1 reply; 8+ messages in thread
From: Sudeep Holla @ 2026-03-19 15:14 UTC (permalink / raw)
  To: Maulik Shah
  Cc: Catalin Marinas, Sudeep Holla, Will Deacon, linux-arm-kernel,
	linux-kernel, linux-arm-msm

On Mon, Mar 16, 2026 at 02:18:18PM +0530, Maulik Shah wrote:
> Successful cpu_suspend() may not always want to return to cpu_resume() to
> save the work and latency involved.
> 
> consider a scenario,
> 
> when single physical CPU (pCPU) is used on different virtual machines (VMs)
> as virtual CPUs (vCPUs). VM-x's vCPU can request a powerdown state after
> saving the context by invoking __cpu_suspend_enter() whereas VM-y's vCPU is
> requesting a shallower than powerdown state. The hypervisor aggregates to a
> non powerdown state for pCPU. A wakeup event for VM-x's vCPU may want to
> resume the execution at the same place instead of jumping to cpu_resume()
> as the HW never reached till powerdown state which would have lost the
> context.
> 

Though I don't fully understand the intention/use-case for presenting the
VMs with powerdown states ....

> While the vCPU of VM-x had latency impact of saving the context in suspend
> entry path but having the return to same place saves the latency to restore
> the context in resume path.
> 

I understand the exit-latency aspect, though the register set involved is not
very large unless the driver notifier list is sizable on some platforms. This
is typically the case in Platform Coordinated mode.

> consider another scenario,
> 
> Newer CPUs include a feature called “powerdown abandon”. The feature is
> based on the observation that events like GIC wakeups have a high
> likelihood of happening while the CPU is in the middle of its powerdown
> sequence (at wfi). Older CPUs will powerdown and immediately power back
> up when this happens. The newer CPUs will “give up” mid way through if
> no context has been lost yet. This is possible as the powerdown operation
> is lengthy and a large part of it does not lose context [1].
> 

When you say "large part" above, do you mean that none of the CPU context, as
visible to software, is lost? Otherwise, we would need to discuss that "large
part" in more detail. From the kernel point of view, this is a simple boolean:
context is either lost or retained. Anything in between is not valid, as we do
not support partial context loss.

> As the wakeup arrived after SW powerdown is done but before HW is fully
> powered down. From SW view this is still a successful entry to suspend
> and since the HW did not loose the context there is no reason to return at
> entry address cpu_resume() to restore the context.
> 

Yes, that may be worth considering from an optimization perspective. However,
if the hardware aborts the transition, then returning success regardless of the
software state should still be counted as a failure. That would keep the
cpuidle entry statistics more accurate than returning success. And it is
a failure as the OS expected to enter that powerdown state but there was
as H/W abort.

> Remove forcing the failure at kernel if the execution does not resume at
> cpu_resume() as kernel has no reason to treat such returns as failures
> when the firmware has already filled in return as success.
> 

This is not possible with the current PSCI spec:
"Powerdown states do not return on success because restart is through the
entry point address at wakeup."

-- 
Regards,
Sudeep


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm64: suspend: Remove forcing error from suspend finisher
  2026-03-19 15:14 ` Sudeep Holla
@ 2026-03-19 16:52   ` Sudeep Holla
  2026-03-20 10:13     ` Lorenzo Pieralisi
  0 siblings, 1 reply; 8+ messages in thread
From: Sudeep Holla @ 2026-03-19 16:52 UTC (permalink / raw)
  To: Maulik Shah
  Cc: Catalin Marinas, Sudeep Holla, Will Deacon, linux-arm-kernel,
	linux-kernel, linux-arm-msm

On Thu, Mar 19, 2026 at 03:14:01PM +0000, Sudeep Holla wrote:
> On Mon, Mar 16, 2026 at 02:18:18PM +0530, Maulik Shah wrote:
> > Successful cpu_suspend() may not always want to return to cpu_resume() to
> > save the work and latency involved.
> > 
> > consider a scenario,
> > 
> > when single physical CPU (pCPU) is used on different virtual machines (VMs)
> > as virtual CPUs (vCPUs). VM-x's vCPU can request a powerdown state after
> > saving the context by invoking __cpu_suspend_enter() whereas VM-y's vCPU is
> > requesting a shallower than powerdown state. The hypervisor aggregates to a
> > non powerdown state for pCPU. A wakeup event for VM-x's vCPU may want to
> > resume the execution at the same place instead of jumping to cpu_resume()
> > as the HW never reached till powerdown state which would have lost the
> > context.
> > 
> 
> Though I don't fully understand the intention/use-case for presenting the
> VMs with powerdown states ....
> 
> > While the vCPU of VM-x had latency impact of saving the context in suspend
> > entry path but having the return to same place saves the latency to restore
> > the context in resume path.
> > 
> 
> I understand the exit-latency aspect, though the register set involved is not
> very large unless the driver notifier list is sizable on some platforms. This
> is typically the case in Platform Coordinated mode.
> 
> > consider another scenario,
> > 
> > Newer CPUs include a feature called “powerdown abandon”. The feature is
> > based on the observation that events like GIC wakeups have a high
> > likelihood of happening while the CPU is in the middle of its powerdown
> > sequence (at wfi). Older CPUs will powerdown and immediately power back
> > up when this happens. The newer CPUs will “give up” mid way through if
> > no context has been lost yet. This is possible as the powerdown operation
> > is lengthy and a large part of it does not lose context [1].
> > 
> 
> When you say "large part" above, do you mean that none of the CPU context, as
> visible to software, is lost? Otherwise, we would need to discuss that "large
> part" in more detail. From the kernel point of view, this is a simple boolean:
> context is either lost or retained. Anything in between is not valid, as we do
> not support partial context loss.
> 
> > As the wakeup arrived after SW powerdown is done but before HW is fully
> > powered down. From SW view this is still a successful entry to suspend
> > and since the HW did not loose the context there is no reason to return at
> > entry address cpu_resume() to restore the context.
> > 
> 
> Yes, that may be worth considering from an optimization perspective. However,
> if the hardware aborts the transition, then returning success regardless of the
> software state should still be counted as a failure. That would keep the
> cpuidle entry statistics more accurate than returning success. And it is
> a failure as the OS expected to enter that powerdown state but there was
> as H/W abort.
> 
> > Remove forcing the failure at kernel if the execution does not resume at
> > cpu_resume() as kernel has no reason to treat such returns as failures
> > when the firmware has already filled in return as success.
> > 
> 
> This is not possible with the current PSCI spec:
> "Powerdown states do not return on success because restart is through the
> entry point address at wakeup."
> 

OK, my bad. Sorry for that.
For some reason, I read "do not return" as "must not return".

The spec allows this:
  |  The caller must not assume that a powerdown request will return using the
  |  specified entry point address. The powerdown request might not complete due,
  |  for example, to pending interrupts. It is also possible that, because of
  |  coordination with other cores, the actual state entered is shallower
  |  than the one requested. Because of this it is possible for an
  |  implementation to downgrade the powerdown state request to a standby
  |  state. In the case of a downgrade to standby, the implementation
  |  returns at the instruction following the PSCI call, at the Exception
  |  level of the caller, instead of returning by the specified entry point
  |  address. The return code in this case is SUCCESS. In the case of an
  |  early return due to a pending wakeup event, the implementation can
  |  return at the next instruction, with a return code of SUCCESS, or
  |  resume at the specified entry point address

So we need to dig and check if there was any reason for returning "NOT
SUPPORTED" when the call returned success.

-- 
Regards,
Sudeep


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm64: suspend: Remove forcing error from suspend finisher
  2026-03-19 16:52   ` Sudeep Holla
@ 2026-03-20 10:13     ` Lorenzo Pieralisi
  2026-03-20 11:10       ` Sudeep Holla
  2026-03-23 10:30       ` Maulik Shah (mkshah)
  0 siblings, 2 replies; 8+ messages in thread
From: Lorenzo Pieralisi @ 2026-03-20 10:13 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Maulik Shah, Catalin Marinas, Will Deacon, linux-arm-kernel,
	linux-kernel, linux-arm-msm

On Thu, Mar 19, 2026 at 04:52:18PM +0000, Sudeep Holla wrote:
> On Thu, Mar 19, 2026 at 03:14:01PM +0000, Sudeep Holla wrote:
> > On Mon, Mar 16, 2026 at 02:18:18PM +0530, Maulik Shah wrote:
> > > Successful cpu_suspend() may not always want to return to cpu_resume() to
> > > save the work and latency involved.
> > > 
> > > consider a scenario,
> > > 
> > > when single physical CPU (pCPU) is used on different virtual machines (VMs)
> > > as virtual CPUs (vCPUs). VM-x's vCPU can request a powerdown state after
> > > saving the context by invoking __cpu_suspend_enter() whereas VM-y's vCPU is
> > > requesting a shallower than powerdown state. The hypervisor aggregates to a
> > > non powerdown state for pCPU. A wakeup event for VM-x's vCPU may want to
> > > resume the execution at the same place instead of jumping to cpu_resume()
> > > as the HW never reached till powerdown state which would have lost the
> > > context.
> > > 
> > 
> > Though I don't fully understand the intention/use-case for presenting the
> > VMs with powerdown states ....
> > 
> > > While the vCPU of VM-x had latency impact of saving the context in suspend
> > > entry path but having the return to same place saves the latency to restore
> > > the context in resume path.
> > > 
> > 
> > I understand the exit-latency aspect, though the register set involved is not
> > very large unless the driver notifier list is sizable on some platforms. This
> > is typically the case in Platform Coordinated mode.
> > 
> > > consider another scenario,
> > > 
> > > Newer CPUs include a feature called “powerdown abandon”. The feature is
> > > based on the observation that events like GIC wakeups have a high
> > > likelihood of happening while the CPU is in the middle of its powerdown
> > > sequence (at wfi). Older CPUs will powerdown and immediately power back
> > > up when this happens. The newer CPUs will “give up” mid way through if
> > > no context has been lost yet. This is possible as the powerdown operation
> > > is lengthy and a large part of it does not lose context [1].
> > > 
> > 
> > When you say "large part" above, do you mean that none of the CPU context, as
> > visible to software, is lost? Otherwise, we would need to discuss that "large
> > part" in more detail. From the kernel point of view, this is a simple boolean:
> > context is either lost or retained. Anything in between is not valid, as we do
> > not support partial context loss.
> > 
> > > As the wakeup arrived after SW powerdown is done but before HW is fully
> > > powered down. From SW view this is still a successful entry to suspend
> > > and since the HW did not loose the context there is no reason to return at
> > > entry address cpu_resume() to restore the context.
> > > 
> > 
> > Yes, that may be worth considering from an optimization perspective. However,
> > if the hardware aborts the transition, then returning success regardless of the
> > software state should still be counted as a failure. That would keep the
> > cpuidle entry statistics more accurate than returning success. And it is
> > a failure as the OS expected to enter that powerdown state but there was
> > as H/W abort.
> > 
> > > Remove forcing the failure at kernel if the execution does not resume at
> > > cpu_resume() as kernel has no reason to treat such returns as failures
> > > when the firmware has already filled in return as success.
> > > 
> > 
> > This is not possible with the current PSCI spec:
> > "Powerdown states do not return on success because restart is through the
> > entry point address at wakeup."
> > 
> 
> OK, my bad. Sorry for that.
> For some reason, I read "do not return" as "must not return".
> 
> The spec allows this:
>   |  The caller must not assume that a powerdown request will return using the
>   |  specified entry point address. The powerdown request might not complete due,
>   |  for example, to pending interrupts. It is also possible that, because of
>   |  coordination with other cores, the actual state entered is shallower
>   |  than the one requested. Because of this it is possible for an
>   |  implementation to downgrade the powerdown state request to a standby
>   |  state. In the case of a downgrade to standby, the implementation
>   |  returns at the instruction following the PSCI call, at the Exception
>   |  level of the caller, instead of returning by the specified entry point
>   |  address. The return code in this case is SUCCESS. In the case of an
>   |  early return due to a pending wakeup event, the implementation can
>   |  return at the next instruction, with a return code of SUCCESS, or
>   |  resume at the specified entry point address
> 
> So we need to dig and check if there was any reason for returning "NOT
> SUPPORTED" when the call returned success.

Because we have no clue whatsoever about what happened. We need to get
back to the cpu_suspend() caller and either say "we entered state X instead
of Y" or report a failure (because an interrupted power down sequence *is* a
failure for Linux - unless we want to make things up), we just can't know so
to me the code seems good as it is (we can debate about the error code, yes
but the gist does not change).

Is that we want to tell CPUidle that entering the state was successful even if
the power down sequence was interrupted or the state demoted ? That's
tantamount to lying IMO and would skew the power stats, no ?

Let me know, I am just trying to understand this patch's goal.

Thanks,
Lorenzo


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm64: suspend: Remove forcing error from suspend finisher
  2026-03-20 10:13     ` Lorenzo Pieralisi
@ 2026-03-20 11:10       ` Sudeep Holla
  2026-03-23  9:19         ` Maulik Shah (mkshah)
  2026-03-23 10:30       ` Maulik Shah (mkshah)
  1 sibling, 1 reply; 8+ messages in thread
From: Sudeep Holla @ 2026-03-20 11:10 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Maulik Shah, Sudeep Holla, Catalin Marinas, Will Deacon,
	linux-arm-kernel, linux-kernel, linux-arm-msm

On Fri, Mar 20, 2026 at 11:13:08AM +0100, Lorenzo Pieralisi wrote:
> On Thu, Mar 19, 2026 at 04:52:18PM +0000, Sudeep Holla wrote:
> > On Thu, Mar 19, 2026 at 03:14:01PM +0000, Sudeep Holla wrote:
> > > On Mon, Mar 16, 2026 at 02:18:18PM +0530, Maulik Shah wrote:
> > > > Successful cpu_suspend() may not always want to return to cpu_resume() to
> > > > save the work and latency involved.
> > > > 
> > > > consider a scenario,
> > > > 
> > > > when single physical CPU (pCPU) is used on different virtual machines (VMs)
> > > > as virtual CPUs (vCPUs). VM-x's vCPU can request a powerdown state after
> > > > saving the context by invoking __cpu_suspend_enter() whereas VM-y's vCPU is
> > > > requesting a shallower than powerdown state. The hypervisor aggregates to a
> > > > non powerdown state for pCPU. A wakeup event for VM-x's vCPU may want to
> > > > resume the execution at the same place instead of jumping to cpu_resume()
> > > > as the HW never reached till powerdown state which would have lost the
> > > > context.
> > > > 
> > > 
> > > Though I don't fully understand the intention/use-case for presenting the
> > > VMs with powerdown states ....
> > > 
> > > > While the vCPU of VM-x had latency impact of saving the context in suspend
> > > > entry path but having the return to same place saves the latency to restore
> > > > the context in resume path.
> > > > 
> > > 
> > > I understand the exit-latency aspect, though the register set involved is not
> > > very large unless the driver notifier list is sizable on some platforms. This
> > > is typically the case in Platform Coordinated mode.
> > > 
> > > > consider another scenario,
> > > > 
> > > > Newer CPUs include a feature called “powerdown abandon”. The feature is
> > > > based on the observation that events like GIC wakeups have a high
> > > > likelihood of happening while the CPU is in the middle of its powerdown
> > > > sequence (at wfi). Older CPUs will powerdown and immediately power back
> > > > up when this happens. The newer CPUs will “give up” mid way through if
> > > > no context has been lost yet. This is possible as the powerdown operation
> > > > is lengthy and a large part of it does not lose context [1].
> > > > 
> > > 
> > > When you say "large part" above, do you mean that none of the CPU context, as
> > > visible to software, is lost? Otherwise, we would need to discuss that "large
> > > part" in more detail. From the kernel point of view, this is a simple boolean:
> > > context is either lost or retained. Anything in between is not valid, as we do
> > > not support partial context loss.
> > > 
> > > > As the wakeup arrived after SW powerdown is done but before HW is fully
> > > > powered down. From SW view this is still a successful entry to suspend
> > > > and since the HW did not loose the context there is no reason to return at
> > > > entry address cpu_resume() to restore the context.
> > > > 
> > > 
> > > Yes, that may be worth considering from an optimization perspective. However,
> > > if the hardware aborts the transition, then returning success regardless of the
> > > software state should still be counted as a failure. That would keep the
> > > cpuidle entry statistics more accurate than returning success. And it is
> > > a failure as the OS expected to enter that powerdown state but there was
> > > as H/W abort.
> > > 
> > > > Remove forcing the failure at kernel if the execution does not resume at
> > > > cpu_resume() as kernel has no reason to treat such returns as failures
> > > > when the firmware has already filled in return as success.
> > > > 
> > > 
> > > This is not possible with the current PSCI spec:
> > > "Powerdown states do not return on success because restart is through the
> > > entry point address at wakeup."
> > > 
> > 
> > OK, my bad. Sorry for that.
> > For some reason, I read "do not return" as "must not return".
> > 
> > The spec allows this:
> >   |  The caller must not assume that a powerdown request will return using the
> >   |  specified entry point address. The powerdown request might not complete due,
> >   |  for example, to pending interrupts. It is also possible that, because of
> >   |  coordination with other cores, the actual state entered is shallower
> >   |  than the one requested. Because of this it is possible for an
> >   |  implementation to downgrade the powerdown state request to a standby
> >   |  state. In the case of a downgrade to standby, the implementation
> >   |  returns at the instruction following the PSCI call, at the Exception
> >   |  level of the caller, instead of returning by the specified entry point
> >   |  address. The return code in this case is SUCCESS. In the case of an
> >   |  early return due to a pending wakeup event, the implementation can
> >   |  return at the next instruction, with a return code of SUCCESS, or
> >   |  resume at the specified entry point address
> > 
> > So we need to dig and check if there was any reason for returning "NOT
> > SUPPORTED" when the call returned success.
> 
> Because we have no clue whatsoever about what happened. We need to get
> back to the cpu_suspend() caller and either say "we entered state X instead
> of Y" or report a failure (because an interrupted power down sequence *is* a
> failure for Linux - unless we want to make things up), we just can't know so
> to me the code seems good as it is (we can debate about the error code, yes
> but the gist does not change).
>

Completely agreed. The current code clearly survives a successful return, so,
in my opinion, nothing is broken. It is really just a matter of exploring
whether there is a better way to express this error condition. I doubt that is
possible, since it is either success or error; it cannot be both.

Just to clarify, my earlier comment was purely about the error code, not about
success versus error.

> Is that we want to tell CPUidle that entering the state was successful even if
> the power down sequence was interrupted or the state demoted ?

That sounds like the ask to me, but that would be incorrect.

> That's tantamount to lying IMO and would skew the power stats, no ?
>

Yes, I agree.

> Let me know, I am just trying to understand this patch's goal.
>

I will let Maulik present his opinion, as I am aligned with you, Lorenzo.

-- 
Regards,
Sudeep


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm64: suspend: Remove forcing error from suspend finisher
  2026-03-20 11:10       ` Sudeep Holla
@ 2026-03-23  9:19         ` Maulik Shah (mkshah)
  0 siblings, 0 replies; 8+ messages in thread
From: Maulik Shah (mkshah) @ 2026-03-23  9:19 UTC (permalink / raw)
  To: Sudeep Holla, Lorenzo Pieralisi
  Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel,
	linux-arm-msm



On 3/20/2026 4:40 PM, Sudeep Holla wrote:
> On Fri, Mar 20, 2026 at 11:13:08AM +0100, Lorenzo Pieralisi wrote:
>> On Thu, Mar 19, 2026 at 04:52:18PM +0000, Sudeep Holla wrote:
>>> On Thu, Mar 19, 2026 at 03:14:01PM +0000, Sudeep Holla wrote:
>>>> On Mon, Mar 16, 2026 at 02:18:18PM +0530, Maulik Shah wrote:
>>>>> Successful cpu_suspend() may not always want to return to cpu_resume() to
>>>>> save the work and latency involved.
>>>>>
>>>>> consider a scenario,
>>>>>
>>>>> when single physical CPU (pCPU) is used on different virtual machines (VMs)
>>>>> as virtual CPUs (vCPUs). VM-x's vCPU can request a powerdown state after
>>>>> saving the context by invoking __cpu_suspend_enter() whereas VM-y's vCPU is
>>>>> requesting a shallower than powerdown state. The hypervisor aggregates to a
>>>>> non powerdown state for pCPU. A wakeup event for VM-x's vCPU may want to
>>>>> resume the execution at the same place instead of jumping to cpu_resume()
>>>>> as the HW never reached till powerdown state which would have lost the
>>>>> context.
>>>>>
>>>>
>>>> Though I don't fully understand the intention/use-case for presenting the
>>>> VMs with powerdown states ....
>>>>
>>>>> While the vCPU of VM-x had latency impact of saving the context in suspend
>>>>> entry path but having the return to same place saves the latency to restore
>>>>> the context in resume path.
>>>>>
>>>>
>>>> I understand the exit-latency aspect, though the register set involved is not
>>>> very large unless the driver notifier list is sizable on some platforms. This
>>>> is typically the case in Platform Coordinated mode.
>>>>
>>>>> consider another scenario,
>>>>>
>>>>> Newer CPUs include a feature called “powerdown abandon”. The feature is
>>>>> based on the observation that events like GIC wakeups have a high
>>>>> likelihood of happening while the CPU is in the middle of its powerdown
>>>>> sequence (at wfi). Older CPUs will powerdown and immediately power back
>>>>> up when this happens. The newer CPUs will “give up” mid way through if
>>>>> no context has been lost yet. This is possible as the powerdown operation
>>>>> is lengthy and a large part of it does not lose context [1].
>>>>>
>>>>
>>>> When you say "large part" above, do you mean that none of the CPU context, as
>>>> visible to software, is lost? Otherwise, we would need to discuss that "large
>>>> part" in more detail. From the kernel point of view, this is a simple boolean:
>>>> context is either lost or retained. Anything in between is not valid, as we do
>>>> not support partial context loss.
>>>>
>>>>> As the wakeup arrived after SW powerdown is done but before HW is fully
>>>>> powered down. From SW view this is still a successful entry to suspend
>>>>> and since the HW did not loose the context there is no reason to return at
>>>>> entry address cpu_resume() to restore the context.
>>>>>
>>>>
>>>> Yes, that may be worth considering from an optimization perspective. However,
>>>> if the hardware aborts the transition, then returning success regardless of the
>>>> software state should still be counted as a failure. That would keep the
>>>> cpuidle entry statistics more accurate than returning success. And it is
>>>> a failure as the OS expected to enter that powerdown state but there was
>>>> as H/W abort.
>>>>
>>>>> Remove forcing the failure at kernel if the execution does not resume at
>>>>> cpu_resume() as kernel has no reason to treat such returns as failures
>>>>> when the firmware has already filled in return as success.
>>>>>
>>>>
>>>> This is not possible with the current PSCI spec:
>>>> "Powerdown states do not return on success because restart is through the
>>>> entry point address at wakeup."
>>>>
>>>
>>> OK, my bad. Sorry for that.
>>> For some reason, I read "do not return" as "must not return".
>>>
>>> The spec allows this:
>>>   |  The caller must not assume that a powerdown request will return using the
>>>   |  specified entry point address. The powerdown request might not complete due,
>>>   |  for example, to pending interrupts. It is also possible that, because of
>>>   |  coordination with other cores, the actual state entered is shallower
>>>   |  than the one requested. Because of this it is possible for an
>>>   |  implementation to downgrade the powerdown state request to a standby
>>>   |  state. In the case of a downgrade to standby, the implementation
>>>   |  returns at the instruction following the PSCI call, at the Exception
>>>   |  level of the caller, instead of returning by the specified entry point
>>>   |  address. The return code in this case is SUCCESS. In the case of an
>>>   |  early return due to a pending wakeup event, the implementation can
>>>   |  return at the next instruction, with a return code of SUCCESS, or
>>>   |  resume at the specified entry point address
>>>
>>> So we need to dig and check if there was any reason for returning "NOT
>>> SUPPORTED" when the call returned success.
>>
>> Because we have no clue whatsoever about what happened. We need to get
>> back to the cpu_suspend() caller and either say "we entered state X instead
>> of Y" or report a failure (because an interrupted power down sequence *is* a
>> failure for Linux - unless we want to make things up), we just can't know so
>> to me the code seems good as it is (we can debate about the error code, yes
>> but the gist does not change).
>>
> 
> Completely agreed. The current code clearly survives a successful return, so,
> in my opinion, nothing is broken. It is really just a matter of exploring
> whether there is a better way to express this error condition. I doubt that is
> possible, since it is either success or error; it cannot be both.
> 
> Just to clarify, my earlier comment was purely about the error code, not about
> success versus error.
> 
>> Is that we want to tell CPUidle that entering the state was successful even if
>> the power down sequence was interrupted or the state demoted ?
> 
> That sounds like the ask to me, but that would be incorrect.
> 
>> That's tantamount to lying IMO and would skew the power stats, no ?
>>
> 
> Yes, I agree.
> 
>> Let me know, I am just trying to understand this patch's goal.
>>
> 
> I will let Maulik present his opinion, as I am aligned with you, Lorenzo.
> 

Thank you for the review.

The goal is to optimize the exit latency even if register set involved to restore
may not be of considerable size, saving scales up with multiple CPUs and multiple
VMs running.

This is achieved when the cpu resumes the execution at the same place,
however with this, i am seeing that idle states are rejected because of forcing the
error from Linux.

Consider an example scenario listed in commit text for multiple virtual machines,

VM‑X: Power‑Down ─┐
                  ├─ Hypervisor Aggregation to Retention
VM‑Y: Retention ──┘

A resume to the same place is today treated as failure in VM-X's cpuidle statistics,
A cpuidle governor may then become less aggressive towards next entry to power down state
as it has seen a error for the previously entered mode.

Such a scenario should not impact VM-X to re-select a power down state for CPU,
for this reason the VM-X need to treat the resume at the same place as "success",
when the firmware says so.

Thanks,
Maulik


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] arm64: suspend: Remove forcing error from suspend finisher
  2026-03-20 10:13     ` Lorenzo Pieralisi
  2026-03-20 11:10       ` Sudeep Holla
@ 2026-03-23 10:30       ` Maulik Shah (mkshah)
  1 sibling, 0 replies; 8+ messages in thread
From: Maulik Shah (mkshah) @ 2026-03-23 10:30 UTC (permalink / raw)
  To: Lorenzo Pieralisi, Sudeep Holla
  Cc: Catalin Marinas, Will Deacon, linux-arm-kernel, linux-kernel,
	linux-arm-msm



On 3/20/2026 3:43 PM, Lorenzo Pieralisi wrote:
> On Thu, Mar 19, 2026 at 04:52:18PM +0000, Sudeep Holla wrote:
>> On Thu, Mar 19, 2026 at 03:14:01PM +0000, Sudeep Holla wrote:
>>> On Mon, Mar 16, 2026 at 02:18:18PM +0530, Maulik Shah wrote:
>>>> Successful cpu_suspend() may not always want to return to cpu_resume() to
>>>> save the work and latency involved.
>>>>
>>>> consider a scenario,
>>>>
>>>> when single physical CPU (pCPU) is used on different virtual machines (VMs)
>>>> as virtual CPUs (vCPUs). VM-x's vCPU can request a powerdown state after
>>>> saving the context by invoking __cpu_suspend_enter() whereas VM-y's vCPU is
>>>> requesting a shallower than powerdown state. The hypervisor aggregates to a
>>>> non powerdown state for pCPU. A wakeup event for VM-x's vCPU may want to
>>>> resume the execution at the same place instead of jumping to cpu_resume()
>>>> as the HW never reached till powerdown state which would have lost the
>>>> context.
>>>>
>>>
>>> Though I don't fully understand the intention/use-case for presenting the
>>> VMs with powerdown states ....
>>>
>>>> While the vCPU of VM-x had latency impact of saving the context in suspend
>>>> entry path but having the return to same place saves the latency to restore
>>>> the context in resume path.
>>>>
>>>
>>> I understand the exit-latency aspect, though the register set involved is not
>>> very large unless the driver notifier list is sizable on some platforms. This
>>> is typically the case in Platform Coordinated mode.
>>>
>>>> consider another scenario,
>>>>
>>>> Newer CPUs include a feature called “powerdown abandon”. The feature is
>>>> based on the observation that events like GIC wakeups have a high
>>>> likelihood of happening while the CPU is in the middle of its powerdown
>>>> sequence (at wfi). Older CPUs will powerdown and immediately power back
>>>> up when this happens. The newer CPUs will “give up” mid way through if
>>>> no context has been lost yet. This is possible as the powerdown operation
>>>> is lengthy and a large part of it does not lose context [1].
>>>>
>>>
>>> When you say "large part" above, do you mean that none of the CPU context, as
>>> visible to software, is lost? Otherwise, we would need to discuss that "large
>>> part" in more detail. From the kernel point of view, this is a simple boolean:
>>> context is either lost or retained. Anything in between is not valid, as we do
>>> not support partial context loss.
>>>
>>>> As the wakeup arrived after SW powerdown is done but before HW is fully
>>>> powered down. From SW view this is still a successful entry to suspend
>>>> and since the HW did not loose the context there is no reason to return at
>>>> entry address cpu_resume() to restore the context.
>>>>
>>>
>>> Yes, that may be worth considering from an optimization perspective. However,
>>> if the hardware aborts the transition, then returning success regardless of the
>>> software state should still be counted as a failure. That would keep the
>>> cpuidle entry statistics more accurate than returning success. And it is
>>> a failure as the OS expected to enter that powerdown state but there was
>>> as H/W abort.
>>>
>>>> Remove forcing the failure at kernel if the execution does not resume at
>>>> cpu_resume() as kernel has no reason to treat such returns as failures
>>>> when the firmware has already filled in return as success.
>>>>
>>>
>>> This is not possible with the current PSCI spec:
>>> "Powerdown states do not return on success because restart is through the
>>> entry point address at wakeup."
>>>
>>
>> OK, my bad. Sorry for that.
>> For some reason, I read "do not return" as "must not return".
>>
>> The spec allows this:
>>   |  The caller must not assume that a powerdown request will return using the
>>   |  specified entry point address. The powerdown request might not complete due,
>>   |  for example, to pending interrupts. It is also possible that, because of
>>   |  coordination with other cores, the actual state entered is shallower
>>   |  than the one requested. Because of this it is possible for an
>>   |  implementation to downgrade the powerdown state request to a standby
>>   |  state. In the case of a downgrade to standby, the implementation
>>   |  returns at the instruction following the PSCI call, at the Exception
>>   |  level of the caller, instead of returning by the specified entry point
>>   |  address. The return code in this case is SUCCESS. In the case of an
>>   |  early return due to a pending wakeup event, the implementation can
>>   |  return at the next instruction, with a return code of SUCCESS, or
>>   |  resume at the specified entry point address
>>
>> So we need to dig and check if there was any reason for returning "NOT
>> SUPPORTED" when the call returned success.
> 
> Because we have no clue whatsoever about what happened. We need to get
> back to the cpu_suspend() caller and either say "we entered state X instead
> of Y" or report a failure (because an interrupted power down sequence *is* a
> failure for Linux - unless we want to make things up), we just can't know so
> to me the code seems good as it is (we can debate about the error code, yes
> but the gist does not change).
> 
> Is that we want to tell CPUidle that entering the state was successful even if
> the power down sequence was interrupted or the state demoted ? That's
> tantamount to lying IMO and would skew the power stats, no ?

In this case, The power down sequence is interrupted in HW, no SW involved here.

when the SW is at "wfi" instruction (last instruction) and the core power down sequence is in progress
in HW, the wakeup interrupt arrives at the this point (before power down is completed).

Older core would power down and immediately power up back, since power down is completed
(without staying at power down state) and resume started from reset vector, firmware makes
jump at entry address and Linux accounted this as "success" and CPUidle increments the "above"
count seeing the less residency in selected state.

Newer cpus with "powerdown abandon" feature, same is today accounted as failure by Linux and
"rejected" count in CPUidle would increment just because core did not finish the power down in HW
and did not resume execution from reset vector / entry address.

In both older/newer cores case, SW reached till last point of execution (wfi() instruction) without
any aborts/wake ups but one is telling CPUidle a success and other is telling a failure.

For the state demoted part, 

The success filled by firmware along with the return to the same place (instead of entry address) is
kind of indication that we entered demoted state-X instead of state-Y.

From the spec, platform-coordinated mode power state parameter point,

 | In platform-coordinated mode, the semantic expressed by the caller through the power state parameter is
 | not a mandatory requirement to enter the specific state. Instead, the power state parameter indicates
 | the deepest state the caller can tolerate.

Assuming CPU requested deepest power down state-Y which got demoted to shallower retention state-X, such
demotion are not really a failure as it was not mandatory to enter state-Y, it was only a indication of
tolerance till state-Y to firmware.

Thanks,
Maulik

> 
> Let me know, I am just trying to understand this patch's goal.
> 
> Thanks,
> Lorenzo



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-23 10:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-16  8:48 [PATCH] arm64: suspend: Remove forcing error from suspend finisher Maulik Shah
2026-03-19 14:11 ` Will Deacon
2026-03-19 15:14 ` Sudeep Holla
2026-03-19 16:52   ` Sudeep Holla
2026-03-20 10:13     ` Lorenzo Pieralisi
2026-03-20 11:10       ` Sudeep Holla
2026-03-23  9:19         ` Maulik Shah (mkshah)
2026-03-23 10:30       ` Maulik Shah (mkshah)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox