Devicetree
 help / color / mirror / Atom feed
* [PATCH RFC] arm64: dts: qcom: hamoa: Drop cluster_cl5 idle state from CPU clusters
@ 2026-06-04 17:40 Jens Glathe via B4 Relay
  2026-06-05  8:09 ` Marc Zyngier
  0 siblings, 1 reply; 2+ messages in thread
From: Jens Glathe via B4 Relay @ 2026-06-04 17:40 UTC (permalink / raw)
  To: Bjorn Andersson, Konrad Dybcio, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley
  Cc: linux-arm-msm, devicetree, linux-kernel, Steev Klimaszewski,
	Icecream95, Marc Zyngier, Jens Glathe

From: Jens Glathe <jens.glathe@oldschoolsolutions.biz>

The cluster_cl5 idle state triggers DC ZVA misbehavior that resets
X1 SoCs. Remove it from cluster_pd0/1/2 domain-idle-states for now.

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Jens Glathe <jens.glathe@oldschoolsolutions.biz>
---
This is an RFC for a mitigation of a stability issue observed on
Snapdragon X1-based SoCs (Hamoa and Purwa).

Affected systems experience spontaneous resets under the following
conditions:
 - During intensive `git fetch` / `git pull` activity
 - During mostly idle periods (Bitburner and similar workloads were
   frequently mentioned)

Steev Klimaszewski first connected the crashes to git operations.
Subsequent discussion in #aarch64-laptops led icecream95 to isolate
DC ZVA as the triggering instruction and to create a reliable
reproducer [1].

Further debugging showed that the issue is strongly related to deep
cluster idle states. Marc Zyngier suggested removing the deepest
cluster state (`cluster_cl5`), which resolved the problem on all tested
consumer hardware.

This patch implements that change by removing `&cluster_cl5` from the
`domain-idle-states` of `cluster_pd0`, `cluster_pd1`, and `cluster_pd2`.

Testing:
 - Lenovo ThinkPad T14s G6 (X1E-78-100, Hamoa)
 - Lenovo ThinkBook 16 G7 QOY (X1P-42-100, Purwa)
 - Lenovo IdeaPad 5 2-in-1 14Q8X9 (X1P-42-100, Purwa)
 - Lenovo IdeaPad Slim 3x 15Q8X10 (X1-26-100, Purwa)

All consumer devices became stable with this change.

On the Snapdragon Dev Kit (X1E-001-DE, Hamoa) the situation is
different: the firmware does not advertise OSI mode. Even with this
patch the device still crashes with the x1e-crash reproducer. Stability
is only achieved by passing `cpuidle.off=1`, which of course increases
power consumption but makes the devkit a bit faster, so there's that.

The different behaviour correlates with PSCI mode:
- Consumer firmwares enable OSI mode
- Devkit firmware stays in platform-coordinated mode

This patch is therefore only a band-aid. All evidence points to a
firmware/microcode issue where DC ZVA can hit caches that have been
powered down by PSCI idle states. A proper fix would be either a
Qualcomm firmware update or a kernel erratum that disables DZE on
these SoCs.

[1] https://github.com/icecream95/x1e-crash
---
 arch/arm64/boot/dts/qcom/hamoa.dtsi | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/hamoa.dtsi b/arch/arm64/boot/dts/qcom/hamoa.dtsi
index 4ba751a65142b..8ec39ba621946 100644
--- a/arch/arm64/boot/dts/qcom/hamoa.dtsi
+++ b/arch/arm64/boot/dts/qcom/hamoa.dtsi
@@ -442,19 +442,19 @@ cpu_pd11: power-domain-cpu11 {
 
 		cluster_pd0: power-domain-cpu-cluster0 {
 			#power-domain-cells = <0>;
-			domain-idle-states = <&cluster_cl4>, <&cluster_cl5>;
+			domain-idle-states = <&cluster_cl4>;
 			power-domains = <&system_pd>;
 		};
 
 		cluster_pd1: power-domain-cpu-cluster1 {
 			#power-domain-cells = <0>;
-			domain-idle-states = <&cluster_cl4>, <&cluster_cl5>;
+			domain-idle-states = <&cluster_cl4>;
 			power-domains = <&system_pd>;
 		};
 
 		cluster_pd2: power-domain-cpu-cluster2 {
 			#power-domain-cells = <0>;
-			domain-idle-states = <&cluster_cl4>, <&cluster_cl5>;
+			domain-idle-states = <&cluster_cl4>;
 			power-domains = <&system_pd>;
 		};
 

---
base-commit: a225caacc36546a09586e3ece36c0313146e7da9
change-id: 20260604-dc_zva_mitigation-245ecd5d797f

Best regards,
-- 
Jens Glathe <jens.glathe@oldschoolsolutions.biz>



^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH RFC] arm64: dts: qcom: hamoa: Drop cluster_cl5 idle state from CPU clusters
  2026-06-04 17:40 [PATCH RFC] arm64: dts: qcom: hamoa: Drop cluster_cl5 idle state from CPU clusters Jens Glathe via B4 Relay
@ 2026-06-05  8:09 ` Marc Zyngier
  0 siblings, 0 replies; 2+ messages in thread
From: Marc Zyngier @ 2026-06-05  8:09 UTC (permalink / raw)
  To: jens.glathe
  Cc: Bjorn Andersson, Konrad Dybcio, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, linux-arm-msm, devicetree, linux-kernel,
	Steev Klimaszewski, Icecream95

Hi Jens,

Thanks for sending this.

On Thu, 04 Jun 2026 18:40:14 +0100,
Jens Glathe via B4 Relay <devnull+jens.glathe.oldschoolsolutions.biz@kernel.org> wrote:
> 
> From: Jens Glathe <jens.glathe@oldschoolsolutions.biz>
> 
> The cluster_cl5 idle state triggers DC ZVA misbehavior that resets
> X1 SoCs. Remove it from cluster_pd0/1/2 domain-idle-states for now.
> 
> Suggested-by: Marc Zyngier <maz@kernel.org>
> Signed-off-by: Jens Glathe <jens.glathe@oldschoolsolutions.biz>
> ---
> This is an RFC for a mitigation of a stability issue observed on
> Snapdragon X1-based SoCs (Hamoa and Purwa).
> 
> Affected systems experience spontaneous resets under the following
> conditions:
>  - During intensive `git fetch` / `git pull` activity
>  - During mostly idle periods (Bitburner and similar workloads were
>    frequently mentioned)
> 
> Steev Klimaszewski first connected the crashes to git operations.
> Subsequent discussion in #aarch64-laptops led icecream95 to isolate
> DC ZVA as the triggering instruction and to create a reliable
> reproducer [1].
> 
> Further debugging showed that the issue is strongly related to deep
> cluster idle states. Marc Zyngier suggested removing the deepest
> cluster state (`cluster_cl5`), which resolved the problem on all tested
> consumer hardware.
> 
> This patch implements that change by removing `&cluster_cl5` from the
> `domain-idle-states` of `cluster_pd0`, `cluster_pd1`, and `cluster_pd2`.
> 
> Testing:
>  - Lenovo ThinkPad T14s G6 (X1E-78-100, Hamoa)
>  - Lenovo ThinkBook 16 G7 QOY (X1P-42-100, Purwa)
>  - Lenovo IdeaPad 5 2-in-1 14Q8X9 (X1P-42-100, Purwa)
>  - Lenovo IdeaPad Slim 3x 15Q8X10 (X1-26-100, Purwa)
> 
> All consumer devices became stable with this change.
> 
> On the Snapdragon Dev Kit (X1E-001-DE, Hamoa) the situation is
> different: the firmware does not advertise OSI mode. Even with this
> patch the device still crashes with the x1e-crash reproducer. Stability
> is only achieved by passing `cpuidle.off=1`, which of course increases
> power consumption but makes the devkit a bit faster, so there's that.
> 
> The different behaviour correlates with PSCI mode:
> - Consumer firmwares enable OSI mode
> - Devkit firmware stays in platform-coordinated mode
> 
> This patch is therefore only a band-aid. All evidence points to a
> firmware/microcode issue where DC ZVA can hit caches that have been
> powered down by PSCI idle states. A proper fix would be either a
> Qualcomm firmware update or a kernel erratum that disables DZE on
> these SoCs.
> 
> [1] https://github.com/icecream95/x1e-crash
> ---
>  arch/arm64/boot/dts/qcom/hamoa.dtsi | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/boot/dts/qcom/hamoa.dtsi b/arch/arm64/boot/dts/qcom/hamoa.dtsi
> index 4ba751a65142b..8ec39ba621946 100644
> --- a/arch/arm64/boot/dts/qcom/hamoa.dtsi
> +++ b/arch/arm64/boot/dts/qcom/hamoa.dtsi
> @@ -442,19 +442,19 @@ cpu_pd11: power-domain-cpu11 {
>  
>  		cluster_pd0: power-domain-cpu-cluster0 {
>  			#power-domain-cells = <0>;
> -			domain-idle-states = <&cluster_cl4>, <&cluster_cl5>;
> +			domain-idle-states = <&cluster_cl4>;
>  			power-domains = <&system_pd>;
>  		};
>  
>  		cluster_pd1: power-domain-cpu-cluster1 {
>  			#power-domain-cells = <0>;
> -			domain-idle-states = <&cluster_cl4>, <&cluster_cl5>;
> +			domain-idle-states = <&cluster_cl4>;
>  			power-domains = <&system_pd>;
>  		};
>  
>  		cluster_pd2: power-domain-cpu-cluster2 {
>  			#power-domain-cells = <0>;
> -			domain-idle-states = <&cluster_cl4>, <&cluster_cl5>;
> +			domain-idle-states = <&cluster_cl4>;
>  			power-domains = <&system_pd>;
>  		};
>  
> 

It may be worth adding a comment somewhere in the DTS file, as
cluster_cl5 is not referenced anymore.

Ideally we'd simply mark cluster-sleep-1 with 'status = "disabled"',
but I'm not sure Linux (and other OSs that consume this) actively
parse this property.

Overall, I'd like clarity from the vendor on what can be done to
better mitigate issues like this. So far, we have been randomly
disabling features and CPU capabilities each and every time we find
something broken on these machines, and the list is getting long.

I don't think such course of action is sustainable, and maybe we
should simply consider marking the full X1 platform as BROKEN so that
people know what to expect.

Thanks,

	M.

-- 
Jazz isn't dead. It just smells funny.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-06-05  8:06 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-04 17:40 [PATCH RFC] arm64: dts: qcom: hamoa: Drop cluster_cl5 idle state from CPU clusters Jens Glathe via B4 Relay
2026-06-05  8:09 ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox