Devicetree
 help / color / mirror / Atom feed
From: Marc Zyngier <maz@kernel.org>
To: jens.glathe@oldschoolsolutions.biz
Cc: Bjorn Andersson <andersson@kernel.org>,
	Konrad Dybcio <konradybcio@kernel.org>,
	Rob Herring <robh@kernel.org>,
	Krzysztof Kozlowski <krzk+dt@kernel.org>,
	Conor Dooley <conor+dt@kernel.org>,
	linux-arm-msm@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Steev Klimaszewski <threeway@gmail.com>,
	Icecream95 <ixn@disroot.org>
Subject: Re: [PATCH RFC] arm64: dts: qcom: hamoa: Drop cluster_cl5 idle state from CPU clusters
Date: Fri, 05 Jun 2026 09:09:23 +0100	[thread overview]
Message-ID: <87bjdp9znw.wl-maz@kernel.org> (raw)
In-Reply-To: <20260604-dc_zva_mitigation-v1-1-d1148c1c0259@oldschoolsolutions.biz>

Hi Jens,

Thanks for sending this.

On Thu, 04 Jun 2026 18:40:14 +0100,
Jens Glathe via B4 Relay <devnull+jens.glathe.oldschoolsolutions.biz@kernel.org> wrote:
> 
> From: Jens Glathe <jens.glathe@oldschoolsolutions.biz>
> 
> The cluster_cl5 idle state triggers DC ZVA misbehavior that resets
> X1 SoCs. Remove it from cluster_pd0/1/2 domain-idle-states for now.
> 
> Suggested-by: Marc Zyngier <maz@kernel.org>
> Signed-off-by: Jens Glathe <jens.glathe@oldschoolsolutions.biz>
> ---
> This is an RFC for a mitigation of a stability issue observed on
> Snapdragon X1-based SoCs (Hamoa and Purwa).
> 
> Affected systems experience spontaneous resets under the following
> conditions:
>  - During intensive `git fetch` / `git pull` activity
>  - During mostly idle periods (Bitburner and similar workloads were
>    frequently mentioned)
> 
> Steev Klimaszewski first connected the crashes to git operations.
> Subsequent discussion in #aarch64-laptops led icecream95 to isolate
> DC ZVA as the triggering instruction and to create a reliable
> reproducer [1].
> 
> Further debugging showed that the issue is strongly related to deep
> cluster idle states. Marc Zyngier suggested removing the deepest
> cluster state (`cluster_cl5`), which resolved the problem on all tested
> consumer hardware.
> 
> This patch implements that change by removing `&cluster_cl5` from the
> `domain-idle-states` of `cluster_pd0`, `cluster_pd1`, and `cluster_pd2`.
> 
> Testing:
>  - Lenovo ThinkPad T14s G6 (X1E-78-100, Hamoa)
>  - Lenovo ThinkBook 16 G7 QOY (X1P-42-100, Purwa)
>  - Lenovo IdeaPad 5 2-in-1 14Q8X9 (X1P-42-100, Purwa)
>  - Lenovo IdeaPad Slim 3x 15Q8X10 (X1-26-100, Purwa)
> 
> All consumer devices became stable with this change.
> 
> On the Snapdragon Dev Kit (X1E-001-DE, Hamoa) the situation is
> different: the firmware does not advertise OSI mode. Even with this
> patch the device still crashes with the x1e-crash reproducer. Stability
> is only achieved by passing `cpuidle.off=1`, which of course increases
> power consumption but makes the devkit a bit faster, so there's that.
> 
> The different behaviour correlates with PSCI mode:
> - Consumer firmwares enable OSI mode
> - Devkit firmware stays in platform-coordinated mode
> 
> This patch is therefore only a band-aid. All evidence points to a
> firmware/microcode issue where DC ZVA can hit caches that have been
> powered down by PSCI idle states. A proper fix would be either a
> Qualcomm firmware update or a kernel erratum that disables DZE on
> these SoCs.
> 
> [1] https://github.com/icecream95/x1e-crash
> ---
>  arch/arm64/boot/dts/qcom/hamoa.dtsi | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/boot/dts/qcom/hamoa.dtsi b/arch/arm64/boot/dts/qcom/hamoa.dtsi
> index 4ba751a65142b..8ec39ba621946 100644
> --- a/arch/arm64/boot/dts/qcom/hamoa.dtsi
> +++ b/arch/arm64/boot/dts/qcom/hamoa.dtsi
> @@ -442,19 +442,19 @@ cpu_pd11: power-domain-cpu11 {
>  
>  		cluster_pd0: power-domain-cpu-cluster0 {
>  			#power-domain-cells = <0>;
> -			domain-idle-states = <&cluster_cl4>, <&cluster_cl5>;
> +			domain-idle-states = <&cluster_cl4>;
>  			power-domains = <&system_pd>;
>  		};
>  
>  		cluster_pd1: power-domain-cpu-cluster1 {
>  			#power-domain-cells = <0>;
> -			domain-idle-states = <&cluster_cl4>, <&cluster_cl5>;
> +			domain-idle-states = <&cluster_cl4>;
>  			power-domains = <&system_pd>;
>  		};
>  
>  		cluster_pd2: power-domain-cpu-cluster2 {
>  			#power-domain-cells = <0>;
> -			domain-idle-states = <&cluster_cl4>, <&cluster_cl5>;
> +			domain-idle-states = <&cluster_cl4>;
>  			power-domains = <&system_pd>;
>  		};
>  
> 

It may be worth adding a comment somewhere in the DTS file, as
cluster_cl5 is not referenced anymore.

Ideally we'd simply mark cluster-sleep-1 with 'status = "disabled"',
but I'm not sure Linux (and other OSs that consume this) actively
parse this property.

Overall, I'd like clarity from the vendor on what can be done to
better mitigate issues like this. So far, we have been randomly
disabling features and CPU capabilities each and every time we find
something broken on these machines, and the list is getting long.

I don't think such course of action is sustainable, and maybe we
should simply consider marking the full X1 platform as BROKEN so that
people know what to expect.

Thanks,

	M.

-- 
Jazz isn't dead. It just smells funny.

      reply	other threads:[~2026-06-05  8:06 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-04 17:40 [PATCH RFC] arm64: dts: qcom: hamoa: Drop cluster_cl5 idle state from CPU clusters Jens Glathe via B4 Relay
2026-06-05  8:09 ` Marc Zyngier [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bjdp9znw.wl-maz@kernel.org \
    --to=maz@kernel.org \
    --cc=andersson@kernel.org \
    --cc=conor+dt@kernel.org \
    --cc=devicetree@vger.kernel.org \
    --cc=ixn@disroot.org \
    --cc=jens.glathe@oldschoolsolutions.biz \
    --cc=konradybcio@kernel.org \
    --cc=krzk+dt@kernel.org \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=robh@kernel.org \
    --cc=threeway@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox