Questions About SM8550 Support

Linux ARM-MSM sub-architecture
 help / color / mirror / Atom feed

* Questions About SM8550 Support
@ 2026-01-27 22:48 Aaron Kling
  2026-01-28  8:50 ` Neil Armstrong
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Aaron Kling @ 2026-01-27 22:48 UTC (permalink / raw)
  To: linux-arm-msm

I am working on the AYN Odin 2 qcs8550 series of devices, specifically
for Android, using mainline kernel drivers. I have come across some
missing functionality and failures that I would like to inquire about.

* ABL fails to load a dtbo using a baseline dtb unmodified from
mainline. Using changes described in the gunyah watchdog thread [0], a
dtbo loads and the devices boot as expected. If any of the changes in
that post don't exist in the base dtb, abl will fail to load the dtbo
and go to the bootloader menu. This appears to be an issue in the
baseline abl code, affecting all devices of that generation. Would it
be allowable to merge a change adding those changes to the sm8550
dtsi, allowing an unmodified mainline dtb to work with overlays?

* SM8550 does not have cpu opp tables, thus cpufreq does not work. I
have locally copied the commits from sm8650 and adapted for sm8550,
and that seems to work okay. But no measuring of bandwidth was done,
so the numbers are likely not entirely correct. Is there any plan to
generate correct tables for sm8550?

* As part of a series to support the original Odin 2, a patch to
update sm8550 EAS values was submitted [1]. But that series stalled
and this was never merged. If this change is valid, which per that
discussion it appears to be, can it be resubmitted by itself and
merged?

* Per the mainline kernel device trees and audio topology provide by
the oem, these devices use primary i2s for the speakers path. There
was a commit adding clock support for that as part of an hdmi series
[2], but that seems to have stalled. Is this going to be picked back
up?

* Inline crypto fails to detect hwkm support. And I see other logs
online, such as for the sm8550 qrd, that logs the same way my device
does. I traced the issue to the check for wrapped key support [3]. On
my devices, the derive call is supported, but the other three calls
are not. I was pointed at the downstream headers for sm8550 support
and only derive is listed there, the other three don't appear to be
used in the downstream driver. Is this expected? And if so, will this
case be added to the mainline drivers?

* Some gpu related clocks complain about being stuck off during boot,
causing stack traces, but the gpu does work. I tried to do some
research into this, but quickly got lost in the weeds and I have no
idea where to even look.
[    0.367278] gpu_cc_cxo_clk status stuck at 'off'
[    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
[    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
[    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'

* Sometimes when starting rendering, a bandwidth submission times out,
then the driver immediately complains that said id was left on the
queue. I have tried increasing the timeout, but the same sequence
still happens. Timeout happens, immediately followed by a matching
unexpected response. Implying that this isn't actually a delay /
timeout issue.
[ 1848.517020] platform 3d6a000.gmu:
[drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
[ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
*ERROR* Unexpected message id 1015 on the response queue

* Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
unsure if this is a kernel problem or userspace, so I'm submitting
here first. If the consensus is that it's a userspace issue, I'll
submit it to mesa.
[ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
00000001512E9000/003d ib2 00000001512E7000/0000
[ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
[msm]] *ERROR* 67.5.10.1: hangcheck recover!
[ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
[msm]] *ERROR* 67.5.10.1: offending task: Thread-23
(com.futuremark.dmandroid.application)
[ 1860.258126] revision: 0 (67.5.10.1)
[ 1860.258132] rb 0: fence:    2884/2884
[ 1860.258133] rptr:     36
[ 1860.258134] rb wptr:  36
[ 1860.258135] rb 1: fence:    -256/-256
[ 1860.258138] rptr:     0
[ 1860.258138] rb wptr:  0
[ 1860.258139] rb 2: fence:    41563/41569
[ 1860.258140] rptr:     1752
[ 1860.258140] rb wptr:  2319
[ 1860.258141] rb 3: fence:    -256/-256
[ 1860.258141] rptr:     0
[ 1860.258142] rb wptr:  0
[ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
[ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
[ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
CP_SCRATCH_REG2: 41562
[ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
[ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
CP_SCRATCH_REG4: 3736059565
[ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
CP_SCRATCH_REG5: 3736059565
[ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
CP_SCRATCH_REG6: 3736059565
[ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
CP_SCRATCH_REG7: 3736059565

Aaron

[0] https://lore.kernel.org/all/91002189-9d9e-48a2-8424-c42705fed3f8@quicinc.com/
[1] https://lore.kernel.org/all/20240424-ayn-odin2-initial-v1-7-e0aa05c991fd@gmail.com/
[2] https://lore.kernel.org/all/20251008-topic-sm8x50-next-hdk-i2s-v2-3-6b7d38d4ad5e@linaro.org/
[3] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/firmware/qcom/qcom_scm.c?h=v6.18#n1285

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-27 22:48 Questions About SM8550 Support Aaron Kling
@ 2026-01-28  8:50 ` Neil Armstrong
  2026-01-28 14:46   ` Rob Clark
  2026-01-28 18:42   ` Aaron Kling
  2026-01-28 14:03 ` Konrad Dybcio
  2026-01-30 11:01 ` Konrad Dybcio
  2 siblings, 2 replies; 31+ messages in thread
From: Neil Armstrong @ 2026-01-28  8:50 UTC (permalink / raw)
  To: Aaron Kling, linux-arm-msm

Hi,

On 1/27/26 23:48, Aaron Kling wrote:
> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> for Android, using mainline kernel drivers. I have come across some
> missing functionality and failures that I would like to inquire about.
> 
> * ABL fails to load a dtbo using a baseline dtb unmodified from
> mainline. Using changes described in the gunyah watchdog thread [0], a
> dtbo loads and the devices boot as expected. If any of the changes in
> that post don't exist in the base dtb, abl will fail to load the dtbo
> and go to the bootloader menu. This appears to be an issue in the
> baseline abl code, affecting all devices of that generation. Would it
> be allowable to merge a change adding those changes to the sm8550
> dtsi, allowing an unmodified mainline dtb to work with overlays?

Any addition to the DT must be documented in dt-bindings, so if it's needed
for boot they should be documented and added for sure.

> 
> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
> have locally copied the commits from sm8650 and adapted for sm8550,
> and that seems to work okay. But no measuring of bandwidth was done,
> so the numbers are likely not entirely correct. Is there any plan to
> generate correct tables for sm8550?

Cpufreq works but not the interconnect scaling, so doing the same as sm8650
is fine but since the values were calculated from downstream DT tables,
the same should be done for sm8550.

> 
> * As part of a series to support the original Odin 2, a patch to
> update sm8550 EAS values was submitted [1]. But that series stalled
> and this was never merged. If this change is valid, which per that
> discussion it appears to be, can it be resubmitted by itself and
> merged?

I missed this patch, please re-submit, I also need to update the ones
for SM8650.

> 
> * Per the mainline kernel device trees and audio topology provide by
> the oem, these devices use primary i2s for the speakers path. There
> was a commit adding clock support for that as part of an hdmi series
> [2], but that seems to have stalled. Is this going to be picked back
> up?

No, I do not plan to do this work, it required adding callbacks in the
code to handle the clocks like done for the pre-audioreach firmwares.

> 
> * Inline crypto fails to detect hwkm support. And I see other logs
> online, such as for the sm8550 qrd, that logs the same way my device
> does. I traced the issue to the check for wrapped key support [3]. On
> my devices, the derive call is supported, but the other three calls
> are not. I was pointed at the downstream headers for sm8550 support
> and only derive is listed there, the other three don't appear to be
> used in the downstream driver. Is this expected? And if so, will this
> case be added to the mainline drivers?

Does hwkm work with you remove the last 3 calls ?

> 
> * Some gpu related clocks complain about being stuck off during boot,
> causing stack traces, but the gpu does work. I tried to do some
> research into this, but quickly got lost in the weeds and I have no
> idea where to even look.
> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'

This may be related with the display handoff from ABL, did you add the
plat region to the reserved memories ?

> 
> * Sometimes when starting rendering, a bandwidth submission times out,
> then the driver immediately complains that said id was left on the
> queue. I have tried increasing the timeout, but the same sequence
> still happens. Timeout happens, immediately followed by a matching
> unexpected response. Implying that this isn't actually a delay /
> timeout issue.
> [ 1848.517020] platform 3d6a000.gmu:
> [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
> HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
> [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
> *ERROR* Unexpected message id 1015 on the response queue

Weird the timeout was extended for this very purpose

> 
> * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
> unsure if this is a kernel problem or userspace, so I'm submitting
> here first. If the consensus is that it's a userspace issue, I'll
> submit it to mesa.
> [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
> fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
> 00000001512E9000/003d ib2 00000001512E7000/0000
> [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
> [msm]] *ERROR* 67.5.10.1: hangcheck recover!
> [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
> [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
> (com.futuremark.dmandroid.application)
> [ 1860.258126] revision: 0 (67.5.10.1)
> [ 1860.258132] rb 0: fence:    2884/2884
> [ 1860.258133] rptr:     36
> [ 1860.258134] rb wptr:  36
> [ 1860.258135] rb 1: fence:    -256/-256
> [ 1860.258138] rptr:     0
> [ 1860.258138] rb wptr:  0
> [ 1860.258139] rb 2: fence:    41563/41569
> [ 1860.258140] rptr:     1752
> [ 1860.258140] rb wptr:  2319
> [ 1860.258141] rb 3: fence:    -256/-256
> [ 1860.258141] rptr:     0
> [ 1860.258142] rb wptr:  0
> [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
> [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
> [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> CP_SCRATCH_REG2: 41562
> [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
> [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> CP_SCRATCH_REG4: 3736059565
> [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> CP_SCRATCH_REG5: 3736059565
> [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> CP_SCRATCH_REG6: 3736059565
> [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> CP_SCRATCH_REG7: 3736059565

@rob do you have any idea how to solve this crash on a740 ?

Neil

> 
> Aaron
> 
> [0] https://lore.kernel.org/all/91002189-9d9e-48a2-8424-c42705fed3f8@quicinc.com/
> [1] https://lore.kernel.org/all/20240424-ayn-odin2-initial-v1-7-e0aa05c991fd@gmail.com/
> [2] https://lore.kernel.org/all/20251008-topic-sm8x50-next-hdk-i2s-v2-3-6b7d38d4ad5e@linaro.org/
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/firmware/qcom/qcom_scm.c?h=v6.18#n1285


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-27 22:48 Questions About SM8550 Support Aaron Kling
  2026-01-28  8:50 ` Neil Armstrong
@ 2026-01-28 14:03 ` Konrad Dybcio
  2026-01-28 18:20   ` Aaron Kling
  2026-02-02  9:35   ` Taniya Das
  2026-01-30 11:01 ` Konrad Dybcio
  2 siblings, 2 replies; 31+ messages in thread
From: Konrad Dybcio @ 2026-01-28 14:03 UTC (permalink / raw)
  To: Aaron Kling, linux-arm-msm, Taniya Das

On 1/27/26 11:48 PM, Aaron Kling wrote:
> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> for Android, using mainline kernel drivers. I have come across some
> missing functionality and failures that I would like to inquire about.

[...]

> * Some gpu related clocks complain about being stuck off during boot,
> causing stack traces, but the gpu does work. I tried to do some
> research into this, but quickly got lost in the weeds and I have no
> idea where to even look.
> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'

I'm really scratching my head trying to understand how these GPU
clocks could not turn on, they barely have any dependencies than
"the chip shouldn't be offline"

+Taniya ideas? This is 8550

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-28  8:50 ` Neil Armstrong
@ 2026-01-28 14:46   ` Rob Clark
  2026-01-28 17:54     ` Aaron Kling
  2026-01-28 18:42   ` Aaron Kling
  1 sibling, 1 reply; 31+ messages in thread
From: Rob Clark @ 2026-01-28 14:46 UTC (permalink / raw)
  To: Neil Armstrong; +Cc: Aaron Kling, linux-arm-msm, Akhil P Oommen

On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
<neil.armstrong@linaro.org> wrote:
>
> Hi,
>
> On 1/27/26 23:48, Aaron Kling wrote:
> > I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> > for Android, using mainline kernel drivers. I have come across some
> > missing functionality and failures that I would like to inquire about.
> >
> > * ABL fails to load a dtbo using a baseline dtb unmodified from
> > mainline. Using changes described in the gunyah watchdog thread [0], a
> > dtbo loads and the devices boot as expected. If any of the changes in
> > that post don't exist in the base dtb, abl will fail to load the dtbo
> > and go to the bootloader menu. This appears to be an issue in the
> > baseline abl code, affecting all devices of that generation. Would it
> > be allowable to merge a change adding those changes to the sm8550
> > dtsi, allowing an unmodified mainline dtb to work with overlays?
>
> Any addition to the DT must be documented in dt-bindings, so if it's needed
> for boot they should be documented and added for sure.
>
> >
> > * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
> > have locally copied the commits from sm8650 and adapted for sm8550,
> > and that seems to work okay. But no measuring of bandwidth was done,
> > so the numbers are likely not entirely correct. Is there any plan to
> > generate correct tables for sm8550?
>
> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
> is fine but since the values were calculated from downstream DT tables,
> the same should be done for sm8550.
>
> >
> > * As part of a series to support the original Odin 2, a patch to
> > update sm8550 EAS values was submitted [1]. But that series stalled
> > and this was never merged. If this change is valid, which per that
> > discussion it appears to be, can it be resubmitted by itself and
> > merged?
>
> I missed this patch, please re-submit, I also need to update the ones
> for SM8650.
>
> >
> > * Per the mainline kernel device trees and audio topology provide by
> > the oem, these devices use primary i2s for the speakers path. There
> > was a commit adding clock support for that as part of an hdmi series
> > [2], but that seems to have stalled. Is this going to be picked back
> > up?
>
> No, I do not plan to do this work, it required adding callbacks in the
> code to handle the clocks like done for the pre-audioreach firmwares.
>
> >
> > * Inline crypto fails to detect hwkm support. And I see other logs
> > online, such as for the sm8550 qrd, that logs the same way my device
> > does. I traced the issue to the check for wrapped key support [3]. On
> > my devices, the derive call is supported, but the other three calls
> > are not. I was pointed at the downstream headers for sm8550 support
> > and only derive is listed there, the other three don't appear to be
> > used in the downstream driver. Is this expected? And if so, will this
> > case be added to the mainline drivers?
>
> Does hwkm work with you remove the last 3 calls ?
>
> >
> > * Some gpu related clocks complain about being stuck off during boot,
> > causing stack traces, but the gpu does work. I tried to do some
> > research into this, but quickly got lost in the weeds and I have no
> > idea where to even look.
> > [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> > [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> > [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> > [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
>
> This may be related with the display handoff from ABL, did you add the
> plat region to the reserved memories ?
>
> >
> > * Sometimes when starting rendering, a bandwidth submission times out,
> > then the driver immediately complains that said id was left on the
> > queue. I have tried increasing the timeout, but the same sequence
> > still happens. Timeout happens, immediately followed by a matching
> > unexpected response. Implying that this isn't actually a delay /
> > timeout issue.
> > [ 1848.517020] platform 3d6a000.gmu:
> > [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
> > HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
> > [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
> > *ERROR* Unexpected message id 1015 on the response queue
>
> Weird the timeout was extended for this very purpose
>
> >
> > * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
> > unsure if this is a kernel problem or userspace, so I'm submitting
> > here first. If the consensus is that it's a userspace issue, I'll
> > submit it to mesa.
> > [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
> > fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
> > 00000001512E9000/003d ib2 00000001512E7000/0000
> > [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
> > [msm]] *ERROR* 67.5.10.1: hangcheck recover!
> > [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
> > [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
> > (com.futuremark.dmandroid.application)
> > [ 1860.258126] revision: 0 (67.5.10.1)
> > [ 1860.258132] rb 0: fence:    2884/2884
> > [ 1860.258133] rptr:     36
> > [ 1860.258134] rb wptr:  36
> > [ 1860.258135] rb 1: fence:    -256/-256
> > [ 1860.258138] rptr:     0
> > [ 1860.258138] rb wptr:  0
> > [ 1860.258139] rb 2: fence:    41563/41569
> > [ 1860.258140] rptr:     1752
> > [ 1860.258140] rb wptr:  2319
> > [ 1860.258141] rb 3: fence:    -256/-256
> > [ 1860.258141] rptr:     0
> > [ 1860.258142] rb wptr:  0
> > [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
> > [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
> > [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > CP_SCRATCH_REG2: 41562
> > [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
> > [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > CP_SCRATCH_REG4: 3736059565
> > [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > CP_SCRATCH_REG5: 3736059565
> > [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > CP_SCRATCH_REG6: 3736059565
> > [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > CP_SCRATCH_REG7: 3736059565
>
> @rob do you have any idea how to solve this crash on a740 ?

The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
something is unhappy about gpu pm.  I'd focus on that first, since
that is almost certainly the cause of the later issues.  If things
_sorta_ work (rendering UI, etc) you could try removing all but the
lowest gpu OPP as an experiment.  Could be that power related problems
surface when the GPU ramps up to higher OPPs.

BR,
-R

> Neil
>
> >
> > Aaron
> >
> > [0] https://lore.kernel.org/all/91002189-9d9e-48a2-8424-c42705fed3f8@quicinc.com/
> > [1] https://lore.kernel.org/all/20240424-ayn-odin2-initial-v1-7-e0aa05c991fd@gmail.com/
> > [2] https://lore.kernel.org/all/20251008-topic-sm8x50-next-hdk-i2s-v2-3-6b7d38d4ad5e@linaro.org/
> > [3] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/firmware/qcom/qcom_scm.c?h=v6.18#n1285
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-28 14:46   ` Rob Clark
@ 2026-01-28 17:54     ` Aaron Kling
  2026-01-29 23:11       ` Akhil P Oommen
  0 siblings, 1 reply; 31+ messages in thread
From: Aaron Kling @ 2026-01-28 17:54 UTC (permalink / raw)
  To: rob.clark; +Cc: Neil Armstrong, linux-arm-msm, Akhil P Oommen

On Wed, Jan 28, 2026 at 8:46 AM Rob Clark <rob.clark@oss.qualcomm.com> wrote:
>
> On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
> <neil.armstrong@linaro.org> wrote:
> >
> > Hi,
> >
> > On 1/27/26 23:48, Aaron Kling wrote:
> > > I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> > > for Android, using mainline kernel drivers. I have come across some
> > > missing functionality and failures that I would like to inquire about.
> > >
> > > * ABL fails to load a dtbo using a baseline dtb unmodified from
> > > mainline. Using changes described in the gunyah watchdog thread [0], a
> > > dtbo loads and the devices boot as expected. If any of the changes in
> > > that post don't exist in the base dtb, abl will fail to load the dtbo
> > > and go to the bootloader menu. This appears to be an issue in the
> > > baseline abl code, affecting all devices of that generation. Would it
> > > be allowable to merge a change adding those changes to the sm8550
> > > dtsi, allowing an unmodified mainline dtb to work with overlays?
> >
> > Any addition to the DT must be documented in dt-bindings, so if it's needed
> > for boot they should be documented and added for sure.
> >
> > >
> > > * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
> > > have locally copied the commits from sm8650 and adapted for sm8550,
> > > and that seems to work okay. But no measuring of bandwidth was done,
> > > so the numbers are likely not entirely correct. Is there any plan to
> > > generate correct tables for sm8550?
> >
> > Cpufreq works but not the interconnect scaling, so doing the same as sm8650
> > is fine but since the values were calculated from downstream DT tables,
> > the same should be done for sm8550.
> >
> > >
> > > * As part of a series to support the original Odin 2, a patch to
> > > update sm8550 EAS values was submitted [1]. But that series stalled
> > > and this was never merged. If this change is valid, which per that
> > > discussion it appears to be, can it be resubmitted by itself and
> > > merged?
> >
> > I missed this patch, please re-submit, I also need to update the ones
> > for SM8650.
> >
> > >
> > > * Per the mainline kernel device trees and audio topology provide by
> > > the oem, these devices use primary i2s for the speakers path. There
> > > was a commit adding clock support for that as part of an hdmi series
> > > [2], but that seems to have stalled. Is this going to be picked back
> > > up?
> >
> > No, I do not plan to do this work, it required adding callbacks in the
> > code to handle the clocks like done for the pre-audioreach firmwares.
> >
> > >
> > > * Inline crypto fails to detect hwkm support. And I see other logs
> > > online, such as for the sm8550 qrd, that logs the same way my device
> > > does. I traced the issue to the check for wrapped key support [3]. On
> > > my devices, the derive call is supported, but the other three calls
> > > are not. I was pointed at the downstream headers for sm8550 support
> > > and only derive is listed there, the other three don't appear to be
> > > used in the downstream driver. Is this expected? And if so, will this
> > > case be added to the mainline drivers?
> >
> > Does hwkm work with you remove the last 3 calls ?
> >
> > >
> > > * Some gpu related clocks complain about being stuck off during boot,
> > > causing stack traces, but the gpu does work. I tried to do some
> > > research into this, but quickly got lost in the weeds and I have no
> > > idea where to even look.
> > > [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> > > [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> > > [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> > > [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
> >
> > This may be related with the display handoff from ABL, did you add the
> > plat region to the reserved memories ?
> >
> > >
> > > * Sometimes when starting rendering, a bandwidth submission times out,
> > > then the driver immediately complains that said id was left on the
> > > queue. I have tried increasing the timeout, but the same sequence
> > > still happens. Timeout happens, immediately followed by a matching
> > > unexpected response. Implying that this isn't actually a delay /
> > > timeout issue.
> > > [ 1848.517020] platform 3d6a000.gmu:
> > > [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
> > > HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
> > > [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
> > > *ERROR* Unexpected message id 1015 on the response queue
> >
> > Weird the timeout was extended for this very purpose
> >
> > >
> > > * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
> > > unsure if this is a kernel problem or userspace, so I'm submitting
> > > here first. If the consensus is that it's a userspace issue, I'll
> > > submit it to mesa.
> > > [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
> > > fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
> > > 00000001512E9000/003d ib2 00000001512E7000/0000
> > > [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
> > > [msm]] *ERROR* 67.5.10.1: hangcheck recover!
> > > [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
> > > [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
> > > (com.futuremark.dmandroid.application)
> > > [ 1860.258126] revision: 0 (67.5.10.1)
> > > [ 1860.258132] rb 0: fence:    2884/2884
> > > [ 1860.258133] rptr:     36
> > > [ 1860.258134] rb wptr:  36
> > > [ 1860.258135] rb 1: fence:    -256/-256
> > > [ 1860.258138] rptr:     0
> > > [ 1860.258138] rb wptr:  0
> > > [ 1860.258139] rb 2: fence:    41563/41569
> > > [ 1860.258140] rptr:     1752
> > > [ 1860.258140] rb wptr:  2319
> > > [ 1860.258141] rb 3: fence:    -256/-256
> > > [ 1860.258141] rptr:     0
> > > [ 1860.258142] rb wptr:  0
> > > [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
> > > [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
> > > [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > > CP_SCRATCH_REG2: 41562
> > > [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
> > > [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > > CP_SCRATCH_REG4: 3736059565
> > > [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > > CP_SCRATCH_REG5: 3736059565
> > > [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > > CP_SCRATCH_REG6: 3736059565
> > > [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > > CP_SCRATCH_REG7: 3736059565
> >
> > @rob do you have any idea how to solve this crash on a740 ?
>
> The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
> something is unhappy about gpu pm.  I'd focus on that first, since
> that is almost certainly the cause of the later issues.  If things
> _sorta_ work (rendering UI, etc) you could try removing all but the
> lowest gpu OPP as an experiment.  Could be that power related problems
> surface when the GPU ramps up to higher OPPs.

Things work amazingly well compared to what I was expecting. Using
mesa staging 26.0 as of yesterday, I'm getting roughly 80% performance
in the benchmarks that do run, compared to the stock Android. And
rendering is correct everywhere that I've seen so far. Mesa 25.3.3
gives about 89% compared to stock, but there are graphical glitches in
some of the benchmarks.

I set gpu max_freq via devfreq to the minimum available frequency and
ran the failing benchmark again. It completed once, but failed with a
similar stack trace on the second run. And per sysfs, the gpu did stay
at that minimum. Of note, that causes the benchmark to fail, but
rendering does recover and the unit is still usable afterwards.

Aaron

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-28 14:03 ` Konrad Dybcio
@ 2026-01-28 18:20   ` Aaron Kling
  2026-02-02  9:35   ` Taniya Das
  1 sibling, 0 replies; 31+ messages in thread
From: Aaron Kling @ 2026-01-28 18:20 UTC (permalink / raw)
  To: Konrad Dybcio; +Cc: linux-arm-msm, Taniya Das

On Wed, Jan 28, 2026 at 8:03 AM Konrad Dybcio
<konrad.dybcio@oss.qualcomm.com> wrote:
>
> On 1/27/26 11:48 PM, Aaron Kling wrote:
> > I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> > for Android, using mainline kernel drivers. I have come across some
> > missing functionality and failures that I would like to inquire about.
>
> [...]
>
> > * Some gpu related clocks complain about being stuck off during boot,
> > causing stack traces, but the gpu does work. I tried to do some
> > research into this, but quickly got lost in the weeds and I have no
> > idea where to even look.
> > [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> > [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> > [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> > [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
>
> I'm really scratching my head trying to understand how these GPU
> clocks could not turn on, they barely have any dependencies than
> "the chip shouldn't be offline"
>
> +Taniya ideas? This is 8550

Mmm, mentioning dependencies reminds me that Android does manual
module loading, so it's possible that some dependency isn't mapped
properly in the drivers and I'm loading something in the wrong order.
The *cc modules are loaded in this order:

gcc-sm8550
tcsrcc-sm8550
dispcc-sm8550
gpucc-sm8550

The full list of modules loaded in my ramdisk can be seen here [0], if
more context is needed.

Aaron

[0] https://gitlab.incom.co/cm-ayn/android_device_ayn_qcs8550-ack/-/blob/0d3b3fe8dd8f4671062f3f3a7698e444e1700658/modules.mk#L139

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-28  8:50 ` Neil Armstrong
  2026-01-28 14:46   ` Rob Clark
@ 2026-01-28 18:42   ` Aaron Kling
  2026-02-06 15:04     ` Neil Armstrong
  1 sibling, 1 reply; 31+ messages in thread
From: Aaron Kling @ 2026-01-28 18:42 UTC (permalink / raw)
  To: Neil Armstrong; +Cc: linux-arm-msm

On Wed, Jan 28, 2026 at 2:50 AM Neil Armstrong
<neil.armstrong@linaro.org> wrote:
>
> Hi,
>
> On 1/27/26 23:48, Aaron Kling wrote:
> > I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> > for Android, using mainline kernel drivers. I have come across some
> > missing functionality and failures that I would like to inquire about.
> >
> > * ABL fails to load a dtbo using a baseline dtb unmodified from
> > mainline. Using changes described in the gunyah watchdog thread [0], a
> > dtbo loads and the devices boot as expected. If any of the changes in
> > that post don't exist in the base dtb, abl will fail to load the dtbo
> > and go to the bootloader menu. This appears to be an issue in the
> > baseline abl code, affecting all devices of that generation. Would it
> > be allowable to merge a change adding those changes to the sm8550
> > dtsi, allowing an unmodified mainline dtb to work with overlays?
>
> Any addition to the DT must be documented in dt-bindings, so if it's needed
> for boot they should be documented and added for sure.

I can make the change and see if bindings check reports any new issues
before uploading.

> > * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
> > have locally copied the commits from sm8650 and adapted for sm8550,
> > and that seems to work okay. But no measuring of bandwidth was done,
> > so the numbers are likely not entirely correct. Is there any plan to
> > generate correct tables for sm8550?
>
> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
> is fine but since the values were calculated from downstream DT tables,
> the same should be done for sm8550.

What am I looking for in the downstream dt? I'm not greatly familiar
with that layout. But if I get pointed at the right stuff, I can do
the legwork.

> > * As part of a series to support the original Odin 2, a patch to
> > update sm8550 EAS values was submitted [1]. But that series stalled
> > and this was never merged. If this change is valid, which per that
> > discussion it appears to be, can it be resubmitted by itself and
> > merged?
>
> I missed this patch, please re-submit, I also need to update the ones
> for SM8650.

Ack.

> > * Per the mainline kernel device trees and audio topology provide by
> > the oem, these devices use primary i2s for the speakers path. There
> > was a commit adding clock support for that as part of an hdmi series
> > [2], but that seems to have stalled. Is this going to be picked back
> > up?
>
> No, I do not plan to do this work, it required adding callbacks in the
> code to handle the clocks like done for the pre-audioreach firmwares.
>
> >
> > * Inline crypto fails to detect hwkm support. And I see other logs
> > online, such as for the sm8550 qrd, that logs the same way my device
> > does. I traced the issue to the check for wrapped key support [3]. On
> > my devices, the derive call is supported, but the other three calls
> > are not. I was pointed at the downstream headers for sm8550 support
> > and only derive is listed there, the other three don't appear to be
> > used in the downstream driver. Is this expected? And if so, will this
> > case be added to the mainline drivers?
>
> Does hwkm work with you remove the last 3 calls ?

I would assume not, since the ufs driver [0] references all four. And
the plumbing doesn't do any further existence checks and just makes
the smc calls.

> > * Some gpu related clocks complain about being stuck off during boot,
> > causing stack traces, but the gpu does work. I tried to do some
> > research into this, but quickly got lost in the weeds and I have no
> > idea where to even look.
> > [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> > [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> > [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> > [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
>
> This may be related with the display handoff from ABL, did you add the
> plat region to the reserved memories ?
>

I did not, for these logs. Earlier in bringup, I did try to make abl
leave the display on by adding the splash region, but that just caused
display corruption before the kernel reset the display controller, so
I pulled that back out. And I saw a comment somewhere stating that
seamless handoff is not supported. Is that still the case, or should
seamless handoff work now? It would be a much nicer user experience if
it did.

Aaron

[0] https://github.com/torvalds/linux/blob/master/drivers/ufs/host/ufs-qcom.c#L311-L314

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-28 17:54     ` Aaron Kling
@ 2026-01-29 23:11       ` Akhil P Oommen
  2026-01-30  2:35         ` Aaron Kling
  0 siblings, 1 reply; 31+ messages in thread
From: Akhil P Oommen @ 2026-01-29 23:11 UTC (permalink / raw)
  To: Aaron Kling, rob.clark; +Cc: Neil Armstrong, linux-arm-msm

On 1/28/2026 11:24 PM, Aaron Kling wrote:
> On Wed, Jan 28, 2026 at 8:46 AM Rob Clark <rob.clark@oss.qualcomm.com> wrote:
>>
>> On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
>> <neil.armstrong@linaro.org> wrote:
>>>
>>> Hi,
>>>
>>> On 1/27/26 23:48, Aaron Kling wrote:
>>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
>>>> for Android, using mainline kernel drivers. I have come across some
>>>> missing functionality and failures that I would like to inquire about.
>>>>
>>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
>>>> mainline. Using changes described in the gunyah watchdog thread [0], a
>>>> dtbo loads and the devices boot as expected. If any of the changes in
>>>> that post don't exist in the base dtb, abl will fail to load the dtbo
>>>> and go to the bootloader menu. This appears to be an issue in the
>>>> baseline abl code, affecting all devices of that generation. Would it
>>>> be allowable to merge a change adding those changes to the sm8550
>>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
>>>
>>> Any addition to the DT must be documented in dt-bindings, so if it's needed
>>> for boot they should be documented and added for sure.
>>>
>>>>
>>>> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
>>>> have locally copied the commits from sm8650 and adapted for sm8550,
>>>> and that seems to work okay. But no measuring of bandwidth was done,
>>>> so the numbers are likely not entirely correct. Is there any plan to
>>>> generate correct tables for sm8550?
>>>
>>> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
>>> is fine but since the values were calculated from downstream DT tables,
>>> the same should be done for sm8550.
>>>
>>>>
>>>> * As part of a series to support the original Odin 2, a patch to
>>>> update sm8550 EAS values was submitted [1]. But that series stalled
>>>> and this was never merged. If this change is valid, which per that
>>>> discussion it appears to be, can it be resubmitted by itself and
>>>> merged?
>>>
>>> I missed this patch, please re-submit, I also need to update the ones
>>> for SM8650.
>>>
>>>>
>>>> * Per the mainline kernel device trees and audio topology provide by
>>>> the oem, these devices use primary i2s for the speakers path. There
>>>> was a commit adding clock support for that as part of an hdmi series
>>>> [2], but that seems to have stalled. Is this going to be picked back
>>>> up?
>>>
>>> No, I do not plan to do this work, it required adding callbacks in the
>>> code to handle the clocks like done for the pre-audioreach firmwares.
>>>
>>>>
>>>> * Inline crypto fails to detect hwkm support. And I see other logs
>>>> online, such as for the sm8550 qrd, that logs the same way my device
>>>> does. I traced the issue to the check for wrapped key support [3]. On
>>>> my devices, the derive call is supported, but the other three calls
>>>> are not. I was pointed at the downstream headers for sm8550 support
>>>> and only derive is listed there, the other three don't appear to be
>>>> used in the downstream driver. Is this expected? And if so, will this
>>>> case be added to the mainline drivers?
>>>
>>> Does hwkm work with you remove the last 3 calls ?
>>>
>>>>
>>>> * Some gpu related clocks complain about being stuck off during boot,
>>>> causing stack traces, but the gpu does work. I tried to do some
>>>> research into this, but quickly got lost in the weeds and I have no
>>>> idea where to even look.
>>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
>>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
>>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
>>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
>>>
>>> This may be related with the display handoff from ABL, did you add the
>>> plat region to the reserved memories ?
>>>
>>>>
>>>> * Sometimes when starting rendering, a bandwidth submission times out,
>>>> then the driver immediately complains that said id was left on the
>>>> queue. I have tried increasing the timeout, but the same sequence
>>>> still happens. Timeout happens, immediately followed by a matching
>>>> unexpected response. Implying that this isn't actually a delay /
>>>> timeout issue.
>>>> [ 1848.517020] platform 3d6a000.gmu:
>>>> [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
>>>> HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
>>>> [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
>>>> *ERROR* Unexpected message id 1015 on the response queue
>>>
>>> Weird the timeout was extended for this very purpose
>>>
>>>>
>>>> * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
>>>> unsure if this is a kernel problem or userspace, so I'm submitting
>>>> here first. If the consensus is that it's a userspace issue, I'll
>>>> submit it to mesa.
>>>> [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
>>>> fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
>>>> 00000001512E9000/003d ib2 00000001512E7000/0000
>>>> [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
>>>> [msm]] *ERROR* 67.5.10.1: hangcheck recover!
>>>> [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
>>>> [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
>>>> (com.futuremark.dmandroid.application)
>>>> [ 1860.258126] revision: 0 (67.5.10.1)
>>>> [ 1860.258132] rb 0: fence:    2884/2884
>>>> [ 1860.258133] rptr:     36
>>>> [ 1860.258134] rb wptr:  36
>>>> [ 1860.258135] rb 1: fence:    -256/-256
>>>> [ 1860.258138] rptr:     0
>>>> [ 1860.258138] rb wptr:  0
>>>> [ 1860.258139] rb 2: fence:    41563/41569
>>>> [ 1860.258140] rptr:     1752
>>>> [ 1860.258140] rb wptr:  2319
>>>> [ 1860.258141] rb 3: fence:    -256/-256
>>>> [ 1860.258141] rptr:     0
>>>> [ 1860.258142] rb wptr:  0
>>>> [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
>>>> [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
>>>> [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>> CP_SCRATCH_REG2: 41562
>>>> [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
>>>> [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>> CP_SCRATCH_REG4: 3736059565
>>>> [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>> CP_SCRATCH_REG5: 3736059565
>>>> [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>> CP_SCRATCH_REG6: 3736059565
>>>> [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>> CP_SCRATCH_REG7: 3736059565
>>>
>>> @rob do you have any idea how to solve this crash on a740 ?
>>
>> The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
>> something is unhappy about gpu pm.  I'd focus on that first, since
>> that is almost certainly the cause of the later issues.  If things
>> _sorta_ work (rendering UI, etc) you could try removing all but the
>> lowest gpu OPP as an experiment.  Could be that power related problems
>> surface when the GPU ramps up to higher OPPs.
> 
> Things work amazingly well compared to what I was expecting. Using
> mesa staging 26.0 as of yesterday, I'm getting roughly 80% performance
> in the benchmarks that do run, compared to the stock Android. And
> rendering is correct everywhere that I've seen so far. Mesa 25.3.3
> gives about 89% compared to stock, but there are graphical glitches in
> some of the benchmarks.
> 
> I set gpu max_freq via devfreq to the minimum available frequency and
> ran the failing benchmark again. It completed once, but failed with a
> similar stack trace on the second run. And per sysfs, the gpu did stay
> at that minimum. Of note, that causes the benchmark to fail, but
> rendering does recover and the unit is still usable afterwards.

In sm8550.dtsi, I see that ACD values are not specified in the GPU OPP
table. Can we add those (from downstream dt) and try again?

Also, are you using the latest stable release from Linus?

-Akhil.
> 
> Aaron


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-29 23:11       ` Akhil P Oommen
@ 2026-01-30  2:35         ` Aaron Kling
  2026-02-05  8:01           ` Aaron Kling
  0 siblings, 1 reply; 31+ messages in thread
From: Aaron Kling @ 2026-01-30  2:35 UTC (permalink / raw)
  To: Akhil P Oommen; +Cc: rob.clark, Neil Armstrong, linux-arm-msm

On Thu, Jan 29, 2026 at 5:11 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>
> On 1/28/2026 11:24 PM, Aaron Kling wrote:
> > On Wed, Jan 28, 2026 at 8:46 AM Rob Clark <rob.clark@oss.qualcomm.com> wrote:
> >>
> >> On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
> >> <neil.armstrong@linaro.org> wrote:
> >>>
> >>> Hi,
> >>>
> >>> On 1/27/26 23:48, Aaron Kling wrote:
> >>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> >>>> for Android, using mainline kernel drivers. I have come across some
> >>>> missing functionality and failures that I would like to inquire about.
> >>>>
> >>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
> >>>> mainline. Using changes described in the gunyah watchdog thread [0], a
> >>>> dtbo loads and the devices boot as expected. If any of the changes in
> >>>> that post don't exist in the base dtb, abl will fail to load the dtbo
> >>>> and go to the bootloader menu. This appears to be an issue in the
> >>>> baseline abl code, affecting all devices of that generation. Would it
> >>>> be allowable to merge a change adding those changes to the sm8550
> >>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
> >>>
> >>> Any addition to the DT must be documented in dt-bindings, so if it's needed
> >>> for boot they should be documented and added for sure.
> >>>
> >>>>
> >>>> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
> >>>> have locally copied the commits from sm8650 and adapted for sm8550,
> >>>> and that seems to work okay. But no measuring of bandwidth was done,
> >>>> so the numbers are likely not entirely correct. Is there any plan to
> >>>> generate correct tables for sm8550?
> >>>
> >>> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
> >>> is fine but since the values were calculated from downstream DT tables,
> >>> the same should be done for sm8550.
> >>>
> >>>>
> >>>> * As part of a series to support the original Odin 2, a patch to
> >>>> update sm8550 EAS values was submitted [1]. But that series stalled
> >>>> and this was never merged. If this change is valid, which per that
> >>>> discussion it appears to be, can it be resubmitted by itself and
> >>>> merged?
> >>>
> >>> I missed this patch, please re-submit, I also need to update the ones
> >>> for SM8650.
> >>>
> >>>>
> >>>> * Per the mainline kernel device trees and audio topology provide by
> >>>> the oem, these devices use primary i2s for the speakers path. There
> >>>> was a commit adding clock support for that as part of an hdmi series
> >>>> [2], but that seems to have stalled. Is this going to be picked back
> >>>> up?
> >>>
> >>> No, I do not plan to do this work, it required adding callbacks in the
> >>> code to handle the clocks like done for the pre-audioreach firmwares.
> >>>
> >>>>
> >>>> * Inline crypto fails to detect hwkm support. And I see other logs
> >>>> online, such as for the sm8550 qrd, that logs the same way my device
> >>>> does. I traced the issue to the check for wrapped key support [3]. On
> >>>> my devices, the derive call is supported, but the other three calls
> >>>> are not. I was pointed at the downstream headers for sm8550 support
> >>>> and only derive is listed there, the other three don't appear to be
> >>>> used in the downstream driver. Is this expected? And if so, will this
> >>>> case be added to the mainline drivers?
> >>>
> >>> Does hwkm work with you remove the last 3 calls ?
> >>>
> >>>>
> >>>> * Some gpu related clocks complain about being stuck off during boot,
> >>>> causing stack traces, but the gpu does work. I tried to do some
> >>>> research into this, but quickly got lost in the weeds and I have no
> >>>> idea where to even look.
> >>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> >>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> >>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> >>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
> >>>
> >>> This may be related with the display handoff from ABL, did you add the
> >>> plat region to the reserved memories ?
> >>>
> >>>>
> >>>> * Sometimes when starting rendering, a bandwidth submission times out,
> >>>> then the driver immediately complains that said id was left on the
> >>>> queue. I have tried increasing the timeout, but the same sequence
> >>>> still happens. Timeout happens, immediately followed by a matching
> >>>> unexpected response. Implying that this isn't actually a delay /
> >>>> timeout issue.
> >>>> [ 1848.517020] platform 3d6a000.gmu:
> >>>> [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
> >>>> HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
> >>>> [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
> >>>> *ERROR* Unexpected message id 1015 on the response queue
> >>>
> >>> Weird the timeout was extended for this very purpose
> >>>
> >>>>
> >>>> * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
> >>>> unsure if this is a kernel problem or userspace, so I'm submitting
> >>>> here first. If the consensus is that it's a userspace issue, I'll
> >>>> submit it to mesa.
> >>>> [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
> >>>> fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
> >>>> 00000001512E9000/003d ib2 00000001512E7000/0000
> >>>> [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
> >>>> [msm]] *ERROR* 67.5.10.1: hangcheck recover!
> >>>> [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
> >>>> [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
> >>>> (com.futuremark.dmandroid.application)
> >>>> [ 1860.258126] revision: 0 (67.5.10.1)
> >>>> [ 1860.258132] rb 0: fence:    2884/2884
> >>>> [ 1860.258133] rptr:     36
> >>>> [ 1860.258134] rb wptr:  36
> >>>> [ 1860.258135] rb 1: fence:    -256/-256
> >>>> [ 1860.258138] rptr:     0
> >>>> [ 1860.258138] rb wptr:  0
> >>>> [ 1860.258139] rb 2: fence:    41563/41569
> >>>> [ 1860.258140] rptr:     1752
> >>>> [ 1860.258140] rb wptr:  2319
> >>>> [ 1860.258141] rb 3: fence:    -256/-256
> >>>> [ 1860.258141] rptr:     0
> >>>> [ 1860.258142] rb wptr:  0
> >>>> [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
> >>>> [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
> >>>> [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>> CP_SCRATCH_REG2: 41562
> >>>> [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
> >>>> [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>> CP_SCRATCH_REG4: 3736059565
> >>>> [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>> CP_SCRATCH_REG5: 3736059565
> >>>> [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>> CP_SCRATCH_REG6: 3736059565
> >>>> [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>> CP_SCRATCH_REG7: 3736059565
> >>>
> >>> @rob do you have any idea how to solve this crash on a740 ?
> >>
> >> The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
> >> something is unhappy about gpu pm.  I'd focus on that first, since
> >> that is almost certainly the cause of the later issues.  If things
> >> _sorta_ work (rendering UI, etc) you could try removing all but the
> >> lowest gpu OPP as an experiment.  Could be that power related problems
> >> surface when the GPU ramps up to higher OPPs.
> >
> > Things work amazingly well compared to what I was expecting. Using
> > mesa staging 26.0 as of yesterday, I'm getting roughly 80% performance
> > in the benchmarks that do run, compared to the stock Android. And
> > rendering is correct everywhere that I've seen so far. Mesa 25.3.3
> > gives about 89% compared to stock, but there are graphical glitches in
> > some of the benchmarks.
> >
> > I set gpu max_freq via devfreq to the minimum available frequency and
> > ran the failing benchmark again. It completed once, but failed with a
> > similar stack trace on the second run. And per sysfs, the gpu did stay
> > at that minimum. Of note, that causes the benchmark to fail, but
> > rendering does recover and the unit is still usable afterwards.
>
> In sm8550.dtsi, I see that ACD values are not specified in the GPU OPP
> table. Can we add those (from downstream dt) and try again?

I don't know what I'm looking for in the downstream dt. But if such a
change gets pushed to lkml, I can grab that and verify.

> Also, are you using the latest stable release from Linus?

I'm not using stable as-is, no. I am using the Google Android Common
Kernel, based on stable 6.18.7 for my last set of tests. I
unfortunately don't have any straightforward way to boot pure
mainline. No uart serial or devkit or anything similar, like I use for
Tegra, which is what I am more familiar with.

Aaron

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-27 22:48 Questions About SM8550 Support Aaron Kling
  2026-01-28  8:50 ` Neil Armstrong
  2026-01-28 14:03 ` Konrad Dybcio
@ 2026-01-30 11:01 ` Konrad Dybcio
  2026-01-30 17:13   ` Aaron Kling
  2 siblings, 1 reply; 31+ messages in thread
From: Konrad Dybcio @ 2026-01-30 11:01 UTC (permalink / raw)
  To: Aaron Kling, linux-arm-msm

On 1/27/26 11:48 PM, Aaron Kling wrote:
> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> for Android, using mainline kernel drivers. I have come across some
> missing functionality and failures that I would like to inquire about.
> 
> * ABL fails to load a dtbo using a baseline dtb unmodified from
> mainline. Using changes described in the gunyah watchdog thread [0], a
> dtbo loads and the devices boot as expected. If any of the changes in
> that post don't exist in the base dtb, abl will fail to load the dtbo
> and go to the bootloader menu. This appears to be an issue in the
> baseline abl code, affecting all devices of that generation. Would it
> be allowable to merge a change adding those changes to the sm8550
> dtsi, allowing an unmodified mainline dtb to work with overlays?

ABL is.. picky.. to say the least

Could you please try to check if what once worked for me on a
8550-based Sony phone would happen to work for you too?

39c596304e44 ("arm64: dts: qcom: Add SM8550 Xperia 1 V")

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-30 11:01 ` Konrad Dybcio
@ 2026-01-30 17:13   ` Aaron Kling
  2026-02-02 10:36     ` Konrad Dybcio
  0 siblings, 1 reply; 31+ messages in thread
From: Aaron Kling @ 2026-01-30 17:13 UTC (permalink / raw)
  To: Konrad Dybcio; +Cc: linux-arm-msm

On Fri, Jan 30, 2026 at 5:01 AM Konrad Dybcio
<konrad.dybcio@oss.qualcomm.com> wrote:
>
> On 1/27/26 11:48 PM, Aaron Kling wrote:
> > I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> > for Android, using mainline kernel drivers. I have come across some
> > missing functionality and failures that I would like to inquire about.
> >
> > * ABL fails to load a dtbo using a baseline dtb unmodified from
> > mainline. Using changes described in the gunyah watchdog thread [0], a
> > dtbo loads and the devices boot as expected. If any of the changes in
> > that post don't exist in the base dtb, abl will fail to load the dtbo
> > and go to the bootloader menu. This appears to be an issue in the
> > baseline abl code, affecting all devices of that generation. Would it
> > be allowable to merge a change adding those changes to the sm8550
> > dtsi, allowing an unmodified mainline dtb to work with overlays?
>
> ABL is.. picky.. to say the least
>
> Could you please try to check if what once worked for me on a
> 8550-based Sony phone would happen to work for you too?
>
> 39c596304e44 ("arm64: dts: qcom: Add SM8550 Xperia 1 V")

Is the question if the devices boots without dtbo? Yes, that works.
And fastboot erase even works too, though that may be because I'm not
using appended dtb, I'm using dtb in vendor_boot. The setup I'm trying
to use is a base dtb that has all the common nodes for the AYN qcs8550
devices, then a device specific dtbo for the diverging parts of the
four devices.

Aaron

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-28 14:03 ` Konrad Dybcio
  2026-01-28 18:20   ` Aaron Kling
@ 2026-02-02  9:35   ` Taniya Das
  2026-02-02 23:01     ` Aaron Kling
  1 sibling, 1 reply; 31+ messages in thread
From: Taniya Das @ 2026-02-02  9:35 UTC (permalink / raw)
  To: Konrad Dybcio, Aaron Kling, linux-arm-msm, Jagadeesh Kona



On 1/28/2026 7:33 PM, Konrad Dybcio wrote:
> On 1/27/26 11:48 PM, Aaron Kling wrote:
>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
>> for Android, using mainline kernel drivers. I have come across some
>> missing functionality and failures that I would like to inquire about.
> 
> [...]
> 
>> * Some gpu related clocks complain about being stuck off during boot,
>> causing stack traces, but the gpu does work. I tried to do some
>> research into this, but quickly got lost in the weeds and I have no
>> idea where to even look.
>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
> 

Aaron, if you could share the boot up logs or stack traces it would be
helpful to see what is leading to stuck at 'off'.

-- 
Thanks,
Taniya Das


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-30 17:13   ` Aaron Kling
@ 2026-02-02 10:36     ` Konrad Dybcio
  2026-02-02 23:12       ` Aaron Kling
  0 siblings, 1 reply; 31+ messages in thread
From: Konrad Dybcio @ 2026-02-02 10:36 UTC (permalink / raw)
  To: Aaron Kling; +Cc: linux-arm-msm

On 1/30/26 6:13 PM, Aaron Kling wrote:
> On Fri, Jan 30, 2026 at 5:01 AM Konrad Dybcio
> <konrad.dybcio@oss.qualcomm.com> wrote:
>>
>> On 1/27/26 11:48 PM, Aaron Kling wrote:
>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
>>> for Android, using mainline kernel drivers. I have come across some
>>> missing functionality and failures that I would like to inquire about.
>>>
>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
>>> mainline. Using changes described in the gunyah watchdog thread [0], a
>>> dtbo loads and the devices boot as expected. If any of the changes in
>>> that post don't exist in the base dtb, abl will fail to load the dtbo
>>> and go to the bootloader menu. This appears to be an issue in the
>>> baseline abl code, affecting all devices of that generation. Would it
>>> be allowable to merge a change adding those changes to the sm8550
>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
>>
>> ABL is.. picky.. to say the least
>>
>> Could you please try to check if what once worked for me on a
>> 8550-based Sony phone would happen to work for you too?
>>
>> 39c596304e44 ("arm64: dts: qcom: Add SM8550 Xperia 1 V")
> 
> Is the question if the devices boots without dtbo? Yes, that works.

That's nice!

> And fastboot erase even works too, though that may be because I'm not
> using appended dtb, I'm using dtb in vendor_boot. The setup I'm trying
> to use is a base dtb that has all the common nodes for the AYN qcs8550
> devices, then a device specific dtbo for the diverging parts of the
> four devices.

I'm not sure if that's a good idea if the bootloader is (effectively)
broken

I'd consider building full DTBs for each device

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-02  9:35   ` Taniya Das
@ 2026-02-02 23:01     ` Aaron Kling
  2026-02-03  6:34       ` Jagadeesh Kona
  0 siblings, 1 reply; 31+ messages in thread
From: Aaron Kling @ 2026-02-02 23:01 UTC (permalink / raw)
  To: Taniya Das; +Cc: Konrad Dybcio, linux-arm-msm, Jagadeesh Kona

On Mon, Feb 2, 2026 at 3:35 AM Taniya Das <taniya.das@oss.qualcomm.com> wrote:
>
>
>
> On 1/28/2026 7:33 PM, Konrad Dybcio wrote:
> > On 1/27/26 11:48 PM, Aaron Kling wrote:
> >> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> >> for Android, using mainline kernel drivers. I have come across some
> >> missing functionality and failures that I would like to inquire about.
> >
> > [...]
> >
> >> * Some gpu related clocks complain about being stuck off during boot,
> >> causing stack traces, but the gpu does work. I tried to do some
> >> research into this, but quickly got lost in the weeds and I have no
> >> idea where to even look.
> >> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> >> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> >> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> >> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
> >
>
> Aaron, if you could share the boot up logs or stack traces it would be
> helpful to see what is leading to stuck at 'off'.

Sure. Here [0] is a kernel boot log booting to Android launcher.

Aaron

[0]  http://0x0.st/PbLh.txt

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-02 10:36     ` Konrad Dybcio
@ 2026-02-02 23:12       ` Aaron Kling
  2026-02-03 10:31         ` Konrad Dybcio
  0 siblings, 1 reply; 31+ messages in thread
From: Aaron Kling @ 2026-02-02 23:12 UTC (permalink / raw)
  To: Konrad Dybcio; +Cc: linux-arm-msm

On Mon, Feb 2, 2026 at 4:36 AM Konrad Dybcio
<konrad.dybcio@oss.qualcomm.com> wrote:
>
> On 1/30/26 6:13 PM, Aaron Kling wrote:
> > On Fri, Jan 30, 2026 at 5:01 AM Konrad Dybcio
> > <konrad.dybcio@oss.qualcomm.com> wrote:
> >>
> >> On 1/27/26 11:48 PM, Aaron Kling wrote:
> >>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> >>> for Android, using mainline kernel drivers. I have come across some
> >>> missing functionality and failures that I would like to inquire about.
> >>>
> >>> * ABL fails to load a dtbo using a baseline dtb unmodified from
> >>> mainline. Using changes described in the gunyah watchdog thread [0], a
> >>> dtbo loads and the devices boot as expected. If any of the changes in
> >>> that post don't exist in the base dtb, abl will fail to load the dtbo
> >>> and go to the bootloader menu. This appears to be an issue in the
> >>> baseline abl code, affecting all devices of that generation. Would it
> >>> be allowable to merge a change adding those changes to the sm8550
> >>> dtsi, allowing an unmodified mainline dtb to work with overlays?
> >>
> >> ABL is.. picky.. to say the least
> >>
> >> Could you please try to check if what once worked for me on a
> >> 8550-based Sony phone would happen to work for you too?
> >>
> >> 39c596304e44 ("arm64: dts: qcom: Add SM8550 Xperia 1 V")
> >
> > Is the question if the devices boots without dtbo? Yes, that works.
>
> That's nice!
>
> > And fastboot erase even works too, though that may be because I'm not
> > using appended dtb, I'm using dtb in vendor_boot. The setup I'm trying
> > to use is a base dtb that has all the common nodes for the AYN qcs8550
> > devices, then a device specific dtbo for the diverging parts of the
> > four devices.
>
> I'm not sure if that's a good idea if the bootloader is (effectively)
> broken
>
> I'd consider building full DTBs for each device

My end goal makes that difficult. I'm working on LineageOS, an open
source AOSP fork. I am attempting to make a single build target that
supports all four AYN qcs8550 devices. Android puts the base dtb in
vendor_boot. The concept supports multiple dtb's, but the ids the
bootloader uses to fetch said dtb matches across all four devices.
Even more unfortunately, this is true for the dtbo id as well; the
vendor did not set unique board ids for the different devices.
However, I can pull some tricks to use a variant dtbo image per
device. That concept isn't feasible for the vendor_boot partition. So
I'm taking every reasonable effort to support dtbo's.

And to be fair, beyond these node name and label requirements, I have
not seen any breakage. Once the bootloader is convinced to actually
apply the dtbo, it works as expected.

Aaron

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-02 23:01     ` Aaron Kling
@ 2026-02-03  6:34       ` Jagadeesh Kona
  2026-02-03 23:21         ` Aaron Kling
  0 siblings, 1 reply; 31+ messages in thread
From: Jagadeesh Kona @ 2026-02-03  6:34 UTC (permalink / raw)
  To: Aaron Kling, Taniya Das; +Cc: Konrad Dybcio, linux-arm-msm



On 2/3/2026 4:31 AM, Aaron Kling wrote:
> On Mon, Feb 2, 2026 at 3:35 AM Taniya Das <taniya.das@oss.qualcomm.com> wrote:
>>
>>
>>
>> On 1/28/2026 7:33 PM, Konrad Dybcio wrote:
>>> On 1/27/26 11:48 PM, Aaron Kling wrote:
>>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
>>>> for Android, using mainline kernel drivers. I have come across some
>>>> missing functionality and failures that I would like to inquire about.
>>>
>>> [...]
>>>
>>>> * Some gpu related clocks complain about being stuck off during boot,
>>>> causing stack traces, but the gpu does work. I tried to do some
>>>> research into this, but quickly got lost in the weeds and I have no
>>>> idea where to even look.
>>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
>>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
>>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
>>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
>>>
>>
>> Aaron, if you could share the boot up logs or stack traces it would be
>> helpful to see what is leading to stuck at 'off'.
> 
> Sure. Here [0] is a kernel boot log booting to Android launcher.
> 
> Aaron
> 
> [0]  http://0x0.st/PbLh.txt

Thanks Aaron for sharing the logs!

This warnings seems to be due to below ACK change which tries to proxy vote on
the boot time enabled clocks. And the same change is not part of Linux mainline.

https://android-review.googlesource.com/c/kernel/common/+/1164507

Can you please see if dropping above change is helping to avoid the warnings? or
please apply below patch and see if helps to avoid the warnings.

diff --git a/drivers/clk/qcom/dispcc-sm8550.c b/drivers/clk/qcom/dispcc-sm8550.c
index f27140c649f5..55dee8a96e74 100644
--- a/drivers/clk/qcom/dispcc-sm8550.c
+++ b/drivers/clk/qcom/dispcc-sm8550.c
@@ -825,7 +825,7 @@ static struct clk_branch disp_cc_mdss_ahb1_clk = {
                                &disp_cc_mdss_ahb_clk_src.clkr.hw,
                        },
                        .num_parents = 1,
-                       .flags = CLK_SET_RATE_PARENT,
+                       .flags = CLK_SET_RATE_PARENT | CLK_DONT_HOLD_STATE,
                        .ops = &clk_branch2_ops,
                },
        },
diff --git a/drivers/clk/qcom/gpucc-sm8550.c b/drivers/clk/qcom/gpucc-sm8550.c
index 7486edf56160..2cd27cb835f9 100644
--- a/drivers/clk/qcom/gpucc-sm8550.c
+++ b/drivers/clk/qcom/gpucc-sm8550.c
@@ -330,7 +330,7 @@ static struct clk_branch gpu_cc_cx_gmu_clk = {
                                &gpu_cc_gmu_clk_src.clkr.hw,
                        },
                        .num_parents = 1,
-                       .flags = CLK_SET_RATE_PARENT,
+                       .flags = CLK_SET_RATE_PARENT | CLK_DONT_HOLD_STATE,
                        .ops = &clk_branch2_aon_ops,
                },
        },
@@ -348,7 +348,7 @@ static struct clk_branch gpu_cc_cxo_clk = {
                                &gpu_cc_xo_clk_src.clkr.hw,
                        },
                        .num_parents = 1,
-                       .flags = CLK_SET_RATE_PARENT,
+                       .flags = CLK_SET_RATE_PARENT | CLK_DONT_HOLD_STATE,
                        .ops = &clk_branch2_ops,
                },
        },
@@ -415,7 +415,7 @@ static struct clk_branch gpu_cc_hub_cx_int_clk = {
                                &gpu_cc_hub_clk_src.clkr.hw,
                        },
                        .num_parents = 1,
-                       .flags = CLK_SET_RATE_PARENT,
+                       .flags = CLK_SET_RATE_PARENT | CLK_DONT_HOLD_STATE,
                        .ops = &clk_branch2_aon_ops,
                },
        },

Thanks,
Jagadeesh 

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-02 23:12       ` Aaron Kling
@ 2026-02-03 10:31         ` Konrad Dybcio
  2026-02-03 17:31           ` Aaron Kling
  0 siblings, 1 reply; 31+ messages in thread
From: Konrad Dybcio @ 2026-02-03 10:31 UTC (permalink / raw)
  To: Aaron Kling; +Cc: linux-arm-msm

On 2/3/26 12:12 AM, Aaron Kling wrote:
> On Mon, Feb 2, 2026 at 4:36 AM Konrad Dybcio
> <konrad.dybcio@oss.qualcomm.com> wrote:
>>
>> On 1/30/26 6:13 PM, Aaron Kling wrote:
>>> On Fri, Jan 30, 2026 at 5:01 AM Konrad Dybcio
>>> <konrad.dybcio@oss.qualcomm.com> wrote:
>>>>
>>>> On 1/27/26 11:48 PM, Aaron Kling wrote:
>>>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
>>>>> for Android, using mainline kernel drivers. I have come across some
>>>>> missing functionality and failures that I would like to inquire about.
>>>>>
>>>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
>>>>> mainline. Using changes described in the gunyah watchdog thread [0], a
>>>>> dtbo loads and the devices boot as expected. If any of the changes in
>>>>> that post don't exist in the base dtb, abl will fail to load the dtbo
>>>>> and go to the bootloader menu. This appears to be an issue in the
>>>>> baseline abl code, affecting all devices of that generation. Would it
>>>>> be allowable to merge a change adding those changes to the sm8550
>>>>> dtsi, allowing an unmodified mainline dtb to work with overlays?

[...]

>> I'd consider building full DTBs for each device
> 
> My end goal makes that difficult. I'm working on LineageOS, an open
> source AOSP fork. I am attempting to make a single build target that
> supports all four AYN qcs8550 devices. Android puts the base dtb in
> vendor_boot. The concept supports multiple dtb's, but the ids the
> bootloader uses to fetch said dtb matches across all four devices.
> Even more unfortunately, this is true for the dtbo id as well; the
> vendor did not set unique board ids for the different devices.
> However, I can pull some tricks to use a variant dtbo image per
> device. That concept isn't feasible for the vendor_boot partition. So
> I'm taking every reasonable effort to support dtbo's.
> 
> And to be fair, beyond these node name and label requirements, I have
> not seen any breakage. Once the bootloader is convinced to actually
> apply the dtbo, it works as expected.

I see

It may be that I lost some context across various threads, but IIUC
we're down to just label changes, which I suppose are generally fine..
Am I following?

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-03 10:31         ` Konrad Dybcio
@ 2026-02-03 17:31           ` Aaron Kling
  0 siblings, 0 replies; 31+ messages in thread
From: Aaron Kling @ 2026-02-03 17:31 UTC (permalink / raw)
  To: Konrad Dybcio; +Cc: linux-arm-msm

On Tue, Feb 3, 2026 at 4:31 AM Konrad Dybcio
<konrad.dybcio@oss.qualcomm.com> wrote:
>
> On 2/3/26 12:12 AM, Aaron Kling wrote:
> > On Mon, Feb 2, 2026 at 4:36 AM Konrad Dybcio
> > <konrad.dybcio@oss.qualcomm.com> wrote:
> >>
> >> On 1/30/26 6:13 PM, Aaron Kling wrote:
> >>> On Fri, Jan 30, 2026 at 5:01 AM Konrad Dybcio
> >>> <konrad.dybcio@oss.qualcomm.com> wrote:
> >>>>
> >>>> On 1/27/26 11:48 PM, Aaron Kling wrote:
> >>>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> >>>>> for Android, using mainline kernel drivers. I have come across some
> >>>>> missing functionality and failures that I would like to inquire about.
> >>>>>
> >>>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
> >>>>> mainline. Using changes described in the gunyah watchdog thread [0], a
> >>>>> dtbo loads and the devices boot as expected. If any of the changes in
> >>>>> that post don't exist in the base dtb, abl will fail to load the dtbo
> >>>>> and go to the bootloader menu. This appears to be an issue in the
> >>>>> baseline abl code, affecting all devices of that generation. Would it
> >>>>> be allowable to merge a change adding those changes to the sm8550
> >>>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
>
> [...]
>
> >> I'd consider building full DTBs for each device
> >
> > My end goal makes that difficult. I'm working on LineageOS, an open
> > source AOSP fork. I am attempting to make a single build target that
> > supports all four AYN qcs8550 devices. Android puts the base dtb in
> > vendor_boot. The concept supports multiple dtb's, but the ids the
> > bootloader uses to fetch said dtb matches across all four devices.
> > Even more unfortunately, this is true for the dtbo id as well; the
> > vendor did not set unique board ids for the different devices.
> > However, I can pull some tricks to use a variant dtbo image per
> > device. That concept isn't feasible for the vendor_boot partition. So
> > I'm taking every reasonable effort to support dtbo's.
> >
> > And to be fair, beyond these node name and label requirements, I have
> > not seen any breakage. Once the bootloader is convinced to actually
> > apply the dtbo, it works as expected.
>
> I see
>
> It may be that I lost some context across various threads, but IIUC
> we're down to just label changes, which I suppose are generally fine..
> Am I following?

That is correct. My post on the series yesterday [0] is my current
understanding of how things stand.

Aaron

[0] https://lore.kernel.org/linux-arm-msm/CALHNRZ8j4XWg_oVdPTTp+RPhsEtYrjR3iGusACgoa76dGL0U3A@mail.gmail.com/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-03  6:34       ` Jagadeesh Kona
@ 2026-02-03 23:21         ` Aaron Kling
  2026-02-04 16:53           ` Taniya Das
  0 siblings, 1 reply; 31+ messages in thread
From: Aaron Kling @ 2026-02-03 23:21 UTC (permalink / raw)
  To: Jagadeesh Kona; +Cc: Taniya Das, Konrad Dybcio, linux-arm-msm

On Tue, Feb 3, 2026 at 12:35 AM Jagadeesh Kona
<jagadeesh.kona@oss.qualcomm.com> wrote:
>
>
>
> On 2/3/2026 4:31 AM, Aaron Kling wrote:
> > On Mon, Feb 2, 2026 at 3:35 AM Taniya Das <taniya.das@oss.qualcomm.com> wrote:
> >>
> >>
> >>
> >> On 1/28/2026 7:33 PM, Konrad Dybcio wrote:
> >>> On 1/27/26 11:48 PM, Aaron Kling wrote:
> >>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> >>>> for Android, using mainline kernel drivers. I have come across some
> >>>> missing functionality and failures that I would like to inquire about.
> >>>
> >>> [...]
> >>>
> >>>> * Some gpu related clocks complain about being stuck off during boot,
> >>>> causing stack traces, but the gpu does work. I tried to do some
> >>>> research into this, but quickly got lost in the weeds and I have no
> >>>> idea where to even look.
> >>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> >>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> >>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> >>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
> >>>
> >>
> >> Aaron, if you could share the boot up logs or stack traces it would be
> >> helpful to see what is leading to stuck at 'off'.
> >
> > Sure. Here [0] is a kernel boot log booting to Android launcher.
> >
> > Aaron
> >
> > [0]  http://0x0.st/PbLh.txt
>
> Thanks Aaron for sharing the logs!
>
> This warnings seems to be due to below ACK change which tries to proxy vote on
> the boot time enabled clocks. And the same change is not part of Linux mainline.
>
> https://android-review.googlesource.com/c/kernel/common/+/1164507
>
> Can you please see if dropping above change is helping to avoid the warnings? or
> please apply below patch and see if helps to avoid the warnings.
>
> diff --git a/drivers/clk/qcom/dispcc-sm8550.c b/drivers/clk/qcom/dispcc-sm8550.c
> index f27140c649f5..55dee8a96e74 100644
> --- a/drivers/clk/qcom/dispcc-sm8550.c
> +++ b/drivers/clk/qcom/dispcc-sm8550.c
> @@ -825,7 +825,7 @@ static struct clk_branch disp_cc_mdss_ahb1_clk = {
>                                 &disp_cc_mdss_ahb_clk_src.clkr.hw,
>                         },
>                         .num_parents = 1,
> -                       .flags = CLK_SET_RATE_PARENT,
> +                       .flags = CLK_SET_RATE_PARENT | CLK_DONT_HOLD_STATE,
>                         .ops = &clk_branch2_ops,
>                 },
>         },
> diff --git a/drivers/clk/qcom/gpucc-sm8550.c b/drivers/clk/qcom/gpucc-sm8550.c
> index 7486edf56160..2cd27cb835f9 100644
> --- a/drivers/clk/qcom/gpucc-sm8550.c
> +++ b/drivers/clk/qcom/gpucc-sm8550.c
> @@ -330,7 +330,7 @@ static struct clk_branch gpu_cc_cx_gmu_clk = {
>                                 &gpu_cc_gmu_clk_src.clkr.hw,
>                         },
>                         .num_parents = 1,
> -                       .flags = CLK_SET_RATE_PARENT,
> +                       .flags = CLK_SET_RATE_PARENT | CLK_DONT_HOLD_STATE,
>                         .ops = &clk_branch2_aon_ops,
>                 },
>         },
> @@ -348,7 +348,7 @@ static struct clk_branch gpu_cc_cxo_clk = {
>                                 &gpu_cc_xo_clk_src.clkr.hw,
>                         },
>                         .num_parents = 1,
> -                       .flags = CLK_SET_RATE_PARENT,
> +                       .flags = CLK_SET_RATE_PARENT | CLK_DONT_HOLD_STATE,
>                         .ops = &clk_branch2_ops,
>                 },
>         },
> @@ -415,7 +415,7 @@ static struct clk_branch gpu_cc_hub_cx_int_clk = {
>                                 &gpu_cc_hub_clk_src.clkr.hw,
>                         },
>                         .num_parents = 1,
> -                       .flags = CLK_SET_RATE_PARENT,
> +                       .flags = CLK_SET_RATE_PARENT | CLK_DONT_HOLD_STATE,
>                         .ops = &clk_branch2_aon_ops,
>                 },
>         },

Both reverting the clock sync state support and setting the
CLK_DONT_HOLD_STATE flag on the affected clocks do independently cause
the warnings to stop.

So this is an ACK issue and not related to mainline at all. Sorry for
the hassle. But while the topic is here, is this something that should
be sent to the aosp gerrit? I'd be willing to spearhead that if no one
else is planning to. But I don't know much about the underlying issues
at play there, so if anyone that does know more about that is willing
to, it'd be more efficient for them to do so.

Aaron

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-03 23:21         ` Aaron Kling
@ 2026-02-04 16:53           ` Taniya Das
  2026-02-04 18:18             ` Aaron Kling
  0 siblings, 1 reply; 31+ messages in thread
From: Taniya Das @ 2026-02-04 16:53 UTC (permalink / raw)
  To: Aaron Kling, Jagadeesh Kona; +Cc: Konrad Dybcio, linux-arm-msm



On 2/4/2026 4:51 AM, Aaron Kling wrote:
>> +                       .flags = CLK_SET_RATE_PARENT | CLK_DONT_HOLD_STATE,
>>                         .ops = &clk_branch2_ops,
>>                 },
>>         },
>> @@ -415,7 +415,7 @@ static struct clk_branch gpu_cc_hub_cx_int_clk = {
>>                                 &gpu_cc_hub_clk_src.clkr.hw,
>>                         },
>>                         .num_parents = 1,
>> -                       .flags = CLK_SET_RATE_PARENT,
>> +                       .flags = CLK_SET_RATE_PARENT | CLK_DONT_HOLD_STATE,
>>                         .ops = &clk_branch2_aon_ops,
>>                 },
>>         },
> Both reverting the clock sync state support and setting the
> CLK_DONT_HOLD_STATE flag on the affected clocks do independently cause
> the warnings to stop.
> 

Thanks for confirming that Aaron.

> So this is an ACK issue and not related to mainline at all. Sorry for
> the hassle. But while the topic is here, is this something that should
> be sent to the aosp gerrit? I'd be willing to spearhead that if no one
> else is planning to. But I don't know much about the underlying issues
> at play there, so if anyone that does know more about that is willing
> to, it'd be more efficient for them to do so.

On the AOSP, we should be having another code altogether which would
include all the required support. Is there any reason we want to use the
upstreamed version of the driver?

-- 
Thanks,
Taniya Das


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-04 16:53           ` Taniya Das
@ 2026-02-04 18:18             ` Aaron Kling
  0 siblings, 0 replies; 31+ messages in thread
From: Aaron Kling @ 2026-02-04 18:18 UTC (permalink / raw)
  To: Taniya Das; +Cc: Jagadeesh Kona, Konrad Dybcio, linux-arm-msm

On Wed, Feb 4, 2026 at 10:53 AM Taniya Das <taniya.das@oss.qualcomm.com> wrote:
>
>
>
> On 2/4/2026 4:51 AM, Aaron Kling wrote:
> >> +                       .flags = CLK_SET_RATE_PARENT | CLK_DONT_HOLD_STATE,
> >>                         .ops = &clk_branch2_ops,
> >>                 },
> >>         },
> >> @@ -415,7 +415,7 @@ static struct clk_branch gpu_cc_hub_cx_int_clk = {
> >>                                 &gpu_cc_hub_clk_src.clkr.hw,
> >>                         },
> >>                         .num_parents = 1,
> >> -                       .flags = CLK_SET_RATE_PARENT,
> >> +                       .flags = CLK_SET_RATE_PARENT | CLK_DONT_HOLD_STATE,
> >>                         .ops = &clk_branch2_aon_ops,
> >>                 },
> >>         },
> > Both reverting the clock sync state support and setting the
> > CLK_DONT_HOLD_STATE flag on the affected clocks do independently cause
> > the warnings to stop.
> >
>
> Thanks for confirming that Aaron.
>
> > So this is an ACK issue and not related to mainline at all. Sorry for
> > the hassle. But while the topic is here, is this something that should
> > be sent to the aosp gerrit? I'd be willing to spearhead that if no one
> > else is planning to. But I don't know much about the underlying issues
> > at play there, so if anyone that does know more about that is willing
> > to, it'd be more efficient for them to do so.
>
> On the AOSP, we should be having another code altogether which would
> include all the required support. Is there any reason we want to use the
> upstreamed version of the driver?

There are several reasons I wish to use mainline drivers. The primary
ones include:

* Using as close to fully open source software as possible. Using the
mainline drivers with mesa, drm_hwcomposer, etc. The qcom downstream
kernel is only compatible with the downstream userspace, to my
knowledge, much of which is closed source. And those closed source
components by definition are not freely available for open source
projects to use.

* Flexibility. The downstream qcom kernel is designed to work only
within the qcom android build system. It is difficult to use that
anywhere else. By contrast, a mainline based target can take a pure
google kernel-platform working tree, do a small set of build files,
and build.

* Longevity. The downstream qcom sm8550 support uses kernel 5.15,
which is already aging. I am currently targeting ACK 6.18 using
mainline drivers. And when 7.x hits lts and google forks it, I will
upgrade to that. Using properly mainlined support makes that simple.
And such support should be able to continue until the hardware is no
longer powerful enough to run the newer AOSP releases. A much longer
runway than being locked into the downstream kernel and closed source
userspace.

Aaron

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-30  2:35         ` Aaron Kling
@ 2026-02-05  8:01           ` Aaron Kling
  2026-02-05 10:54             ` Konrad Dybcio
                               ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Aaron Kling @ 2026-02-05  8:01 UTC (permalink / raw)
  To: Akhil P Oommen; +Cc: rob.clark, Neil Armstrong, linux-arm-msm

On Thu, Jan 29, 2026 at 8:35 PM Aaron Kling <webgeek1234@gmail.com> wrote:
>
> On Thu, Jan 29, 2026 at 5:11 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
> >
> > On 1/28/2026 11:24 PM, Aaron Kling wrote:
> > > On Wed, Jan 28, 2026 at 8:46 AM Rob Clark <rob.clark@oss.qualcomm.com> wrote:
> > >>
> > >> On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
> > >> <neil.armstrong@linaro.org> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> On 1/27/26 23:48, Aaron Kling wrote:
> > >>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> > >>>> for Android, using mainline kernel drivers. I have come across some
> > >>>> missing functionality and failures that I would like to inquire about.
> > >>>>
> > >>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
> > >>>> mainline. Using changes described in the gunyah watchdog thread [0], a
> > >>>> dtbo loads and the devices boot as expected. If any of the changes in
> > >>>> that post don't exist in the base dtb, abl will fail to load the dtbo
> > >>>> and go to the bootloader menu. This appears to be an issue in the
> > >>>> baseline abl code, affecting all devices of that generation. Would it
> > >>>> be allowable to merge a change adding those changes to the sm8550
> > >>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
> > >>>
> > >>> Any addition to the DT must be documented in dt-bindings, so if it's needed
> > >>> for boot they should be documented and added for sure.
> > >>>
> > >>>>
> > >>>> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
> > >>>> have locally copied the commits from sm8650 and adapted for sm8550,
> > >>>> and that seems to work okay. But no measuring of bandwidth was done,
> > >>>> so the numbers are likely not entirely correct. Is there any plan to
> > >>>> generate correct tables for sm8550?
> > >>>
> > >>> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
> > >>> is fine but since the values were calculated from downstream DT tables,
> > >>> the same should be done for sm8550.
> > >>>
> > >>>>
> > >>>> * As part of a series to support the original Odin 2, a patch to
> > >>>> update sm8550 EAS values was submitted [1]. But that series stalled
> > >>>> and this was never merged. If this change is valid, which per that
> > >>>> discussion it appears to be, can it be resubmitted by itself and
> > >>>> merged?
> > >>>
> > >>> I missed this patch, please re-submit, I also need to update the ones
> > >>> for SM8650.
> > >>>
> > >>>>
> > >>>> * Per the mainline kernel device trees and audio topology provide by
> > >>>> the oem, these devices use primary i2s for the speakers path. There
> > >>>> was a commit adding clock support for that as part of an hdmi series
> > >>>> [2], but that seems to have stalled. Is this going to be picked back
> > >>>> up?
> > >>>
> > >>> No, I do not plan to do this work, it required adding callbacks in the
> > >>> code to handle the clocks like done for the pre-audioreach firmwares.
> > >>>
> > >>>>
> > >>>> * Inline crypto fails to detect hwkm support. And I see other logs
> > >>>> online, such as for the sm8550 qrd, that logs the same way my device
> > >>>> does. I traced the issue to the check for wrapped key support [3]. On
> > >>>> my devices, the derive call is supported, but the other three calls
> > >>>> are not. I was pointed at the downstream headers for sm8550 support
> > >>>> and only derive is listed there, the other three don't appear to be
> > >>>> used in the downstream driver. Is this expected? And if so, will this
> > >>>> case be added to the mainline drivers?
> > >>>
> > >>> Does hwkm work with you remove the last 3 calls ?
> > >>>
> > >>>>
> > >>>> * Some gpu related clocks complain about being stuck off during boot,
> > >>>> causing stack traces, but the gpu does work. I tried to do some
> > >>>> research into this, but quickly got lost in the weeds and I have no
> > >>>> idea where to even look.
> > >>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> > >>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> > >>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> > >>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
> > >>>
> > >>> This may be related with the display handoff from ABL, did you add the
> > >>> plat region to the reserved memories ?
> > >>>
> > >>>>
> > >>>> * Sometimes when starting rendering, a bandwidth submission times out,
> > >>>> then the driver immediately complains that said id was left on the
> > >>>> queue. I have tried increasing the timeout, but the same sequence
> > >>>> still happens. Timeout happens, immediately followed by a matching
> > >>>> unexpected response. Implying that this isn't actually a delay /
> > >>>> timeout issue.
> > >>>> [ 1848.517020] platform 3d6a000.gmu:
> > >>>> [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
> > >>>> HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
> > >>>> [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
> > >>>> *ERROR* Unexpected message id 1015 on the response queue
> > >>>
> > >>> Weird the timeout was extended for this very purpose
> > >>>
> > >>>>
> > >>>> * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
> > >>>> unsure if this is a kernel problem or userspace, so I'm submitting
> > >>>> here first. If the consensus is that it's a userspace issue, I'll
> > >>>> submit it to mesa.
> > >>>> [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
> > >>>> fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
> > >>>> 00000001512E9000/003d ib2 00000001512E7000/0000
> > >>>> [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
> > >>>> [msm]] *ERROR* 67.5.10.1: hangcheck recover!
> > >>>> [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
> > >>>> [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
> > >>>> (com.futuremark.dmandroid.application)
> > >>>> [ 1860.258126] revision: 0 (67.5.10.1)
> > >>>> [ 1860.258132] rb 0: fence:    2884/2884
> > >>>> [ 1860.258133] rptr:     36
> > >>>> [ 1860.258134] rb wptr:  36
> > >>>> [ 1860.258135] rb 1: fence:    -256/-256
> > >>>> [ 1860.258138] rptr:     0
> > >>>> [ 1860.258138] rb wptr:  0
> > >>>> [ 1860.258139] rb 2: fence:    41563/41569
> > >>>> [ 1860.258140] rptr:     1752
> > >>>> [ 1860.258140] rb wptr:  2319
> > >>>> [ 1860.258141] rb 3: fence:    -256/-256
> > >>>> [ 1860.258141] rptr:     0
> > >>>> [ 1860.258142] rb wptr:  0
> > >>>> [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
> > >>>> [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
> > >>>> [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > >>>> CP_SCRATCH_REG2: 41562
> > >>>> [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
> > >>>> [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > >>>> CP_SCRATCH_REG4: 3736059565
> > >>>> [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > >>>> CP_SCRATCH_REG5: 3736059565
> > >>>> [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > >>>> CP_SCRATCH_REG6: 3736059565
> > >>>> [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > >>>> CP_SCRATCH_REG7: 3736059565
> > >>>
> > >>> @rob do you have any idea how to solve this crash on a740 ?
> > >>
> > >> The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
> > >> something is unhappy about gpu pm.  I'd focus on that first, since
> > >> that is almost certainly the cause of the later issues.  If things
> > >> _sorta_ work (rendering UI, etc) you could try removing all but the
> > >> lowest gpu OPP as an experiment.  Could be that power related problems
> > >> surface when the GPU ramps up to higher OPPs.
> > >
> > > Things work amazingly well compared to what I was expecting. Using
> > > mesa staging 26.0 as of yesterday, I'm getting roughly 80% performance
> > > in the benchmarks that do run, compared to the stock Android. And
> > > rendering is correct everywhere that I've seen so far. Mesa 25.3.3
> > > gives about 89% compared to stock, but there are graphical glitches in
> > > some of the benchmarks.
> > >
> > > I set gpu max_freq via devfreq to the minimum available frequency and
> > > ran the failing benchmark again. It completed once, but failed with a
> > > similar stack trace on the second run. And per sysfs, the gpu did stay
> > > at that minimum. Of note, that causes the benchmark to fail, but
> > > rendering does recover and the unit is still usable afterwards.
> >
> > In sm8550.dtsi, I see that ACD values are not specified in the GPU OPP
> > table. Can we add those (from downstream dt) and try again?
>
> I don't know what I'm looking for in the downstream dt. But if such a
> change gets pushed to lkml, I can grab that and verify.

I took at look at the downstream dt and took a guess at importing the
acd values. I'm not sure if the gpu here is the baseline kalama or
kalama v2. I guessed the former. There were a couple values missing
however, that I had to extrapolate based on other frequencies. This
however changed nothing about my test results. Still getting crashes.

From my perspective, this part does not appear to be a PM or frequency
related issue. Some of the 3dmark benchmarks I have never seen crash.
Like Wild Life Extreme. I can run the stress variant of that and it
beats the unit for 20 minutes at full clocks with a screaming fan and
that runs perfectly stable. Solar Bay Extreme also runs completely
stable in all of its glorious 3 fps. The two problems are the standard
non-extreme Solar Bay and Steel Nomad Light. Both of these
intermittently crash with similar traces to what I posted before.
There doesn't seem to be consistency in the faults, sometimes it will
be almost immediately after starting the benchmark, other times it
will get 90% through and then fail. But they virtually always fail to
complete. For another point of data, I have never seen GravityMark
cause a fault either.

Aaron

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-05  8:01           ` Aaron Kling
@ 2026-02-05 10:54             ` Konrad Dybcio
  2026-02-05 13:29             ` Akhil P Oommen
  2026-02-05 14:43             ` Dmitry Baryshkov
  2 siblings, 0 replies; 31+ messages in thread
From: Konrad Dybcio @ 2026-02-05 10:54 UTC (permalink / raw)
  To: Aaron Kling, Akhil P Oommen; +Cc: rob.clark, Neil Armstrong, linux-arm-msm

On 2/5/26 9:01 AM, Aaron Kling wrote:
> On Thu, Jan 29, 2026 at 8:35 PM Aaron Kling <webgeek1234@gmail.com> wrote:
>>
>> On Thu, Jan 29, 2026 at 5:11 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>>>
>>> On 1/28/2026 11:24 PM, Aaron Kling wrote:
>>>> On Wed, Jan 28, 2026 at 8:46 AM Rob Clark <rob.clark@oss.qualcomm.com> wrote:
>>>>>
>>>>> On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
>>>>> <neil.armstrong@linaro.org> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 1/27/26 23:48, Aaron Kling wrote:
>>>>>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
>>>>>>> for Android, using mainline kernel drivers. I have come across some
>>>>>>> missing functionality and failures that I would like to inquire about.
>>>>>>>
>>>>>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
>>>>>>> mainline. Using changes described in the gunyah watchdog thread [0], a
>>>>>>> dtbo loads and the devices boot as expected. If any of the changes in
>>>>>>> that post don't exist in the base dtb, abl will fail to load the dtbo
>>>>>>> and go to the bootloader menu. This appears to be an issue in the
>>>>>>> baseline abl code, affecting all devices of that generation. Would it
>>>>>>> be allowable to merge a change adding those changes to the sm8550
>>>>>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
>>>>>>
>>>>>> Any addition to the DT must be documented in dt-bindings, so if it's needed
>>>>>> for boot they should be documented and added for sure.
>>>>>>
>>>>>>>
>>>>>>> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
>>>>>>> have locally copied the commits from sm8650 and adapted for sm8550,
>>>>>>> and that seems to work okay. But no measuring of bandwidth was done,
>>>>>>> so the numbers are likely not entirely correct. Is there any plan to
>>>>>>> generate correct tables for sm8550?
>>>>>>
>>>>>> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
>>>>>> is fine but since the values were calculated from downstream DT tables,
>>>>>> the same should be done for sm8550.
>>>>>>
>>>>>>>
>>>>>>> * As part of a series to support the original Odin 2, a patch to
>>>>>>> update sm8550 EAS values was submitted [1]. But that series stalled
>>>>>>> and this was never merged. If this change is valid, which per that
>>>>>>> discussion it appears to be, can it be resubmitted by itself and
>>>>>>> merged?
>>>>>>
>>>>>> I missed this patch, please re-submit, I also need to update the ones
>>>>>> for SM8650.
>>>>>>
>>>>>>>
>>>>>>> * Per the mainline kernel device trees and audio topology provide by
>>>>>>> the oem, these devices use primary i2s for the speakers path. There
>>>>>>> was a commit adding clock support for that as part of an hdmi series
>>>>>>> [2], but that seems to have stalled. Is this going to be picked back
>>>>>>> up?
>>>>>>
>>>>>> No, I do not plan to do this work, it required adding callbacks in the
>>>>>> code to handle the clocks like done for the pre-audioreach firmwares.
>>>>>>
>>>>>>>
>>>>>>> * Inline crypto fails to detect hwkm support. And I see other logs
>>>>>>> online, such as for the sm8550 qrd, that logs the same way my device
>>>>>>> does. I traced the issue to the check for wrapped key support [3]. On
>>>>>>> my devices, the derive call is supported, but the other three calls
>>>>>>> are not. I was pointed at the downstream headers for sm8550 support
>>>>>>> and only derive is listed there, the other three don't appear to be
>>>>>>> used in the downstream driver. Is this expected? And if so, will this
>>>>>>> case be added to the mainline drivers?
>>>>>>
>>>>>> Does hwkm work with you remove the last 3 calls ?
>>>>>>
>>>>>>>
>>>>>>> * Some gpu related clocks complain about being stuck off during boot,
>>>>>>> causing stack traces, but the gpu does work. I tried to do some
>>>>>>> research into this, but quickly got lost in the weeds and I have no
>>>>>>> idea where to even look.
>>>>>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
>>>>>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
>>>>>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
>>>>>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
>>>>>>
>>>>>> This may be related with the display handoff from ABL, did you add the
>>>>>> plat region to the reserved memories ?
>>>>>>
>>>>>>>
>>>>>>> * Sometimes when starting rendering, a bandwidth submission times out,
>>>>>>> then the driver immediately complains that said id was left on the
>>>>>>> queue. I have tried increasing the timeout, but the same sequence
>>>>>>> still happens. Timeout happens, immediately followed by a matching
>>>>>>> unexpected response. Implying that this isn't actually a delay /
>>>>>>> timeout issue.
>>>>>>> [ 1848.517020] platform 3d6a000.gmu:
>>>>>>> [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
>>>>>>> HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
>>>>>>> [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
>>>>>>> *ERROR* Unexpected message id 1015 on the response queue
>>>>>>
>>>>>> Weird the timeout was extended for this very purpose
>>>>>>
>>>>>>>
>>>>>>> * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
>>>>>>> unsure if this is a kernel problem or userspace, so I'm submitting
>>>>>>> here first. If the consensus is that it's a userspace issue, I'll
>>>>>>> submit it to mesa.
>>>>>>> [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
>>>>>>> fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
>>>>>>> 00000001512E9000/003d ib2 00000001512E7000/0000
>>>>>>> [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
>>>>>>> [msm]] *ERROR* 67.5.10.1: hangcheck recover!
>>>>>>> [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
>>>>>>> [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
>>>>>>> (com.futuremark.dmandroid.application)
>>>>>>> [ 1860.258126] revision: 0 (67.5.10.1)
>>>>>>> [ 1860.258132] rb 0: fence:    2884/2884
>>>>>>> [ 1860.258133] rptr:     36
>>>>>>> [ 1860.258134] rb wptr:  36
>>>>>>> [ 1860.258135] rb 1: fence:    -256/-256
>>>>>>> [ 1860.258138] rptr:     0
>>>>>>> [ 1860.258138] rb wptr:  0
>>>>>>> [ 1860.258139] rb 2: fence:    41563/41569
>>>>>>> [ 1860.258140] rptr:     1752
>>>>>>> [ 1860.258140] rb wptr:  2319
>>>>>>> [ 1860.258141] rb 3: fence:    -256/-256
>>>>>>> [ 1860.258141] rptr:     0
>>>>>>> [ 1860.258142] rb wptr:  0
>>>>>>> [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
>>>>>>> [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
>>>>>>> [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>> CP_SCRATCH_REG2: 41562
>>>>>>> [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
>>>>>>> [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>> CP_SCRATCH_REG4: 3736059565
>>>>>>> [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>> CP_SCRATCH_REG5: 3736059565
>>>>>>> [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>> CP_SCRATCH_REG6: 3736059565
>>>>>>> [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>> CP_SCRATCH_REG7: 3736059565
>>>>>>
>>>>>> @rob do you have any idea how to solve this crash on a740 ?
>>>>>
>>>>> The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
>>>>> something is unhappy about gpu pm.  I'd focus on that first, since
>>>>> that is almost certainly the cause of the later issues.  If things
>>>>> _sorta_ work (rendering UI, etc) you could try removing all but the
>>>>> lowest gpu OPP as an experiment.  Could be that power related problems
>>>>> surface when the GPU ramps up to higher OPPs.
>>>>
>>>> Things work amazingly well compared to what I was expecting. Using
>>>> mesa staging 26.0 as of yesterday, I'm getting roughly 80% performance
>>>> in the benchmarks that do run, compared to the stock Android. And
>>>> rendering is correct everywhere that I've seen so far. Mesa 25.3.3
>>>> gives about 89% compared to stock, but there are graphical glitches in
>>>> some of the benchmarks.
>>>>
>>>> I set gpu max_freq via devfreq to the minimum available frequency and
>>>> ran the failing benchmark again. It completed once, but failed with a
>>>> similar stack trace on the second run. And per sysfs, the gpu did stay
>>>> at that minimum. Of note, that causes the benchmark to fail, but
>>>> rendering does recover and the unit is still usable afterwards.
>>>
>>> In sm8550.dtsi, I see that ACD values are not specified in the GPU OPP
>>> table. Can we add those (from downstream dt) and try again?
>>
>> I don't know what I'm looking for in the downstream dt. But if such a
>> change gets pushed to lkml, I can grab that and verify.
> 
> I took at look at the downstream dt and took a guess at importing the
> acd values. I'm not sure if the gpu here is the baseline kalama or
> kalama v2. I guessed the former. There were a couple values missing
> however, that I had to extrapolate based on other frequencies. This
> however changed nothing about my test results. Still getting crashes.

FYI if there's a chip and chip-v2, then the former was a prototype
revision. Try the values present in the runtime DT of the original
software, i.e.

cat /sys/firmware/fdt > my.dtb
dtc -I dtb my.dtb -O dts > my.dts

Konrad


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-05  8:01           ` Aaron Kling
  2026-02-05 10:54             ` Konrad Dybcio
@ 2026-02-05 13:29             ` Akhil P Oommen
  2026-02-05 17:40               ` Aaron Kling
  2026-02-05 14:43             ` Dmitry Baryshkov
  2 siblings, 1 reply; 31+ messages in thread
From: Akhil P Oommen @ 2026-02-05 13:29 UTC (permalink / raw)
  To: Aaron Kling; +Cc: rob.clark, Neil Armstrong, linux-arm-msm

On 2/5/2026 1:31 PM, Aaron Kling wrote:
> On Thu, Jan 29, 2026 at 8:35 PM Aaron Kling <webgeek1234@gmail.com> wrote:
>>
>> On Thu, Jan 29, 2026 at 5:11 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>>>
>>> On 1/28/2026 11:24 PM, Aaron Kling wrote:
>>>> On Wed, Jan 28, 2026 at 8:46 AM Rob Clark <rob.clark@oss.qualcomm.com> wrote:
>>>>>
>>>>> On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
>>>>> <neil.armstrong@linaro.org> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 1/27/26 23:48, Aaron Kling wrote:
>>>>>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
>>>>>>> for Android, using mainline kernel drivers. I have come across some
>>>>>>> missing functionality and failures that I would like to inquire about.
>>>>>>>
>>>>>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
>>>>>>> mainline. Using changes described in the gunyah watchdog thread [0], a
>>>>>>> dtbo loads and the devices boot as expected. If any of the changes in
>>>>>>> that post don't exist in the base dtb, abl will fail to load the dtbo
>>>>>>> and go to the bootloader menu. This appears to be an issue in the
>>>>>>> baseline abl code, affecting all devices of that generation. Would it
>>>>>>> be allowable to merge a change adding those changes to the sm8550
>>>>>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
>>>>>>
>>>>>> Any addition to the DT must be documented in dt-bindings, so if it's needed
>>>>>> for boot they should be documented and added for sure.
>>>>>>
>>>>>>>
>>>>>>> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
>>>>>>> have locally copied the commits from sm8650 and adapted for sm8550,
>>>>>>> and that seems to work okay. But no measuring of bandwidth was done,
>>>>>>> so the numbers are likely not entirely correct. Is there any plan to
>>>>>>> generate correct tables for sm8550?
>>>>>>
>>>>>> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
>>>>>> is fine but since the values were calculated from downstream DT tables,
>>>>>> the same should be done for sm8550.
>>>>>>
>>>>>>>
>>>>>>> * As part of a series to support the original Odin 2, a patch to
>>>>>>> update sm8550 EAS values was submitted [1]. But that series stalled
>>>>>>> and this was never merged. If this change is valid, which per that
>>>>>>> discussion it appears to be, can it be resubmitted by itself and
>>>>>>> merged?
>>>>>>
>>>>>> I missed this patch, please re-submit, I also need to update the ones
>>>>>> for SM8650.
>>>>>>
>>>>>>>
>>>>>>> * Per the mainline kernel device trees and audio topology provide by
>>>>>>> the oem, these devices use primary i2s for the speakers path. There
>>>>>>> was a commit adding clock support for that as part of an hdmi series
>>>>>>> [2], but that seems to have stalled. Is this going to be picked back
>>>>>>> up?
>>>>>>
>>>>>> No, I do not plan to do this work, it required adding callbacks in the
>>>>>> code to handle the clocks like done for the pre-audioreach firmwares.
>>>>>>
>>>>>>>
>>>>>>> * Inline crypto fails to detect hwkm support. And I see other logs
>>>>>>> online, such as for the sm8550 qrd, that logs the same way my device
>>>>>>> does. I traced the issue to the check for wrapped key support [3]. On
>>>>>>> my devices, the derive call is supported, but the other three calls
>>>>>>> are not. I was pointed at the downstream headers for sm8550 support
>>>>>>> and only derive is listed there, the other three don't appear to be
>>>>>>> used in the downstream driver. Is this expected? And if so, will this
>>>>>>> case be added to the mainline drivers?
>>>>>>
>>>>>> Does hwkm work with you remove the last 3 calls ?
>>>>>>
>>>>>>>
>>>>>>> * Some gpu related clocks complain about being stuck off during boot,
>>>>>>> causing stack traces, but the gpu does work. I tried to do some
>>>>>>> research into this, but quickly got lost in the weeds and I have no
>>>>>>> idea where to even look.
>>>>>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
>>>>>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
>>>>>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
>>>>>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
>>>>>>
>>>>>> This may be related with the display handoff from ABL, did you add the
>>>>>> plat region to the reserved memories ?
>>>>>>
>>>>>>>
>>>>>>> * Sometimes when starting rendering, a bandwidth submission times out,
>>>>>>> then the driver immediately complains that said id was left on the
>>>>>>> queue. I have tried increasing the timeout, but the same sequence
>>>>>>> still happens. Timeout happens, immediately followed by a matching
>>>>>>> unexpected response. Implying that this isn't actually a delay /
>>>>>>> timeout issue.
>>>>>>> [ 1848.517020] platform 3d6a000.gmu:
>>>>>>> [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
>>>>>>> HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
>>>>>>> [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
>>>>>>> *ERROR* Unexpected message id 1015 on the response queue
>>>>>>
>>>>>> Weird the timeout was extended for this very purpose
>>>>>>
>>>>>>>
>>>>>>> * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
>>>>>>> unsure if this is a kernel problem or userspace, so I'm submitting
>>>>>>> here first. If the consensus is that it's a userspace issue, I'll
>>>>>>> submit it to mesa.
>>>>>>> [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
>>>>>>> fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
>>>>>>> 00000001512E9000/003d ib2 00000001512E7000/0000
>>>>>>> [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
>>>>>>> [msm]] *ERROR* 67.5.10.1: hangcheck recover!
>>>>>>> [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
>>>>>>> [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
>>>>>>> (com.futuremark.dmandroid.application)
>>>>>>> [ 1860.258126] revision: 0 (67.5.10.1)
>>>>>>> [ 1860.258132] rb 0: fence:    2884/2884
>>>>>>> [ 1860.258133] rptr:     36
>>>>>>> [ 1860.258134] rb wptr:  36
>>>>>>> [ 1860.258135] rb 1: fence:    -256/-256
>>>>>>> [ 1860.258138] rptr:     0
>>>>>>> [ 1860.258138] rb wptr:  0
>>>>>>> [ 1860.258139] rb 2: fence:    41563/41569
>>>>>>> [ 1860.258140] rptr:     1752
>>>>>>> [ 1860.258140] rb wptr:  2319
>>>>>>> [ 1860.258141] rb 3: fence:    -256/-256
>>>>>>> [ 1860.258141] rptr:     0
>>>>>>> [ 1860.258142] rb wptr:  0
>>>>>>> [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
>>>>>>> [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
>>>>>>> [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>> CP_SCRATCH_REG2: 41562
>>>>>>> [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
>>>>>>> [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>> CP_SCRATCH_REG4: 3736059565
>>>>>>> [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>> CP_SCRATCH_REG5: 3736059565
>>>>>>> [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>> CP_SCRATCH_REG6: 3736059565
>>>>>>> [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>> CP_SCRATCH_REG7: 3736059565
>>>>>>
>>>>>> @rob do you have any idea how to solve this crash on a740 ?
>>>>>
>>>>> The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
>>>>> something is unhappy about gpu pm.  I'd focus on that first, since
>>>>> that is almost certainly the cause of the later issues.  If things
>>>>> _sorta_ work (rendering UI, etc) you could try removing all but the
>>>>> lowest gpu OPP as an experiment.  Could be that power related problems
>>>>> surface when the GPU ramps up to higher OPPs.
>>>>
>>>> Things work amazingly well compared to what I was expecting. Using
>>>> mesa staging 26.0 as of yesterday, I'm getting roughly 80% performance
>>>> in the benchmarks that do run, compared to the stock Android. And
>>>> rendering is correct everywhere that I've seen so far. Mesa 25.3.3
>>>> gives about 89% compared to stock, but there are graphical glitches in
>>>> some of the benchmarks.
>>>>
>>>> I set gpu max_freq via devfreq to the minimum available frequency and
>>>> ran the failing benchmark again. It completed once, but failed with a
>>>> similar stack trace on the second run. And per sysfs, the gpu did stay
>>>> at that minimum. Of note, that causes the benchmark to fail, but
>>>> rendering does recover and the unit is still usable afterwards.
>>>
>>> In sm8550.dtsi, I see that ACD values are not specified in the GPU OPP
>>> table. Can we add those (from downstream dt) and try again?
>>
>> I don't know what I'm looking for in the downstream dt. But if such a
>> change gets pushed to lkml, I can grab that and verify.
> 
> I took at look at the downstream dt and took a guess at importing the
> acd values. I'm not sure if the gpu here is the baseline kalama or
> kalama v2. I guessed the former. There were a couple values missing
> however, that I had to extrapolate based on other frequencies. This
> however changed nothing about my test results. Still getting crashes.

Please use the values from kalama v2 dtsi. And if the acd property is
missing in any OPP node, that is a hint to the the driver+gmu-fw that
ACD should be kept disabled for that freq corner. So, please follow the
same.

ACD configurations are required to meet the hw specifications. We can't
predict how the hw fails in case of a spec violation. I don't know if
this issue is ACD related, but we should ensure that all power related
configurations are accurate first.

Also, could you please try the latest firmwares (sqe and gmu) from here:
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/qcom?id=30979b116b5c5857b72c4332db8db0ff1ca2dc08

> 
> From my perspective, this part does not appear to be a PM or frequency
> related issue. Some of the 3dmark benchmarks I have never seen crash.
> Like Wild Life Extreme. I can run the stress variant of that and it
> beats the unit for 20 minutes at full clocks with a screaming fan and>
that runs perfectly stable. Solar Bay Extreme also runs completely
> stable in all of its glorious 3 fps. The two problems are the standard
> non-extreme Solar Bay and Steel Nomad Light. Both of these
> intermittently crash with similar traces to what I posted before.
> There doesn't seem to be consistency in the faults, sometimes it will
> be almost immediately after starting the benchmark, other times it
> will get 90% through and then fail. But they virtually always fail to
> complete. For another point of data, I have never seen GravityMark
> cause a fault either.

The peak current draw can vary between benchmarks. So we can't rule out
power issues. And are you able to reproduce the same issue on another
device?

-Akhil.

> 
> Aaron


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-05  8:01           ` Aaron Kling
  2026-02-05 10:54             ` Konrad Dybcio
  2026-02-05 13:29             ` Akhil P Oommen
@ 2026-02-05 14:43             ` Dmitry Baryshkov
  2 siblings, 0 replies; 31+ messages in thread
From: Dmitry Baryshkov @ 2026-02-05 14:43 UTC (permalink / raw)
  To: Aaron Kling; +Cc: Akhil P Oommen, rob.clark, Neil Armstrong, linux-arm-msm

On Thu, Feb 05, 2026 at 02:01:01AM -0600, Aaron Kling wrote:
> On Thu, Jan 29, 2026 at 8:35 PM Aaron Kling <webgeek1234@gmail.com> wrote:
> >
> > On Thu, Jan 29, 2026 at 5:11 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
> > >
> > > On 1/28/2026 11:24 PM, Aaron Kling wrote:
> > > > On Wed, Jan 28, 2026 at 8:46 AM Rob Clark <rob.clark@oss.qualcomm.com> wrote:
> > > >>
> > > >> On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
> > > >> <neil.armstrong@linaro.org> wrote:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> On 1/27/26 23:48, Aaron Kling wrote:
> > > >>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> > > >>>> for Android, using mainline kernel drivers. I have come across some
> > > >>>> missing functionality and failures that I would like to inquire about.
> > > >>>>
> > > >>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
> > > >>>> mainline. Using changes described in the gunyah watchdog thread [0], a
> > > >>>> dtbo loads and the devices boot as expected. If any of the changes in
> > > >>>> that post don't exist in the base dtb, abl will fail to load the dtbo
> > > >>>> and go to the bootloader menu. This appears to be an issue in the
> > > >>>> baseline abl code, affecting all devices of that generation. Would it
> > > >>>> be allowable to merge a change adding those changes to the sm8550
> > > >>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
> > > >>>
> > > >>> Any addition to the DT must be documented in dt-bindings, so if it's needed
> > > >>> for boot they should be documented and added for sure.
> > > >>>
> > > >>>>
> > > >>>> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
> > > >>>> have locally copied the commits from sm8650 and adapted for sm8550,
> > > >>>> and that seems to work okay. But no measuring of bandwidth was done,
> > > >>>> so the numbers are likely not entirely correct. Is there any plan to
> > > >>>> generate correct tables for sm8550?
> > > >>>
> > > >>> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
> > > >>> is fine but since the values were calculated from downstream DT tables,
> > > >>> the same should be done for sm8550.
> > > >>>
> > > >>>>
> > > >>>> * As part of a series to support the original Odin 2, a patch to
> > > >>>> update sm8550 EAS values was submitted [1]. But that series stalled
> > > >>>> and this was never merged. If this change is valid, which per that
> > > >>>> discussion it appears to be, can it be resubmitted by itself and
> > > >>>> merged?
> > > >>>
> > > >>> I missed this patch, please re-submit, I also need to update the ones
> > > >>> for SM8650.
> > > >>>
> > > >>>>
> > > >>>> * Per the mainline kernel device trees and audio topology provide by
> > > >>>> the oem, these devices use primary i2s for the speakers path. There
> > > >>>> was a commit adding clock support for that as part of an hdmi series
> > > >>>> [2], but that seems to have stalled. Is this going to be picked back
> > > >>>> up?
> > > >>>
> > > >>> No, I do not plan to do this work, it required adding callbacks in the
> > > >>> code to handle the clocks like done for the pre-audioreach firmwares.
> > > >>>
> > > >>>>
> > > >>>> * Inline crypto fails to detect hwkm support. And I see other logs
> > > >>>> online, such as for the sm8550 qrd, that logs the same way my device
> > > >>>> does. I traced the issue to the check for wrapped key support [3]. On
> > > >>>> my devices, the derive call is supported, but the other three calls
> > > >>>> are not. I was pointed at the downstream headers for sm8550 support
> > > >>>> and only derive is listed there, the other three don't appear to be
> > > >>>> used in the downstream driver. Is this expected? And if so, will this
> > > >>>> case be added to the mainline drivers?
> > > >>>
> > > >>> Does hwkm work with you remove the last 3 calls ?
> > > >>>
> > > >>>>
> > > >>>> * Some gpu related clocks complain about being stuck off during boot,
> > > >>>> causing stack traces, but the gpu does work. I tried to do some
> > > >>>> research into this, but quickly got lost in the weeds and I have no
> > > >>>> idea where to even look.
> > > >>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> > > >>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> > > >>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> > > >>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
> > > >>>
> > > >>> This may be related with the display handoff from ABL, did you add the
> > > >>> plat region to the reserved memories ?
> > > >>>
> > > >>>>
> > > >>>> * Sometimes when starting rendering, a bandwidth submission times out,
> > > >>>> then the driver immediately complains that said id was left on the
> > > >>>> queue. I have tried increasing the timeout, but the same sequence
> > > >>>> still happens. Timeout happens, immediately followed by a matching
> > > >>>> unexpected response. Implying that this isn't actually a delay /
> > > >>>> timeout issue.
> > > >>>> [ 1848.517020] platform 3d6a000.gmu:
> > > >>>> [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
> > > >>>> HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
> > > >>>> [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
> > > >>>> *ERROR* Unexpected message id 1015 on the response queue
> > > >>>
> > > >>> Weird the timeout was extended for this very purpose
> > > >>>
> > > >>>>
> > > >>>> * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
> > > >>>> unsure if this is a kernel problem or userspace, so I'm submitting
> > > >>>> here first. If the consensus is that it's a userspace issue, I'll
> > > >>>> submit it to mesa.
> > > >>>> [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
> > > >>>> fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
> > > >>>> 00000001512E9000/003d ib2 00000001512E7000/0000
> > > >>>> [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
> > > >>>> [msm]] *ERROR* 67.5.10.1: hangcheck recover!
> > > >>>> [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
> > > >>>> [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
> > > >>>> (com.futuremark.dmandroid.application)
> > > >>>> [ 1860.258126] revision: 0 (67.5.10.1)
> > > >>>> [ 1860.258132] rb 0: fence:    2884/2884
> > > >>>> [ 1860.258133] rptr:     36
> > > >>>> [ 1860.258134] rb wptr:  36
> > > >>>> [ 1860.258135] rb 1: fence:    -256/-256
> > > >>>> [ 1860.258138] rptr:     0
> > > >>>> [ 1860.258138] rb wptr:  0
> > > >>>> [ 1860.258139] rb 2: fence:    41563/41569
> > > >>>> [ 1860.258140] rptr:     1752
> > > >>>> [ 1860.258140] rb wptr:  2319
> > > >>>> [ 1860.258141] rb 3: fence:    -256/-256
> > > >>>> [ 1860.258141] rptr:     0
> > > >>>> [ 1860.258142] rb wptr:  0
> > > >>>> [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
> > > >>>> [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
> > > >>>> [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > > >>>> CP_SCRATCH_REG2: 41562
> > > >>>> [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
> > > >>>> [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > > >>>> CP_SCRATCH_REG4: 3736059565
> > > >>>> [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > > >>>> CP_SCRATCH_REG5: 3736059565
> > > >>>> [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > > >>>> CP_SCRATCH_REG6: 3736059565
> > > >>>> [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> > > >>>> CP_SCRATCH_REG7: 3736059565
> > > >>>
> > > >>> @rob do you have any idea how to solve this crash on a740 ?
> > > >>
> > > >> The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
> > > >> something is unhappy about gpu pm.  I'd focus on that first, since
> > > >> that is almost certainly the cause of the later issues.  If things
> > > >> _sorta_ work (rendering UI, etc) you could try removing all but the
> > > >> lowest gpu OPP as an experiment.  Could be that power related problems
> > > >> surface when the GPU ramps up to higher OPPs.
> > > >
> > > > Things work amazingly well compared to what I was expecting. Using
> > > > mesa staging 26.0 as of yesterday, I'm getting roughly 80% performance
> > > > in the benchmarks that do run, compared to the stock Android. And
> > > > rendering is correct everywhere that I've seen so far. Mesa 25.3.3
> > > > gives about 89% compared to stock, but there are graphical glitches in
> > > > some of the benchmarks.
> > > >
> > > > I set gpu max_freq via devfreq to the minimum available frequency and
> > > > ran the failing benchmark again. It completed once, but failed with a
> > > > similar stack trace on the second run. And per sysfs, the gpu did stay
> > > > at that minimum. Of note, that causes the benchmark to fail, but
> > > > rendering does recover and the unit is still usable afterwards.
> > >
> > > In sm8550.dtsi, I see that ACD values are not specified in the GPU OPP
> > > table. Can we add those (from downstream dt) and try again?
> >
> > I don't know what I'm looking for in the downstream dt. But if such a
> > change gets pushed to lkml, I can grab that and verify.
> 
> I took at look at the downstream dt and took a guess at importing the
> acd values. I'm not sure if the gpu here is the baseline kalama or
> kalama v2. I guessed the former. There were a couple values missing
> however, that I had to extrapolate based on other frequencies. This
> however changed nothing about my test results. Still getting crashes.

Usually it's the latter. You can check it in /sys/kernel/debug/qcom_socinfo/raw_version

> 
> From my perspective, this part does not appear to be a PM or frequency
> related issue. Some of the 3dmark benchmarks I have never seen crash.
> Like Wild Life Extreme. I can run the stress variant of that and it
> beats the unit for 20 minutes at full clocks with a screaming fan and
> that runs perfectly stable. Solar Bay Extreme also runs completely
> stable in all of its glorious 3 fps. The two problems are the standard
> non-extreme Solar Bay and Steel Nomad Light. Both of these
> intermittently crash with similar traces to what I posted before.
> There doesn't seem to be consistency in the faults, sometimes it will
> be almost immediately after starting the benchmark, other times it
> will get 90% through and then fail. But they virtually always fail to
> complete. For another point of data, I have never seen GravityMark
> cause a fault either.
> 
> Aaron

-- 
With best wishes
Dmitry

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-05 13:29             ` Akhil P Oommen
@ 2026-02-05 17:40               ` Aaron Kling
  2026-03-10 21:33                 ` Akhil P Oommen
  0 siblings, 1 reply; 31+ messages in thread
From: Aaron Kling @ 2026-02-05 17:40 UTC (permalink / raw)
  To: Akhil P Oommen; +Cc: rob.clark, Neil Armstrong, linux-arm-msm

On Thu, Feb 5, 2026 at 7:29 AM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>
> On 2/5/2026 1:31 PM, Aaron Kling wrote:
> > On Thu, Jan 29, 2026 at 8:35 PM Aaron Kling <webgeek1234@gmail.com> wrote:
> >>
> >> On Thu, Jan 29, 2026 at 5:11 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
> >>>
> >>> On 1/28/2026 11:24 PM, Aaron Kling wrote:
> >>>> On Wed, Jan 28, 2026 at 8:46 AM Rob Clark <rob.clark@oss.qualcomm.com> wrote:
> >>>>>
> >>>>> On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
> >>>>> <neil.armstrong@linaro.org> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 1/27/26 23:48, Aaron Kling wrote:
> >>>>>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> >>>>>>> for Android, using mainline kernel drivers. I have come across some
> >>>>>>> missing functionality and failures that I would like to inquire about.
> >>>>>>>
> >>>>>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
> >>>>>>> mainline. Using changes described in the gunyah watchdog thread [0], a
> >>>>>>> dtbo loads and the devices boot as expected. If any of the changes in
> >>>>>>> that post don't exist in the base dtb, abl will fail to load the dtbo
> >>>>>>> and go to the bootloader menu. This appears to be an issue in the
> >>>>>>> baseline abl code, affecting all devices of that generation. Would it
> >>>>>>> be allowable to merge a change adding those changes to the sm8550
> >>>>>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
> >>>>>>
> >>>>>> Any addition to the DT must be documented in dt-bindings, so if it's needed
> >>>>>> for boot they should be documented and added for sure.
> >>>>>>
> >>>>>>>
> >>>>>>> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
> >>>>>>> have locally copied the commits from sm8650 and adapted for sm8550,
> >>>>>>> and that seems to work okay. But no measuring of bandwidth was done,
> >>>>>>> so the numbers are likely not entirely correct. Is there any plan to
> >>>>>>> generate correct tables for sm8550?
> >>>>>>
> >>>>>> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
> >>>>>> is fine but since the values were calculated from downstream DT tables,
> >>>>>> the same should be done for sm8550.
> >>>>>>
> >>>>>>>
> >>>>>>> * As part of a series to support the original Odin 2, a patch to
> >>>>>>> update sm8550 EAS values was submitted [1]. But that series stalled
> >>>>>>> and this was never merged. If this change is valid, which per that
> >>>>>>> discussion it appears to be, can it be resubmitted by itself and
> >>>>>>> merged?
> >>>>>>
> >>>>>> I missed this patch, please re-submit, I also need to update the ones
> >>>>>> for SM8650.
> >>>>>>
> >>>>>>>
> >>>>>>> * Per the mainline kernel device trees and audio topology provide by
> >>>>>>> the oem, these devices use primary i2s for the speakers path. There
> >>>>>>> was a commit adding clock support for that as part of an hdmi series
> >>>>>>> [2], but that seems to have stalled. Is this going to be picked back
> >>>>>>> up?
> >>>>>>
> >>>>>> No, I do not plan to do this work, it required adding callbacks in the
> >>>>>> code to handle the clocks like done for the pre-audioreach firmwares.
> >>>>>>
> >>>>>>>
> >>>>>>> * Inline crypto fails to detect hwkm support. And I see other logs
> >>>>>>> online, such as for the sm8550 qrd, that logs the same way my device
> >>>>>>> does. I traced the issue to the check for wrapped key support [3]. On
> >>>>>>> my devices, the derive call is supported, but the other three calls
> >>>>>>> are not. I was pointed at the downstream headers for sm8550 support
> >>>>>>> and only derive is listed there, the other three don't appear to be
> >>>>>>> used in the downstream driver. Is this expected? And if so, will this
> >>>>>>> case be added to the mainline drivers?
> >>>>>>
> >>>>>> Does hwkm work with you remove the last 3 calls ?
> >>>>>>
> >>>>>>>
> >>>>>>> * Some gpu related clocks complain about being stuck off during boot,
> >>>>>>> causing stack traces, but the gpu does work. I tried to do some
> >>>>>>> research into this, but quickly got lost in the weeds and I have no
> >>>>>>> idea where to even look.
> >>>>>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> >>>>>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> >>>>>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> >>>>>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
> >>>>>>
> >>>>>> This may be related with the display handoff from ABL, did you add the
> >>>>>> plat region to the reserved memories ?
> >>>>>>
> >>>>>>>
> >>>>>>> * Sometimes when starting rendering, a bandwidth submission times out,
> >>>>>>> then the driver immediately complains that said id was left on the
> >>>>>>> queue. I have tried increasing the timeout, but the same sequence
> >>>>>>> still happens. Timeout happens, immediately followed by a matching
> >>>>>>> unexpected response. Implying that this isn't actually a delay /
> >>>>>>> timeout issue.
> >>>>>>> [ 1848.517020] platform 3d6a000.gmu:
> >>>>>>> [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
> >>>>>>> HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
> >>>>>>> [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
> >>>>>>> *ERROR* Unexpected message id 1015 on the response queue
> >>>>>>
> >>>>>> Weird the timeout was extended for this very purpose
> >>>>>>
> >>>>>>>
> >>>>>>> * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
> >>>>>>> unsure if this is a kernel problem or userspace, so I'm submitting
> >>>>>>> here first. If the consensus is that it's a userspace issue, I'll
> >>>>>>> submit it to mesa.
> >>>>>>> [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
> >>>>>>> fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
> >>>>>>> 00000001512E9000/003d ib2 00000001512E7000/0000
> >>>>>>> [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
> >>>>>>> [msm]] *ERROR* 67.5.10.1: hangcheck recover!
> >>>>>>> [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
> >>>>>>> [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
> >>>>>>> (com.futuremark.dmandroid.application)
> >>>>>>> [ 1860.258126] revision: 0 (67.5.10.1)
> >>>>>>> [ 1860.258132] rb 0: fence:    2884/2884
> >>>>>>> [ 1860.258133] rptr:     36
> >>>>>>> [ 1860.258134] rb wptr:  36
> >>>>>>> [ 1860.258135] rb 1: fence:    -256/-256
> >>>>>>> [ 1860.258138] rptr:     0
> >>>>>>> [ 1860.258138] rb wptr:  0
> >>>>>>> [ 1860.258139] rb 2: fence:    41563/41569
> >>>>>>> [ 1860.258140] rptr:     1752
> >>>>>>> [ 1860.258140] rb wptr:  2319
> >>>>>>> [ 1860.258141] rb 3: fence:    -256/-256
> >>>>>>> [ 1860.258141] rptr:     0
> >>>>>>> [ 1860.258142] rb wptr:  0
> >>>>>>> [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
> >>>>>>> [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
> >>>>>>> [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>> CP_SCRATCH_REG2: 41562
> >>>>>>> [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
> >>>>>>> [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>> CP_SCRATCH_REG4: 3736059565
> >>>>>>> [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>> CP_SCRATCH_REG5: 3736059565
> >>>>>>> [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>> CP_SCRATCH_REG6: 3736059565
> >>>>>>> [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>> CP_SCRATCH_REG7: 3736059565
> >>>>>>
> >>>>>> @rob do you have any idea how to solve this crash on a740 ?
> >>>>>
> >>>>> The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
> >>>>> something is unhappy about gpu pm.  I'd focus on that first, since
> >>>>> that is almost certainly the cause of the later issues.  If things
> >>>>> _sorta_ work (rendering UI, etc) you could try removing all but the
> >>>>> lowest gpu OPP as an experiment.  Could be that power related problems
> >>>>> surface when the GPU ramps up to higher OPPs.
> >>>>
> >>>> Things work amazingly well compared to what I was expecting. Using
> >>>> mesa staging 26.0 as of yesterday, I'm getting roughly 80% performance
> >>>> in the benchmarks that do run, compared to the stock Android. And
> >>>> rendering is correct everywhere that I've seen so far. Mesa 25.3.3
> >>>> gives about 89% compared to stock, but there are graphical glitches in
> >>>> some of the benchmarks.
> >>>>
> >>>> I set gpu max_freq via devfreq to the minimum available frequency and
> >>>> ran the failing benchmark again. It completed once, but failed with a
> >>>> similar stack trace on the second run. And per sysfs, the gpu did stay
> >>>> at that minimum. Of note, that causes the benchmark to fail, but
> >>>> rendering does recover and the unit is still usable afterwards.
> >>>
> >>> In sm8550.dtsi, I see that ACD values are not specified in the GPU OPP
> >>> table. Can we add those (from downstream dt) and try again?
> >>
> >> I don't know what I'm looking for in the downstream dt. But if such a
> >> change gets pushed to lkml, I can grab that and verify.
> >
> > I took at look at the downstream dt and took a guess at importing the
> > acd values. I'm not sure if the gpu here is the baseline kalama or
> > kalama v2. I guessed the former. There were a couple values missing
> > however, that I had to extrapolate based on other frequencies. This
> > however changed nothing about my test results. Still getting crashes.
>
> Please use the values from kalama v2 dtsi. And if the acd property is
> missing in any OPP node, that is a hint to the the driver+gmu-fw that
> ACD should be kept disabled for that freq corner. So, please follow the
> same.

Alright, I updated the change using values from the downstream v2
dtsi. Still getting the same results. Since it's needed regardless,
would you like me to submit the ACD patch?

> ACD configurations are required to meet the hw specifications. We can't
> predict how the hw fails in case of a spec violation. I don't know if
> this issue is ACD related, but we should ensure that all power related
> configurations are accurate first.
>
> Also, could you please try the latest firmwares (sqe and gmu) from here:
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/qcom?id=30979b116b5c5857b72c4332db8db0ff1ca2dc08

These are what I'm already using.

> >
> > From my perspective, this part does not appear to be a PM or frequency
> > related issue. Some of the 3dmark benchmarks I have never seen crash.
> > Like Wild Life Extreme. I can run the stress variant of that and it
> > beats the unit for 20 minutes at full clocks with a screaming fan and>
> that runs perfectly stable. Solar Bay Extreme also runs completely
> > stable in all of its glorious 3 fps. The two problems are the standard
> > non-extreme Solar Bay and Steel Nomad Light. Both of these
> > intermittently crash with similar traces to what I posted before.
> > There doesn't seem to be consistency in the faults, sometimes it will
> > be almost immediately after starting the benchmark, other times it
> > will get 90% through and then fail. But they virtually always fail to
> > complete. For another point of data, I have never seen GravityMark
> > cause a fault either.
>
> The peak current draw can vary between benchmarks. So we can't rule out
> power issues. And are you able to reproduce the same issue on another
> device?

The only relevant devices I have are two of the AYN qcs8550 devices, a
Thor and an Odin 2 Mini. The issue happens on both, yes. But I don't
have anything like a phone or devkit with sm8550.

Aaron

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-01-28 18:42   ` Aaron Kling
@ 2026-02-06 15:04     ` Neil Armstrong
  0 siblings, 0 replies; 31+ messages in thread
From: Neil Armstrong @ 2026-02-06 15:04 UTC (permalink / raw)
  To: Aaron Kling; +Cc: linux-arm-msm

On 1/28/26 19:42, Aaron Kling wrote:
> On Wed, Jan 28, 2026 at 2:50 AM Neil Armstrong
> <neil.armstrong@linaro.org> wrote:
>>
>> Hi,
>>
>> On 1/27/26 23:48, Aaron Kling wrote:
>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
>>> for Android, using mainline kernel drivers. I have come across some
>>> missing functionality and failures that I would like to inquire about.
>>>
>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
>>> mainline. Using changes described in the gunyah watchdog thread [0], a
>>> dtbo loads and the devices boot as expected. If any of the changes in
>>> that post don't exist in the base dtb, abl will fail to load the dtbo
>>> and go to the bootloader menu. This appears to be an issue in the
>>> baseline abl code, affecting all devices of that generation. Would it
>>> be allowable to merge a change adding those changes to the sm8550
>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
>>
>> Any addition to the DT must be documented in dt-bindings, so if it's needed
>> for boot they should be documented and added for sure.
> 
> I can make the change and see if bindings check reports any new issues
> before uploading.
> 
>>> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
>>> have locally copied the commits from sm8650 and adapted for sm8550,
>>> and that seems to work okay. But no measuring of bandwidth was done,
>>> so the numbers are likely not entirely correct. Is there any plan to
>>> generate correct tables for sm8550?
>>
>> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
>> is fine but since the values were calculated from downstream DT tables,
>> the same should be done for sm8550.
> 
> What am I looking for in the downstream dt? I'm not greatly familiar
> with that layout. But if I get pointed at the right stuff, I can do
> the legwork.

As [1] points, I used the  downstream memlat ddr, llcc & l3 tables
for each cluster types with the actual EPSS cpufreq LUT tables.

So first, run your SM8550 device and get the EPSS cpufreq LUT table,
then for each OPP of each cluster find the corresponding memlat ddr, llcc
& l3 tables and add an entry like:

+	cpuX_opp_table: opp-table-cpuX {
+		compatible = "operating-points-v2";
+		opp-shared;
+
+		opp-FREQ {
+			opp-hz = /bits/ 64 <FREQ>;
+			opp-peak-kBps = <(llcc * 16) (ddr * 4) (l3 * 32)>;
+		};

In downstream you'll get this kind of tables:
	qcom,cpufreq-memfreq-tbl =
		< 1132800  547000 >,
		< 1574400  768000 >,
		< 2073600 1555000 >;

By looking the code, this is handled as:

freq <= 1132800 then value is 547000
freq <= 1574400 then value is 768000
freq <= 2073600 then value is 1555000
and the last entry also works for freq > 2073600..

Look for the qcom_memlat subnodes, ddr, llcc & l3. Look at qcom,cpulist
so see at which cpu cluster this applies to.

The "* 4" comes from the qcom,bus-width in qcom_ddr_dcvs_hw,
the "* 16" from qcom_llcc_dcvs_hw and "* 32" from qcom_l3_dcvs_hw.

Note, add _separate_ opp table for each clusters, otherwise the kernel
won't use them properly.

[1] https://lore.kernel.org/all/20250211-topic-sm8650-ddr-bw-scaling-v2-3-a0c950540e68@linaro.org/

> 
>>> * As part of a series to support the original Odin 2, a patch to
>>> update sm8550 EAS values was submitted [1]. But that series stalled
>>> and this was never merged. If this change is valid, which per that
>>> discussion it appears to be, can it be resubmitted by itself and
>>> merged?
>>
>> I missed this patch, please re-submit, I also need to update the ones
>> for SM8650.
> 
> Ack.
> 
>>> * Per the mainline kernel device trees and audio topology provide by
>>> the oem, these devices use primary i2s for the speakers path. There
>>> was a commit adding clock support for that as part of an hdmi series
>>> [2], but that seems to have stalled. Is this going to be picked back
>>> up?
>>
>> No, I do not plan to do this work, it required adding callbacks in the
>> code to handle the clocks like done for the pre-audioreach firmwares.
>>
>>>
>>> * Inline crypto fails to detect hwkm support. And I see other logs
>>> online, such as for the sm8550 qrd, that logs the same way my device
>>> does. I traced the issue to the check for wrapped key support [3]. On
>>> my devices, the derive call is supported, but the other three calls
>>> are not. I was pointed at the downstream headers for sm8550 support
>>> and only derive is listed there, the other three don't appear to be
>>> used in the downstream driver. Is this expected? And if so, will this
>>> case be added to the mainline drivers?
>>
>> Does hwkm work with you remove the last 3 calls ?
> 
> I would assume not, since the ufs driver [0] references all four. And
> the plumbing doesn't do any further existence checks and just makes
> the smc calls.
> 
>>> * Some gpu related clocks complain about being stuck off during boot,
>>> causing stack traces, but the gpu does work. I tried to do some
>>> research into this, but quickly got lost in the weeds and I have no
>>> idea where to even look.
>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
>>
>> This may be related with the display handoff from ABL, did you add the
>> plat region to the reserved memories ?
>>
> 
> I did not, for these logs. Earlier in bringup, I did try to make abl
> leave the display on by adding the splash region, but that just caused
> display corruption before the kernel reset the display controller, so
> I pulled that back out. And I saw a comment somewhere stating that
> seamless handoff is not supported. Is that still the case, or should
> seamless handoff work now? It would be a much nicer user experience if
> it did.
> 
> Aaron
> 
> [0] https://github.com/torvalds/linux/blob/master/drivers/ufs/host/ufs-qcom.c#L311-L314


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-02-05 17:40               ` Aaron Kling
@ 2026-03-10 21:33                 ` Akhil P Oommen
  2026-03-10 21:53                   ` Aaron Kling
  0 siblings, 1 reply; 31+ messages in thread
From: Akhil P Oommen @ 2026-03-10 21:33 UTC (permalink / raw)
  To: Aaron Kling; +Cc: rob.clark, Neil Armstrong, linux-arm-msm

On 2/5/2026 11:10 PM, Aaron Kling wrote:
> On Thu, Feb 5, 2026 at 7:29 AM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>>
>> On 2/5/2026 1:31 PM, Aaron Kling wrote:
>>> On Thu, Jan 29, 2026 at 8:35 PM Aaron Kling <webgeek1234@gmail.com> wrote:
>>>>
>>>> On Thu, Jan 29, 2026 at 5:11 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>>>>>
>>>>> On 1/28/2026 11:24 PM, Aaron Kling wrote:
>>>>>> On Wed, Jan 28, 2026 at 8:46 AM Rob Clark <rob.clark@oss.qualcomm.com> wrote:
>>>>>>>
>>>>>>> On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
>>>>>>> <neil.armstrong@linaro.org> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 1/27/26 23:48, Aaron Kling wrote:
>>>>>>>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
>>>>>>>>> for Android, using mainline kernel drivers. I have come across some
>>>>>>>>> missing functionality and failures that I would like to inquire about.
>>>>>>>>>
>>>>>>>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
>>>>>>>>> mainline. Using changes described in the gunyah watchdog thread [0], a
>>>>>>>>> dtbo loads and the devices boot as expected. If any of the changes in
>>>>>>>>> that post don't exist in the base dtb, abl will fail to load the dtbo
>>>>>>>>> and go to the bootloader menu. This appears to be an issue in the
>>>>>>>>> baseline abl code, affecting all devices of that generation. Would it
>>>>>>>>> be allowable to merge a change adding those changes to the sm8550
>>>>>>>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
>>>>>>>>
>>>>>>>> Any addition to the DT must be documented in dt-bindings, so if it's needed
>>>>>>>> for boot they should be documented and added for sure.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
>>>>>>>>> have locally copied the commits from sm8650 and adapted for sm8550,
>>>>>>>>> and that seems to work okay. But no measuring of bandwidth was done,
>>>>>>>>> so the numbers are likely not entirely correct. Is there any plan to
>>>>>>>>> generate correct tables for sm8550?
>>>>>>>>
>>>>>>>> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
>>>>>>>> is fine but since the values were calculated from downstream DT tables,
>>>>>>>> the same should be done for sm8550.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * As part of a series to support the original Odin 2, a patch to
>>>>>>>>> update sm8550 EAS values was submitted [1]. But that series stalled
>>>>>>>>> and this was never merged. If this change is valid, which per that
>>>>>>>>> discussion it appears to be, can it be resubmitted by itself and
>>>>>>>>> merged?
>>>>>>>>
>>>>>>>> I missed this patch, please re-submit, I also need to update the ones
>>>>>>>> for SM8650.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Per the mainline kernel device trees and audio topology provide by
>>>>>>>>> the oem, these devices use primary i2s for the speakers path. There
>>>>>>>>> was a commit adding clock support for that as part of an hdmi series
>>>>>>>>> [2], but that seems to have stalled. Is this going to be picked back
>>>>>>>>> up?
>>>>>>>>
>>>>>>>> No, I do not plan to do this work, it required adding callbacks in the
>>>>>>>> code to handle the clocks like done for the pre-audioreach firmwares.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Inline crypto fails to detect hwkm support. And I see other logs
>>>>>>>>> online, such as for the sm8550 qrd, that logs the same way my device
>>>>>>>>> does. I traced the issue to the check for wrapped key support [3]. On
>>>>>>>>> my devices, the derive call is supported, but the other three calls
>>>>>>>>> are not. I was pointed at the downstream headers for sm8550 support
>>>>>>>>> and only derive is listed there, the other three don't appear to be
>>>>>>>>> used in the downstream driver. Is this expected? And if so, will this
>>>>>>>>> case be added to the mainline drivers?
>>>>>>>>
>>>>>>>> Does hwkm work with you remove the last 3 calls ?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Some gpu related clocks complain about being stuck off during boot,
>>>>>>>>> causing stack traces, but the gpu does work. I tried to do some
>>>>>>>>> research into this, but quickly got lost in the weeds and I have no
>>>>>>>>> idea where to even look.
>>>>>>>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
>>>>>>>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
>>>>>>>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
>>>>>>>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
>>>>>>>>
>>>>>>>> This may be related with the display handoff from ABL, did you add the
>>>>>>>> plat region to the reserved memories ?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Sometimes when starting rendering, a bandwidth submission times out,
>>>>>>>>> then the driver immediately complains that said id was left on the
>>>>>>>>> queue. I have tried increasing the timeout, but the same sequence
>>>>>>>>> still happens. Timeout happens, immediately followed by a matching
>>>>>>>>> unexpected response. Implying that this isn't actually a delay /
>>>>>>>>> timeout issue.
>>>>>>>>> [ 1848.517020] platform 3d6a000.gmu:
>>>>>>>>> [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
>>>>>>>>> HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
>>>>>>>>> [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
>>>>>>>>> *ERROR* Unexpected message id 1015 on the response queue
>>>>>>>>
>>>>>>>> Weird the timeout was extended for this very purpose
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
>>>>>>>>> unsure if this is a kernel problem or userspace, so I'm submitting
>>>>>>>>> here first. If the consensus is that it's a userspace issue, I'll
>>>>>>>>> submit it to mesa.
>>>>>>>>> [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
>>>>>>>>> fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
>>>>>>>>> 00000001512E9000/003d ib2 00000001512E7000/0000
>>>>>>>>> [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
>>>>>>>>> [msm]] *ERROR* 67.5.10.1: hangcheck recover!
>>>>>>>>> [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
>>>>>>>>> [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
>>>>>>>>> (com.futuremark.dmandroid.application)
>>>>>>>>> [ 1860.258126] revision: 0 (67.5.10.1)
>>>>>>>>> [ 1860.258132] rb 0: fence:    2884/2884
>>>>>>>>> [ 1860.258133] rptr:     36
>>>>>>>>> [ 1860.258134] rb wptr:  36
>>>>>>>>> [ 1860.258135] rb 1: fence:    -256/-256
>>>>>>>>> [ 1860.258138] rptr:     0
>>>>>>>>> [ 1860.258138] rb wptr:  0
>>>>>>>>> [ 1860.258139] rb 2: fence:    41563/41569
>>>>>>>>> [ 1860.258140] rptr:     1752
>>>>>>>>> [ 1860.258140] rb wptr:  2319
>>>>>>>>> [ 1860.258141] rb 3: fence:    -256/-256
>>>>>>>>> [ 1860.258141] rptr:     0
>>>>>>>>> [ 1860.258142] rb wptr:  0
>>>>>>>>> [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
>>>>>>>>> [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
>>>>>>>>> [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>>>> CP_SCRATCH_REG2: 41562
>>>>>>>>> [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
>>>>>>>>> [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>>>> CP_SCRATCH_REG4: 3736059565
>>>>>>>>> [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>>>> CP_SCRATCH_REG5: 3736059565
>>>>>>>>> [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>>>> CP_SCRATCH_REG6: 3736059565
>>>>>>>>> [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>>>> CP_SCRATCH_REG7: 3736059565
>>>>>>>>
>>>>>>>> @rob do you have any idea how to solve this crash on a740 ?
>>>>>>>
>>>>>>> The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
>>>>>>> something is unhappy about gpu pm.  I'd focus on that first, since
>>>>>>> that is almost certainly the cause of the later issues.  If things
>>>>>>> _sorta_ work (rendering UI, etc) you could try removing all but the
>>>>>>> lowest gpu OPP as an experiment.  Could be that power related problems
>>>>>>> surface when the GPU ramps up to higher OPPs.
>>>>>>
>>>>>> Things work amazingly well compared to what I was expecting. Using
>>>>>> mesa staging 26.0 as of yesterday, I'm getting roughly 80% performance
>>>>>> in the benchmarks that do run, compared to the stock Android. And
>>>>>> rendering is correct everywhere that I've seen so far. Mesa 25.3.3
>>>>>> gives about 89% compared to stock, but there are graphical glitches in
>>>>>> some of the benchmarks.
>>>>>>
>>>>>> I set gpu max_freq via devfreq to the minimum available frequency and
>>>>>> ran the failing benchmark again. It completed once, but failed with a
>>>>>> similar stack trace on the second run. And per sysfs, the gpu did stay
>>>>>> at that minimum. Of note, that causes the benchmark to fail, but
>>>>>> rendering does recover and the unit is still usable afterwards.
>>>>>
>>>>> In sm8550.dtsi, I see that ACD values are not specified in the GPU OPP
>>>>> table. Can we add those (from downstream dt) and try again?
>>>>
>>>> I don't know what I'm looking for in the downstream dt. But if such a
>>>> change gets pushed to lkml, I can grab that and verify.
>>>
>>> I took at look at the downstream dt and took a guess at importing the
>>> acd values. I'm not sure if the gpu here is the baseline kalama or
>>> kalama v2. I guessed the former. There were a couple values missing
>>> however, that I had to extrapolate based on other frequencies. This
>>> however changed nothing about my test results. Still getting crashes.
>>
>> Please use the values from kalama v2 dtsi. And if the acd property is
>> missing in any OPP node, that is a hint to the the driver+gmu-fw that
>> ACD should be kept disabled for that freq corner. So, please follow the
>> same.
> 
> Alright, I updated the change using values from the downstream v2
> dtsi. Still getting the same results. Since it's needed regardless,
> would you like me to submit the ACD patch?

Sorry for the super delayed response.

Please go ahead and post the patch.

> 
>> ACD configurations are required to meet the hw specifications. We can't
>> predict how the hw fails in case of a spec violation. I don't know if
>> this issue is ACD related, but we should ensure that all power related
>> configurations are accurate first.
>>
>> Also, could you please try the latest firmwares (sqe and gmu) from here:
>> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/qcom?id=30979b116b5c5857b72c4332db8db0ff1ca2dc08
> 
> These are what I'm already using.
> 
>>>
>>> From my perspective, this part does not appear to be a PM or frequency
>>> related issue. Some of the 3dmark benchmarks I have never seen crash.
>>> Like Wild Life Extreme. I can run the stress variant of that and it
>>> beats the unit for 20 minutes at full clocks with a screaming fan and>
>> that runs perfectly stable. Solar Bay Extreme also runs completely
>>> stable in all of its glorious 3 fps. The two problems are the standard
>>> non-extreme Solar Bay and Steel Nomad Light. Both of these
>>> intermittently crash with similar traces to what I posted before.
>>> There doesn't seem to be consistency in the faults, sometimes it will
>>> be almost immediately after starting the benchmark, other times it
>>> will get 90% through and then fail. But they virtually always fail to
>>> complete. For another point of data, I have never seen GravityMark
>>> cause a fault either.
>>
>> The peak current draw can vary between benchmarks. So we can't rule out
>> power issues. And are you able to reproduce the same issue on another
>> device?
> 
> The only relevant devices I have are two of the AYN qcs8550 devices, a
> Thor and an Odin 2 Mini. The issue happens on both, yes. But I don't
> have anything like a phone or devkit with sm8550.

I will post a few fixes in the next few days. At least, with that there
should be a coredump generated for hfi errors. Please share that.

iirc, you are using upstream drivers with downstream kernel (ACK?). Any
chance you can try pure upstream kernel?

-Akhil.


> 
> Aaron


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-03-10 21:33                 ` Akhil P Oommen
@ 2026-03-10 21:53                   ` Aaron Kling
  2026-03-11  8:47                     ` Konrad Dybcio
  0 siblings, 1 reply; 31+ messages in thread
From: Aaron Kling @ 2026-03-10 21:53 UTC (permalink / raw)
  To: Akhil P Oommen; +Cc: rob.clark, Neil Armstrong, linux-arm-msm

On Tue, Mar 10, 2026 at 4:33 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>
> On 2/5/2026 11:10 PM, Aaron Kling wrote:
> > On Thu, Feb 5, 2026 at 7:29 AM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
> >>
> >> On 2/5/2026 1:31 PM, Aaron Kling wrote:
> >>> On Thu, Jan 29, 2026 at 8:35 PM Aaron Kling <webgeek1234@gmail.com> wrote:
> >>>>
> >>>> On Thu, Jan 29, 2026 at 5:11 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
> >>>>>
> >>>>> On 1/28/2026 11:24 PM, Aaron Kling wrote:
> >>>>>> On Wed, Jan 28, 2026 at 8:46 AM Rob Clark <rob.clark@oss.qualcomm.com> wrote:
> >>>>>>>
> >>>>>>> On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
> >>>>>>> <neil.armstrong@linaro.org> wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> On 1/27/26 23:48, Aaron Kling wrote:
> >>>>>>>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> >>>>>>>>> for Android, using mainline kernel drivers. I have come across some
> >>>>>>>>> missing functionality and failures that I would like to inquire about.
> >>>>>>>>>
> >>>>>>>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
> >>>>>>>>> mainline. Using changes described in the gunyah watchdog thread [0], a
> >>>>>>>>> dtbo loads and the devices boot as expected. If any of the changes in
> >>>>>>>>> that post don't exist in the base dtb, abl will fail to load the dtbo
> >>>>>>>>> and go to the bootloader menu. This appears to be an issue in the
> >>>>>>>>> baseline abl code, affecting all devices of that generation. Would it
> >>>>>>>>> be allowable to merge a change adding those changes to the sm8550
> >>>>>>>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
> >>>>>>>>
> >>>>>>>> Any addition to the DT must be documented in dt-bindings, so if it's needed
> >>>>>>>> for boot they should be documented and added for sure.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
> >>>>>>>>> have locally copied the commits from sm8650 and adapted for sm8550,
> >>>>>>>>> and that seems to work okay. But no measuring of bandwidth was done,
> >>>>>>>>> so the numbers are likely not entirely correct. Is there any plan to
> >>>>>>>>> generate correct tables for sm8550?
> >>>>>>>>
> >>>>>>>> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
> >>>>>>>> is fine but since the values were calculated from downstream DT tables,
> >>>>>>>> the same should be done for sm8550.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> * As part of a series to support the original Odin 2, a patch to
> >>>>>>>>> update sm8550 EAS values was submitted [1]. But that series stalled
> >>>>>>>>> and this was never merged. If this change is valid, which per that
> >>>>>>>>> discussion it appears to be, can it be resubmitted by itself and
> >>>>>>>>> merged?
> >>>>>>>>
> >>>>>>>> I missed this patch, please re-submit, I also need to update the ones
> >>>>>>>> for SM8650.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> * Per the mainline kernel device trees and audio topology provide by
> >>>>>>>>> the oem, these devices use primary i2s for the speakers path. There
> >>>>>>>>> was a commit adding clock support for that as part of an hdmi series
> >>>>>>>>> [2], but that seems to have stalled. Is this going to be picked back
> >>>>>>>>> up?
> >>>>>>>>
> >>>>>>>> No, I do not plan to do this work, it required adding callbacks in the
> >>>>>>>> code to handle the clocks like done for the pre-audioreach firmwares.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> * Inline crypto fails to detect hwkm support. And I see other logs
> >>>>>>>>> online, such as for the sm8550 qrd, that logs the same way my device
> >>>>>>>>> does. I traced the issue to the check for wrapped key support [3]. On
> >>>>>>>>> my devices, the derive call is supported, but the other three calls
> >>>>>>>>> are not. I was pointed at the downstream headers for sm8550 support
> >>>>>>>>> and only derive is listed there, the other three don't appear to be
> >>>>>>>>> used in the downstream driver. Is this expected? And if so, will this
> >>>>>>>>> case be added to the mainline drivers?
> >>>>>>>>
> >>>>>>>> Does hwkm work with you remove the last 3 calls ?
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> * Some gpu related clocks complain about being stuck off during boot,
> >>>>>>>>> causing stack traces, but the gpu does work. I tried to do some
> >>>>>>>>> research into this, but quickly got lost in the weeds and I have no
> >>>>>>>>> idea where to even look.
> >>>>>>>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> >>>>>>>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> >>>>>>>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> >>>>>>>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
> >>>>>>>>
> >>>>>>>> This may be related with the display handoff from ABL, did you add the
> >>>>>>>> plat region to the reserved memories ?
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> * Sometimes when starting rendering, a bandwidth submission times out,
> >>>>>>>>> then the driver immediately complains that said id was left on the
> >>>>>>>>> queue. I have tried increasing the timeout, but the same sequence
> >>>>>>>>> still happens. Timeout happens, immediately followed by a matching
> >>>>>>>>> unexpected response. Implying that this isn't actually a delay /
> >>>>>>>>> timeout issue.
> >>>>>>>>> [ 1848.517020] platform 3d6a000.gmu:
> >>>>>>>>> [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
> >>>>>>>>> HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
> >>>>>>>>> [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
> >>>>>>>>> *ERROR* Unexpected message id 1015 on the response queue
> >>>>>>>>
> >>>>>>>> Weird the timeout was extended for this very purpose
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
> >>>>>>>>> unsure if this is a kernel problem or userspace, so I'm submitting
> >>>>>>>>> here first. If the consensus is that it's a userspace issue, I'll
> >>>>>>>>> submit it to mesa.
> >>>>>>>>> [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
> >>>>>>>>> fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
> >>>>>>>>> 00000001512E9000/003d ib2 00000001512E7000/0000
> >>>>>>>>> [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
> >>>>>>>>> [msm]] *ERROR* 67.5.10.1: hangcheck recover!
> >>>>>>>>> [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
> >>>>>>>>> [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
> >>>>>>>>> (com.futuremark.dmandroid.application)
> >>>>>>>>> [ 1860.258126] revision: 0 (67.5.10.1)
> >>>>>>>>> [ 1860.258132] rb 0: fence:    2884/2884
> >>>>>>>>> [ 1860.258133] rptr:     36
> >>>>>>>>> [ 1860.258134] rb wptr:  36
> >>>>>>>>> [ 1860.258135] rb 1: fence:    -256/-256
> >>>>>>>>> [ 1860.258138] rptr:     0
> >>>>>>>>> [ 1860.258138] rb wptr:  0
> >>>>>>>>> [ 1860.258139] rb 2: fence:    41563/41569
> >>>>>>>>> [ 1860.258140] rptr:     1752
> >>>>>>>>> [ 1860.258140] rb wptr:  2319
> >>>>>>>>> [ 1860.258141] rb 3: fence:    -256/-256
> >>>>>>>>> [ 1860.258141] rptr:     0
> >>>>>>>>> [ 1860.258142] rb wptr:  0
> >>>>>>>>> [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
> >>>>>>>>> [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
> >>>>>>>>> [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>>>> CP_SCRATCH_REG2: 41562
> >>>>>>>>> [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
> >>>>>>>>> [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>>>> CP_SCRATCH_REG4: 3736059565
> >>>>>>>>> [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>>>> CP_SCRATCH_REG5: 3736059565
> >>>>>>>>> [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>>>> CP_SCRATCH_REG6: 3736059565
> >>>>>>>>> [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>>>> CP_SCRATCH_REG7: 3736059565
> >>>>>>>>
> >>>>>>>> @rob do you have any idea how to solve this crash on a740 ?
> >>>>>>>
> >>>>>>> The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
> >>>>>>> something is unhappy about gpu pm.  I'd focus on that first, since
> >>>>>>> that is almost certainly the cause of the later issues.  If things
> >>>>>>> _sorta_ work (rendering UI, etc) you could try removing all but the
> >>>>>>> lowest gpu OPP as an experiment.  Could be that power related problems
> >>>>>>> surface when the GPU ramps up to higher OPPs.
> >>>>>>
> >>>>>> Things work amazingly well compared to what I was expecting. Using
> >>>>>> mesa staging 26.0 as of yesterday, I'm getting roughly 80% performance
> >>>>>> in the benchmarks that do run, compared to the stock Android. And
> >>>>>> rendering is correct everywhere that I've seen so far. Mesa 25.3.3
> >>>>>> gives about 89% compared to stock, but there are graphical glitches in
> >>>>>> some of the benchmarks.
> >>>>>>
> >>>>>> I set gpu max_freq via devfreq to the minimum available frequency and
> >>>>>> ran the failing benchmark again. It completed once, but failed with a
> >>>>>> similar stack trace on the second run. And per sysfs, the gpu did stay
> >>>>>> at that minimum. Of note, that causes the benchmark to fail, but
> >>>>>> rendering does recover and the unit is still usable afterwards.
> >>>>>
> >>>>> In sm8550.dtsi, I see that ACD values are not specified in the GPU OPP
> >>>>> table. Can we add those (from downstream dt) and try again?
> >>>>
> >>>> I don't know what I'm looking for in the downstream dt. But if such a
> >>>> change gets pushed to lkml, I can grab that and verify.
> >>>
> >>> I took at look at the downstream dt and took a guess at importing the
> >>> acd values. I'm not sure if the gpu here is the baseline kalama or
> >>> kalama v2. I guessed the former. There were a couple values missing
> >>> however, that I had to extrapolate based on other frequencies. This
> >>> however changed nothing about my test results. Still getting crashes.
> >>
> >> Please use the values from kalama v2 dtsi. And if the acd property is
> >> missing in any OPP node, that is a hint to the the driver+gmu-fw that
> >> ACD should be kept disabled for that freq corner. So, please follow the
> >> same.
> >
> > Alright, I updated the change using values from the downstream v2
> > dtsi. Still getting the same results. Since it's needed regardless,
> > would you like me to submit the ACD patch?
>
> Sorry for the super delayed response.
>
> Please go ahead and post the patch.

I sent it here [0].

> >
> >> ACD configurations are required to meet the hw specifications. We can't
> >> predict how the hw fails in case of a spec violation. I don't know if
> >> this issue is ACD related, but we should ensure that all power related
> >> configurations are accurate first.
> >>
> >> Also, could you please try the latest firmwares (sqe and gmu) from here:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/qcom?id=30979b116b5c5857b72c4332db8db0ff1ca2dc08
> >
> > These are what I'm already using.
> >
> >>>
> >>> From my perspective, this part does not appear to be a PM or frequency
> >>> related issue. Some of the 3dmark benchmarks I have never seen crash.
> >>> Like Wild Life Extreme. I can run the stress variant of that and it
> >>> beats the unit for 20 minutes at full clocks with a screaming fan and>
> >> that runs perfectly stable. Solar Bay Extreme also runs completely
> >>> stable in all of its glorious 3 fps. The two problems are the standard
> >>> non-extreme Solar Bay and Steel Nomad Light. Both of these
> >>> intermittently crash with similar traces to what I posted before.
> >>> There doesn't seem to be consistency in the faults, sometimes it will
> >>> be almost immediately after starting the benchmark, other times it
> >>> will get 90% through and then fail. But they virtually always fail to
> >>> complete. For another point of data, I have never seen GravityMark
> >>> cause a fault either.
> >>
> >> The peak current draw can vary between benchmarks. So we can't rule out
> >> power issues. And are you able to reproduce the same issue on another
> >> device?
> >
> > The only relevant devices I have are two of the AYN qcs8550 devices, a
> > Thor and an Odin 2 Mini. The issue happens on both, yes. But I don't
> > have anything like a phone or devkit with sm8550.
>
> I will post a few fixes in the next few days. At least, with that there
> should be a coredump generated for hfi errors. Please share that.

I posted an issue on the mesa tracker here [1] and attached some
devcoredumps to one of my replies. I can add more when the new patches
are available.

> iirc, you are using upstream drivers with downstream kernel (ACK?). Any
> chance you can try pure upstream kernel?

Yes, that is correct. My current setup is ACK 6.18.13. I unfortunately
do not have a pure Linux setup. If I had uart access on these devices,
I could use the minimal busybox setup like I do for tegra, but I do
not have such access here. As far as I can tell, no closed case debug
setup is available either. Google does have a mainline tracking branch
which I could use to get closer to -next for verification, but it's
still not unmodified upstream.

Aaron

[0] https://lore.kernel.org/linux-arm-msm/20260207-sm8550-acd-v1-1-53d084c58c9a@gmail.com/
[1] https://gitlab.freedesktop.org/mesa/mesa/-/issues/14919

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-03-10 21:53                   ` Aaron Kling
@ 2026-03-11  8:47                     ` Konrad Dybcio
  2026-03-11 23:33                       ` Aaron Kling
  0 siblings, 1 reply; 31+ messages in thread
From: Konrad Dybcio @ 2026-03-11  8:47 UTC (permalink / raw)
  To: Aaron Kling, Akhil P Oommen; +Cc: rob.clark, Neil Armstrong, linux-arm-msm

On 3/10/26 10:53 PM, Aaron Kling wrote:
> On Tue, Mar 10, 2026 at 4:33 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>>
>> On 2/5/2026 11:10 PM, Aaron Kling wrote:
>>> On Thu, Feb 5, 2026 at 7:29 AM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>>>>
>>>> On 2/5/2026 1:31 PM, Aaron Kling wrote:
>>>>> On Thu, Jan 29, 2026 at 8:35 PM Aaron Kling <webgeek1234@gmail.com> wrote:
>>>>>>
>>>>>> On Thu, Jan 29, 2026 at 5:11 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
>>>>>>>
>>>>>>> On 1/28/2026 11:24 PM, Aaron Kling wrote:
>>>>>>>> On Wed, Jan 28, 2026 at 8:46 AM Rob Clark <rob.clark@oss.qualcomm.com> wrote:
>>>>>>>>>
>>>>>>>>> On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
>>>>>>>>> <neil.armstrong@linaro.org> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On 1/27/26 23:48, Aaron Kling wrote:
>>>>>>>>>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
>>>>>>>>>>> for Android, using mainline kernel drivers. I have come across some
>>>>>>>>>>> missing functionality and failures that I would like to inquire about.
>>>>>>>>>>>
>>>>>>>>>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
>>>>>>>>>>> mainline. Using changes described in the gunyah watchdog thread [0], a
>>>>>>>>>>> dtbo loads and the devices boot as expected. If any of the changes in
>>>>>>>>>>> that post don't exist in the base dtb, abl will fail to load the dtbo
>>>>>>>>>>> and go to the bootloader menu. This appears to be an issue in the
>>>>>>>>>>> baseline abl code, affecting all devices of that generation. Would it
>>>>>>>>>>> be allowable to merge a change adding those changes to the sm8550
>>>>>>>>>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
>>>>>>>>>>
>>>>>>>>>> Any addition to the DT must be documented in dt-bindings, so if it's needed
>>>>>>>>>> for boot they should be documented and added for sure.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
>>>>>>>>>>> have locally copied the commits from sm8650 and adapted for sm8550,
>>>>>>>>>>> and that seems to work okay. But no measuring of bandwidth was done,
>>>>>>>>>>> so the numbers are likely not entirely correct. Is there any plan to
>>>>>>>>>>> generate correct tables for sm8550?
>>>>>>>>>>
>>>>>>>>>> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
>>>>>>>>>> is fine but since the values were calculated from downstream DT tables,
>>>>>>>>>> the same should be done for sm8550.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * As part of a series to support the original Odin 2, a patch to
>>>>>>>>>>> update sm8550 EAS values was submitted [1]. But that series stalled
>>>>>>>>>>> and this was never merged. If this change is valid, which per that
>>>>>>>>>>> discussion it appears to be, can it be resubmitted by itself and
>>>>>>>>>>> merged?
>>>>>>>>>>
>>>>>>>>>> I missed this patch, please re-submit, I also need to update the ones
>>>>>>>>>> for SM8650.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * Per the mainline kernel device trees and audio topology provide by
>>>>>>>>>>> the oem, these devices use primary i2s for the speakers path. There
>>>>>>>>>>> was a commit adding clock support for that as part of an hdmi series
>>>>>>>>>>> [2], but that seems to have stalled. Is this going to be picked back
>>>>>>>>>>> up?
>>>>>>>>>>
>>>>>>>>>> No, I do not plan to do this work, it required adding callbacks in the
>>>>>>>>>> code to handle the clocks like done for the pre-audioreach firmwares.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * Inline crypto fails to detect hwkm support. And I see other logs
>>>>>>>>>>> online, such as for the sm8550 qrd, that logs the same way my device
>>>>>>>>>>> does. I traced the issue to the check for wrapped key support [3]. On
>>>>>>>>>>> my devices, the derive call is supported, but the other three calls
>>>>>>>>>>> are not. I was pointed at the downstream headers for sm8550 support
>>>>>>>>>>> and only derive is listed there, the other three don't appear to be
>>>>>>>>>>> used in the downstream driver. Is this expected? And if so, will this
>>>>>>>>>>> case be added to the mainline drivers?
>>>>>>>>>>
>>>>>>>>>> Does hwkm work with you remove the last 3 calls ?
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * Some gpu related clocks complain about being stuck off during boot,
>>>>>>>>>>> causing stack traces, but the gpu does work. I tried to do some
>>>>>>>>>>> research into this, but quickly got lost in the weeds and I have no
>>>>>>>>>>> idea where to even look.
>>>>>>>>>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
>>>>>>>>>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
>>>>>>>>>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
>>>>>>>>>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
>>>>>>>>>>
>>>>>>>>>> This may be related with the display handoff from ABL, did you add the
>>>>>>>>>> plat region to the reserved memories ?
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * Sometimes when starting rendering, a bandwidth submission times out,
>>>>>>>>>>> then the driver immediately complains that said id was left on the
>>>>>>>>>>> queue. I have tried increasing the timeout, but the same sequence
>>>>>>>>>>> still happens. Timeout happens, immediately followed by a matching
>>>>>>>>>>> unexpected response. Implying that this isn't actually a delay /
>>>>>>>>>>> timeout issue.
>>>>>>>>>>> [ 1848.517020] platform 3d6a000.gmu:
>>>>>>>>>>> [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
>>>>>>>>>>> HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
>>>>>>>>>>> [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
>>>>>>>>>>> *ERROR* Unexpected message id 1015 on the response queue
>>>>>>>>>>
>>>>>>>>>> Weird the timeout was extended for this very purpose
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
>>>>>>>>>>> unsure if this is a kernel problem or userspace, so I'm submitting
>>>>>>>>>>> here first. If the consensus is that it's a userspace issue, I'll
>>>>>>>>>>> submit it to mesa.
>>>>>>>>>>> [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
>>>>>>>>>>> fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
>>>>>>>>>>> 00000001512E9000/003d ib2 00000001512E7000/0000
>>>>>>>>>>> [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
>>>>>>>>>>> [msm]] *ERROR* 67.5.10.1: hangcheck recover!
>>>>>>>>>>> [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
>>>>>>>>>>> [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
>>>>>>>>>>> (com.futuremark.dmandroid.application)
>>>>>>>>>>> [ 1860.258126] revision: 0 (67.5.10.1)
>>>>>>>>>>> [ 1860.258132] rb 0: fence:    2884/2884
>>>>>>>>>>> [ 1860.258133] rptr:     36
>>>>>>>>>>> [ 1860.258134] rb wptr:  36
>>>>>>>>>>> [ 1860.258135] rb 1: fence:    -256/-256
>>>>>>>>>>> [ 1860.258138] rptr:     0
>>>>>>>>>>> [ 1860.258138] rb wptr:  0
>>>>>>>>>>> [ 1860.258139] rb 2: fence:    41563/41569
>>>>>>>>>>> [ 1860.258140] rptr:     1752
>>>>>>>>>>> [ 1860.258140] rb wptr:  2319
>>>>>>>>>>> [ 1860.258141] rb 3: fence:    -256/-256
>>>>>>>>>>> [ 1860.258141] rptr:     0
>>>>>>>>>>> [ 1860.258142] rb wptr:  0
>>>>>>>>>>> [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
>>>>>>>>>>> [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
>>>>>>>>>>> [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>>>>>> CP_SCRATCH_REG2: 41562
>>>>>>>>>>> [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
>>>>>>>>>>> [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>>>>>> CP_SCRATCH_REG4: 3736059565
>>>>>>>>>>> [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>>>>>> CP_SCRATCH_REG5: 3736059565
>>>>>>>>>>> [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>>>>>> CP_SCRATCH_REG6: 3736059565
>>>>>>>>>>> [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
>>>>>>>>>>> CP_SCRATCH_REG7: 3736059565
>>>>>>>>>>
>>>>>>>>>> @rob do you have any idea how to solve this crash on a740 ?
>>>>>>>>>
>>>>>>>>> The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
>>>>>>>>> something is unhappy about gpu pm.  I'd focus on that first, since
>>>>>>>>> that is almost certainly the cause of the later issues.  If things
>>>>>>>>> _sorta_ work (rendering UI, etc) you could try removing all but the
>>>>>>>>> lowest gpu OPP as an experiment.  Could be that power related problems
>>>>>>>>> surface when the GPU ramps up to higher OPPs.
>>>>>>>>
>>>>>>>> Things work amazingly well compared to what I was expecting. Using
>>>>>>>> mesa staging 26.0 as of yesterday, I'm getting roughly 80% performance
>>>>>>>> in the benchmarks that do run, compared to the stock Android. And
>>>>>>>> rendering is correct everywhere that I've seen so far. Mesa 25.3.3
>>>>>>>> gives about 89% compared to stock, but there are graphical glitches in
>>>>>>>> some of the benchmarks.
>>>>>>>>
>>>>>>>> I set gpu max_freq via devfreq to the minimum available frequency and
>>>>>>>> ran the failing benchmark again. It completed once, but failed with a
>>>>>>>> similar stack trace on the second run. And per sysfs, the gpu did stay
>>>>>>>> at that minimum. Of note, that causes the benchmark to fail, but
>>>>>>>> rendering does recover and the unit is still usable afterwards.
>>>>>>>
>>>>>>> In sm8550.dtsi, I see that ACD values are not specified in the GPU OPP
>>>>>>> table. Can we add those (from downstream dt) and try again?
>>>>>>
>>>>>> I don't know what I'm looking for in the downstream dt. But if such a
>>>>>> change gets pushed to lkml, I can grab that and verify.
>>>>>
>>>>> I took at look at the downstream dt and took a guess at importing the
>>>>> acd values. I'm not sure if the gpu here is the baseline kalama or
>>>>> kalama v2. I guessed the former. There were a couple values missing
>>>>> however, that I had to extrapolate based on other frequencies. This
>>>>> however changed nothing about my test results. Still getting crashes.
>>>>
>>>> Please use the values from kalama v2 dtsi. And if the acd property is
>>>> missing in any OPP node, that is a hint to the the driver+gmu-fw that
>>>> ACD should be kept disabled for that freq corner. So, please follow the
>>>> same.
>>>
>>> Alright, I updated the change using values from the downstream v2
>>> dtsi. Still getting the same results. Since it's needed regardless,
>>> would you like me to submit the ACD patch?
>>
>> Sorry for the super delayed response.
>>
>> Please go ahead and post the patch.
> 
> I sent it here [0].
> 
>>>
>>>> ACD configurations are required to meet the hw specifications. We can't
>>>> predict how the hw fails in case of a spec violation. I don't know if
>>>> this issue is ACD related, but we should ensure that all power related
>>>> configurations are accurate first.
>>>>
>>>> Also, could you please try the latest firmwares (sqe and gmu) from here:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/qcom?id=30979b116b5c5857b72c4332db8db0ff1ca2dc08
>>>
>>> These are what I'm already using.
>>>
>>>>>
>>>>> From my perspective, this part does not appear to be a PM or frequency
>>>>> related issue. Some of the 3dmark benchmarks I have never seen crash.
>>>>> Like Wild Life Extreme. I can run the stress variant of that and it
>>>>> beats the unit for 20 minutes at full clocks with a screaming fan and>
>>>> that runs perfectly stable. Solar Bay Extreme also runs completely
>>>>> stable in all of its glorious 3 fps. The two problems are the standard
>>>>> non-extreme Solar Bay and Steel Nomad Light. Both of these
>>>>> intermittently crash with similar traces to what I posted before.
>>>>> There doesn't seem to be consistency in the faults, sometimes it will
>>>>> be almost immediately after starting the benchmark, other times it
>>>>> will get 90% through and then fail. But they virtually always fail to
>>>>> complete. For another point of data, I have never seen GravityMark
>>>>> cause a fault either.
>>>>
>>>> The peak current draw can vary between benchmarks. So we can't rule out
>>>> power issues. And are you able to reproduce the same issue on another
>>>> device?
>>>
>>> The only relevant devices I have are two of the AYN qcs8550 devices, a
>>> Thor and an Odin 2 Mini. The issue happens on both, yes. But I don't
>>> have anything like a phone or devkit with sm8550.
>>
>> I will post a few fixes in the next few days. At least, with that there
>> should be a coredump generated for hfi errors. Please share that.
> 
> I posted an issue on the mesa tracker here [1] and attached some
> devcoredumps to one of my replies. I can add more when the new patches
> are available.
> 
>> iirc, you are using upstream drivers with downstream kernel (ACK?). Any
>> chance you can try pure upstream kernel?
> 
> Yes, that is correct. My current setup is ACK 6.18.13. I unfortunately
> do not have a pure Linux setup. If I had uart access on these devices,
> I could use the minimal busybox setup like I do for tegra, but I do
> not have such access here. As far as I can tell, no closed case debug
> setup is available either. Google does have a mainline tracking branch
> which I could use to get closer to -next for verification, but it's
> still not unmodified upstream.

FWIW you can run AOSP on pure upstream, perhaps not with all the features,
but you can. Try copying the config from ACK and give it a spin

Konrad

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Questions About SM8550 Support
  2026-03-11  8:47                     ` Konrad Dybcio
@ 2026-03-11 23:33                       ` Aaron Kling
  0 siblings, 0 replies; 31+ messages in thread
From: Aaron Kling @ 2026-03-11 23:33 UTC (permalink / raw)
  To: Konrad Dybcio; +Cc: Akhil P Oommen, rob.clark, Neil Armstrong, linux-arm-msm

On Wed, Mar 11, 2026 at 3:47 AM Konrad Dybcio
<konrad.dybcio@oss.qualcomm.com> wrote:
>
> On 3/10/26 10:53 PM, Aaron Kling wrote:
> > On Tue, Mar 10, 2026 at 4:33 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
> >>
> >> On 2/5/2026 11:10 PM, Aaron Kling wrote:
> >>> On Thu, Feb 5, 2026 at 7:29 AM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
> >>>>
> >>>> On 2/5/2026 1:31 PM, Aaron Kling wrote:
> >>>>> On Thu, Jan 29, 2026 at 8:35 PM Aaron Kling <webgeek1234@gmail.com> wrote:
> >>>>>>
> >>>>>> On Thu, Jan 29, 2026 at 5:11 PM Akhil P Oommen <akhilpo@oss.qualcomm.com> wrote:
> >>>>>>>
> >>>>>>> On 1/28/2026 11:24 PM, Aaron Kling wrote:
> >>>>>>>> On Wed, Jan 28, 2026 at 8:46 AM Rob Clark <rob.clark@oss.qualcomm.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On Wed, Jan 28, 2026 at 12:54 AM Neil Armstrong
> >>>>>>>>> <neil.armstrong@linaro.org> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> On 1/27/26 23:48, Aaron Kling wrote:
> >>>>>>>>>>> I am working on the AYN Odin 2 qcs8550 series of devices, specifically
> >>>>>>>>>>> for Android, using mainline kernel drivers. I have come across some
> >>>>>>>>>>> missing functionality and failures that I would like to inquire about.
> >>>>>>>>>>>
> >>>>>>>>>>> * ABL fails to load a dtbo using a baseline dtb unmodified from
> >>>>>>>>>>> mainline. Using changes described in the gunyah watchdog thread [0], a
> >>>>>>>>>>> dtbo loads and the devices boot as expected. If any of the changes in
> >>>>>>>>>>> that post don't exist in the base dtb, abl will fail to load the dtbo
> >>>>>>>>>>> and go to the bootloader menu. This appears to be an issue in the
> >>>>>>>>>>> baseline abl code, affecting all devices of that generation. Would it
> >>>>>>>>>>> be allowable to merge a change adding those changes to the sm8550
> >>>>>>>>>>> dtsi, allowing an unmodified mainline dtb to work with overlays?
> >>>>>>>>>>
> >>>>>>>>>> Any addition to the DT must be documented in dt-bindings, so if it's needed
> >>>>>>>>>> for boot they should be documented and added for sure.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> * SM8550 does not have cpu opp tables, thus cpufreq does not work. I
> >>>>>>>>>>> have locally copied the commits from sm8650 and adapted for sm8550,
> >>>>>>>>>>> and that seems to work okay. But no measuring of bandwidth was done,
> >>>>>>>>>>> so the numbers are likely not entirely correct. Is there any plan to
> >>>>>>>>>>> generate correct tables for sm8550?
> >>>>>>>>>>
> >>>>>>>>>> Cpufreq works but not the interconnect scaling, so doing the same as sm8650
> >>>>>>>>>> is fine but since the values were calculated from downstream DT tables,
> >>>>>>>>>> the same should be done for sm8550.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> * As part of a series to support the original Odin 2, a patch to
> >>>>>>>>>>> update sm8550 EAS values was submitted [1]. But that series stalled
> >>>>>>>>>>> and this was never merged. If this change is valid, which per that
> >>>>>>>>>>> discussion it appears to be, can it be resubmitted by itself and
> >>>>>>>>>>> merged?
> >>>>>>>>>>
> >>>>>>>>>> I missed this patch, please re-submit, I also need to update the ones
> >>>>>>>>>> for SM8650.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> * Per the mainline kernel device trees and audio topology provide by
> >>>>>>>>>>> the oem, these devices use primary i2s for the speakers path. There
> >>>>>>>>>>> was a commit adding clock support for that as part of an hdmi series
> >>>>>>>>>>> [2], but that seems to have stalled. Is this going to be picked back
> >>>>>>>>>>> up?
> >>>>>>>>>>
> >>>>>>>>>> No, I do not plan to do this work, it required adding callbacks in the
> >>>>>>>>>> code to handle the clocks like done for the pre-audioreach firmwares.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> * Inline crypto fails to detect hwkm support. And I see other logs
> >>>>>>>>>>> online, such as for the sm8550 qrd, that logs the same way my device
> >>>>>>>>>>> does. I traced the issue to the check for wrapped key support [3]. On
> >>>>>>>>>>> my devices, the derive call is supported, but the other three calls
> >>>>>>>>>>> are not. I was pointed at the downstream headers for sm8550 support
> >>>>>>>>>>> and only derive is listed there, the other three don't appear to be
> >>>>>>>>>>> used in the downstream driver. Is this expected? And if so, will this
> >>>>>>>>>>> case be added to the mainline drivers?
> >>>>>>>>>>
> >>>>>>>>>> Does hwkm work with you remove the last 3 calls ?
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> * Some gpu related clocks complain about being stuck off during boot,
> >>>>>>>>>>> causing stack traces, but the gpu does work. I tried to do some
> >>>>>>>>>>> research into this, but quickly got lost in the weeds and I have no
> >>>>>>>>>>> idea where to even look.
> >>>>>>>>>>> [    0.367278] gpu_cc_cxo_clk status stuck at 'off'
> >>>>>>>>>>> [    0.367962] gpu_cc_hub_cx_int_clk status stuck at 'off'
> >>>>>>>>>>> [    0.368595] gpu_cc_cx_gmu_clk status stuck at 'off'
> >>>>>>>>>>> [    0.369245] disp_cc_mdss_ahb1_clk status stuck at 'off'
> >>>>>>>>>>
> >>>>>>>>>> This may be related with the display handoff from ABL, did you add the
> >>>>>>>>>> plat region to the reserved memories ?
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> * Sometimes when starting rendering, a bandwidth submission times out,
> >>>>>>>>>>> then the driver immediately complains that said id was left on the
> >>>>>>>>>>> queue. I have tried increasing the timeout, but the same sequence
> >>>>>>>>>>> still happens. Timeout happens, immediately followed by a matching
> >>>>>>>>>>> unexpected response. Implying that this isn't actually a delay /
> >>>>>>>>>>> timeout issue.
> >>>>>>>>>>> [ 1848.517020] platform 3d6a000.gmu:
> >>>>>>>>>>> [drm:a6xx_hfi_wait_for_msg_interrupt [msm]] *ERROR* Message
> >>>>>>>>>>> HFI_H2F_MSG_GX_BW_PERF_VOTE id 1015 timed out waiting for response
> >>>>>>>>>>> [ 1848.518020] platform 3d6a000.gmu: [drm:a6xx_hfi_send_msg [msm]]
> >>>>>>>>>>> *ERROR* Unexpected message id 1015 on the response queue
> >>>>>>>>>>
> >>>>>>>>>> Weird the timeout was extended for this very purpose
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> * Some 3dmark benchmarks such as solar bay cause a gpu crash. I am
> >>>>>>>>>>> unsure if this is a kernel problem or userspace, so I'm submitting
> >>>>>>>>>>> here first. If the consensus is that it's a userspace issue, I'll
> >>>>>>>>>>> submit it to mesa.
> >>>>>>>>>>> [ 1860.112008] adreno 3d00000.gpu: [drm:a6xx_irq [msm]] *ERROR* gpu
> >>>>>>>>>>> fault ring 2 fence a261 status 00EF0585 rb 06df/090f ib1
> >>>>>>>>>>> 00000001512E9000/003d ib2 00000001512E7000/0000
> >>>>>>>>>>> [ 1860.113122] msm_dpu ae01000.display-controller: [drm:recover_worker
> >>>>>>>>>>> [msm]] *ERROR* 67.5.10.1: hangcheck recover!
> >>>>>>>>>>> [ 1860.113238] msm_dpu ae01000.display-controller: [drm:recover_worker
> >>>>>>>>>>> [msm]] *ERROR* 67.5.10.1: offending task: Thread-23
> >>>>>>>>>>> (com.futuremark.dmandroid.application)
> >>>>>>>>>>> [ 1860.258126] revision: 0 (67.5.10.1)
> >>>>>>>>>>> [ 1860.258132] rb 0: fence:    2884/2884
> >>>>>>>>>>> [ 1860.258133] rptr:     36
> >>>>>>>>>>> [ 1860.258134] rb wptr:  36
> >>>>>>>>>>> [ 1860.258135] rb 1: fence:    -256/-256
> >>>>>>>>>>> [ 1860.258138] rptr:     0
> >>>>>>>>>>> [ 1860.258138] rb wptr:  0
> >>>>>>>>>>> [ 1860.258139] rb 2: fence:    41563/41569
> >>>>>>>>>>> [ 1860.258140] rptr:     1752
> >>>>>>>>>>> [ 1860.258140] rb wptr:  2319
> >>>>>>>>>>> [ 1860.258141] rb 3: fence:    -256/-256
> >>>>>>>>>>> [ 1860.258141] rptr:     0
> >>>>>>>>>>> [ 1860.258142] rb wptr:  0
> >>>>>>>>>>> [ 1860.258146] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG0: 0
> >>>>>>>>>>> [ 1860.258220] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG1: 0
> >>>>>>>>>>> [ 1860.258266] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>>>>>> CP_SCRATCH_REG2: 41562
> >>>>>>>>>>> [ 1860.258310] adreno 3d00000.gpu: [drm:a6xx_recover [msm]] CP_SCRATCH_REG3: 0
> >>>>>>>>>>> [ 1860.258354] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>>>>>> CP_SCRATCH_REG4: 3736059565
> >>>>>>>>>>> [ 1860.258399] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>>>>>> CP_SCRATCH_REG5: 3736059565
> >>>>>>>>>>> [ 1860.258443] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>>>>>> CP_SCRATCH_REG6: 3736059565
> >>>>>>>>>>> [ 1860.258487] adreno 3d00000.gpu: [drm:a6xx_recover [msm]]
> >>>>>>>>>>> CP_SCRATCH_REG7: 3736059565
> >>>>>>>>>>
> >>>>>>>>>> @rob do you have any idea how to solve this crash on a740 ?
> >>>>>>>>>
> >>>>>>>>> The clk and a6xx_hfi_wait_for_msg_interrupt errors indicate that
> >>>>>>>>> something is unhappy about gpu pm.  I'd focus on that first, since
> >>>>>>>>> that is almost certainly the cause of the later issues.  If things
> >>>>>>>>> _sorta_ work (rendering UI, etc) you could try removing all but the
> >>>>>>>>> lowest gpu OPP as an experiment.  Could be that power related problems
> >>>>>>>>> surface when the GPU ramps up to higher OPPs.
> >>>>>>>>
> >>>>>>>> Things work amazingly well compared to what I was expecting. Using
> >>>>>>>> mesa staging 26.0 as of yesterday, I'm getting roughly 80% performance
> >>>>>>>> in the benchmarks that do run, compared to the stock Android. And
> >>>>>>>> rendering is correct everywhere that I've seen so far. Mesa 25.3.3
> >>>>>>>> gives about 89% compared to stock, but there are graphical glitches in
> >>>>>>>> some of the benchmarks.
> >>>>>>>>
> >>>>>>>> I set gpu max_freq via devfreq to the minimum available frequency and
> >>>>>>>> ran the failing benchmark again. It completed once, but failed with a
> >>>>>>>> similar stack trace on the second run. And per sysfs, the gpu did stay
> >>>>>>>> at that minimum. Of note, that causes the benchmark to fail, but
> >>>>>>>> rendering does recover and the unit is still usable afterwards.
> >>>>>>>
> >>>>>>> In sm8550.dtsi, I see that ACD values are not specified in the GPU OPP
> >>>>>>> table. Can we add those (from downstream dt) and try again?
> >>>>>>
> >>>>>> I don't know what I'm looking for in the downstream dt. But if such a
> >>>>>> change gets pushed to lkml, I can grab that and verify.
> >>>>>
> >>>>> I took at look at the downstream dt and took a guess at importing the
> >>>>> acd values. I'm not sure if the gpu here is the baseline kalama or
> >>>>> kalama v2. I guessed the former. There were a couple values missing
> >>>>> however, that I had to extrapolate based on other frequencies. This
> >>>>> however changed nothing about my test results. Still getting crashes.
> >>>>
> >>>> Please use the values from kalama v2 dtsi. And if the acd property is
> >>>> missing in any OPP node, that is a hint to the the driver+gmu-fw that
> >>>> ACD should be kept disabled for that freq corner. So, please follow the
> >>>> same.
> >>>
> >>> Alright, I updated the change using values from the downstream v2
> >>> dtsi. Still getting the same results. Since it's needed regardless,
> >>> would you like me to submit the ACD patch?
> >>
> >> Sorry for the super delayed response.
> >>
> >> Please go ahead and post the patch.
> >
> > I sent it here [0].
> >
> >>>
> >>>> ACD configurations are required to meet the hw specifications. We can't
> >>>> predict how the hw fails in case of a spec violation. I don't know if
> >>>> this issue is ACD related, but we should ensure that all power related
> >>>> configurations are accurate first.
> >>>>
> >>>> Also, could you please try the latest firmwares (sqe and gmu) from here:
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/qcom?id=30979b116b5c5857b72c4332db8db0ff1ca2dc08
> >>>
> >>> These are what I'm already using.
> >>>
> >>>>>
> >>>>> From my perspective, this part does not appear to be a PM or frequency
> >>>>> related issue. Some of the 3dmark benchmarks I have never seen crash.
> >>>>> Like Wild Life Extreme. I can run the stress variant of that and it
> >>>>> beats the unit for 20 minutes at full clocks with a screaming fan and>
> >>>> that runs perfectly stable. Solar Bay Extreme also runs completely
> >>>>> stable in all of its glorious 3 fps. The two problems are the standard
> >>>>> non-extreme Solar Bay and Steel Nomad Light. Both of these
> >>>>> intermittently crash with similar traces to what I posted before.
> >>>>> There doesn't seem to be consistency in the faults, sometimes it will
> >>>>> be almost immediately after starting the benchmark, other times it
> >>>>> will get 90% through and then fail. But they virtually always fail to
> >>>>> complete. For another point of data, I have never seen GravityMark
> >>>>> cause a fault either.
> >>>>
> >>>> The peak current draw can vary between benchmarks. So we can't rule out
> >>>> power issues. And are you able to reproduce the same issue on another
> >>>> device?
> >>>
> >>> The only relevant devices I have are two of the AYN qcs8550 devices, a
> >>> Thor and an Odin 2 Mini. The issue happens on both, yes. But I don't
> >>> have anything like a phone or devkit with sm8550.
> >>
> >> I will post a few fixes in the next few days. At least, with that there
> >> should be a coredump generated for hfi errors. Please share that.
> >
> > I posted an issue on the mesa tracker here [1] and attached some
> > devcoredumps to one of my replies. I can add more when the new patches
> > are available.
> >
> >> iirc, you are using upstream drivers with downstream kernel (ACK?). Any
> >> chance you can try pure upstream kernel?
> >
> > Yes, that is correct. My current setup is ACK 6.18.13. I unfortunately
> > do not have a pure Linux setup. If I had uart access on these devices,
> > I could use the minimal busybox setup like I do for tegra, but I do
> > not have such access here. As far as I can tell, no closed case debug
> > setup is available either. Google does have a mainline tracking branch
> > which I could use to get closer to -next for verification, but it's
> > still not unmodified upstream.
>
> FWIW you can run AOSP on pure upstream, perhaps not with all the features,
> but you can. Try copying the config from ACK and give it a spin

I spent some time this afternoon trying to jam todays linux-next into
my build setup, aka googles bazel kernel-platform, and did get it to
build. However, it doesn't boot, no display and no adb. Tried
android-mainline and... that also doesn't boot. Yay. I did have a
version of that booting pre ack 6.18 fork. Extrapolation is that
either something either broke the entire soc again, like that ifpc
thing in 6.18 that took me a week to figure out (and only then found a
patch had already been submitted), or something on that level. And the
only way I've been able to debug without uart or in-ram pstore is
fudgery with pstore-blk and swapping boot slots, and I'd really prefer
not to do that again. I will try to see if I can bisect
android-mainline to track down the issue, then apply the findings to
-next.

Aaron

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2026-03-11 23:33 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-27 22:48 Questions About SM8550 Support Aaron Kling
2026-01-28  8:50 ` Neil Armstrong
2026-01-28 14:46   ` Rob Clark
2026-01-28 17:54     ` Aaron Kling
2026-01-29 23:11       ` Akhil P Oommen
2026-01-30  2:35         ` Aaron Kling
2026-02-05  8:01           ` Aaron Kling
2026-02-05 10:54             ` Konrad Dybcio
2026-02-05 13:29             ` Akhil P Oommen
2026-02-05 17:40               ` Aaron Kling
2026-03-10 21:33                 ` Akhil P Oommen
2026-03-10 21:53                   ` Aaron Kling
2026-03-11  8:47                     ` Konrad Dybcio
2026-03-11 23:33                       ` Aaron Kling
2026-02-05 14:43             ` Dmitry Baryshkov
2026-01-28 18:42   ` Aaron Kling
2026-02-06 15:04     ` Neil Armstrong
2026-01-28 14:03 ` Konrad Dybcio
2026-01-28 18:20   ` Aaron Kling
2026-02-02  9:35   ` Taniya Das
2026-02-02 23:01     ` Aaron Kling
2026-02-03  6:34       ` Jagadeesh Kona
2026-02-03 23:21         ` Aaron Kling
2026-02-04 16:53           ` Taniya Das
2026-02-04 18:18             ` Aaron Kling
2026-01-30 11:01 ` Konrad Dybcio
2026-01-30 17:13   ` Aaron Kling
2026-02-02 10:36     ` Konrad Dybcio
2026-02-02 23:12       ` Aaron Kling
2026-02-03 10:31         ` Konrad Dybcio
2026-02-03 17:31           ` Aaron Kling

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox