[BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
@ 2025-01-24 14:52 Emanuele Rocca
  2025-01-30 11:43 ` Will Deacon
  2025-01-30 12:19 ` Mark Rutland
  0 siblings, 2 replies; 15+ messages in thread
From: Emanuele Rocca @ 2025-01-24 14:52 UTC (permalink / raw)
  To: linux-arm-kernel; +Cc: Mark Rutland

Hi,
longterm kernel 6.1.123 crashes early when booting on the Lenovo Thinkpad X13s
with the following error:

 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000264

 pc: arm_smccc_version_init+0x90/0x1ac

According to faddr2line, that is line 31 of smccc.c:

 arm_smccc_version_init+0x90/0x1ac:
 arm_smccc_version_init at debian/build/build_arm64_none_arm64/drivers/firmware/smccc/smccc.c:31

 22 void __init arm_smccc_version_init(u32 version, enum arm_smccc_conduit conduit)
 23 {
 24         struct arm_smccc_res res;
 25 
 26         smccc_version = version;
 27         smccc_conduit = conduit;
 28 
 29         smccc_trng_available = smccc_probe_trng();
 30 
 31         if ((smccc_version >= ARM_SMCCC_VERSION_1_2) &&

This is with kernel 6.1.123. The last known good kernel I have available right
now is 6.1.119. In the 6.1.120 changelog I see the following commit which seems
potentially related?

 https://lore.kernel.org/all/20241106160448.2712997-1-mark.rutland@arm.com/

That's stable commit [1].

The relevant upstream commit [2] is in linux 6.12, and that kernel version does
not crash. Comparing [1] vs [2] I see differences, but I can't tell if they can
help debug the issue further. 

Thanks,
  ema

[1] https://git.kernel.org/linus/bfcaffd4cc2d61ecb0571c5baf127c4089978ad4
[2] https://git.kernel.org/linus/8c462d56487e3abdbf8a61cedfe7c795a54f4a78


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-01-24 14:52 [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac Emanuele Rocca
@ 2025-01-30 11:43 ` Will Deacon
  2025-02-05 10:35   ` Emanuele Rocca
  2025-01-30 12:19 ` Mark Rutland
  1 sibling, 1 reply; 15+ messages in thread
From: Will Deacon @ 2025-01-30 11:43 UTC (permalink / raw)
  To: Emanuele Rocca; +Cc: linux-arm-kernel, Mark Rutland

On Fri, Jan 24, 2025 at 03:52:10PM +0100, Emanuele Rocca wrote:
> longterm kernel 6.1.123 crashes early when booting on the Lenovo Thinkpad X13s
> with the following error:
> 
>  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000264
> 
>  pc: arm_smccc_version_init+0x90/0x1ac

Please can you share the full crash log?

> According to faddr2line, that is line 31 of smccc.c:
> 
>  arm_smccc_version_init+0x90/0x1ac:
>  arm_smccc_version_init at debian/build/build_arm64_none_arm64/drivers/firmware/smccc/smccc.c:31
> 
>  22 void __init arm_smccc_version_init(u32 version, enum arm_smccc_conduit conduit)
>  23 {
>  24         struct arm_smccc_res res;
>  25 
>  26         smccc_version = version;
>  27         smccc_conduit = conduit;
>  28 
>  29         smccc_trng_available = smccc_probe_trng();
>  30 
>  31         if ((smccc_version >= ARM_SMCCC_VERSION_1_2) &&
> 
> This is with kernel 6.1.123. The last known good kernel I have available right
> now is 6.1.119. In the 6.1.120 changelog I see the following commit which seems
> potentially related?
> 
>  https://lore.kernel.org/all/20241106160448.2712997-1-mark.rutland@arm.com/
> 
> That's stable commit [1].

I can't see anything wrong with that patch but if you have reason to
suspect that it's broken can you try reverting it to see if the problem
disappears?

Will


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-01-24 14:52 [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac Emanuele Rocca
  2025-01-30 11:43 ` Will Deacon
@ 2025-01-30 12:19 ` Mark Rutland
  2025-01-30 14:56   ` Emanuele Rocca
  1 sibling, 1 reply; 15+ messages in thread
From: Mark Rutland @ 2025-01-30 12:19 UTC (permalink / raw)
  To: Emanuele Rocca; +Cc: linux-arm-kernel

On Fri, Jan 24, 2025 at 03:52:10PM +0100, Emanuele Rocca wrote:
> Hi,

Hi Emanuele,

> longterm kernel 6.1.123 crashes early when booting on the Lenovo Thinkpad X13s
> with the following error:
> 
>  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000264
> 
>  pc: arm_smccc_version_init+0x90/0x1ac
> 
> According to faddr2line, that is line 31 of smccc.c:
> 
>  arm_smccc_version_init+0x90/0x1ac:
>  arm_smccc_version_init at debian/build/build_arm64_none_arm64/drivers/firmware/smccc/smccc.c:31
> 
>  22 void __init arm_smccc_version_init(u32 version, enum arm_smccc_conduit conduit)
>  23 {
>  24         struct arm_smccc_res res;
>  25 
>  26         smccc_version = version;
>  27         smccc_conduit = conduit;
>  28 
>  29         smccc_trng_available = smccc_probe_trng();
>  30 
>  31         if ((smccc_version >= ARM_SMCCC_VERSION_1_2) &&

For the benefit others, when we looked into this a few days ago it
appeared that a GPR was being clobbered across an SMCCC call, resulting
in a later crash (as that GPR should hold the ADRP'd base address of
'smccc_version'). I didn't have the time to dig more into that (e.g.  to
figure out whether kernel/compiler/firmware was to blame).

Emanuele, could you please dump the result of:

  objdump --disassemble=arm_smccc_version_init vmlinux

... for this kernel? That'd make it possible for others to
perform/verify the analysis I mentioned above.

If you can share any more details from the crash, that'd be helpful. The
GPR dump would be *enormously* helpful in this case, and even a photo of
the crash log might be useful.

> This is with kernel 6.1.123. The last known good kernel I have available right
> now is 6.1.119. In the 6.1.120 changelog I see the following commit which seems
> potentially related?
> 
>  https://lore.kernel.org/all/20241106160448.2712997-1-mark.rutland@arm.com/

Last I looked, there was no obvious reason why that should have an
effect on this issue. It's possible that the differing asm constraints
have an effect on code generation, and happen to mask the issue.

From a quick scan, I note that the asm constraints *don't* include
clobber x17, and maybe that's getting clobbered by a veneer between the
BL and __arm_smccc_sve_check().

As above, it would really help to have the disassembly for
arm_smccc_version_init(), and the GPRs at the time of the crash.

Mark.

> 
> That's stable commit [1].
> 
> The relevant upstream commit [2] is in linux 6.12, and that kernel version does
> not crash. Comparing [1] vs [2] I see differences, but I can't tell if they can
> help debug the issue further. 
> 
> Thanks,
>   ema
> 
> [1] https://git.kernel.org/linus/bfcaffd4cc2d61ecb0571c5baf127c4089978ad4
> [2] https://git.kernel.org/linus/8c462d56487e3abdbf8a61cedfe7c795a54f4a78

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-01-30 12:19 ` Mark Rutland
@ 2025-01-30 14:56   ` Emanuele Rocca
  2025-01-31 12:41     ` Will Deacon
  0 siblings, 1 reply; 15+ messages in thread
From: Emanuele Rocca @ 2025-01-30 14:56 UTC (permalink / raw)
  To: Mark Rutland; +Cc: linux-arm-kernel

On 2025-01-30 12:19, Mark Rutland wrote:
> For the benefit others, when we looked into this a few days ago it
> appeared that a GPR was being clobbered across an SMCCC call, resulting
> in a later crash (as that GPR should hold the ADRP'd base address of
> 'smccc_version'). I didn't have the time to dig more into that (e.g.  to
> figure out whether kernel/compiler/firmware was to blame).
> 
> Emanuele, could you please dump the result of:
> 
>   objdump --disassemble=arm_smccc_version_init vmlinux
> 
> ... for this kernel? That'd make it possible for others to
> perform/verify the analysis I mentioned above.

Sure, here it is:
https://people.debian.org/~ema/Z5OpGluX6oX5NLxh@NH27D9T0LF.objdump.txt

> If you can share any more details from the crash, that'd be helpful. The
> GPR dump would be *enormously* helpful in this case, and even a photo of
> the crash log might be useful.

pc : arm_smccc_version_init+0x90/0x1ac
lr : psci_probe+0x1fc/0x2c0
sp : ffffa3d8eea33ca0
x29: ffffa3d8eea33ce0 x28: 0000000081000200 x27: ffffa3d8edd72a80
x26: ffffa3d8edd728b0 x25: ffffa3d8eec66248 x24: ffffa3d8edc528d8
x23: ffffa3d8eed7f4f8 x22: 0000000000000001 x21: ffffa3d8eea58410
x20: ffffa3d8eed7f000 x19: 0000000000010001 x18: 0000000000000006
x17: 0000000009f69730 x16: 00000008760968d0 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 00000000ffffefff x10: ffffa3d8eeac09c8 x9 : ffffa3d8eea687d8
x8 : ffffa3d8eea33cb8 x7 : 0000000000000000 x6 : 0000000000000000 
x5 : ffffa3d8eed7f000 x4 : ffffa3d8edeb0000 x3 : 0000000000000000
x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000000

Call trace:
 arm_smccc_version_init+0x90/0x1ac
 psci_0_2_init+0x20/0x2c
 psci_1_0_init+0x1c/0x58
 psci_dt_init+0x6c/0x98
 setup_arch+0x400/0x5ec
 start_kernel+0x90/0x788
 __primary_switched+0xbc/0xc4
Code: d29fffe1 eb01001f 1a9f97e0 f0ffd384 (b94264c2)

The above was transcribed by hand. I double-checked, but there may be
mistakes.

Photo here: https://people.debian.org/~ema/Z5OpGluX6oX5NLxh@NH27D9T0LF.jpg
 
Thanks,
  ema


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-01-30 14:56   ` Emanuele Rocca
@ 2025-01-31 12:41     ` Will Deacon
  2025-01-31 13:54       ` Mark Rutland
  0 siblings, 1 reply; 15+ messages in thread
From: Will Deacon @ 2025-01-31 12:41 UTC (permalink / raw)
  To: Emanuele Rocca; +Cc: Mark Rutland, linux-arm-kernel

On Thu, Jan 30, 2025 at 03:56:10PM +0100, Emanuele Rocca wrote:
> On 2025-01-30 12:19, Mark Rutland wrote:
> > For the benefit others, when we looked into this a few days ago it
> > appeared that a GPR was being clobbered across an SMCCC call, resulting
> > in a later crash (as that GPR should hold the ADRP'd base address of
> > 'smccc_version'). I didn't have the time to dig more into that (e.g.  to
> > figure out whether kernel/compiler/firmware was to blame).
> > 
> > Emanuele, could you please dump the result of:
> > 
> >   objdump --disassemble=arm_smccc_version_init vmlinux
> > 
> > ... for this kernel? That'd make it possible for others to
> > perform/verify the analysis I mentioned above.
> 
> Sure, here it is:
> https://people.debian.org/~ema/Z5OpGluX6oX5NLxh@NH27D9T0LF.objdump.txt
> 
> > If you can share any more details from the crash, that'd be helpful. The
> > GPR dump would be *enormously* helpful in this case, and even a photo of
> > the crash log might be useful.
> 
> pc : arm_smccc_version_init+0x90/0x1ac
> lr : psci_probe+0x1fc/0x2c0
> sp : ffffa3d8eea33ca0
> x29: ffffa3d8eea33ce0 x28: 0000000081000200 x27: ffffa3d8edd72a80
> x26: ffffa3d8edd728b0 x25: ffffa3d8eec66248 x24: ffffa3d8edc528d8
> x23: ffffa3d8eed7f4f8 x22: 0000000000000001 x21: ffffa3d8eea58410
> x20: ffffa3d8eed7f000 x19: 0000000000010001 x18: 0000000000000006
> x17: 0000000009f69730 x16: 00000008760968d0 x15: 0000000000000000
> x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> x11: 00000000ffffefff x10: ffffa3d8eeac09c8 x9 : ffffa3d8eea687d8
> x8 : ffffa3d8eea33cb8 x7 : 0000000000000000 x6 : 0000000000000000 
> x5 : ffffa3d8eed7f000 x4 : ffffa3d8edeb0000 x3 : 0000000000000000
> x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000000
> 
> Call trace:
>  arm_smccc_version_init+0x90/0x1ac
>  psci_0_2_init+0x20/0x2c
>  psci_1_0_init+0x1c/0x58
>  psci_dt_init+0x6c/0x98
>  setup_arch+0x400/0x5ec
>  start_kernel+0x90/0x788
>  __primary_switched+0xbc/0xc4
> Code: d29fffe1 eb01001f 1a9f97e0 f0ffd384 (b94264c2)
> 
> The above was transcribed by hand. I double-checked, but there may be
> mistakes.
> 
> Photo here: https://people.debian.org/~ema/Z5OpGluX6oX5NLxh@NH27D9T0LF.jpg

Crikey, you transcribed all that junk by _hand_?! Thanks for the effort,
but next time please don't waste your time! The screenshot is fine.

Anyway, it looks like x6 gets trashed across the SMC. I think that's fine
per the SMCCC spec:

  | Unused result and scratch registers can leak information after an
  | SMC or HVC call. An implementation can mitigate this risk by either
  | preserving the register state over the call, or returning a constant
  | value,such as zero, in each register.

so the question is whether or not the compiler should be relying on the
register being preserved. It looks to me like the arm_smccc_1_1_smc()
macro is actually broken in this regard: it constructs the input list
based on the number of arguments passed and doesn't take the above into
account at all.

So I think we should change that so unused arguments are clobbered
instead. Mark -- what do you think?

Will


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-01-31 12:41     ` Will Deacon
@ 2025-01-31 13:54       ` Mark Rutland
  2025-02-04 10:00         ` Will Deacon
  0 siblings, 1 reply; 15+ messages in thread
From: Mark Rutland @ 2025-01-31 13:54 UTC (permalink / raw)
  To: Will Deacon; +Cc: Emanuele Rocca, linux-arm-kernel

On Fri, Jan 31, 2025 at 12:41:04PM +0000, Will Deacon wrote:
> On Thu, Jan 30, 2025 at 03:56:10PM +0100, Emanuele Rocca wrote:
> > On 2025-01-30 12:19, Mark Rutland wrote:
> > > For the benefit others, when we looked into this a few days ago it
> > > appeared that a GPR was being clobbered across an SMCCC call, resulting
> > > in a later crash (as that GPR should hold the ADRP'd base address of
> > > 'smccc_version'). I didn't have the time to dig more into that (e.g.  to
> > > figure out whether kernel/compiler/firmware was to blame).
> > > 
> > > Emanuele, could you please dump the result of:
> > > 
> > >   objdump --disassemble=arm_smccc_version_init vmlinux
> > > 
> > > ... for this kernel? That'd make it possible for others to
> > > perform/verify the analysis I mentioned above.
> > 
> > Sure, here it is:
> > https://people.debian.org/~ema/Z5OpGluX6oX5NLxh@NH27D9T0LF.objdump.txt
> > 
> > > If you can share any more details from the crash, that'd be helpful. The
> > > GPR dump would be *enormously* helpful in this case, and even a photo of
> > > the crash log might be useful.
> > 
> > pc : arm_smccc_version_init+0x90/0x1ac
> > lr : psci_probe+0x1fc/0x2c0
> > sp : ffffa3d8eea33ca0
> > x29: ffffa3d8eea33ce0 x28: 0000000081000200 x27: ffffa3d8edd72a80
> > x26: ffffa3d8edd728b0 x25: ffffa3d8eec66248 x24: ffffa3d8edc528d8
> > x23: ffffa3d8eed7f4f8 x22: 0000000000000001 x21: ffffa3d8eea58410
> > x20: ffffa3d8eed7f000 x19: 0000000000010001 x18: 0000000000000006
> > x17: 0000000009f69730 x16: 00000008760968d0 x15: 0000000000000000
> > x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> > x11: 00000000ffffefff x10: ffffa3d8eeac09c8 x9 : ffffa3d8eea687d8
> > x8 : ffffa3d8eea33cb8 x7 : 0000000000000000 x6 : 0000000000000000 
> > x5 : ffffa3d8eed7f000 x4 : ffffa3d8edeb0000 x3 : 0000000000000000
> > x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000000
> > 
> > Call trace:
> >  arm_smccc_version_init+0x90/0x1ac
> >  psci_0_2_init+0x20/0x2c
> >  psci_1_0_init+0x1c/0x58
> >  psci_dt_init+0x6c/0x98
> >  setup_arch+0x400/0x5ec
> >  start_kernel+0x90/0x788
> >  __primary_switched+0xbc/0xc4
> > Code: d29fffe1 eb01001f 1a9f97e0 f0ffd384 (b94264c2)
> > 
> > The above was transcribed by hand. I double-checked, but there may be
> > mistakes.
> > 
> > Photo here: https://people.debian.org/~ema/Z5OpGluX6oX5NLxh@NH27D9T0LF.jpg
> 
> Crikey, you transcribed all that junk by _hand_?! Thanks for the effort,
> but next time please don't waste your time! The screenshot is fine.
> 
> Anyway, it looks like x6 gets trashed across the SMC. I think that's fine
> per the SMCCC spec:
> 
>   | Unused result and scratch registers can leak information after an
>   | SMC or HVC call. An implementation can mitigate this risk by either
>   | preserving the register state over the call, or returning a constant
>   | value,such as zero, in each register.

I think that's misleading/stale and x6 specifically is not supposed to
be clobbered. Shortly before that there's wording about preserving the
registers, summary/history below:

* SMCCC v1.0 allowed x0 to x17 to be clobbered over a call; all other
  GPRs had to be preserved.

  That was defined in ARM DEN 0028B, AKA "SMC CALLING CONVENTION":
  https://developer.arm.com/documentation/den0028/b/?lang=en

* SMCCC v1.1 mandated that x4 to x17 were preserved. This was critical
  for some of the ARCH_WORKAROUND_* calls made from kernel/kvm entry
  asm.

  That was defined in ARM DEN 0070, AKA "Firmware interfaces for
  mitigating cache speculation vulnerabilities":

  https://developer.arm.com/cache-speculation-vulnerability-firmware-specification

* SMCCC v1.2 relaxed things such that x4 to x17 must be preserved
  *unless* they contained results of the function.

  That was defined in ARM DEN 0028C, AKA "SMC CALLING CONVENTION":
  https://developer.arm.com/documentation/den0028/c/?lang=en

  .. which folded inthe requirements from ARM DEN 0070, with the
  relaxation above, but also kept the stale not from ARM DEN 0028B.

In Linux we depend on x4 to x17 not being clobbered *unless* they're
being used to return a result.

The current wording in the spec is messed up, and suggests that SMC32
calls could clobber the upper 32 bits of x4 to x17, which is inccorect
and would break the ARCH_WORKAROUND_* calls mentioend above (given those
are SMC32). I'll go chase the SMCCC architects about that; it's clearly
an unintentional error when reformatting the document.

The *intent* is that x4 to x17 are preserved here. AFAICT, nothing in
arm_smccc_version_init() returns a value in x4 to x17. If any of those
are clobbered for an SMCCCv1.1 call, it's definitely a firmware bug.

> so the question is whether or not the compiler should be relying on the
> register being preserved. It looks to me like the arm_smccc_1_1_smc()
> macro is actually broken in this regard: it constructs the input list
> based on the number of arguments passed and doesn't take the above into
> account at all.

As above, I think that arm_smccc_1_1_smc() is correct; it *always*
clobbers x0 to x3, and cannot be used for SMCCCv1.2+ calls that return
values in x4 to x17.

That doesn't change the fact that we'll need a workaround, though.

> So I think we should change that so unused arguments are clobbered
> instead. Mark -- what do you think?

As above, if FW gets this wrong for calls made from entry asm, the
results could be catastrophic, so it'd be good if we could detect this
and log a FW BUG the first time we hit it.

We'd have to assume that any of x0 to x17 could be clobbered, and at
that point I suspect it'd be better to move this out-of-line for almost
all callers, which we effectively already have in the form of the
arm_smccc_1_2_{smc,hvc}() trampolines. We could bounce arm_smccc_1_1_*()
calls through those.

Mark.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-01-31 13:54       ` Mark Rutland
@ 2025-02-04 10:00         ` Will Deacon
  2025-02-05 16:50           ` Emanuele Rocca
  0 siblings, 1 reply; 15+ messages in thread
From: Will Deacon @ 2025-02-04 10:00 UTC (permalink / raw)
  To: Mark Rutland; +Cc: Emanuele Rocca, linux-arm-kernel

On Fri, Jan 31, 2025 at 01:54:57PM +0000, Mark Rutland wrote:
> On Fri, Jan 31, 2025 at 12:41:04PM +0000, Will Deacon wrote:
> > Anyway, it looks like x6 gets trashed across the SMC. I think that's fine
> > per the SMCCC spec:
> > 
> >   | Unused result and scratch registers can leak information after an
> >   | SMC or HVC call. An implementation can mitigate this risk by either
> >   | preserving the register state over the call, or returning a constant
> >   | value,such as zero, in each register.
> 
> I think that's misleading/stale and x6 specifically is not supposed to
> be clobbered. Shortly before that there's wording about preserving the
> registers, summary/history below:
> 
> * SMCCC v1.0 allowed x0 to x17 to be clobbered over a call; all other
>   GPRs had to be preserved.
> 
>   That was defined in ARM DEN 0028B, AKA "SMC CALLING CONVENTION":
>   https://developer.arm.com/documentation/den0028/b/?lang=en
> 
> * SMCCC v1.1 mandated that x4 to x17 were preserved. This was critical
>   for some of the ARCH_WORKAROUND_* calls made from kernel/kvm entry
>   asm.
> 
>   That was defined in ARM DEN 0070, AKA "Firmware interfaces for
>   mitigating cache speculation vulnerabilities":
> 
>   https://developer.arm.com/cache-speculation-vulnerability-firmware-specification

I can't find "DEN 0070" anywhere there...

> * SMCCC v1.2 relaxed things such that x4 to x17 must be preserved
>   *unless* they contained results of the function.
> 
>   That was defined in ARM DEN 0028C, AKA "SMC CALLING CONVENTION":
>   https://developer.arm.com/documentation/den0028/c/?lang=en
> 
>   .. which folded inthe requirements from ARM DEN 0070, with the
>   relaxation above, but also kept the stale not from ARM DEN 0028B.
> 
> In Linux we depend on x4 to x17 not being clobbered *unless* they're
> being used to return a result.
> 
> The current wording in the spec is messed up, and suggests that SMC32
> calls could clobber the upper 32 bits of x4 to x17, which is inccorect
> and would break the ARCH_WORKAROUND_* calls mentioend above (given those
> are SMC32). I'll go chase the SMCCC architects about that; it's clearly
> an unintentional error when reformatting the document.
> 
> The *intent* is that x4 to x17 are preserved here. AFAICT, nothing in
> arm_smccc_version_init() returns a value in x4 to x17. If any of those
> are clobbered for an SMCCCv1.1 call, it's definitely a firmware bug.

Perhaps, but I must admit that I have sympathy for the f/w developers
here given that the latest spec on the Arm website says that they can do
this and the spec that apparently requires the registers to be preserved
is either missing or hard to find. How are they supposed to know what
the intention was when the documentation definitively says something
else?

Emanuele -- could you hack the code to poison the other unused result
registers () and see if they are also cleared? ARM_SMCCC_TRNG_VERSION
looks like a 32-bit call, so that would be W1-W7 afaict.

Will


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-01-30 11:43 ` Will Deacon
@ 2025-02-05 10:35   ` Emanuele Rocca
  2025-02-06 13:08     ` Will Deacon
  0 siblings, 1 reply; 15+ messages in thread
From: Emanuele Rocca @ 2025-02-05 10:35 UTC (permalink / raw)
  To: Will Deacon; +Cc: linux-arm-kernel, Mark Rutland

Hello Will and Mark,

On 2025-01-30 11:43, Will Deacon wrote:
> On Fri, Jan 24, 2025 at 03:52:10PM +0100, Emanuele Rocca wrote:
> > This is with kernel 6.1.123. The last known good kernel I have available right
> > now is 6.1.119. In the 6.1.120 changelog I see the following commit which seems
> > potentially related?
> > 
> >  https://lore.kernel.org/all/20241106160448.2712997-1-mark.rutland@arm.com/
> > 
> > That's stable commit [1].
> 
> I can't see anything wrong with that patch but if you have reason to
> suspect that it's broken can you try reverting it to see if the problem
> disappears?

Confirmed: reverting the patch fixes the issue.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-02-04 10:00         ` Will Deacon
@ 2025-02-05 16:50           ` Emanuele Rocca
  2025-02-06 13:11             ` Will Deacon
  0 siblings, 1 reply; 15+ messages in thread
From: Emanuele Rocca @ 2025-02-05 16:50 UTC (permalink / raw)
  To: Will Deacon; +Cc: Mark Rutland, linux-arm-kernel

Hello Will,

On 2025-02-04 10:00, Will Deacon wrote:
> Emanuele -- could you hack the code to poison the other unused result
> registers () and see if they are also cleared? ARM_SMCCC_TRNG_VERSION
> looks like a 32-bit call, so that would be W1-W7 afaict.

Not sure if this is exactly what you are asking for, but right before
the call to smccc_probe_trng():
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/firmware/smccc/smccc.c?h=v6.1.124#n29

I did the following to write to all registers. First I tried W1-W7, then
I went for W1-W17 too:

register unsigned long w1 asm("w1");
[...]
register unsigned long w17 asm("w17");

asm volatile(
  "mov w1, #0x1234\n"
  "mov w2, #0x2234\n"
  [...]
  "mov w16, #0x0234\n"
  "mov w17, #0x1234\n"
);

The values I wrote were not overwritten, see https://people.debian.org/~ema/w1-w17.jpg


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-02-05 10:35   ` Emanuele Rocca
@ 2025-02-06 13:08     ` Will Deacon
  0 siblings, 0 replies; 15+ messages in thread
From: Will Deacon @ 2025-02-06 13:08 UTC (permalink / raw)
  To: Emanuele Rocca; +Cc: linux-arm-kernel, Mark Rutland

On Wed, Feb 05, 2025 at 11:35:18AM +0100, Emanuele Rocca wrote:
> On 2025-01-30 11:43, Will Deacon wrote:
> > On Fri, Jan 24, 2025 at 03:52:10PM +0100, Emanuele Rocca wrote:
> > > This is with kernel 6.1.123. The last known good kernel I have available right
> > > now is 6.1.119. In the 6.1.120 changelog I see the following commit which seems
> > > potentially related?
> > > 
> > >  https://lore.kernel.org/all/20241106160448.2712997-1-mark.rutland@arm.com/
> > > 
> > > That's stable commit [1].
> > 
> > I can't see anything wrong with that patch but if you have reason to
> > suspect that it's broken can you try reverting it to see if the problem
> > disappears?
> 
> Confirmed: reverting the patch fixes the issue.

I guess that's just a happy accident thanks to the clobber list changing
on the smc asm with the sve check.

Will


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-02-05 16:50           ` Emanuele Rocca
@ 2025-02-06 13:11             ` Will Deacon
  2025-02-06 16:57               ` Emanuele Rocca
  0 siblings, 1 reply; 15+ messages in thread
From: Will Deacon @ 2025-02-06 13:11 UTC (permalink / raw)
  To: Emanuele Rocca; +Cc: Mark Rutland, linux-arm-kernel

On Wed, Feb 05, 2025 at 05:50:13PM +0100, Emanuele Rocca wrote:
> Hello Will,
> 
> On 2025-02-04 10:00, Will Deacon wrote:
> > Emanuele -- could you hack the code to poison the other unused result
> > registers () and see if they are also cleared? ARM_SMCCC_TRNG_VERSION
> > looks like a 32-bit call, so that would be W1-W7 afaict.
> 
> Not sure if this is exactly what you are asking for, but right before
> the call to smccc_probe_trng():
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/firmware/smccc/smccc.c?h=v6.1.124#n29
> 
> I did the following to write to all registers. First I tried W1-W7, then
> I went for W1-W17 too:
> 
> register unsigned long w1 asm("w1");
> [...]
> register unsigned long w17 asm("w17");
> 
> asm volatile(
>   "mov w1, #0x1234\n"
>   "mov w2, #0x2234\n"
>   [...]
>   "mov w16, #0x0234\n"
>   "mov w17, #0x1234\n"
> );
> 
> The values I wrote were not overwritten, see https://people.debian.org/~ema/w1-w17.jpg

Hrm, now I'm confused :/

In your screenshot, x6 looks like it's retained its poison value, but
that was the register being corrupted in the initial report. Maybe you
could share the diff you made?

Will


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-02-06 13:11             ` Will Deacon
@ 2025-02-06 16:57               ` Emanuele Rocca
  2025-03-13 22:08                 ` Will Deacon
  0 siblings, 1 reply; 15+ messages in thread
From: Emanuele Rocca @ 2025-02-06 16:57 UTC (permalink / raw)
  To: Will Deacon; +Cc: Mark Rutland, linux-arm-kernel

On 2025-02-06 01:11, Will Deacon wrote:
> In your screenshot, x6 looks like it's retained its poison value, but
> that was the register being corrupted in the initial report. Maybe you
> could share the diff you made?

Sure, please see [1]. A kernel built with that patch crashes and all
registers retain their poison values.

I now also tried moving the poisoning before the smccc_version /
smccc_conduit assignments, and the resulting kernel does *not* crash.
See [2].

[1] https://people.debian.org/~ema/w1-w17-crash.diff
[2] https://people.debian.org/~ema/w1-w17-nocrash.diff


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-02-06 16:57               ` Emanuele Rocca
@ 2025-03-13 22:08                 ` Will Deacon
  2025-09-26 10:36                   ` Dan Carpenter
  0 siblings, 1 reply; 15+ messages in thread
From: Will Deacon @ 2025-03-13 22:08 UTC (permalink / raw)
  To: Emanuele Rocca; +Cc: Mark Rutland, linux-arm-kernel

On Thu, Feb 06, 2025 at 05:57:41PM +0100, Emanuele Rocca wrote:
> On 2025-02-06 01:11, Will Deacon wrote:
> > In your screenshot, x6 looks like it's retained its poison value, but
> > that was the register being corrupted in the initial report. Maybe you
> > could share the diff you made?
> 
> Sure, please see [1]. A kernel built with that patch crashes and all
> registers retain their poison values.

Okey doke, at least it sounds like the firmware isn't taking advantage
of the broken spec, then.

In any case, I think the right way forward here is to use the
out-of-line 1.2 helpers whenever we can (the spectre mitigations can
stay as they are).

Will


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-03-13 22:08                 ` Will Deacon
@ 2025-09-26 10:36                   ` Dan Carpenter
  2025-09-26 11:03                     ` Will Deacon
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Carpenter @ 2025-09-26 10:36 UTC (permalink / raw)
  To: Will Deacon; +Cc: Emanuele Rocca, Mark Rutland, linux-arm-kernel

On Thu, Mar 13, 2025 at 10:08:36PM +0000, Will Deacon wrote:
> On Thu, Feb 06, 2025 at 05:57:41PM +0100, Emanuele Rocca wrote:
> > On 2025-02-06 01:11, Will Deacon wrote:
> > > In your screenshot, x6 looks like it's retained its poison value, but
> > > that was the register being corrupted in the initial report. Maybe you
> > > could share the diff you made?
> > 
> > Sure, please see [1]. A kernel built with that patch crashes and all
> > registers retain their poison values.
> 
> Okey doke, at least it sounds like the firmware isn't taking advantage
> of the broken spec, then.
> 
> In any case, I think the right way forward here is to use the
> out-of-line 1.2 helpers whenever we can (the spectre mitigations can
> stay as they are).
> 
> Will
> 

Any updates on this?  I guess it only affects gcc13?  Clang seems to boot
okay.

regards,
dan carpenter



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac
  2025-09-26 10:36                   ` Dan Carpenter
@ 2025-09-26 11:03                     ` Will Deacon
  0 siblings, 0 replies; 15+ messages in thread
From: Will Deacon @ 2025-09-26 11:03 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: Emanuele Rocca, Mark Rutland, linux-arm-kernel

On Fri, Sep 26, 2025 at 01:36:34PM +0300, Dan Carpenter wrote:
> On Thu, Mar 13, 2025 at 10:08:36PM +0000, Will Deacon wrote:
> > On Thu, Feb 06, 2025 at 05:57:41PM +0100, Emanuele Rocca wrote:
> > > On 2025-02-06 01:11, Will Deacon wrote:
> > > > In your screenshot, x6 looks like it's retained its poison value, but
> > > > that was the register being corrupted in the initial report. Maybe you
> > > > could share the diff you made?
> > > 
> > > Sure, please see [1]. A kernel built with that patch crashes and all
> > > registers retain their poison values.
> > 
> > Okey doke, at least it sounds like the firmware isn't taking advantage
> > of the broken spec, then.
> > 
> > In any case, I think the right way forward here is to use the
> > out-of-line 1.2 helpers whenever we can (the spectre mitigations can
> > stay as they are).
> > 
> > Will
> > 
> 
> Any updates on this?  I guess it only affects gcc13?  Clang seems to boot
> okay.

I don't recall seeing a patch but we should just move to the out-of-line
SMCCC helpers wherever we can. The inline macros are fiddly to use,
especially when you have to pad things out with zeroes at the caller to
prevent registers from getting corrupted.

Will


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-09-26 11:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-24 14:52 [BUG] ARM64 regression: NULL pointer dereference in arm_smccc_version_init+0x90/0x1ac Emanuele Rocca
2025-01-30 11:43 ` Will Deacon
2025-02-05 10:35   ` Emanuele Rocca
2025-02-06 13:08     ` Will Deacon
2025-01-30 12:19 ` Mark Rutland
2025-01-30 14:56   ` Emanuele Rocca
2025-01-31 12:41     ` Will Deacon
2025-01-31 13:54       ` Mark Rutland
2025-02-04 10:00         ` Will Deacon
2025-02-05 16:50           ` Emanuele Rocca
2025-02-06 13:11             ` Will Deacon
2025-02-06 16:57               ` Emanuele Rocca
2025-03-13 22:08                 ` Will Deacon
2025-09-26 10:36                   ` Dan Carpenter
2025-09-26 11:03                     ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).