-next boot failures during KVM setup

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* -next boot failures during KVM setup
@ 2026-06-08 19:19 Mark Brown
  2026-06-08 20:18 ` Marc Zyngier
  0 siblings, 1 reply; 3+ messages in thread
From: Mark Brown @ 2026-06-08 19:19 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton; +Cc: Aishwarya.TCV, linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 1413 bytes --]

I'm seeing boot failures on a range of physical arm64 platforms in
today's -next.  Turning on earlycon it looks like we're getting bad
pointer dereferences during KVM initialisation:

[    0.728923] kvm [1]: nv: 570 coarse grained trap handlers
[    0.735138] kvm [1]: nv: 710 fine grained trap handlers
[    0.741326] kvm [1]: IPA Size Limit: 40 bits
[    0.748840] Unable to handle kernel paging request at virtual address ffff00000478e000
[    0.757027] Mem abort info:
[    0.759917]   ESR = 0x0000000096000147
[    0.763772]   EC = 0x25: DABT (current EL), IL = 32 bits
[    0.851526] pc : dcache_clean_inval_poc+0x24/0x48
[    0.856367] lr : kvm_arm_init+0xbb0/0x13f0
...

[    0.937120] Call trace:
[    0.939628]  dcache_clean_inval_poc+0x24/0x48 (P)
[    0.944457]  do_one_initcall+0x60/0x1d4
[    0.948393]  kernel_init_freeable+0x250/0x2d8

   https://lava.sirena.org.uk/scheduler/job/2849583#L848

(with other platforms I've got earlycon logs showing basically the same
thing).  I have some bisects but they seem to have been confused by
earlier driver core issues, I've tweaked to try to avoid that and am
retrying.  FVP and qemu seem unaffected:

  https://lava.sirena.org.uk/scheduler/job/2848374#L888
  https://lava.sirena.org.uk/scheduler/job/2848966#L447

The affected platforms thus far are all SMP Cortex A53/5 systems, but
that's the vast majority of my lab.  They have both GICv3 and GICv2.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: -next boot failures during KVM setup
  2026-06-08 19:19 -next boot failures during KVM setup Mark Brown
@ 2026-06-08 20:18 ` Marc Zyngier
  2026-06-08 20:56   ` Ard Biesheuvel
  0 siblings, 1 reply; 3+ messages in thread
From: Marc Zyngier @ 2026-06-08 20:18 UTC (permalink / raw)
  To: Mark Brown, Will Deacon, Catalin Marinas, Ard Biesheuvel
  Cc: Oliver Upton, Aishwarya.TCV, linux-arm-kernel

[+ Will, Catalin, Ard]

On Mon, 08 Jun 2026 20:19:37 +0100,
Mark Brown <broonie@kernel.org> wrote:
> 
> I'm seeing boot failures on a range of physical arm64 platforms in
> today's -next.  Turning on earlycon it looks like we're getting bad
> pointer dereferences during KVM initialisation:
> 
> [    0.728923] kvm [1]: nv: 570 coarse grained trap handlers
> [    0.735138] kvm [1]: nv: 710 fine grained trap handlers
> [    0.741326] kvm [1]: IPA Size Limit: 40 bits
> [    0.748840] Unable to handle kernel paging request at virtual address ffff00000478e000

That really doesn't look like a duff pointer.

> [    0.757027] Mem abort info:
> [    0.759917]   ESR = 0x0000000096000147

Translation fault, level 3. My take is that something is getting
unmapped.

> [    0.763772]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    0.851526] pc : dcache_clean_inval_poc+0x24/0x48
> [    0.856367] lr : kvm_arm_init+0xbb0/0x13f0
> ...
> 
> [    0.937120] Call trace:
> [    0.939628]  dcache_clean_inval_poc+0x24/0x48 (P)
> [    0.944457]  do_one_initcall+0x60/0x1d4
> [    0.948393]  kernel_init_freeable+0x250/0x2d8
> 
>    https://lava.sirena.org.uk/scheduler/job/2849583#L848
> 
> (with other platforms I've got earlycon logs showing basically the same
> thing).  I have some bisects but they seem to have been confused by
> earlier driver core issues, I've tweaked to try to avoid that and am
> retrying.  FVP and qemu seem unaffected:
> 
>   https://lava.sirena.org.uk/scheduler/job/2848374#L888
>   https://lava.sirena.org.uk/scheduler/job/2848966#L447
> 
> The affected platforms thus far are all SMP Cortex A53/5 systems, but
> that's the vast majority of my lab.  They have both GICv3 and GICv2.

I've reproduced with -next on an A72 platform. But it doesn't happen
with kvmarm/next on its own. So it is likely something coming from
another tree that messes up with CMOs, or .

The stack trace here is slightly better:

[    0.099138] Unable to handle kernel paging request at virtual address ffff0023d9ead000
[    0.099141] Mem abort info:
[    0.099142]   ESR = 0x0000000096000147
[    0.099144]   EC = 0x25: DABT (current EL), IL = 32 bits
[    0.099146]   SET = 0, FnV = 0
[    0.099148]   EA = 0, S1PTW = 0
[    0.099150]   FSC = 0x07: level 3 translation fault
[    0.099151] Data abort info:
[    0.099153]   ISV = 0, ISS = 0x00000147, ISS2 = 0x00000000
[    0.099155]   CM = 1, WnR = 1, TnD = 0, TagAccess = 0
[    0.099157]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    0.099159] swapper pgtable: 4k pages, 48-bit VAs, pgdp=000000245983b000
[    0.099162] [ffff0023d9ead000] pgd=18000027fffff403, p4d=18000027fffff403, pud=18000027ffffe403, pmd=18000027fffed403, pte=00e8002459eadf06
[    0.099173] Internal error: Oops: 0000000096000147 [#1]  SMP
[    0.582137] Freeing initrd memory: 29068K
[    2.025400] Modules linked in:
[    2.028447] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.1.0-rc7-next-20260608 #6265 PREEMPT 
[    2.037482] Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II May 30 2024
[    2.045559] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    2.052510] pc : dcache_clean_inval_poc+0x24/0x48
[    2.057210] lr : kvm_hyp_init_symbols+0x370/0x388
[    2.061904] sp : ffff80008009bd00
[    2.065206] x29: ffff80008009bd00 x28: 0000000000000000 x27: 0000002022116000
[    2.072332] x26: ffff0020232967f0 x25: 00000020a2116000 x24: 00000000000038b0
[    2.079458] x23: 0000000000000030 x22: ffffc7dc575880c8 x21: ffffc7dc57948fb0
[    2.086584] x20: 0000000000000001 x19: 0000000001002222 x18: 00000000ffffffff
[    2.093709] x17: 000000007a3345b6 x16: 0000000073a611dd x15: 0000000000000000
[    2.100835] x14: 0000000000000000 x13: 0000000000000000 x12: fffffffffffff800
[    2.107960] x11: 00000000000007ff x10: 0000000000000000 x9 : fffffffffffff800
[    2.115086] x8 : 00000000000007ff x7 : 0000000000000000 x6 : ffffc7dc5740af58
[    2.122211] x5 : 0000000080000000 x4 : ffffc7b87de00000 x3 : 000000000000003f
[    2.129336] x2 : 0000000000000040 x1 : ffff0023d9eaf000 x0 : ffff0023d9ead000
[    2.136462] Call trace:
[    2.138896]  dcache_clean_inval_poc+0x24/0x48 (P)
[    2.143592]  init_hyp_mode+0x644/0x960
[    2.147333]  kvm_arm_init+0x128/0x280
[    2.150987]  do_one_initcall+0x4c/0x458
[    2.154813]  kernel_init_freeable+0x1f4/0x2a0
[    2.159161]  kernel_init+0x2c/0x150
[    2.162642]  ret_from_fork+0x10/0x20
[    2.166210] Code: 9ac32042 d1000443 8a230000 d503201f (d50b7e20) 
[    2.172292] ---[ end trace 0000000000000000 ]---
[    2.176958] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    2.184608] SMP: stopping secondary CPUs
[    2.188523] Kernel Offset: 0x47dbd5dc0000 from 0xffff800080000000
[    2.194604] PHYS_OFFSET: 0x80000000
[    2.198080] CPU features: 0x04000000,804b0008,00040001,0400421b
[    2.203988] Memory Limit: none
[    2.207031] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

This points to the following code in kvm_hyp_init_symbols():

<quote>
	/*
	 * Flush entire BSS since part of its data containing init symbols is read
	 * while the MMU is off.
	 */
	kvm_flush_dcache_to_poc(kvm_ksym_ref(__hyp_bss_start),
				kvm_ksym_ref(__hyp_bss_end) - kvm_ksym_ref(__hyp_bss_start))

</quote>

which I suspect is related to some of the new BSS related code in
arm64/for-next/mm.

Ard, does this ring a bell?

Thanks,

	M.

-- 
Jazz isn't dead. It just smells funny.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: -next boot failures during KVM setup
  2026-06-08 20:18 ` Marc Zyngier
@ 2026-06-08 20:56   ` Ard Biesheuvel
  0 siblings, 0 replies; 3+ messages in thread
From: Ard Biesheuvel @ 2026-06-08 20:56 UTC (permalink / raw)
  To: Marc Zyngier, Mark Brown, Will Deacon, Catalin Marinas
  Cc: Oliver Upton, Aishwarya.TCV, linux-arm-kernel


On Mon, 8 Jun 2026, at 22:18, Marc Zyngier wrote:
> [+ Will, Catalin, Ard]
>
> On Mon, 08 Jun 2026 20:19:37 +0100,
> Mark Brown <broonie@kernel.org> wrote:
>> 
>> I'm seeing boot failures on a range of physical arm64 platforms in
>> today's -next.  Turning on earlycon it looks like we're getting bad
>> pointer dereferences during KVM initialisation:
>> 
>> [    0.728923] kvm [1]: nv: 570 coarse grained trap handlers
>> [    0.735138] kvm [1]: nv: 710 fine grained trap handlers
>> [    0.741326] kvm [1]: IPA Size Limit: 40 bits
>> [    0.748840] Unable to handle kernel paging request at virtual address ffff00000478e000
>
> That really doesn't look like a duff pointer.
>
>> [    0.757027] Mem abort info:
>> [    0.759917]   ESR = 0x0000000096000147
>
> Translation fault, level 3. My take is that something is getting
> unmapped.
>
...
> I've reproduced with -next on an A72 platform. But it doesn't happen
> with kvmarm/next on its own. So it is likely something coming from
> another tree that messes up with CMOs, or .
>
> The stack trace here is slightly better:
>
> [    0.099138] Unable to handle kernel paging request at virtual 
> address ffff0023d9ead000
...
> [    2.136462] Call trace:
> [    2.138896]  dcache_clean_inval_poc+0x24/0x48 (P)
> [    2.143592]  init_hyp_mode+0x644/0x960
> [    2.147333]  kvm_arm_init+0x128/0x280
> [    2.150987]  do_one_initcall+0x4c/0x458
> [    2.154813]  kernel_init_freeable+0x1f4/0x2a0
> [    2.159161]  kernel_init+0x2c/0x150
> [    2.162642]  ret_from_fork+0x10/0x20
> [    2.166210] Code: 9ac32042 d1000443 8a230000 d503201f (d50b7e20) 
> [    2.172292] ---[ end trace 0000000000000000 ]---
> [    2.176958] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x0000000b
> [    2.184608] SMP: stopping secondary CPUs
> [    2.188523] Kernel Offset: 0x47dbd5dc0000 from 0xffff800080000000
> [    2.194604] PHYS_OFFSET: 0x80000000
> [    2.198080] CPU features: 0x04000000,804b0008,00040001,0400421b
> [    2.203988] Memory Limit: none
> [    2.207031] ---[ end Kernel panic - not syncing: Attempted to kill 
> init! exitcode=0x0000000b ]---
>
> This points to the following code in kvm_hyp_init_symbols():
>
> <quote>
> 	/*
> 	 * Flush entire BSS since part of its data containing init symbols is read
> 	 * while the MMU is off.
> 	 */
> 	kvm_flush_dcache_to_poc(kvm_ksym_ref(__hyp_bss_start),
> 				kvm_ksym_ref(__hyp_bss_end) - kvm_ksym_ref(__hyp_bss_start))
>
> </quote>
>
> which I suspect is related to some of the new BSS related code in
> arm64/for-next/mm.
>
> Ard, does this ring a bell?
>

Haven't seen this myself, surprisingly, but yeah, this is obviously related.

By now, I am wondering if unmapping that region entirely is really worth the
hassle, or whether we'd be better off just remapping it read-only.

Given we're at -rc7, I'd lean towards dropping the whole branch for now, or
alternatively, only drop/revert "arm64: mm: Unmap kernel data/bss entirely from the 
linear map" (and its followup fix "arm64: mm: Defer remap of linear alias of
data/bss") so that the region always remains readable via the linear map.



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-08 20:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08 19:19 -next boot failures during KVM setup Mark Brown
2026-06-08 20:18 ` Marc Zyngier
2026-06-08 20:56   ` Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox