linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* CSD lockup during kexec due to unbounded busy-wait in pl011_console_write_atomic (arm64)
@ 2025-11-25 16:02 Breno Leitao
  2025-11-26 14:13 ` Breno Leitao
  2025-11-28 16:08 ` Petr Mladek
  0 siblings, 2 replies; 11+ messages in thread
From: Breno Leitao @ 2025-11-25 16:02 UTC (permalink / raw)
  To: john.ogness, pmladek, linux, paulmck
  Cc: usamaarif642, leo.yan, linux-arm-kernel, linux-kernel,
	kernel-team, rmikey

Hello,

I am reporting a CSD lockup issue that occurs during kexec on ARM64 hosts,
which I have traced to the amba-pl011 serial driver waiting for hardware with
IRQs disabled in the nbcon atomic write path.


PROBLEM SUMMARY:
================
During kexec, a CSD lockup occurs when pl011_console_write_atomic() performs
an unbounded busy-wait for hardware synchronization while IRQs are disabled.
This blocks other CPUs for extended periods (>11 seconds observed), triggering
CSD lock timeouts.


KERNEL VERSION:
===============
Observed on kernel 6.13, but the code path appears similar in upstream.


ERROR MESSAGE:
==============
  mlx5_core 0000:03:00.0: Shutdown was called
  kvm: exiting hardware virtualization
  arm-smmu-v3 arm-smmu-v3.10.auto: CMD_SYNC timeout at 0x00000103 [hwprod 0x00000104, hwcons 0x00000102]
  smp: csd: Detected non-responsive CSD lock (#1) on CPU#4, waiting 5000000032 ns for CPU#00 do_nothing (kernel/smp.c:1057)
  smp:     csd: CSD lock (#1) unresponsive.
  Sending NMI from CPU 4 to CPUs 0:
  NMI backtrace for cpu 0
  pstate: 03401009 (nzcv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
  pc : pl011_console_write_atomic (./arch/arm64/include/asm/vdso/processor.h:12 drivers/tty/serial/amba-pl011.c:2540)
  lr : pl011_console_write_atomic (drivers/tty/serial/amba-pl011.c:292 drivers/tty/serial/amba-pl011.c:298 drivers/tty/serial/amba-pl011.c:2539)
  sp : ffff80010e26fae0
  pmr: 000000c0
  x29: ffff80010e26fae0 x28: ffff800082ddb000 x27: 00000000000000e0
  x26: 0000000000000001 x25: ffff8000826a8de8 x24: 00000000000008eb
  x23: 0000000000000000 x22: 0000000000000001 x21: 0000000000000000
  x20: ffff00009c19c880 x19: ffff80010e26fb88 x18: 0000000000000018
  x17: 696f70646e452065 x16: 4943502032303830 x15: 3130783020737361
  x14: 6c63203030206570 x13: 746e696f70646e45 x12: 0000000000000000
  x11: 0000000000000008 x10: 0000000000000000 x9 : ffff800081888d80
  x8 : 0000000000000018 x7 : 205d313332363336 x6 : 362e31202020205b
  x5 : ffff000097d4700f x4 : ffff80010e26f99f x3 : ffff800081125220
  x2 : 0000000000000052 x1 : 000000000000000a x0 : ffff00009c19c880
  Call trace:
  pl011_console_write_atomic (./arch/arm64/include/asm/vdso/processor.h:12 drivers/tty/serial/amba-pl011.c:2540) (P)
  nbcon_emit_next_record (kernel/printk/nbcon.c:1049)
  __nbcon_atomic_flush_pending_con (kernel/printk/nbcon.c:1517)
  __nbcon_atomic_flush_pending.llvm.15488114865160659019 (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:192 kernel/printk/nbcon.c:1562 kernel/printk/nbcon.c:1612)
  nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1629)
  printk_kthreads_shutdown (kernel/printk/printk.c:?)
  syscore_shutdown (drivers/base/syscore.c:120)
  kernel_kexec (kernel/kexec_core.c:1045)
  __arm64_sys_reboot (kernel/reboot.c:794 kernel/reboot.c:722 kernel/reboot.c:722)
  invoke_syscall (arch/arm64/kernel/syscall.c:50)
  el0_svc_common.llvm.14158405452757855239 (arch/arm64/kernel/syscall.c:?)
  do_el0_svc (arch/arm64/kernel/syscall.c:152)
  el0_svc (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:73 arch/arm64/kernel/entry-common.c:169 arch/arm64/kernel/entry-common.c:182 arch/arm64/kernel/entry-common.c:749)
  el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:820)
  el0t_64_sync (arch/arm64/kernel/entry.S:600)
  smp: csd: Re-sending CSD lock (#1) IPI from CPU#04 to CPU#00
  Workqueue: events_unbound toggle_allocation_gate

  Call trace:
  show_stack (arch/arm64/kernel/stacktrace.c:503) (C)
  dump_stack_lvl (lib/dump_stack.c:122)
  smp_call_function_many_cond.llvm.3022501501692466737 (lib/dump_stack.c:? kernel/smp.c:305 kernel/smp.c:326 kernel/smp.c:336 kernel/smp.c:884)
  kick_all_cpus_sync (kernel/smp.c:1076)
  __jump_label_update (kernel/jump_label.c:522)
  jump_label_update (kernel/jump_label.c:921)
  static_key_enable_cpuslocked (kernel/jump_label.c:?)
  toggle_allocation_gate (kernel/jump_label.c:224 mm/kfence/core.c:849)
  process_scheduled_works (kernel/workqueue.c:3245 kernel/workqueue.c:3321)
  worker_thread (./include/linux/list.h:373 kernel/workqueue.c:950 kernel/workqueue.c:3403)
  kthread (kernel/kthread.c:391)
  ret_from_fork (arch/arm64/kernel/entry.S:863)
  smp: csd: CSD lock (#1) got unstuck on CPU#04, CPU#00 released the lock.
  kexec_core: Starting new kernel


ROOT CAUSE ANALYSIS:
====================
The issue occurs through the following sequence:

1. System initiates kexec shutdown on an ARM64 host
2. NBCON enters atomic mode during shutdown (printk_kthreads_shutdown)
3. NBCON calls pl011_console_write_atomic() with the following call path:

   local_irq_save()
     __nbcon_atomic_flush_pending_con()
       pl011_console_write_atomic()

4. Inside pl011_console_write_atomic(), the driver performs an unbounded
busy-wait for the hardware to become ready before leaving ->write_atomic()

   while ((pl011_read(uap, REG_FR) ^ uap->vendor->inv_fr) & uap->vendor->fr_busy)
       cpu_relax();            // drivers/tty/serial/amba-pl011.c:2540

5. With IRQs disabled, this busy-wait blocks the CPU for >11 seconds waiting
for the hardware to clear its busy state.

6. Meanwhile, kfence's toggle_allocation_gate() on another CPU attempts to
perform a synchronous operation across all CPUs, which correctly triggers a CSD
lock timeout because CPU#0 is stuck in the busy loop with IRQs disabled.

NOTES:
======

This is slightly similar to a report I gave a while ago [1] that got
fixed by Petr's a7df4ed0af77 ("printk: Allow to use the printk kthread
immediately even for 1st nbcon")

https://lore.kernel.org/all/aGVn%2FSnOvwWewkOW@gmail.com/

QUESTION
========

1) Should nbcon wait for hardware synchronizations with IRQ disabled?
2) Can the hardware synchronization be moved of the IRQ disabled path?

Thanks
--breno


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-12-02 10:34 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-25 16:02 CSD lockup during kexec due to unbounded busy-wait in pl011_console_write_atomic (arm64) Breno Leitao
2025-11-26 14:13 ` Breno Leitao
2025-11-26 14:54   ` Marco Elver
2025-11-26 15:54     ` Breno Leitao
2025-11-26 16:08       ` Marco Elver
2025-11-26 16:37         ` Breno Leitao
2025-11-28 16:08 ` Petr Mladek
2025-12-01 12:58   ` John Ogness
2025-12-01 13:21     ` John Ogness
2025-12-02 10:34       ` Petr Mladek
2025-12-01 17:04   ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).