From: Kiryl Shutsemau <kirill@shutemov.name>
To: Puranjay Mohan <puranjay12@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>, James Morse <james.morse@arm.com>,
Mark Rutland <mark.rutland@arm.com>,
Marc Zyngier <maz@kernel.org>,
Doug Anderson <dianders@chromium.org>,
Petr Mladek <pmladek@suse.com>,
Thomas Gleixner <tglx@linutronix.de>,
Andrew Morton <akpm@linux-foundation.org>,
Baoquan He <bhe@redhat.com>, Usama Arif <usama.arif@linux.dev>,
Breno Leitao <leitao@debian.org>,
Julien Thierry <julien.thierry.kdev@gmail.com>,
Lecopzer Chen <lecopzer@gmail.com>,
Sumit Garg <sumit.garg@kernel.org>,
kernel-team@meta.com, kexec@lists.infradead.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 3/3] arm64: escalate smp_send_stop() to an SDEI NMI as a last resort
Date: Mon, 15 Jun 2026 13:46:07 +0100 [thread overview]
Message-ID: <ai_yEtDjI0e8sBBD@thinkstation> (raw)
In-Reply-To: <CANk7y0irTcfVrKXBd1L7tLRyDpNJ1cRhwKG7x42XbKgJy+ACeA@mail.gmail.com>
On Mon, Jun 15, 2026 at 12:25:17PM +0200, Puranjay Mohan wrote:
> On Mon, Jun 15, 2026 at 4:36 AM Kiryl Shutsemau <kirill@shutemov.name> wrote:
> >
> > From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
> >
> > A CPU wedged with interrupts masked ignores the stop IPI, and without
> > pseudo-NMI there is no NMI IPI to escalate to: a reboot proceeds with
> > the CPU still running, and a kdump misses its registers.
> >
> > Add a third rung to smp_send_stop(): once the IPI (and pseudo-NMI IPI,
> > if enabled) rungs have run, signal SDEI event 0 at whatever stayed
> > online. Firmware delivers it regardless of the target's DAIF, so it
> > reaches a CPU a plain IPI cannot; the target acks by going offline,
> > which the caller already polls for.
> >
> > Fold the stop bookkeeping into one arm64_nmi_cpu_stop(regs,
> > die_on_crash), shared by the stop IPI handlers, panic_smp_self_stop()
> > and the SDEI handler, replacing the near-duplicate local_cpu_stop() and
> > ipi_cpu_crash_stop(). @die_on_crash is the only difference: the IPI
> > handlers pass true and PSCI CPU_OFF the CPU on a crash stop so a capture
> > kernel can reclaim it; the SDEI handler and self-stop pass false and
> > park. The SDEI park is required, not conservative -- its handler runs
> > inside an SDEI event that is never completed (completing it resumes the
> > wedged context), and a CPU_OFF from that unfinished-event context wedges
> > EL3 on some firmware (left as a follow-up). The dump is unaffected; only
> > re-onlining the CPU in an SMP capture kernel is lost.
> >
> > Suggested-by: Douglas Anderson <dianders@chromium.org>
> > Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> > ---
> > arch/arm64/include/asm/nmi.h | 24 +++++++
> > arch/arm64/kernel/smp.c | 109 +++++++++++++++++++++-----------
> > drivers/firmware/Kconfig | 2 +
> > drivers/firmware/arm_sdei_nmi.c | 75 ++++++++++++++++++++++
> > 4 files changed, 172 insertions(+), 38 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/nmi.h b/arch/arm64/include/asm/nmi.h
> > index 9366be419d18..2e8974ff8d63 100644
> > --- a/arch/arm64/include/asm/nmi.h
> > +++ b/arch/arm64/include/asm/nmi.h
> > @@ -4,21 +4,45 @@
> >
> > #include <linux/cpumask.h>
> >
> > +struct pt_regs;
> > +
> > /*
> > * Cross-CPU NMI provider hooks, consulted by the arm64 arch code before
> > * its regular-IRQ / pseudo-NMI IPI paths. The SDEI provider in
> > * drivers/firmware/arm_sdei_nmi.c implements them when active; a future
> > * FEAT_NMI provider could slot in here too. The stubs let callers stay
> > * unconditional when ARM_SDEI_NMI is off.
> > + *
> > + * sdei_nmi_active() lets a caller test for the service before committing
> > + * to (and waiting on) the SDEI stop rung; sdei_nmi_stop_cpus() then signals
> > + * the targets, which ack by going offline.
> > */
> > #ifdef CONFIG_ARM_SDEI_NMI
> > bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu);
> > +bool sdei_nmi_active(void);
> > +void sdei_nmi_stop_cpus(const cpumask_t *mask);
> > #else
> > static inline bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
> > int exclude_cpu)
> > {
> > return false;
> > }
> > +
> > +static inline bool sdei_nmi_active(void)
> > +{
> > + return false;
> > +}
> > +
> > +static inline void sdei_nmi_stop_cpus(const cpumask_t *mask) { }
> > #endif
> >
> > +/*
> > + * The common "stop this CPU" entry every arm64 stop path funnels through:
> > + * the regular/pseudo-NMI stop IPI handlers, panic_smp_self_stop(), and the
> > + * SDEI cross-CPU NMI handler. @die_on_crash powers the CPU off on the kdump
> > + * crash path (IPI handlers) instead of parking it (SDEI / self-stop).
> > + * Defined in arch/arm64/kernel/smp.c.
> > + */
> > +void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs, bool die_on_crash);
> > +
> > #endif /* __ASM_NMI_H */
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index a670434a8cae..e85a4ba18d5c 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -33,6 +33,7 @@
> > #include <linux/kernel_stat.h>
> > #include <linux/kexec.h>
> > #include <linux/kgdb.h>
> > +#include <linux/kprobes.h>
> > #include <linux/kvm_host.h>
> > #include <linux/nmi.h>
> >
> > @@ -862,14 +863,58 @@ void arch_irq_work_raise(void)
> > }
> > #endif
> >
> > -static void __noreturn local_cpu_stop(unsigned int cpu)
> > +/*
> > + * Bring the local CPU to a stop, saving its register state into the vmcore
> > + * on the kdump crash path first. The single point every arm64 stop path
> > + * funnels through, so the bookkeeping (mask interrupts, mark offline, mask
> > + * SDEI, optionally power off) lives in one place:
> > + *
> > + * - the regular IPI_CPU_STOP and pseudo-NMI IPI_CPU_STOP_NMI handlers;
> > + * - panic_smp_self_stop(), a CPU parking itself on a parallel panic();
> > + * - the SDEI cross-CPU NMI handler (drivers/firmware/arm_sdei_nmi.c),
> > + * which reaches CPUs the stop IPIs could not.
> > + *
> > + * @regs is the register state to record in the vmcore on a crash stop; NULL
> > + * means "capture the current context". @die_on_crash decides the kdump crash
> > + * path: the IPI stop handlers pass true and power the CPU off (PSCI CPU_OFF,
> > + * via __cpu_try_die()) so a capture kernel can reclaim it. The SDEI handler
> > + * and panic_smp_self_stop() pass false and only park. For SDEI that is
> > + * required, not just conservative: it runs inside an SDEI event that is
> > + * deliberately never completed (completing it has firmware resume the wedged
> > + * context), and a CPU_OFF from that not-yet-completed context wedges EL3 on
> > + * some firmware -- a documented follow-up. Parking also matches this path's
> > + * own fallback when CPU_OFF is unavailable.
> > + */
> > +void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs, bool die_on_crash)
> > {
> > + unsigned int cpu = smp_processor_id();
> > + bool crash = IS_ENABLED(CONFIG_KEXEC_CORE) && crash_stop;
> > +
> > + /*
> > + * Use local_daif_mask() instead of local_irq_disable() to make sure
> > + * that pseudo-NMIs are disabled. The "stop" code starts with an IRQ
> > + * and falls back to NMI (which might be pseudo). If the IRQ finally
> > + * goes through right as we're timing out then the NMI could interrupt
> > + * us. It's better to prevent the NMI and let the IRQ finish since the
> > + * pt_regs will be better.
> > + */
> > + local_daif_mask();
> > +
> > + if (crash)
> > + crash_save_cpu(regs, cpu);
> > +
> > + /* the ack a stop requester (e.g. smp_send_stop()) polls for */
> > set_cpu_online(cpu, false);
> >
> > - local_daif_mask();
> > sdei_mask_local_cpu();
> > +
> > + if (crash && die_on_crash)
> > + __cpu_try_die(cpu);
> > +
> > + /* just in case */
> > cpu_park_loop();
> > }
> > +NOKPROBE_SYMBOL(arm64_nmi_cpu_stop);
> >
> > /*
> > * We need to implement panic_smp_self_stop() for parallel panic() calls, so
> > @@ -878,36 +923,7 @@ static void __noreturn local_cpu_stop(unsigned int cpu)
> > */
> > void __noreturn panic_smp_self_stop(void)
> > {
> > - local_cpu_stop(smp_processor_id());
> > -}
> > -
> > -static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
> > -{
> > -#ifdef CONFIG_KEXEC_CORE
> > - /*
> > - * Use local_daif_mask() instead of local_irq_disable() to make sure
> > - * that pseudo-NMIs are disabled. The "crash stop" code starts with
> > - * an IRQ and falls back to NMI (which might be pseudo). If the IRQ
> > - * finally goes through right as we're timing out then the NMI could
> > - * interrupt us. It's better to prevent the NMI and let the IRQ
> > - * finish since the pt_regs will be better.
> > - */
> > - local_daif_mask();
> > -
> > - crash_save_cpu(regs, cpu);
> > -
> > - set_cpu_online(cpu, false);
> > -
> > - sdei_mask_local_cpu();
> > -
> > - if (IS_ENABLED(CONFIG_HOTPLUG_CPU))
> > - __cpu_try_die(cpu);
> > -
> > - /* just in case */
> > - cpu_park_loop();
> > -#else
> > - BUG();
> > -#endif
> > + arm64_nmi_cpu_stop(NULL, false);
> > }
>
> panic_smp_self_stop() passes regs == NULL. If a second CPU panics
> after the primary has already set crash_stop, it loses the panic_cpu
> cmpxchg and calls panic_smp_self_stop(); if it was running with
> interrupts masked it never took the stop IPI, so it gets here with
> crash_stop == 1. crash is then true and we do crash_save_cpu(NULL,
> cpu), which ends up in elf_core_copy_regs(), and on arm64 that is just
>
> *(struct user_pt_regs *)&(dest) = (regs)->user_regs;
>
> so a straight NULL deref -> synchronous abort while we're in the
> middle of crashing. The old local_cpu_stop() never called
> crash_save_cpu(), so this is a new regression from the unification.
Good catch, you're right — that's a real NULL deref, and a regression
from folding ipi_cpu_crash_stop() in (the old code only ran on the IPI
path, which always has regs).
Will be fixed in v4.
>
> The comment above the function says NULL means "capture the current
> context", but crash_save_cpu() doesn't do that, it just dereferences
> regs. If that's the intent, materialise it:
>
> if (crash) {
> struct pt_regs local;
>
> if (!regs) {
> crash_setup_regs(&local, NULL);
> regs = &local;
> }
> crash_save_cpu(regs, cpu);
> }
>
> crash_setup_regs(..., NULL) is the existing "capture current" helper. Or just
> skip the save when regs is NULL if the self-stop registers aren't
> worth having.
I went with skipping it. No architecture saves registers for the
self-stopping CPU -- the generic, arm and powerpc panic_smp_self_stop()
all just mark offline and spin, and arm64 did the same via
local_cpu_stop() before this series. crash_save_cpu() is only ever fed
the interrupted context, from the crash shootdown or the crashing CPU
itself. So:
if (crash && regs)
crash_save_cpu(regs, cpu);
and I dropped the "NULL means capture current" wording from the comment
rather than make it true.
While fixing this I noticed the same block broke the CONFIG_KEXEC_CORE=n
build: the unification replaced the original ipi_cpu_crash_stop()'s
#ifdef CONFIG_KEXEC_CORE with an IS_ENABLED() check, but <linux/kexec.h>
only declares crash_save_cpu() under CONFIG_KEXEC_CORE (and pulling in
<linux/crash_core.h> directly collides with kexec.h's own stubs), so the
dead-but-compiled call was an implicit-declaration error. Restored the
#ifdef around the crash_save_cpu() call.
Thanks for the review!
--
Kiryl Shutsemau / Kirill A. Shutemov
prev parent reply other threads:[~2026-06-15 12:46 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-15 2:35 [PATCH v3 0/3] arm64: cross-CPU NMI via SDEI Kiryl Shutsemau
2026-06-15 2:35 ` [PATCH v3 1/3] firmware: arm_sdei: add SDEI_EVENT_SIGNAL support Kiryl Shutsemau
2026-06-15 2:35 ` [PATCH v3 2/3] drivers/firmware: add SDEI cross-CPU NMI service for arm64 Kiryl Shutsemau
2026-06-15 10:18 ` Puranjay Mohan
2026-06-15 13:15 ` Kiryl Shutsemau
2026-06-15 2:35 ` [PATCH v3 3/3] arm64: escalate smp_send_stop() to an SDEI NMI as a last resort Kiryl Shutsemau
2026-06-15 10:25 ` Puranjay Mohan
2026-06-15 12:46 ` Kiryl Shutsemau [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ai_yEtDjI0e8sBBD@thinkstation \
--to=kirill@shutemov.name \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=catalin.marinas@arm.com \
--cc=dianders@chromium.org \
--cc=james.morse@arm.com \
--cc=julien.thierry.kdev@gmail.com \
--cc=kernel-team@meta.com \
--cc=kexec@lists.infradead.org \
--cc=lecopzer@gmail.com \
--cc=leitao@debian.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=maz@kernel.org \
--cc=pmladek@suse.com \
--cc=puranjay12@gmail.com \
--cc=sumit.garg@kernel.org \
--cc=tglx@linutronix.de \
--cc=usama.arif@linux.dev \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox