From: Kiryl Shutsemau <kirill@shutemov.name>
To: Puranjay Mohan <puranjay12@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>, James Morse <james.morse@arm.com>,
Mark Rutland <mark.rutland@arm.com>,
Marc Zyngier <maz@kernel.org>,
Doug Anderson <dianders@chromium.org>,
Petr Mladek <pmladek@suse.com>,
Thomas Gleixner <tglx@linutronix.de>,
Andrew Morton <akpm@linux-foundation.org>,
Baoquan He <bhe@redhat.com>, Usama Arif <usama.arif@linux.dev>,
Breno Leitao <leitao@debian.org>,
Julien Thierry <julien.thierry.kdev@gmail.com>,
Lecopzer Chen <lecopzer@gmail.com>,
Sumit Garg <sumit.garg@kernel.org>,
kernel-team@meta.com, kexec@lists.infradead.org,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 3/3] arm64: escalate smp_send_stop() to an SDEI NMI as a last resort
Date: Mon, 15 Jun 2026 13:46:07 +0100 [thread overview]
Message-ID: <ai_yEtDjI0e8sBBD@thinkstation> (raw)
In-Reply-To: <CANk7y0irTcfVrKXBd1L7tLRyDpNJ1cRhwKG7x42XbKgJy+ACeA@mail.gmail.com>
On Mon, Jun 15, 2026 at 12:25:17PM +0200, Puranjay Mohan wrote:
> On Mon, Jun 15, 2026 at 4:36 AM Kiryl Shutsemau <kirill@shutemov.name> wrote:
> >
> > From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
> >
> > A CPU wedged with interrupts masked ignores the stop IPI, and without
> > pseudo-NMI there is no NMI IPI to escalate to: a reboot proceeds with
> > the CPU still running, and a kdump misses its registers.
> >
> > Add a third rung to smp_send_stop(): once the IPI (and pseudo-NMI IPI,
> > if enabled) rungs have run, signal SDEI event 0 at whatever stayed
> > online. Firmware delivers it regardless of the target's DAIF, so it
> > reaches a CPU a plain IPI cannot; the target acks by going offline,
> > which the caller already polls for.
> >
> > Fold the stop bookkeeping into one arm64_nmi_cpu_stop(regs,
> > die_on_crash), shared by the stop IPI handlers, panic_smp_self_stop()
> > and the SDEI handler, replacing the near-duplicate local_cpu_stop() and
> > ipi_cpu_crash_stop(). @die_on_crash is the only difference: the IPI
> > handlers pass true and PSCI CPU_OFF the CPU on a crash stop so a capture
> > kernel can reclaim it; the SDEI handler and self-stop pass false and
> > park. The SDEI park is required, not conservative -- its handler runs
> > inside an SDEI event that is never completed (completing it resumes the
> > wedged context), and a CPU_OFF from that unfinished-event context wedges
> > EL3 on some firmware (left as a follow-up). The dump is unaffected; only
> > re-onlining the CPU in an SMP capture kernel is lost.
> >
> > Suggested-by: Douglas Anderson <dianders@chromium.org>
> > Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> > ---
> > arch/arm64/include/asm/nmi.h | 24 +++++++
> > arch/arm64/kernel/smp.c | 109 +++++++++++++++++++++-----------
> > drivers/firmware/Kconfig | 2 +
> > drivers/firmware/arm_sdei_nmi.c | 75 ++++++++++++++++++++++
> > 4 files changed, 172 insertions(+), 38 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/nmi.h b/arch/arm64/include/asm/nmi.h
> > index 9366be419d18..2e8974ff8d63 100644
> > --- a/arch/arm64/include/asm/nmi.h
> > +++ b/arch/arm64/include/asm/nmi.h
> > @@ -4,21 +4,45 @@
> >
> > #include <linux/cpumask.h>
> >
> > +struct pt_regs;
> > +
> > /*
> > * Cross-CPU NMI provider hooks, consulted by the arm64 arch code before
> > * its regular-IRQ / pseudo-NMI IPI paths. The SDEI provider in
> > * drivers/firmware/arm_sdei_nmi.c implements them when active; a future
> > * FEAT_NMI provider could slot in here too. The stubs let callers stay
> > * unconditional when ARM_SDEI_NMI is off.
> > + *
> > + * sdei_nmi_active() lets a caller test for the service before committing
> > + * to (and waiting on) the SDEI stop rung; sdei_nmi_stop_cpus() then signals
> > + * the targets, which ack by going offline.
> > */
> > #ifdef CONFIG_ARM_SDEI_NMI
> > bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu);
> > +bool sdei_nmi_active(void);
> > +void sdei_nmi_stop_cpus(const cpumask_t *mask);
> > #else
> > static inline bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
> > int exclude_cpu)
> > {
> > return false;
> > }
> > +
> > +static inline bool sdei_nmi_active(void)
> > +{
> > + return false;
> > +}
> > +
> > +static inline void sdei_nmi_stop_cpus(const cpumask_t *mask) { }
> > #endif
> >
> > +/*
> > + * The common "stop this CPU" entry every arm64 stop path funnels through:
> > + * the regular/pseudo-NMI stop IPI handlers, panic_smp_self_stop(), and the
> > + * SDEI cross-CPU NMI handler. @die_on_crash powers the CPU off on the kdump
> > + * crash path (IPI handlers) instead of parking it (SDEI / self-stop).
> > + * Defined in arch/arm64/kernel/smp.c.
> > + */
> > +void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs, bool die_on_crash);
> > +
> > #endif /* __ASM_NMI_H */
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index a670434a8cae..e85a4ba18d5c 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -33,6 +33,7 @@
> > #include <linux/kernel_stat.h>
> > #include <linux/kexec.h>
> > #include <linux/kgdb.h>
> > +#include <linux/kprobes.h>
> > #include <linux/kvm_host.h>
> > #include <linux/nmi.h>
> >
> > @@ -862,14 +863,58 @@ void arch_irq_work_raise(void)
> > }
> > #endif
> >
> > -static void __noreturn local_cpu_stop(unsigned int cpu)
> > +/*
> > + * Bring the local CPU to a stop, saving its register state into the vmcore
> > + * on the kdump crash path first. The single point every arm64 stop path
> > + * funnels through, so the bookkeeping (mask interrupts, mark offline, mask
> > + * SDEI, optionally power off) lives in one place:
> > + *
> > + * - the regular IPI_CPU_STOP and pseudo-NMI IPI_CPU_STOP_NMI handlers;
> > + * - panic_smp_self_stop(), a CPU parking itself on a parallel panic();
> > + * - the SDEI cross-CPU NMI handler (drivers/firmware/arm_sdei_nmi.c),
> > + * which reaches CPUs the stop IPIs could not.
> > + *
> > + * @regs is the register state to record in the vmcore on a crash stop; NULL
> > + * means "capture the current context". @die_on_crash decides the kdump crash
> > + * path: the IPI stop handlers pass true and power the CPU off (PSCI CPU_OFF,
> > + * via __cpu_try_die()) so a capture kernel can reclaim it. The SDEI handler
> > + * and panic_smp_self_stop() pass false and only park. For SDEI that is
> > + * required, not just conservative: it runs inside an SDEI event that is
> > + * deliberately never completed (completing it has firmware resume the wedged
> > + * context), and a CPU_OFF from that not-yet-completed context wedges EL3 on
> > + * some firmware -- a documented follow-up. Parking also matches this path's
> > + * own fallback when CPU_OFF is unavailable.
> > + */
> > +void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs, bool die_on_crash)
> > {
> > + unsigned int cpu = smp_processor_id();
> > + bool crash = IS_ENABLED(CONFIG_KEXEC_CORE) && crash_stop;
> > +
> > + /*
> > + * Use local_daif_mask() instead of local_irq_disable() to make sure
> > + * that pseudo-NMIs are disabled. The "stop" code starts with an IRQ
> > + * and falls back to NMI (which might be pseudo). If the IRQ finally
> > + * goes through right as we're timing out then the NMI could interrupt
> > + * us. It's better to prevent the NMI and let the IRQ finish since the
> > + * pt_regs will be better.
> > + */
> > + local_daif_mask();
> > +
> > + if (crash)
> > + crash_save_cpu(regs, cpu);
> > +
> > + /* the ack a stop requester (e.g. smp_send_stop()) polls for */
> > set_cpu_online(cpu, false);
> >
> > - local_daif_mask();
> > sdei_mask_local_cpu();
> > +
> > + if (crash && die_on_crash)
> > + __cpu_try_die(cpu);
> > +
> > + /* just in case */
> > cpu_park_loop();
> > }
> > +NOKPROBE_SYMBOL(arm64_nmi_cpu_stop);
> >
> > /*
> > * We need to implement panic_smp_self_stop() for parallel panic() calls, so
> > @@ -878,36 +923,7 @@ static void __noreturn local_cpu_stop(unsigned int cpu)
> > */
> > void __noreturn panic_smp_self_stop(void)
> > {
> > - local_cpu_stop(smp_processor_id());
> > -}
> > -
> > -static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
> > -{
> > -#ifdef CONFIG_KEXEC_CORE
> > - /*
> > - * Use local_daif_mask() instead of local_irq_disable() to make sure
> > - * that pseudo-NMIs are disabled. The "crash stop" code starts with
> > - * an IRQ and falls back to NMI (which might be pseudo). If the IRQ
> > - * finally goes through right as we're timing out then the NMI could
> > - * interrupt us. It's better to prevent the NMI and let the IRQ
> > - * finish since the pt_regs will be better.
> > - */
> > - local_daif_mask();
> > -
> > - crash_save_cpu(regs, cpu);
> > -
> > - set_cpu_online(cpu, false);
> > -
> > - sdei_mask_local_cpu();
> > -
> > - if (IS_ENABLED(CONFIG_HOTPLUG_CPU))
> > - __cpu_try_die(cpu);
> > -
> > - /* just in case */
> > - cpu_park_loop();
> > -#else
> > - BUG();
> > -#endif
> > + arm64_nmi_cpu_stop(NULL, false);
> > }
>
> panic_smp_self_stop() passes regs == NULL. If a second CPU panics
> after the primary has already set crash_stop, it loses the panic_cpu
> cmpxchg and calls panic_smp_self_stop(); if it was running with
> interrupts masked it never took the stop IPI, so it gets here with
> crash_stop == 1. crash is then true and we do crash_save_cpu(NULL,
> cpu), which ends up in elf_core_copy_regs(), and on arm64 that is just
>
> *(struct user_pt_regs *)&(dest) = (regs)->user_regs;
>
> so a straight NULL deref -> synchronous abort while we're in the
> middle of crashing. The old local_cpu_stop() never called
> crash_save_cpu(), so this is a new regression from the unification.
Good catch, you're right — that's a real NULL deref, and a regression
from folding ipi_cpu_crash_stop() in (the old code only ran on the IPI
path, which always has regs).
Will be fixed in v4.
>
> The comment above the function says NULL means "capture the current
> context", but crash_save_cpu() doesn't do that, it just dereferences
> regs. If that's the intent, materialise it:
>
> if (crash) {
> struct pt_regs local;
>
> if (!regs) {
> crash_setup_regs(&local, NULL);
> regs = &local;
> }
> crash_save_cpu(regs, cpu);
> }
>
> crash_setup_regs(..., NULL) is the existing "capture current" helper. Or just
> skip the save when regs is NULL if the self-stop registers aren't
> worth having.
I went with skipping it. No architecture saves registers for the
self-stopping CPU -- the generic, arm and powerpc panic_smp_self_stop()
all just mark offline and spin, and arm64 did the same via
local_cpu_stop() before this series. crash_save_cpu() is only ever fed
the interrupted context, from the crash shootdown or the crashing CPU
itself. So:
if (crash && regs)
crash_save_cpu(regs, cpu);
and I dropped the "NULL means capture current" wording from the comment
rather than make it true.
While fixing this I noticed the same block broke the CONFIG_KEXEC_CORE=n
build: the unification replaced the original ipi_cpu_crash_stop()'s
#ifdef CONFIG_KEXEC_CORE with an IS_ENABLED() check, but <linux/kexec.h>
only declares crash_save_cpu() under CONFIG_KEXEC_CORE (and pulling in
<linux/crash_core.h> directly collides with kexec.h's own stubs), so the
dead-but-compiled call was an implicit-declaration error. Restored the
#ifdef around the crash_save_cpu() call.
Thanks for the review!
--
Kiryl Shutsemau / Kirill A. Shutemov
prev parent reply other threads:[~2026-06-15 12:46 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-15 2:35 [PATCH v3 0/3] arm64: cross-CPU NMI via SDEI Kiryl Shutsemau
2026-06-15 2:35 ` [PATCH v3 1/3] firmware: arm_sdei: add SDEI_EVENT_SIGNAL support Kiryl Shutsemau
2026-06-15 2:35 ` [PATCH v3 2/3] drivers/firmware: add SDEI cross-CPU NMI service for arm64 Kiryl Shutsemau
2026-06-15 10:18 ` Puranjay Mohan
2026-06-15 13:15 ` Kiryl Shutsemau
2026-06-15 2:35 ` [PATCH v3 3/3] arm64: escalate smp_send_stop() to an SDEI NMI as a last resort Kiryl Shutsemau
2026-06-15 10:25 ` Puranjay Mohan
2026-06-15 12:46 ` Kiryl Shutsemau [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ai_yEtDjI0e8sBBD@thinkstation \
--to=kirill@shutemov.name \
--cc=akpm@linux-foundation.org \
--cc=bhe@redhat.com \
--cc=catalin.marinas@arm.com \
--cc=dianders@chromium.org \
--cc=james.morse@arm.com \
--cc=julien.thierry.kdev@gmail.com \
--cc=kernel-team@meta.com \
--cc=kexec@lists.infradead.org \
--cc=lecopzer@gmail.com \
--cc=leitao@debian.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=maz@kernel.org \
--cc=pmladek@suse.com \
--cc=puranjay12@gmail.com \
--cc=sumit.garg@kernel.org \
--cc=tglx@linutronix.de \
--cc=usama.arif@linux.dev \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.