Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Kiryl Shutsemau <kirill@shutemov.name>
To: Puranjay Mohan <puranjay12@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	 Will Deacon <will@kernel.org>, James Morse <james.morse@arm.com>,
	 Mark Rutland <mark.rutland@arm.com>,
	Marc Zyngier <maz@kernel.org>,
	 Doug Anderson <dianders@chromium.org>,
	Petr Mladek <pmladek@suse.com>,
	 Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Baoquan He <bhe@redhat.com>, Usama Arif <usama.arif@linux.dev>,
	 Breno Leitao <leitao@debian.org>,
	Julien Thierry <julien.thierry.kdev@gmail.com>,
	 Lecopzer Chen <lecopzer@gmail.com>,
	Sumit Garg <sumit.garg@kernel.org>,
	kernel-team@meta.com,  kexec@lists.infradead.org,
	linux-arm-kernel@lists.infradead.org,
	 linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 3/3] arm64: escalate smp_send_stop() to an SDEI NMI as a last resort
Date: Mon, 15 Jun 2026 13:46:07 +0100	[thread overview]
Message-ID: <ai_yEtDjI0e8sBBD@thinkstation> (raw)
In-Reply-To: <CANk7y0irTcfVrKXBd1L7tLRyDpNJ1cRhwKG7x42XbKgJy+ACeA@mail.gmail.com>

On Mon, Jun 15, 2026 at 12:25:17PM +0200, Puranjay Mohan wrote:
> On Mon, Jun 15, 2026 at 4:36 AM Kiryl Shutsemau <kirill@shutemov.name> wrote:
> >
> > From: "Kiryl Shutsemau (Meta)" <kas@kernel.org>
> >
> > A CPU wedged with interrupts masked ignores the stop IPI, and without
> > pseudo-NMI there is no NMI IPI to escalate to: a reboot proceeds with
> > the CPU still running, and a kdump misses its registers.
> >
> > Add a third rung to smp_send_stop(): once the IPI (and pseudo-NMI IPI,
> > if enabled) rungs have run, signal SDEI event 0 at whatever stayed
> > online. Firmware delivers it regardless of the target's DAIF, so it
> > reaches a CPU a plain IPI cannot; the target acks by going offline,
> > which the caller already polls for.
> >
> > Fold the stop bookkeeping into one arm64_nmi_cpu_stop(regs,
> > die_on_crash), shared by the stop IPI handlers, panic_smp_self_stop()
> > and the SDEI handler, replacing the near-duplicate local_cpu_stop() and
> > ipi_cpu_crash_stop(). @die_on_crash is the only difference: the IPI
> > handlers pass true and PSCI CPU_OFF the CPU on a crash stop so a capture
> > kernel can reclaim it; the SDEI handler and self-stop pass false and
> > park. The SDEI park is required, not conservative -- its handler runs
> > inside an SDEI event that is never completed (completing it resumes the
> > wedged context), and a CPU_OFF from that unfinished-event context wedges
> > EL3 on some firmware (left as a follow-up). The dump is unaffected; only
> > re-onlining the CPU in an SMP capture kernel is lost.
> >
> > Suggested-by: Douglas Anderson <dianders@chromium.org>
> > Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> > ---
> >  arch/arm64/include/asm/nmi.h    |  24 +++++++
> >  arch/arm64/kernel/smp.c         | 109 +++++++++++++++++++++-----------
> >  drivers/firmware/Kconfig        |   2 +
> >  drivers/firmware/arm_sdei_nmi.c |  75 ++++++++++++++++++++++
> >  4 files changed, 172 insertions(+), 38 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/nmi.h b/arch/arm64/include/asm/nmi.h
> > index 9366be419d18..2e8974ff8d63 100644
> > --- a/arch/arm64/include/asm/nmi.h
> > +++ b/arch/arm64/include/asm/nmi.h
> > @@ -4,21 +4,45 @@
> >
> >  #include <linux/cpumask.h>
> >
> > +struct pt_regs;
> > +
> >  /*
> >   * Cross-CPU NMI provider hooks, consulted by the arm64 arch code before
> >   * its regular-IRQ / pseudo-NMI IPI paths. The SDEI provider in
> >   * drivers/firmware/arm_sdei_nmi.c implements them when active; a future
> >   * FEAT_NMI provider could slot in here too. The stubs let callers stay
> >   * unconditional when ARM_SDEI_NMI is off.
> > + *
> > + * sdei_nmi_active() lets a caller test for the service before committing
> > + * to (and waiting on) the SDEI stop rung; sdei_nmi_stop_cpus() then signals
> > + * the targets, which ack by going offline.
> >   */
> >  #ifdef CONFIG_ARM_SDEI_NMI
> >  bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu);
> > +bool sdei_nmi_active(void);
> > +void sdei_nmi_stop_cpus(const cpumask_t *mask);
> >  #else
> >  static inline bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
> >                                                       int exclude_cpu)
> >  {
> >         return false;
> >  }
> > +
> > +static inline bool sdei_nmi_active(void)
> > +{
> > +       return false;
> > +}
> > +
> > +static inline void sdei_nmi_stop_cpus(const cpumask_t *mask) { }
> >  #endif
> >
> > +/*
> > + * The common "stop this CPU" entry every arm64 stop path funnels through:
> > + * the regular/pseudo-NMI stop IPI handlers, panic_smp_self_stop(), and the
> > + * SDEI cross-CPU NMI handler. @die_on_crash powers the CPU off on the kdump
> > + * crash path (IPI handlers) instead of parking it (SDEI / self-stop).
> > + * Defined in arch/arm64/kernel/smp.c.
> > + */
> > +void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs, bool die_on_crash);
> > +
> >  #endif /* __ASM_NMI_H */
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index a670434a8cae..e85a4ba18d5c 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -33,6 +33,7 @@
> >  #include <linux/kernel_stat.h>
> >  #include <linux/kexec.h>
> >  #include <linux/kgdb.h>
> > +#include <linux/kprobes.h>
> >  #include <linux/kvm_host.h>
> >  #include <linux/nmi.h>
> >
> > @@ -862,14 +863,58 @@ void arch_irq_work_raise(void)
> >  }
> >  #endif
> >
> > -static void __noreturn local_cpu_stop(unsigned int cpu)
> > +/*
> > + * Bring the local CPU to a stop, saving its register state into the vmcore
> > + * on the kdump crash path first. The single point every arm64 stop path
> > + * funnels through, so the bookkeeping (mask interrupts, mark offline, mask
> > + * SDEI, optionally power off) lives in one place:
> > + *
> > + *   - the regular IPI_CPU_STOP and pseudo-NMI IPI_CPU_STOP_NMI handlers;
> > + *   - panic_smp_self_stop(), a CPU parking itself on a parallel panic();
> > + *   - the SDEI cross-CPU NMI handler (drivers/firmware/arm_sdei_nmi.c),
> > + *     which reaches CPUs the stop IPIs could not.
> > + *
> > + * @regs is the register state to record in the vmcore on a crash stop; NULL
> > + * means "capture the current context". @die_on_crash decides the kdump crash
> > + * path: the IPI stop handlers pass true and power the CPU off (PSCI CPU_OFF,
> > + * via __cpu_try_die()) so a capture kernel can reclaim it. The SDEI handler
> > + * and panic_smp_self_stop() pass false and only park. For SDEI that is
> > + * required, not just conservative: it runs inside an SDEI event that is
> > + * deliberately never completed (completing it has firmware resume the wedged
> > + * context), and a CPU_OFF from that not-yet-completed context wedges EL3 on
> > + * some firmware -- a documented follow-up. Parking also matches this path's
> > + * own fallback when CPU_OFF is unavailable.
> > + */
> > +void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs, bool die_on_crash)
> >  {
> > +       unsigned int cpu = smp_processor_id();
> > +       bool crash = IS_ENABLED(CONFIG_KEXEC_CORE) && crash_stop;
> > +
> > +       /*
> > +        * Use local_daif_mask() instead of local_irq_disable() to make sure
> > +        * that pseudo-NMIs are disabled. The "stop" code starts with an IRQ
> > +        * and falls back to NMI (which might be pseudo). If the IRQ finally
> > +        * goes through right as we're timing out then the NMI could interrupt
> > +        * us. It's better to prevent the NMI and let the IRQ finish since the
> > +        * pt_regs will be better.
> > +        */
> > +       local_daif_mask();
> > +
> > +       if (crash)
> > +               crash_save_cpu(regs, cpu);
> > +
> > +       /* the ack a stop requester (e.g. smp_send_stop()) polls for */
> >         set_cpu_online(cpu, false);
> >
> > -       local_daif_mask();
> >         sdei_mask_local_cpu();
> > +
> > +       if (crash && die_on_crash)
> > +               __cpu_try_die(cpu);
> > +
> > +       /* just in case */
> >         cpu_park_loop();
> >  }
> > +NOKPROBE_SYMBOL(arm64_nmi_cpu_stop);
> >
> >  /*
> >   * We need to implement panic_smp_self_stop() for parallel panic() calls, so
> > @@ -878,36 +923,7 @@ static void __noreturn local_cpu_stop(unsigned int cpu)
> >   */
> >  void __noreturn panic_smp_self_stop(void)
> >  {
> > -       local_cpu_stop(smp_processor_id());
> > -}
> > -
> > -static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs)
> > -{
> > -#ifdef CONFIG_KEXEC_CORE
> > -       /*
> > -        * Use local_daif_mask() instead of local_irq_disable() to make sure
> > -        * that pseudo-NMIs are disabled. The "crash stop" code starts with
> > -        * an IRQ and falls back to NMI (which might be pseudo). If the IRQ
> > -        * finally goes through right as we're timing out then the NMI could
> > -        * interrupt us. It's better to prevent the NMI and let the IRQ
> > -        * finish since the pt_regs will be better.
> > -        */
> > -       local_daif_mask();
> > -
> > -       crash_save_cpu(regs, cpu);
> > -
> > -       set_cpu_online(cpu, false);
> > -
> > -       sdei_mask_local_cpu();
> > -
> > -       if (IS_ENABLED(CONFIG_HOTPLUG_CPU))
> > -               __cpu_try_die(cpu);
> > -
> > -       /* just in case */
> > -       cpu_park_loop();
> > -#else
> > -       BUG();
> > -#endif
> > +       arm64_nmi_cpu_stop(NULL, false);
> >  }
> 
> panic_smp_self_stop() passes regs == NULL. If a second CPU panics
> after the primary has already set crash_stop, it loses the panic_cpu
> cmpxchg and calls panic_smp_self_stop(); if it was running with
> interrupts masked it never took the stop IPI, so it gets here with
> crash_stop == 1. crash is then true and we do crash_save_cpu(NULL,
> cpu), which ends up in elf_core_copy_regs(), and on arm64 that is just
> 
>         *(struct user_pt_regs *)&(dest) = (regs)->user_regs;
> 
> so a straight NULL deref -> synchronous abort while we're in the
> middle of crashing. The old local_cpu_stop() never called
> crash_save_cpu(), so this is a new regression from the unification.

Good catch, you're right — that's a real NULL deref, and a regression
from folding ipi_cpu_crash_stop() in (the old code only ran on the IPI
path, which always has regs).

Will be fixed in v4.

> 
> The comment above the function says NULL means "capture the current
> context", but crash_save_cpu() doesn't do that, it just dereferences
> regs. If that's the intent, materialise it:
> 
>         if (crash) {
>                 struct pt_regs local;
> 
>                 if (!regs) {
>                         crash_setup_regs(&local, NULL);
>                         regs = &local;
>                 }
>                 crash_save_cpu(regs, cpu);
>         }
> 
>   crash_setup_regs(..., NULL) is the existing "capture current" helper. Or just
>   skip the save when regs is NULL if the self-stop registers aren't
> worth having.

I went with skipping it. No architecture saves registers for the
self-stopping CPU -- the generic, arm and powerpc panic_smp_self_stop()
all just mark offline and spin, and arm64 did the same via
local_cpu_stop() before this series. crash_save_cpu() is only ever fed
the interrupted context, from the crash shootdown or the crashing CPU
itself. So:

	if (crash && regs)
		crash_save_cpu(regs, cpu);

and I dropped the "NULL means capture current" wording from the comment
rather than make it true.

While fixing this I noticed the same block broke the CONFIG_KEXEC_CORE=n
build: the unification replaced the original ipi_cpu_crash_stop()'s
#ifdef CONFIG_KEXEC_CORE with an IS_ENABLED() check, but <linux/kexec.h>
only declares crash_save_cpu() under CONFIG_KEXEC_CORE (and pulling in
<linux/crash_core.h> directly collides with kexec.h's own stubs), so the
dead-but-compiled call was an implicit-declaration error. Restored the
#ifdef around the crash_save_cpu() call.

Thanks for the review!

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


      reply	other threads:[~2026-06-15 12:46 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-15  2:35 [PATCH v3 0/3] arm64: cross-CPU NMI via SDEI Kiryl Shutsemau
2026-06-15  2:35 ` [PATCH v3 1/3] firmware: arm_sdei: add SDEI_EVENT_SIGNAL support Kiryl Shutsemau
2026-06-15  2:35 ` [PATCH v3 2/3] drivers/firmware: add SDEI cross-CPU NMI service for arm64 Kiryl Shutsemau
2026-06-15 10:18   ` Puranjay Mohan
2026-06-15 13:15     ` Kiryl Shutsemau
2026-06-15  2:35 ` [PATCH v3 3/3] arm64: escalate smp_send_stop() to an SDEI NMI as a last resort Kiryl Shutsemau
2026-06-15 10:25   ` Puranjay Mohan
2026-06-15 12:46     ` Kiryl Shutsemau [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ai_yEtDjI0e8sBBD@thinkstation \
    --to=kirill@shutemov.name \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=catalin.marinas@arm.com \
    --cc=dianders@chromium.org \
    --cc=james.morse@arm.com \
    --cc=julien.thierry.kdev@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=kexec@lists.infradead.org \
    --cc=lecopzer@gmail.com \
    --cc=leitao@debian.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=pmladek@suse.com \
    --cc=puranjay12@gmail.com \
    --cc=sumit.garg@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=usama.arif@linux.dev \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox