public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
From: Daniel Thompson <daniel.thompson@linaro.org>
To: Doug Anderson <dianders@chromium.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Marc Zyngier <maz@kernel.org>,
	Misono Tomohiro <misono.tomohiro@fujitsu.com>,
	Chen-Yu Tsai <wens@csie.org>, Stephen Boyd <swboyd@chromium.org>,
	Sumit Garg <sumit.garg@linaro.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	"Guilherme G. Piccoli" <gpiccoli@igalia.com>,
	Josh Poimboeuf <jpoimboe@kernel.org>,
	Kees Cook <keescook@chromium.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Tony Luck <tony.luck@intel.com>,
	Valentin Schneider <vschneid@redhat.com>,
	linux-arm-kernel@lists.infradead.org,
	linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] arm64: smp: smp_send_stop() and crash_smp_send_stop() should try non-NMI first
Date: Wed, 28 Feb 2024 13:11:04 +0000	[thread overview]
Message-ID: <20240228131104.GB22898@aspen.lan> (raw)
In-Reply-To: <CAD=FV=XMkrWmA1D6UjdTs8oZiXxKc1xiUoRqtNqAE-7GoPk8mA@mail.gmail.com>

On Tue, Feb 27, 2024 at 04:57:31PM -0800, Doug Anderson wrote:
> Hi,
>
> On Mon, Jan 8, 2024 at 4:54 PM Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Thu, Dec 7, 2023 at 5:03 PM Douglas Anderson <dianders@chromium.org> wrote:
> > >
> > > When testing hard lockup handling on my sc7180-trogdor-lazor device
> > > with pseudo-NMI enabled, with serial console enabled and with kgdb
> > > disabled, I found that the stack crawls printed to the serial console
> > > ended up as a jumbled mess. After rebooting, the pstore-based console
> > > looked fine though. Also, enabling kgdb to trap the panic made the
> > > console look fine and avoided the mess.
> > >
> > > After a bit of tracking down, I came to the conclusion that this was
> > > what was happening:
> > > 1. The panic path was stopping all other CPUs with
> > >    panic_other_cpus_shutdown().
> > > 2. At least one of those other CPUs was in the middle of printing to
> > >    the serial console and holding the console port's lock, which is
> > >    grabbed with "irqsave". ...but since we were stopping with an NMI
> > >    we didn't care about the "irqsave" and interrupted anyway.
> > > 3. Since we stopped the CPU while it was holding the lock it would
> > >    never release it.
> > > 4. All future calls to output to the console would end up failing to
> > >    get the lock in qcom_geni_serial_console_write(). This isn't
> > >    _totally_ unexpected at panic time but it's a code path that's not
> > >    well tested, hard to get right, and apparently doesn't work
> > >    terribly well on the Qualcomm geni serial driver.
> > >
> > > It would probably be a reasonable idea to try to make the Qualcomm
> > > geni serial driver work better, but also it's nice not to get into
> > > this situation in the first place.
> > >
> > > Taking a page from what x86 appears to do in native_stop_other_cpus(),
> > > let's do this:
> > > 1. First, we'll try to stop other CPUs with a normal IPI and wait a
> > >    second. This gives them a chance to leave critical sections.
> > > 2. If CPUs fail to stop then we'll retry with an NMI, but give a much
> > >    lower timeout since there's no good reason for a CPU not to react
> > >    quickly to a NMI.
> > >
> > > This works well and avoids the corrupted console and (presumably)
> > > could help avoid other similar issues.
> > >
> > > In order to do this, we need to do a little re-organization of our
> > > IPIs since we don't have any more free IDs. We'll do what was
> > > suggested in previous conversations and combine "stop" and "crash
> > > stop". That frees up an IPI so now we can have a "stop" and "stop
> > > NMI".
> > >
> > > In order to do this we also need a slight change in the way we keep
> > > track of which CPUs still need to be stopped. We need to know
> > > specifically which CPUs haven't stopped yet when we fall back to NMI
> > > but in the "crash stop" case the "cpu_online_mask" isn't updated as
> > > CPUs go down. This is why that code path had an atomic of the number
> > > of CPUs left. We'll solve this by making the cpumask into a
> > > global. This has a potential memory implication--with NR_CPUs = 4096
> > > this is 4096/8 = 512 bytes of globals. On the upside in that same case
> > > we take 512 bytes off the stack which could potentially have made the
> > > stop code less reliable. It can be noted that the NMI backtrace code
> > > (lib/nmi_backtrace.c) uses the same approach and that use also
> > > confirms that updating the mask is safe from NMI.
> > >
> > > All of the above lets us combine the logic for "stop" and "crash stop"
> > > code, which appeared to have a bunch of arbitrary implementation
> > > differences. Possibly this could make up for some of the 512 wasted
> > > bytes. ;-)
> > >
> > > Aside from the above change where we try a normal IPI and then an NMI,
> > > the combined function has a few subtle differences:
> > > * In the normal smp_send_stop(), if we fail to stop one or more CPUs
> > >   then we won't include the current CPU (the one running
> > >   smp_send_stop()) in the error message.
> > > * In crash_smp_send_stop(), if we fail to stop some CPUs we'll print
> > >   the CPUs that we failed to stop instead of printing all _but_ the
> > >   current running CPU.
> > > * In crash_smp_send_stop(), we will now only print "SMP: stopping
> > >   secondary CPUs" if (system_state <= SYSTEM_RUNNING).
> > >
> > > Fixes: d7402513c935 ("arm64: smp: IPI_CPU_STOP and IPI_CPU_CRASH_STOP should try for NMI")
> > > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > > ---
> > > I'm not setup to test the crash_smp_send_stop(). I made sure it
> > > compiled and hacked the panic() method to call it, but I haven't
> > > actually run kexec. Hopefully others can confirm that it's working for
> > > them.
> > >
> > >  arch/arm64/kernel/smp.c | 115 +++++++++++++++++++---------------------
> > >  1 file changed, 54 insertions(+), 61 deletions(-)
> >
> > The sound of crickets is overwhelming. ;-) Does anyone have any
> > comments here? Is this a terrible idea? Is this the best idea you've
> > heard all year (it's only been 8 days, so maybe)? Is this great but
> > the implementation is lacking (at best)? Do you hate that this waits
> > for 1 second and wish it waited for 1 ms? 10 ms? 100 ms? 8192 ms?
> >
> > Aside from the weirdness of a processor being killed while holding the
> > console lock, it does seem beneficial to give IRQs at least a little
> > time to finish before killing a processor. I don't have any other
> > explicit examples, but I could just imagine that things might be a
> > little more orderly in such a case...
>
> I'm still hoping to get some sort of feedback here. If people think
> this is a terrible idea then I'll shut up now and leave well enough
> alone, but it would be nice to actively decide and get the patch out
> of limbo.

I've read patch through a couple of times and was generally convinced by
the "do what x86 does" argument.

However until now I've always held my council since I wasn't familiar
with these code paths and I figured it was OK for me to have no opinion
because the first line of the description says that kgdb/kdb is 100% not
involved in causing the problem ;-) .

However today I also took a look at the HAVE_NMI architectures and there
is no consensus between them about how to implement this: PowerPC uses
NMI and most of the others use IRQ only, s390 special cases for the
panic code path and acts differently compared to a normal SMP shutdown.

FWIW the x86 route was irq-only and then switching to irq-plus-nmi
(after a short trial with NMI-only that had problems with pstore
reliability[1]) and that approach has been in place for over
a decade now!

However, if we talking ourselves into copying x86 then perhaps we should
more accurately copy x86! Assuming I read the x86 code correctly then
crash_smp_send_stop() will (mostly) go staight to NMI rather
than trialling an IRQ first! That is not what is currently implemented
in the patch for arm64.


Daniel.


[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7d007d21e539dbecb6942c5734e6649f720982cf

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2024-02-28 13:11 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-08  1:02 [PATCH] arm64: smp: smp_send_stop() and crash_smp_send_stop() should try non-NMI first Douglas Anderson
2024-01-09  0:54 ` Doug Anderson
2024-02-28  0:57   ` Doug Anderson
2024-02-28 13:11     ` Daniel Thompson [this message]
2024-02-29 18:34       ` Doug Anderson
2024-03-01 11:42         ` Daniel Thompson
2024-03-01 16:05     ` Mark Rutland
2024-03-04 17:34       ` Doug Anderson
2024-04-12 13:55 ` Will Deacon
2024-05-17 20:01   ` Doug Anderson
2024-06-24 13:49     ` Will Deacon
2024-05-17 20:01 ` Doug Anderson
2024-06-24 13:54   ` Will Deacon
2024-06-25 23:08     ` Doug Anderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240228131104.GB22898@aspen.lan \
    --to=daniel.thompson@linaro.org \
    --cc=catalin.marinas@arm.com \
    --cc=dianders@chromium.org \
    --cc=frederic@kernel.org \
    --cc=gpiccoli@igalia.com \
    --cc=jpoimboe@kernel.org \
    --cc=keescook@chromium.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=misono.tomohiro@fujitsu.com \
    --cc=peterz@infradead.org \
    --cc=sumit.garg@linaro.org \
    --cc=swboyd@chromium.org \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=vschneid@redhat.com \
    --cc=wens@csie.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox