* Re: [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs
2023-08-31 10:10 [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs Mark Rutland
@ 2023-08-31 12:45 ` Sumit Garg
2023-08-31 13:07 ` Mark Rutland
2023-08-31 16:16 ` Doug Anderson
2023-08-31 19:15 ` Kees Cook
2 siblings, 1 reply; 7+ messages in thread
From: Sumit Garg @ 2023-08-31 12:45 UTC (permalink / raw)
To: Mark Rutland; +Cc: linux-kernel, dianders, keescook, swboyd
Hi Mark,
Thanks for putting up a test case for this.
On Thu, 31 Aug 2023 at 15:40, Mark Rutland <mark.rutland@arm.com> wrote:
>
> Upon a panic() the kernel will use either smp_send_stop() or
> crash_smp_send_stop() to attempt to stop secondary CPUs via an IPI,
> which may or may not be an NMI. Generally it's preferable that this is an
> NMI so that CPUs can be stopped in as many situations as possible, but
> it's not always possible to provide an NMI, and there are cases where
> CPUs may be unable to handle the NMI regardless.
>
> This patch adds a test for panic() where all other CPUs are stuck with
> interrupts disabled, which can be used to check whether the kernel
> gracefully handles CPUs failing to respond to a stop, and whe NMIs stops
s/whe/when/
> work.
>
> For example, on arm64 *without* an NMI, this results in:
>
> | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> | Kernel panic - not syncing: panic stop irqoff test
> | CPU: 2 PID: 24 Comm: migration/2 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> | Call trace:
> | dump_backtrace+0x94/0xec
> | show_stack+0x18/0x24
> | dump_stack_lvl+0x74/0xc0
> | dump_stack+0x18/0x24
> | panic+0x358/0x3e8
> | lkdtm_PANIC+0x0/0x18
> | multi_cpu_stop+0x9c/0x1a0
> | cpu_stopper_thread+0x84/0x118
> | smpboot_thread_fn+0x224/0x248
> | kthread+0x114/0x118
> | ret_from_fork+0x10/0x20
> | SMP: stopping secondary CPUs
> | SMP: failed to stop secondary CPUs 0-3
> | Kernel Offset: 0x401cf3490000 from 0xffff800080000000
> | PHYS_OFFSET: 0x40000000
> | CPU features: 0x00000000,68c167a1,cce6773f
> | Memory Limit: none
> | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
>
> On arm64 *with* an NMI, this results in:
I suppose a more interesting test scenario to show difference among
NMI stop IPI and regular stop IPI would be:
- First put any CPU into hard lockup state via:
$ echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
- And then provoke following from other CPU:
$ echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
>
> | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> | Kernel panic - not syncing: panic stop irqoff test
> | CPU: 1 PID: 19 Comm: migration/1 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> | Call trace:
> | dump_backtrace+0x94/0xec
> | show_stack+0x18/0x24
> | dump_stack_lvl+0x74/0xc0
> | dump_stack+0x18/0x24
> | panic+0x358/0x3e8
> | lkdtm_PANIC+0x0/0x18
> | multi_cpu_stop+0x9c/0x1a0
> | cpu_stopper_thread+0x84/0x118
> | smpboot_thread_fn+0x224/0x248
> | kthread+0x114/0x118
> | ret_from_fork+0x10/0x20
> | SMP: stopping secondary CPUs
> | Kernel Offset: 0x55a9c0bc0000 from 0xffff800080000000
> | PHYS_OFFSET: 0x40000000
> | CPU features: 0x00000000,68c167a1,fce6773f
> | Memory Limit: none
> | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
>
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Douglas Anderson <dianders@chromium.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Stephen Boyd <swboyd@chromium.org
> Cc: Sumit Garg <sumit.garg@linaro.org>
> ---
> drivers/misc/lkdtm/bugs.c | 29 ++++++++++++++++++++++++++++-
> 1 file changed, 28 insertions(+), 1 deletion(-)
>
> I've tested this with the arm64 NMI IPI patches:
>
> https://lore.kernel.org/linux-arm-kernel/20230830191314.1618136-1-dianders@chromium.org/
>
> Specifically, with the patch that uses an NMI for IPI_CPU_STOP and
> IPI_CPU_CRASH_STOP:
>
> https://lore.kernel.org/linux-arm-kernel/20230830121115.v12.5.Ifadbfd45b22c52edcb499034dd4783d096343260@changeid/
>
> Mark.
>
> diff --git a/drivers/misc/lkdtm/bugs.c b/drivers/misc/lkdtm/bugs.c
> index 3c95600ab2f71..368da8b83cd1c 100644
> --- a/drivers/misc/lkdtm/bugs.c
> +++ b/drivers/misc/lkdtm/bugs.c
> @@ -6,12 +6,14 @@
> * test source files.
> */
> #include "lkdtm.h"
> +#include <linux/cpu.h>
> #include <linux/list.h>
> #include <linux/sched.h>
> #include <linux/sched/signal.h>
> #include <linux/sched/task_stack.h>
> -#include <linux/uaccess.h>
> #include <linux/slab.h>
> +#include <linux/stop_machine.h>
> +#include <linux/uaccess.h>
>
> #if IS_ENABLED(CONFIG_X86_32) && !IS_ENABLED(CONFIG_UML)
> #include <asm/desc.h>
> @@ -73,6 +75,30 @@ static void lkdtm_PANIC(void)
> panic("dumptest");
> }
>
> +static int panic_stop_irqoff_fn(void *arg)
> +{
> + atomic_t *v = arg;
> +
> + /*
> + * Trigger the panic after all other CPUs have entered this function,
> + * so that they are guaranteed to have IRQs disabled.
> + */
> + if (atomic_inc_return(v) == num_online_cpus())
> + panic("panic stop irqoff test");
> +
> + for (;;)
> + cpu_relax();
> +}
> +
> +static void lkdtm_PANIC_STOP_IRQOFF(void)
> +{
> + atomic_t v = ATOMIC_INIT(0);
> +
> + cpus_read_lock();
> + stop_machine(panic_stop_irqoff_fn, &v, cpu_online_mask);
> + cpus_read_unlock();
stop_machine() does internally use cpus_read_{lock/unlock}(), is there
any need to have them here as well?
-Sumit
> +}
> +
> static void lkdtm_BUG(void)
> {
> BUG();
> @@ -598,6 +624,7 @@ static noinline void lkdtm_CORRUPT_PAC(void)
>
> static struct crashtype crashtypes[] = {
> CRASHTYPE(PANIC),
> + CRASHTYPE(PANIC_STOP_IRQOFF),
> CRASHTYPE(BUG),
> CRASHTYPE(WARNING),
> CRASHTYPE(WARNING_MESSAGE),
> --
> 2.30.2
>
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs
2023-08-31 12:45 ` Sumit Garg
@ 2023-08-31 13:07 ` Mark Rutland
2023-08-31 13:16 ` Sumit Garg
0 siblings, 1 reply; 7+ messages in thread
From: Mark Rutland @ 2023-08-31 13:07 UTC (permalink / raw)
To: Sumit Garg; +Cc: linux-kernel, dianders, keescook, swboyd
On Thu, Aug 31, 2023 at 06:15:29PM +0530, Sumit Garg wrote:
> Hi Mark,
>
> Thanks for putting up a test case for this.
>
> On Thu, 31 Aug 2023 at 15:40, Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > Upon a panic() the kernel will use either smp_send_stop() or
> > crash_smp_send_stop() to attempt to stop secondary CPUs via an IPI,
> > which may or may not be an NMI. Generally it's preferable that this is an
> > NMI so that CPUs can be stopped in as many situations as possible, but
> > it's not always possible to provide an NMI, and there are cases where
> > CPUs may be unable to handle the NMI regardless.
> >
> > This patch adds a test for panic() where all other CPUs are stuck with
> > interrupts disabled, which can be used to check whether the kernel
> > gracefully handles CPUs failing to respond to a stop, and whe NMIs stops
>
> s/whe/when/
>
> > work.
> >
> > For example, on arm64 *without* an NMI, this results in:
> >
> > | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> > | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> > | Kernel panic - not syncing: panic stop irqoff test
> > | CPU: 2 PID: 24 Comm: migration/2 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> > | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> > | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> > | Call trace:
> > | dump_backtrace+0x94/0xec
> > | show_stack+0x18/0x24
> > | dump_stack_lvl+0x74/0xc0
> > | dump_stack+0x18/0x24
> > | panic+0x358/0x3e8
> > | lkdtm_PANIC+0x0/0x18
> > | multi_cpu_stop+0x9c/0x1a0
> > | cpu_stopper_thread+0x84/0x118
> > | smpboot_thread_fn+0x224/0x248
> > | kthread+0x114/0x118
> > | ret_from_fork+0x10/0x20
> > | SMP: stopping secondary CPUs
> > | SMP: failed to stop secondary CPUs 0-3
> > | Kernel Offset: 0x401cf3490000 from 0xffff800080000000
> > | PHYS_OFFSET: 0x40000000
> > | CPU features: 0x00000000,68c167a1,cce6773f
> > | Memory Limit: none
> > | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
> >
> > On arm64 *with* an NMI, this results in:
>
> I suppose a more interesting test scenario to show difference among
> NMI stop IPI and regular stop IPI would be:
>
> - First put any CPU into hard lockup state via:
> $ echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
>
> - And then provoke following from other CPU:
> $ echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
I don't follow. IIUC that's only going to test whether a HW watchdog can fire
and reset the system?
The PANIC_STOP_IRQOFF test has each CPU run panic_stop_irqoff_fn() with IRQs
disabled, and if one CPU is stuck in the HARDLOCKUP test, we'll never get all
CPUs into panic_stop_irqoff_fn(), and so all CPUs will be stuck with IRQs
disabled, spinning.
The PANIC_STOP_IRQOFF test itself tests the different between an NMI stop IPI
and regular stop IPI, as the results in the commit message shows. Look for the
line above that says:
| SMP: failed to stop secondary CPUs 0-3
... which is *not* present in the NMI case (though we don't have an explicit
"stoppped all CPUs" message).
> >
> > | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> > | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> > | Kernel panic - not syncing: panic stop irqoff test
> > | CPU: 1 PID: 19 Comm: migration/1 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> > | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> > | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> > | Call trace:
> > | dump_backtrace+0x94/0xec
> > | show_stack+0x18/0x24
> > | dump_stack_lvl+0x74/0xc0
> > | dump_stack+0x18/0x24
> > | panic+0x358/0x3e8
> > | lkdtm_PANIC+0x0/0x18
> > | multi_cpu_stop+0x9c/0x1a0
> > | cpu_stopper_thread+0x84/0x118
> > | smpboot_thread_fn+0x224/0x248
> > | kthread+0x114/0x118
> > | ret_from_fork+0x10/0x20
> > | SMP: stopping secondary CPUs
> > | Kernel Offset: 0x55a9c0bc0000 from 0xffff800080000000
> > | PHYS_OFFSET: 0x40000000
> > | CPU features: 0x00000000,68c167a1,fce6773f
> > | Memory Limit: none
> > | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
> >
> > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > Cc: Douglas Anderson <dianders@chromium.org>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Stephen Boyd <swboyd@chromium.org
> > Cc: Sumit Garg <sumit.garg@linaro.org>
> > ---
> > drivers/misc/lkdtm/bugs.c | 29 ++++++++++++++++++++++++++++-
> > 1 file changed, 28 insertions(+), 1 deletion(-)
> >
> > I've tested this with the arm64 NMI IPI patches:
> >
> > https://lore.kernel.org/linux-arm-kernel/20230830191314.1618136-1-dianders@chromium.org/
> >
> > Specifically, with the patch that uses an NMI for IPI_CPU_STOP and
> > IPI_CPU_CRASH_STOP:
> >
> > https://lore.kernel.org/linux-arm-kernel/20230830121115.v12.5.Ifadbfd45b22c52edcb499034dd4783d096343260@changeid/
> >
> > Mark.
> >
> > diff --git a/drivers/misc/lkdtm/bugs.c b/drivers/misc/lkdtm/bugs.c
> > index 3c95600ab2f71..368da8b83cd1c 100644
> > --- a/drivers/misc/lkdtm/bugs.c
> > +++ b/drivers/misc/lkdtm/bugs.c
> > @@ -6,12 +6,14 @@
> > * test source files.
> > */
> > #include "lkdtm.h"
> > +#include <linux/cpu.h>
> > #include <linux/list.h>
> > #include <linux/sched.h>
> > #include <linux/sched/signal.h>
> > #include <linux/sched/task_stack.h>
> > -#include <linux/uaccess.h>
> > #include <linux/slab.h>
> > +#include <linux/stop_machine.h>
> > +#include <linux/uaccess.h>
> >
> > #if IS_ENABLED(CONFIG_X86_32) && !IS_ENABLED(CONFIG_UML)
> > #include <asm/desc.h>
> > @@ -73,6 +75,30 @@ static void lkdtm_PANIC(void)
> > panic("dumptest");
> > }
> >
> > +static int panic_stop_irqoff_fn(void *arg)
> > +{
> > + atomic_t *v = arg;
> > +
> > + /*
> > + * Trigger the panic after all other CPUs have entered this function,
> > + * so that they are guaranteed to have IRQs disabled.
> > + */
> > + if (atomic_inc_return(v) == num_online_cpus())
> > + panic("panic stop irqoff test");
> > +
> > + for (;;)
> > + cpu_relax();
> > +}
> > +
> > +static void lkdtm_PANIC_STOP_IRQOFF(void)
> > +{
> > + atomic_t v = ATOMIC_INIT(0);
> > +
> > + cpus_read_lock();
> > + stop_machine(panic_stop_irqoff_fn, &v, cpu_online_mask);
> > + cpus_read_unlock();
>
> stop_machine() does internally use cpus_read_{lock/unlock}(), is there
> any need to have them here as well?
For some reason I thought that stop_machine() copied the mask prior to calling
cpus_read_lock(), and hance it was necessary to avoid the online mask changing,
but from taking a look just now that's not the case.
I'll drop the cpus_read_lock() .. cpus_read_unlock() here.
Thanks,
Mark.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs
2023-08-31 13:07 ` Mark Rutland
@ 2023-08-31 13:16 ` Sumit Garg
0 siblings, 0 replies; 7+ messages in thread
From: Sumit Garg @ 2023-08-31 13:16 UTC (permalink / raw)
To: Mark Rutland; +Cc: linux-kernel, dianders, keescook, swboyd
On Thu, 31 Aug 2023 at 18:38, Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Thu, Aug 31, 2023 at 06:15:29PM +0530, Sumit Garg wrote:
> > Hi Mark,
> >
> > Thanks for putting up a test case for this.
> >
> > On Thu, 31 Aug 2023 at 15:40, Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > Upon a panic() the kernel will use either smp_send_stop() or
> > > crash_smp_send_stop() to attempt to stop secondary CPUs via an IPI,
> > > which may or may not be an NMI. Generally it's preferable that this is an
> > > NMI so that CPUs can be stopped in as many situations as possible, but
> > > it's not always possible to provide an NMI, and there are cases where
> > > CPUs may be unable to handle the NMI regardless.
> > >
> > > This patch adds a test for panic() where all other CPUs are stuck with
> > > interrupts disabled, which can be used to check whether the kernel
> > > gracefully handles CPUs failing to respond to a stop, and whe NMIs stops
> >
> > s/whe/when/
> >
> > > work.
> > >
> > > For example, on arm64 *without* an NMI, this results in:
> > >
> > > | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> > > | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> > > | Kernel panic - not syncing: panic stop irqoff test
> > > | CPU: 2 PID: 24 Comm: migration/2 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> > > | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> > > | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> > > | Call trace:
> > > | dump_backtrace+0x94/0xec
> > > | show_stack+0x18/0x24
> > > | dump_stack_lvl+0x74/0xc0
> > > | dump_stack+0x18/0x24
> > > | panic+0x358/0x3e8
> > > | lkdtm_PANIC+0x0/0x18
> > > | multi_cpu_stop+0x9c/0x1a0
> > > | cpu_stopper_thread+0x84/0x118
> > > | smpboot_thread_fn+0x224/0x248
> > > | kthread+0x114/0x118
> > > | ret_from_fork+0x10/0x20
> > > | SMP: stopping secondary CPUs
> > > | SMP: failed to stop secondary CPUs 0-3
> > > | Kernel Offset: 0x401cf3490000 from 0xffff800080000000
> > > | PHYS_OFFSET: 0x40000000
> > > | CPU features: 0x00000000,68c167a1,cce6773f
> > > | Memory Limit: none
> > > | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
> > >
> > > On arm64 *with* an NMI, this results in:
> >
> > I suppose a more interesting test scenario to show difference among
> > NMI stop IPI and regular stop IPI would be:
> >
> > - First put any CPU into hard lockup state via:
> > $ echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
> >
> > - And then provoke following from other CPU:
> > $ echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
>
> I don't follow. IIUC that's only going to test whether a HW watchdog can fire
> and reset the system?
>
> The PANIC_STOP_IRQOFF test has each CPU run panic_stop_irqoff_fn() with IRQs
> disabled, and if one CPU is stuck in the HARDLOCKUP test, we'll never get all
> CPUs into panic_stop_irqoff_fn(), and so all CPUs will be stuck with IRQs
> disabled, spinning.
>
> The PANIC_STOP_IRQOFF test itself tests the different between an NMI stop IPI
> and regular stop IPI, as the results in the commit message shows. Look for the
> line above that says:
>
> | SMP: failed to stop secondary CPUs 0-3
>
> ... which is *not* present in the NMI case (though we don't have an explicit
> "stoppped all CPUs" message).
Ah, I see your point as I missed that difference when I first looked
up the panic() logs. So it's the post panic() CPU stop behaviour that
we are testing here. Thanks for the explanation.
-Sumit
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs
2023-08-31 10:10 [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs Mark Rutland
2023-08-31 12:45 ` Sumit Garg
@ 2023-08-31 16:16 ` Doug Anderson
2023-09-21 16:03 ` Mark Rutland
2023-08-31 19:15 ` Kees Cook
2 siblings, 1 reply; 7+ messages in thread
From: Doug Anderson @ 2023-08-31 16:16 UTC (permalink / raw)
To: Mark Rutland; +Cc: linux-kernel, keescook, sumit.garg, swboyd
Hi,
On Thu, Aug 31, 2023 at 3:10 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> Upon a panic() the kernel will use either smp_send_stop() or
> crash_smp_send_stop() to attempt to stop secondary CPUs via an IPI,
> which may or may not be an NMI. Generally it's preferable that this is an
> NMI so that CPUs can be stopped in as many situations as possible, but
> it's not always possible to provide an NMI, and there are cases where
> CPUs may be unable to handle the NMI regardless.
>
> This patch adds a test for panic() where all other CPUs are stuck with
> interrupts disabled, which can be used to check whether the kernel
> gracefully handles CPUs failing to respond to a stop, and whe NMIs stops
> work.
>
> For example, on arm64 *without* an NMI, this results in:
>
> | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> | Kernel panic - not syncing: panic stop irqoff test
> | CPU: 2 PID: 24 Comm: migration/2 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> | Call trace:
> | dump_backtrace+0x94/0xec
> | show_stack+0x18/0x24
> | dump_stack_lvl+0x74/0xc0
> | dump_stack+0x18/0x24
> | panic+0x358/0x3e8
> | lkdtm_PANIC+0x0/0x18
> | multi_cpu_stop+0x9c/0x1a0
> | cpu_stopper_thread+0x84/0x118
> | smpboot_thread_fn+0x224/0x248
> | kthread+0x114/0x118
> | ret_from_fork+0x10/0x20
> | SMP: stopping secondary CPUs
> | SMP: failed to stop secondary CPUs 0-3
> | Kernel Offset: 0x401cf3490000 from 0xffff800080000000
> | PHYS_OFFSET: 0x40000000
> | CPU features: 0x00000000,68c167a1,cce6773f
> | Memory Limit: none
> | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
>
> On arm64 *with* an NMI, this results in:
>
> | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> | Kernel panic - not syncing: panic stop irqoff test
> | CPU: 1 PID: 19 Comm: migration/1 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> | Call trace:
> | dump_backtrace+0x94/0xec
> | show_stack+0x18/0x24
> | dump_stack_lvl+0x74/0xc0
> | dump_stack+0x18/0x24
> | panic+0x358/0x3e8
> | lkdtm_PANIC+0x0/0x18
> | multi_cpu_stop+0x9c/0x1a0
> | cpu_stopper_thread+0x84/0x118
> | smpboot_thread_fn+0x224/0x248
> | kthread+0x114/0x118
> | ret_from_fork+0x10/0x20
> | SMP: stopping secondary CPUs
> | Kernel Offset: 0x55a9c0bc0000 from 0xffff800080000000
> | PHYS_OFFSET: 0x40000000
> | CPU features: 0x00000000,68c167a1,fce6773f
> | Memory Limit: none
> | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
Since Sumit didn't notice it and I had to use a "diff" tool to help me
spot the difference in your two examples, maybe telegraph for the
reader of the commit message that the important thing is the lack of
"SMP: failed to stop secondary CPUs" in your example.
> +static void lkdtm_PANIC_STOP_IRQOFF(void)
> +{
> + atomic_t v = ATOMIC_INIT(0);
> +
> + cpus_read_lock();
> + stop_machine(panic_stop_irqoff_fn, &v, cpu_online_mask);
It broke my brain just a little bit that stop_machine() isn't backed
by "IPI_CPU_STOP" since the point of this was to test the patch
modifying how "IPI_CPU_STOP" worked. ;-) Assuming I didn't mess up, I
guess stop_machine() is just used to get all the CPUs running with
interrupts disabled and then the last CPU does a panic. The _panic_
then tests "IPI_CPU_STOP" / "IPI_CPU_CRASH_STOP".
Probably this is obvious to everyone but me, so no action is needed
unless you agree that it's confusing and can think of a good comment /
commit message addition to help.
In any case, after Sumit's feedback is addressed:
Reviewed-by: Douglas Anderson <dianders@chromium.org>
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs
2023-08-31 16:16 ` Doug Anderson
@ 2023-09-21 16:03 ` Mark Rutland
0 siblings, 0 replies; 7+ messages in thread
From: Mark Rutland @ 2023-09-21 16:03 UTC (permalink / raw)
To: Doug Anderson; +Cc: linux-kernel, keescook, sumit.garg, swboyd
On Thu, Aug 31, 2023 at 09:16:31AM -0700, Doug Anderson wrote:
> Hi,
>
> On Thu, Aug 31, 2023 at 3:10 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > Upon a panic() the kernel will use either smp_send_stop() or
> > crash_smp_send_stop() to attempt to stop secondary CPUs via an IPI,
> > which may or may not be an NMI. Generally it's preferable that this is an
> > NMI so that CPUs can be stopped in as many situations as possible, but
> > it's not always possible to provide an NMI, and there are cases where
> > CPUs may be unable to handle the NMI regardless.
> >
> > This patch adds a test for panic() where all other CPUs are stuck with
> > interrupts disabled, which can be used to check whether the kernel
> > gracefully handles CPUs failing to respond to a stop, and whe NMIs stops
> > work.
> >
> > For example, on arm64 *without* an NMI, this results in:
> >
> > | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> > | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> > | Kernel panic - not syncing: panic stop irqoff test
> > | CPU: 2 PID: 24 Comm: migration/2 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> > | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> > | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> > | Call trace:
> > | dump_backtrace+0x94/0xec
> > | show_stack+0x18/0x24
> > | dump_stack_lvl+0x74/0xc0
> > | dump_stack+0x18/0x24
> > | panic+0x358/0x3e8
> > | lkdtm_PANIC+0x0/0x18
> > | multi_cpu_stop+0x9c/0x1a0
> > | cpu_stopper_thread+0x84/0x118
> > | smpboot_thread_fn+0x224/0x248
> > | kthread+0x114/0x118
> > | ret_from_fork+0x10/0x20
> > | SMP: stopping secondary CPUs
> > | SMP: failed to stop secondary CPUs 0-3
> > | Kernel Offset: 0x401cf3490000 from 0xffff800080000000
> > | PHYS_OFFSET: 0x40000000
> > | CPU features: 0x00000000,68c167a1,cce6773f
> > | Memory Limit: none
> > | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
> >
> > On arm64 *with* an NMI, this results in:
> >
> > | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> > | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> > | Kernel panic - not syncing: panic stop irqoff test
> > | CPU: 1 PID: 19 Comm: migration/1 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> > | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> > | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> > | Call trace:
> > | dump_backtrace+0x94/0xec
> > | show_stack+0x18/0x24
> > | dump_stack_lvl+0x74/0xc0
> > | dump_stack+0x18/0x24
> > | panic+0x358/0x3e8
> > | lkdtm_PANIC+0x0/0x18
> > | multi_cpu_stop+0x9c/0x1a0
> > | cpu_stopper_thread+0x84/0x118
> > | smpboot_thread_fn+0x224/0x248
> > | kthread+0x114/0x118
> > | ret_from_fork+0x10/0x20
> > | SMP: stopping secondary CPUs
> > | Kernel Offset: 0x55a9c0bc0000 from 0xffff800080000000
> > | PHYS_OFFSET: 0x40000000
> > | CPU features: 0x00000000,68c167a1,fce6773f
> > | Memory Limit: none
> > | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
>
> Since Sumit didn't notice it and I had to use a "diff" tool to help me
> spot the difference in your two examples, maybe telegraph for the
> reader of the commit message that the important thing is the lack of
> "SMP: failed to stop secondary CPUs" in your example.
Sure; I've added explicit "note" lines to v2 for that.
> > +static void lkdtm_PANIC_STOP_IRQOFF(void)
> > +{
> > + atomic_t v = ATOMIC_INIT(0);
> > +
> > + cpus_read_lock();
> > + stop_machine(panic_stop_irqoff_fn, &v, cpu_online_mask);
>
> It broke my brain just a little bit that stop_machine() isn't backed
> by "IPI_CPU_STOP" since the point of this was to test the patch
> modifying how "IPI_CPU_STOP" worked. ;-) Assuming I didn't mess up, I
> guess stop_machine() is just used to get all the CPUs running with
> interrupts disabled and then the last CPU does a panic. The _panic_
> then tests "IPI_CPU_STOP" / "IPI_CPU_CRASH_STOP".
Yup; I had hoped that the comment within panic_stop_irqoff_fn() was sufficient
for that:
/*
* Trigger the panic after all other CPUs have entered this function,
* so that they are guaranteed to have IRQs disabled.
*/
I've made that a bit more explicit now:
/*
* As stop_machine() disables interrupts, all CPUs within this function
* have interrupts disabled and cannot take a regular IPI.
*
* The last CPU which enters here will trigger a panic, and as all CPUs
* cannot take a regular IPI, we'll only be able to stop secondaries if
* smp_send_stop() or crash_smp_send_stop() uses an NMI.
*/
> Probably this is obvious to everyone but me, so no action is needed
> unless you agree that it's confusing and can think of a good comment /
> commit message addition to help.
Hopefully the above is good enough?
I agree the stop_machine() naming is also somewhat confusing here, but I can't
do much about that.
> In any case, after Sumit's feedback is addressed:
>
> Reviewed-by: Douglas Anderson <dianders@chromium.org>
Thanks!
Mark.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs
2023-08-31 10:10 [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs Mark Rutland
2023-08-31 12:45 ` Sumit Garg
2023-08-31 16:16 ` Doug Anderson
@ 2023-08-31 19:15 ` Kees Cook
2 siblings, 0 replies; 7+ messages in thread
From: Kees Cook @ 2023-08-31 19:15 UTC (permalink / raw)
To: Mark Rutland; +Cc: linux-kernel, dianders, sumit.garg, swboyd
On Thu, Aug 31, 2023 at 11:10:26AM +0100, Mark Rutland wrote:
> Upon a panic() the kernel will use either smp_send_stop() or
> crash_smp_send_stop() to attempt to stop secondary CPUs via an IPI,
> which may or may not be an NMI. Generally it's preferable that this is an
> NMI so that CPUs can be stopped in as many situations as possible, but
> it's not always possible to provide an NMI, and there are cases where
> CPUs may be unable to handle the NMI regardless.
>
> This patch adds a test for panic() where all other CPUs are stuck with
> interrupts disabled, which can be used to check whether the kernel
> gracefully handles CPUs failing to respond to a stop, and whe NMIs stops
> work.
>
> For example, on arm64 *without* an NMI, this results in:
>
> | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> | Kernel panic - not syncing: panic stop irqoff test
> | CPU: 2 PID: 24 Comm: migration/2 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> | Call trace:
> | dump_backtrace+0x94/0xec
> | show_stack+0x18/0x24
> | dump_stack_lvl+0x74/0xc0
> | dump_stack+0x18/0x24
> | panic+0x358/0x3e8
> | lkdtm_PANIC+0x0/0x18
> | multi_cpu_stop+0x9c/0x1a0
> | cpu_stopper_thread+0x84/0x118
> | smpboot_thread_fn+0x224/0x248
> | kthread+0x114/0x118
> | ret_from_fork+0x10/0x20
> | SMP: stopping secondary CPUs
> | SMP: failed to stop secondary CPUs 0-3
> | Kernel Offset: 0x401cf3490000 from 0xffff800080000000
> | PHYS_OFFSET: 0x40000000
> | CPU features: 0x00000000,68c167a1,cce6773f
> | Memory Limit: none
> | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
>
> On arm64 *with* an NMI, this results in:
>
> | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> | Kernel panic - not syncing: panic stop irqoff test
> | CPU: 1 PID: 19 Comm: migration/1 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> | Call trace:
> | dump_backtrace+0x94/0xec
> | show_stack+0x18/0x24
> | dump_stack_lvl+0x74/0xc0
> | dump_stack+0x18/0x24
> | panic+0x358/0x3e8
> | lkdtm_PANIC+0x0/0x18
> | multi_cpu_stop+0x9c/0x1a0
> | cpu_stopper_thread+0x84/0x118
> | smpboot_thread_fn+0x224/0x248
> | kthread+0x114/0x118
> | ret_from_fork+0x10/0x20
> | SMP: stopping secondary CPUs
> | Kernel Offset: 0x55a9c0bc0000 from 0xffff800080000000
> | PHYS_OFFSET: 0x40000000
> | CPU features: 0x00000000,68c167a1,fce6773f
> | Memory Limit: none
> | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
>
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Douglas Anderson <dianders@chromium.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Stephen Boyd <swboyd@chromium.org
> Cc: Sumit Garg <sumit.garg@linaro.org>
> ---
> drivers/misc/lkdtm/bugs.c | 29 ++++++++++++++++++++++++++++-
> 1 file changed, 28 insertions(+), 1 deletion(-)
>
> I've tested this with the arm64 NMI IPI patches:
>
> https://lore.kernel.org/linux-arm-kernel/20230830191314.1618136-1-dianders@chromium.org/
>
> Specifically, with the patch that uses an NMI for IPI_CPU_STOP and
> IPI_CPU_CRASH_STOP:
>
> https://lore.kernel.org/linux-arm-kernel/20230830121115.v12.5.Ifadbfd45b22c52edcb499034dd4783d096343260@changeid/
>
> Mark.
>
> diff --git a/drivers/misc/lkdtm/bugs.c b/drivers/misc/lkdtm/bugs.c
> index 3c95600ab2f71..368da8b83cd1c 100644
> --- a/drivers/misc/lkdtm/bugs.c
> +++ b/drivers/misc/lkdtm/bugs.c
> @@ -6,12 +6,14 @@
> * test source files.
> */
> #include "lkdtm.h"
> +#include <linux/cpu.h>
> #include <linux/list.h>
> #include <linux/sched.h>
> #include <linux/sched/signal.h>
> #include <linux/sched/task_stack.h>
> -#include <linux/uaccess.h>
> #include <linux/slab.h>
> +#include <linux/stop_machine.h>
> +#include <linux/uaccess.h>
>
> #if IS_ENABLED(CONFIG_X86_32) && !IS_ENABLED(CONFIG_UML)
> #include <asm/desc.h>
> @@ -73,6 +75,30 @@ static void lkdtm_PANIC(void)
> panic("dumptest");
> }
>
> +static int panic_stop_irqoff_fn(void *arg)
> +{
> + atomic_t *v = arg;
> +
> + /*
> + * Trigger the panic after all other CPUs have entered this function,
> + * so that they are guaranteed to have IRQs disabled.
> + */
> + if (atomic_inc_return(v) == num_online_cpus())
> + panic("panic stop irqoff test");
> +
> + for (;;)
> + cpu_relax();
> +}
> +
> +static void lkdtm_PANIC_STOP_IRQOFF(void)
> +{
> + atomic_t v = ATOMIC_INIT(0);
> +
> + cpus_read_lock();
> + stop_machine(panic_stop_irqoff_fn, &v, cpu_online_mask);
> + cpus_read_unlock();
> +}
> +
> static void lkdtm_BUG(void)
> {
> BUG();
> @@ -598,6 +624,7 @@ static noinline void lkdtm_CORRUPT_PAC(void)
>
> static struct crashtype crashtypes[] = {
> CRASHTYPE(PANIC),
> + CRASHTYPE(PANIC_STOP_IRQOFF),
> CRASHTYPE(BUG),
> CRASHTYPE(WARNING),
> CRASHTYPE(WARNING_MESSAGE),
Modulo the other comments in the thread, this looks good to me. :)
Reviewed-by: Kees Cook <keescook@chromium.org>
--
Kees Cook
^ permalink raw reply [flat|nested] 7+ messages in thread