linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Michael Ellerman <mpe@ellerman.id.au>
To: Hari Bathini <hbathini@linux.ibm.com>, linuxppc-dev@lists.ozlabs.org
Cc: Hari Bathini <hbathini@linux.ibm.com>,
	sourabhjain@linux.ibm.com, mahesh@linux.ibm.com,
	npiggin@gmail.com
Subject: Re: [PATCH] ppc64/fadump: fix inaccurate CPU state info in vmcore generated with panic
Date: Thu, 11 Nov 2021 17:14:03 +1100	[thread overview]
Message-ID: <87lf1vmd78.fsf@mpe.ellerman.id.au> (raw)
In-Reply-To: <20211110190143.186346-1-hbathini@linux.ibm.com>

Hari Bathini <hbathini@linux.ibm.com> writes:
> In panic path, fadump is triggered via a panic notifier function.
> Before calling panic notifier functions, smp_send_stop() gets called,
> which stops all CPUs except the panic'ing CPU. Commit 8389b37dffdc
> ("powerpc: stop_this_cpu: remove the cpu from the online map.") and
> again commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
> started marking CPUs as offline while stopping them. So, if a kernel
> has either of the above commits, vmcore captured with fadump via panic
> path would show only the panic'ing CPU as online. Sample output of
> crash-utility with such vmcore:
>
>   # crash vmlinux vmcore
>   ...
>         KERNEL: vmlinux
>       DUMPFILE: vmcore  [PARTIAL DUMP]
>           CPUS: 1
>           DATE: Wed Nov 10 09:56:34 EST 2021
>         UPTIME: 00:00:42
>   LOAD AVERAGE: 2.27, 0.69, 0.24
>          TASKS: 183
>       NODENAME: XXXXXXXXX
>        RELEASE: 5.15.0+
>        VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021
>        MACHINE: ppc64le  (2500 Mhz)
>         MEMORY: 8 GB
>          PANIC: "Kernel panic - not syncing: sysrq triggered crash"
>            PID: 3394
>        COMMAND: "bash"
>           TASK: c0000000150a5f80  [THREAD_INFO: c0000000150a5f80]
>            CPU: 1
>          STATE: TASK_RUNNING (PANIC)
>
>   crash> p -x __cpu_online_mask
>   __cpu_online_mask = $1 = {
>     bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
>   }
>   crash>
>   crash>
>   crash> p -x __cpu_active_mask
>   __cpu_active_mask = $2 = {
>     bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
>   }
>   crash>
>
> While this has been the case since fadump was introduced, the issue
> was not identified for two probable reasons:
>
>   - In general, the bulk of the vmcores analyzed were from crash
>     due to exception.
>
>   - The above did change since commit 8341f2f222d7 ("sysrq: Use
>     panic() to force a crash") started using panic() instead of
>     deferencing NULL pointer to force a kernel crash. But then
>     commit de6e5d38417e ("powerpc: smp_send_stop do not offline
>     stopped CPUs") stopped marking CPUs as offline till kernel
>     commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
>     reverted that change.
>
> To avoid vmcore from showing only one CPU as online in panic path,
> skip marking non panic'ing CPUs as offline while stopping them
> during fadump crash.

Is this really worth the added complexity/bug surface?

Why does it matter if the vmcore shows only one CPU online?


> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index c23ee842c4c3..20555d5d5966 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -61,6 +61,7 @@
>  #include <asm/cpu_has_feature.h>
>  #include <asm/ftrace.h>
>  #include <asm/kup.h>
> +#include <asm/fadump.h>
>  
>  #ifdef DEBUG
>  #include <asm/udbg.h>
> @@ -626,7 +627,8 @@ static void nmi_stop_this_cpu(struct pt_regs *regs)
>  	/*
>  	 * IRQs are already hard disabled by the smp_handle_nmi_ipi.
>  	 */
> -	set_cpu_online(smp_processor_id(), false);
> +	if (!(oops_in_progress && should_fadump_crash()))
> +		set_cpu_online(smp_processor_id(), false);
>  
>  	spin_begin();
>  	while (1)
> @@ -650,7 +652,8 @@ static void stop_this_cpu(void *dummy)
>  	 * to know other CPUs are offline before it breaks locks to flush
>  	 * printk buffers, in case we panic()ed while holding the lock.
>  	 */
> -	set_cpu_online(smp_processor_id(), false);
> +	if (!(oops_in_progress && should_fadump_crash()))
> +		set_cpu_online(smp_processor_id(), false);

The comment talks about printk_safe_flush_on_panic(), and this change
would presumably break that. Except that printk_safe_flush_on_panic() no
longer exists.

So do we need to set the CPU online here at all?

ie. could we revert bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
now that printk_safe_flush_on_panic() no longer exists?

cheers

  reply	other threads:[~2021-11-11  6:14 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-10 19:01 [PATCH] ppc64/fadump: fix inaccurate CPU state info in vmcore generated with panic Hari Bathini
2021-11-11  6:14 ` Michael Ellerman [this message]
2021-11-11 12:06   ` Hari Bathini
2021-11-11 13:20     ` Nicholas Piggin
2021-11-17  6:09       ` Hari Bathini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87lf1vmd78.fsf@mpe.ellerman.id.au \
    --to=mpe@ellerman.id.au \
    --cc=hbathini@linux.ibm.com \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mahesh@linux.ibm.com \
    --cc=npiggin@gmail.com \
    --cc=sourabhjain@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).