Linux-HyperV List
 help / color / mirror / Atom feed
* [PATCH v2] panic: call panic handlers before panic_other_cpus_shutdown()
@ 2025-02-21 21:30 Hamza Mahfooz
  2025-02-21 23:01 ` Michael Kelley
  0 siblings, 1 reply; 5+ messages in thread
From: Hamza Mahfooz @ 2025-02-21 21:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dexuan Cui, Wei Liu, linux-hyperv, Haiyang Zhang, Hamza Mahfooz,
	Petr Mladek, Andrew Morton, Greg Kroah-Hartman, John Ogness,
	Jani Nikula, Baoquan He, Thomas Gleixner, Ryo Takakura

Since, the panic handlers may require certain cpus to be online to panic
gracefully, we should call them before turning off SMP. Without this
re-ordering, on Hyper-V hv_panic_vmbus_unload() times out, because the
vmbus channel is bound to VMBUS_CONNECT_CPU and unless the crashing cpu
is the same as VMBUS_CONNECT_CPU, VMBUS_CONNECT_CPU will be offlined by
crash_smp_send_stop() before the vmbus channel can be deconstructed.

Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
---
v2: keep printk_legacy_allow_panic_sync() after
    panic_other_cpus_shutdown().
---
 kernel/panic.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/panic.c b/kernel/panic.c
index fbc59b3b64d0..433cf651e213 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -372,16 +372,16 @@ void panic(const char *fmt, ...)
 	if (!_crash_kexec_post_notifiers)
 		__crash_kexec(NULL);
 
-	panic_other_cpus_shutdown(_crash_kexec_post_notifiers);
-
-	printk_legacy_allow_panic_sync();
-
 	/*
 	 * Run any panic handlers, including those that might need to
 	 * add information to the kmsg dump output.
 	 */
 	atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
 
+	panic_other_cpus_shutdown(_crash_kexec_post_notifiers);
+
+	printk_legacy_allow_panic_sync();
+
 	panic_print_sys_info(false);
 
 	kmsg_dump_desc(KMSG_DUMP_PANIC, buf);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* RE: [PATCH v2] panic: call panic handlers before panic_other_cpus_shutdown()
  2025-02-21 21:30 [PATCH v2] panic: call panic handlers before panic_other_cpus_shutdown() Hamza Mahfooz
@ 2025-02-21 23:01 ` Michael Kelley
  2025-02-24 14:48   ` Hamza Mahfooz
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Kelley @ 2025-02-21 23:01 UTC (permalink / raw)
  To: Hamza Mahfooz, linux-kernel@vger.kernel.org
  Cc: Dexuan Cui, Wei Liu, linux-hyperv@vger.kernel.org, Haiyang Zhang,
	Petr Mladek, Andrew Morton, Greg Kroah-Hartman, John Ogness,
	Jani Nikula, Baoquan He, Thomas Gleixner, Ryo Takakura

From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Sent: Friday, February 21, 2025 1:31 PM
> 
> Since, the panic handlers may require certain cpus to be online to panic
> gracefully, we should call them before turning off SMP. Without this
> re-ordering, on Hyper-V hv_panic_vmbus_unload() times out, because the
> vmbus channel is bound to VMBUS_CONNECT_CPU and unless the crashing cpu
> is the same as VMBUS_CONNECT_CPU, VMBUS_CONNECT_CPU will be offlined by
> crash_smp_send_stop() before the vmbus channel can be deconstructed.

Hamza -- what specifically is the problem with the way vmbus_wait_for_unload()
works today? That code is aware of the problem that the unload response comes
only on the VMBUS_CONNECT_CPU, and that cpu may not be able to handle
the interrupt. So the code polls the message page of each CPU to try to get the
unload response message. Is there a scenario where that approach isn't working?

Note also that Hyper-V itself can take a long time (10's of seconds) to respond
to the unload request. See the comments in vmbus_wait_for_unload() about
flushing the Azure host disk cache. I worked on this code and did the
measurements, so I have some familiarity with the problems. :-)

Michael

> 
> Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> ---
> v2: keep printk_legacy_allow_panic_sync() after
>     panic_other_cpus_shutdown().
> ---
>  kernel/panic.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/panic.c b/kernel/panic.c
> index fbc59b3b64d0..433cf651e213 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -372,16 +372,16 @@ void panic(const char *fmt, ...)
>  	if (!_crash_kexec_post_notifiers)
>  		__crash_kexec(NULL);
> 
> -	panic_other_cpus_shutdown(_crash_kexec_post_notifiers);
> -
> -	printk_legacy_allow_panic_sync();
> -
>  	/*
>  	 * Run any panic handlers, including those that might need to
>  	 * add information to the kmsg dump output.
>  	 */
>  	atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
> 
> +	panic_other_cpus_shutdown(_crash_kexec_post_notifiers);
> +
> +	printk_legacy_allow_panic_sync();
> +
>  	panic_print_sys_info(false);
> 
>  	kmsg_dump_desc(KMSG_DUMP_PANIC, buf);
> --
> 2.47.1
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] panic: call panic handlers before panic_other_cpus_shutdown()
  2025-02-21 23:01 ` Michael Kelley
@ 2025-02-24 14:48   ` Hamza Mahfooz
  2025-02-24 19:59     ` Michael Kelley
  0 siblings, 1 reply; 5+ messages in thread
From: Hamza Mahfooz @ 2025-02-24 14:48 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-kernel@vger.kernel.org, Dexuan Cui, Wei Liu,
	linux-hyperv@vger.kernel.org, Haiyang Zhang, Petr Mladek,
	Andrew Morton, Greg Kroah-Hartman, John Ogness, Jani Nikula,
	Baoquan He, Thomas Gleixner, Ryo Takakura

On Fri, Feb 21, 2025 at 11:01:09PM +0000, Michael Kelley wrote:
> From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Sent: Friday, February 21, 2025 1:31 PM
> > 
> > Since, the panic handlers may require certain cpus to be online to panic
> > gracefully, we should call them before turning off SMP. Without this
> > re-ordering, on Hyper-V hv_panic_vmbus_unload() times out, because the
> > vmbus channel is bound to VMBUS_CONNECT_CPU and unless the crashing cpu
> > is the same as VMBUS_CONNECT_CPU, VMBUS_CONNECT_CPU will be offlined by
> > crash_smp_send_stop() before the vmbus channel can be deconstructed.
> 
> Hamza -- what specifically is the problem with the way vmbus_wait_for_unload()
> works today? That code is aware of the problem that the unload response comes
> only on the VMBUS_CONNECT_CPU, and that cpu may not be able to handle
> the interrupt. So the code polls the message page of each CPU to try to get the
> unload response message. Is there a scenario where that approach isn't working?
> 

It doesn't work on arm64 (if the crashing cpu isn't VMBUS_CONNECT_CPU), it
always ends up at "VMBus UNLOAD did not complete" without fail. It seems
like arm64's crash_smp_send_stop() is more aggressive than x86's.

> Note also that Hyper-V itself can take a long time (10's of seconds) to respond
> to the unload request. See the comments in vmbus_wait_for_unload() about
> flushing the Azure host disk cache. I worked on this code and did the
> measurements, so I have some familiarity with the problems. :-)
> 
> Michael
> 
> > 
> > Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
> > ---
> > v2: keep printk_legacy_allow_panic_sync() after
> >     panic_other_cpus_shutdown().
> > ---
> >  kernel/panic.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/panic.c b/kernel/panic.c
> > index fbc59b3b64d0..433cf651e213 100644
> > --- a/kernel/panic.c
> > +++ b/kernel/panic.c
> > @@ -372,16 +372,16 @@ void panic(const char *fmt, ...)
> >  	if (!_crash_kexec_post_notifiers)
> >  		__crash_kexec(NULL);
> > 
> > -	panic_other_cpus_shutdown(_crash_kexec_post_notifiers);
> > -
> > -	printk_legacy_allow_panic_sync();
> > -
> >  	/*
> >  	 * Run any panic handlers, including those that might need to
> >  	 * add information to the kmsg dump output.
> >  	 */
> >  	atomic_notifier_call_chain(&panic_notifier_list, 0, buf);
> > 
> > +	panic_other_cpus_shutdown(_crash_kexec_post_notifiers);
> > +
> > +	printk_legacy_allow_panic_sync();
> > +
> >  	panic_print_sys_info(false);
> > 
> >  	kmsg_dump_desc(KMSG_DUMP_PANIC, buf);
> > --
> > 2.47.1
> > 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH v2] panic: call panic handlers before panic_other_cpus_shutdown()
  2025-02-24 14:48   ` Hamza Mahfooz
@ 2025-02-24 19:59     ` Michael Kelley
  2025-02-24 20:57       ` Hamza Mahfooz
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Kelley @ 2025-02-24 19:59 UTC (permalink / raw)
  To: Hamza Mahfooz
  Cc: linux-kernel@vger.kernel.org, Dexuan Cui, Wei Liu,
	linux-hyperv@vger.kernel.org, Haiyang Zhang, Petr Mladek,
	Andrew Morton, Greg Kroah-Hartman, John Ogness, Jani Nikula,
	Baoquan He, Thomas Gleixner, Ryo Takakura

From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Sent: Monday, February 24, 2025 6:49 AM
> 
> On Fri, Feb 21, 2025 at 11:01:09PM +0000, Michael Kelley wrote:
> > From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Sent: Friday, February
> 21, 2025 1:31 PM
> > >
> > > Since, the panic handlers may require certain cpus to be online to panic
> > > gracefully, we should call them before turning off SMP. Without this
> > > re-ordering, on Hyper-V hv_panic_vmbus_unload() times out, because the
> > > vmbus channel is bound to VMBUS_CONNECT_CPU and unless the crashing cpu
> > > is the same as VMBUS_CONNECT_CPU, VMBUS_CONNECT_CPU will be offlined by
> > > crash_smp_send_stop() before the vmbus channel can be deconstructed.
> >
> > Hamza -- what specifically is the problem with the way vmbus_wait_for_unload()
> > works today? That code is aware of the problem that the unload response comes
> > only on the VMBUS_CONNECT_CPU, and that cpu may not be able to handle
> > the interrupt. So the code polls the message page of each CPU to try to get the
> > unload response message. Is there a scenario where that approach isn't working?
> >
> 
> It doesn't work on arm64 (if the crashing cpu isn't VMBUS_CONNECT_CPU), it
> always ends up at "VMBus UNLOAD did not complete" without fail. It seems
> like arm64's crash_smp_send_stop() is more aggressive than x86's.

FWIW, I tested on a D16plds_v6 arm64 VM in Azure, running Ubuntu 20.04 with
a linux-next20252021 kernel. I caused a panic using "echo c >/proc/sysrq-trigger"
using "taskset" to make sure the panic is triggered on a CPU other than CPU 0.
I didn't see any problem. The panic code path completely quickly, and there were
no messages from vmbus_wait_for_unload(), including none of the periodic
"Waiting for unload" messages . I tried initiating the panic on several different
CPUs (4, 7, and 15) with the same result. I tested with kdump disabled and with
kdump enabled, both with no problems.

So I think the current vmbus_wait_for_unload() code works on arm64, as least
in some ordinary scenarios. Any key differences in the configuration or test
environment when you see the "did not complete" message?

Michael

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] panic: call panic handlers before panic_other_cpus_shutdown()
  2025-02-24 19:59     ` Michael Kelley
@ 2025-02-24 20:57       ` Hamza Mahfooz
  0 siblings, 0 replies; 5+ messages in thread
From: Hamza Mahfooz @ 2025-02-24 20:57 UTC (permalink / raw)
  To: Michael Kelley
  Cc: linux-kernel@vger.kernel.org, Dexuan Cui, Wei Liu,
	linux-hyperv@vger.kernel.org, Haiyang Zhang, Petr Mladek,
	Andrew Morton, Greg Kroah-Hartman, John Ogness, Jani Nikula,
	Baoquan He, Thomas Gleixner, Ryo Takakura

On Mon, Feb 24, 2025 at 07:59:28PM +0000, Michael Kelley wrote:
> From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Sent: Monday, February 24, 2025 6:49 AM
> > 
> > On Fri, Feb 21, 2025 at 11:01:09PM +0000, Michael Kelley wrote:
> > > From: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Sent: Friday, February
> > 21, 2025 1:31 PM
> > > >
> > > > Since, the panic handlers may require certain cpus to be online to panic
> > > > gracefully, we should call them before turning off SMP. Without this
> > > > re-ordering, on Hyper-V hv_panic_vmbus_unload() times out, because the
> > > > vmbus channel is bound to VMBUS_CONNECT_CPU and unless the crashing cpu
> > > > is the same as VMBUS_CONNECT_CPU, VMBUS_CONNECT_CPU will be offlined by
> > > > crash_smp_send_stop() before the vmbus channel can be deconstructed.
> > >
> > > Hamza -- what specifically is the problem with the way vmbus_wait_for_unload()
> > > works today? That code is aware of the problem that the unload response comes
> > > only on the VMBUS_CONNECT_CPU, and that cpu may not be able to handle
> > > the interrupt. So the code polls the message page of each CPU to try to get the
> > > unload response message. Is there a scenario where that approach isn't working?
> > >
> > 
> > It doesn't work on arm64 (if the crashing cpu isn't VMBUS_CONNECT_CPU), it
> > always ends up at "VMBus UNLOAD did not complete" without fail. It seems
> > like arm64's crash_smp_send_stop() is more aggressive than x86's.
> 
> FWIW, I tested on a D16plds_v6 arm64 VM in Azure, running Ubuntu 20.04 with
> a linux-next20252021 kernel. I caused a panic using "echo c >/proc/sysrq-trigger"
> using "taskset" to make sure the panic is triggered on a CPU other than CPU 0.
> I didn't see any problem. The panic code path completely quickly, and there were
> no messages from vmbus_wait_for_unload(), including none of the periodic
> "Waiting for unload" messages . I tried initiating the panic on several different
> CPUs (4, 7, and 15) with the same result. I tested with kdump disabled and with
> kdump enabled, both with no problems.
> 
> So I think the current vmbus_wait_for_unload() code works on arm64, as least
> in some ordinary scenarios. Any key differences in the configuration or test
> environment when you see the "did not complete" message?

Can you try on a Standard_D16pls_v5 with the stock ubuntu image and
kernel crash dump (i.e. linux-crashdump) installed and setup?

> 
> Michael

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-02-24 20:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-21 21:30 [PATCH v2] panic: call panic handlers before panic_other_cpus_shutdown() Hamza Mahfooz
2025-02-21 23:01 ` Michael Kelley
2025-02-24 14:48   ` Hamza Mahfooz
2025-02-24 19:59     ` Michael Kelley
2025-02-24 20:57       ` Hamza Mahfooz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox