[PATCH] kdump: Fix crash_kexec - smp_send

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic
@ 2011-10-26 14:34 Michael Holzheu
  2011-10-27 17:40 ` Vivek Goyal
  2011-10-28 23:11 ` Andrew Morton
  0 siblings, 2 replies; 13+ messages in thread
From: Michael Holzheu @ 2011-10-26 14:34 UTC (permalink / raw)
  To: akpm
  Cc: Eric W. Biederman, Vivek Goyal, schwidefsky, heiko.carstens,
	kexec, linux-kernel

Hello Andrew,

After the discussion with Eric and Vivek the following patch
seems to be a good solution to me. Could you accept this patch?

When two CPUs call panic at the same time there is a
possible race condition that can stop kdump. The first
CPU calls crash_kexec() and the second CPU calls
smp_send_stop() in panic() before crash_kexec() finished
on the first CPU. So the second CPU stops the first CPU
and therefore kdump fails:

1st CPU:
panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump

2nd CPU:
panic()->crash_kexec()->kexec_mutex already held by 1st CPU
       ->smp_send_stop()-> stop 1st CPU (stop kdump)

This patch fixes the problem by introducing a spinlock in
panic that allows only one CPU to process crash_kexec() and
the subsequent panic code.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 kernel/panic.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -59,6 +59,7 @@ EXPORT_SYMBOL(panic_blink);
  */
 NORET_TYPE void panic(const char * fmt, ...)
 {
+	static DEFINE_SPINLOCK(panic_lock);
 	static char buf[1024];
 	va_list args;
 	long i, i_next = 0;
@@ -82,6 +83,13 @@ NORET_TYPE void panic(const char * fmt,
 #endif
 
 	/*
+	 * Only one CPU is allowed to execute the panic code from here. For
+	 * multiple parallel invocations of panic all other CPUs will wait on
+	 * the panic_lock. They are stopped afterwards by smp_send_stop().
+	 */
+	spin_lock(&panic_lock);
+
+	/*
 	 * If we have crashed and we have a crash kernel loaded let it handle
 	 * everything else.
 	 * Do we want to call this before we try to display a message?



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic
  2011-10-26 14:34 [PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic Michael Holzheu
@ 2011-10-27 17:40 ` Vivek Goyal
  2011-10-28 23:11 ` Andrew Morton
  1 sibling, 0 replies; 13+ messages in thread
From: Vivek Goyal @ 2011-10-27 17:40 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: akpm, Eric W. Biederman, schwidefsky, heiko.carstens, kexec,
	linux-kernel

On Wed, Oct 26, 2011 at 04:34:09PM +0200, Michael Holzheu wrote:
> Hello Andrew,
> 
> After the discussion with Eric and Vivek the following patch
> seems to be a good solution to me. Could you accept this patch?
> 
> When two CPUs call panic at the same time there is a
> possible race condition that can stop kdump. The first
> CPU calls crash_kexec() and the second CPU calls
> smp_send_stop() in panic() before crash_kexec() finished
> on the first CPU. So the second CPU stops the first CPU
> and therefore kdump fails:
> 
> 1st CPU:
> panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump
> 
> 2nd CPU:
> panic()->crash_kexec()->kexec_mutex already held by 1st CPU
>        ->smp_send_stop()-> stop 1st CPU (stop kdump)
> 
> This patch fixes the problem by introducing a spinlock in
> panic that allows only one CPU to process crash_kexec() and
> the subsequent panic code.
> 
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>

Sounds reasonable to me.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Thanks
Vivek

> ---
>  kernel/panic.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -59,6 +59,7 @@ EXPORT_SYMBOL(panic_blink);
>   */
>  NORET_TYPE void panic(const char * fmt, ...)
>  {
> +	static DEFINE_SPINLOCK(panic_lock);
>  	static char buf[1024];
>  	va_list args;
>  	long i, i_next = 0;
> @@ -82,6 +83,13 @@ NORET_TYPE void panic(const char * fmt,
>  #endif
>  
>  	/*
> +	 * Only one CPU is allowed to execute the panic code from here. For
> +	 * multiple parallel invocations of panic all other CPUs will wait on
> +	 * the panic_lock. They are stopped afterwards by smp_send_stop().
> +	 */
> +	spin_lock(&panic_lock);
> +
> +	/*
>  	 * If we have crashed and we have a crash kernel loaded let it handle
>  	 * everything else.
>  	 * Do we want to call this before we try to display a message?
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic
  2011-10-26 14:34 [PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic Michael Holzheu
  2011-10-27 17:40 ` Vivek Goyal
@ 2011-10-28 23:11 ` Andrew Morton
  2011-10-31  9:57   ` Michael Holzheu
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2011-10-28 23:11 UTC (permalink / raw)
  To: holzheu
  Cc: Eric W. Biederman, Vivek Goyal, schwidefsky, heiko.carstens,
	kexec, linux-kernel

On Wed, 26 Oct 2011 16:34:09 +0200
Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:

> Hello Andrew,
> 
> After the discussion with Eric and Vivek the following patch
> seems to be a good solution to me. Could you accept this patch?
> 
> When two CPUs call panic at the same time there is a
> possible race condition that can stop kdump. The first
> CPU calls crash_kexec() and the second CPU calls
> smp_send_stop() in panic() before crash_kexec() finished
> on the first CPU. So the second CPU stops the first CPU
> and therefore kdump fails:
> 
> 1st CPU:
> panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump
> 
> 2nd CPU:
> panic()->crash_kexec()->kexec_mutex already held by 1st CPU
>        ->smp_send_stop()-> stop 1st CPU (stop kdump)
> 
> This patch fixes the problem by introducing a spinlock in
> panic that allows only one CPU to process crash_kexec() and
> the subsequent panic code.
> 
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> ---
>  kernel/panic.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -59,6 +59,7 @@ EXPORT_SYMBOL(panic_blink);
>   */
>  NORET_TYPE void panic(const char * fmt, ...)
>  {
> +	static DEFINE_SPINLOCK(panic_lock);
>  	static char buf[1024];
>  	va_list args;
>  	long i, i_next = 0;
> @@ -82,6 +83,13 @@ NORET_TYPE void panic(const char * fmt,
>  #endif
>  
>  	/*
> +	 * Only one CPU is allowed to execute the panic code from here. For
> +	 * multiple parallel invocations of panic all other CPUs will wait on
> +	 * the panic_lock. They are stopped afterwards by smp_send_stop().
> +	 */
> +	spin_lock(&panic_lock);
> +

hm.  Boy.  That'll stop 'em OK!

Should this be done earlier in the function?  As it stands we'll have
multiple CPUs scribbling on buf[] at the same time and all trying to
print the same thing at the same time, dumping their stacks, etc. 
Perhaps it would be better to single-thread all that stuff.

Also...  this patch affects all CPU architectures, all configs, etc. 
So we're expecting that every architecture's smp_send_stop() is able to
stop a CPU which is spinning in spin_lock(), possibly with local
interrupts disabled.  Will this work?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic
  2011-10-28 23:11 ` Andrew Morton
@ 2011-10-31  9:57   ` Michael Holzheu
  2011-10-31 10:39     ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Holzheu @ 2011-10-31  9:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: heiko.carstens, kexec, linux-kernel, Eric W. Biederman,
	schwidefsky, Vivek Goyal

Hello Andrew,

On Fri, 2011-10-28 at 16:11 -0700, Andrew Morton wrote:

[snip]

> Should this be done earlier in the function?  As it stands we'll have
> multiple CPUs scribbling on buf[] at the same time and all trying to
> print the same thing at the same time, dumping their stacks, etc. 
> Perhaps it would be better to single-thread all that stuff

My fist patch took the spinlock at the beginning of panic(). But then
Eric asked, if it wouldn't be better to get both panic printk's and I
agreed.

> Also...  this patch affects all CPU architectures, all configs, etc. 
> So we're expecting that every architecture's smp_send_stop() is able to
> stop a CPU which is spinning in spin_lock(), possibly with local
> interrupts disabled.  Will this work?

At least on s390 it will work. If there are architectures that can't
stop disabled CPUs then this problem is already there without this
patch.

Example:

1. 1st CPU gets lock X and panics
2. 2nd CPU is disabled and gets lock X
3. 1st CPU calls smp_send_stop()
   -> 2nd CPU loops disabled and can't be stopped

Michael



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic
  2011-10-31  9:57   ` Michael Holzheu
@ 2011-10-31 10:39     ` Andrew Morton
  2011-10-31 12:34       ` [PATCH v2] " Michael Holzheu
  2011-11-03 10:07       ` [PATCH] " Michael Holzheu
  0 siblings, 2 replies; 13+ messages in thread
From: Andrew Morton @ 2011-10-31 10:39 UTC (permalink / raw)
  To: holzheu
  Cc: heiko.carstens, kexec, linux-kernel, Eric W. Biederman,
	schwidefsky, Vivek Goyal

On Mon, 31 Oct 2011 10:57:16 +0100 Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:

> > Should this be done earlier in the function?  As it stands we'll have
> > multiple CPUs scribbling on buf[] at the same time and all trying to
> > print the same thing at the same time, dumping their stacks, etc. 
> > Perhaps it would be better to single-thread all that stuff
> 
> My fist patch took the spinlock at the beginning of panic(). But then
> Eric asked, if it wouldn't be better to get both panic printk's and I
> agreed.

Hm, why?  It will make a big mess.

> > Also...  this patch affects all CPU architectures, all configs, etc. 
> > So we're expecting that every architecture's smp_send_stop() is able to
> > stop a CPU which is spinning in spin_lock(), possibly with local
> > interrupts disabled.  Will this work?
> 
> At least on s390 it will work. If there are architectures that can't
> stop disabled CPUs then this problem is already there without this
> patch.
> 
> Example:
> 
> 1. 1st CPU gets lock X and panics
> 2. 2nd CPU is disabled and gets lock X

(irq-disabled)

> 3. 1st CPU calls smp_send_stop()
>    -> 2nd CPU loops disabled and can't be stopped

Well OK.  Maybe some architectures do have this problem - who would
notice?  If that is the case, we just made the failure cases much more
common.  Could you check, please?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2] kdump: Fix crash_kexec - smp_send_stop race in panic
  2011-10-31 10:39     ` Andrew Morton
@ 2011-10-31 12:34       ` Michael Holzheu
  2011-11-01 20:04         ` Don Zickus
  2011-11-03 10:07       ` [PATCH] " Michael Holzheu
  1 sibling, 1 reply; 13+ messages in thread
From: Michael Holzheu @ 2011-10-31 12:34 UTC (permalink / raw)
  To: Andrew Morton, linux-arch
  Cc: heiko.carstens, kexec, linux-kernel, Eric W. Biederman,
	schwidefsky, Vivek Goyal

Hello Andrew, hello linux-arch,

On Mon, 2011-10-31 at 03:39 -0700, Andrew Morton wrote:
> On Mon, 31 Oct 2011 10:57:16 +0100 Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:
> 
> > > Should this be done earlier in the function?  As it stands we'll have
> > > multiple CPUs scribbling on buf[] at the same time and all trying to
> > > print the same thing at the same time, dumping their stacks, etc. 
> > > Perhaps it would be better to single-thread all that stuff
> > 
> > My fist patch took the spinlock at the beginning of panic(). But then
> > Eric asked, if it wouldn't be better to get both panic printk's and I
> > agreed.
> 
> Hm, why?  It will make a big mess.

@Andrew:

I thought it would be good to have both messages and it would be good to
change the panic behavior as less as possible...

But ok, I have no problem with getting the lock at the beginning of
panic(). Below, I attached the updated patch.

> > > Also...  this patch affects all CPU architectures, all configs, etc. 
> > > So we're expecting that every architecture's smp_send_stop() is able to
> > > stop a CPU which is spinning in spin_lock(), possibly with local
> > > interrupts disabled.  Will this work?
> > 
> > At least on s390 it will work. If there are architectures that can't
> > stop disabled CPUs then this problem is already there without this
> > patch.
> > 
> > Example:
> > 
> > 1. 1st CPU gets lock X and panics
> > 2. 2nd CPU is disabled and gets lock X
> 
> (irq-disabled)
> 
> > 3. 1st CPU calls smp_send_stop()
> >    -> 2nd CPU loops disabled and can't be stopped
> 
> Well OK.  Maybe some architectures do have this problem - who would
> notice?  If that is the case, we just made the failure cases much more
> common.  Could you check, please?

@linux-arch: 

This patch introduces a spinlock to prevent parallel execution of the
panic code. Andrew pointed out that this might be a problem for
architectures that can't do smp_send_stop() on remote CPUs that have
interrupts disabled. When irq-disabled CPUs execute panic() in parallel,
we then would have looping CPUs.

So please speak up if somebody has a problem with this patch!

Michael
---
From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Subject: kdump: fix crash_kexec()/smp_send_stop() race in panic

When two CPUs call panic at the same time there is a possible race
condition that can stop kdump.  The first CPU calls crash_kexec() and the
second CPU calls smp_send_stop() in panic() before crash_kexec() finished
on the first CPU.  So the second CPU stops the first CPU and therefore
kdump fails:

1st CPU:
panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump

2nd CPU:
panic()->crash_kexec()->kexec_mutex already held by 1st CPU
       ->smp_send_stop()-> stop 1st CPU (stop kdump)

This patch fixes the problem by introducing a spinlock in panic that
allows only one CPU to process crash_kexec() and the subsequent panic
code.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 kernel/panic.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -59,6 +59,7 @@ EXPORT_SYMBOL(panic_blink);
  */
 NORET_TYPE void panic(const char * fmt, ...)
 {
+	static DEFINE_SPINLOCK(panic_lock);
 	static char buf[1024];
 	va_list args;
 	long i, i_next = 0;
@@ -68,8 +69,12 @@ NORET_TYPE void panic(const char * fmt,
 	 * It's possible to come here directly from a panic-assertion and
 	 * not have preempt disabled. Some functions called from here want
 	 * preempt to be disabled. No point enabling it later though...
+	 *
+	 * Only one CPU is allowed to execute the panic code from here. For
+	 * multiple parallel invocations of panic all other CPUs will wait on
+	 * the panic_lock. They are stopped afterwards by smp_send_stop().
 	 */
-	preempt_disable();
+	spin_lock_irq(&panic_lock);
 
 	console_verbose();
 	bust_spinlocks(1);



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] kdump: Fix crash_kexec - smp_send_stop race in panic
  2011-10-31 12:34       ` [PATCH v2] " Michael Holzheu
@ 2011-11-01 20:04         ` Don Zickus
  2011-11-02 10:03           ` Michael Holzheu
  0 siblings, 1 reply; 13+ messages in thread
From: Don Zickus @ 2011-11-01 20:04 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Andrew Morton, linux-arch, heiko.carstens, kexec, linux-kernel,
	Eric W. Biederman, schwidefsky, Vivek Goyal

On Mon, Oct 31, 2011 at 01:34:19PM +0100, Michael Holzheu wrote:
> Hello Andrew, hello linux-arch,
> 
> > Well OK.  Maybe some architectures do have this problem - who would
> > notice?  If that is the case, we just made the failure cases much more
> > common.  Could you check, please?
> 
> @linux-arch: 
> 
> This patch introduces a spinlock to prevent parallel execution of the
> panic code. Andrew pointed out that this might be a problem for
> architectures that can't do smp_send_stop() on remote CPUs that have
> interrupts disabled. When irq-disabled CPUs execute panic() in parallel,
> we then would have looping CPUs.

x86 has such problem and I posted a patch recently to fix it

https://lkml.org/lkml/2011/10/13/426

Cheers,
Don

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] kdump: Fix crash_kexec - smp_send_stop race in panic
  2011-11-01 20:04         ` Don Zickus
@ 2011-11-02 10:03           ` Michael Holzheu
  2011-11-02 20:57             ` Luck, Tony
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Holzheu @ 2011-11-02 10:03 UTC (permalink / raw)
  To: Don Zickus
  Cc: Andrew Morton, linux-arch, heiko.carstens, kexec, linux-kernel,
	Eric W. Biederman, schwidefsky, Vivek Goyal

On Tue, 2011-11-01 at 16:04 -0400, Don Zickus wrote:
> On Mon, Oct 31, 2011 at 01:34:19PM +0100, Michael Holzheu wrote:
> > Hello Andrew, hello linux-arch,
> > 
> > > Well OK.  Maybe some architectures do have this problem - who would
> > > notice?  If that is the case, we just made the failure cases much more
> > > common.  Could you check, please?
> > 
> > @linux-arch: 
> > 
> > This patch introduces a spinlock to prevent parallel execution of the
> > panic code. Andrew pointed out that this might be a problem for
> > architectures that can't do smp_send_stop() on remote CPUs that have
> > interrupts disabled. When irq-disabled CPUs execute panic() in parallel,
> > we then would have looping CPUs.
> 
> x86 has such problem and I posted a patch recently to fix it
> 
> https://lkml.org/lkml/2011/10/13/426

Ok good, so with this patch x86 has no problem with the panic spinlock.
Anybody else?

Instead of introducing the panic lock, as an alternative we could move
smp_send_stop() to the beginning of panic(). Eric told me that the
function is currently "insufficiently reliable" for that, but perhaps we
could make it more reliable.

Michael




^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH v2] kdump: Fix crash_kexec - smp_send_stop race in panic
  2011-11-02 10:03           ` Michael Holzheu
@ 2011-11-02 20:57             ` Luck, Tony
  0 siblings, 0 replies; 13+ messages in thread
From: Luck, Tony @ 2011-11-02 20:57 UTC (permalink / raw)
  To: holzheu@linux.vnet.ibm.com, Don Zickus
  Cc: Andrew Morton, linux-arch@vger.kernel.org,
	heiko.carstens@de.ibm.com, kexec@lists.infradead.org,
	linux-kernel@vger.kernel.org, Eric W. Biederman,
	schwidefsky@de.ibm.com, Vivek Goyal

> Instead of introducing the panic lock, as an alternative we could move
> smp_send_stop() to the beginning of panic(). Eric told me that the
> function is currently "insufficiently reliable" for that, but perhaps we
> could make it more reliable.

That's tough to do.  We are in panic because something went horribly
wrong somewhere in the kernel - so we can make few assumptions about
which subsystems are still working. In the worst case (for this example)
our panic was caused by a failure in the code that sends cross-processor
interrupts ... so calling that same code to stop the other cpus is
likely to run into the same problem - perhaps causing a nested panic.

So what looks like a good fix for some panic scenarios actually makes
others worse.

-Tony

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic
  2011-10-31 10:39     ` Andrew Morton
  2011-10-31 12:34       ` [PATCH v2] " Michael Holzheu
@ 2011-11-03 10:07       ` Michael Holzheu
  2011-11-10  0:04         ` Andrew Morton
  1 sibling, 1 reply; 13+ messages in thread
From: Michael Holzheu @ 2011-11-03 10:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: heiko.carstens, kexec, linux-kernel, Eric W. Biederman,
	schwidefsky, Vivek Goyal, Don Zickus, Luck, Tony, linux-arch

Hello Andrew,

On Mon, 2011-10-31 at 03:39 -0700, Andrew Morton wrote:
> On Mon, 31 Oct 2011 10:57:16 +0100 Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:
> 
> > > Should this be done earlier in the function?  As it stands we'll have
> > > multiple CPUs scribbling on buf[] at the same time and all trying to
> > > print the same thing at the same time, dumping their stacks, etc. 
> > > Perhaps it would be better to single-thread all that stuff
> > 
> > My fist patch took the spinlock at the beginning of panic(). But then
> > Eric asked, if it wouldn't be better to get both panic printk's and I
> > agreed.
> 
> Hm, why?  It will make a big mess.
> 
> > > Also...  this patch affects all CPU architectures, all configs, etc. 
> > > So we're expecting that every architecture's smp_send_stop() is able to
> > > stop a CPU which is spinning in spin_lock(), possibly with local
> > > interrupts disabled.  Will this work?
> > 
> > At least on s390 it will work. If there are architectures that can't
> > stop disabled CPUs then this problem is already there without this
> > patch.
> > 
> > Example:
> > 
> > 1. 1st CPU gets lock X and panics
> > 2. 2nd CPU is disabled and gets lock X
> 
> (irq-disabled)
> 
> > 3. 1st CPU calls smp_send_stop()
> >    -> 2nd CPU loops disabled and can't be stopped
> 
> Well OK.  Maybe some architectures do have this problem - who would
> notice?  If that is the case, we just made the failure cases much more
> common.

Ok, next idea: What, if the CPUs wait irq-enabled in panic until they
get stopped by smp_send_stop()?

See patch below:
---
From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Subject: kdump: fix crash_kexec()/smp_send_stop() race in panic

When two CPUs call panic at the same time there is a possible race
condition that can stop kdump.  The first CPU calls crash_kexec() and the
second CPU calls smp_send_stop() in panic() before crash_kexec() finished
on the first CPU.  So the second CPU stops the first CPU and therefore
kdump fails:

1st CPU:
panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump

2nd CPU:
panic()->crash_kexec()->kexec_mutex already held by 1st CPU
       ->smp_send_stop()-> stop 1st CPU (stop kdump)

This patch fixes the problem by introducing a spinlock in panic that
allows only one CPU to process crash_kexec() and the subsequent panic
code.

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
---
 kernel/panic.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -59,6 +59,7 @@ EXPORT_SYMBOL(panic_blink);
  */
 NORET_TYPE void panic(const char * fmt, ...)
 {
+	static DEFINE_SPINLOCK(panic_lock);
 	static char buf[1024];
 	va_list args;
 	long i, i_next = 0;
@@ -68,8 +69,16 @@ NORET_TYPE void panic(const char * fmt,
 	 * It's possible to come here directly from a panic-assertion and
 	 * not have preempt disabled. Some functions called from here want
 	 * preempt to be disabled. No point enabling it later though...
+	 *
+	 * Only one CPU is allowed to execute the panic code from here. For
+	 * multiple parallel invocations of panic, all other CPUs will wait
+	 * until they are stopped by the 1st CPU with smp_send_stop().
 	 */
-	preempt_disable();
+	if (!spin_trylock(&panic_lock)) {
+		local_irq_enable();
+		while (1)
+			cpu_relax();
+	}
 
 	console_verbose();
 	bust_spinlocks(1);





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic
  2011-11-03 10:07       ` [PATCH] " Michael Holzheu
@ 2011-11-10  0:04         ` Andrew Morton
  2011-11-10 14:17           ` Américo Wang
  2011-11-10 15:31           ` James Bottomley
  0 siblings, 2 replies; 13+ messages in thread
From: Andrew Morton @ 2011-11-10  0:04 UTC (permalink / raw)
  To: holzheu
  Cc: heiko.carstens, kexec, linux-kernel, Eric W. Biederman,
	schwidefsky, Vivek Goyal, Don Zickus, Luck, Tony, linux-arch

On Thu, 03 Nov 2011 11:07:24 +0100
Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:

> Hello Andrew,
> 
> On Mon, 2011-10-31 at 03:39 -0700, Andrew Morton wrote:
> > On Mon, 31 Oct 2011 10:57:16 +0100 Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:
> > 
> > > > Should this be done earlier in the function?  As it stands we'll have
> > > > multiple CPUs scribbling on buf[] at the same time and all trying to
> > > > print the same thing at the same time, dumping their stacks, etc. 
> > > > Perhaps it would be better to single-thread all that stuff
> > > 
> > > My fist patch took the spinlock at the beginning of panic(). But then
> > > Eric asked, if it wouldn't be better to get both panic printk's and I
> > > agreed.
> > 
> > Hm, why?  It will make a big mess.

This, please?

> > > > Also...  this patch affects all CPU architectures, all configs, etc. 
> > > > So we're expecting that every architecture's smp_send_stop() is able to
> > > > stop a CPU which is spinning in spin_lock(), possibly with local
> > > > interrupts disabled.  Will this work?
> > > 
> > > At least on s390 it will work. If there are architectures that can't
> > > stop disabled CPUs then this problem is already there without this
> > > patch.
> > > 
> > > Example:
> > > 
> > > 1. 1st CPU gets lock X and panics
> > > 2. 2nd CPU is disabled and gets lock X
> > 
> > (irq-disabled)
> > 
> > > 3. 1st CPU calls smp_send_stop()
> > >    -> 2nd CPU loops disabled and can't be stopped
> > 
> > Well OK.  Maybe some architectures do have this problem - who would
> > notice?  If that is the case, we just made the failure cases much more
> > common.
> 
> Ok, next idea: What, if the CPUs wait irq-enabled in panic until they
> get stopped by smp_send_stop()?
> 
> See patch below:
> ---
> From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> Subject: kdump: fix crash_kexec()/smp_send_stop() race in panic
> 
> When two CPUs call panic at the same time there is a possible race
> condition that can stop kdump.  The first CPU calls crash_kexec() and the
> second CPU calls smp_send_stop() in panic() before crash_kexec() finished
> on the first CPU.  So the second CPU stops the first CPU and therefore
> kdump fails:
> 
> 1st CPU:
> panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump
> 
> 2nd CPU:
> panic()->crash_kexec()->kexec_mutex already held by 1st CPU
>        ->smp_send_stop()-> stop 1st CPU (stop kdump)
> 
> This patch fixes the problem by introducing a spinlock in panic that
> allows only one CPU to process crash_kexec() and the subsequent panic
> code.
> 
> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> ---
>  kernel/panic.c |   11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -59,6 +59,7 @@ EXPORT_SYMBOL(panic_blink);
>   */
>  NORET_TYPE void panic(const char * fmt, ...)
>  {
> +	static DEFINE_SPINLOCK(panic_lock);
>  	static char buf[1024];
>  	va_list args;
>  	long i, i_next = 0;
> @@ -68,8 +69,16 @@ NORET_TYPE void panic(const char * fmt,
>  	 * It's possible to come here directly from a panic-assertion and
>  	 * not have preempt disabled. Some functions called from here want
>  	 * preempt to be disabled. No point enabling it later though...
> +	 *
> +	 * Only one CPU is allowed to execute the panic code from here. For
> +	 * multiple parallel invocations of panic, all other CPUs will wait
> +	 * until they are stopped by the 1st CPU with smp_send_stop().
>  	 */
> -	preempt_disable();
> +	if (!spin_trylock(&panic_lock)) {
> +		local_irq_enable();
> +		while (1)
> +			cpu_relax();
> +	}

Looks worse ;) Unconditionally enabling interrupts in a panic situation
could cause all sorts of havoc, with other interrupts being taken or
same interrupts being retaken, etc.

Ho hum, I guess we stick with the original patch.  It *should* work, as
long as all archtectures are doing the expected thing.  But in this
situation it is bad of us to just hope that the architectures are doing
this.  We should go and find out, rather than waiting for bug reports
to come in.  Especially because in this case, bugs will take a very
long time indeed to even be noticed.

One way to resolve this would be to ask the various arch maintainers!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic
  2011-11-10  0:04         ` Andrew Morton
@ 2011-11-10 14:17           ` Américo Wang
  2011-11-10 15:31           ` James Bottomley
  1 sibling, 0 replies; 13+ messages in thread
From: Américo Wang @ 2011-11-10 14:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: holzheu, heiko.carstens, kexec, linux-kernel, Eric W. Biederman,
	schwidefsky, Vivek Goyal, Don Zickus, Luck, Tony, linux-arch

On Thu, Nov 10, 2011 at 8:04 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> Ho hum, I guess we stick with the original patch.  It *should* work, as
> long as all archtectures are doing the expected thing.  But in this
> situation it is bad of us to just hope that the architectures are doing
> this.  We should go and find out, rather than waiting for bug reports
> to come in.  Especially because in this case, bugs will take a very
> long time indeed to even be noticed.
>
> One way to resolve this would be to ask the various arch maintainers!

At least we can add the spin_lock just for x86 and s390, and leave
a big comment saying we are waiting for the other arch to be fixed...

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic
  2011-11-10  0:04         ` Andrew Morton
  2011-11-10 14:17           ` Américo Wang
@ 2011-11-10 15:31           ` James Bottomley
  1 sibling, 0 replies; 13+ messages in thread
From: James Bottomley @ 2011-11-10 15:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: holzheu, heiko.carstens, kexec, linux-kernel, Eric W. Biederman,
	schwidefsky, Vivek Goyal, Don Zickus, Luck, Tony, linux-arch

On Wed, 2011-11-09 at 16:04 -0800, Andrew Morton wrote:
> On Thu, 03 Nov 2011 11:07:24 +0100
> Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:
> 
> > Hello Andrew,
> > 
> > On Mon, 2011-10-31 at 03:39 -0700, Andrew Morton wrote:
> > > On Mon, 31 Oct 2011 10:57:16 +0100 Michael Holzheu <holzheu@linux.vnet.ibm.com> wrote:
> > > 
> > > > > Should this be done earlier in the function?  As it stands we'll have
> > > > > multiple CPUs scribbling on buf[] at the same time and all trying to
> > > > > print the same thing at the same time, dumping their stacks, etc. 
> > > > > Perhaps it would be better to single-thread all that stuff
> > > > 
> > > > My fist patch took the spinlock at the beginning of panic(). But then
> > > > Eric asked, if it wouldn't be better to get both panic printk's and I
> > > > agreed.
> > > 
> > > Hm, why?  It will make a big mess.
> 
> This, please?
> 
> > > > > Also...  this patch affects all CPU architectures, all configs, etc. 
> > > > > So we're expecting that every architecture's smp_send_stop() is able to
> > > > > stop a CPU which is spinning in spin_lock(), possibly with local
> > > > > interrupts disabled.  Will this work?
> > > > 
> > > > At least on s390 it will work. If there are architectures that can't
> > > > stop disabled CPUs then this problem is already there without this
> > > > patch.
> > > > 
> > > > Example:
> > > > 
> > > > 1. 1st CPU gets lock X and panics
> > > > 2. 2nd CPU is disabled and gets lock X
> > > 
> > > (irq-disabled)
> > > 
> > > > 3. 1st CPU calls smp_send_stop()
> > > >    -> 2nd CPU loops disabled and can't be stopped
> > > 
> > > Well OK.  Maybe some architectures do have this problem - who would
> > > notice?  If that is the case, we just made the failure cases much more
> > > common.
> > 
> > Ok, next idea: What, if the CPUs wait irq-enabled in panic until they
> > get stopped by smp_send_stop()?
> > 
> > See patch below:
> > ---
> > From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> > Subject: kdump: fix crash_kexec()/smp_send_stop() race in panic
> > 
> > When two CPUs call panic at the same time there is a possible race
> > condition that can stop kdump.  The first CPU calls crash_kexec() and the
> > second CPU calls smp_send_stop() in panic() before crash_kexec() finished
> > on the first CPU.  So the second CPU stops the first CPU and therefore
> > kdump fails:
> > 
> > 1st CPU:
> > panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump
> > 
> > 2nd CPU:
> > panic()->crash_kexec()->kexec_mutex already held by 1st CPU
> >        ->smp_send_stop()-> stop 1st CPU (stop kdump)
> > 
> > This patch fixes the problem by introducing a spinlock in panic that
> > allows only one CPU to process crash_kexec() and the subsequent panic
> > code.
> > 
> > Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
> > ---
> >  kernel/panic.c |   11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > --- a/kernel/panic.c
> > +++ b/kernel/panic.c
> > @@ -59,6 +59,7 @@ EXPORT_SYMBOL(panic_blink);
> >   */
> >  NORET_TYPE void panic(const char * fmt, ...)
> >  {
> > +	static DEFINE_SPINLOCK(panic_lock);
> >  	static char buf[1024];
> >  	va_list args;
> >  	long i, i_next = 0;
> > @@ -68,8 +69,16 @@ NORET_TYPE void panic(const char * fmt,
> >  	 * It's possible to come here directly from a panic-assertion and
> >  	 * not have preempt disabled. Some functions called from here want
> >  	 * preempt to be disabled. No point enabling it later though...
> > +	 *
> > +	 * Only one CPU is allowed to execute the panic code from here. For
> > +	 * multiple parallel invocations of panic, all other CPUs will wait
> > +	 * until they are stopped by the 1st CPU with smp_send_stop().
> >  	 */
> > -	preempt_disable();
> > +	if (!spin_trylock(&panic_lock)) {
> > +		local_irq_enable();
> > +		while (1)
> > +			cpu_relax();
> > +	}
> 
> Looks worse ;) Unconditionally enabling interrupts in a panic situation
> could cause all sorts of havoc, with other interrupts being taken or
> same interrupts being retaken, etc.
> 
> Ho hum, I guess we stick with the original patch. 

By original patch you mean the smp_send_stop() at the beginning of the
panic one (which isn't on linux-arch)?

>  It *should* work, as
> long as all archtectures are doing the expected thing.  But in this
> situation it is bad of us to just hope that the architectures are doing
> this.  We should go and find out, rather than waiting for bug reports
> to come in.  Especially because in this case, bugs will take a very
> long time indeed to even be noticed.
> 
> One way to resolve this would be to ask the various arch maintainers!

You're assuming we actually know.  On parisc, the IPI_STOP_CPU is
implemented, it's just it's not something we ever use (not even in the
shutdown path), so while I can tell you it *should* work (the code in
the IPI handler looks sane) ... we've never tested it.

James



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-11-10 15:31 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-26 14:34 [PATCH] kdump: Fix crash_kexec - smp_send_stop race in panic Michael Holzheu
2011-10-27 17:40 ` Vivek Goyal
2011-10-28 23:11 ` Andrew Morton
2011-10-31  9:57   ` Michael Holzheu
2011-10-31 10:39     ` Andrew Morton
2011-10-31 12:34       ` [PATCH v2] " Michael Holzheu
2011-11-01 20:04         ` Don Zickus
2011-11-02 10:03           ` Michael Holzheu
2011-11-02 20:57             ` Luck, Tony
2011-11-03 10:07       ` [PATCH] " Michael Holzheu
2011-11-10  0:04         ` Andrew Morton
2011-11-10 14:17           ` Américo Wang
2011-11-10 15:31           ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).