linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] panic: release stale console lock to always get the logbuf printed out
@ 2015-10-07 17:02 Vitaly Kuznetsov
  2015-10-07 22:34 ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-10-07 17:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: HATAYAMA Daisuke, Masami Hiramatsu, Jiri Kosina, Baoquan He,
	Prarit Bhargava, Xie XiuQi, Seth Jennings, linux-kernel,
	K. Y. Srinivasan

In some cases we may end up killing the CPU holding the console lock
while still having valuable data in logbuf. E.g. I'm observing the
following:
- A crash is happening on one CPU and console_unlock() is being called on
  some other.
- console_unlock() tries to print out the buffer before releasing the lock
  and on slow console it takes time.
- in the meanwhile crashing CPU does lots of printk()-s with valuable data
  (which go to the logbuf) and sends IPIs to all other CPUs.
- console_unlock() finishes printing previous chunk and enables interrupts
  before trying to print out the rest, the CPU catches the IPI and never
  releases console lock.
This is not the only possible case: in VT/fb subsystems we have many other
console_lock()/console_unlock() users. Non-masked interrupts (or receiving
NMI in case of extreme slowness) will have the same result. Getting the
whole console buffer printed out on crash should be top priority.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 kernel/panic.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/kernel/panic.c b/kernel/panic.c
index 04e91ff..f94525f 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -23,6 +23,7 @@
 #include <linux/sysrq.h>
 #include <linux/init.h>
 #include <linux/nmi.h>
+#include <linux/console.h>
 
 #define PANIC_TIMER_STEP 100
 #define PANIC_BLINK_SPD 18
@@ -147,6 +148,15 @@ void panic(const char *fmt, ...)
 
 	bust_spinlocks(0);
 
+	/*
+	 * We may have ended up killing the CPU holding the lock and still have
+	 * some valuable data in console buffer. Try to acquire the lock and
+	 * release it regardless of the result. The release will also print the
+	 * buffers out.
+	 */
+	console_trylock();
+	console_unlock();
+
 	if (!panic_blink)
 		panic_blink = no_blink;
 
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] panic: release stale console lock to always get the logbuf printed out
  2015-10-07 17:02 [PATCH] panic: release stale console lock to always get the logbuf printed out Vitaly Kuznetsov
@ 2015-10-07 22:34 ` Andrew Morton
  2015-10-08  9:01   ` Jan Kara
  2015-10-08  9:51   ` Vitaly Kuznetsov
  0 siblings, 2 replies; 11+ messages in thread
From: Andrew Morton @ 2015-10-07 22:34 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: HATAYAMA Daisuke, Masami Hiramatsu, Jiri Kosina, Baoquan He,
	Prarit Bhargava, Xie XiuQi, Seth Jennings, linux-kernel,
	K. Y. Srinivasan, Jan Kara

(cc Jan)

On Wed,  7 Oct 2015 19:02:22 +0200 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:

> In some cases we may end up killing the CPU holding the console lock
> while still having valuable data in logbuf. E.g. I'm observing the
> following:
> - A crash is happening on one CPU and console_unlock() is being called on
>   some other.
> - console_unlock() tries to print out the buffer before releasing the lock
>   and on slow console it takes time.
> - in the meanwhile crashing CPU does lots of printk()-s with valuable data
>   (which go to the logbuf) and sends IPIs to all other CPUs.
> - console_unlock() finishes printing previous chunk and enables interrupts
>   before trying to print out the rest, the CPU catches the IPI and never
>   releases console lock.

Why doesn't the lock-owning CPU release the console lock?  Because it
was stopped by smp_send_stop() in panic()?

I don't recall why we stop CPUs in panic(), and of course we didn't
document the reason.  I guess it makes sense from the "what else can we
do" point of view, but I wonder if we can just do it later on - that
would fix this problem?

(dumb aside: why doesn't smp_send_stop() stop the calling CPU?)

> This is not the only possible case: in VT/fb subsystems we have many other
> console_lock()/console_unlock() users. Non-masked interrupts (or receiving
> NMI in case of extreme slowness) will have the same result. Getting the
> whole console buffer printed out on crash should be top priority.

Yes, this is a pretty big hole in the logic.

> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -23,6 +23,7 @@
>  #include <linux/sysrq.h>
>  #include <linux/init.h>
>  #include <linux/nmi.h>
> +#include <linux/console.h>
>  
>  #define PANIC_TIMER_STEP 100
>  #define PANIC_BLINK_SPD 18
> @@ -147,6 +148,15 @@ void panic(const char *fmt, ...)
>  
>  	bust_spinlocks(0);
>  
> +	/*
> +	 * We may have ended up killing the CPU holding the lock and still have
> +	 * some valuable data in console buffer. Try to acquire the lock and
> +	 * release it regardless of the result. The release will also print the
> +	 * buffers out.
> +	 */
> +	console_trylock();
> +	console_unlock();
> +

"killing the CPU" is a bit vague.  How's this look?

--- a/kernel/panic.c~panic-release-stale-console-lock-to-always-get-the-logbuf-printed-out-fix
+++ a/kernel/panic.c
@@ -149,10 +149,10 @@ void panic(const char *fmt, ...)
 	bust_spinlocks(0);
 
 	/*
-	 * We may have ended up killing the CPU holding the lock and still have
-	 * some valuable data in console buffer. Try to acquire the lock and
-	 * release it regardless of the result. The release will also print the
-	 * buffers out.
+	 * We may have ended up stopping the CPU holding the lock (in
+	 * smp_send_stop()) while still having some valuable data in the console
+	 * buffer.  Try to acquire the lock then release it regardless of the
+	 * result.  The release will also print the buffers out.
 	 */
 	console_trylock();
 	console_unlock();
_


Does the console_trylock() guarantee that the console lock is now held?
If the console_lock-holding CPU is still running then there's a window
where the above code could enter console_unlock() when nobody's holding
console_lock.  If smp_send_stop() always works (synchronously) then
that won't happen.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] panic: release stale console lock to always get the logbuf printed out
  2015-10-07 22:34 ` Andrew Morton
@ 2015-10-08  9:01   ` Jan Kara
  2015-10-08 10:03     ` Vitaly Kuznetsov
  2015-10-12  3:02     ` kbuild test robot
  2015-10-08  9:51   ` Vitaly Kuznetsov
  1 sibling, 2 replies; 11+ messages in thread
From: Jan Kara @ 2015-10-08  9:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vitaly Kuznetsov, HATAYAMA Daisuke, Masami Hiramatsu, Jiri Kosina,
	Baoquan He, Prarit Bhargava, Xie XiuQi, Seth Jennings,
	linux-kernel, K. Y. Srinivasan, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 4183 bytes --]

On Wed 07-10-15 15:34:08, Andrew Morton wrote:
> (cc Jan)
> 
> On Wed,  7 Oct 2015 19:02:22 +0200 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> 
> > In some cases we may end up killing the CPU holding the console lock
> > while still having valuable data in logbuf. E.g. I'm observing the
> > following:
> > - A crash is happening on one CPU and console_unlock() is being called on
> >   some other.
> > - console_unlock() tries to print out the buffer before releasing the lock
> >   and on slow console it takes time.
> > - in the meanwhile crashing CPU does lots of printk()-s with valuable data
> >   (which go to the logbuf) and sends IPIs to all other CPUs.
> > - console_unlock() finishes printing previous chunk and enables interrupts
> >   before trying to print out the rest, the CPU catches the IPI and never
> >   releases console lock.
> 
> Why doesn't the lock-owning CPU release the console lock?  Because it
> was stopped by smp_send_stop() in panic()?
> 
> I don't recall why we stop CPUs in panic(), and of course we didn't
> document the reason.  I guess it makes sense from the "what else can we
> do" point of view, but I wonder if we can just do it later on - that
> would fix this problem?
> 
> (dumb aside: why doesn't smp_send_stop() stop the calling CPU?)
> 
> > This is not the only possible case: in VT/fb subsystems we have many other
> > console_lock()/console_unlock() users. Non-masked interrupts (or receiving
> > NMI in case of extreme slowness) will have the same result. Getting the
> > whole console buffer printed out on crash should be top priority.
> 
> Yes, this is a pretty big hole in the logic.
> 
> > --- a/kernel/panic.c
> > +++ b/kernel/panic.c
> > @@ -23,6 +23,7 @@
> >  #include <linux/sysrq.h>
> >  #include <linux/init.h>
> >  #include <linux/nmi.h>
> > +#include <linux/console.h>
> >  
> >  #define PANIC_TIMER_STEP 100
> >  #define PANIC_BLINK_SPD 18
> > @@ -147,6 +148,15 @@ void panic(const char *fmt, ...)
> >  
> >  	bust_spinlocks(0);
> >  
> > +	/*
> > +	 * We may have ended up killing the CPU holding the lock and still have
> > +	 * some valuable data in console buffer. Try to acquire the lock and
> > +	 * release it regardless of the result. The release will also print the
> > +	 * buffers out.
> > +	 */
> > +	console_trylock();
> > +	console_unlock();
> > +
> 
> "killing the CPU" is a bit vague.  How's this look?
> 
> --- a/kernel/panic.c~panic-release-stale-console-lock-to-always-get-the-logbuf-printed-out-fix
> +++ a/kernel/panic.c
> @@ -149,10 +149,10 @@ void panic(const char *fmt, ...)
>  	bust_spinlocks(0);
>  
>  	/*
> -	 * We may have ended up killing the CPU holding the lock and still have
> -	 * some valuable data in console buffer. Try to acquire the lock and
> -	 * release it regardless of the result. The release will also print the
> -	 * buffers out.
> +	 * We may have ended up stopping the CPU holding the lock (in
> +	 * smp_send_stop()) while still having some valuable data in the console
> +	 * buffer.  Try to acquire the lock then release it regardless of the
> +	 * result.  The release will also print the buffers out.
>  	 */
>  	console_trylock();
>  	console_unlock();
> _
> 
> 
> Does the console_trylock() guarantee that the console lock is now held?
> If the console_lock-holding CPU is still running then there's a window
> where the above code could enter console_unlock() when nobody's holding
> console_lock.  If smp_send_stop() always works (synchronously) then
> that won't happen.

We have this mechanism using zap_locks() in kernel/printk/printk.c when
crash happens on the CPU holding console_sem. Can't we use the same
mechanism for this case? Something like adding:

	zap_locks();
	console_lock();
	console_unlock();

to panic? If we picked up patch "kernel: Avoid softlockups in
stop_machine() during heavy printing" from my series (it's completely
independent, I've attached the latest version), the result would look less
hacky to me (attached). Thoughts?

Warning, the result is untested... I can do some testing and official
posting of the two patches if people agree we want to got down this path.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0003-kernel-Avoid-softlockups-in-stop_machine-during-heav.patch --]
[-- Type: text/x-patch, Size: 4074 bytes --]

>From 950e36435598090c6b03b4659922d981c53aa75b Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 13 Mar 2014 15:07:49 +0100
Subject: [PATCH 3/5] kernel: Avoid softlockups in stop_machine() during heavy
 printing

When there are lots of messages accumulated in printk buffer, printing
them (especially over serial console) can take a long time (tens of
seconds). stop_machine() will effectively make all cpus spin in
multi_cpu_stop() waiting for the CPU doing printing to print all the
messages which triggers NMI softlockup watchdog and RCU stall detector
which add even more to the messages to print. Since machine doesn't do
anything (except serving interrupts) during this time, also network
connections are dropped and other disturbances may happen.

Paper over the problem by waiting for printk buffer to be empty before
starting to stop CPUs. In theory a burst of new messages can be appended
to the printk buffer before CPUs enter multi_cpu_stop() so this isn't a 100%
solution but it works OK in practice and I'm not aware of a reasonably
simple better solution.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/console.h | 11 +++++++++++
 kernel/printk/printk.c  | 25 +++++++++++++++++++++++++
 kernel/stop_machine.c   |  9 +++++++++
 3 files changed, 45 insertions(+)

diff --git a/include/linux/console.h b/include/linux/console.h
index bd194343c346..96da462cdfeb 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -150,6 +150,17 @@ extern int console_trylock(void);
 extern void console_unlock(void);
 extern void console_conditional_schedule(void);
 extern void console_unblank(void);
+#ifdef CONFIG_SMP
+extern void printk_log_buf_drain(void);
+#else
+/*
+ * In non-SMP kernels there won't be much to drain so save some code for tiny
+ * kernels.
+ */
+static inline void printk_log_buf_drain(void)
+{
+}
+#endif
 extern struct tty_driver *console_device(int *);
 extern void console_stop(struct console *);
 extern void console_start(struct console *);
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index b249866260d7..b03431709965 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2488,6 +2488,31 @@ struct tty_driver *console_device(int *index)
 	return driver;
 }
 
+/* For non-SMP kernels this function isn't used and would be pointless anyway */
+#ifdef CONFIG_SMP
+/*
+ * Wait until all messages accumulated in the printk buffer are printed to
+ * console. Note that as soon as this function returns, new messages may be
+ * added to the printk buffer by other CPUs.
+ */
+void printk_log_buf_drain(void)
+{
+	bool retry;
+	unsigned long flags;
+
+	while (1) {
+		raw_spin_lock_irqsave(&logbuf_lock, flags);
+		retry = console_seq != log_next_seq;
+		raw_spin_unlock_irqrestore(&logbuf_lock, flags);
+		if (!retry || console_suspended)
+			break;
+		/* Cycle console_sem to wait for outstanding printing */
+		console_lock();
+		console_unlock();
+	}
+}
+#endif
+
 /*
  * Prevent further output on the passed console device so that (for example)
  * serial drivers can disable console output before suspending a port, and can
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 12484e5d5c88..e9496b4a3825 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -21,6 +21,7 @@
 #include <linux/smpboot.h>
 #include <linux/atomic.h>
 #include <linux/lglock.h>
+#include <linux/console.h>
 
 /*
  * Structure to determine completion condition and record errors.  May
@@ -543,6 +544,14 @@ static int __stop_machine(cpu_stop_fn_t fn, void *data, const struct cpumask *cp
 		return ret;
 	}
 
+	/*
+	 * If there are lots of outstanding messages, printing them can take a
+	 * long time and all cpus would be spinning waiting for the printing to
+	 * finish thus triggering NMI watchdog, RCU lockups etc. Wait for the
+	 * printing here to avoid these.
+	 */
+	printk_log_buf_drain();
+
 	/* Set the initial state and stop all online cpus. */
 	set_state(&msdata, MULTI_STOP_PREPARE);
 	return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);
-- 
2.1.4


[-- Attachment #3: 0001-panic-release-stale-console-lock-to-always-get-the-l.patch --]
[-- Type: text/x-patch, Size: 3935 bytes --]

>From 5bba47bcb62b52a0a1eb78fbea4ac25a5a2852bf Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.com>
Date: Thu, 8 Oct 2015 10:51:42 +0200
Subject: [PATCH] panic: release stale console lock to always get the logbuf
 printed out

In some cases we may end up killing the CPU holding the console lock
while still having valuable data in logbuf. E.g. Vitaly is observing the
following:

- A crash is happening on one CPU and console_unlock() is being called
  on some other.

- console_unlock() tries to print out the buffer before releasing the
  lock and on slow console it takes time.

- in the meanwhile crashing CPU does lots of printk()-s with valuable
  data (which go to the logbuf) and sends IPIs to all other CPUs.

- console_unlock() finishes printing previous chunk and enables
  interrupts before trying to print out the rest, the CPU catches the
  IPI and never releases console lock.

This is not the only possible case: in VT/fb subsystems we have many
other console_lock()/console_unlock() users. Non-masked interrupts (or
receiving NMI in case of extreme slowness) will have the same result.
Getting the whole console buffer printed out on crash should be top
priority.

Base on original patch by Vitaly Kuznetsov.

Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Jan Kara <jack@suse.com>
---
 include/linux/console.h | 4 ++--
 kernel/panic.c          | 8 ++++++++
 kernel/printk/printk.c  | 5 ++++-
 kernel/stop_machine.c   | 2 +-
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/console.h b/include/linux/console.h
index 96da462cdfeb..f40084802f3f 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -151,13 +151,13 @@ extern void console_unlock(void);
 extern void console_conditional_schedule(void);
 extern void console_unblank(void);
 #ifdef CONFIG_SMP
-extern void printk_log_buf_drain(void);
+extern void printk_log_buf_drain(bool panic);
 #else
 /*
  * In non-SMP kernels there won't be much to drain so save some code for tiny
  * kernels.
  */
-static inline void printk_log_buf_drain(void)
+static inline void printk_log_buf_drain(bool panic)
 {
 }
 #endif
diff --git a/kernel/panic.c b/kernel/panic.c
index 04e91ff7560b..d07ed830a9fb 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -23,6 +23,7 @@
 #include <linux/sysrq.h>
 #include <linux/init.h>
 #include <linux/nmi.h>
+#include <linux/console.h>
 
 #define PANIC_TIMER_STEP 100
 #define PANIC_BLINK_SPD 18
@@ -147,6 +148,13 @@ void panic(const char *fmt, ...)
 
 	bust_spinlocks(0);
 
+	/*
+	 * We may have ended up stopping the CPU doing printing (in
+	 * smp_send_stop()) while still having some valuable data in the
+	 * console buffer.  Flush it out.
+	 */
+	printk_log_buf_drain(true);
+
 	if (!panic_blink)
 		panic_blink = no_blink;
 
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 8e125e98f523..62be2890e6a0 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2401,11 +2401,14 @@ struct tty_driver *console_device(int *index)
  * console. Note that as soon as this function returns, new messages may be
  * added to the printk buffer by other CPUs.
  */
-void printk_log_buf_drain(void)
+void printk_log_buf_drain(bool panic)
 {
 	bool retry;
 	unsigned long flags;
 
+	if (panic)
+		zap_locks();
+
 	while (1) {
 		raw_spin_lock_irqsave(&logbuf_lock, flags);
 		retry = console_seq != log_next_seq;
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index e9496b4a3825..50a03735893e 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -550,7 +550,7 @@ static int __stop_machine(cpu_stop_fn_t fn, void *data, const struct cpumask *cp
 	 * finish thus triggering NMI watchdog, RCU lockups etc. Wait for the
 	 * printing here to avoid these.
 	 */
-	printk_log_buf_drain();
+	printk_log_buf_drain(false);
 
 	/* Set the initial state and stop all online cpus. */
 	set_state(&msdata, MULTI_STOP_PREPARE);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] panic: release stale console lock to always get the logbuf printed out
  2015-10-07 22:34 ` Andrew Morton
  2015-10-08  9:01   ` Jan Kara
@ 2015-10-08  9:51   ` Vitaly Kuznetsov
  2015-10-08 20:54     ` Andrew Morton
  1 sibling, 1 reply; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-10-08  9:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: HATAYAMA Daisuke, Masami Hiramatsu, Jiri Kosina, Baoquan He,
	Prarit Bhargava, Xie XiuQi, Seth Jennings, linux-kernel,
	K. Y. Srinivasan, Jan Kara

Andrew Morton <akpm@linux-foundation.org> writes:

> (cc Jan)
>
> On Wed,  7 Oct 2015 19:02:22 +0200 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
>> In some cases we may end up killing the CPU holding the console lock
>> while still having valuable data in logbuf. E.g. I'm observing the
>> following:
>> - A crash is happening on one CPU and console_unlock() is being called on
>>   some other.
>> - console_unlock() tries to print out the buffer before releasing the lock
>>   and on slow console it takes time.
>> - in the meanwhile crashing CPU does lots of printk()-s with valuable data
>>   (which go to the logbuf) and sends IPIs to all other CPUs.
>> - console_unlock() finishes printing previous chunk and enables interrupts
>>   before trying to print out the rest, the CPU catches the IPI and never
>>   releases console lock.
>
> Why doesn't the lock-owning CPU release the console lock?  Because it
> was stopped by smp_send_stop() in panic()?
>
> I don't recall why we stop CPUs in panic(), and of course we didn't
> document the reason.  I guess it makes sense from the "what else can we
> do" point of view, but I wonder if we can just do it later on - that
> would fix this problem?

We don't know for how long should we wait for the other CPU to finish
the output and it can take some time. In case we're rebooting after a
short timeout we can still end up with something in the logbuf.

>
> (dumb aside: why doesn't smp_send_stop() stop the calling CPU?)
>

Because we do send_IPI_allbutself() ? :-)

>> This is not the only possible case: in VT/fb subsystems we have many other
>> console_lock()/console_unlock() users. Non-masked interrupts (or receiving
>> NMI in case of extreme slowness) will have the same result. Getting the
>> whole console buffer printed out on crash should be top priority.
>
> Yes, this is a pretty big hole in the logic.
>
>> --- a/kernel/panic.c
>> +++ b/kernel/panic.c
>> @@ -23,6 +23,7 @@
>>  #include <linux/sysrq.h>
>>  #include <linux/init.h>
>>  #include <linux/nmi.h>
>> +#include <linux/console.h>
>>  
>>  #define PANIC_TIMER_STEP 100
>>  #define PANIC_BLINK_SPD 18
>> @@ -147,6 +148,15 @@ void panic(const char *fmt, ...)
>>  
>>  	bust_spinlocks(0);
>>  
>> +	/*
>> +	 * We may have ended up killing the CPU holding the lock and still have
>> +	 * some valuable data in console buffer. Try to acquire the lock and
>> +	 * release it regardless of the result. The release will also print the
>> +	 * buffers out.
>> +	 */
>> +	console_trylock();
>> +	console_unlock();
>> +
>
> "killing the CPU" is a bit vague.  How's this look?
>
> --- a/kernel/panic.c~panic-release-stale-console-lock-to-always-get-the-logbuf-printed-out-fix
> +++ a/kernel/panic.c
> @@ -149,10 +149,10 @@ void panic(const char *fmt, ...)
>  	bust_spinlocks(0);
>
>  	/*
> -	 * We may have ended up killing the CPU holding the lock and still have
> -	 * some valuable data in console buffer. Try to acquire the lock and
> -	 * release it regardless of the result. The release will also print the
> -	 * buffers out.
> +	 * We may have ended up stopping the CPU holding the lock (in
> +	 * smp_send_stop()) while still having some valuable data in the console
> +	 * buffer.  Try to acquire the lock then release it regardless of the
> +	 * result.  The release will also print the buffers out.

Thanks, looks better)

>  	 */
>  	console_trylock();
>  	console_unlock();
> _
>
> Does the console_trylock() guarantee that the console lock is now held?
> If the console_lock-holding CPU is still running
> then there's a window
> where the above code could enter console_unlock() when nobody's holding
> console_lock.  If smp_send_stop() always works (synchronously) then
> that won't happen.

Well, in smp_send_stop() we send IPIs and if some CPU is still running
we send NMIs... In case someone ignored that I'm not sure what can be
done...

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] panic: release stale console lock to always get the logbuf printed out
  2015-10-08  9:01   ` Jan Kara
@ 2015-10-08 10:03     ` Vitaly Kuznetsov
  2015-10-08 20:56       ` Andrew Morton
  2015-10-12  3:02     ` kbuild test robot
  1 sibling, 1 reply; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-10-08 10:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, HATAYAMA Daisuke, Masami Hiramatsu, Jiri Kosina,
	Baoquan He, Prarit Bhargava, Xie XiuQi, Seth Jennings,
	linux-kernel, K. Y. Srinivasan

Jan Kara <jack@suse.cz> writes:

>> --- a/kernel/panic.c~panic-release-stale-console-lock-to-always-get-the-logbuf-printed-out-fix
>> +++ a/kernel/panic.c
>> @@ -149,10 +149,10 @@ void panic(const char *fmt, ...)
>>  	bust_spinlocks(0);
>>  
>>  	/*
>> -	 * We may have ended up killing the CPU holding the lock and still have
>> -	 * some valuable data in console buffer. Try to acquire the lock and
>> -	 * release it regardless of the result. The release will also print the
>> -	 * buffers out.
>> +	 * We may have ended up stopping the CPU holding the lock (in
>> +	 * smp_send_stop()) while still having some valuable data in the console
>> +	 * buffer.  Try to acquire the lock then release it regardless of the
>> +	 * result.  The release will also print the buffers out.
>>  	 */
>>  	console_trylock();
>>  	console_unlock();
>> _
>> 
>> 
>> Does the console_trylock() guarantee that the console lock is now held?
>> If the console_lock-holding CPU is still running then there's a window
>> where the above code could enter console_unlock() when nobody's holding
>> console_lock.  If smp_send_stop() always works (synchronously) then
>> that won't happen.
>
> We have this mechanism using zap_locks() in kernel/printk/printk.c when
> crash happens on the CPU holding console_sem. Can't we use the same
> mechanism for this case? Something like adding:
>
> 	zap_locks();
> 	console_lock();
> 	console_unlock();
>
> to panic?

I thought it doesn't really matter who holds the lock at this point --
all other cpus were already stopped with IPIs/NMIs.

> If we picked up patch "kernel: Avoid softlockups in
> stop_machine() during heavy printing" from my series (it's completely
> independent, I've attached the latest version), the result would look less
> hacky to me (attached). Thoughts?

Haven't tested it but should also work...

Thanks,

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] panic: release stale console lock to always get the logbuf printed out
  2015-10-08  9:51   ` Vitaly Kuznetsov
@ 2015-10-08 20:54     ` Andrew Morton
  2015-10-09 10:09       ` Vitaly Kuznetsov
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2015-10-08 20:54 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: HATAYAMA Daisuke, Masami Hiramatsu, Jiri Kosina, Baoquan He,
	Prarit Bhargava, Xie XiuQi, Seth Jennings, linux-kernel,
	K. Y. Srinivasan, Jan Kara

On Thu, 08 Oct 2015 11:51:13 +0200 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:

> > On Wed,  7 Oct 2015 19:02:22 +0200 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> >
> >> In some cases we may end up killing the CPU holding the console lock
> >> while still having valuable data in logbuf. E.g. I'm observing the
> >> following:
> >> - A crash is happening on one CPU and console_unlock() is being called on
> >>   some other.
> >> - console_unlock() tries to print out the buffer before releasing the lock
> >>   and on slow console it takes time.
> >> - in the meanwhile crashing CPU does lots of printk()-s with valuable data
> >>   (which go to the logbuf) and sends IPIs to all other CPUs.
> >> - console_unlock() finishes printing previous chunk and enables interrupts
> >>   before trying to print out the rest, the CPU catches the IPI and never
> >>   releases console lock.
> >
> > Why doesn't the lock-owning CPU release the console lock?  Because it
> > was stopped by smp_send_stop() in panic()?
> >
> > I don't recall why we stop CPUs in panic(), and of course we didn't
> > document the reason.  I guess it makes sense from the "what else can we
> > do" point of view, but I wonder if we can just do it later on - that
> > would fix this problem?
> 
> We don't know for how long should we wait for the other CPU to finish
> the output and it can take some time. In case we're rebooting after a
> short timeout we can still end up with something in the logbuf.

I don't understand what you're saying here.

If we move panic()'s call to smp_send_stop() so it occurs later in
panic(), won't this result in this CPU's messages being properly
displayed?  The currently-printing CPU will still be running and all
the printks will proceed in the normal fashion?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] panic: release stale console lock to always get the logbuf printed out
  2015-10-08 10:03     ` Vitaly Kuznetsov
@ 2015-10-08 20:56       ` Andrew Morton
  2015-10-09 10:10         ` Vitaly Kuznetsov
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2015-10-08 20:56 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Jan Kara, HATAYAMA Daisuke, Masami Hiramatsu, Jiri Kosina,
	Baoquan He, Prarit Bhargava, Xie XiuQi, Seth Jennings,
	linux-kernel, K. Y. Srinivasan

On Thu, 08 Oct 2015 12:03:25 +0200 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:

> > If we picked up patch "kernel: Avoid softlockups in
> > stop_machine() during heavy printing" from my series (it's completely
> > independent, I've attached the latest version), the result would look less
> > hacky to me (attached). Thoughts?
> 
> Haven't tested it but should also work...

Well, I expect we'll be merging Jan's lockup-avoidance code fairly
soon.  There's no point in solving the same problem twice.

Sp please, do take a closer look at the proposed patches and if
possible, runtime test them?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] panic: release stale console lock to always get the logbuf printed out
  2015-10-08 20:54     ` Andrew Morton
@ 2015-10-09 10:09       ` Vitaly Kuznetsov
  0 siblings, 0 replies; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-10-09 10:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: HATAYAMA Daisuke, Masami Hiramatsu, Jiri Kosina, Baoquan He,
	Prarit Bhargava, Xie XiuQi, Seth Jennings, linux-kernel,
	K. Y. Srinivasan, Jan Kara

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 08 Oct 2015 11:51:13 +0200 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
>> > On Wed,  7 Oct 2015 19:02:22 +0200 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>> >
>> >> In some cases we may end up killing the CPU holding the console lock
>> >> while still having valuable data in logbuf. E.g. I'm observing the
>> >> following:
>> >> - A crash is happening on one CPU and console_unlock() is being called on
>> >>   some other.
>> >> - console_unlock() tries to print out the buffer before releasing the lock
>> >>   and on slow console it takes time.
>> >> - in the meanwhile crashing CPU does lots of printk()-s with valuable data
>> >>   (which go to the logbuf) and sends IPIs to all other CPUs.
>> >> - console_unlock() finishes printing previous chunk and enables interrupts
>> >>   before trying to print out the rest, the CPU catches the IPI and never
>> >>   releases console lock.
>> >
>> > Why doesn't the lock-owning CPU release the console lock?  Because it
>> > was stopped by smp_send_stop() in panic()?
>> >
>> > I don't recall why we stop CPUs in panic(), and of course we didn't
>> > document the reason.  I guess it makes sense from the "what else can we
>> > do" point of view, but I wonder if we can just do it later on - that
>> > would fix this problem?
>> 
>> We don't know for how long should we wait for the other CPU to finish
>> the output and it can take some time. In case we're rebooting after a
>> short timeout we can still end up with something in the logbuf.
>
> I don't understand what you're saying here.
>
> If we move panic()'s call to smp_send_stop() so it occurs later in
> panic(), won't this result in this CPU's messages being properly
> displayed?

If some other CPU is printing, for how long do we need to wait before we
try to stop it? It can take *any* amount of time to print out the buffer
-- we can even reboot the host earlier.

> The currently-printing CPU will still be running and all
> the printks will proceed in the normal fashion?

It will be running till we reboot the host, and we need to make sure
there is nothing in the buffer when we do that. I see only two viable
options: make sure the crashing cpu prints out the buffer before we
reboot (natural serialization) or some sort of lock-waiting to make sure
the printing CPU is done with its job.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] panic: release stale console lock to always get the logbuf printed out
  2015-10-08 20:56       ` Andrew Morton
@ 2015-10-09 10:10         ` Vitaly Kuznetsov
  2015-10-09 12:44           ` Vitaly Kuznetsov
  0 siblings, 1 reply; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-10-09 10:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, HATAYAMA Daisuke, Masami Hiramatsu, Jiri Kosina,
	Baoquan He, Prarit Bhargava, Xie XiuQi, Seth Jennings,
	linux-kernel, K. Y. Srinivasan

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 08 Oct 2015 12:03:25 +0200 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
>> > If we picked up patch "kernel: Avoid softlockups in
>> > stop_machine() during heavy printing" from my series (it's completely
>> > independent, I've attached the latest version), the result would look less
>> > hacky to me (attached). Thoughts?
>> 
>> Haven't tested it but should also work...
>
> Well, I expect we'll be merging Jan's lockup-avoidance code fairly
> soon.  There's no point in solving the same problem twice.
>
> Sp please, do take a closer look at the proposed patches and if
> possible, runtime test them?

Sure, I'll take a look.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] panic: release stale console lock to always get the logbuf printed out
  2015-10-09 10:10         ` Vitaly Kuznetsov
@ 2015-10-09 12:44           ` Vitaly Kuznetsov
  0 siblings, 0 replies; 11+ messages in thread
From: Vitaly Kuznetsov @ 2015-10-09 12:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, HATAYAMA Daisuke, Masami Hiramatsu, Jiri Kosina,
	Baoquan He, Prarit Bhargava, Xie XiuQi, Seth Jennings,
	linux-kernel, K. Y. Srinivasan

Vitaly Kuznetsov <vkuznets@redhat.com> writes:

> Andrew Morton <akpm@linux-foundation.org> writes:
>
>> On Thu, 08 Oct 2015 12:03:25 +0200 Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>>
>>> > If we picked up patch "kernel: Avoid softlockups in
>>> > stop_machine() during heavy printing" from my series (it's completely
>>> > independent, I've attached the latest version), the result would look less
>>> > hacky to me (attached). Thoughts?
>>> 
>>> Haven't tested it but should also work...
>>
>> Well, I expect we'll be merging Jan's lockup-avoidance code fairly
>> soon.  There's no point in solving the same problem twice.
>>
>> Sp please, do take a closer look at the proposed patches and if
>> possible, runtime test them?
>
> Sure, I'll take a look.

Jan,

I've just tested patches you suggested and it seems they also solve the
issue I observe. Feel free to add:

Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>

Thanks!

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH] panic: release stale console lock to always get the logbuf printed out
  2015-10-08  9:01   ` Jan Kara
  2015-10-08 10:03     ` Vitaly Kuznetsov
@ 2015-10-12  3:02     ` kbuild test robot
  1 sibling, 0 replies; 11+ messages in thread
From: kbuild test robot @ 2015-10-12  3:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: kbuild-all, Andrew Morton, Vitaly Kuznetsov, HATAYAMA Daisuke,
	Masami Hiramatsu, Jiri Kosina, Baoquan He, Prarit Bhargava,
	Xie XiuQi, Seth Jennings, linux-kernel, K. Y. Srinivasan,
	Jan Kara

[-- Attachment #1: Type: text/plain, Size: 1132 bytes --]

Hi Jan,

[auto build test ERROR on v4.3-rc4 -- if it's inappropriate base, please ignore]

config: arm64-allnoconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm64 

All errors (new ones prefixed by >>):

   kernel/printk/printk.c: In function 'printk_log_buf_drain':
>> kernel/printk/printk.c:2410:3: error: implicit declaration of function 'zap_locks' [-Werror=implicit-function-declaration]
      zap_locks();
      ^
   cc1: some warnings being treated as errors

vim +/zap_locks +2410 kernel/printk/printk.c

  2404	void printk_log_buf_drain(bool panic)
  2405	{
  2406		bool retry;
  2407		unsigned long flags;
  2408	
  2409		if (panic)
> 2410			zap_locks();
  2411	
  2412		while (1) {
  2413			raw_spin_lock_irqsave(&logbuf_lock, flags);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 5474 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-10-12  3:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-07 17:02 [PATCH] panic: release stale console lock to always get the logbuf printed out Vitaly Kuznetsov
2015-10-07 22:34 ` Andrew Morton
2015-10-08  9:01   ` Jan Kara
2015-10-08 10:03     ` Vitaly Kuznetsov
2015-10-08 20:56       ` Andrew Morton
2015-10-09 10:10         ` Vitaly Kuznetsov
2015-10-09 12:44           ` Vitaly Kuznetsov
2015-10-12  3:02     ` kbuild test robot
2015-10-08  9:51   ` Vitaly Kuznetsov
2015-10-08 20:54     ` Andrew Morton
2015-10-09 10:09       ` Vitaly Kuznetsov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).