* [PATCH 1/3] introduce for_each_process_thread_{break,continue}() helpers
2016-07-25 16:23 [PATCH 0/3] introduce for_each_process_thread_{break,continue}() helpers Oleg Nesterov
@ 2016-07-25 16:23 ` Oleg Nesterov
2016-08-02 11:58 ` Peter Zijlstra
2016-07-25 16:23 ` [PATCH 2/3] hung_task.c: change rcu_lock_break() code to use for_each_process_thread_break/continue Oleg Nesterov
` (2 subsequent siblings)
3 siblings, 1 reply; 11+ messages in thread
From: Oleg Nesterov @ 2016-07-25 16:23 UTC (permalink / raw)
To: Andrew Morton
Cc: Dave Anderson, Ingo Molnar, Peter Zijlstra, Paul E. McKenney,
Wang Shu, linux-kernel
Example:
rcu_read_lock();
for_each_process_thread(p, t) {
do_something_slow(p, t);
if (SPENT_TOO_MANY_TIME) {
for_each_process_thread_break(p, t);
rcu_read_unlock();
schedule();
rcu_read_lock();
for_each_process_thread_continue(&p, &t);
}
}
rcu_read_unlock();
This looks similar to rcu_lock_break(), but much better and the next patch
changes check_hung_uninterruptible_tasks() to use these new helpers. But my
real target is show_state_filter() which can trivially lead to lockup.
Compared to rcu_lock_break(), for_each_process_thread_continue() never gives
up, it relies on fact that both process and thread lists are sorted by the
task->start_time key. So, for example, even if both leader/thread are already
dead we can find the next alive process and continue.
TODO: This is another indication we should rename pid_alive(tsk) (and probably
change it), most users call it to ensure that rcu_read_lock() still protects
this tsk.
NOTE: it seems that, contrary to the comment, task_struct->start_time is not
really monotonic, and this should be probably fixed. Until then _continue()
might skip more threads with the same ->start_time than necessary.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/sched.h | 10 ++++++++++
kernel/exit.c | 42 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 253538f..004c9d5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2859,6 +2859,16 @@ extern bool current_is_single_threaded(void);
#define for_each_process_thread(p, t) \
for_each_process(p) for_each_thread(p, t)
+static inline void
+for_each_process_thread_break(struct task_struct *p, struct task_struct *t)
+{
+ get_task_struct(p);
+ get_task_struct(t);
+}
+
+extern void
+for_each_process_thread_continue(struct task_struct **, struct task_struct **);
+
static inline int get_nr_threads(struct task_struct *tsk)
{
return tsk->signal->nr_threads;
diff --git a/kernel/exit.c b/kernel/exit.c
index 9e6e135..08f49d4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -210,6 +210,48 @@ repeat:
goto repeat;
}
+void for_each_process_thread_continue(struct task_struct **p_leader,
+ struct task_struct **p_thread)
+{
+ struct task_struct *leader = *p_leader, *thread = *p_thread;
+ struct task_struct *prev, *next;
+ u64 start_time;
+
+ if (pid_alive(thread)) {
+ /* mt exec could change the leader */
+ *p_leader = thread->group_leader;
+ } else if (pid_alive(leader)) {
+ start_time = thread->start_time;
+ prev = leader;
+
+ for_each_thread(leader, next) {
+ if (next->start_time > start_time)
+ break;
+ prev = next;
+ }
+
+ *p_thread = prev;
+ } else {
+ start_time = leader->start_time;
+ prev = &init_task;
+
+ for_each_process(next) {
+ if (next->start_time > start_time)
+ break;
+ prev = next;
+ }
+
+ *p_leader = prev;
+ /* a new thread can come after that, but this is fine */
+ *p_thread = list_last_entry(&prev->signal->thread_head,
+ struct task_struct,
+ thread_node);
+ }
+
+ put_task_struct(leader);
+ put_task_struct(thread);
+}
+
/*
* Determine if a process group is "orphaned", according to the POSIX
* definition in 2.2.2.52. Orphaned process groups are not to be affected
--
2.5.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH 1/3] introduce for_each_process_thread_{break,continue}() helpers
2016-07-25 16:23 ` [PATCH 1/3] " Oleg Nesterov
@ 2016-08-02 11:58 ` Peter Zijlstra
2016-08-03 20:26 ` Oleg Nesterov
0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2016-08-02 11:58 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Dave Anderson, Ingo Molnar, Paul E. McKenney,
Wang Shu, linux-kernel
On Mon, Jul 25, 2016 at 06:23:48PM +0200, Oleg Nesterov wrote:
> +void for_each_process_thread_continue(struct task_struct **p_leader,
> + struct task_struct **p_thread)
> +{
> + struct task_struct *leader = *p_leader, *thread = *p_thread;
> + struct task_struct *prev, *next;
> + u64 start_time;
> +
> + if (pid_alive(thread)) {
> + /* mt exec could change the leader */
> + *p_leader = thread->group_leader;
> + } else if (pid_alive(leader)) {
> + start_time = thread->start_time;
> + prev = leader;
> +
> + for_each_thread(leader, next) {
> + if (next->start_time > start_time)
> + break;
> + prev = next;
> + }
This,
> + *p_thread = prev;
> + } else {
> + start_time = leader->start_time;
> + prev = &init_task;
> +
> + for_each_process(next) {
> + if (next->start_time > start_time)
> + break;
> + prev = next;
> + }
and this, could be 'SPEND_TOO_MUCH_TIME' all by themselves.
Unlikely though, nor do I really have a better suggestion :/
> +
> + *p_leader = prev;
> + /* a new thread can come after that, but this is fine */
> + *p_thread = list_last_entry(&prev->signal->thread_head,
> + struct task_struct,
> + thread_node);
> + }
> +
> + put_task_struct(leader);
> + put_task_struct(thread);
> +}
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH 1/3] introduce for_each_process_thread_{break,continue}() helpers
2016-08-02 11:58 ` Peter Zijlstra
@ 2016-08-03 20:26 ` Oleg Nesterov
2016-08-08 12:20 ` Peter Zijlstra
0 siblings, 1 reply; 11+ messages in thread
From: Oleg Nesterov @ 2016-08-03 20:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andrew Morton, Dave Anderson, Ingo Molnar, Paul E. McKenney,
Wang Shu, linux-kernel
Thanks for review and sorry for delay, I am travelling till the end of this week.
On 08/02, Peter Zijlstra wrote:
>
> On Mon, Jul 25, 2016 at 06:23:48PM +0200, Oleg Nesterov wrote:
> > +void for_each_process_thread_continue(struct task_struct **p_leader,
> > + struct task_struct **p_thread)
> > +{
> > + struct task_struct *leader = *p_leader, *thread = *p_thread;
> > + struct task_struct *prev, *next;
> > + u64 start_time;
> > +
> > + if (pid_alive(thread)) {
> > + /* mt exec could change the leader */
> > + *p_leader = thread->group_leader;
> > + } else if (pid_alive(leader)) {
> > + start_time = thread->start_time;
> > + prev = leader;
> > +
> > + for_each_thread(leader, next) {
> > + if (next->start_time > start_time)
> > + break;
> > + prev = next;
> > + }
>
> This,
>
> > + *p_thread = prev;
> > + } else {
> > + start_time = leader->start_time;
> > + prev = &init_task;
> > +
> > + for_each_process(next) {
> > + if (next->start_time > start_time)
> > + break;
> > + prev = next;
> > + }
>
> and this, could be 'SPEND_TOO_MUCH_TIME' all by themselves.
>
> Unlikely though, nor do I really have a better suggestion :/
Yeees, I don't think this can actually hurt "in practice", but I agree, compared
to rcu_lock_break() this is only bounded by PID_MAX_DEFAULT in theory.
Will you agree if I just add the "int max_scan" argument and make it return a boolean
for the start? The caller will need to abort the for_each_process_thread() loop if
_continue() fails.
Probably this is not what we actually want for show_filter_state(), we can make it
better later. We can "embed" the rcu_lock_break() logic into _continue(), or change
break/continue to record the state (leader_start_time, thread_start_time) so that
a "false" return from _continue() means that the caller needs another schedule(),
struct state state;
rcu_read_lock();
for_each_process_thread(p, t) {
do_something_slow(p, t);
if (SPENT_TOO_MANY_TIME) {
for_each_process_thread_break(p, t, &state);
another_break:
rcu_read_unlock();
schedule();
rcu_read_lock();
if (!for_each_process_thread_continue(&p, &t, LIMIT, &state))
goto another_break;
}
}
rcu_read_unlock();
Not sure. I'd like to do something simple for the start. We need to make
show_state_filter() "preemptible" in any case. And even killable, I think.
Not only it can trivially trigger the soft-lockups (at least), it can simply
never finish.
Oleg.
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH 1/3] introduce for_each_process_thread_{break,continue}() helpers
2016-08-03 20:26 ` Oleg Nesterov
@ 2016-08-08 12:20 ` Peter Zijlstra
0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2016-08-08 12:20 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Dave Anderson, Ingo Molnar, Paul E. McKenney,
Wang Shu, linux-kernel
On Wed, Aug 03, 2016 at 10:26:09PM +0200, Oleg Nesterov wrote:
> Not sure. I'd like to do something simple for the start. We need to make
> show_state_filter() "preemptible" in any case. And even killable, I think.
> Not only it can trivially trigger the soft-lockups (at least), it can simply
> never finish.
Yeah, no objection raelly. Just make sure to put a comment explaining
the limitation/issues with whatever you descide to go for.
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 2/3] hung_task.c: change rcu_lock_break() code to use for_each_process_thread_break/continue
2016-07-25 16:23 [PATCH 0/3] introduce for_each_process_thread_{break,continue}() helpers Oleg Nesterov
2016-07-25 16:23 ` [PATCH 1/3] " Oleg Nesterov
@ 2016-07-25 16:23 ` Oleg Nesterov
2016-07-25 16:23 ` [PATCH 3/3] hung_task.c: change the "max_count" " Oleg Nesterov
2016-07-26 18:57 ` [PATCH 0/1] (Was: introduce for_each_process_thread_{break,continue}() helpers) Oleg Nesterov
3 siblings, 0 replies; 11+ messages in thread
From: Oleg Nesterov @ 2016-07-25 16:23 UTC (permalink / raw)
To: Andrew Morton
Cc: Dave Anderson, Ingo Molnar, Peter Zijlstra, Paul E. McKenney,
Wang Shu, linux-kernel
Change check_hung_uninterruptible_tasks() to use the new helpers and remove
the "can_cont" logic from rcu_lock_break() which we will probably export
later for other users (show_state_filter).
We could add for_each_process_thread_break/continue right into rcu_lock_break()
but see the next patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/hung_task.c | 25 +++++++++----------------
1 file changed, 9 insertions(+), 16 deletions(-)
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index d234022..517f52e 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -134,20 +134,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
* For preemptible RCU it is sufficient to call rcu_read_unlock in order
* to exit the grace period. For classic RCU, a reschedule is required.
*/
-static bool rcu_lock_break(struct task_struct *g, struct task_struct *t)
+static void rcu_lock_break(void)
{
- bool can_cont;
-
- get_task_struct(g);
- get_task_struct(t);
rcu_read_unlock();
cond_resched();
rcu_read_lock();
- can_cont = pid_alive(g) && pid_alive(t);
- put_task_struct(t);
- put_task_struct(g);
-
- return can_cont;
}
/*
@@ -170,16 +161,18 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
rcu_read_lock();
for_each_process_thread(g, t) {
- if (!max_count--)
+ /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
+ if (t->state == TASK_UNINTERRUPTIBLE)
+ check_hung_task(t, timeout);
+
+ if (!--max_count)
goto unlock;
if (!--batch_count) {
batch_count = HUNG_TASK_BATCHING;
- if (!rcu_lock_break(g, t))
- goto unlock;
+ for_each_process_thread_break(g, t);
+ rcu_lock_break();
+ for_each_process_thread_continue(&g, &t);
}
- /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
- if (t->state == TASK_UNINTERRUPTIBLE)
- check_hung_task(t, timeout);
}
unlock:
rcu_read_unlock();
--
2.5.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH 3/3] hung_task.c: change the "max_count" code to use for_each_process_thread_break/continue
2016-07-25 16:23 [PATCH 0/3] introduce for_each_process_thread_{break,continue}() helpers Oleg Nesterov
2016-07-25 16:23 ` [PATCH 1/3] " Oleg Nesterov
2016-07-25 16:23 ` [PATCH 2/3] hung_task.c: change rcu_lock_break() code to use for_each_process_thread_break/continue Oleg Nesterov
@ 2016-07-25 16:23 ` Oleg Nesterov
2016-07-26 18:57 ` [PATCH 0/1] (Was: introduce for_each_process_thread_{break,continue}() helpers) Oleg Nesterov
3 siblings, 0 replies; 11+ messages in thread
From: Oleg Nesterov @ 2016-07-25 16:23 UTC (permalink / raw)
To: Andrew Morton
Cc: Dave Anderson, Ingo Molnar, Peter Zijlstra, Paul E. McKenney,
Wang Shu, linux-kernel
If sysctl_hung_task_check_count is low and the first "max_count" threads do
not exit, check_hung_uninterruptible_tasks() will never check other tasks,
this is just ugly.
Now that we have for_each_process_thread_continue(), we can save g/t before
return and restart on the next call.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/hung_task.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 517f52e..e6e516d 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -148,6 +148,9 @@ static void rcu_lock_break(void)
*/
static void check_hung_uninterruptible_tasks(unsigned long timeout)
{
+ /* we have only one watchdog() thread */
+ static struct task_struct *g_saved, *t_saved;
+
int max_count = sysctl_hung_task_check_count;
int batch_count = HUNG_TASK_BATCHING;
struct task_struct *g, *t;
@@ -160,18 +163,29 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
return;
rcu_read_lock();
+ if (g_saved) {
+ g = g_saved;
+ t = t_saved;
+ g_saved = NULL;
+ goto resume;
+ }
+
for_each_process_thread(g, t) {
/* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */
if (t->state == TASK_UNINTERRUPTIBLE)
check_hung_task(t, timeout);
- if (!--max_count)
- goto unlock;
- if (!--batch_count) {
- batch_count = HUNG_TASK_BATCHING;
+ if (!--max_count || !--batch_count) {
for_each_process_thread_break(g, t);
+ if (!max_count) {
+ g_saved = g;
+ t_saved = t;
+ goto unlock;
+ }
rcu_lock_break();
+ resume:
for_each_process_thread_continue(&g, &t);
+ batch_count = HUNG_TASK_BATCHING;
}
}
unlock:
--
2.5.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH 0/1] (Was: introduce for_each_process_thread_{break,continue}() helpers)
2016-07-25 16:23 [PATCH 0/3] introduce for_each_process_thread_{break,continue}() helpers Oleg Nesterov
` (2 preceding siblings ...)
2016-07-25 16:23 ` [PATCH 3/3] hung_task.c: change the "max_count" " Oleg Nesterov
@ 2016-07-26 18:57 ` Oleg Nesterov
2016-07-26 18:57 ` [PATCH 1/1] stop_machine: touch_nmi_watchdog() after MULTI_STOP_PREPARE Oleg Nesterov
3 siblings, 1 reply; 11+ messages in thread
From: Oleg Nesterov @ 2016-07-26 18:57 UTC (permalink / raw)
To: Andrew Morton, Ingo Molnar
Cc: Dave Anderson, Peter Zijlstra, Paul E. McKenney, Wang Shu,
linux-kernel, Tejun Heo, Thomas Gleixner
On 07/25, Oleg Nesterov wrote:
>
> IMHO this makes sense in any case, but mostly this is preparation for
> another change: show_state_filter() should be preemptible. But this needs
> more discussion, I'll write another email/patch when I fully understand
> the hard-lockup caused by sysrq-t.
Yes, we need to do something with show_state_filter() anyway, I think.
OTOH, I believe this simple change in multi_cpu_stop() makes sense too
regardless.
Oleg.
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH 1/1] stop_machine: touch_nmi_watchdog() after MULTI_STOP_PREPARE
2016-07-26 18:57 ` [PATCH 0/1] (Was: introduce for_each_process_thread_{break,continue}() helpers) Oleg Nesterov
@ 2016-07-26 18:57 ` Oleg Nesterov
2016-07-27 8:06 ` Thomas Gleixner
2016-07-27 10:42 ` [tip:core/urgent] stop_machine: Touch_nmi_watchdog() " tip-bot for Oleg Nesterov
0 siblings, 2 replies; 11+ messages in thread
From: Oleg Nesterov @ 2016-07-26 18:57 UTC (permalink / raw)
To: Andrew Morton, Ingo Molnar
Cc: Dave Anderson, Peter Zijlstra, Paul E. McKenney, Wang Shu,
linux-kernel, Tejun Heo, Thomas Gleixner
Suppose that stop_machine(fn) hangs because fn() hangs. In this case NMI
hard-lockup can be triggered on another CPU which does nothing wrong and
the trace from nmi_panic() won't help to investigate the problem.
And this change "fixes" the problem we (seem to) hit in practice.
- stop_two_cpus(0, 1) races with show_state_filter() running on CPU_0.
- CPU_1 already spins in MULTI_STOP_PREPARE state, it detects the soft
lockup and tries to report the problem.
- show_state_filter() enables preemption, CPU_0 calls multi_cpu_stop()
which goes to MULTI_STOP_DISABLE_IRQ state and disables interrupts.
- CPU_1 spends more than 10 seconds trying to flush the log buffer to
the slow serial console.
- NMI interrupt on CPU_0 (which now waits for CPU_1) calls nmi_panic().
Reported-by: Wang Shu <shuwang@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/stop_machine.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index a467e6c..4a1ca5f 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -21,6 +21,7 @@
#include <linux/smpboot.h>
#include <linux/atomic.h>
#include <linux/lglock.h>
+#include <linux/nmi.h>
/*
* Structure to determine completion condition and record errors. May
@@ -209,6 +210,13 @@ static int multi_cpu_stop(void *data)
break;
}
ack_state(msdata);
+ } else if (curstate > MULTI_STOP_PREPARE) {
+ /*
+ * At this stage all other CPUs we depend on must spin
+ * in the same loop. Any reason for hard-lockup should
+ * be detected and reported on their side.
+ */
+ touch_nmi_watchdog();
}
} while (curstate != MULTI_STOP_EXIT);
--
2.5.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH 1/1] stop_machine: touch_nmi_watchdog() after MULTI_STOP_PREPARE
2016-07-26 18:57 ` [PATCH 1/1] stop_machine: touch_nmi_watchdog() after MULTI_STOP_PREPARE Oleg Nesterov
@ 2016-07-27 8:06 ` Thomas Gleixner
2016-07-27 10:42 ` [tip:core/urgent] stop_machine: Touch_nmi_watchdog() " tip-bot for Oleg Nesterov
1 sibling, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2016-07-27 8:06 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Ingo Molnar, Dave Anderson, Peter Zijlstra,
Paul E. McKenney, Wang Shu, linux-kernel, Tejun Heo
On Tue, 26 Jul 2016, Oleg Nesterov wrote:
> Suppose that stop_machine(fn) hangs because fn() hangs. In this case NMI
> hard-lockup can be triggered on another CPU which does nothing wrong and
> the trace from nmi_panic() won't help to investigate the problem.
>
> And this change "fixes" the problem we (seem to) hit in practice.
>
> - stop_two_cpus(0, 1) races with show_state_filter() running on CPU_0.
>
> - CPU_1 already spins in MULTI_STOP_PREPARE state, it detects the soft
> lockup and tries to report the problem.
>
> - show_state_filter() enables preemption, CPU_0 calls multi_cpu_stop()
> which goes to MULTI_STOP_DISABLE_IRQ state and disables interrupts.
>
> - CPU_1 spends more than 10 seconds trying to flush the log buffer to
> the slow serial console.
>
> - NMI interrupt on CPU_0 (which now waits for CPU_1) calls nmi_panic().
>
> Reported-by: Wang Shu <shuwang@redhat.com>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> kernel/stop_machine.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index a467e6c..4a1ca5f 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -21,6 +21,7 @@
> #include <linux/smpboot.h>
> #include <linux/atomic.h>
> #include <linux/lglock.h>
> +#include <linux/nmi.h>
>
> /*
> * Structure to determine completion condition and record errors. May
> @@ -209,6 +210,13 @@ static int multi_cpu_stop(void *data)
> break;
> }
> ack_state(msdata);
> + } else if (curstate > MULTI_STOP_PREPARE) {
> + /*
> + * At this stage all other CPUs we depend on must spin
> + * in the same loop. Any reason for hard-lockup should
> + * be detected and reported on their side.
> + */
> + touch_nmi_watchdog();
> }
> } while (curstate != MULTI_STOP_EXIT);
>
> --
> 2.5.0
>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread* [tip:core/urgent] stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE
2016-07-26 18:57 ` [PATCH 1/1] stop_machine: touch_nmi_watchdog() after MULTI_STOP_PREPARE Oleg Nesterov
2016-07-27 8:06 ` Thomas Gleixner
@ 2016-07-27 10:42 ` tip-bot for Oleg Nesterov
1 sibling, 0 replies; 11+ messages in thread
From: tip-bot for Oleg Nesterov @ 2016-07-27 10:42 UTC (permalink / raw)
To: linux-tip-commits
Cc: paulmck, oleg, tj, linux-kernel, akpm, shuwang, peterz, tglx,
mingo, hpa, anderson, torvalds
Commit-ID: ce4f06dcbb5d6d04d202f1b81ac72d5679dcdfc0
Gitweb: http://git.kernel.org/tip/ce4f06dcbb5d6d04d202f1b81ac72d5679dcdfc0
Author: Oleg Nesterov <oleg@redhat.com>
AuthorDate: Tue, 26 Jul 2016 20:57:36 +0200
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 27 Jul 2016 11:12:11 +0200
stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE
Suppose that stop_machine(fn) hangs because fn() hangs. In this case NMI
hard-lockup can be triggered on another CPU which does nothing wrong and
the trace from nmi_panic() won't help to investigate the problem.
And this change "fixes" the problem we (seem to) hit in practice.
- stop_two_cpus(0, 1) races with show_state_filter() running on CPU_0.
- CPU_1 already spins in MULTI_STOP_PREPARE state, it detects the soft
lockup and tries to report the problem.
- show_state_filter() enables preemption, CPU_0 calls multi_cpu_stop()
which goes to MULTI_STOP_DISABLE_IRQ state and disables interrupts.
- CPU_1 spends more than 10 seconds trying to flush the log buffer to
the slow serial console.
- NMI interrupt on CPU_0 (which now waits for CPU_1) calls nmi_panic().
Reported-by: Wang Shu <shuwang@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20160726185736.GB4088@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/stop_machine.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index a467e6c..4a1ca5f 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -21,6 +21,7 @@
#include <linux/smpboot.h>
#include <linux/atomic.h>
#include <linux/lglock.h>
+#include <linux/nmi.h>
/*
* Structure to determine completion condition and record errors. May
@@ -209,6 +210,13 @@ static int multi_cpu_stop(void *data)
break;
}
ack_state(msdata);
+ } else if (curstate > MULTI_STOP_PREPARE) {
+ /*
+ * At this stage all other CPUs we depend on must spin
+ * in the same loop. Any reason for hard-lockup should
+ * be detected and reported on their side.
+ */
+ touch_nmi_watchdog();
}
} while (curstate != MULTI_STOP_EXIT);
^ permalink raw reply related [flat|nested] 11+ messages in thread