All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] hung_task debugging: Add tracepoint to report the hang
@ 2013-10-19 16:18 Oleg Nesterov
  2013-10-19 16:18 ` [PATCH 1/1] " Oleg Nesterov
  0 siblings, 1 reply; 5+ messages in thread
From: Oleg Nesterov @ 2013-10-19 16:18 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Steven Rostedt; +Cc: Dave Sullivan, linux-kernel

Hi,

We have a feature request, the customer needs something more hookable
than just printk's from check_hung_task() to implement the user-space
watchdog which can potentially resolve the problems which caused the
hang.

The patch simply adds a tracepoint into check_hung_task(), do you
think we can (ab)use tp this way?


And I am just curious, perhaps another patch (below) makes sense too?

Oleg.

------------------------------------------------------------------------------
[PATCH] hung_task debugging: Do not report the same task twice

If a task hangs, check_hung_task() will blame it again and again
until it wakes up or sysctl_hung_task_warnings becomes zero. IMO,
this just adds the unnecessary noise and if another task hangs
after sysctl_hung_task_timeout_secs * sysctl_hung_task_warnings
it won't be reported.

With this patch check_hung_task() simply sets the most significant
bit in ->last_switch_count to mark this task as "already reported".
This bit is cleared if we notice the change in nvcsw/nivcsw, so we
do not skip this task if it hangs again later.

Note that we also ignore the MSB in switch_count, we need this to
avoid the false-positive "already reported". This means that
->last_switch_count is not necessarily equal to ->nvcsw + ->nivcsw
but we do not care, we have enough bits to notice the change.

And this allows to remove the special switch_count == 0 case, just
we need to initialize ->last_switch_count = LONG_MIN.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/fork.c      |    2 +-
 kernel/hung_task.c |   19 +++++++++----------
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 8531609..aea397b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -866,7 +866,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	tsk->min_flt = tsk->maj_flt = 0;
 	tsk->nvcsw = tsk->nivcsw = 0;
 #ifdef CONFIG_DETECT_HUNG_TASK
-	tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;
+	tsk->last_switch_count = LONG_MIN; /* see check_hung_task() */
 #endif
 
 	tsk->mm = NULL;
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 3952ab1..0f6233c 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -72,7 +72,7 @@ static struct notifier_block panic_block = {
 
 static void check_hung_task(struct task_struct *t, unsigned long timeout)
 {
-	unsigned long switch_count = t->nvcsw + t->nivcsw;
+	unsigned long switch_count;
 
 	/*
 	 * Ensure the task is not frozen.
@@ -81,19 +81,18 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
 	if (unlikely(t->flags & (PF_FROZEN | PF_FREEZER_SKIP)))
 	    return;
 
-	/*
-	 * When a freshly created task is scheduled once, changes its state to
-	 * TASK_UNINTERRUPTIBLE without having ever been switched out once, it
-	 * musn't be checked.
-	 */
-	if (unlikely(!switch_count))
-		return;
-
-	if (switch_count != t->last_switch_count) {
+	/* Ignore MSB, see below. LONG_MAX = ~LONG_MIN. */
+	switch_count = (t->nvcsw + t->nivcsw) & LONG_MAX;
+	if (switch_count != (LONG_MAX & t->last_switch_count)) {
 		t->last_switch_count = switch_count;
 		return;
 	}
 
+	/* We use MSB to mark this task as already reported. */
+	if (t->last_switch_count & LONG_MIN)
+		return;
+	t->last_switch_count |= LONG_MIN;
+
 	trace_sched_process_hang(t);
 
 	if (!sysctl_hung_task_warnings)
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-10-31 10:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-19 16:18 [PATCH 0/1] hung_task debugging: Add tracepoint to report the hang Oleg Nesterov
2013-10-19 16:18 ` [PATCH 1/1] " Oleg Nesterov
2013-10-20  8:48   ` Ingo Molnar
2013-10-29 19:10     ` Oleg Nesterov
2013-10-31 10:53   ` [tip:core/locking] " tip-bot for Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.