All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] hung_task: deduplicate identical hang reports
@ 2026-06-21 21:37 Aaron Tomlin
  2026-06-22  0:18 ` Masami Hiramatsu
  2026-06-22 15:58 ` Petr Mladek
  0 siblings, 2 replies; 4+ messages in thread
From: Aaron Tomlin @ 2026-06-21 21:37 UTC (permalink / raw)
  To: akpm, lance.yang, mhiramat, pmladek
  Cc: linux-kernel, david.laight.linux, atomlin, neelx, sean, chjohnst,
	steve, mproche, nick.lange

Currently, during severe lock contention, multiple tasks can hang while
waiting on the exact same resource. The khungtaskd kthread
indiscriminately reports every single instance with a stack trace.
This can roll the kernel ring buffer and prematurely exhaust the
kernel.hung_task_warnings budget. Consequently, the kernel is left
entirely blind to subsequent, unrelated deadlocks.

To preserve the warning budget and ring buffer without sacrificing
observability, introduce a Wait Channel (wchan) and task-state based
deduplicator:

    1. Implement a lightweight, stack-allocated 64-slot Wait Channel
       (wchan) hash map. Tasks blocked on the exact same wchan during a
       single scan are recognised as sharing the same bottleneck,
       successfully deduplicating contentions even when the callers
       possess entirely disparate call stacks.

    2. Introduce a hung_task_reported bit-field in task_struct. If a task
       remains hung across multiple intervals, khungtaskd recognises it
       has already been reported. The bit is safely cleared without
       locks or atomics the moment the task's context switch counter
       increments.

    3. For duplicate tasks, we still print the single-line
       "INFO: task ..." message and trigger tracepoint
       trace_sched_process_hang(). It merely skips calling
       sched_show_task() and debug_show_blocker(), printing a concise
       suppression notice instead.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
--
Changes since v2:

 - Replaced the per-round cache flush with a task_struct bit-field for
   persistent cross-scan tracking, mitigating delayed budget exhaustion

 - Abandoned exact-stack hashing in favour of Wait Channel hashing

 - Transitioned from jhash() to hash_long() to optimise single-pointer
   hashing, and relocated the hash map to the local stack

 - Linked to v2: https://lore.kernel.org/lkml/20260620013559.1537893-1-atomlin@atomlin.com/

Changes since v1:

 - Preserve "INFO:" headers for all hung tasks; suppress only the stack
   dumps for duplicates (Masami Hiramatsu)

 - Print a clear notification when a trace is explicitly suppressed

 - Add #ifdef CONFIG_STACKTRACE guards to prevent Kconfig build errors

 - Optimise overhead by unwinding the stack only if a warning is
   actually going to be printed

 - Linked to v1: https://lore.kernel.org/lkml/20260617184841.1447955-1-atomlin@atomlin.com/
---
 include/linux/sched.h |  3 +++
 kernel/hung_task.c    | 32 ++++++++++++++++++++++++++++----
 2 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b3204a15d512..e76cf221cc78 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1046,6 +1046,9 @@ struct task_struct {
 	/* Used by page_owner=on to detect recursion in page tracking. */
 	unsigned			in_page_owner:1;
 #endif
+#ifdef CONFIG_DETECT_HUNG_TASK
+	unsigned			hung_task_reported:1;
+#endif
 #ifdef CONFIG_EVENTFD
 	/* Recursion prevention for eventfd_signal() */
 	unsigned			in_eventfd:1;
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 6fcc94ce4ca9..5dcce0e7041b 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -25,6 +25,7 @@
 #include <linux/hung_task.h>
 #include <linux/rwsem.h>
 #include <linux/sys_info.h>
+#include <linux/hash.h>
 
 #include <trace/events/sched.h>
 
@@ -125,6 +126,7 @@ static bool task_is_hung(struct task_struct *t, unsigned long timeout)
 	if (switch_count != t->last_switch_count) {
 		t->last_switch_count = switch_count;
 		t->last_switch_time = jiffies;
+		t->hung_task_reported = 0;
 		return false;
 	}
 	if (time_is_after_jiffies(t->last_switch_time + timeout * HZ))
@@ -228,12 +230,14 @@ static inline void debug_show_blocker(struct task_struct *task, unsigned long ti
  * @t: Pointer to the detected hung task.
  * @timeout: Timeout threshold for detecting hung tasks
  * @this_round_count: Count of hung tasks detected in the current iteration
+ * @skip_show_task: Indicating if stack trace should be skipped
  *
  * Print structured information about the specified hung task, if warnings
  * are enabled or if the panic batch threshold is exceeded.
  */
 static void hung_task_info(struct task_struct *t, unsigned long timeout,
-			   unsigned long this_round_count)
+			   unsigned long this_round_count,
+			   unsigned int skip_show_task)
 {
 	trace_sched_process_hang(t);
 
@@ -261,8 +265,12 @@ static void hung_task_info(struct task_struct *t, unsigned long timeout,
 			pr_err("      Blocked by coredump.\n");
 		pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
 			" disables this message.\n");
-		sched_show_task(t);
-		debug_show_blocker(t, timeout);
+		if (!skip_show_task) {
+			sched_show_task(t);
+			debug_show_blocker(t, timeout);
+		} else {
+			pr_err("      Stack trace suppressed. Already reported or duplicate wchan\n");
+		}
 
 		if (!sysctl_hung_task_warnings)
 			pr_info("Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings\n");
@@ -306,6 +314,9 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 	unsigned long this_round_count;
 	int need_warning = sysctl_hung_task_warnings;
 	unsigned long si_mask = hung_task_si_mask;
+	unsigned long wchan, wchan_hash[64] = { 0 };
+	unsigned int hash;
+	unsigned int skip_show_task;
 
 	/*
 	 * If the system crashed already then all bets are off,
@@ -326,6 +337,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 		}
 
 		if (task_is_hung(t, timeout)) {
+			skip_show_task = t->hung_task_reported;
 			/*
 			 * Increment the global counter so that userspace could
 			 * start migrating tasks ASAP. But count the current
@@ -334,7 +346,19 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 			 */
 			atomic_long_inc(&sysctl_hung_task_detect_count);
 			this_round_count++;
-			hung_task_info(t, timeout, this_round_count);
+
+			wchan = get_wchan(t);
+			if (wchan) {
+				hash = hash_long(wchan, 6);
+				if (wchan_hash[hash] == wchan)
+					skip_show_task = 1;
+				else
+					wchan_hash[hash] = wchan;
+			}
+
+			hung_task_info(t, timeout, this_round_count,
+				       skip_show_task);
+			t->hung_task_reported = 1;
 		}
 	}
  unlock:
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] hung_task: deduplicate identical hang reports
  2026-06-21 21:37 [PATCH v3] hung_task: deduplicate identical hang reports Aaron Tomlin
@ 2026-06-22  0:18 ` Masami Hiramatsu
  2026-06-22 15:58 ` Petr Mladek
  1 sibling, 0 replies; 4+ messages in thread
From: Masami Hiramatsu @ 2026-06-22  0:18 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: akpm, lance.yang, pmladek, linux-kernel, david.laight.linux,
	neelx, sean, chjohnst, steve, mproche, nick.lange

On Sun, 21 Jun 2026 17:37:56 -0400
Aaron Tomlin <atomlin@atomlin.com> wrote:

> Currently, during severe lock contention, multiple tasks can hang while
> waiting on the exact same resource. The khungtaskd kthread
> indiscriminately reports every single instance with a stack trace.
> This can roll the kernel ring buffer and prematurely exhaust the
> kernel.hung_task_warnings budget. Consequently, the kernel is left
> entirely blind to subsequent, unrelated deadlocks.
> 
> To preserve the warning budget and ring buffer without sacrificing
> observability, introduce a Wait Channel (wchan) and task-state based
> deduplicator:
> 
>     1. Implement a lightweight, stack-allocated 64-slot Wait Channel
>        (wchan) hash map. Tasks blocked on the exact same wchan during a
>        single scan are recognised as sharing the same bottleneck,
>        successfully deduplicating contentions even when the callers
>        possess entirely disparate call stacks.

Hmm, wouldn't this essentially erase everything that's typically
expected in a standard lock? 
Ideally, we'd like to sort by the time the lock was first blocked
and display only the oldest stack.

> 
>     2. Introduce a hung_task_reported bit-field in task_struct. If a task
>        remains hung across multiple intervals, khungtaskd recognises it
>        has already been reported. The bit is safely cleared without
>        locks or atomics the moment the task's context switch counter
>        increments.
> 
>     3. For duplicate tasks, we still print the single-line
>        "INFO: task ..." message and trigger tracepoint
>        trace_sched_process_hang(). It merely skips calling
>        sched_show_task() and debug_show_blocker(), printing a concise
>        suppression notice instead.

Ah, OK. So if we need more information, we can record it on trace
ring buffer.

> 
> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
> --
> Changes since v2:
> 
>  - Replaced the per-round cache flush with a task_struct bit-field for
>    persistent cross-scan tracking, mitigating delayed budget exhaustion
> 
>  - Abandoned exact-stack hashing in favour of Wait Channel hashing
> 
>  - Transitioned from jhash() to hash_long() to optimise single-pointer
>    hashing, and relocated the hash map to the local stack
> 
>  - Linked to v2: https://lore.kernel.org/lkml/20260620013559.1537893-1-atomlin@atomlin.com/
> 
> Changes since v1:
> 
>  - Preserve "INFO:" headers for all hung tasks; suppress only the stack
>    dumps for duplicates (Masami Hiramatsu)
> 
>  - Print a clear notification when a trace is explicitly suppressed
> 
>  - Add #ifdef CONFIG_STACKTRACE guards to prevent Kconfig build errors
> 
>  - Optimise overhead by unwinding the stack only if a warning is
>    actually going to be printed
> 
>  - Linked to v1: https://lore.kernel.org/lkml/20260617184841.1447955-1-atomlin@atomlin.com/
> ---
>  include/linux/sched.h |  3 +++
>  kernel/hung_task.c    | 32 ++++++++++++++++++++++++++++----
>  2 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b3204a15d512..e76cf221cc78 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1046,6 +1046,9 @@ struct task_struct {
>  	/* Used by page_owner=on to detect recursion in page tracking. */
>  	unsigned			in_page_owner:1;
>  #endif
> +#ifdef CONFIG_DETECT_HUNG_TASK
> +	unsigned			hung_task_reported:1;
> +#endif
>  #ifdef CONFIG_EVENTFD
>  	/* Recursion prevention for eventfd_signal() */
>  	unsigned			in_eventfd:1;
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 6fcc94ce4ca9..5dcce0e7041b 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -25,6 +25,7 @@
>  #include <linux/hung_task.h>
>  #include <linux/rwsem.h>
>  #include <linux/sys_info.h>
> +#include <linux/hash.h>
>  
>  #include <trace/events/sched.h>
>  
> @@ -125,6 +126,7 @@ static bool task_is_hung(struct task_struct *t, unsigned long timeout)
>  	if (switch_count != t->last_switch_count) {
>  		t->last_switch_count = switch_count;
>  		t->last_switch_time = jiffies;
> +		t->hung_task_reported = 0;
>  		return false;
>  	}
>  	if (time_is_after_jiffies(t->last_switch_time + timeout * HZ))
> @@ -228,12 +230,14 @@ static inline void debug_show_blocker(struct task_struct *task, unsigned long ti
>   * @t: Pointer to the detected hung task.
>   * @timeout: Timeout threshold for detecting hung tasks
>   * @this_round_count: Count of hung tasks detected in the current iteration
> + * @skip_show_task: Indicating if stack trace should be skipped
>   *
>   * Print structured information about the specified hung task, if warnings
>   * are enabled or if the panic batch threshold is exceeded.
>   */
>  static void hung_task_info(struct task_struct *t, unsigned long timeout,
> -			   unsigned long this_round_count)
> +			   unsigned long this_round_count,
> +			   unsigned int skip_show_task)
>  {
>  	trace_sched_process_hang(t);
>  
> @@ -261,8 +265,12 @@ static void hung_task_info(struct task_struct *t, unsigned long timeout,
>  			pr_err("      Blocked by coredump.\n");
>  		pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
>  			" disables this message.\n");
> -		sched_show_task(t);
> -		debug_show_blocker(t, timeout);
> +		if (!skip_show_task) {
> +			sched_show_task(t);
> +			debug_show_blocker(t, timeout);
> +		} else {
> +			pr_err("      Stack trace suppressed. Already reported or duplicate wchan\n");

Can we show the wchan hash for each task, so that we can see which
tasks are waiting on the same wchan?

Thanks,

> +		}
>  
>  		if (!sysctl_hung_task_warnings)
>  			pr_info("Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings\n");
> @@ -306,6 +314,9 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>  	unsigned long this_round_count;
>  	int need_warning = sysctl_hung_task_warnings;
>  	unsigned long si_mask = hung_task_si_mask;
> +	unsigned long wchan, wchan_hash[64] = { 0 };
> +	unsigned int hash;
> +	unsigned int skip_show_task;
>  
>  	/*
>  	 * If the system crashed already then all bets are off,
> @@ -326,6 +337,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>  		}
>  
>  		if (task_is_hung(t, timeout)) {
> +			skip_show_task = t->hung_task_reported;
>  			/*
>  			 * Increment the global counter so that userspace could
>  			 * start migrating tasks ASAP. But count the current
> @@ -334,7 +346,19 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>  			 */
>  			atomic_long_inc(&sysctl_hung_task_detect_count);
>  			this_round_count++;
> -			hung_task_info(t, timeout, this_round_count);
> +
> +			wchan = get_wchan(t);
> +			if (wchan) {
> +				hash = hash_long(wchan, 6);
> +				if (wchan_hash[hash] == wchan)
> +					skip_show_task = 1;
> +				else
> +					wchan_hash[hash] = wchan;
> +			}
> +
> +			hung_task_info(t, timeout, this_round_count,
> +				       skip_show_task);
> +			t->hung_task_reported = 1;
>  		}
>  	}
>   unlock:
> -- 
> 2.51.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] hung_task: deduplicate identical hang reports
  2026-06-21 21:37 [PATCH v3] hung_task: deduplicate identical hang reports Aaron Tomlin
  2026-06-22  0:18 ` Masami Hiramatsu
@ 2026-06-22 15:58 ` Petr Mladek
  2026-06-22 16:56   ` David Laight
  1 sibling, 1 reply; 4+ messages in thread
From: Petr Mladek @ 2026-06-22 15:58 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: akpm, lance.yang, mhiramat, linux-kernel, david.laight.linux,
	neelx, sean, chjohnst, steve, mproche, nick.lange

On Sun 2026-06-21 17:37:56, Aaron Tomlin wrote:
> Currently, during severe lock contention, multiple tasks can hang while
> waiting on the exact same resource. The khungtaskd kthread
> indiscriminately reports every single instance with a stack trace.
> This can roll the kernel ring buffer and prematurely exhaust the
> kernel.hung_task_warnings budget. Consequently, the kernel is left
> entirely blind to subsequent, unrelated deadlocks.
> 
> To preserve the warning budget and ring buffer without sacrificing
> observability, introduce a Wait Channel (wchan) and task-state based
> deduplicator:
> 
>     1. Implement a lightweight, stack-allocated 64-slot Wait Channel
>        (wchan) hash map. Tasks blocked on the exact same wchan during a
>        single scan are recognised as sharing the same bottleneck,
>        successfully deduplicating contentions even when the callers
>        possess entirely disparate call stacks.

I am sorry but I do not like this. It would show one random task blocked
using a locking/wait API (mutex, semaphore, wait). But it will not be
able to distinguish whether they are waiting for the same lock or
event.

It might easily skip the lock/event which is the root of the problem.

By other words, the motivation for this patch is to avoid duplicated
backtraces because the global limit of shown backtraces is too low
and it hides too much. But this would hide even more backtraces.
As a result administrators and developers will be even more blind.

Honestly, the previous version looked more acceptable to me. The
problem with not-exactly same backtraces might be solved by
comparing (hashing) only the top N backtrace levels, e.g. 10th.
Anyway, we should compare the callers of the locking/waiter API.

IMHO, we should always print backtraces of all hung tasks when
a hung_task is detected for the 1st time. Because we do not
know which of the hung tasks is pointing to the root of the problem
and which is a secondary victim.

Also I would primary try to increase the ring buffer size when
backtraces get lost.

>     2. Introduce a hung_task_reported bit-field in task_struct. If a task
>        remains hung across multiple intervals, khungtaskd recognises it
>        has already been reported. The bit is safely cleared without
>        locks or atomics the moment the task's context switch counter
>        increments.

Also this looks like an interesting optimization which might help
to reduce printing the same backtrace again and again. It looks
much better than the global limit of printed backtraces.

>     3. For duplicate tasks, we still print the single-line
>        "INFO: task ..." message and trigger tracepoint
>        trace_sched_process_hang(). It merely skips calling
>        sched_show_task() and debug_show_blocker(), printing a concise
>        suppression notice instead.

Yes, this is important as well.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] hung_task: deduplicate identical hang reports
  2026-06-22 15:58 ` Petr Mladek
@ 2026-06-22 16:56   ` David Laight
  0 siblings, 0 replies; 4+ messages in thread
From: David Laight @ 2026-06-22 16:56 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Aaron Tomlin, akpm, lance.yang, mhiramat, linux-kernel, neelx,
	sean, chjohnst, steve, mproche, nick.lange

On Mon, 22 Jun 2026 17:58:54 +0200
Petr Mladek <pmladek@suse.com> wrote:

> On Sun 2026-06-21 17:37:56, Aaron Tomlin wrote:
> > Currently, during severe lock contention, multiple tasks can hang while
> > waiting on the exact same resource. The khungtaskd kthread
> > indiscriminately reports every single instance with a stack trace.
> > This can roll the kernel ring buffer and prematurely exhaust the
> > kernel.hung_task_warnings budget. Consequently, the kernel is left
> > entirely blind to subsequent, unrelated deadlocks.
> > 
> > To preserve the warning budget and ring buffer without sacrificing
> > observability, introduce a Wait Channel (wchan) and task-state based
> > deduplicator:
> > 
> >     1. Implement a lightweight, stack-allocated 64-slot Wait Channel
> >        (wchan) hash map. Tasks blocked on the exact same wchan during a
> >        single scan are recognised as sharing the same bottleneck,
> >        successfully deduplicating contentions even when the callers
> >        possess entirely disparate call stacks.  
> 
> I am sorry but I do not like this. It would show one random task blocked
> using a locking/wait API (mutex, semaphore, wait). But it will not be
> able to distinguish whether they are waiting for the same lock or
> event.
> 
> It might easily skip the lock/event which is the root of the problem.
> 
> By other words, the motivation for this patch is to avoid duplicated
> backtraces because the global limit of shown backtraces is too low
> and it hides too much. But this would hide even more backtraces.
> As a result administrators and developers will be even more blind.
> 
> Honestly, the previous version looked more acceptable to me. The
> problem with not-exactly same backtraces might be solved by
> comparing (hashing) only the top N backtrace levels, e.g. 10th.
> Anyway, we should compare the callers of the locking/waiter API.
> 
> IMHO, we should always print backtraces of all hung tasks when
> a hung_task is detected for the 1st time. Because we do not
> know which of the hung tasks is pointing to the root of the problem
> and which is a secondary victim.
> 
> Also I would primary try to increase the ring buffer size when
> backtraces get lost.

Mostly the traces wont be seen until they get written to file by syslogd
(assuming it can run).
So why not write them slowly enough that it keeps up?

> 
> >     2. Introduce a hung_task_reported bit-field in task_struct. If a task
> >        remains hung across multiple intervals, khungtaskd recognises it
> >        has already been reported. The bit is safely cleared without
> >        locks or atomics the moment the task's context switch counter
> >        increments.  
> 
> Also this looks like an interesting optimization which might help
> to reduce printing the same backtrace again and again. It looks
> much better than the global limit of printed backtraces.
> 
> >     3. For duplicate tasks, we still print the single-line
> >        "INFO: task ..." message and trigger tracepoint
> >        trace_sched_process_hang(). It merely skips calling
> >        sched_show_task() and debug_show_blocker(), printing a concise
> >        suppression notice instead.  
> 
> Yes, this is important as well.

And would need to include the pid of the duplicate trace so you can see
which one it is.

	David

> 
> Best Regards,
> Petr


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-22 16:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-21 21:37 [PATCH v3] hung_task: deduplicate identical hang reports Aaron Tomlin
2026-06-22  0:18 ` Masami Hiramatsu
2026-06-22 15:58 ` Petr Mladek
2026-06-22 16:56   ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.