From: Lance Yang <lance.yang@linux.dev>
To: atomlin@atomlin.com
Cc: lance.yang@linux.dev, akpm@linux-foundation.org,
mhiramat@kernel.org, pmladek@suse.com,
linux-kernel@vger.kernel.org, david.laight.linux@gmail.com,
neelx@suse.com, sean@ashe.io, chjohnst@gmail.com, steve@abita.co,
mproche@gmail.com, nick.lange@gmail.com
Subject: Re: [PATCH v2] hung_task: Add per-round stack trace deduplication
Date: Sun, 21 Jun 2026 13:17:18 +0800 [thread overview]
Message-ID: <20260621051718.64919-1-lance.yang@linux.dev> (raw)
In-Reply-To: <ou2kjpz7ojgu7xb2bv6hwkzpr7mqodh5oxji5fl4zdkw775zko@aaskidk2i6ka>
On Sat, Jun 20, 2026 at 01:54:56PM -0400, Aaron Tomlin wrote:
[...]
>On Sat, Jun 20, 2026 at 11:37:15AM +0800, Lance Yang wrote:
>> Hi Aaron,
>>
>> On Fri, Jun 19, 2026 at 09:35:59PM -0400, Aaron Tomlin wrote:
>> >Currently, when multiple tasks hang in the exact same location (e.g.,
>> >such as severe contention for a mutex), khungtaskd indiscriminately
>> >reports every single instance. This wastes ring buffer space with
>> >identical stack traces up to the defined warning limit (i.e.,
>> >kernel.hung_task_warnings), obscuring the root cause without providing
>> >any additional diagnostic value.
>> >
>> >Introduce a lightweight, hash-based stack trace deduplicator for
>> >khungtaskd to ensure only unique stack traces are reported during
>> >a single detection interval.
>> >
>> >Technical details of the implementation:
>> > - Uses a 12-bit hash table (4096 slots), consuming just 16 KB of
>> > static memory to prevent cache thrashing during massive hangs.
>> >
>> > - Operates purely serially within the single khungtaskd thread,
>> > requiring zero atomic operations or concurrent locking overhead.
>> >
>> > - Flushes the lossy cache via memset() at the beginning of each
>> > detection round. This ensures the immediate "thundering herd" of
>> > duplicates is suppressed, but guarantees the system will not
>> > permanently suppress identical hangs that occur in future rounds.
>> >
>> > - Introduces a new sysctl, kernel.hung_task_dedup, which defaults to 1
>> > (enabled). The sysctl is locally cached at the outset of each
>> > interval to prevent tearing caused by concurrent userspace toggling.
>> >
>>
>> Thanks for working on this, but ... guess I'll be the bad guy here, not
>> convinced this should go in ...
>>
>> When khungtaskd fires, somthing is already wrong, no? I don't see why it
>> should grow a new sysctl, a stack hash table, and extra filtering logic
>> just ot hide part of the report ...
>>
>> Emm ... do you have real cases where duplicate hung-task stacks caused
>> serious pain?
>>
>> If many tasks hang at once, usually one root cause, not a bunch of
>> different bugs. At least from what I've seen, any one of those stacks is
>> enough to start debugging ...
>>
>> We already have hung_task_detect_count and trace_sched_process_hang() for
>> basic counting/observability. Even if hung_task_warnings is finite and
>> the warning budget runs out, we still don't lose detections: counter gets
>> bumped and tracepoint fires before printk output is gated :)
>>
>> If someone wants stack grouping, I'd rather leave that to a tool than add
>> another policy knob to khungtaskd. Once it lands, maintainers have to
>> carry it forever. Not every nice-to-have feature is worth that cost, IMHO
>>
>> And if someone really wants more hung-task stacks in the log, we already
>> have hung_task_warnings for that. Raise it, or set it to -1.
>>
>> Also, looking at the v1 thread, I don't think the concerns there have
>> really settled yet ... If nobody replies, maybe give it a week before
>> sending a new version.
>
>Hi Lance,
>
>Thank you for taking the time to review the patch and for your candour.
I think your reply still misses most of my concerns ...
>You raise an entirely fair point regarding maintainability; every new
>control knob indeed carries a permanent cost for the maintainers, and I
>respect your caution.
Yeah, that matters, IMHO.
>To answer your question regarding real world pain: the primary issue is not
>merely visual clutter, but the premature exhaustion of the warning budget
>and the preservation of the kernel ring buffer during cascading failures.
Right, but that still sounds like a very specific case.
When khungtaskd fires, something is already wrong, no?
Even with one identical stack, per-round dedup only helps inside one
scan. The same stack can still come back in later rounds and burn through
hung_task_warnings anyway.
And under heavy contention, I would not expect only one stack anyway.
Different tasks can hang behind different locks or different callers, and
those stacks can still burn through the warning budget.
>In our production environments, we typically leave
>kernel.hung_task_warnings at its default value of 10. If a severe lock
>contention occurs, a single bottleneck can easily cause 10 tasks to hang
>simultaneously with the exact same stack trace. Under the current logic,
Not sure I buy this premise :)
Same bottleneck does not necessarily mean exactly the same stack.
Different callers can block on the same lock, and exact-stack dedup won't
help there.
At least from cases I've looked at, I can't really recall seeing this
exact pattern often enough to justify a new khungtaskd knob.
>those 10 identical traces will completely exhaust the warning budget.
>Consequently, the kernel is left entirely blind to any subsequent or
>completely unrelated deadlocks that might be occurring concurrently, as all
>further reports are silenced.
I don't think "entirely blind" is accurate.
hung_task_warnings *only* gates printk. We still bump
hung_task_detect_count and hit trace_sched_process_hang() before that
gate.
>Furthermore, dumping a full stack trace for every duplicate rapidly injects
>several of lines of identical noise into dmesg. We have found that this
>sudden burst frequently rolls the circular ring buffer.
>
>Userspace tooling is unfortunately unable to group or analyse logs that
>have already been evicted before the tool could read them, nor can it
>recover traces the kernel silently dropped due to an exhausted budget.
>
>The deduplicator acts as a telemetry filter, ensuring that the limited
>warning budget is spent strictly on unique traces rather than redundant
>noise, thereby preserving the history of the crash and ensuring secondary
>failures are not obscured.
>
>I wanted to clarify the exact operational context and the limitation of
>relying on userspace. Please let me know if this operational context alters
>your perspective at all.
[...]
Aaron, you've done good work in khungtaskd, and some of it is upstream
already. I do appreciate that!
But this one feels different. Useful locally, maybe, but not something
the kernel should carry forever.
Anyway, I'll stop here. Still a nack from my side.
Thanks, Lance
prev parent reply other threads:[~2026-06-21 5:17 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-20 1:35 [PATCH v2] hung_task: Add per-round stack trace deduplication Aaron Tomlin
2026-06-20 3:37 ` Lance Yang
2026-06-20 17:54 ` Aaron Tomlin
2026-06-21 5:17 ` Lance Yang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260621051718.64919-1-lance.yang@linux.dev \
--to=lance.yang@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=atomlin@atomlin.com \
--cc=chjohnst@gmail.com \
--cc=david.laight.linux@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mhiramat@kernel.org \
--cc=mproche@gmail.com \
--cc=neelx@suse.com \
--cc=nick.lange@gmail.com \
--cc=pmladek@suse.com \
--cc=sean@ashe.io \
--cc=steve@abita.co \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox