From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A6C0A157487 for ; Sun, 21 Jun 2026 05:17:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.181 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782019061; cv=none; b=m5KZW2Hfm9Ar/tz5g74Jgejv5OHrgO1NIop3/cA9yKZqWBQxMupcLXFF5729fgXCapKgCn/YVR4iPXJeS/lkAnvJtV1vXSqKlm88UlyTlMese/uevPTDernhuMhSeWICoFvbOTr2ZKB18aKV33OtGepfvtBdLl/Rtf0e1w+sbos= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782019061; c=relaxed/simple; bh=Td3Xxp6F1/UzMT/BF7IulLbque7jQIZoOyvmeRch9DM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=Erk32WathTGumzUaYiJYrYwvmj0pqFy0Ia8QpWYxJJ+6hkIFZyKfAYjDQt85U9UXKkh7aHyg1fqZMAydmey9LgCYZ4igO/RsdVYE5Crr7HfSuPeak9KJCANF3EGjJGwNoYRgalkniUH9zzR6xo2q0y5RMT6sCpHeHjdX4uy7Lpk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=RIDT4uiJ; arc=none smtp.client-ip=91.218.175.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="RIDT4uiJ" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1782019057; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UQJMdBuENWXvKO1BOFP1FtBxNbbrjotmUUeQ6s7U5KU=; b=RIDT4uiJEU4lRSjgLcvMbEnPW74FpdZhr493r+z6LvD/m30O7Gzl8ifTR2Q5SZJB4LDe47 LuclHoITFIPip81ieMDcRC/0K0cgsxlbkbF47g/WKwx79+fdnTYvUob02ZqQJB2LRT3p3j x7nEvM8klLkVSZa7wx1WB2VTvgx2H8U= From: Lance Yang To: atomlin@atomlin.com Cc: lance.yang@linux.dev, akpm@linux-foundation.org, mhiramat@kernel.org, pmladek@suse.com, linux-kernel@vger.kernel.org, david.laight.linux@gmail.com, neelx@suse.com, sean@ashe.io, chjohnst@gmail.com, steve@abita.co, mproche@gmail.com, nick.lange@gmail.com Subject: Re: [PATCH v2] hung_task: Add per-round stack trace deduplication Date: Sun, 21 Jun 2026 13:17:18 +0800 Message-Id: <20260621051718.64919-1-lance.yang@linux.dev> In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On Sat, Jun 20, 2026 at 01:54:56PM -0400, Aaron Tomlin wrote: [...] >On Sat, Jun 20, 2026 at 11:37:15AM +0800, Lance Yang wrote: >> Hi Aaron, >> >> On Fri, Jun 19, 2026 at 09:35:59PM -0400, Aaron Tomlin wrote: >> >Currently, when multiple tasks hang in the exact same location (e.g., >> >such as severe contention for a mutex), khungtaskd indiscriminately >> >reports every single instance. This wastes ring buffer space with >> >identical stack traces up to the defined warning limit (i.e., >> >kernel.hung_task_warnings), obscuring the root cause without providing >> >any additional diagnostic value. >> > >> >Introduce a lightweight, hash-based stack trace deduplicator for >> >khungtaskd to ensure only unique stack traces are reported during >> >a single detection interval. >> > >> >Technical details of the implementation: >> > - Uses a 12-bit hash table (4096 slots), consuming just 16 KB of >> > static memory to prevent cache thrashing during massive hangs. >> > >> > - Operates purely serially within the single khungtaskd thread, >> > requiring zero atomic operations or concurrent locking overhead. >> > >> > - Flushes the lossy cache via memset() at the beginning of each >> > detection round. This ensures the immediate "thundering herd" of >> > duplicates is suppressed, but guarantees the system will not >> > permanently suppress identical hangs that occur in future rounds. >> > >> > - Introduces a new sysctl, kernel.hung_task_dedup, which defaults to 1 >> > (enabled). The sysctl is locally cached at the outset of each >> > interval to prevent tearing caused by concurrent userspace toggling. >> > >> >> Thanks for working on this, but ... guess I'll be the bad guy here, not >> convinced this should go in ... >> >> When khungtaskd fires, somthing is already wrong, no? I don't see why it >> should grow a new sysctl, a stack hash table, and extra filtering logic >> just ot hide part of the report ... >> >> Emm ... do you have real cases where duplicate hung-task stacks caused >> serious pain? >> >> If many tasks hang at once, usually one root cause, not a bunch of >> different bugs. At least from what I've seen, any one of those stacks is >> enough to start debugging ... >> >> We already have hung_task_detect_count and trace_sched_process_hang() for >> basic counting/observability. Even if hung_task_warnings is finite and >> the warning budget runs out, we still don't lose detections: counter gets >> bumped and tracepoint fires before printk output is gated :) >> >> If someone wants stack grouping, I'd rather leave that to a tool than add >> another policy knob to khungtaskd. Once it lands, maintainers have to >> carry it forever. Not every nice-to-have feature is worth that cost, IMHO >> >> And if someone really wants more hung-task stacks in the log, we already >> have hung_task_warnings for that. Raise it, or set it to -1. >> >> Also, looking at the v1 thread, I don't think the concerns there have >> really settled yet ... If nobody replies, maybe give it a week before >> sending a new version. > >Hi Lance, > >Thank you for taking the time to review the patch and for your candour. I think your reply still misses most of my concerns ... >You raise an entirely fair point regarding maintainability; every new >control knob indeed carries a permanent cost for the maintainers, and I >respect your caution. Yeah, that matters, IMHO. >To answer your question regarding real world pain: the primary issue is not >merely visual clutter, but the premature exhaustion of the warning budget >and the preservation of the kernel ring buffer during cascading failures. Right, but that still sounds like a very specific case. When khungtaskd fires, something is already wrong, no? Even with one identical stack, per-round dedup only helps inside one scan. The same stack can still come back in later rounds and burn through hung_task_warnings anyway. And under heavy contention, I would not expect only one stack anyway. Different tasks can hang behind different locks or different callers, and those stacks can still burn through the warning budget. >In our production environments, we typically leave >kernel.hung_task_warnings at its default value of 10. If a severe lock >contention occurs, a single bottleneck can easily cause 10 tasks to hang >simultaneously with the exact same stack trace. Under the current logic, Not sure I buy this premise :) Same bottleneck does not necessarily mean exactly the same stack. Different callers can block on the same lock, and exact-stack dedup won't help there. At least from cases I've looked at, I can't really recall seeing this exact pattern often enough to justify a new khungtaskd knob. >those 10 identical traces will completely exhaust the warning budget. >Consequently, the kernel is left entirely blind to any subsequent or >completely unrelated deadlocks that might be occurring concurrently, as all >further reports are silenced. I don't think "entirely blind" is accurate. hung_task_warnings *only* gates printk. We still bump hung_task_detect_count and hit trace_sched_process_hang() before that gate. >Furthermore, dumping a full stack trace for every duplicate rapidly injects >several of lines of identical noise into dmesg. We have found that this >sudden burst frequently rolls the circular ring buffer. > >Userspace tooling is unfortunately unable to group or analyse logs that >have already been evicted before the tool could read them, nor can it >recover traces the kernel silently dropped due to an exhausted budget. > >The deduplicator acts as a telemetry filter, ensuring that the limited >warning budget is spent strictly on unique traces rather than redundant >noise, thereby preserving the history of the crash and ensuring secondary >failures are not obscured. > >I wanted to clarify the exact operational context and the limitation of >relying on userspace. Please let me know if this operational context alters >your perspective at all. [...] Aaron, you've done good work in khungtaskd, and some of it is upstream already. I do appreciate that! But this one feels different. Useful locally, maybe, but not something the kernel should carry forever. Anyway, I'll stop here. Still a nack from my side. Thanks, Lance