* [PATCH] hung_task: Panic after fixed number of hung tasks
@ 2025-09-25 6:06 lirongqing
2025-09-25 10:26 ` Lance Yang
2025-09-27 2:39 ` Lance Yang
0 siblings, 2 replies; 9+ messages in thread
From: lirongqing @ 2025-09-25 6:06 UTC (permalink / raw)
To: corbet, akpm, lance.yang, mhiramat, paulmck, pawan.kumar.gupta,
mingo, dave.hansen, rostedt, kees, arnd, lirongqing, feng.tang,
pauld, joel.granados, linux-doc, linux-kernel
From: Li RongQing <lirongqing@baidu.com>
Currently, when hung_task_panic is enabled, kernel will panic immediately
upon detecting the first hung task. However, some hung tasks are transient
and the system can recover fully, while others are unrecoverable and
trigger consecutive hung task reports, and a panic is expected.
This commit adds a new sysctl parameter hung_task_count_to_panic to allows
specifying the number of consecutive hung tasks that must be detected
before triggering a kernel panic. This provides finer control for
environments where transient hangs maybe happen but persistent hangs should
still be fatal.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
Documentation/admin-guide/sysctl/kernel.rst | 6 ++++++
kernel/hung_task.c | 14 +++++++++++++-
2 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 8b49eab..4240e7b 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -405,6 +405,12 @@ This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
1 Panic immediately.
= =================================================
+hung_task_count_to_panic
+=====================
+
+When set to a non-zero value, after the number of consecutive hung task
+occur, the kernel will triggers a panic
+
hung_task_check_count
=====================
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 8708a12..87a6421 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -83,6 +83,8 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
static unsigned int __read_mostly sysctl_hung_task_panic =
IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
+static unsigned int __read_mostly sysctl_hung_task_count_to_panic;
+
static int
hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
{
@@ -219,7 +221,9 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
trace_sched_process_hang(t);
- if (sysctl_hung_task_panic) {
+ if (sysctl_hung_task_panic ||
+ (sysctl_hung_task_count_to_panic &&
+ (sysctl_hung_task_detect_count >= sysctl_hung_task_count_to_panic))) {
console_verbose();
hung_task_show_lock = true;
hung_task_call_panic = true;
@@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[] = {
.extra2 = SYSCTL_ONE,
},
{
+ .procname = "hung_task_count_to_panic",
+ .data = &sysctl_hung_task_count_to_panic,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ },
+ {
.procname = "hung_task_check_count",
.data = &sysctl_hung_task_check_count,
.maxlen = sizeof(int),
--
2.9.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] hung_task: Panic after fixed number of hung tasks
2025-09-25 6:06 [PATCH] hung_task: Panic after fixed number of hung tasks lirongqing
@ 2025-09-25 10:26 ` Lance Yang
2025-09-26 18:02 ` Paul E. McKenney
2025-09-27 2:39 ` Lance Yang
1 sibling, 1 reply; 9+ messages in thread
From: Lance Yang @ 2025-09-25 10:26 UTC (permalink / raw)
To: lirongqing
Cc: linux-kernel, linux-doc, arnd, feng.tang, joel.granados, kees,
rostedt, pauld, pawan.kumar.gupta, mhiramat, dave.hansen, corbet,
akpm, paulmck, mingo
Thanks for the patch!
On 2025/9/25 14:06, lirongqing wrote:
> From: Li RongQing <lirongqing@baidu.com>
>
> Currently, when hung_task_panic is enabled, kernel will panic immediately
> upon detecting the first hung task. However, some hung tasks are transient
> and the system can recover fully, while others are unrecoverable and
> trigger consecutive hung task reports, and a panic is expected.
The new hung_task_count_to_panic relies on an absolute count, but I
assume the real indicator you're trying to capture is the trend or
rate of increase over a time window (e.g., "panic if count increases
by 5 in 10 minutes").
IMHO, this kind of time-windowed, trend-based logic seems much more
flexible and better suited for a userspace monitoring agent :)
In other words, why is this the right place for this feature?
Please sell it to us ;)
Lance
>
> This commit adds a new sysctl parameter hung_task_count_to_panic to allows
> specifying the number of consecutive hung tasks that must be detected
> before triggering a kernel panic. This provides finer control for
> environments where transient hangs maybe happen but persistent hangs should
> still be fatal.
>
> Signed-off-by: Li RongQing <lirongqing@baidu.com>
> ---
> Documentation/admin-guide/sysctl/kernel.rst | 6 ++++++
> kernel/hung_task.c | 14 +++++++++++++-
> 2 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index 8b49eab..4240e7b 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -405,6 +405,12 @@ This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
> 1 Panic immediately.
> = =================================================
>
> +hung_task_count_to_panic
> +=====================
> +
> +When set to a non-zero value, after the number of consecutive hung task
> +occur, the kernel will triggers a panic
> +
>
> hung_task_check_count
> =====================
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 8708a12..87a6421 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -83,6 +83,8 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
> static unsigned int __read_mostly sysctl_hung_task_panic =
> IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
>
> +static unsigned int __read_mostly sysctl_hung_task_count_to_panic;
> +
> static int
> hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> {
> @@ -219,7 +221,9 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
>
> trace_sched_process_hang(t);
>
> - if (sysctl_hung_task_panic) {
> + if (sysctl_hung_task_panic ||
> + (sysctl_hung_task_count_to_panic &&
> + (sysctl_hung_task_detect_count >= sysctl_hung_task_count_to_panic))) {
> console_verbose();
> hung_task_show_lock = true;
> hung_task_call_panic = true;
> @@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[] = {
> .extra2 = SYSCTL_ONE,
> },
> {
> + .procname = "hung_task_count_to_panic",
> + .data = &sysctl_hung_task_count_to_panic,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + },
> + {
> .procname = "hung_task_check_count",
> .data = &sysctl_hung_task_check_count,
> .maxlen = sizeof(int),
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] hung_task: Panic after fixed number of hung tasks
2025-09-25 10:26 ` Lance Yang
@ 2025-09-26 18:02 ` Paul E. McKenney
2025-09-27 2:18 ` Lance Yang
2025-09-28 1:51 ` [????] " Li,Rongqing
0 siblings, 2 replies; 9+ messages in thread
From: Paul E. McKenney @ 2025-09-26 18:02 UTC (permalink / raw)
To: Lance Yang
Cc: lirongqing, linux-kernel, linux-doc, arnd, feng.tang,
joel.granados, kees, rostedt, pauld, pawan.kumar.gupta, mhiramat,
dave.hansen, corbet, akpm, mingo
On Thu, Sep 25, 2025 at 06:26:00PM +0800, Lance Yang wrote:
>
> Thanks for the patch!
>
> On 2025/9/25 14:06, lirongqing wrote:
> > From: Li RongQing <lirongqing@baidu.com>
> >
> > Currently, when hung_task_panic is enabled, kernel will panic immediately
> > upon detecting the first hung task. However, some hung tasks are transient
> > and the system can recover fully, while others are unrecoverable and
> > trigger consecutive hung task reports, and a panic is expected.
>
> The new hung_task_count_to_panic relies on an absolute count, but I
> assume the real indicator you're trying to capture is the trend or
> rate of increase over a time window (e.g., "panic if count increases
> by 5 in 10 minutes").
>
> IMHO, this kind of time-windowed, trend-based logic seems much more
> flexible and better suited for a userspace monitoring agent :)
>
> In other words, why is this the right place for this feature?
A possibly related question is "why are RCU CPU stall warnings implemented
in the kernel instead of in userspace?" One reason is that by the
time that things get bad enough to trigger an RCU CPU stall warning,
userspace might not be capable of doing much of anything. Thus, there
is an uncomfortably high probability that orchestrating RCU CPU stall
warnings from userspace would cause these warnings to be lost entirely.
Similar reasoning might (or might not) apply to the hung-task mechanism.
Thanx, Paul
> Please sell it to us ;)
> Lance
>
> >
> > This commit adds a new sysctl parameter hung_task_count_to_panic to allows
> > specifying the number of consecutive hung tasks that must be detected
> > before triggering a kernel panic. This provides finer control for
> > environments where transient hangs maybe happen but persistent hangs should
> > still be fatal.
> >
> > Signed-off-by: Li RongQing <lirongqing@baidu.com>
> > ---
> > Documentation/admin-guide/sysctl/kernel.rst | 6 ++++++
> > kernel/hung_task.c | 14 +++++++++++++-
> > 2 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> > index 8b49eab..4240e7b 100644
> > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > @@ -405,6 +405,12 @@ This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
> > 1 Panic immediately.
> > = =================================================
> > +hung_task_count_to_panic
> > +=====================
> > +
> > +When set to a non-zero value, after the number of consecutive hung task
> > +occur, the kernel will triggers a panic
> > +
> > hung_task_check_count
> > =====================
> > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > index 8708a12..87a6421 100644
> > --- a/kernel/hung_task.c
> > +++ b/kernel/hung_task.c
> > @@ -83,6 +83,8 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
> > static unsigned int __read_mostly sysctl_hung_task_panic =
> > IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> > +static unsigned int __read_mostly sysctl_hung_task_count_to_panic;
> > +
> > static int
> > hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> > {
> > @@ -219,7 +221,9 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
> > trace_sched_process_hang(t);
> > - if (sysctl_hung_task_panic) {
> > + if (sysctl_hung_task_panic ||
> > + (sysctl_hung_task_count_to_panic &&
> > + (sysctl_hung_task_detect_count >= sysctl_hung_task_count_to_panic))) {
> > console_verbose();
> > hung_task_show_lock = true;
> > hung_task_call_panic = true;
> > @@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[] = {
> > .extra2 = SYSCTL_ONE,
> > },
> > {
> > + .procname = "hung_task_count_to_panic",
> > + .data = &sysctl_hung_task_count_to_panic,
> > + .maxlen = sizeof(int),
> > + .mode = 0644,
> > + .proc_handler = proc_dointvec_minmax,
> > + .extra1 = SYSCTL_ZERO,
> > + },
> > + {
> > .procname = "hung_task_check_count",
> > .data = &sysctl_hung_task_check_count,
> > .maxlen = sizeof(int),
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] hung_task: Panic after fixed number of hung tasks
2025-09-26 18:02 ` Paul E. McKenney
@ 2025-09-27 2:18 ` Lance Yang
2025-09-28 1:51 ` [????] " Li,Rongqing
1 sibling, 0 replies; 9+ messages in thread
From: Lance Yang @ 2025-09-27 2:18 UTC (permalink / raw)
To: paulmck
Cc: lirongqing, linux-kernel, linux-doc, arnd, feng.tang,
joel.granados, kees, rostedt, pauld, pawan.kumar.gupta, mhiramat,
dave.hansen, corbet, akpm, mingo
On 2025/9/27 02:02, Paul E. McKenney wrote:
> On Thu, Sep 25, 2025 at 06:26:00PM +0800, Lance Yang wrote:
>>
>> Thanks for the patch!
>>
>> On 2025/9/25 14:06, lirongqing wrote:
>>> From: Li RongQing <lirongqing@baidu.com>
>>>
>>> Currently, when hung_task_panic is enabled, kernel will panic immediately
>>> upon detecting the first hung task. However, some hung tasks are transient
>>> and the system can recover fully, while others are unrecoverable and
>>> trigger consecutive hung task reports, and a panic is expected.
>>
>> The new hung_task_count_to_panic relies on an absolute count, but I
>> assume the real indicator you're trying to capture is the trend or
>> rate of increase over a time window (e.g., "panic if count increases
>> by 5 in 10 minutes").
>>
>> IMHO, this kind of time-windowed, trend-based logic seems much more
>> flexible and better suited for a userspace monitoring agent :)
>>
>> In other words, why is this the right place for this feature?
>
> A possibly related question is "why are RCU CPU stall warnings implemented
> in the kernel instead of in userspace?" One reason is that by the
Fair point. I was initially leaning towards the "let userspace
handle it" camp ...
> time that things get bad enough to trigger an RCU CPU stall warning,
> userspace might not be capable of doing much of anything. Thus, there
> is an uncomfortably high probability that orchestrating RCU CPU stall
> warnings from userspace would cause these warnings to be lost entirely.
But you're right. When things really go sideways, userspace is likely
dead in the water.
>
> Similar reasoning might (or might not) apply to the hung-task mechanism.
Yes. No objection from me ;)
Thanks,
Lance
>
> Thanx, Paul
>
>> Please sell it to us ;)
>> Lance
>>
>>>
>>> This commit adds a new sysctl parameter hung_task_count_to_panic to allows
>>> specifying the number of consecutive hung tasks that must be detected
>>> before triggering a kernel panic. This provides finer control for
>>> environments where transient hangs maybe happen but persistent hangs should
>>> still be fatal.
>>>
>>> Signed-off-by: Li RongQing <lirongqing@baidu.com>
>>> ---
>>> Documentation/admin-guide/sysctl/kernel.rst | 6 ++++++
>>> kernel/hung_task.c | 14 +++++++++++++-
>>> 2 files changed, 19 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
>>> index 8b49eab..4240e7b 100644
>>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>>> @@ -405,6 +405,12 @@ This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
>>> 1 Panic immediately.
>>> = =================================================
>>> +hung_task_count_to_panic
>>> +=====================
>>> +
>>> +When set to a non-zero value, after the number of consecutive hung task
>>> +occur, the kernel will triggers a panic
>>> +
>>> hung_task_check_count
>>> =====================
>>> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
>>> index 8708a12..87a6421 100644
>>> --- a/kernel/hung_task.c
>>> +++ b/kernel/hung_task.c
>>> @@ -83,6 +83,8 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
>>> static unsigned int __read_mostly sysctl_hung_task_panic =
>>> IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
>>> +static unsigned int __read_mostly sysctl_hung_task_count_to_panic;
>>> +
>>> static int
>>> hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
>>> {
>>> @@ -219,7 +221,9 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
>>> trace_sched_process_hang(t);
>>> - if (sysctl_hung_task_panic) {
>>> + if (sysctl_hung_task_panic ||
>>> + (sysctl_hung_task_count_to_panic &&
>>> + (sysctl_hung_task_detect_count >= sysctl_hung_task_count_to_panic))) {
>>> console_verbose();
>>> hung_task_show_lock = true;
>>> hung_task_call_panic = true;
>>> @@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[] = {
>>> .extra2 = SYSCTL_ONE,
>>> },
>>> {
>>> + .procname = "hung_task_count_to_panic",
>>> + .data = &sysctl_hung_task_count_to_panic,
>>> + .maxlen = sizeof(int),
>>> + .mode = 0644,
>>> + .proc_handler = proc_dointvec_minmax,
>>> + .extra1 = SYSCTL_ZERO,
>>> + },
>>> + {
>>> .procname = "hung_task_check_count",
>>> .data = &sysctl_hung_task_check_count,
>>> .maxlen = sizeof(int),
>>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] hung_task: Panic after fixed number of hung tasks
2025-09-25 6:06 [PATCH] hung_task: Panic after fixed number of hung tasks lirongqing
2025-09-25 10:26 ` Lance Yang
@ 2025-09-27 2:39 ` Lance Yang
2025-09-28 1:54 ` [外部邮件] " Li,Rongqing
2025-09-28 3:19 ` Li,Rongqing
1 sibling, 2 replies; 9+ messages in thread
From: Lance Yang @ 2025-09-27 2:39 UTC (permalink / raw)
To: lirongqing
Cc: linux-doc, linux-kernel, arnd, joel.granados, feng.tang, pauld,
kees, rostedt, pawan.kumar.gupta, akpm, dave.hansen, mingo,
paulmck, corbet, mhiramat
On 2025/9/25 14:06, lirongqing wrote:
> From: Li RongQing <lirongqing@baidu.com>
>
> Currently, when hung_task_panic is enabled, kernel will panic immediately
> upon detecting the first hung task. However, some hung tasks are transient
> and the system can recover fully, while others are unrecoverable and
> trigger consecutive hung task reports, and a panic is expected.
>
> This commit adds a new sysctl parameter hung_task_count_to_panic to allows
> specifying the number of consecutive hung tasks that must be detected
> before triggering a kernel panic. This provides finer control for
> environments where transient hangs maybe happen but persistent hangs should
> still be fatal.
>
> Signed-off-by: Li RongQing <lirongqing@baidu.com>
> ---
> Documentation/admin-guide/sysctl/kernel.rst | 6 ++++++
> kernel/hung_task.c | 14 +++++++++++++-
> 2 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index 8b49eab..4240e7b 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -405,6 +405,12 @@ This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
> 1 Panic immediately.
> = =================================================
>
> +hung_task_count_to_panic
> +=====================
> +
> +When set to a non-zero value, after the number of consecutive hung task
> +occur, the kernel will triggers a panic
Hmm... the documentation here seems a bit misleading.
hung_task_panic=1 will always cause an immediate panic, regardless of
the hung_task_count_to_panic setting, right?
Perhaps something like this would be more accurate?
```
hung_task_count_to_panic
========================
When set to a non-zero value, a kernel panic will be triggered if
the number of detected hung tasks reaches this value.
Note that setting hung_task_panic=1 will still cause an immediate
panic on the first hung task, overriding this setting.
```
>
> hung_task_check_count
> =====================
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 8708a12..87a6421 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -83,6 +83,8 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
> static unsigned int __read_mostly sysctl_hung_task_panic =
> IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
>
> +static unsigned int __read_mostly sysctl_hung_task_count_to_panic;
Nit: while static variables are guaranteed to be zero-initialized, it's
a good practice and clearer for readers to initialize them explicitly.
static unsigned int __read_mostly sysctl_hung_task_count_to_panic = 0;
Otherwise, this patch looks good to me!
Acked-by: Lance Yang <lance.yang@linux.dev>
> +
> static int
> hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> {
> @@ -219,7 +221,9 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
>
> trace_sched_process_hang(t);
>
> - if (sysctl_hung_task_panic) {
> + if (sysctl_hung_task_panic ||
> + (sysctl_hung_task_count_to_panic &&
> + (sysctl_hung_task_detect_count >= sysctl_hung_task_count_to_panic))) {
> console_verbose();
> hung_task_show_lock = true;
> hung_task_call_panic = true;
> @@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[] = {
> .extra2 = SYSCTL_ONE,
> },
> {
> + .procname = "hung_task_count_to_panic",
> + .data = &sysctl_hung_task_count_to_panic,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + },
> + {
> .procname = "hung_task_check_count",
> .data = &sysctl_hung_task_check_count,
> .maxlen = sizeof(int),
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [????] Re: [PATCH] hung_task: Panic after fixed number of hung tasks
2025-09-26 18:02 ` Paul E. McKenney
2025-09-27 2:18 ` Lance Yang
@ 2025-09-28 1:51 ` Li,Rongqing
1 sibling, 0 replies; 9+ messages in thread
From: Li,Rongqing @ 2025-09-28 1:51 UTC (permalink / raw)
To: paulmck@kernel.org, Lance Yang
Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
arnd@arndb.de, feng.tang@linux.alibaba.com,
joel.granados@kernel.org, kees@kernel.org, rostedt@goodmis.org,
pauld@redhat.com, pawan.kumar.gupta@linux.intel.com,
mhiramat@kernel.org, dave.hansen@linux.intel.com, corbet@lwn.net,
akpm@linux-foundation.org, mingo@kernel.org
>
> > On 2025/9/25 14:06, lirongqing wrote:
> > > From: Li RongQing <lirongqing@baidu.com>
> > >
> > > Currently, when hung_task_panic is enabled, kernel will panic
> > > immediately upon detecting the first hung task. However, some hung
> > > tasks are transient and the system can recover fully, while others
> > > are unrecoverable and trigger consecutive hung task reports, and a panic is
> expected.
> >
> > The new hung_task_count_to_panic relies on an absolute count, but I
> > assume the real indicator you're trying to capture is the trend or
> > rate of increase over a time window (e.g., "panic if count increases
> > by 5 in 10 minutes").
> >
> > IMHO, this kind of time-windowed, trend-based logic seems much more
> > flexible and better suited for a userspace monitoring agent :)
> >
> > In other words, why is this the right place for this feature?
>
> A possibly related question is "why are RCU CPU stall warnings implemented in
> the kernel instead of in userspace?" One reason is that by the time that
> things get bad enough to trigger an RCU CPU stall warning, userspace might
> not be capable of doing much of anything. Thus, there is an uncomfortably
> high probability that orchestrating RCU CPU stall warnings from userspace
> would cause these warnings to be lost entirely.
>
Thank you, I think so too.
-Li
> Similar reasoning might (or might not) apply to the hung-task mechanism.
>
> Thanx, Paul
>
> > Please sell it to us ;)
> > Lance
> >
> > >
> > > This commit adds a new sysctl parameter hung_task_count_to_panic to
> > > allows specifying the number of consecutive hung tasks that must be
> > > detected before triggering a kernel panic. This provides finer
> > > control for environments where transient hangs maybe happen but
> > > persistent hangs should still be fatal.
> > >
> > > Signed-off-by: Li RongQing <lirongqing@baidu.com>
> > > ---
> > > Documentation/admin-guide/sysctl/kernel.rst | 6 ++++++
> > > kernel/hung_task.c | 14 +++++++++++++-
> > > 2 files changed, 19 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst
> > > b/Documentation/admin-guide/sysctl/kernel.rst
> > > index 8b49eab..4240e7b 100644
> > > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > > @@ -405,6 +405,12 @@ This file shows up if
> ``CONFIG_DETECT_HUNG_TASK`` is enabled.
> > > 1 Panic immediately.
> > > = =================================================
> > > +hung_task_count_to_panic
> > > +=====================
> > > +
> > > +When set to a non-zero value, after the number of consecutive hung
> > > +task occur, the kernel will triggers a panic
> > > +
> > > hung_task_check_count
> > > =====================
> > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c index
> > > 8708a12..87a6421 100644
> > > --- a/kernel/hung_task.c
> > > +++ b/kernel/hung_task.c
> > > @@ -83,6 +83,8 @@ static unsigned int __read_mostly
> sysctl_hung_task_all_cpu_backtrace;
> > > static unsigned int __read_mostly sysctl_hung_task_panic =
> > > IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> > > +static unsigned int __read_mostly sysctl_hung_task_count_to_panic;
> > > +
> > > static int
> > > hung_task_panic(struct notifier_block *this, unsigned long event, void
> *ptr)
> > > {
> > > @@ -219,7 +221,9 @@ static void check_hung_task(struct task_struct *t,
> unsigned long timeout)
> > > trace_sched_process_hang(t);
> > > - if (sysctl_hung_task_panic) {
> > > + if (sysctl_hung_task_panic ||
> > > + (sysctl_hung_task_count_to_panic &&
> > > + (sysctl_hung_task_detect_count >=
> > > +sysctl_hung_task_count_to_panic))) {
> > > console_verbose();
> > > hung_task_show_lock = true;
> > > hung_task_call_panic = true;
> > > @@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[] =
> {
> > > .extra2 = SYSCTL_ONE,
> > > },
> > > {
> > > + .procname = "hung_task_count_to_panic",
> > > + .data = &sysctl_hung_task_count_to_panic,
> > > + .maxlen = sizeof(int),
> > > + .mode = 0644,
> > > + .proc_handler = proc_dointvec_minmax,
> > > + .extra1 = SYSCTL_ZERO,
> > > + },
> > > + {
> > > .procname = "hung_task_check_count",
> > > .data = &sysctl_hung_task_check_count,
> > > .maxlen = sizeof(int),
> >
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [外部邮件] Re: [PATCH] hung_task: Panic after fixed number of hung tasks
2025-09-27 2:39 ` Lance Yang
@ 2025-09-28 1:54 ` Li,Rongqing
2025-09-28 3:19 ` Li,Rongqing
1 sibling, 0 replies; 9+ messages in thread
From: Li,Rongqing @ 2025-09-28 1:54 UTC (permalink / raw)
To: Lance Yang
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
arnd@arndb.de, joel.granados@kernel.org,
feng.tang@linux.alibaba.com, pauld@redhat.com, kees@kernel.org,
rostedt@goodmis.org, pawan.kumar.gupta@linux.intel.com,
akpm@linux-foundation.org, dave.hansen@linux.intel.com,
mingo@kernel.org, paulmck@kernel.org, corbet@lwn.net,
mhiramat@kernel.org
> > Signed-off-by: Li RongQing <lirongqing@baidu.com>
> > ---
> > Documentation/admin-guide/sysctl/kernel.rst | 6 ++++++
> > kernel/hung_task.c | 14 +++++++++++++-
> > 2 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/admin-guide/sysctl/kernel.rst
> > b/Documentation/admin-guide/sysctl/kernel.rst
> > index 8b49eab..4240e7b 100644
> > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > @@ -405,6 +405,12 @@ This file shows up if
> ``CONFIG_DETECT_HUNG_TASK`` is enabled.
> > 1 Panic immediately.
> > = =================================================
> >
> > +hung_task_count_to_panic
> > +=====================
> > +
> > +When set to a non-zero value, after the number of consecutive hung
> > +task occur, the kernel will triggers a panic
>
> Hmm... the documentation here seems a bit misleading.
>
> hung_task_panic=1 will always cause an immediate panic, regardless of the
> hung_task_count_to_panic setting, right?
>
> Perhaps something like this would be more accurate?
>
> ```
> hung_task_count_to_panic
> ========================
>
> When set to a non-zero value, a kernel panic will be triggered if the number of
> detected hung tasks reaches this value.
>
> Note that setting hung_task_panic=1 will still cause an immediate panic on the
> first hung task, overriding this setting.
> ```
I will rewrite this documentation as your suggestions
>
> >
> > hung_task_check_count
> > =====================
> > diff --git a/kernel/hung_task.c b/kernel/hung_task.c index
> > 8708a12..87a6421 100644
> > --- a/kernel/hung_task.c
> > +++ b/kernel/hung_task.c
> > @@ -83,6 +83,8 @@ static unsigned int __read_mostly
> sysctl_hung_task_all_cpu_backtrace;
> > static unsigned int __read_mostly sysctl_hung_task_panic =
> > IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> >
> > +static unsigned int __read_mostly sysctl_hung_task_count_to_panic;
>
> Nit: while static variables are guaranteed to be zero-initialized, it's a good
> practice and clearer for readers to initialize them explicitly.
>
> static unsigned int __read_mostly sysctl_hung_task_count_to_panic = 0;
>
Ok, I will change it
Thanks
-Li
>
> Otherwise, this patch looks good to me!
> Acked-by: Lance Yang <lance.yang@linux.dev>
>
> > +
> > static int
> > hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> > {
> > @@ -219,7 +221,9 @@ static void check_hung_task(struct task_struct *t,
> > unsigned long timeout)
> >
> > trace_sched_process_hang(t);
> >
> > - if (sysctl_hung_task_panic) {
> > + if (sysctl_hung_task_panic ||
> > + (sysctl_hung_task_count_to_panic &&
> > + (sysctl_hung_task_detect_count >=
> > +sysctl_hung_task_count_to_panic))) {
> > console_verbose();
> > hung_task_show_lock = true;
> > hung_task_call_panic = true;
> > @@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[] = {
> > .extra2 = SYSCTL_ONE,
> > },
> > {
> > + .procname = "hung_task_count_to_panic",
> > + .data = &sysctl_hung_task_count_to_panic,
> > + .maxlen = sizeof(int),
> > + .mode = 0644,
> > + .proc_handler = proc_dointvec_minmax,
> > + .extra1 = SYSCTL_ZERO,
> > + },
> > + {
> > .procname = "hung_task_check_count",
> > .data = &sysctl_hung_task_check_count,
> > .maxlen = sizeof(int),
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [外部邮件] Re: [PATCH] hung_task: Panic after fixed number of hung tasks
2025-09-27 2:39 ` Lance Yang
2025-09-28 1:54 ` [外部邮件] " Li,Rongqing
@ 2025-09-28 3:19 ` Li,Rongqing
2025-09-28 3:29 ` Lance Yang
1 sibling, 1 reply; 9+ messages in thread
From: Li,Rongqing @ 2025-09-28 3:19 UTC (permalink / raw)
To: Lance Yang
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
arnd@arndb.de, joel.granados@kernel.org,
feng.tang@linux.alibaba.com, pauld@redhat.com, kees@kernel.org,
rostedt@goodmis.org, pawan.kumar.gupta@linux.intel.com,
akpm@linux-foundation.org, dave.hansen@linux.intel.com,
mingo@kernel.org, paulmck@kernel.org, corbet@lwn.net,
mhiramat@kernel.org
> -----Original Message-----
> From: Lance Yang <lance.yang@linux.dev>
> Sent: 2025年9月27日 10:39
> To: Li,Rongqing <lirongqing@baidu.com>
> Cc: linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; arnd@arndb.de;
> joel.granados@kernel.org; feng.tang@linux.alibaba.com; pauld@redhat.com;
> kees@kernel.org; rostedt@goodmis.org; pawan.kumar.gupta@linux.intel.com;
> akpm@linux-foundation.org; dave.hansen@linux.intel.com; mingo@kernel.org;
> paulmck@kernel.org; corbet@lwn.net; mhiramat@kernel.org
> Subject: [外部邮件] Re: [PATCH] hung_task: Panic after fixed number of hung
> tasks
>
>
>
> On 2025/9/25 14:06, lirongqing wrote:
> > From: Li RongQing <lirongqing@baidu.com>
> >
> > Currently, when hung_task_panic is enabled, kernel will panic
> > immediately upon detecting the first hung task. However, some hung
> > tasks are transient and the system can recover fully, while others are
> > unrecoverable and trigger consecutive hung task reports, and a panic is
> expected.
> >
> > This commit adds a new sysctl parameter hung_task_count_to_panic to
> > allows specifying the number of consecutive hung tasks that must be
> > detected before triggering a kernel panic. This provides finer control
> > for environments where transient hangs maybe happen but persistent
> > hangs should still be fatal.
> >
> > Signed-off-by: Li RongQing <lirongqing@baidu.com>
> > ---
> > Documentation/admin-guide/sysctl/kernel.rst | 6 ++++++
> > kernel/hung_task.c | 14 +++++++++++++-
> > 2 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/admin-guide/sysctl/kernel.rst
> > b/Documentation/admin-guide/sysctl/kernel.rst
> > index 8b49eab..4240e7b 100644
> > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > @@ -405,6 +405,12 @@ This file shows up if
> ``CONFIG_DETECT_HUNG_TASK`` is enabled.
> > 1 Panic immediately.
> > = =================================================
> >
> > +hung_task_count_to_panic
> > +=====================
> > +
> > +When set to a non-zero value, after the number of consecutive hung
> > +task occur, the kernel will triggers a panic
>
> Hmm... the documentation here seems a bit misleading.
>
> hung_task_panic=1 will always cause an immediate panic, regardless of the
> hung_task_count_to_panic setting, right?
>
> Perhaps something like this would be more accurate?
>
> ```
> hung_task_count_to_panic
> ========================
>
> When set to a non-zero value, a kernel panic will be triggered if the number of
> detected hung tasks reaches this value.
>
> Note that setting hung_task_panic=1 will still cause an immediate panic on the
> first hung task, overriding this setting.
> ```
>
> >
> > hung_task_check_count
> > =====================
> > diff --git a/kernel/hung_task.c b/kernel/hung_task.c index
> > 8708a12..87a6421 100644
> > --- a/kernel/hung_task.c
> > +++ b/kernel/hung_task.c
> > @@ -83,6 +83,8 @@ static unsigned int __read_mostly
> sysctl_hung_task_all_cpu_backtrace;
> > static unsigned int __read_mostly sysctl_hung_task_panic =
> > IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> >
> > +static unsigned int __read_mostly sysctl_hung_task_count_to_panic;
>
> Nit: while static variables are guaranteed to be zero-initialized, it's a good
> practice and clearer for readers to initialize them explicitly.
>
> static unsigned int __read_mostly sysctl_hung_task_count_to_panic = 0;
>
>
./scripts/checkpatch.pl reports error when initialise statics to 0, so I will keep it uninitialized
ERROR: do not initialise statics to 0
#51: FILE: kernel/hung_task.c:86:
+static unsigned int __read_mostly sysctl_hung_task_count_to_panic = 0;
thanks
-Li
> Otherwise, this patch looks good to me!
> Acked-by: Lance Yang <lance.yang@linux.dev>
>
> > +
> > static int
> > hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> > {
> > @@ -219,7 +221,9 @@ static void check_hung_task(struct task_struct *t,
> > unsigned long timeout)
> >
> > trace_sched_process_hang(t);
> >
> > - if (sysctl_hung_task_panic) {
> > + if (sysctl_hung_task_panic ||
> > + (sysctl_hung_task_count_to_panic &&
> > + (sysctl_hung_task_detect_count >=
> > +sysctl_hung_task_count_to_panic))) {
> > console_verbose();
> > hung_task_show_lock = true;
> > hung_task_call_panic = true;
> > @@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[] = {
> > .extra2 = SYSCTL_ONE,
> > },
> > {
> > + .procname = "hung_task_count_to_panic",
> > + .data = &sysctl_hung_task_count_to_panic,
> > + .maxlen = sizeof(int),
> > + .mode = 0644,
> > + .proc_handler = proc_dointvec_minmax,
> > + .extra1 = SYSCTL_ZERO,
> > + },
> > + {
> > .procname = "hung_task_check_count",
> > .data = &sysctl_hung_task_check_count,
> > .maxlen = sizeof(int),
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [外部邮件] Re: [PATCH] hung_task: Panic after fixed number of hung tasks
2025-09-28 3:19 ` Li,Rongqing
@ 2025-09-28 3:29 ` Lance Yang
0 siblings, 0 replies; 9+ messages in thread
From: Lance Yang @ 2025-09-28 3:29 UTC (permalink / raw)
To: Li,Rongqing
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
arnd@arndb.de, joel.granados@kernel.org,
feng.tang@linux.alibaba.com, pauld@redhat.com, kees@kernel.org,
rostedt@goodmis.org, pawan.kumar.gupta@linux.intel.com,
akpm@linux-foundation.org, dave.hansen@linux.intel.com,
mingo@kernel.org, paulmck@kernel.org, corbet@lwn.net,
mhiramat@kernel.org
On 2025/9/28 11:19, Li,Rongqing wrote:
>
>
>> -----Original Message-----
>> From: Lance Yang <lance.yang@linux.dev>
>> Sent: 2025年9月27日 10:39
>> To: Li,Rongqing <lirongqing@baidu.com>
>> Cc: linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org; arnd@arndb.de;
>> joel.granados@kernel.org; feng.tang@linux.alibaba.com; pauld@redhat.com;
>> kees@kernel.org; rostedt@goodmis.org; pawan.kumar.gupta@linux.intel.com;
>> akpm@linux-foundation.org; dave.hansen@linux.intel.com; mingo@kernel.org;
>> paulmck@kernel.org; corbet@lwn.net; mhiramat@kernel.org
>> Subject: [外部邮件] Re: [PATCH] hung_task: Panic after fixed number of hung
>> tasks
>>
>>
>>
>> On 2025/9/25 14:06, lirongqing wrote:
>>> From: Li RongQing <lirongqing@baidu.com>
>>>
>>> Currently, when hung_task_panic is enabled, kernel will panic
>>> immediately upon detecting the first hung task. However, some hung
>>> tasks are transient and the system can recover fully, while others are
>>> unrecoverable and trigger consecutive hung task reports, and a panic is
>> expected.
>>>
>>> This commit adds a new sysctl parameter hung_task_count_to_panic to
>>> allows specifying the number of consecutive hung tasks that must be
>>> detected before triggering a kernel panic. This provides finer control
>>> for environments where transient hangs maybe happen but persistent
>>> hangs should still be fatal.
>>>
>>> Signed-off-by: Li RongQing <lirongqing@baidu.com>
>>> ---
>>> Documentation/admin-guide/sysctl/kernel.rst | 6 ++++++
>>> kernel/hung_task.c | 14 +++++++++++++-
>>> 2 files changed, 19 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst
>>> b/Documentation/admin-guide/sysctl/kernel.rst
>>> index 8b49eab..4240e7b 100644
>>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>>> @@ -405,6 +405,12 @@ This file shows up if
>> ``CONFIG_DETECT_HUNG_TASK`` is enabled.
>>> 1 Panic immediately.
>>> = =================================================
>>>
>>> +hung_task_count_to_panic
>>> +=====================
>>> +
>>> +When set to a non-zero value, after the number of consecutive hung
>>> +task occur, the kernel will triggers a panic
>>
>> Hmm... the documentation here seems a bit misleading.
>>
>> hung_task_panic=1 will always cause an immediate panic, regardless of the
>> hung_task_count_to_panic setting, right?
>>
>> Perhaps something like this would be more accurate?
>>
>> ```
>> hung_task_count_to_panic
>> ========================
>>
>> When set to a non-zero value, a kernel panic will be triggered if the number of
>> detected hung tasks reaches this value.
>>
>> Note that setting hung_task_panic=1 will still cause an immediate panic on the
>> first hung task, overriding this setting.
>> ```
>>
>>>
>>> hung_task_check_count
>>> =====================
>>> diff --git a/kernel/hung_task.c b/kernel/hung_task.c index
>>> 8708a12..87a6421 100644
>>> --- a/kernel/hung_task.c
>>> +++ b/kernel/hung_task.c
>>> @@ -83,6 +83,8 @@ static unsigned int __read_mostly
>> sysctl_hung_task_all_cpu_backtrace;
>>> static unsigned int __read_mostly sysctl_hung_task_panic =
>>> IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
>>>
>>> +static unsigned int __read_mostly sysctl_hung_task_count_to_panic;
>>
>> Nit: while static variables are guaranteed to be zero-initialized, it's a good
>> practice and clearer for readers to initialize them explicitly.
>>
>> static unsigned int __read_mostly sysctl_hung_task_count_to_panic = 0;
>>
>>
>
> ./scripts/checkpatch.pl reports error when initialise statics to 0, so I will keep it uninitialized
>
> ERROR: do not initialise statics to 0
> #51: FILE: kernel/hung_task.c:86:
> +static unsigned int __read_mostly sysctl_hung_task_count_to_panic = 0;
Ah, good spot! Let’s leave it as is ;)
Cheers,
Lance
>
>
> thanks
>
> -Li
>
>> Otherwise, this patch looks good to me!
>> Acked-by: Lance Yang <lance.yang@linux.dev>
>>
>>> +
>>> static int
>>> hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
>>> {
>>> @@ -219,7 +221,9 @@ static void check_hung_task(struct task_struct *t,
>>> unsigned long timeout)
>>>
>>> trace_sched_process_hang(t);
>>>
>>> - if (sysctl_hung_task_panic) {
>>> + if (sysctl_hung_task_panic ||
>>> + (sysctl_hung_task_count_to_panic &&
>>> + (sysctl_hung_task_detect_count >=
>>> +sysctl_hung_task_count_to_panic))) {
>>> console_verbose();
>>> hung_task_show_lock = true;
>>> hung_task_call_panic = true;
>>> @@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[] = {
>>> .extra2 = SYSCTL_ONE,
>>> },
>>> {
>>> + .procname = "hung_task_count_to_panic",
>>> + .data = &sysctl_hung_task_count_to_panic,
>>> + .maxlen = sizeof(int),
>>> + .mode = 0644,
>>> + .proc_handler = proc_dointvec_minmax,
>>> + .extra1 = SYSCTL_ZERO,
>>> + },
>>> + {
>>> .procname = "hung_task_check_count",
>>> .data = &sysctl_hung_task_check_count,
>>> .maxlen = sizeof(int),
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-09-28 3:29 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-25 6:06 [PATCH] hung_task: Panic after fixed number of hung tasks lirongqing
2025-09-25 10:26 ` Lance Yang
2025-09-26 18:02 ` Paul E. McKenney
2025-09-27 2:18 ` Lance Yang
2025-09-28 1:51 ` [????] " Li,Rongqing
2025-09-27 2:39 ` Lance Yang
2025-09-28 1:54 ` [外部邮件] " Li,Rongqing
2025-09-28 3:19 ` Li,Rongqing
2025-09-28 3:29 ` Lance Yang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).