[PATCH][v3] hung_task: Panic after fixed number of hung tasks

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH][v3] hung_task: Panic after fixed number of hung tasks
@ 2025-10-12 11:50 lirongqing
  2025-10-12 13:26 ` [PATCH v3] " Markus Elfring
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: lirongqing @ 2025-10-12 11:50 UTC (permalink / raw)
  To: Jonathan Corbet, Russell King, Joel Stanley, Andrew Jeffery,
	Andrew Morton, Lance Yang, Masami Hiramatsu, Jason A . Donenfeld,
	Shuah Khan, Paul E . McKenney, Petr Mladek, Randy Dunlap,
	Steven Rostedt, Feng Tang, Pawan Gupta, Kees Cook, Arnd Bergmann,
	Li RongQing, Phil Auld, Joel Granados, Jakub Kicinski,
	Simon Horman, Anshuman Khandual, Stanislav Fomichev,
	Liam R . Howlett, Lorenzo Stoakes, David Hildenbrand,
	Florian Westphal, linux-doc, linux-kernel, linux-arm-kernel,
	linux-aspeed, wireguard, netdev, linux-kselftest

From: Li RongQing <lirongqing@baidu.com>

Currently, when 'hung_task_panic' is enabled, the kernel panics
immediately upon detecting the first hung task. However, some hung
tasks are transient and the system can recover, while others are
persistent and may accumulate progressively.

This patch extends the 'hung_task_panic' sysctl to allow specifying
the number of hung tasks that must be detected before triggering
a kernel panic. This provides finer control for environments where
transient hangs may occur but persistent hangs should still be fatal.

The sysctl can be set to:
- 0: disabled (never panic)
- 1: original behavior (panic on first hung task)
- N: panic when N hung tasks are detected

This maintains backward compatibility while providing more flexibility
for handling different hang scenarios.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
Diff with v2: not add new sysctl, extend hung_task_panic

 Documentation/admin-guide/kernel-parameters.txt      | 20 +++++++++++++-------
 Documentation/admin-guide/sysctl/kernel.rst          |  3 ++-
 arch/arm/configs/aspeed_g5_defconfig                 |  2 +-
 kernel/configs/debug.config                          |  2 +-
 kernel/hung_task.c                                   | 16 +++++++++++-----
 lib/Kconfig.debug                                    | 10 ++++++----
 tools/testing/selftests/wireguard/qemu/kernel.config |  2 +-
 7 files changed, 35 insertions(+), 20 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a51ab46..7d9a8ee 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1992,14 +1992,20 @@
 			the added memory block itself do not be affected.
 
 	hung_task_panic=
-			[KNL] Should the hung task detector generate panics.
-			Format: 0 | 1
+			[KNL] Number of hung tasks to trigger kernel panic.
+			Format: <int>
+
+			Set this to the number of hung tasks that must be
+			detected before triggering a kernel panic.
+
+			0: don't panic
+			1: panic immediately on first hung task
+			N: panic after N hung tasks are detect
 
-			A value of 1 instructs the kernel to panic when a
-			hung task is detected. The default value is controlled
-			by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
-			option. The value selected by this boot parameter can
-			be changed later by the kernel.hung_task_panic sysctl.
+			The default value is controlled by the
+			CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. The value
+			selected by this boot parameter can be changed later by the
+			kernel.hung_task_panic sysctl.
 
 	hvc_iucv=	[S390]	Number of z/VM IUCV hypervisor console (HVC)
 				terminal devices. Valid values: 0..8
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index f3ee807..0a8dfab 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -397,7 +397,8 @@ a hung task is detected.
 hung_task_panic
 ===============
 
-Controls the kernel's behavior when a hung task is detected.
+When set to a non-zero value, a kernel panic will be triggered if the
+number of detected hung tasks reaches this value
 This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
 
 = =================================================
diff --git a/arch/arm/configs/aspeed_g5_defconfig b/arch/arm/configs/aspeed_g5_defconfig
index 61cee1e..c3b0d5f 100644
--- a/arch/arm/configs/aspeed_g5_defconfig
+++ b/arch/arm/configs/aspeed_g5_defconfig
@@ -308,7 +308,7 @@ CONFIG_PANIC_ON_OOPS=y
 CONFIG_PANIC_TIMEOUT=-1
 CONFIG_SOFTLOCKUP_DETECTOR=y
 CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y
-CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y
+CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1
 CONFIG_WQ_WATCHDOG=y
 # CONFIG_SCHED_DEBUG is not set
 CONFIG_FUNCTION_TRACER=y
diff --git a/kernel/configs/debug.config b/kernel/configs/debug.config
index e81327d..9f6ab7d 100644
--- a/kernel/configs/debug.config
+++ b/kernel/configs/debug.config
@@ -83,7 +83,7 @@ CONFIG_SLUB_DEBUG_ON=y
 #
 # Debug Oops, Lockups and Hangs
 #
-# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
+CONFIG_BOOTPARAM_HUNG_TASK_PANIC=0
 # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
 CONFIG_DEBUG_ATOMIC_SLEEP=y
 CONFIG_DETECT_HUNG_TASK=y
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index b2c1f14..3929ed9 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -81,7 +81,7 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
  * hung task is detected:
  */
 static unsigned int __read_mostly sysctl_hung_task_panic =
-	IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
+	CONFIG_BOOTPARAM_HUNG_TASK_PANIC;
 
 static int
 hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
@@ -218,8 +218,11 @@ static inline void debug_show_blocker(struct task_struct *task, unsigned long ti
 }
 #endif
 
-static void check_hung_task(struct task_struct *t, unsigned long timeout)
+static void check_hung_task(struct task_struct *t, unsigned long timeout,
+		unsigned long prev_detect_count)
 {
+	unsigned long total_hung_task;
+
 	if (!task_is_hung(t, timeout))
 		return;
 
@@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
 	 */
 	sysctl_hung_task_detect_count++;
 
+	total_hung_task = sysctl_hung_task_detect_count - prev_detect_count;
 	trace_sched_process_hang(t);
 
-	if (sysctl_hung_task_panic) {
+	if (sysctl_hung_task_panic &&
+			(total_hung_task >= sysctl_hung_task_panic)) {
 		console_verbose();
 		hung_task_show_lock = true;
 		hung_task_call_panic = true;
@@ -300,6 +305,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 	int max_count = sysctl_hung_task_check_count;
 	unsigned long last_break = jiffies;
 	struct task_struct *g, *t;
+	unsigned long prev_detect_count = sysctl_hung_task_detect_count;
 
 	/*
 	 * If the system crashed already then all bets are off,
@@ -320,7 +326,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 			last_break = jiffies;
 		}
 
-		check_hung_task(t, timeout);
+		check_hung_task(t, timeout, prev_detect_count);
 	}
  unlock:
 	rcu_read_unlock();
@@ -389,7 +395,7 @@ static const struct ctl_table hung_task_sysctls[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
-		.extra2		= SYSCTL_ONE,
+		.extra2		= SYSCTL_INT_MAX,
 	},
 	{
 		.procname	= "hung_task_check_count",
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3034e294..077b9e4 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1258,12 +1258,14 @@ config DEFAULT_HUNG_TASK_TIMEOUT
 	  Keeping the default should be fine in most cases.
 
 config BOOTPARAM_HUNG_TASK_PANIC
-	bool "Panic (Reboot) On Hung Tasks"
+	int "Number of hung tasks to trigger kernel panic"
 	depends on DETECT_HUNG_TASK
+	default 0
 	help
-	  Say Y here to enable the kernel to panic on "hung tasks",
-	  which are bugs that cause the kernel to leave a task stuck
-	  in uninterruptible "D" state.
+	  The number of hung tasks must be detected to trigger kernel panic.
+
+	  - 0: Don't trigger panic
+	  - N: Panic when N hung tasks are detected
 
 	  The panic can be used in combination with panic_timeout,
 	  to cause the system to reboot automatically after a
diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config
index 936b18b..0504c11 100644
--- a/tools/testing/selftests/wireguard/qemu/kernel.config
+++ b/tools/testing/selftests/wireguard/qemu/kernel.config
@@ -81,7 +81,7 @@ CONFIG_WQ_WATCHDOG=y
 CONFIG_DETECT_HUNG_TASK=y
 CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y
 CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y
-CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y
+CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1
 CONFIG_PANIC_TIMEOUT=-1
 CONFIG_STACKTRACE=y
 CONFIG_EARLY_PRINTK=y
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] hung_task: Panic after fixed number of hung tasks
  2025-10-12 11:50 [PATCH][v3] hung_task: Panic after fixed number of hung tasks lirongqing
@ 2025-10-12 13:26 ` Markus Elfring
  2025-10-13  2:14   ` [外部邮件] " Li,Rongqing
  2025-10-14  1:37 ` [PATCH][v3] " Randy Dunlap
  2025-10-14  5:23 ` Lance Yang
  2 siblings, 1 reply; 12+ messages in thread
From: Markus Elfring @ 2025-10-12 13:26 UTC (permalink / raw)
  To: Li RongQing, linux-doc, linux-kselftest, netdev, linux-arm-kernel,
	linux-aspeed, wireguard, Andrew Jeffery, Andrew Morton,
	Anshuman Khandual, Arnd Bergmann, David Hildenbrand, Feng Tang,
	Florian Westphal, Jakub Kicinski, Jason A . Donenfeld,
	Joel Granados, Joel Stanley, Jonathan Corbet, Kees Cook,
	Lance Yang, Liam R . Howlett, Lorenzo Stoakes, Masami Hiramatsu,
	Paul E . McKenney, Pawan Gupta, Petr Mladek, Phil Auld,
	Randy Dunlap, Russell King, Shuah Khan, Simon Horman,
	Stanislav Fomichev, Steven Rostedt
  Cc: LKML, kernel-janitors

…
> This patch extends the …

Will another imperative wording approach become more helpful for an improved
change description?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.17#n94


…
> +++ b/kernel/hung_task.c
…
@@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
…
>  	trace_sched_process_hang(t);
>  
> -	if (sysctl_hung_task_panic) {
> +	if (sysctl_hung_task_panic &&
> +			(total_hung_task >= sysctl_hung_task_panic)) {
…

I suggest to use the following source code variant instead.

	if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic) {


Regards,
Markus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [外部邮件] Re: [PATCH v3] hung_task: Panic after fixed number of hung tasks
  2025-10-12 13:26 ` [PATCH v3] " Markus Elfring
@ 2025-10-13  2:14   ` Li,Rongqing
  0 siblings, 0 replies; 12+ messages in thread
From: Li,Rongqing @ 2025-10-13  2:14 UTC (permalink / raw)
  To: Markus Elfring, linux-doc@vger.kernel.org,
	linux-kselftest@vger.kernel.org, netdev@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-aspeed@lists.ozlabs.org, wireguard@lists.zx2c4.com,
	Andrew Jeffery, Andrew Morton, Anshuman Khandual, Arnd Bergmann,
	David Hildenbrand, Feng Tang, Florian Westphal, Jakub Kicinski,
	Jason A . Donenfeld, Joel Granados, Joel Stanley, Jonathan Corbet,
	Kees Cook, Lance Yang, Liam R . Howlett, Lorenzo Stoakes,
	Masami Hiramatsu, Paul E . McKenney, Pawan Gupta, Petr Mladek,
	Phil Auld, Randy Dunlap, Russell King, Shuah Khan, Simon Horman,
	Stanislav Fomichev, Steven Rostedt
  Cc: LKML, kernel-janitors@vger.kernel.org

> …
> > This patch extends the …
> 
> Will another imperative wording approach become more helpful for an
> improved change description?
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Docum
> entation/process/submitting-patches.rst?h=v6.17#n94
> 

will fix in next version
> 
> …
> > +++ b/kernel/hung_task.c
> …
> @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t,
> unsigned long timeout) …
> >  	trace_sched_process_hang(t);
> >
> > -	if (sysctl_hung_task_panic) {
> > +	if (sysctl_hung_task_panic &&
> > +			(total_hung_task >= sysctl_hung_task_panic)) {
> …
> 
> I suggest to use the following source code variant instead.
> 
> 	if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic)
> {
> 

will fix in next version

thanks

-Li

> 
> Regards,
> Markus


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks
  2025-10-12 11:50 [PATCH][v3] hung_task: Panic after fixed number of hung tasks lirongqing
  2025-10-12 13:26 ` [PATCH v3] " Markus Elfring
@ 2025-10-14  1:37 ` Randy Dunlap
  2025-10-14  5:23 ` Lance Yang
  2 siblings, 0 replies; 12+ messages in thread
From: Randy Dunlap @ 2025-10-14  1:37 UTC (permalink / raw)
  To: lirongqing, Jonathan Corbet, Russell King, Joel Stanley,
	Andrew Jeffery, Andrew Morton, Lance Yang, Masami Hiramatsu,
	Jason A . Donenfeld, Shuah Khan, Paul E . McKenney, Petr Mladek,
	Steven Rostedt, Feng Tang, Pawan Gupta, Kees Cook, Arnd Bergmann,
	Phil Auld, Joel Granados, Jakub Kicinski, Simon Horman,
	Anshuman Khandual, Stanislav Fomichev, Liam R . Howlett,
	Lorenzo Stoakes, David Hildenbrand, Florian Westphal, linux-doc,
	linux-kernel, linux-arm-kernel, linux-aspeed, wireguard, netdev,
	linux-kselftest

Hi--

On 10/12/25 4:50 AM, lirongqing wrote:
> From: Li RongQing <lirongqing@baidu.com>
> 

> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a51ab46..7d9a8ee 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1992,14 +1992,20 @@
>  			the added memory block itself do not be affected.
>  
>  	hung_task_panic=
> -			[KNL] Should the hung task detector generate panics.
> -			Format: 0 | 1
> +			[KNL] Number of hung tasks to trigger kernel panic.
> +			Format: <int>
> +
> +			Set this to the number of hung tasks that must be
> +			detected before triggering a kernel panic.
> +
> +			0: don't panic
> +			1: panic immediately on first hung task
> +			N: panic after N hung tasks are detect

			                            are detected

>  
> -			A value of 1 instructs the kernel to panic when a
> -			hung task is detected. The default value is controlled
> -			by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
> -			option. The value selected by this boot parameter can
> -			be changed later by the kernel.hung_task_panic sysctl.
> +			The default value is controlled by the
> +			CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. The value
> +			selected by this boot parameter can be changed later by the
> +			kernel.hung_task_panic sysctl.
>  
>  	hvc_iucv=	[S390]	Number of z/VM IUCV hypervisor console (HVC)
>  				terminal devices. Valid values: 0..8


-- 
~Randy


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks
  2025-10-12 11:50 [PATCH][v3] hung_task: Panic after fixed number of hung tasks lirongqing
  2025-10-12 13:26 ` [PATCH v3] " Markus Elfring
  2025-10-14  1:37 ` [PATCH][v3] " Randy Dunlap
@ 2025-10-14  5:23 ` Lance Yang
  2025-10-14  9:45   ` Petr Mladek
  2 siblings, 1 reply; 12+ messages in thread
From: Lance Yang @ 2025-10-14  5:23 UTC (permalink / raw)
  To: lirongqing
  Cc: wireguard, linux-arm-kernel, Liam R . Howlett, linux-doc,
	David Hildenbrand, Randy Dunlap, Stanislav Fomichev, linux-aspeed,
	Andrew Jeffery, Joel Stanley, Russell King, Lorenzo Stoakes,
	Shuah Khan, Steven Rostedt, Jonathan Corbet, Petr Mladek,
	Joel Granados, Andrew Morton, Phil Auld, linux-kernel,
	linux-kselftest, Masami Hiramatsu, Jakub Kicinski, Pawan Gupta,
	Simon Horman, Anshuman Khandual, Florian Westphal, netdev,
	Kees Cook, Arnd Bergmann, Paul E . McKenney, Feng Tang,
	Jason A . Donenfeld

Thanks for the patch!

I noticed the implementation panics only when N tasks are detected
within a single scan, because total_hung_task is reset for each
check_hung_uninterruptible_tasks() run.

So some suggestions to align the documentation with the code's
behavior below :)

On 2025/10/12 19:50, lirongqing wrote:
> From: Li RongQing <lirongqing@baidu.com>
> 
> Currently, when 'hung_task_panic' is enabled, the kernel panics
> immediately upon detecting the first hung task. However, some hung
> tasks are transient and the system can recover, while others are
> persistent and may accumulate progressively.
> 
> This patch extends the 'hung_task_panic' sysctl to allow specifying
> the number of hung tasks that must be detected before triggering
> a kernel panic. This provides finer control for environments where
> transient hangs may occur but persistent hangs should still be fatal.
> 
> The sysctl can be set to:
> - 0: disabled (never panic)
> - 1: original behavior (panic on first hung task)
> - N: panic when N hung tasks are detected
> 
> This maintains backward compatibility while providing more flexibility
> for handling different hang scenarios.
> 
> Signed-off-by: Li RongQing <lirongqing@baidu.com>
> ---
> Diff with v2: not add new sysctl, extend hung_task_panic
> 
>   Documentation/admin-guide/kernel-parameters.txt      | 20 +++++++++++++-------
>   Documentation/admin-guide/sysctl/kernel.rst          |  3 ++-
>   arch/arm/configs/aspeed_g5_defconfig                 |  2 +-
>   kernel/configs/debug.config                          |  2 +-
>   kernel/hung_task.c                                   | 16 +++++++++++-----
>   lib/Kconfig.debug                                    | 10 ++++++----
>   tools/testing/selftests/wireguard/qemu/kernel.config |  2 +-
>   7 files changed, 35 insertions(+), 20 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a51ab46..7d9a8ee 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1992,14 +1992,20 @@
>   			the added memory block itself do not be affected.
>   
>   	hung_task_panic=
> -			[KNL] Should the hung task detector generate panics.
> -			Format: 0 | 1
> +			[KNL] Number of hung tasks to trigger kernel panic.
> +			Format: <int>
> +
> +			Set this to the number of hung tasks that must be
> +			detected before triggering a kernel panic.
> +
> +			0: don't panic
> +			1: panic immediately on first hung task
> +			N: panic after N hung tasks are detect

The description should be more specific :)

N: panic after N hung tasks are detected in a single scan

Would it be better and cleaner?

>   
> -			A value of 1 instructs the kernel to panic when a
> -			hung task is detected. The default value is controlled
> -			by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
> -			option. The value selected by this boot parameter can
> -			be changed later by the kernel.hung_task_panic sysctl.
> +			The default value is controlled by the
> +			CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. The value
> +			selected by this boot parameter can be changed later by the
> +			kernel.hung_task_panic sysctl.
>   
>   	hvc_iucv=	[S390]	Number of z/VM IUCV hypervisor console (HVC)
>   				terminal devices. Valid values: 0..8
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index f3ee807..0a8dfab 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -397,7 +397,8 @@ a hung task is detected.
>   hung_task_panic
>   ===============
>   
> -Controls the kernel's behavior when a hung task is detected.
> +When set to a non-zero value, a kernel panic will be triggered if the
> +number of detected hung tasks reaches this value

Hmm... that is also ambiguous ...

+When set to a non-zero value, a kernel panic will be triggered if the
+number of hung tasks found during a single scan reaches this value.

>   This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
>   
>   = =================================================
> diff --git a/arch/arm/configs/aspeed_g5_defconfig b/arch/arm/configs/aspeed_g5_defconfig
> index 61cee1e..c3b0d5f 100644
> --- a/arch/arm/configs/aspeed_g5_defconfig
> +++ b/arch/arm/configs/aspeed_g5_defconfig
> @@ -308,7 +308,7 @@ CONFIG_PANIC_ON_OOPS=y
>   CONFIG_PANIC_TIMEOUT=-1
>   CONFIG_SOFTLOCKUP_DETECTOR=y
>   CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y
> -CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y
> +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1
>   CONFIG_WQ_WATCHDOG=y
>   # CONFIG_SCHED_DEBUG is not set
>   CONFIG_FUNCTION_TRACER=y
> diff --git a/kernel/configs/debug.config b/kernel/configs/debug.config
> index e81327d..9f6ab7d 100644
> --- a/kernel/configs/debug.config
> +++ b/kernel/configs/debug.config
> @@ -83,7 +83,7 @@ CONFIG_SLUB_DEBUG_ON=y
>   #
>   # Debug Oops, Lockups and Hangs
>   #
> -# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
> +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=0
>   # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
>   CONFIG_DEBUG_ATOMIC_SLEEP=y
>   CONFIG_DETECT_HUNG_TASK=y
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index b2c1f14..3929ed9 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -81,7 +81,7 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
>    * hung task is detected:
>    */
>   static unsigned int __read_mostly sysctl_hung_task_panic =
> -	IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> +	CONFIG_BOOTPARAM_HUNG_TASK_PANIC;
>   
>   static int
>   hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> @@ -218,8 +218,11 @@ static inline void debug_show_blocker(struct task_struct *task, unsigned long ti
>   }
>   #endif
>   
> -static void check_hung_task(struct task_struct *t, unsigned long timeout)
> +static void check_hung_task(struct task_struct *t, unsigned long timeout,
> +		unsigned long prev_detect_count)
>   {
> +	unsigned long total_hung_task;
> +
>   	if (!task_is_hung(t, timeout))
>   		return;
>   
> @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
>   	 */
>   	sysctl_hung_task_detect_count++;
>   
> +	total_hung_task = sysctl_hung_task_detect_count - prev_detect_count;
>   	trace_sched_process_hang(t);
>   
> -	if (sysctl_hung_task_panic) {
> +	if (sysctl_hung_task_panic &&
> +			(total_hung_task >= sysctl_hung_task_panic)) {
>   		console_verbose();
>   		hung_task_show_lock = true;
>   		hung_task_call_panic = true;
> @@ -300,6 +305,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>   	int max_count = sysctl_hung_task_check_count;
>   	unsigned long last_break = jiffies;
>   	struct task_struct *g, *t;
> +	unsigned long prev_detect_count = sysctl_hung_task_detect_count;
>   
>   	/*
>   	 * If the system crashed already then all bets are off,
> @@ -320,7 +326,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>   			last_break = jiffies;
>   		}
>   
> -		check_hung_task(t, timeout);
> +		check_hung_task(t, timeout, prev_detect_count);
>   	}
>    unlock:
>   	rcu_read_unlock();
> @@ -389,7 +395,7 @@ static const struct ctl_table hung_task_sysctls[] = {
>   		.mode		= 0644,
>   		.proc_handler	= proc_dointvec_minmax,
>   		.extra1		= SYSCTL_ZERO,
> -		.extra2		= SYSCTL_ONE,
> +		.extra2		= SYSCTL_INT_MAX,
>   	},
>   	{
>   		.procname	= "hung_task_check_count",
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 3034e294..077b9e4 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1258,12 +1258,14 @@ config DEFAULT_HUNG_TASK_TIMEOUT
>   	  Keeping the default should be fine in most cases.
>   
>   config BOOTPARAM_HUNG_TASK_PANIC
> -	bool "Panic (Reboot) On Hung Tasks"
> +	int "Number of hung tasks to trigger kernel panic"
>   	depends on DETECT_HUNG_TASK
> +	default 0
>   	help
> -	  Say Y here to enable the kernel to panic on "hung tasks",
> -	  which are bugs that cause the kernel to leave a task stuck
> -	  in uninterruptible "D" state.
> +	  The number of hung tasks must be detected to trigger kernel panic.
> +
> +	  - 0: Don't trigger panic
> +	  - N: Panic when N hung tasks are detected

+	  - N: Panic when N hung tasks are detected in a single scan

With these documentation changes, this patch would accurately describe 
its behavior, IMHO.

>   
>   	  The panic can be used in combination with panic_timeout,
>   	  to cause the system to reboot automatically after a
> diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config
> index 936b18b..0504c11 100644
> --- a/tools/testing/selftests/wireguard/qemu/kernel.config
> +++ b/tools/testing/selftests/wireguard/qemu/kernel.config
> @@ -81,7 +81,7 @@ CONFIG_WQ_WATCHDOG=y
>   CONFIG_DETECT_HUNG_TASK=y
>   CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y
>   CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y
> -CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y
> +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1
>   CONFIG_PANIC_TIMEOUT=-1
>   CONFIG_STACKTRACE=y
>   CONFIG_EARLY_PRINTK=y


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks
  2025-10-14  5:23 ` Lance Yang
@ 2025-10-14  9:45   ` Petr Mladek
  2025-10-14 10:49     ` [????] " Li,Rongqing
  2025-10-14 10:59     ` Lance Yang
  0 siblings, 2 replies; 12+ messages in thread
From: Petr Mladek @ 2025-10-14  9:45 UTC (permalink / raw)
  To: Lance Yang
  Cc: lirongqing, wireguard, linux-arm-kernel, Liam R . Howlett,
	linux-doc, David Hildenbrand, Randy Dunlap, Stanislav Fomichev,
	linux-aspeed, Andrew Jeffery, Joel Stanley, Russell King,
	Lorenzo Stoakes, Shuah Khan, Steven Rostedt, Jonathan Corbet,
	Joel Granados, Andrew Morton, Phil Auld, linux-kernel,
	linux-kselftest, Masami Hiramatsu, Jakub Kicinski, Pawan Gupta,
	Simon Horman, Anshuman Khandual, Florian Westphal, netdev,
	Kees Cook, Arnd Bergmann, Paul E . McKenney, Feng Tang,
	Jason A . Donenfeld

On Tue 2025-10-14 13:23:58, Lance Yang wrote:
> Thanks for the patch!
> 
> I noticed the implementation panics only when N tasks are detected
> within a single scan, because total_hung_task is reset for each
> check_hung_uninterruptible_tasks() run.

Great catch!

Does it make sense?
Is is the intended behavior, please?

> So some suggestions to align the documentation with the code's
> behavior below :)

> On 2025/10/12 19:50, lirongqing wrote:
> > From: Li RongQing <lirongqing@baidu.com>
> > 
> > Currently, when 'hung_task_panic' is enabled, the kernel panics
> > immediately upon detecting the first hung task. However, some hung
> > tasks are transient and the system can recover, while others are
> > persistent and may accumulate progressively.

My understanding is that this patch wanted to do:

   + report even temporary stalls
   + panic only when the stall was much longer and likely persistent

Which might make some sense. But the code does something else.

> > --- a/kernel/hung_task.c
> > +++ b/kernel/hung_task.c
> > @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
> >   	 */
> >   	sysctl_hung_task_detect_count++;
> > +	total_hung_task = sysctl_hung_task_detect_count - prev_detect_count;
> >   	trace_sched_process_hang(t);
> > -	if (sysctl_hung_task_panic) {
> > +	if (sysctl_hung_task_panic &&
> > +			(total_hung_task >= sysctl_hung_task_panic)) {
> >   		console_verbose();
> >   		hung_task_show_lock = true;
> >   		hung_task_call_panic = true;

I would expect that this patch added another counter, similar to
sysctl_hung_task_detect_count. It would be incremented only
once per check when a hung task was detected. And it would
be cleared (reset) when no hung task was found.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [????] Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks
  2025-10-14  9:45   ` Petr Mladek
@ 2025-10-14 10:49     ` Li,Rongqing
  2025-10-14 13:09       ` Petr Mladek
  2025-10-14 10:59     ` Lance Yang
  1 sibling, 1 reply; 12+ messages in thread
From: Li,Rongqing @ 2025-10-14 10:49 UTC (permalink / raw)
  To: Petr Mladek, Lance Yang
  Cc: wireguard@lists.zx2c4.com, linux-arm-kernel@lists.infradead.org,
	Liam R . Howlett, linux-doc@vger.kernel.org, David Hildenbrand,
	Randy Dunlap, Stanislav Fomichev, linux-aspeed@lists.ozlabs.org,
	Andrew Jeffery, Joel Stanley, Russell King, Lorenzo Stoakes,
	Shuah Khan, Steven Rostedt, Jonathan Corbet, Joel Granados,
	Andrew Morton, Phil Auld, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, Masami Hiramatsu, Jakub Kicinski,
	Pawan Gupta, Simon Horman, Anshuman Khandual, Florian Westphal,
	netdev@vger.kernel.org, Kees Cook, Arnd Bergmann,
	Paul E . McKenney, Feng Tang, Jason A . Donenfeld


> On Tue 2025-10-14 13:23:58, Lance Yang wrote:
> > Thanks for the patch!
> >
> > I noticed the implementation panics only when N tasks are detected
> > within a single scan, because total_hung_task is reset for each
> > check_hung_uninterruptible_tasks() run.
> 
> Great catch!
> 
> Does it make sense?
> Is is the intended behavior, please?
> 

Yes, this is intended behavior

> > So some suggestions to align the documentation with the code's
> > behavior below :)
> 
> > On 2025/10/12 19:50, lirongqing wrote:
> > > From: Li RongQing <lirongqing@baidu.com>
> > >
> > > Currently, when 'hung_task_panic' is enabled, the kernel panics
> > > immediately upon detecting the first hung task. However, some hung
> > > tasks are transient and the system can recover, while others are
> > > persistent and may accumulate progressively.
> 
> My understanding is that this patch wanted to do:
> 
>    + report even temporary stalls
>    + panic only when the stall was much longer and likely persistent
> 
> Which might make some sense. But the code does something else.
> 

A single task hanging for an extended period may not be a critical issue, as users might still log into the system to investigate. However, if multiple tasks hang simultaneously-such as in cases of I/O hangs caused by disk failures-it could prevent users from logging in and become a serious problem, and a panic is expected. 


> > > --- a/kernel/hung_task.c
> > > +++ b/kernel/hung_task.c
> > > @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct
> *t, unsigned long timeout)
> > >   	 */
> > >   	sysctl_hung_task_detect_count++;
> > > +	total_hung_task = sysctl_hung_task_detect_count -
> > > +prev_detect_count;
> > >   	trace_sched_process_hang(t);
> > > -	if (sysctl_hung_task_panic) {
> > > +	if (sysctl_hung_task_panic &&
> > > +			(total_hung_task >= sysctl_hung_task_panic)) {
> > >   		console_verbose();
> > >   		hung_task_show_lock = true;
> > >   		hung_task_call_panic = true;
> 
> I would expect that this patch added another counter, similar to
> sysctl_hung_task_detect_count. It would be incremented only once per check
> when a hung task was detected. And it would be cleared (reset) when no
> hung task was found.
> 
> Best Regards,
> Petr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [????] Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks
  2025-10-14 10:49     ` [????] " Li,Rongqing
@ 2025-10-14 13:09       ` Petr Mladek
  2025-10-15  2:04         ` [????] " Li,Rongqing
  0 siblings, 1 reply; 12+ messages in thread
From: Petr Mladek @ 2025-10-14 13:09 UTC (permalink / raw)
  To: Li,Rongqing
  Cc: Lance Yang, wireguard@lists.zx2c4.com,
	linux-arm-kernel@lists.infradead.org, Liam R . Howlett,
	linux-doc@vger.kernel.org, David Hildenbrand, Randy Dunlap,
	Stanislav Fomichev, linux-aspeed@lists.ozlabs.org, Andrew Jeffery,
	Joel Stanley, Russell King, Lorenzo Stoakes, Shuah Khan,
	Steven Rostedt, Jonathan Corbet, Joel Granados, Andrew Morton,
	Phil Auld, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, Masami Hiramatsu, Jakub Kicinski,
	Pawan Gupta, Simon Horman, Anshuman Khandual, Florian Westphal,
	netdev@vger.kernel.org, Kees Cook, Arnd Bergmann,
	Paul E . McKenney, Feng Tang, Jason A . Donenfeld

On Tue 2025-10-14 10:49:53, Li,Rongqing wrote:
> 
> > On Tue 2025-10-14 13:23:58, Lance Yang wrote:
> > > Thanks for the patch!
> > >
> > > I noticed the implementation panics only when N tasks are detected
> > > within a single scan, because total_hung_task is reset for each
> > > check_hung_uninterruptible_tasks() run.
> > 
> > Great catch!
> > 
> > Does it make sense?
> > Is is the intended behavior, please?
> > 
> 
> Yes, this is intended behavior
> 
> > > So some suggestions to align the documentation with the code's
> > > behavior below :)
> > 
> > > On 2025/10/12 19:50, lirongqing wrote:
> > > > From: Li RongQing <lirongqing@baidu.com>
> > > >
> > > > Currently, when 'hung_task_panic' is enabled, the kernel panics
> > > > immediately upon detecting the first hung task. However, some hung
> > > > tasks are transient and the system can recover, while others are
> > > > persistent and may accumulate progressively.
> > 
> > My understanding is that this patch wanted to do:
> > 
> >    + report even temporary stalls
> >    + panic only when the stall was much longer and likely persistent
> > 
> > Which might make some sense. But the code does something else.
> > 
> 
> A single task hanging for an extended period may not be a critical
> issue, as users might still log into the system to investigate.
> However, if multiple tasks hang simultaneously-such as in cases
> of I/O hangs caused by disk failures-it could prevent users from
> logging in and become a serious problem, and a panic is expected.

I see. This another approach and it makes sense as well.
An this is much more clear description than the original text.

I would also update the subject to something like:

    hung_task: Panic when there are more than N hung tasks at the same time



That said, I think that both approaches make sense.

Your approach would trigger the panic when many processes are stuck.
Note that it still might be a transient state. But I agree that
the more stuck processes exist the more serious the problem
likely is for the heath of the system.

My approach would trigger panic when a single process hangs
for a long time. It will trigger more likely only when the problem
is persistent. The seriousness depends on which particular process
get stuck.

I am fine with your approach. Just please, make more clear that
the number means the number of hung tasks at the same time.
And mention the problems to login, ...

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [????] Re: [????] Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks
  2025-10-14 13:09       ` Petr Mladek
@ 2025-10-15  2:04         ` Li,Rongqing
  0 siblings, 0 replies; 12+ messages in thread
From: Li,Rongqing @ 2025-10-15  2:04 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Lance Yang, wireguard@lists.zx2c4.com,
	linux-arm-kernel@lists.infradead.org, Liam R . Howlett,
	linux-doc@vger.kernel.org, David Hildenbrand, Randy Dunlap,
	Stanislav Fomichev, linux-aspeed@lists.ozlabs.org, Andrew Jeffery,
	Joel Stanley, Russell King, Lorenzo Stoakes, Shuah Khan,
	Steven Rostedt, Jonathan Corbet, Joel Granados, Andrew Morton,
	Phil Auld, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, Masami Hiramatsu, Jakub Kicinski,
	Pawan Gupta, Simon Horman, Anshuman Khandual, Florian Westphal,
	netdev@vger.kernel.org, Kees Cook, Arnd Bergmann,
	Paul E . McKenney, Feng Tang, Jason A . Donenfeld

> I would also update the subject to something like:
> 
>     hung_task: Panic when there are more than N hung tasks at the same
> time
> 

Ok, I will update 

> 
> 
> That said, I think that both approaches make sense.
> 
> Your approach would trigger the panic when many processes are stuck.
> Note that it still might be a transient state. But I agree that the more stuck
> processes exist the more serious the problem likely is for the heath of the
> system.
> 
> My approach would trigger panic when a single process hangs for a long
> time. It will trigger more likely only when the problem is persistent. The
> seriousness depends on which particular process get stuck.
> 
Yes, both are reasonable requirement, and I will leave it to you or anyone else interested to implement it

Thanks

-Li.


> I am fine with your approach. Just please, make more clear that the number
> means the number of hung tasks at the same time.
> And mention the problems to login, ...
> 
> Best Regards,
> Petr

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks
  2025-10-14  9:45   ` Petr Mladek
  2025-10-14 10:49     ` [????] " Li,Rongqing
@ 2025-10-14 10:59     ` Lance Yang
  2025-10-14 11:18       ` [外部邮件] " Li,Rongqing
  1 sibling, 1 reply; 12+ messages in thread
From: Lance Yang @ 2025-10-14 10:59 UTC (permalink / raw)
  To: lirongqing, Petr Mladek
  Cc: wireguard, linux-arm-kernel, Liam R . Howlett, linux-doc,
	David Hildenbrand, Randy Dunlap, Stanislav Fomichev, linux-aspeed,
	Andrew Jeffery, Joel Stanley, Russell King, Lorenzo Stoakes,
	Shuah Khan, Steven Rostedt, Jonathan Corbet, Joel Granados,
	Andrew Morton, Phil Auld, linux-kernel, linux-kselftest,
	Masami Hiramatsu, Jakub Kicinski, Pawan Gupta, Simon Horman,
	Anshuman Khandual, Florian Westphal, netdev, Kees Cook,
	Arnd Bergmann, Paul E . McKenney, Feng Tang, Jason A . Donenfeld



On 2025/10/14 17:45, Petr Mladek wrote:
> On Tue 2025-10-14 13:23:58, Lance Yang wrote:
>> Thanks for the patch!
>>
>> I noticed the implementation panics only when N tasks are detected
>> within a single scan, because total_hung_task is reset for each
>> check_hung_uninterruptible_tasks() run.
> 
> Great catch!
> 
> Does it make sense?
> Is is the intended behavior, please?
> 
>> So some suggestions to align the documentation with the code's
>> behavior below :)
> 
>> On 2025/10/12 19:50, lirongqing wrote:
>>> From: Li RongQing <lirongqing@baidu.com>
>>>
>>> Currently, when 'hung_task_panic' is enabled, the kernel panics
>>> immediately upon detecting the first hung task. However, some hung
>>> tasks are transient and the system can recover, while others are
>>> persistent and may accumulate progressively.
> 
> My understanding is that this patch wanted to do:
> 
>     + report even temporary stalls
>     + panic only when the stall was much longer and likely persistent
> 
> Which might make some sense. But the code does something else.

Cool. Sounds good to me!

> 
>>> --- a/kernel/hung_task.c
>>> +++ b/kernel/hung_task.c
>>> @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
>>>    	 */
>>>    	sysctl_hung_task_detect_count++;
>>> +	total_hung_task = sysctl_hung_task_detect_count - prev_detect_count;
>>>    	trace_sched_process_hang(t);
>>> -	if (sysctl_hung_task_panic) {
>>> +	if (sysctl_hung_task_panic &&
>>> +			(total_hung_task >= sysctl_hung_task_panic)) {
>>>    		console_verbose();
>>>    		hung_task_show_lock = true;
>>>    		hung_task_call_panic = true;
> 
> I would expect that this patch added another counter, similar to
> sysctl_hung_task_detect_count. It would be incremented only
> once per check when a hung task was detected. And it would
> be cleared (reset) when no hung task was found.

Much cleaner. We could add an internal counter for that, yeah. No need
to expose it to userspace ;)

Petr's suggestion seems to align better with the goal of panicking on
persistent hangs, IMHO. Panic after N consecutive checks with hung tasks.

@RongQing does that work for you?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [外部邮件] Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks
  2025-10-14 10:59     ` Lance Yang
@ 2025-10-14 11:18       ` Li,Rongqing
  2025-10-14 11:40         ` Lance Yang
  0 siblings, 1 reply; 12+ messages in thread
From: Li,Rongqing @ 2025-10-14 11:18 UTC (permalink / raw)
  To: Lance Yang, Petr Mladek
  Cc: wireguard@lists.zx2c4.com, linux-arm-kernel@lists.infradead.org,
	Liam R . Howlett, linux-doc@vger.kernel.org, David Hildenbrand,
	Randy Dunlap, Stanislav Fomichev, linux-aspeed@lists.ozlabs.org,
	Andrew Jeffery, Joel Stanley, Russell King, Lorenzo Stoakes,
	Shuah Khan, Steven Rostedt, Jonathan Corbet, Joel Granados,
	Andrew Morton, Phil Auld, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, Masami Hiramatsu, Jakub Kicinski,
	Pawan Gupta, Simon Horman, Anshuman Khandual, Florian Westphal,
	netdev@vger.kernel.org, Kees Cook, Arnd Bergmann,
	Paul E . McKenney, Feng Tang, Jason A . Donenfeld

> >>> Currently, when 'hung_task_panic' is enabled, the kernel panics
> >>> immediately upon detecting the first hung task. However, some hung
> >>> tasks are transient and the system can recover, while others are
> >>> persistent and may accumulate progressively.
> >
> > My understanding is that this patch wanted to do:
> >
> >     + report even temporary stalls
> >     + panic only when the stall was much longer and likely persistent
> >
> > Which might make some sense. But the code does something else.
> 
> Cool. Sounds good to me!
> 
> >
> >>> --- a/kernel/hung_task.c
> >>> +++ b/kernel/hung_task.c
> >>> @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct
> *t, unsigned long timeout)
> >>>    	 */
> >>>    	sysctl_hung_task_detect_count++;
> >>> +	total_hung_task = sysctl_hung_task_detect_count -
> >>> +prev_detect_count;
> >>>    	trace_sched_process_hang(t);
> >>> -	if (sysctl_hung_task_panic) {
> >>> +	if (sysctl_hung_task_panic &&
> >>> +			(total_hung_task >= sysctl_hung_task_panic)) {
> >>>    		console_verbose();
> >>>    		hung_task_show_lock = true;
> >>>    		hung_task_call_panic = true;
> >
> > I would expect that this patch added another counter, similar to
> > sysctl_hung_task_detect_count. It would be incremented only once per
> > check when a hung task was detected. And it would be cleared (reset)
> > when no hung task was found.
> 
> Much cleaner. We could add an internal counter for that, yeah. No need to
> expose it to userspace ;)
> 
> Petr's suggestion seems to align better with the goal of panicking on
> persistent hangs, IMHO. Panic after N consecutive checks with hung tasks.
> 
> @RongQing does that work for you?


In my opinion, a single task hang is not a critical issue, fatal hangs—such as those caused by I/O hangs, network card failures, or hangs while holding locks—will inevitably lead to multiple tasks being hung. In such scenarios, users cannot even log in to the machine, making it extremely difficult to investigate the root cause. Therefore, I believe the current approach is sound. What's your opinion?

-Li


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [外部邮件] Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks
  2025-10-14 11:18       ` [外部邮件] " Li,Rongqing
@ 2025-10-14 11:40         ` Lance Yang
  0 siblings, 0 replies; 12+ messages in thread
From: Lance Yang @ 2025-10-14 11:40 UTC (permalink / raw)
  To: Li,Rongqing, Petr Mladek
  Cc: wireguard@lists.zx2c4.com, linux-arm-kernel@lists.infradead.org,
	Liam R . Howlett, linux-doc@vger.kernel.org, David Hildenbrand,
	Randy Dunlap, Stanislav Fomichev, linux-aspeed@lists.ozlabs.org,
	Andrew Jeffery, Joel Stanley, Russell King, Lorenzo Stoakes,
	Shuah Khan, Steven Rostedt, Jonathan Corbet, Joel Granados,
	Andrew Morton, Phil Auld, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, Masami Hiramatsu, Jakub Kicinski,
	Pawan Gupta, Simon Horman, Anshuman Khandual, Florian Westphal,
	netdev@vger.kernel.org, Kees Cook, Arnd Bergmann,
	Paul E . McKenney, Feng Tang, Jason A . Donenfeld



On 2025/10/14 19:18, Li,Rongqing wrote:
>>>>> Currently, when 'hung_task_panic' is enabled, the kernel panics
>>>>> immediately upon detecting the first hung task. However, some hung
>>>>> tasks are transient and the system can recover, while others are
>>>>> persistent and may accumulate progressively.
>>>
>>> My understanding is that this patch wanted to do:
>>>
>>>      + report even temporary stalls
>>>      + panic only when the stall was much longer and likely persistent
>>>
>>> Which might make some sense. But the code does something else.
>>
>> Cool. Sounds good to me!
>>
>>>
>>>>> --- a/kernel/hung_task.c
>>>>> +++ b/kernel/hung_task.c
>>>>> @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct
>> *t, unsigned long timeout)
>>>>>     	 */
>>>>>     	sysctl_hung_task_detect_count++;
>>>>> +	total_hung_task = sysctl_hung_task_detect_count -
>>>>> +prev_detect_count;
>>>>>     	trace_sched_process_hang(t);
>>>>> -	if (sysctl_hung_task_panic) {
>>>>> +	if (sysctl_hung_task_panic &&
>>>>> +			(total_hung_task >= sysctl_hung_task_panic)) {
>>>>>     		console_verbose();
>>>>>     		hung_task_show_lock = true;
>>>>>     		hung_task_call_panic = true;
>>>
>>> I would expect that this patch added another counter, similar to
>>> sysctl_hung_task_detect_count. It would be incremented only once per
>>> check when a hung task was detected. And it would be cleared (reset)
>>> when no hung task was found.
>>
>> Much cleaner. We could add an internal counter for that, yeah. No need to
>> expose it to userspace ;)
>>
>> Petr's suggestion seems to align better with the goal of panicking on
>> persistent hangs, IMHO. Panic after N consecutive checks with hung tasks.
>>
>> @RongQing does that work for you?
> 
> 
> In my opinion, a single task hang is not a critical issue, fatal hangs—such as those caused by I/O hangs, network card failures, or hangs while holding locks—will inevitably lead to multiple tasks being hung. In such scenarios, users cannot even log in to the machine, making it extremely difficult to investigate the root cause. Therefore, I believe the current approach is sound. What's your opinion?

Thanks! I'm fine with either approach. Let's hear what the other folks 
think ;)


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-10-15  2:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-12 11:50 [PATCH][v3] hung_task: Panic after fixed number of hung tasks lirongqing
2025-10-12 13:26 ` [PATCH v3] " Markus Elfring
2025-10-13  2:14   ` [外部邮件] " Li,Rongqing
2025-10-14  1:37 ` [PATCH][v3] " Randy Dunlap
2025-10-14  5:23 ` Lance Yang
2025-10-14  9:45   ` Petr Mladek
2025-10-14 10:49     ` [????] " Li,Rongqing
2025-10-14 13:09       ` Petr Mladek
2025-10-15  2:04         ` [????] " Li,Rongqing
2025-10-14 10:59     ` Lance Yang
2025-10-14 11:18       ` [外部邮件] " Li,Rongqing
2025-10-14 11:40         ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).