From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from baidu.com (mx22.baidu.com [220.181.50.185]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 781748F54; Sun, 28 Sep 2025 01:53:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=220.181.50.185 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759024403; cv=none; b=n1peDHeyAG4Wto+pQpJ4lDQ4HdK/57nMACCtHwm91wyyCDZogxZRJhzVEe+t6fIsRMdo/0+R2ls2tQLLTgWCcy09MnfwG1LfYEKs+kThgIb01tdXr4n53yYqSe2Gt3evfXaa2C3NEvIYXxuXu7hKkuZxeeAc9/DPNWkaTwaoUuI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759024403; c=relaxed/simple; bh=IQUN86YLZk/sbbm1YHS0WSdgE278cdrExooMpACcnkM=; h=From:To:CC:Subject:Date:Message-ID:References:In-Reply-To: Content-Type:MIME-Version; b=pUvGOwbg+6APorSWpJTnW+qRS8jf0kZb6RuuwjUOdKHvnEHpN/fXB1c0sFsqI4oRc41JaORpsqwPui/mllDSHkQXz/AtTraAHodh1St4xZCEej24DPbutL3NrPRiv0LTqzuzDWJwvfu0nuOLf1YtPB4p4jMg6B1Fgp/xw5fMiWU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=baidu.com; spf=pass smtp.mailfrom=baidu.com; arc=none smtp.client-ip=220.181.50.185 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=baidu.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=baidu.com From: "Li,Rongqing" To: "paulmck@kernel.org" , Lance Yang CC: "linux-kernel@vger.kernel.org" , "linux-doc@vger.kernel.org" , "arnd@arndb.de" , "feng.tang@linux.alibaba.com" , "joel.granados@kernel.org" , "kees@kernel.org" , "rostedt@goodmis.org" , "pauld@redhat.com" , "pawan.kumar.gupta@linux.intel.com" , "mhiramat@kernel.org" , "dave.hansen@linux.intel.com" , "corbet@lwn.net" , "akpm@linux-foundation.org" , "mingo@kernel.org" Subject: RE: [????] Re: [PATCH] hung_task: Panic after fixed number of hung tasks Thread-Topic: [????] Re: [PATCH] hung_task: Panic after fixed number of hung tasks Thread-Index: AQHcLw+tCjbF9sXsFk+2rsF9SSkcs7Sn1qJw Date: Sun, 28 Sep 2025 01:51:52 +0000 Message-ID: <8828a890f37048da8b9846b08c321c2b@baidu.com> References: <20250925060605.2659-1-lirongqing@baidu.com> <8c4cd66c-9c3f-411a-82df-0130b78e889c@linux.dev> <81514e1d-4a10-4466-8a87-2d4b0927195b@paulmck-laptop> In-Reply-To: <81514e1d-4a10-4466-8a87-2d4b0927195b@paulmck-laptop> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-FEAS-Client-IP: 172.31.3.13 X-FE-Policy-ID: 52:10:53:SYSTEM >=20 > > On 2025/9/25 14:06, lirongqing wrote: > > > From: Li RongQing > > > > > > Currently, when hung_task_panic is enabled, kernel will panic > > > immediately upon detecting the first hung task. However, some hung > > > tasks are transient and the system can recover fully, while others > > > are unrecoverable and trigger consecutive hung task reports, and a pa= nic is > expected. > > > > The new hung_task_count_to_panic relies on an absolute count, but I > > assume the real indicator you're trying to capture is the trend or > > rate of increase over a time window (e.g., "panic if count increases > > by 5 in 10 minutes"). > > > > IMHO, this kind of time-windowed, trend-based logic seems much more > > flexible and better suited for a userspace monitoring agent :) > > > > In other words, why is this the right place for this feature? >=20 > A possibly related question is "why are RCU CPU stall warnings implemente= d in > the kernel instead of in userspace?" One reason is that by the time that > things get bad enough to trigger an RCU CPU stall warning, userspace migh= t > not be capable of doing much of anything. Thus, there is an uncomfortabl= y > high probability that orchestrating RCU CPU stall warnings from userspace > would cause these warnings to be lost entirely. >=20 Thank you, I think so too. -Li > Similar reasoning might (or might not) apply to the hung-task mechanism. >=20 > Thanx, Paul >=20 > > Please sell it to us ;) > > Lance > > > > > > > > This commit adds a new sysctl parameter hung_task_count_to_panic to > > > allows specifying the number of consecutive hung tasks that must be > > > detected before triggering a kernel panic. This provides finer > > > control for environments where transient hangs maybe happen but > > > persistent hangs should still be fatal. > > > > > > Signed-off-by: Li RongQing > > > --- > > > Documentation/admin-guide/sysctl/kernel.rst | 6 ++++++ > > > kernel/hung_task.c | 14 +++++++++++++- > > > 2 files changed, 19 insertions(+), 1 deletion(-) > > > > > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst > > > b/Documentation/admin-guide/sysctl/kernel.rst > > > index 8b49eab..4240e7b 100644 > > > --- a/Documentation/admin-guide/sysctl/kernel.rst > > > +++ b/Documentation/admin-guide/sysctl/kernel.rst > > > @@ -405,6 +405,12 @@ This file shows up if > ``CONFIG_DETECT_HUNG_TASK`` is enabled. > > > 1 Panic immediately. > > > =3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D > > > +hung_task_count_to_panic > > > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > + > > > +When set to a non-zero value, after the number of consecutive hung > > > +task occur, the kernel will triggers a panic > > > + > > > hung_task_check_count > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c index > > > 8708a12..87a6421 100644 > > > --- a/kernel/hung_task.c > > > +++ b/kernel/hung_task.c > > > @@ -83,6 +83,8 @@ static unsigned int __read_mostly > sysctl_hung_task_all_cpu_backtrace; > > > static unsigned int __read_mostly sysctl_hung_task_panic =3D > > > IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC); > > > +static unsigned int __read_mostly sysctl_hung_task_count_to_panic; > > > + > > > static int > > > hung_task_panic(struct notifier_block *this, unsigned long event, v= oid > *ptr) > > > { > > > @@ -219,7 +221,9 @@ static void check_hung_task(struct task_struct *t= , > unsigned long timeout) > > > trace_sched_process_hang(t); > > > - if (sysctl_hung_task_panic) { > > > + if (sysctl_hung_task_panic || > > > + (sysctl_hung_task_count_to_panic && > > > + (sysctl_hung_task_detect_count >=3D > > > +sysctl_hung_task_count_to_panic))) { > > > console_verbose(); > > > hung_task_show_lock =3D true; > > > hung_task_call_panic =3D true; > > > @@ -388,6 +392,14 @@ static const struct ctl_table hung_task_sysctls[= ] =3D > { > > > .extra2 =3D SYSCTL_ONE, > > > }, > > > { > > > + .procname =3D "hung_task_count_to_panic", > > > + .data =3D &sysctl_hung_task_count_to_panic, > > > + .maxlen =3D sizeof(int), > > > + .mode =3D 0644, > > > + .proc_handler =3D proc_dointvec_minmax, > > > + .extra1 =3D SYSCTL_ZERO, > > > + }, > > > + { > > > .procname =3D "hung_task_check_count", > > > .data =3D &sysctl_hung_task_check_count, > > > .maxlen =3D sizeof(int), > >