From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 29DCB336ED1
	for <linux-kernel@vger.kernel.org>; Fri,  9 Jan 2026 13:50:55 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.181
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1767966660; cv=none; b=lNHmDMnnmHZ0mSnyYdz91oji6Qp0WMgIpQ61aXZntgeAibrBgtqbjOa/8q/JLY9rf/HlCFwYCHE7GqGYR/J4BceuQ8NLf21bgs/emNH5B7VOXI0khGtssReYrUK4vVt0noA0IgavtGmzjCk14UWjJ4srDazjxOdO2MqfIgpuCdQ=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1767966660; c=relaxed/simple;
	bh=n6tajQiijja81e9+jqPLCzmg1Yxz4d6OyyhLlFlvjXg=;
	h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From:
	 In-Reply-To:Content-Type; b=K9uQ6wFp/YkxA7seWzni7zxHqHjem1oDB5yEm7fMZIDAqs0s8e+D6zkAJy/cZ2eETZSKh/c2IXFyavJI8nG+Ml3DEcNhtiY/U3a1x10mL/lN4YaBX89Iq4JFTGtBhHJ3kbRvyGu49YeB4d3zA0KpZkW5R2tsu8P3tLERMfE9r5Q=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=VLqFaJeQ; arc=none smtp.client-ip=91.218.175.181
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="VLqFaJeQ"
Message-ID: <e4d9a366-597f-4b62-8bbd-4cba33968a6e@linux.dev>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1767966653;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Pl70/ffeCKWshxsz4iVspYDfraTgJE1+JWagmUpjgBc=;
	b=VLqFaJeQ9VTAB3dvBcheGVbFqSoxe0OxLEF62PDvkFOggb70ytrKA1H3sQFFP6883Efx0u
	HL4vkLodcSzYet1qdJTPui2Tf9Yq4+LNqeKTD616YgFN981lTo+48V3noqGPfimMOZth0l
	QL0NoQjScBLeyn4K74pR+iYfs/T500k=
Date: Fri, 9 Jan 2026 21:50:20 +0800
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Subject: Re: [v5 PATCH 2/2] hung_task: Enable runtime reset of
 hung_task_detect_count
Content-Language: en-US
To: Petr Mladek <pmladek@suse.com>, Aaron Tomlin <atomlin@atomlin.com>
Cc: akpm@linux-foundation.org, mhiramat@kernel.org,
 gregkh@linuxfoundation.org, joel.granados@kernel.org, sean@ashe.io,
 linux-kernel@vger.kernel.org
References: <20251231004125.2380105-1-atomlin@atomlin.com>
 <20251231004125.2380105-3-atomlin@atomlin.com>
 <aV_CGOl_rXziqdHZ@pathway.suse.cz>
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Lance Yang <lance.yang@linux.dev>
In-Reply-To: <aV_CGOl_rXziqdHZ@pathway.suse.cz>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Migadu-Flow: FLOW_OUT


On 2026/1/8 22:41, Petr Mladek wrote:
> On Tue 2025-12-30 19:41:25, Aaron Tomlin wrote:
>> Introduce support for writing to /proc/sys/kernel/hung_task_detect_count.
>>
>> Writing a value of zero to this file atomically resets the counter of
>> detected hung tasks. This grants system administrators the ability to
>> clear the cumulative diagnostic history after resolving an incident,
>> simplifying monitoring without requiring a system restart.
> 
>> --- a/kernel/hung_task.c
>> +++ b/kernel/hung_task.c
>> @@ -36,7 +37,7 @@ static int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
>>   /*
>>    * Total number of tasks detected as hung since boot:
>>    */
>> -static unsigned long __read_mostly sysctl_hung_task_detect_count;
>> +static atomic_long_t sysctl_hung_task_detect_count = ATOMIC_LONG_INIT(0);
>>   
>>   /*
>>    * Limit number of tasks checked in a batch.
>> @@ -246,20 +247,26 @@ static inline void hung_task_diagnostics(struct task_struct *t)
>>   }
>>   
>>   static void check_hung_task(struct task_struct *t, unsigned long timeout,
>> -		unsigned long prev_detect_count)
>> +			    unsigned long prev_detect_count)
>>   {
>> -	unsigned long total_hung_task;
>> +	unsigned long total_hung_task, cur_detect_count;
>>   
>>   	if (!task_is_hung(t, timeout))
>>   		return;
>>   
>>   	/*
>>   	 * This counter tracks the total number of tasks detected as hung
>> -	 * since boot.
>> +	 * since boot. If a reset occurred during the scan, we treat the
>> +	 * current count as the new delta to avoid an underflow error.
>> +	 * Ensure hang details are globally visible before the counter
>> +	 * update.
>>   	 */
>> -	sysctl_hung_task_detect_count++;
>> +	cur_detect_count = atomic_long_inc_return_release(&sysctl_hung_task_detect_count);
> 
> The _release() feels a bit weird here because the counter might
> get incremented more times during one scan.
> 
> IMHO, it should be perfectly fine to use the _relaxed version here
> because it is in the middle of the acquire/release, see below.
> The important thing here is that the load/modify/store operation
> is done atomically.

Right, we only need atomicity here, not the ordering guarantee :)

> 
>> +	if (cur_detect_count >= prev_detect_count)
>> +		total_hung_task = cur_detect_count - prev_detect_count;
>> +	else
>> +		total_hung_task = cur_detect_count;
>>   
>> -	total_hung_task = sysctl_hung_task_detect_count - prev_detect_count;
>>   	trace_sched_process_hang(t);
>>   
>>   	if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic) {
>> @@ -318,10 +325,12 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>>   	int max_count = sysctl_hung_task_check_count;
>>   	unsigned long last_break = jiffies;
>>   	struct task_struct *g, *t;
>> -	unsigned long prev_detect_count = sysctl_hung_task_detect_count;
>> +	unsigned long cur_detect_count, prev_detect_count, delta;
>>   	int need_warning = sysctl_hung_task_warnings;
>>   	unsigned long si_mask = hung_task_si_mask;
>>   
>> +	/* Acquire prevents reordering task checks before this point. */
>> +	prev_detect_count = atomic_long_read_acquire(&sysctl_hung_task_detect_count);
> 
> This value is read before the scan started => _acquire
> semantic/barrier fits here.
> 
>>   	/*
>>   	 * If the system crashed already then all bets are off,
>>   	 * do not report extra hung tasks:
>> @@ -346,7 +355,14 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>>    unlock:
>>   	rcu_read_unlock();
>>   
>> -	if (!(sysctl_hung_task_detect_count - prev_detect_count))
>> +	/* Ensures we see all hang details recorded during the scan. */
>> +	cur_detect_count = atomic_long_read_acquire(&sysctl_hung_task_detect_count);
> 
> This value is read at the end of the scan => _release
> semantic/barrier should be here.

Seems like _acquire is still correct here, because it is a load.

_release semantics apply to stores, while _acquire on a load
ensures subsequent memory accesses are not reordered before it.

Or smp_mb()?

In the same thread, atomic operations on the same variable are not
reordered with respect to each other, even the _relaxed variant
preserves program order for that variable, IIRC.

So the increment will always complete before the final read in
program order, and the read will see the updated value (unless
another CPU resets it concurrently, which is a logical race, not
a reordering issue).

So, it would be:

   prev = atomic_long_read_acquire(&counter);     // scan start
   ...
   cur = atomic_long_inc_return_relaxed(&counter); // during scan
   ...
   cur = atomic_long_read_acquire(&counter);      // scan end

The first _acquire ensures no task-checking code is reordered
before the start read, the middle increment is just atomic
without extra barriers, and the final _acquire makes sure we
observe all hang details before computing the delta.

That said, I also see the value in using _release or smp_mb()
here to pair with the _acquire at the start. Making the
ordering semantics clearer to readers.

Cheers,
Lance

> 
>> +	if (cur_detect_count < prev_detect_count)
>> +		delta = cur_detect_count;
>> +	else
>> +		delta = cur_detect_count - prev_detect_count;
>> +
>> +	if (!delta)
>>   		return;
>>   
>>   	if (need_warning || hung_task_call_panic) {
> 
> Otherwise, I do not have anything more to add. I agree with the other
> proposals, for example:
> 
>     + remove 1st patch
>     + split 2nd patch into two
>     + changes in the sysctl code proposed by Joel
> 
> Best Regards,
> Petr