From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from stravinsky.debian.org (stravinsky.debian.org [82.195.75.108])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C24643C2D
	for <linux-kernel@vger.kernel.org>; Fri, 13 Mar 2026 12:25:12 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=82.195.75.108
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1773404714; cv=none; b=AeqWF2aVYQsmdoFgWiX4lLbtRRRYCHsL4pqD3N9m+st98b0KAKkrTkjvyJUpp/xHpp3r57ca3j2ikfbf82wSe7ariVpNvfUp69Nwi8UTOg0fRgBChJmrsona7Qf5RItNX34ubw6H2/1gUfuzlm713UbG0W+T8RgjxM7qoUJucIU=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1773404714; c=relaxed/simple;
	bh=cF+3E6YTEnu8s4LOQWNBNhAjO5zlIf+zJgKAPXggI7k=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=o1K1dPCxDmbSs0It+0GtOCkC96RRUhd+6mZvBx3xN4H4UFYECUjXHGso9BS7QzRFps7UVLSBXidi/crj8VPsSiic1emUTwGd83Eh1cwfIWKVVDcvnusXbwat2evQyNCClfAGHRyglvNv84VGyJb6E+JHbIrVcDEX9VtzlNDXiwo=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=debian.org; spf=none smtp.mailfrom=debian.org; dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b=rf0+vD6P; arc=none smtp.client-ip=82.195.75.108
Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=debian.org
Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=debian.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=debian.org header.i=@debian.org header.b="rf0+vD6P"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=debian.org;
	s=smtpauto.stravinsky; h=X-Debian-User:In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=j5zV13C2nMIYXY2eUvBsm7OBDR9dZRaaTQErHJd47r0=; b=rf0+vD6Ps9wu0CxTAdrCMNWnTY
	eLZ284RO6CFc98tb6JwXJEJ64mzDRq3KOvu2kxIrl6GNdFQWv9lcX7M9QfvGE2WwqeuibPZE0p3Vb
	eXMcGX0O+67txBw8zqR4AEdiVA0hFgz/1IM2zE9IFILIaU1SrOgDswR4cRxlOmXYaOnOtQfIJ2z68
	0GrZvxFaXxGupFGoFKYRJrR/foD4t6sWc5eKmqjegkLbrJLLBiAipUV35xWGTTqvgILJ9usQEm2fT
	6uwIlZLCSCVDZGSpXNqw3zPalWVSvaqbf5F9sme1OCyox06/uqX7LsRwsIqCAXQ9KNQnt3g/WmWjG
	WAjRGjJA==;
Received: from authenticated user
	by stravinsky.debian.org with esmtpsa (TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_256_GCM:256)
	(Exim 4.94.2)
	(envelope-from <leitao@debian.org>)
	id 1w11Zb-005JoZ-LA; Fri, 13 Mar 2026 12:25:00 +0000
Date: Fri, 13 Mar 2026 05:24:54 -0700
From: Breno Leitao <leitao@debian.org>
To: Petr Mladek <pmladek@suse.com>
Cc: Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>, 
	Andrew Morton <akpm@linux-foundation.org>, linux-kernel@vger.kernel.org, 
	Omar Sandoval <osandov@osandov.com>, Song Liu <song@kernel.org>, 
	Danielle Costantino <dcostantino@meta.com>, kasan-dev@googlegroups.com, kernel-team@meta.com
Subject: Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers
Message-ID: <abP8wDhYWwk3ufmA@gmail.com>
References: <20260305-wqstall_start-at-v2-0-b60863ee0899@debian.org>
 <abLsAi7_fU5FrYiF@pathway.suse.cz>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <abLsAi7_fU5FrYiF@pathway.suse.cz>
X-Debian-User: leitao

Hello Petr,

On Thu, Mar 12, 2026 at 05:38:26PM +0100, Petr Mladek wrote:
> On Thu 2026-03-05 08:15:36, Breno Leitao wrote:
> > There is a blind spot exists in the work queue stall detecetor (aka
> > show_cpu_pool_hog()). It only prints workers whose task_is_running() is
> > true, so a busy worker that is sleeping (e.g. wait_event_idle())
> > produces an empty backtrace section even though it is the cause of the
> > stall.
> > 
> > Additionally, when the watchdog does report stalled pools, the output
> > doesn't show how long each in-flight work item has been running, making
> > it harder to identify which specific worker is stuck.
> > 
> > Example of the sample code:
> > 
> >     BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
> >     Showing busy workqueues and worker pools:
> >     workqueue events: flags=0x100
> >         pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
> >         in-flight: 178:stall_work1_fn [wq_stall]
> >         pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
> > 	...
> >     Showing backtraces of running workers in stalled
> >     CPU-bound worker pools:
> >         <nothing here>
> > 
> > I see it happening on real machines, causing some stalls that doesn't
> > have any backtrace. This is one of the code path:
> > 
> >   1) kfence executes toggle_allocation_gate() as a delayed workqueue
> >      item (kfence_timer) on the system WQ.
> > 
> >   2) toggle_allocation_gate() enables a static key, which IPIs every
> >      CPU to patch code:
> >           static_branch_enable(&kfence_allocation_key);
> > 
> >   3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
> >      kfence allocation to occur:
> >           wait_event_idle(allocation_wait,
> >                   atomic_read(&kfence_allocation_gate) > 0 || ...);
> > 
> >      This can last indefinitely if no allocation goes through the
> >      kfence path (or IPIing all the CPUs take longer, which is common on
> >      platforms that do not have NMI).
> > 
> >      The worker remains in the pool's busy_hash
> >      (in-flight) but is no longer task_is_running().
> >
> >   4) The workqueue watchdog detects the stall and calls
> >      show_cpu_pool_hog(), which only prints backtraces for workers
> >      that are actively running on CPU:
> > 
> >           static void show_cpu_pool_hog(struct worker_pool *pool) {
> >                   ...
> >                   if (task_is_running(worker->task))
> >                           sched_show_task(worker->task);
> >           }
> > 
> >   5) Nothing is printed because the offending worker is in TASK_IDLE
> >      state. The output shows "Showing backtraces of running workers in
> >      stalled CPU-bound worker pools:" followed by nothing, effectively
> >      hiding the actual culprit.
> 
> I am trying to better understand the situation. There was a reason
> why only the worker in the running state was shown.
> 
> Normally, a sleeping worker should not cause a stall. The scheduler calls
> wq_worker_sleeping() which should wake up another idle worker. There is
> always at least one idle worker in the poll. It should start processing
> the next pending work. Or it should fork another worker when it was
> the last idle one.

Right, but let's look at this case:

	 BUG: workqueue lockup - pool 55 cpu 13 curr 0 (swapper/13) stack ffff800085640000 cpus=13 node=0 flags=0x0 nice=-20 stuck for 679s!
	  work func=blk_mq_timeout_work data=0xffff0000ad7e3a05
	  Showing busy workqueues and worker pools:
	  workqueue events_unbound: flags=0x2
	    pwq 288: cpus=0-71 flags=0x4 nice=0 active=1 refcnt=2
	      in-flight: 4083734:btrfs_extent_map_shrinker_worker
	  workqueue mm_percpu_wq: flags=0x8
	    pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=1 refcnt=2
	      pending: vmstat_update
	  pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 1715676 4086805 3860852 3587585 4065550 4014041 3944711 3744484
	  Showing backtraces of running workers in stalled CPU-bound worker pools:
		# Nothing in here

It seems CPU 13 is idle (curr = 0) and blk_mq_timeout_work has been pending for
679s ?

> I wonder what blocked the idle worker from waking or forking
> a new worker. Was it caused by the IPIs?

Not sure, keep in mind that these hosts (arm64) do not have NMI, so,
IPIs are just regular interrupts that could take a long time to be handled. The
toggle_allocation_gate() was good example, given it was sending IPIs very
frequently and I took it as an example for the cover letter, but, this problem
also show up with diferent places. (more examples later)

> Did printing the sleeping workers helped to analyze the problem?

That is my hope. I don't have a reproducer other than the one in this
patchset.

I am currently rolling this patchset to production, and I can report once
I get more information.

> I wonder if we could do better in this case. For example, warn
> that the scheduler failed to wake up another idle worker when
> no worker is in the running state. And maybe, print backtrace
> of the currently running process on the given CPU because it
> likely blocks waking/scheduling the idle worker.

I am happy to improve this, given this has been a hard issue. let me give more
instances of the "empty" stalls I am seeing. All with empty backtraces:

# Instance 1
	 BUG: workqueue lockup - pool cpus=33 node=0 flags=0x0 nice=0 stuck for 33s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x0
	   pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=3 refcnt=4
	     pending: 3*psi_avgs_work
	   pwq 218: cpus=54 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     in-flight: 842:key_garbage_collector
	 workqueue mm_percpu_wq: flags=0x8
	   pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 pool 218: cpus=54 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 11200 524627
	 Showing backtraces of running workers in stalled CPU-bound worker pools:

# Instance 2
	 BUG: workqueue lockup - pool cpus=53 node=0 flags=0x0 nice=0 stuck for 459s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x0
	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: psi_avgs_work
	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=4 refcnt=5
	     pending: 2*psi_avgs_work, drain_local_memcg_stock, iova_depot_work_func
	 workqueue events_freezable: flags=0x4
	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: pci_pme_list_scan
	 workqueue slub_flushwq: flags=0x8
	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=3
	     pending: flush_cpu_slab BAR(7520)
	 workqueue mm_percpu_wq: flags=0x8
	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 workqueue mlx5_cmd_0002:03:00.1: flags=0x6000a
	   pwq 576: cpus=0-143 flags=0x4 nice=0 active=1 refcnt=146
	     pending: cmd_work_handler
	 Showing backtraces of running workers in stalled CPU-bound worker pools:

# Instance 3
	 BUG: workqueue lockup - pool cpus=74 node=1 flags=0x0 nice=0 stuck for 31s!
	 Showing busy workqueues and worker pools:
	 workqueue mm_percpu_wq: flags=0x8
	   pwq 298: cpus=74 node=1 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 Showing backtraces of running workers in stalled CPU-bound worker pools:	

# Instance 4
	 BUG: workqueue lockup - pool cpus=71 node=0 flags=0x0 nice=0 stuck for 32s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x0
	   pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=2 refcnt=3
	     pending: psi_avgs_work, fuse_check_timeout
	 workqueue events_freezable: flags=0x4
	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: pci_pme_list_scan
	 workqueue mm_percpu_wq: flags=0x8
	   pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 Showing backtraces of running workers in stalled CPU-bound worker pools:

Thanks for your help,
--breno