public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Petr Mladek <pmladek@suse.com>
Cc: Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	 Omar Sandoval <osandov@osandov.com>, Song Liu <song@kernel.org>,
	 Danielle Costantino <dcostantino@meta.com>,
	kasan-dev@googlegroups.com, kernel-team@meta.com
Subject: Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers
Date: Fri, 13 Mar 2026 05:24:54 -0700	[thread overview]
Message-ID: <abP8wDhYWwk3ufmA@gmail.com> (raw)
In-Reply-To: <abLsAi7_fU5FrYiF@pathway.suse.cz>

Hello Petr,

On Thu, Mar 12, 2026 at 05:38:26PM +0100, Petr Mladek wrote:
> On Thu 2026-03-05 08:15:36, Breno Leitao wrote:
> > There is a blind spot exists in the work queue stall detecetor (aka
> > show_cpu_pool_hog()). It only prints workers whose task_is_running() is
> > true, so a busy worker that is sleeping (e.g. wait_event_idle())
> > produces an empty backtrace section even though it is the cause of the
> > stall.
> > 
> > Additionally, when the watchdog does report stalled pools, the output
> > doesn't show how long each in-flight work item has been running, making
> > it harder to identify which specific worker is stuck.
> > 
> > Example of the sample code:
> > 
> >     BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
> >     Showing busy workqueues and worker pools:
> >     workqueue events: flags=0x100
> >         pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
> >         in-flight: 178:stall_work1_fn [wq_stall]
> >         pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
> > 	...
> >     Showing backtraces of running workers in stalled
> >     CPU-bound worker pools:
> >         <nothing here>
> > 
> > I see it happening on real machines, causing some stalls that doesn't
> > have any backtrace. This is one of the code path:
> > 
> >   1) kfence executes toggle_allocation_gate() as a delayed workqueue
> >      item (kfence_timer) on the system WQ.
> > 
> >   2) toggle_allocation_gate() enables a static key, which IPIs every
> >      CPU to patch code:
> >           static_branch_enable(&kfence_allocation_key);
> > 
> >   3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
> >      kfence allocation to occur:
> >           wait_event_idle(allocation_wait,
> >                   atomic_read(&kfence_allocation_gate) > 0 || ...);
> > 
> >      This can last indefinitely if no allocation goes through the
> >      kfence path (or IPIing all the CPUs take longer, which is common on
> >      platforms that do not have NMI).
> > 
> >      The worker remains in the pool's busy_hash
> >      (in-flight) but is no longer task_is_running().
> >
> >   4) The workqueue watchdog detects the stall and calls
> >      show_cpu_pool_hog(), which only prints backtraces for workers
> >      that are actively running on CPU:
> > 
> >           static void show_cpu_pool_hog(struct worker_pool *pool) {
> >                   ...
> >                   if (task_is_running(worker->task))
> >                           sched_show_task(worker->task);
> >           }
> > 
> >   5) Nothing is printed because the offending worker is in TASK_IDLE
> >      state. The output shows "Showing backtraces of running workers in
> >      stalled CPU-bound worker pools:" followed by nothing, effectively
> >      hiding the actual culprit.
> 
> I am trying to better understand the situation. There was a reason
> why only the worker in the running state was shown.
> 
> Normally, a sleeping worker should not cause a stall. The scheduler calls
> wq_worker_sleeping() which should wake up another idle worker. There is
> always at least one idle worker in the poll. It should start processing
> the next pending work. Or it should fork another worker when it was
> the last idle one.

Right, but let's look at this case:

	 BUG: workqueue lockup - pool 55 cpu 13 curr 0 (swapper/13) stack ffff800085640000 cpus=13 node=0 flags=0x0 nice=-20 stuck for 679s!
	  work func=blk_mq_timeout_work data=0xffff0000ad7e3a05
	  Showing busy workqueues and worker pools:
	  workqueue events_unbound: flags=0x2
	    pwq 288: cpus=0-71 flags=0x4 nice=0 active=1 refcnt=2
	      in-flight: 4083734:btrfs_extent_map_shrinker_worker
	  workqueue mm_percpu_wq: flags=0x8
	    pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=1 refcnt=2
	      pending: vmstat_update
	  pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 1715676 4086805 3860852 3587585 4065550 4014041 3944711 3744484
	  Showing backtraces of running workers in stalled CPU-bound worker pools:
		# Nothing in here

It seems CPU 13 is idle (curr = 0) and blk_mq_timeout_work has been pending for
679s ?

> I wonder what blocked the idle worker from waking or forking
> a new worker. Was it caused by the IPIs?

Not sure, keep in mind that these hosts (arm64) do not have NMI, so,
IPIs are just regular interrupts that could take a long time to be handled. The
toggle_allocation_gate() was good example, given it was sending IPIs very
frequently and I took it as an example for the cover letter, but, this problem
also show up with diferent places. (more examples later)

> Did printing the sleeping workers helped to analyze the problem?

That is my hope. I don't have a reproducer other than the one in this
patchset.

I am currently rolling this patchset to production, and I can report once
I get more information.

> I wonder if we could do better in this case. For example, warn
> that the scheduler failed to wake up another idle worker when
> no worker is in the running state. And maybe, print backtrace
> of the currently running process on the given CPU because it
> likely blocks waking/scheduling the idle worker.

I am happy to improve this, given this has been a hard issue. let me give more
instances of the "empty" stalls I am seeing. All with empty backtraces:

# Instance 1
	 BUG: workqueue lockup - pool cpus=33 node=0 flags=0x0 nice=0 stuck for 33s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x0
	   pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=3 refcnt=4
	     pending: 3*psi_avgs_work
	   pwq 218: cpus=54 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     in-flight: 842:key_garbage_collector
	 workqueue mm_percpu_wq: flags=0x8
	   pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 pool 218: cpus=54 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 11200 524627
	 Showing backtraces of running workers in stalled CPU-bound worker pools:

# Instance 2
	 BUG: workqueue lockup - pool cpus=53 node=0 flags=0x0 nice=0 stuck for 459s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x0
	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: psi_avgs_work
	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=4 refcnt=5
	     pending: 2*psi_avgs_work, drain_local_memcg_stock, iova_depot_work_func
	 workqueue events_freezable: flags=0x4
	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: pci_pme_list_scan
	 workqueue slub_flushwq: flags=0x8
	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=3
	     pending: flush_cpu_slab BAR(7520)
	 workqueue mm_percpu_wq: flags=0x8
	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 workqueue mlx5_cmd_0002:03:00.1: flags=0x6000a
	   pwq 576: cpus=0-143 flags=0x4 nice=0 active=1 refcnt=146
	     pending: cmd_work_handler
	 Showing backtraces of running workers in stalled CPU-bound worker pools:

# Instance 3
	 BUG: workqueue lockup - pool cpus=74 node=1 flags=0x0 nice=0 stuck for 31s!
	 Showing busy workqueues and worker pools:
	 workqueue mm_percpu_wq: flags=0x8
	   pwq 298: cpus=74 node=1 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 Showing backtraces of running workers in stalled CPU-bound worker pools:	

# Instance 4
	 BUG: workqueue lockup - pool cpus=71 node=0 flags=0x0 nice=0 stuck for 32s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x0
	   pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=2 refcnt=3
	     pending: psi_avgs_work, fuse_check_timeout
	 workqueue events_freezable: flags=0x4
	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: pci_pme_list_scan
	 workqueue mm_percpu_wq: flags=0x8
	   pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 Showing backtraces of running workers in stalled CPU-bound worker pools:

Thanks for your help,
--breno

  reply	other threads:[~2026-03-13 12:25 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-05 16:15 [PATCH v2 0/5] workqueue: Detect stalled in-flight workers Breno Leitao
2026-03-05 16:15 ` [PATCH v2 1/5] workqueue: Use POOL_BH instead of WQ_BH when checking pool flags Breno Leitao
2026-03-05 17:13   ` Song Liu
2026-03-05 16:15 ` [PATCH v2 2/5] workqueue: Rename pool->watchdog_ts to pool->last_progress_ts Breno Leitao
2026-03-05 17:16   ` Song Liu
2026-03-05 16:15 ` [PATCH v2 3/5] workqueue: Show in-flight work item duration in stall diagnostics Breno Leitao
2026-03-05 17:17   ` Song Liu
2026-03-05 16:15 ` [PATCH v2 4/5] workqueue: Show all busy workers " Breno Leitao
2026-03-05 17:17   ` Song Liu
2026-03-12 17:03   ` Petr Mladek
2026-03-13 12:57     ` Breno Leitao
2026-03-13 16:27       ` Petr Mladek
2026-03-18 11:31         ` Breno Leitao
2026-03-18 15:11           ` Petr Mladek
2026-03-20 10:41             ` Breno Leitao
2026-03-05 16:15 ` [PATCH v2 5/5] workqueue: Add stall detector sample module Breno Leitao
2026-03-05 17:25   ` Song Liu
2026-03-05 17:39 ` [PATCH v2 0/5] workqueue: Improve stall diagnostics Tejun Heo
2026-03-12 16:38 ` [PATCH v2 0/5] workqueue: Detect stalled in-flight workers Petr Mladek
2026-03-13 12:24   ` Breno Leitao [this message]
2026-03-13 14:38     ` Petr Mladek
2026-03-13 17:36       ` Breno Leitao
2026-03-18 16:46         ` Petr Mladek
2026-03-20 10:44           ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abP8wDhYWwk3ufmA@gmail.com \
    --to=leitao@debian.org \
    --cc=akpm@linux-foundation.org \
    --cc=dcostantino@meta.com \
    --cc=jiangshanlai@gmail.com \
    --cc=kasan-dev@googlegroups.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=osandov@osandov.com \
    --cc=pmladek@suse.com \
    --cc=song@kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox