public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	 Omar Sandoval <osandov@osandov.com>,
	kernel-team@meta.com
Subject: Re: [PATCH 0/4] workqueue: Detect stalled in-flight workers
Date: Wed, 4 Mar 2026 07:40:49 -0800	[thread overview]
Message-ID: <aag4tTyeiZyw0jID@gmail.com> (raw)
In-Reply-To: <aYzQy4H0IRA5s6MP@slm.duckdns.org>

Hello Tejun,

On Wed, Feb 11, 2026 at 08:56:11AM -1000, Tejun Heo wrote:
> On Wed, Feb 11, 2026 at 04:29:14AM -0800, Breno Leitao wrote:
> > The workqueue watchdog detects pools that haven't made forward progress
> > by checking whether pending work items on the worklist have been waiting
> > too long. However, this approach has a blind spot: if a pool has only
> > one work item and that item has already been dequeued and is executing on
> > a worker, the worklist is empty and the watchdog skips the pool entirely.
> > This means a single hogged worker with no other pending work is invisible
> > to the stall detector.
> > 
> > I was able to come up with the following example that shows this blind
> > spot:
> > 
> > 	static void stall_work_fn(struct work_struct *work)
> > 	{
> > 		for (;;) {
> > 			mdelay(1000);
> > 			cond_resched();
> > 		}
> > 	}
> 
> Workqueue doesn't require users to limit execution time. As long as there is
> enough supply of concurrency to avoid stalling of pending work items, work
> items can run as long as they want, including indefinitely. Workqueue stall
> is there to indicate that there is insufficient supply of concurrency.

Thank you for the clarification. Let me share more context about the
actual problem I am observing so we can think through it together.

On some production hosts, I am seeing a workqueue stall where no
backtraces are printed:

	BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
	Showing busy workqueues and worker pools:
	workqueue events: flags=0x100
		pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
		in-flight: 178:stall_work1_fn [wq_stall]
		pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
	workqueue mm_percpu_wq: flags=0x108
		pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=1 refcnt=2
		pending: vmstat_update
		pool 18: cpus=4 node=0 flags=0x0 nice=0 hung=132s workers=2 idle: 45
	Showing backtraces of running workers in stalled
	CPU-bound worker pools:
		<nothing here>

We initially suspected a TOCTOU issue, and Omar put together a patch to
address that, but it did not identify anything.

After digging deeper, I believe I have found the root cause along with
a reproducer[1]:

  1) kfence executes toggle_allocation_gate() as a delayed workqueue
     item (kfence_timer) on the system WQ.

  2) toggle_allocation_gate() enables a static key, which IPIs every
     CPU to patch code:
          static_branch_enable(&kfence_allocation_key);

  3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
     kfence allocation to occur:
          wait_event_idle(allocation_wait,
                  atomic_read(&kfence_allocation_gate) > 0 || ...);

     This can last indefinitely if no allocation goes through the
     kfence path. The worker remains in the pool's busy_hash
     (in-flight) but is no longer task_is_running().

  4) The workqueue watchdog detects the stall and calls
     show_cpu_pool_hog(), which only prints backtraces for workers
     that are actively running on CPU:

          static void show_cpu_pool_hog(struct worker_pool *pool) {
                  ...
                  if (task_is_running(worker->task))
                          sched_show_task(worker->task);
          }

  5) Nothing is printed because the offending worker is in TASK_IDLE
     state. The output shows "Showing backtraces of running workers in
     stalled CPU-bound worker pools:" followed by nothing, effectively
     hiding the actual culprit.

The fix I am considering is to remove the task_is_running() filter in
show_cpu_pool_hog() so that all in-flight workers in stalled pools have
their backtraces printed, regardless of whether they are running or
sleeping. This would make sleeping culprits like toggle_allocation_gate()
visible in the watchdog output.

When I test without the task_runinng, then I see the culprit.

Fix I am testing:

	diff --git a/kernel/workqueue.c b/kernel/workqueue.c
	index aeaec79bc09c4..3f5ee08f99313 100644
	--- a/kernel/workqueue.c
	+++ b/kernel/workqueue.c
	@@ -7593,19 +7593,17 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
		raw_spin_lock_irqsave(&pool->lock, irq_flags);

		hash_for_each(pool->busy_hash, bkt, worker, hentry) {
	-               if (task_is_running(worker->task)) {
	-                       /*
	-                        * Defer printing to avoid deadlocks in console
	-                        * drivers that queue work while holding locks
	-                        * also taken in their write paths.
	-                        */
	-                       printk_deferred_enter();
	+               /*
	+                * Defer printing to avoid deadlocks in console
	+                * drivers that queue work while holding locks
	+                * also taken in their write paths.
	+                */
	+               printk_deferred_enter();

	-                       pr_info("pool %d:\n", pool->id);
	-                       sched_show_task(worker->task);
	+               pr_info("pool %d:\n", pool->id);
	+               sched_show_task(worker->task);

	-                       printk_deferred_exit();
	-               }
	+               printk_deferred_exit();
		}

		raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
	@@ -7616,7 +7614,7 @@ static void show_cpu_pools_hogs(void)
		struct worker_pool *pool;
		int pi;

	-       pr_info("Showing backtraces of running workers in stalled CPU-bound worker pools:\n");
	+       pr_info("Showing backtraces of in-flight workers in stalled CPU-bound worker pools:\n");

		rcu_read_lock();

Then I see:

	BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 34s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x100
	   pwq 26: cpus=6 node=0 flags=0x0 nice=0 active=3 refcnt=4
	     in-flight: 161:stall_work1_fn [wq_stall]
	     pending: stall_work2_fn [wq_stall], psi_avgs_work
	 workqueue mm_percpu_wq: flags=0x108
	   pwq 26: cpus=6 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 pool 26: cpus=6 node=0 flags=0x0 nice=0 hung=34s workers=3 idle: 210 57
	 Showing backtraces of in-flight workers in stalled CPU-bound worker pools:
	 pool 26:
	 task:kworker/6:1     state:I stack:0     pid:161   tgid:161   ppid:2      task_flags:0x4208040 flags:0x00080000
	 Call Trace:
	  <TASK>
	  __schedule+0x1521/0x5360
	  ? console_trylock+0x40/0x40
	  ? preempt_count_add+0x92/0x1a0
	  ? do_raw_spin_lock+0x12c/0x2f0
	  ? is_mmconf_reserved+0x390/0x390
	  ? schedule+0x91/0x350
	  ? schedule+0x91/0x350
	  schedule+0x165/0x350
	  stall_work1_fn+0x17f/0x250 [wq_stall]


Link: https://github.com/leitao/debug/blob/main/workqueue_stall/wq_stall.c [1]

  reply	other threads:[~2026-03-04 15:40 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-11 12:29 [PATCH 0/4] workqueue: Detect stalled in-flight workers Breno Leitao
2026-02-11 12:29 ` [PATCH 1/4] workqueue: Use POOL_BH instead of WQ_BH when checking pool flags Breno Leitao
2026-02-11 12:29 ` [PATCH 2/4] workqueue: Rename pool->watchdog_ts to pool->last_progress_ts Breno Leitao
2026-02-11 12:29 ` [PATCH 3/4] workqueue: Show in-flight work item duration in stall diagnostics Breno Leitao
2026-02-11 12:29 ` [PATCH 4/4] workqueue: Detect stalled in-flight work items with empty worklist Breno Leitao
2026-02-11 18:56 ` [PATCH 0/4] workqueue: Detect stalled in-flight workers Tejun Heo
2026-03-04 15:40   ` Breno Leitao [this message]
2026-03-04 16:40     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aag4tTyeiZyw0jID@gmail.com \
    --to=leitao@debian.org \
    --cc=akpm@linux-foundation.org \
    --cc=jiangshanlai@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=osandov@osandov.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox