All of lore.kernel.org
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	 Omar Sandoval <osandov@osandov.com>,
	kernel-team@meta.com
Subject: Re: [PATCH 0/4] workqueue: Detect stalled in-flight workers
Date: Wed, 4 Mar 2026 07:40:49 -0800	[thread overview]
Message-ID: <aag4tTyeiZyw0jID@gmail.com> (raw)
In-Reply-To: <aYzQy4H0IRA5s6MP@slm.duckdns.org>

Hello Tejun,

On Wed, Feb 11, 2026 at 08:56:11AM -1000, Tejun Heo wrote:
> On Wed, Feb 11, 2026 at 04:29:14AM -0800, Breno Leitao wrote:
> > The workqueue watchdog detects pools that haven't made forward progress
> > by checking whether pending work items on the worklist have been waiting
> > too long. However, this approach has a blind spot: if a pool has only
> > one work item and that item has already been dequeued and is executing on
> > a worker, the worklist is empty and the watchdog skips the pool entirely.
> > This means a single hogged worker with no other pending work is invisible
> > to the stall detector.
> > 
> > I was able to come up with the following example that shows this blind
> > spot:
> > 
> > 	static void stall_work_fn(struct work_struct *work)
> > 	{
> > 		for (;;) {
> > 			mdelay(1000);
> > 			cond_resched();
> > 		}
> > 	}
> 
> Workqueue doesn't require users to limit execution time. As long as there is
> enough supply of concurrency to avoid stalling of pending work items, work
> items can run as long as they want, including indefinitely. Workqueue stall
> is there to indicate that there is insufficient supply of concurrency.

Thank you for the clarification. Let me share more context about the
actual problem I am observing so we can think through it together.

On some production hosts, I am seeing a workqueue stall where no
backtraces are printed:

	BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
	Showing busy workqueues and worker pools:
	workqueue events: flags=0x100
		pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
		in-flight: 178:stall_work1_fn [wq_stall]
		pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
	workqueue mm_percpu_wq: flags=0x108
		pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=1 refcnt=2
		pending: vmstat_update
		pool 18: cpus=4 node=0 flags=0x0 nice=0 hung=132s workers=2 idle: 45
	Showing backtraces of running workers in stalled
	CPU-bound worker pools:
		<nothing here>

We initially suspected a TOCTOU issue, and Omar put together a patch to
address that, but it did not identify anything.

After digging deeper, I believe I have found the root cause along with
a reproducer[1]:

  1) kfence executes toggle_allocation_gate() as a delayed workqueue
     item (kfence_timer) on the system WQ.

  2) toggle_allocation_gate() enables a static key, which IPIs every
     CPU to patch code:
          static_branch_enable(&kfence_allocation_key);

  3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
     kfence allocation to occur:
          wait_event_idle(allocation_wait,
                  atomic_read(&kfence_allocation_gate) > 0 || ...);

     This can last indefinitely if no allocation goes through the
     kfence path. The worker remains in the pool's busy_hash
     (in-flight) but is no longer task_is_running().

  4) The workqueue watchdog detects the stall and calls
     show_cpu_pool_hog(), which only prints backtraces for workers
     that are actively running on CPU:

          static void show_cpu_pool_hog(struct worker_pool *pool) {
                  ...
                  if (task_is_running(worker->task))
                          sched_show_task(worker->task);
          }

  5) Nothing is printed because the offending worker is in TASK_IDLE
     state. The output shows "Showing backtraces of running workers in
     stalled CPU-bound worker pools:" followed by nothing, effectively
     hiding the actual culprit.

The fix I am considering is to remove the task_is_running() filter in
show_cpu_pool_hog() so that all in-flight workers in stalled pools have
their backtraces printed, regardless of whether they are running or
sleeping. This would make sleeping culprits like toggle_allocation_gate()
visible in the watchdog output.

When I test without the task_runinng, then I see the culprit.

Fix I am testing:

	diff --git a/kernel/workqueue.c b/kernel/workqueue.c
	index aeaec79bc09c4..3f5ee08f99313 100644
	--- a/kernel/workqueue.c
	+++ b/kernel/workqueue.c
	@@ -7593,19 +7593,17 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
		raw_spin_lock_irqsave(&pool->lock, irq_flags);

		hash_for_each(pool->busy_hash, bkt, worker, hentry) {
	-               if (task_is_running(worker->task)) {
	-                       /*
	-                        * Defer printing to avoid deadlocks in console
	-                        * drivers that queue work while holding locks
	-                        * also taken in their write paths.
	-                        */
	-                       printk_deferred_enter();
	+               /*
	+                * Defer printing to avoid deadlocks in console
	+                * drivers that queue work while holding locks
	+                * also taken in their write paths.
	+                */
	+               printk_deferred_enter();

	-                       pr_info("pool %d:\n", pool->id);
	-                       sched_show_task(worker->task);
	+               pr_info("pool %d:\n", pool->id);
	+               sched_show_task(worker->task);

	-                       printk_deferred_exit();
	-               }
	+               printk_deferred_exit();
		}

		raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
	@@ -7616,7 +7614,7 @@ static void show_cpu_pools_hogs(void)
		struct worker_pool *pool;
		int pi;

	-       pr_info("Showing backtraces of running workers in stalled CPU-bound worker pools:\n");
	+       pr_info("Showing backtraces of in-flight workers in stalled CPU-bound worker pools:\n");

		rcu_read_lock();

Then I see:

	BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 34s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x100
	   pwq 26: cpus=6 node=0 flags=0x0 nice=0 active=3 refcnt=4
	     in-flight: 161:stall_work1_fn [wq_stall]
	     pending: stall_work2_fn [wq_stall], psi_avgs_work
	 workqueue mm_percpu_wq: flags=0x108
	   pwq 26: cpus=6 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 pool 26: cpus=6 node=0 flags=0x0 nice=0 hung=34s workers=3 idle: 210 57
	 Showing backtraces of in-flight workers in stalled CPU-bound worker pools:
	 pool 26:
	 task:kworker/6:1     state:I stack:0     pid:161   tgid:161   ppid:2      task_flags:0x4208040 flags:0x00080000
	 Call Trace:
	  <TASK>
	  __schedule+0x1521/0x5360
	  ? console_trylock+0x40/0x40
	  ? preempt_count_add+0x92/0x1a0
	  ? do_raw_spin_lock+0x12c/0x2f0
	  ? is_mmconf_reserved+0x390/0x390
	  ? schedule+0x91/0x350
	  ? schedule+0x91/0x350
	  schedule+0x165/0x350
	  stall_work1_fn+0x17f/0x250 [wq_stall]


Link: https://github.com/leitao/debug/blob/main/workqueue_stall/wq_stall.c [1]

  reply	other threads:[~2026-03-04 15:40 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-11 12:29 [PATCH 0/4] workqueue: Detect stalled in-flight workers Breno Leitao
2026-02-11 12:29 ` [PATCH 1/4] workqueue: Use POOL_BH instead of WQ_BH when checking pool flags Breno Leitao
2026-02-11 12:29 ` [PATCH 2/4] workqueue: Rename pool->watchdog_ts to pool->last_progress_ts Breno Leitao
2026-02-11 12:29 ` [PATCH 3/4] workqueue: Show in-flight work item duration in stall diagnostics Breno Leitao
2026-02-11 12:29 ` [PATCH 4/4] workqueue: Detect stalled in-flight work items with empty worklist Breno Leitao
2026-02-11 18:56 ` [PATCH 0/4] workqueue: Detect stalled in-flight workers Tejun Heo
2026-03-04 15:40   ` Breno Leitao [this message]
2026-03-04 16:40     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aag4tTyeiZyw0jID@gmail.com \
    --to=leitao@debian.org \
    --cc=akpm@linux-foundation.org \
    --cc=jiangshanlai@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=osandov@osandov.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.