[PATCH 0/4] workqueue: Detect stalled in-flight workers

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] workqueue: Detect stalled in-flight workers
@ 2026-02-11 12:29 Breno Leitao
  2026-02-11 12:29 ` [PATCH 1/4] workqueue: Use POOL_BH instead of WQ_BH when checking pool flags Breno Leitao
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Breno Leitao @ 2026-02-11 12:29 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, Omar Sandoval, kernel-team, Breno Leitao

The workqueue watchdog detects pools that haven't made forward progress
by checking whether pending work items on the worklist have been waiting
too long. However, this approach has a blind spot: if a pool has only
one work item and that item has already been dequeued and is executing on
a worker, the worklist is empty and the watchdog skips the pool entirely.
This means a single hogged worker with no other pending work is invisible
to the stall detector.

I was able to come up with the following example that shows this blind
spot:

	static void stall_work_fn(struct work_struct *work)
	{
		for (;;) {
			mdelay(1000);
			cond_resched();
		}
	}

Additionally, when the watchdog does report stalled pools, the output
doesn't show how long each in-flight work item has been running, making
it harder to identify which specific worker is stuck.

This series addresses both issues:

Patch 1 fixes a minor semantic inconsistency where pool flags were
checked against a workqueue-level constant (WQ_BH instead of POOL_BH).
No behavioral change since both constants have the same value.

Patch 2 renames pool->watchdog_ts to pool->last_progress_ts to better
describe what the timestamp actually tracks.

Patch 3 adds a current_start timestamp to struct worker, recording when
a work item began executing. This is printed in show_pwq() as elapsed
wall-clock time (e.g., "in-flight: 165:stall_work_fn [wq_stall] for
100s"), giving immediate visibility into how long each worker has been
busy.

Patch 4 introduces pool_has_stalled_worker(), which scans all workers in
a pool's busy_hash for any whose current_start timestamp exceeds the
watchdog threshold. This is called unconditionally for every pool,
independent of worklist state, so a stuck worker is always detected. The
feature is gated behind a new CONFIG_WQ_WATCHDOG_WORKERS option
(disabled by default) under CONFIG_WQ_WATCHDOG.

An option is to get rid of CONFIG_WQ_WATCHDOG_WORKERS completely. I've
been running this change on some hosts with workloads (mainly stress-ng)
and I haven't found any false positive.

With this series applied, we will be able to see a stall like the one
above:

	 BUG: workqueue lockup - worker365:stall_work_fn [wq_stall] stuck in pool cpus=9 node=0 flags=0x0 nice=0 for 2570s!
	 Showing busy workqueues and worker pools:
	  workqueue events: flags=0x100
	  pwq 38: cpus=9 node=0 flags=0x0 nice=0 active=2 refcnt=3
	  workqueue stall_wq: flags=0x0

---
Breno Leitao (4):
      workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
      workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
      workqueue: Show in-flight work item duration in stall diagnostics
      workqueue: Detect stalled in-flight work items with empty worklist

 kernel/workqueue.c          | 71 ++++++++++++++++++++++++++++++++++++++-------
 kernel/workqueue_internal.h |  1 +
 lib/Kconfig.debug           | 12 ++++++++
 3 files changed, 74 insertions(+), 10 deletions(-)
---
base-commit: 9cb8b0f289560728dbb8b88158e7a957e2e90a14
change-id: 20260210-wqstall_start-at-e7319a005ab4

Best regards,
--  
Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4] workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
  2026-02-11 12:29 [PATCH 0/4] workqueue: Detect stalled in-flight workers Breno Leitao
@ 2026-02-11 12:29 ` Breno Leitao
  2026-02-11 12:29 ` [PATCH 2/4] workqueue: Rename pool->watchdog_ts to pool->last_progress_ts Breno Leitao
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-02-11 12:29 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, Omar Sandoval, kernel-team, Breno Leitao

pr_cont_worker_id() checks pool->flags against WQ_BH, which is a
workqueue-level flag (defined in workqueue.h). Pool flags use a
separate namespace with POOL_* constants (defined in workqueue.c).
The correct constant is POOL_BH. Both WQ_BH and POOL_BH are defined
as (1 << 0) so this has no behavioral impact, but it is semantically
wrong and inconsistent with every other pool-level BH check in the
file.

Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e6c249f2fb46b..265d841e1b81c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6274,7 +6274,7 @@ static void pr_cont_worker_id(struct worker *worker)
 {
 	struct worker_pool *pool = worker->pool;
 
-	if (pool->flags & WQ_BH)
+	if (pool->flags & POOL_BH)
 		pr_cont("bh%s",
 			pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : "");
 	else

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4] workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
  2026-02-11 12:29 [PATCH 0/4] workqueue: Detect stalled in-flight workers Breno Leitao
  2026-02-11 12:29 ` [PATCH 1/4] workqueue: Use POOL_BH instead of WQ_BH when checking pool flags Breno Leitao
@ 2026-02-11 12:29 ` Breno Leitao
  2026-02-11 12:29 ` [PATCH 3/4] workqueue: Show in-flight work item duration in stall diagnostics Breno Leitao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-02-11 12:29 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, Omar Sandoval, kernel-team, Breno Leitao

The watchdog_ts name doesn't convey what the timestamp actually tracks.
This field tracks the last time a workqueue got progress.

Rename it to last_progress_ts to make it clear that it records when the
pool last made forward progress (started processing new work items).

No functional change.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 265d841e1b81c..b3ba739cf493a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -190,7 +190,7 @@ struct worker_pool {
 	int			id;		/* I: pool ID */
 	unsigned int		flags;		/* L: flags */
 
-	unsigned long		watchdog_ts;	/* L: watchdog timestamp */
+	unsigned long		last_progress_ts;	/* L: last forward progress timestamp */
 	bool			cpu_stall;	/* WD: stalled cpu bound pool */
 
 	/*
@@ -1697,7 +1697,7 @@ static void __pwq_activate_work(struct pool_workqueue *pwq,
 	WARN_ON_ONCE(!(*wdb & WORK_STRUCT_INACTIVE));
 	trace_workqueue_activate_work(work);
 	if (list_empty(&pwq->pool->worklist))
-		pwq->pool->watchdog_ts = jiffies;
+		pwq->pool->last_progress_ts = jiffies;
 	move_linked_works(work, &pwq->pool->worklist, NULL);
 	__clear_bit(WORK_STRUCT_INACTIVE_BIT, wdb);
 }
@@ -2348,7 +2348,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 	 */
 	if (list_empty(&pwq->inactive_works) && pwq_tryinc_nr_active(pwq, false)) {
 		if (list_empty(&pool->worklist))
-			pool->watchdog_ts = jiffies;
+			pool->last_progress_ts = jiffies;
 
 		trace_workqueue_activate_work(work);
 		insert_work(pwq, work, &pool->worklist, work_flags);
@@ -3352,7 +3352,7 @@ static void process_scheduled_works(struct worker *worker)
 	while ((work = list_first_entry_or_null(&worker->scheduled,
 						struct work_struct, entry))) {
 		if (first) {
-			worker->pool->watchdog_ts = jiffies;
+			worker->pool->last_progress_ts = jiffies;
 			first = false;
 		}
 		process_one_work(worker, work);
@@ -4850,7 +4850,7 @@ static int init_worker_pool(struct worker_pool *pool)
 	pool->cpu = -1;
 	pool->node = NUMA_NO_NODE;
 	pool->flags |= POOL_DISASSOCIATED;
-	pool->watchdog_ts = jiffies;
+	pool->last_progress_ts = jiffies;
 	INIT_LIST_HEAD(&pool->worklist);
 	INIT_LIST_HEAD(&pool->idle_list);
 	hash_init(pool->busy_hash);
@@ -6462,7 +6462,7 @@ static void show_one_worker_pool(struct worker_pool *pool)
 
 	/* How long the first pending work is waiting for a worker. */
 	if (!list_empty(&pool->worklist))
-		hung = jiffies_to_msecs(jiffies - pool->watchdog_ts) / 1000;
+		hung = jiffies_to_msecs(jiffies - pool->last_progress_ts) / 1000;
 
 	/*
 	 * Defer printing to avoid deadlocks in console drivers that
@@ -7688,7 +7688,7 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
 			touched = READ_ONCE(per_cpu(wq_watchdog_touched_cpu, pool->cpu));
 		else
 			touched = READ_ONCE(wq_watchdog_touched);
-		pool_ts = READ_ONCE(pool->watchdog_ts);
+		pool_ts = READ_ONCE(pool->last_progress_ts);
 
 		if (time_after(pool_ts, touched))
 			ts = pool_ts;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4] workqueue: Show in-flight work item duration in stall diagnostics
  2026-02-11 12:29 [PATCH 0/4] workqueue: Detect stalled in-flight workers Breno Leitao
  2026-02-11 12:29 ` [PATCH 1/4] workqueue: Use POOL_BH instead of WQ_BH when checking pool flags Breno Leitao
  2026-02-11 12:29 ` [PATCH 2/4] workqueue: Rename pool->watchdog_ts to pool->last_progress_ts Breno Leitao
@ 2026-02-11 12:29 ` Breno Leitao
  2026-02-11 12:29 ` [PATCH 4/4] workqueue: Detect stalled in-flight work items with empty worklist Breno Leitao
  2026-02-11 18:56 ` [PATCH 0/4] workqueue: Detect stalled in-flight workers Tejun Heo
  4 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-02-11 12:29 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, Omar Sandoval, kernel-team, Breno Leitao

When diagnosing workqueue stalls, knowing how long each in-flight work
item has been executing is valuable. Add a current_start timestamp
(jiffies) to struct worker, set it when a work item begins execution in
process_one_work(), and print the elapsed wall-clock time in show_pwq().

Unlike current_at (which tracks CPU runtime and resets on wakeup for
CPU-intensive detection), current_start is never reset because the
diagnostic cares about total wall-clock time including sleeps.

Before: in-flight: 165:stall_work_fn [wq_stall]
After:  in-flight: 165:stall_work_fn [wq_stall] for 100s

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c          | 3 +++
 kernel/workqueue_internal.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b3ba739cf493a..e527e763162e6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3204,6 +3204,7 @@ __acquires(&pool->lock)
 	worker->current_pwq = pwq;
 	if (worker->task)
 		worker->current_at = worker->task->se.sum_exec_runtime;
+	worker->current_start = jiffies;
 	work_data = *work_data_bits(work);
 	worker->current_color = get_work_color(work_data);
 
@@ -6359,6 +6360,8 @@ static void show_pwq(struct pool_workqueue *pwq)
 			pr_cont(" %s", comma ? "," : "");
 			pr_cont_worker_id(worker);
 			pr_cont(":%ps", worker->current_func);
+			pr_cont(" for %us",
+				jiffies_to_msecs(jiffies - worker->current_start) / 1000);
 			list_for_each_entry(work, &worker->scheduled, entry)
 				pr_cont_work(false, work, &pcws);
 			pr_cont_work_flush(comma, (work_func_t)-1L, &pcws);
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index f6275944ada77..8def1ddc5a1bf 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -32,6 +32,7 @@ struct worker {
 	work_func_t		current_func;	/* K: function */
 	struct pool_workqueue	*current_pwq;	/* K: pwq */
 	u64			current_at;	/* K: runtime at start or last wakeup */
+	unsigned long		current_start;	/* K: start time of current work item */
 	unsigned int		current_color;	/* K: color */
 
 	int			sleeping;	/* S: is worker sleeping? */

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4] workqueue: Detect stalled in-flight work items with empty worklist
  2026-02-11 12:29 [PATCH 0/4] workqueue: Detect stalled in-flight workers Breno Leitao
                   ` (2 preceding siblings ...)
  2026-02-11 12:29 ` [PATCH 3/4] workqueue: Show in-flight work item duration in stall diagnostics Breno Leitao
@ 2026-02-11 12:29 ` Breno Leitao
  2026-02-11 18:56 ` [PATCH 0/4] workqueue: Detect stalled in-flight workers Tejun Heo
  4 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-02-11 12:29 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, Omar Sandoval, kernel-team, Breno Leitao

The workqueue watchdog skips pools with an empty worklist, assuming no
work is pending. However, a single work item that was dequeued and is
now executing on a worker will leave the worklist empty while the worker
is stuck. This means a pool with one hogged worker and no pending work
is invisible to the watchdog.

An example is something like:

  static void stall_work_fn(struct work_struct *work)
  {
        for (;;) {
                mdelay(1000);
                cond_resched();
        }
  }

Fix this by scanning the pool's busy_hash for workers whose
current_start timestamp exceeds the watchdog threshold, independent of
worklist state. The new report_stalled_workers() function iterates all
in-flight workers in a pool and reports each one that has exceeded the
threshold, running as a separate detection path alongside the existing
pool-level last_progress_ts check.

This is an example of the report:

  BUG: workqueue lockup - worker 365:stall_work_fn [wq_stall] stuck in pool cpus=9 node=0 flags=0x0 nice=0 for 33s!
   Showing busy workqueues and worker pools:
   ...

The feature is gated behind a new CONFIG_WQ_WATCHDOG_WORKERS option
(disabled by default) under CONFIG_WQ_WATCHDOG.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 lib/Kconfig.debug  | 12 ++++++++++++
 2 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e527e763162e6..719e14aa4ac56 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -7659,6 +7659,49 @@ static void wq_watchdog_reset_touched(void)
 		per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
 }
 
+#ifdef CONFIG_WQ_WATCHDOG_WORKERS
+/*
+ * Scan all in-flight workers in @pool for stalls. A worker is considered
+ * stalled if its current work item has been executing for longer than @thresh
+ * based on its current_start timestamp. This catches workers that are stuck
+ * regardless of the pool's worklist state or last_progress_ts.
+ */
+static bool report_stalled_workers(struct worker_pool *pool,
+				   unsigned long now,
+				   unsigned long thresh)
+{
+	struct worker *worker;
+	bool stall = false;
+	int bkt;
+
+	/*
+	 * Iterate busy_hash without pool->lock. This is intentionally
+	 * lockless to avoid contention in the watchdog timer path.
+	 * Workers that have been stalled for thresh (typically 30s) are
+	 * unlikely to be transitioning in/out of busy_hash concurrently.
+	 */
+	hash_for_each(pool->busy_hash, bkt, worker, hentry) {
+		if (time_after(now, worker->current_start + thresh)) {
+			pr_emerg("BUG: workqueue lockup - worker ");
+			pr_cont_worker_id(worker);
+			pr_cont(":%ps stuck in pool",
+				worker->current_func);
+			pr_cont_pool_info(pool);
+			pr_cont(" for %us!\n",
+				jiffies_to_msecs(now - worker->current_start) / 1000);
+			stall = true;
+		}
+	}
+	return stall;
+}
+#else
+static bool report_stalled_workers(struct worker_pool *pool,
+				   unsigned long now, unsigned long thresh)
+{
+	return false;
+}
+#endif /* CONFIG_WQ_WATCHDOG_WORKERS */
+
 static void wq_watchdog_timer_fn(struct timer_list *unused)
 {
 	unsigned long thresh = READ_ONCE(wq_watchdog_thresh) * HZ;
@@ -7677,8 +7720,6 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
 		unsigned long pool_ts, touched, ts;
 
 		pool->cpu_stall = false;
-		if (list_empty(&pool->worklist))
-			continue;
 
 		/*
 		 * If a virtual machine is stopped by the host it can look to
@@ -7686,6 +7727,13 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
 		 */
 		kvm_check_and_clear_guest_paused();
 
+		/* Check for individual stalled workers in this pool. */
+		if (report_stalled_workers(pool, now, thresh))
+			lockup_detected = true;
+
+		if (list_empty(&pool->worklist))
+			continue;
+
 		/* get the latest of pool and touched timestamps */
 		if (pool->cpu >= 0)
 			touched = READ_ONCE(per_cpu(wq_watchdog_touched_cpu, pool->cpu));
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ce25a8faf6e9e..dc4bb546b2033 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1320,6 +1320,18 @@ config BOOTPARAM_WQ_STALL_PANIC
 	  This setting can be overridden at runtime via the
 	  workqueue.panic_on_stall kernel parameter.
 
+config WQ_WATCHDOG_WORKERS
+	bool "Detect individual stalled workqueue workers"
+	depends on WQ_WATCHDOG
+	default n
+	help
+	  Say Y here to enable per-worker stall detection. When enabled,
+	  the workqueue watchdog scans all in-flight workers in each pool
+	  and reports any whose current work item has been executing for
+	  longer than the watchdog threshold. This catches stalled workers
+	  even when the pool's worklist is empty or the pool has recently
+	  made forward progress on other work items.
+
 config WQ_CPU_INTENSIVE_REPORT
 	bool "Report per-cpu work items which hog CPU for too long"
 	depends on DEBUG_KERNEL

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/4] workqueue: Detect stalled in-flight workers
  2026-02-11 12:29 [PATCH 0/4] workqueue: Detect stalled in-flight workers Breno Leitao
                   ` (3 preceding siblings ...)
  2026-02-11 12:29 ` [PATCH 4/4] workqueue: Detect stalled in-flight work items with empty worklist Breno Leitao
@ 2026-02-11 18:56 ` Tejun Heo
  2026-03-04 15:40   ` Breno Leitao
  4 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2026-02-11 18:56 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Lai Jiangshan, Andrew Morton, linux-kernel, Omar Sandoval,
	kernel-team

Hello,

On Wed, Feb 11, 2026 at 04:29:14AM -0800, Breno Leitao wrote:
> The workqueue watchdog detects pools that haven't made forward progress
> by checking whether pending work items on the worklist have been waiting
> too long. However, this approach has a blind spot: if a pool has only
> one work item and that item has already been dequeued and is executing on
> a worker, the worklist is empty and the watchdog skips the pool entirely.
> This means a single hogged worker with no other pending work is invisible
> to the stall detector.
> 
> I was able to come up with the following example that shows this blind
> spot:
> 
> 	static void stall_work_fn(struct work_struct *work)
> 	{
> 		for (;;) {
> 			mdelay(1000);
> 			cond_resched();
> 		}
> 	}

Workqueue doesn't require users to limit execution time. As long as there is
enough supply of concurrency to avoid stalling of pending work items, work
items can run as long as they want, including indefinitely. Workqueue stall
is there to indicate that there is insufficient supply of concurrency.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/4] workqueue: Detect stalled in-flight workers
  2026-02-11 18:56 ` [PATCH 0/4] workqueue: Detect stalled in-flight workers Tejun Heo
@ 2026-03-04 15:40   ` Breno Leitao
  2026-03-04 16:40     ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: Breno Leitao @ 2026-03-04 15:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, Andrew Morton, linux-kernel, Omar Sandoval,
	kernel-team

Hello Tejun,

On Wed, Feb 11, 2026 at 08:56:11AM -1000, Tejun Heo wrote:
> On Wed, Feb 11, 2026 at 04:29:14AM -0800, Breno Leitao wrote:
> > The workqueue watchdog detects pools that haven't made forward progress
> > by checking whether pending work items on the worklist have been waiting
> > too long. However, this approach has a blind spot: if a pool has only
> > one work item and that item has already been dequeued and is executing on
> > a worker, the worklist is empty and the watchdog skips the pool entirely.
> > This means a single hogged worker with no other pending work is invisible
> > to the stall detector.
> > 
> > I was able to come up with the following example that shows this blind
> > spot:
> > 
> > 	static void stall_work_fn(struct work_struct *work)
> > 	{
> > 		for (;;) {
> > 			mdelay(1000);
> > 			cond_resched();
> > 		}
> > 	}
> 
> Workqueue doesn't require users to limit execution time. As long as there is
> enough supply of concurrency to avoid stalling of pending work items, work
> items can run as long as they want, including indefinitely. Workqueue stall
> is there to indicate that there is insufficient supply of concurrency.

Thank you for the clarification. Let me share more context about the
actual problem I am observing so we can think through it together.

On some production hosts, I am seeing a workqueue stall where no
backtraces are printed:

	BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
	Showing busy workqueues and worker pools:
	workqueue events: flags=0x100
		pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
		in-flight: 178:stall_work1_fn [wq_stall]
		pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
	workqueue mm_percpu_wq: flags=0x108
		pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=1 refcnt=2
		pending: vmstat_update
		pool 18: cpus=4 node=0 flags=0x0 nice=0 hung=132s workers=2 idle: 45
	Showing backtraces of running workers in stalled
	CPU-bound worker pools:
		<nothing here>

We initially suspected a TOCTOU issue, and Omar put together a patch to
address that, but it did not identify anything.

After digging deeper, I believe I have found the root cause along with
a reproducer[1]:

  1) kfence executes toggle_allocation_gate() as a delayed workqueue
     item (kfence_timer) on the system WQ.

  2) toggle_allocation_gate() enables a static key, which IPIs every
     CPU to patch code:
          static_branch_enable(&kfence_allocation_key);

  3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
     kfence allocation to occur:
          wait_event_idle(allocation_wait,
                  atomic_read(&kfence_allocation_gate) > 0 || ...);

     This can last indefinitely if no allocation goes through the
     kfence path. The worker remains in the pool's busy_hash
     (in-flight) but is no longer task_is_running().

  4) The workqueue watchdog detects the stall and calls
     show_cpu_pool_hog(), which only prints backtraces for workers
     that are actively running on CPU:

          static void show_cpu_pool_hog(struct worker_pool *pool) {
                  ...
                  if (task_is_running(worker->task))
                          sched_show_task(worker->task);
          }

  5) Nothing is printed because the offending worker is in TASK_IDLE
     state. The output shows "Showing backtraces of running workers in
     stalled CPU-bound worker pools:" followed by nothing, effectively
     hiding the actual culprit.

The fix I am considering is to remove the task_is_running() filter in
show_cpu_pool_hog() so that all in-flight workers in stalled pools have
their backtraces printed, regardless of whether they are running or
sleeping. This would make sleeping culprits like toggle_allocation_gate()
visible in the watchdog output.

When I test without the task_runinng, then I see the culprit.

Fix I am testing:

	diff --git a/kernel/workqueue.c b/kernel/workqueue.c
	index aeaec79bc09c4..3f5ee08f99313 100644
	--- a/kernel/workqueue.c
	+++ b/kernel/workqueue.c
	@@ -7593,19 +7593,17 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
		raw_spin_lock_irqsave(&pool->lock, irq_flags);

		hash_for_each(pool->busy_hash, bkt, worker, hentry) {
	-               if (task_is_running(worker->task)) {
	-                       /*
	-                        * Defer printing to avoid deadlocks in console
	-                        * drivers that queue work while holding locks
	-                        * also taken in their write paths.
	-                        */
	-                       printk_deferred_enter();
	+               /*
	+                * Defer printing to avoid deadlocks in console
	+                * drivers that queue work while holding locks
	+                * also taken in their write paths.
	+                */
	+               printk_deferred_enter();

	-                       pr_info("pool %d:\n", pool->id);
	-                       sched_show_task(worker->task);
	+               pr_info("pool %d:\n", pool->id);
	+               sched_show_task(worker->task);

	-                       printk_deferred_exit();
	-               }
	+               printk_deferred_exit();
		}

		raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
	@@ -7616,7 +7614,7 @@ static void show_cpu_pools_hogs(void)
		struct worker_pool *pool;
		int pi;

	-       pr_info("Showing backtraces of running workers in stalled CPU-bound worker pools:\n");
	+       pr_info("Showing backtraces of in-flight workers in stalled CPU-bound worker pools:\n");

		rcu_read_lock();

Then I see:

	BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 34s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x100
	   pwq 26: cpus=6 node=0 flags=0x0 nice=0 active=3 refcnt=4
	     in-flight: 161:stall_work1_fn [wq_stall]
	     pending: stall_work2_fn [wq_stall], psi_avgs_work
	 workqueue mm_percpu_wq: flags=0x108
	   pwq 26: cpus=6 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 pool 26: cpus=6 node=0 flags=0x0 nice=0 hung=34s workers=3 idle: 210 57
	 Showing backtraces of in-flight workers in stalled CPU-bound worker pools:
	 pool 26:
	 task:kworker/6:1     state:I stack:0     pid:161   tgid:161   ppid:2      task_flags:0x4208040 flags:0x00080000
	 Call Trace:
	  <TASK>
	  __schedule+0x1521/0x5360
	  ? console_trylock+0x40/0x40
	  ? preempt_count_add+0x92/0x1a0
	  ? do_raw_spin_lock+0x12c/0x2f0
	  ? is_mmconf_reserved+0x390/0x390
	  ? schedule+0x91/0x350
	  ? schedule+0x91/0x350
	  schedule+0x165/0x350
	  stall_work1_fn+0x17f/0x250 [wq_stall]


Link: https://github.com/leitao/debug/blob/main/workqueue_stall/wq_stall.c [1]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/4] workqueue: Detect stalled in-flight workers
  2026-03-04 15:40   ` Breno Leitao
@ 2026-03-04 16:40     ` Tejun Heo
  0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-03-04 16:40 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Lai Jiangshan, Andrew Morton, linux-kernel, Omar Sandoval,
	kernel-team

Hello,

On Wed, Mar 04, 2026 at 07:40:49AM -0800, Breno Leitao wrote:
> The fix I am considering is to remove the task_is_running() filter in
> show_cpu_pool_hog() so that all in-flight workers in stalled pools have
> their backtraces printed, regardless of whether they are running or
> sleeping. This would make sleeping culprits like toggle_allocation_gate()
> visible in the watchdog output.

Yeah, that makes sense to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-04 16:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-11 12:29 [PATCH 0/4] workqueue: Detect stalled in-flight workers Breno Leitao
2026-02-11 12:29 ` [PATCH 1/4] workqueue: Use POOL_BH instead of WQ_BH when checking pool flags Breno Leitao
2026-02-11 12:29 ` [PATCH 2/4] workqueue: Rename pool->watchdog_ts to pool->last_progress_ts Breno Leitao
2026-02-11 12:29 ` [PATCH 3/4] workqueue: Show in-flight work item duration in stall diagnostics Breno Leitao
2026-02-11 12:29 ` [PATCH 4/4] workqueue: Detect stalled in-flight work items with empty worklist Breno Leitao
2026-02-11 18:56 ` [PATCH 0/4] workqueue: Detect stalled in-flight workers Tejun Heo
2026-03-04 15:40   ` Breno Leitao
2026-03-04 16:40     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox