public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] workqueue: Detect stalled in-flight workers
@ 2026-02-11 12:29 Breno Leitao
  2026-02-11 12:29 ` [PATCH 1/4] workqueue: Use POOL_BH instead of WQ_BH when checking pool flags Breno Leitao
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Breno Leitao @ 2026-02-11 12:29 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, Omar Sandoval, kernel-team, Breno Leitao

The workqueue watchdog detects pools that haven't made forward progress
by checking whether pending work items on the worklist have been waiting
too long. However, this approach has a blind spot: if a pool has only
one work item and that item has already been dequeued and is executing on
a worker, the worklist is empty and the watchdog skips the pool entirely.
This means a single hogged worker with no other pending work is invisible
to the stall detector.

I was able to come up with the following example that shows this blind
spot:

	static void stall_work_fn(struct work_struct *work)
	{
		for (;;) {
			mdelay(1000);
			cond_resched();
		}
	}

Additionally, when the watchdog does report stalled pools, the output
doesn't show how long each in-flight work item has been running, making
it harder to identify which specific worker is stuck.

This series addresses both issues:

Patch 1 fixes a minor semantic inconsistency where pool flags were
checked against a workqueue-level constant (WQ_BH instead of POOL_BH).
No behavioral change since both constants have the same value.

Patch 2 renames pool->watchdog_ts to pool->last_progress_ts to better
describe what the timestamp actually tracks.

Patch 3 adds a current_start timestamp to struct worker, recording when
a work item began executing. This is printed in show_pwq() as elapsed
wall-clock time (e.g., "in-flight: 165:stall_work_fn [wq_stall] for
100s"), giving immediate visibility into how long each worker has been
busy.

Patch 4 introduces pool_has_stalled_worker(), which scans all workers in
a pool's busy_hash for any whose current_start timestamp exceeds the
watchdog threshold. This is called unconditionally for every pool,
independent of worklist state, so a stuck worker is always detected. The
feature is gated behind a new CONFIG_WQ_WATCHDOG_WORKERS option
(disabled by default) under CONFIG_WQ_WATCHDOG.

An option is to get rid of CONFIG_WQ_WATCHDOG_WORKERS completely. I've
been running this change on some hosts with workloads (mainly stress-ng)
and I haven't found any false positive.

With this series applied, we will be able to see a stall like the one
above:

	 BUG: workqueue lockup - worker365:stall_work_fn [wq_stall] stuck in pool cpus=9 node=0 flags=0x0 nice=0 for 2570s!
	 Showing busy workqueues and worker pools:
	  workqueue events: flags=0x100
	  pwq 38: cpus=9 node=0 flags=0x0 nice=0 active=2 refcnt=3
	  workqueue stall_wq: flags=0x0

---
Breno Leitao (4):
      workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
      workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
      workqueue: Show in-flight work item duration in stall diagnostics
      workqueue: Detect stalled in-flight work items with empty worklist

 kernel/workqueue.c          | 71 ++++++++++++++++++++++++++++++++++++++-------
 kernel/workqueue_internal.h |  1 +
 lib/Kconfig.debug           | 12 ++++++++
 3 files changed, 74 insertions(+), 10 deletions(-)
---
base-commit: 9cb8b0f289560728dbb8b88158e7a957e2e90a14
change-id: 20260210-wqstall_start-at-e7319a005ab4

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-04 16:40 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-11 12:29 [PATCH 0/4] workqueue: Detect stalled in-flight workers Breno Leitao
2026-02-11 12:29 ` [PATCH 1/4] workqueue: Use POOL_BH instead of WQ_BH when checking pool flags Breno Leitao
2026-02-11 12:29 ` [PATCH 2/4] workqueue: Rename pool->watchdog_ts to pool->last_progress_ts Breno Leitao
2026-02-11 12:29 ` [PATCH 3/4] workqueue: Show in-flight work item duration in stall diagnostics Breno Leitao
2026-02-11 12:29 ` [PATCH 4/4] workqueue: Detect stalled in-flight work items with empty worklist Breno Leitao
2026-02-11 18:56 ` [PATCH 0/4] workqueue: Detect stalled in-flight workers Tejun Heo
2026-03-04 15:40   ` Breno Leitao
2026-03-04 16:40     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox