public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] workqueue: Fix false positive stall reports
@ 2026-03-22  3:30 Song Liu
  2026-03-22  4:41 ` Tejun Heo
  2026-03-23 14:09 ` Petr Mladek
  0 siblings, 2 replies; 11+ messages in thread
From: Song Liu @ 2026-03-22  3:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: tj, jiangshanlai, leitao, pmladek, kernel-team, puranjay,
	Song Liu

On weakly ordered architectures (e.g., arm64), the lockless check in
wq_watchdog_timer_fn() can observe a reordering between the worklist
insertion and the last_progress_ts update. Specifically, the watchdog
can see a non-empty worklist (from a list_add) while reading a stale
last_progress_ts value, causing a false positive stall report.

This was confirmed by reading pool->last_progress_ts again after holding
pool->lock in wq_watchdog_timer_fn():

  workqueue watchdog: pool 7 false positive detected!
    lockless_ts=4784580465 locked_ts=4785033728
    diff=453263ms worklist_empty=0

To avoid slowing down the hot path (queue_work, etc.), recheck
last_progress_ts with pool->lock held. This will eliminate the false
positive with minimal overhead.

Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it.

Assisted-by: claude-code:claude-opus-4-6
Signed-off-by: Song Liu <song@kernel.org>
---
v1 -> v2:
- Use scoped_guard() instead of manual raw_spin_lock/unlock (Tejun)
- Drop READ_ONCE() for pool->last_progress_ts under pool->lock (Tejun)
- Expand comment with reordering scenario and function names (Tejun)
---
 kernel/workqueue.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b77119d71641..ff97b705f25e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -7699,8 +7699,28 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
 		else
 			ts = touched;
 
-		/* did we stall? */
+		/*
+		 * Did we stall?
+		 *
+		 * Do a lockless check first. On weakly ordered
+		 * architectures, the lockless check can observe a
+		 * reordering between worklist insert_work() and
+		 * last_progress_ts update from __queue_work(). Since
+		 * __queue_work() is a much hotter path than the timer
+		 * function, we handle false positive here by reading
+		 * last_progress_ts again with pool->lock held.
+		 */
 		if (time_after(now, ts + thresh)) {
+			scoped_guard(raw_spinlock_irqsave, &pool->lock) {
+				pool_ts = pool->last_progress_ts;
+				if (time_after(pool_ts, touched))
+					ts = pool_ts;
+				else
+					ts = touched;
+			}
+			if (!time_after(now, ts + thresh))
+				continue;
+
 			lockup_detected = true;
 			stall_time = jiffies_to_msecs(now - pool_ts) / 1000;
 			max_stall_time = max(max_stall_time, stall_time);
@@ -7712,8 +7732,6 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
 			pr_cont_pool_info(pool);
 			pr_cont(" stuck for %us!\n", stall_time);
 		}
-
-
 	}
 
 	if (lockup_detected)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-03-25 15:53 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-22  3:30 [PATCH v2] workqueue: Fix false positive stall reports Song Liu
2026-03-22  4:41 ` Tejun Heo
2026-03-23 14:09 ` Petr Mladek
2026-03-23 18:31   ` Song Liu
2026-03-24 10:01     ` Petr Mladek
2026-03-24 14:15       ` Tejun Heo
2026-03-24 18:22       ` Song Liu
2026-03-25 13:19         ` Petr Mladek
2026-03-25 14:12           ` Petr Mladek
2026-03-25 15:36             ` Song Liu
2026-03-25 15:53             ` [PATCH v2] workqueue: Better describe stall check Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox