* [PATCH] workqueue: Fix false positive stall reports
@ 2026-03-20 19:23 Song Liu
2026-03-21 18:27 ` Tejun Heo
2026-03-23 9:35 ` Breno Leitao
0 siblings, 2 replies; 3+ messages in thread
From: Song Liu @ 2026-03-20 19:23 UTC (permalink / raw)
To: linux-kernel
Cc: tj, jiangshanlai, leitao, pmladek, kernel-team, puranjay,
Song Liu
On weakly ordered architectures (e.g., arm64), the lockless check in
wq_watchdog_timer_fn() can observe a reordering between the worklist
insertion and the last_progress_ts update. Specifically, the watchdog
can see a non-empty worklist (from a list_add) while reading a stale
last_progress_ts value, causing a false positive stall report.
This was confirmed by reading pool->last_progress_ts again after holding
pool->lock in wq_watchdog_timer_fn():
workqueue watchdog: pool 7 false positive detected!
lockless_ts=4784580465 locked_ts=4785033728
diff=453263ms worklist_empty=0
To avoid slowing down the hot path (queue_work, etc.), recheck
last_progress_ts with pool->lock held. This will eliminate the false
positive with minimal overhead.
Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it.
Assisted-by: claude-code:claude-opus-4-6
Signed-off-by: Song Liu <song@kernel.org>
---
kernel/workqueue.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b77119d71641..5b501ff1223a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -7701,6 +7701,23 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
/* did we stall? */
if (time_after(now, ts + thresh)) {
+ unsigned long irq_flags;
+
+ raw_spin_lock_irqsave(&pool->lock, irq_flags);
+ /*
+ * Recheck last_progress_ts with pool->lock, this
+ * eliminates false positive where we report wq
+ * stall for newly queued work.
+ */
+ pool_ts = READ_ONCE(pool->last_progress_ts);
+ if (time_after(pool_ts, touched))
+ ts = pool_ts;
+ else
+ ts = touched;
+ raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
+ if (!time_after(now, ts + thresh))
+ continue;
+
lockup_detected = true;
stall_time = jiffies_to_msecs(now - pool_ts) / 1000;
max_stall_time = max(max_stall_time, stall_time);
@@ -7712,8 +7729,6 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
pr_cont_pool_info(pool);
pr_cont(" stuck for %us!\n", stall_time);
}
-
-
}
if (lockup_detected)
--
2.52.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] workqueue: Fix false positive stall reports
2026-03-20 19:23 [PATCH] workqueue: Fix false positive stall reports Song Liu
@ 2026-03-21 18:27 ` Tejun Heo
2026-03-23 9:35 ` Breno Leitao
1 sibling, 0 replies; 3+ messages in thread
From: Tejun Heo @ 2026-03-21 18:27 UTC (permalink / raw)
To: Song Liu; +Cc: linux-kernel, jiangshanlai, leitao, pmladek, kernel-team,
puranjay
Hello,
On Fri, Mar 20, 2026 at 12:23:32PM -0700, Song Liu wrote:
> index b77119d71641..5b501ff1223a 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -7701,6 +7701,23 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
>
> /* did we stall? */
> if (time_after(now, ts + thresh)) {
> + unsigned long irq_flags;
> +
> + raw_spin_lock_irqsave(&pool->lock, irq_flags);
Can you use guard() instead? I don't think we can just keep the lock in the
block.
> + /*
> + * Recheck last_progress_ts with pool->lock, this
> + * eliminates false positive where we report wq
> + * stall for newly queued work.
> + */
And I think this deserves a bit more explanation. Can you move this comment
outside the block where /* did we stall? */ is and give the scenario where
it can go wrong w/o the locked check?
> + pool_ts = READ_ONCE(pool->last_progress_ts);
pool->last_progress_ts is write protected by pool lock. No need for
READ_ONCE().
> + if (time_after(pool_ts, touched))
> + ts = pool_ts;
> + else
> + ts = touched;
> + raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
> + if (!time_after(now, ts + thresh))
> + continue;
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] workqueue: Fix false positive stall reports
2026-03-20 19:23 [PATCH] workqueue: Fix false positive stall reports Song Liu
2026-03-21 18:27 ` Tejun Heo
@ 2026-03-23 9:35 ` Breno Leitao
1 sibling, 0 replies; 3+ messages in thread
From: Breno Leitao @ 2026-03-23 9:35 UTC (permalink / raw)
To: Song Liu; +Cc: linux-kernel, tj, jiangshanlai, pmladek, kernel-team, puranjay
On Fri, Mar 20, 2026 at 12:23:32PM -0700, Song Liu wrote:
> On weakly ordered architectures (e.g., arm64), the lockless check in
> wq_watchdog_timer_fn() can observe a reordering between the worklist
> insertion and the last_progress_ts update. Specifically, the watchdog
> can see a non-empty worklist (from a list_add) while reading a stale
> last_progress_ts value, causing a false positive stall report.
>
> This was confirmed by reading pool->last_progress_ts again after holding
> pool->lock in wq_watchdog_timer_fn():
>
> workqueue watchdog: pool 7 false positive detected!
> lockless_ts=4784580465 locked_ts=4785033728
> diff=453263ms worklist_empty=0
>
> To avoid slowing down the hot path (queue_work, etc.), recheck
> last_progress_ts with pool->lock held. This will eliminate the false
> positive with minimal overhead.
>
> Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it.
>
> Assisted-by: claude-code:claude-opus-4-6
> Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Breno Leitao <leitao@debian.org>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-03-23 9:37 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20 19:23 [PATCH] workqueue: Fix false positive stall reports Song Liu
2026-03-21 18:27 ` Tejun Heo
2026-03-23 9:35 ` Breno Leitao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox