* Re: [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics
[not found] ` <20260305-wqstall_start-at-v2-4-b60863ee0899@debian.org>
@ 2026-05-07 10:20 ` Jiri Slaby
2026-05-07 13:11 ` Breno Leitao
0 siblings, 1 reply; 3+ messages in thread
From: Jiri Slaby @ 2026-05-07 10:20 UTC (permalink / raw)
To: Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton
Cc: linux-kernel, Omar Sandoval, Song Liu, Danielle Costantino,
kasan-dev, Petr Mladek, kernel-team
On 05. 03. 26, 17:15, Breno Leitao wrote:
> show_cpu_pool_hog() only prints workers whose task is currently running
> on the CPU (task_is_running()). This misses workers that are busy
> processing a work item but are sleeping or blocked — for example, a
> worker that clears PF_WQ_WORKER and enters wait_event_idle(). Such a
> worker still occupies a pool slot and prevents progress, yet produces
> an empty backtrace section in the watchdog output.
>
> This is happening on real arm64 systems, where
> toggle_allocation_gate() IPIs every single CPU in the machine (which
> lacks NMI), causing workqueue stalls that show empty backtraces because
> toggle_allocation_gate() is sleeping in wait_event_idle().
>
> Remove the task_is_running() filter so every in-flight worker in the
> pool's busy_hash is dumped. The busy_hash is protected by pool->lock,
> which is already held.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> kernel/workqueue.c | 28 +++++++++++++---------------
> 1 file changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 56d8af13843f8..09b9ad78d566c 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -7583,9 +7583,9 @@ MODULE_PARM_DESC(panic_on_stall_time, "Panic if stall exceeds this many seconds
>
> /*
> * Show workers that might prevent the processing of pending work items.
> - * The only candidates are CPU-bound workers in the running state.
> - * Pending work items should be handled by another idle worker
> - * in all other situations.
> + * A busy worker that is not running on the CPU (e.g. sleeping in
> + * wait_event_idle() with PF_WQ_WORKER cleared) can stall the pool just as
> + * effectively as a CPU-bound one, so dump every in-flight worker.
> */
> static void show_cpu_pool_hog(struct worker_pool *pool)
> {
> @@ -7596,19 +7596,17 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
> raw_spin_lock_irqsave(&pool->lock, irq_flags);
>
> hash_for_each(pool->busy_hash, bkt, worker, hentry) {
> - if (task_is_running(worker->task)) {
We see dumps from non-existent cpus on 7.0 like:
BUG: workqueue lockup - pool cpus=144 node=0 flags=0x4 nice=0 stuck
for 168224s!
...
Showing busy workqueues and worker pools:
workqueue rcu_gp: flags=0x108
pwq 578: cpus=144 node=0 flags=0x4 nice=0 active=3 refcnt=4
in:
https://bugzilla.suse.com/show_bug.cgi?id=1263947
?
Can this (or other patch from the series) cause this? Should there be
something like cpu_online() instead of task_is_running() somewhere?
> - /*
> - * Defer printing to avoid deadlocks in console
> - * drivers that queue work while holding locks
> - * also taken in their write paths.
> - */
> - printk_deferred_enter();
> + /*
> + * Defer printing to avoid deadlocks in console
> + * drivers that queue work while holding locks
> + * also taken in their write paths.
> + */
> + printk_deferred_enter();
>
> - pr_info("pool %d:\n", pool->id);
> - sched_show_task(worker->task);
> + pr_info("pool %d:\n", pool->id);
> + sched_show_task(worker->task);
>
> - printk_deferred_exit();
> - }
> + printk_deferred_exit();
> }
>
> raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
thanks,
--
js
suse labs
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics
2026-05-07 10:20 ` [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics Jiri Slaby
@ 2026-05-07 13:11 ` Breno Leitao
2026-05-11 5:21 ` Jiri Slaby
0 siblings, 1 reply; 3+ messages in thread
From: Breno Leitao @ 2026-05-07 13:11 UTC (permalink / raw)
To: Jiri Slaby
Cc: Tejun Heo, Lai Jiangshan, Andrew Morton, linux-kernel,
Omar Sandoval, Song Liu, Danielle Costantino, kasan-dev,
Petr Mladek, kernel-team
Hi Jiri,
On Thu, May 07, 2026 at 12:20:33PM +0200, Jiri Slaby wrote:
> On 05. 03. 26, 17:15, Breno Leitao wrote:
>
> BUG: workqueue lockup - pool cpus=144 node=0 flags=0x4 nice=0 stuck for
> 168224s!
That's an extremely long stall (~1.95 days).
> ...
> Showing busy workqueues and worker pools:
> workqueue rcu_gp: flags=0x108
> pwq 578: cpus=144 node=0 flags=0x4 nice=0 active=3 refcnt=4
> in:
> https://bugzilla.suse.com/show_bug.cgi?id=1263947
> ?
>
> Can this (or other patch from the series) cause this? Should there be
> something like cpu_online() instead of task_is_running() somewhere?
This series only affects stall reporting, not detection. The changes run
after the watchdog has identified a stall, so the detection logic itself
remains unchanged.
To help diagnose this issue, could you provide some additional information:
1) Was CPU 144 online at any point? If so, when was it taken offline?
2) Does this message appear repeatedly? If you bring CPU 144 online, does
the issue resolve?
3) Have you run similar tests on earlier kernel versions without seeing
this behavior, or is this a clear regression?
Thanks for the report,
--breno
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics
2026-05-07 13:11 ` Breno Leitao
@ 2026-05-11 5:21 ` Jiri Slaby
0 siblings, 0 replies; 3+ messages in thread
From: Jiri Slaby @ 2026-05-11 5:21 UTC (permalink / raw)
To: Breno Leitao
Cc: Tejun Heo, Lai Jiangshan, Andrew Morton, linux-kernel,
Omar Sandoval, Song Liu, Danielle Costantino, kasan-dev,
Petr Mladek, kernel-team
Hi,
we currently have several reports of this. On s390, ppc64, and x86_64.
On 07. 05. 26, 15:11, Breno Leitao wrote:
> Hi Jiri,
>
> On Thu, May 07, 2026 at 12:20:33PM +0200, Jiri Slaby wrote:
>> On 05. 03. 26, 17:15, Breno Leitao wrote:
>>
>> BUG: workqueue lockup - pool cpus=144 node=0 flags=0x4 nice=0 stuck for
>> 168224s!
>
> That's an extremely long stall (~1.95 days).
>
>> ...
>> Showing busy workqueues and worker pools:
>> workqueue rcu_gp: flags=0x108
>> pwq 578: cpus=144 node=0 flags=0x4 nice=0 active=3 refcnt=4
>> in:
>> https://bugzilla.suse.com/show_bug.cgi?id=1263947
>> ?
>>
>> Can this (or other patch from the series) cause this? Should there be
>> something like cpu_online() instead of task_is_running() somewhere?
>
> This series only affects stall reporting, not detection. The changes run
> after the watchdog has identified a stall, so the detection logic itself
> remains unchanged.
>
> To help diagnose this issue, could you provide some additional information:
>
> 1) Was CPU 144 online at any point? If so, when was it taken offline?
It was not, it's non-present.
> 2) Does this message appear repeatedly? If you bring CPU 144 online, does
> the issue resolve?
Yes, look at this new x86_64 report's dmesg (I believe it is related to
the above report):
BUG: workqueue lockup - pool cpus=2 node=0 flags=0x4 nice=0 stuck for
50s!
in:
https://bugzilla.suse.com/attachment.cgi?id=890229
$ grep -c BUG sl.txt
504
$ grep -c pwq sl.txt
509
It comes from:
https://bugzilla.suse.com/show_bug.cgi?id=1264554
> 3) Have you run similar tests on earlier kernel versions without seeing
> this behavior, or is this a clear regression?
It's new in 7.0. Going back to 6.19.12 makes it disappear.
thanks,
--
js
suse labs
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-05-11 5:21 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260305-wqstall_start-at-v2-0-b60863ee0899@debian.org>
[not found] ` <20260305-wqstall_start-at-v2-4-b60863ee0899@debian.org>
2026-05-07 10:20 ` [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics Jiri Slaby
2026-05-07 13:11 ` Breno Leitao
2026-05-11 5:21 ` Jiri Slaby
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox