public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Petr Mladek <pmladek@suse.com>
Cc: Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	 Omar Sandoval <osandov@osandov.com>, Song Liu <song@kernel.org>,
	 Danielle Costantino <dcostantino@meta.com>,
	kasan-dev@googlegroups.com, kernel-team@meta.com
Subject: Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers
Date: Fri, 13 Mar 2026 10:36:09 -0700	[thread overview]
Message-ID: <abREbwgbsviqGq51@gmail.com> (raw)
In-Reply-To: <abQhgUAyAphVTHWd@pathway>

On Fri, Mar 13, 2026 at 03:38:57PM +0100, Petr Mladek wrote:
> On Fri 2026-03-13 05:24:54, Breno Leitao wrote:

> > Right, but let's look at this case:
> > 
> > 	 BUG: workqueue lockup - pool 55 cpu 13 curr 0 (swapper/13) stack ffff800085640000 cpus=13 node=0 flags=0x0 nice=-20 stuck for 679s!
> > 	  work func=blk_mq_timeout_work data=0xffff0000ad7e3a05
> > 	  Showing busy workqueues and worker pools:
> > 	  workqueue events_unbound: flags=0x2
> > 	    pwq 288: cpus=0-71 flags=0x4 nice=0 active=1 refcnt=2
> > 	      in-flight: 4083734:btrfs_extent_map_shrinker_worker
> > 	  workqueue mm_percpu_wq: flags=0x8
> > 	    pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=1 refcnt=2
> > 	      pending: vmstat_update
> > 	  pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 1715676 4086805 3860852 3587585 4065550 4014041 3944711 3744484
> > 	  Showing backtraces of running workers in stalled CPU-bound worker pools:
> > 		# Nothing in here
> > 
> > It seems CPU 13 is idle (curr = 0) and blk_mq_timeout_work has been pending for
> > 679s ?
> 
> It looks like that progress is not blocked by an overloaded CPU.

Looking at data address, it seems it always have the last 0x5 bits set,
meaning that WORK_STRUCT_PENDING and WORK_STRUCT_PWQ set, right?

So, the work is peding for a huge amount of time (see more examples below)

> One interesting thing is there is no "pwq XXX: cpus=13" in the list
> of busy workqueues and worker pools. IMHO, the watchdog should report
> a stall only when there is a pending work. It does not make much sense
> to me.
> 
> BTW: I look at pr_cont_pool_info() in the mainline and it does not
> not print the name of the current process and its stack address.
> I guess that it is printed by another debugging patch ?

Sorry, this was an simple change we got in initially, that is basically doing:

	void *curr_stack;
	curr_stack = try_get_task_stack(curr)
	pr_emerg("BUG: workqueue lockup - pool %d cpu %d curr %d (%s) stack %px",
		 pool->id, pool->cpu, curr->pid,
		 curr->comm, curr_stack);
> 
> 
> > 	  pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 17
> 
> > > I wonder what blocked the idle worker from waking or forking
> > > a new worker. Was it caused by the IPIs?
> > 
> > Not sure, keep in mind that these hosts (arm64) do not have NMI, so,
> > IPIs are just regular interrupts that could take a long time to be handled. The
> > toggle_allocation_gate() was good example, given it was sending IPIs very
> > frequently and I took it as an example for the cover letter, but, this problem
> > also show up with diferent places. (more examples later)
> > 
> > > Did printing the sleeping workers helped to analyze the problem?
> > 
> > That is my hope. I don't have a reproducer other than the one in this
> > patchset.
> 
> Good to know. Note that the reproducer is not "realistic".
> PF_WQ_WORKER is an internal flag and must not be manipulated
> by the queued work callbacks. It is like shooting into an own leg.

Ack!

> > I am currently rolling this patchset to production, and I can report once
> > I get more information.
> 
> That would be great. I am really curious what is the root problem here.

In fact, I got some instances of this issue with this new patchset, and,
still, the backtrace is empty. These are the only 3 issues I got with the new
patches applied. All of them wiht the "blk_mq_timeout_work" function.

	BUG: workqueue lockup - pool 11 cpu 2 curr 686384 (thrmon_agg) stack ffff8002bd200000 cpus=2 node=0 flags=0x0 nice=-20 stuck for 276s!
	   work func=blk_mq_timeout_work data=0xffff0000b88e3405
	   Showing busy workqueues and worker pools:
	   workqueue kblockd: flags=0x18
	     pwq 11: cpus=2 node=0 flags=0x0 nice=-20 active=1 refcnt=2
	       pending: blk_mq_timeout_work
	   Showing backtraces of busy workers in stalled CPU-bound worker pools:

	BUG: workqueue lockup - pool 7 cpu 1 curr 0 (swapper/1) stack ffff800084f80000 cpus=1 node=0 flags=0x0 nice=-20 stuck for 114s!
           work func=blk_mq_timeout_work data=0xffff0000b88e3205
           Showing busy workqueues and worker pools:
           workqueue events: flags=0x0
             pwq 510: cpus=127 node=1 flags=0x0 nice=0 active=1 refcnt=2
               pending: psi_avgs_work
           Showing backtraces of busy workers in stalled CPU-bound worker pools:

	BUG: workqueue lockup - pool 11 cpu 2 curr 24596 (mcrcfg-fci) stack ffff8002b5a40000 cpus=2 node=0 flags=0x0 nice=-20 stuck for 282s!
           work func=blk_mq_timeout_work data=0xffff0000b8706805
           Showing busy workqueues and worker pools:
           Showing backtraces of busy workers in stalled CPU-bound worker pools:

> In all these cases, there is listed some pending work on the stuck
> "cpus=XXX". So, it looks more sane than the 1st report.
> 
> I agree that it looks ugly that it did not print any backtraces.
> But I am not sure if the backtraces would help.
> 
> If there is no running worker then wq_worker_sleeping() should wake up
> another idle worker. And if this is the last idle worker in the
> per-CPU pool than it should create another worker.
> 
> Honestly, I think that there is only small chance that the backtraces
> of the sleeping workers will help to solve the problem.
> 
> IMHO, the problem is that wq_worker_sleeping() was not able to
> guarantee forward progress. Note that there should always be
> at least one idle work on CPU-bound worker pools.
> 
> Now, the might be more reasons why it failed:
> 
>   1. It did not wake up any idle worker because it though
>      it has already been done, for example because a messed
>      worker->sleeping flag, worker->flags & WORKER_NOT_RUNNING flag,
>      pool->nr_running count.
> 
>      IMHO, the chance of this bug is small.
> 
> 
>   2. The scheduler does not schedule the woken idle worker because:
> 
> 	+ too big load
> 	+ soft/hardlockup on the given CPU
> 	+ the scheduler does not schedule anything, e.g. because of
> 	  stop_machine()
> 
>       It seems that this not the case on the 1st example where
>       the CPU is idle. But I am not sure how exactly are the IPIs
>       handled on arm64.

I don't have information about the load of those machines when the problem
happens, but, in some case the problem happen when there is no workload
(production job) running on those machine, thus, it is hard to assume that the
load is high.

>    3. There always must be at least one idle worker in each pool.
>       But the last idle worker newer processes pending work.
>       It has to create another worker instead.
> 
>       create_worker() might fail from more reasons:
> 
> 	+ worker pool limit (is there any?)
> 	+ PID limit
> 	+ memory limit
> 
>       I have personally seen these problems caused by PID limit.
>       Note that containers might have relatively small limits by
>       default !!!

Might this justify the fact that WORK_STRUCT_PENDING bit is set for ~200
seconds?


> I think that it might be interesting to print backtrace and
> state of the worker which is supposed to guarantee progress.
> Is it "pool->manager" ?
> 
> Also create_worker() prints an error when it can't create worker.
> But the error is printed only once. And it might get lost on
> huge systems with extensive load and logging.

That is definitely not the case. I've scan Meta's whole fleet for create_worker
error, and there is a single instance on a unrelated host.

> Maybe, we could add some global variable allow to print
> these errors once again when workqueue stall is detected.
> 
> Or store some timestamps when the function tried to create a new worker
> and when it succeeded last time. And print it in the stall report.

  reply	other threads:[~2026-03-13 17:36 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-05 16:15 [PATCH v2 0/5] workqueue: Detect stalled in-flight workers Breno Leitao
2026-03-05 16:15 ` [PATCH v2 1/5] workqueue: Use POOL_BH instead of WQ_BH when checking pool flags Breno Leitao
2026-03-05 17:13   ` Song Liu
2026-03-05 16:15 ` [PATCH v2 2/5] workqueue: Rename pool->watchdog_ts to pool->last_progress_ts Breno Leitao
2026-03-05 17:16   ` Song Liu
2026-03-05 16:15 ` [PATCH v2 3/5] workqueue: Show in-flight work item duration in stall diagnostics Breno Leitao
2026-03-05 17:17   ` Song Liu
2026-03-05 16:15 ` [PATCH v2 4/5] workqueue: Show all busy workers " Breno Leitao
2026-03-05 17:17   ` Song Liu
2026-03-12 17:03   ` Petr Mladek
2026-03-13 12:57     ` Breno Leitao
2026-03-13 16:27       ` Petr Mladek
2026-03-18 11:31         ` Breno Leitao
2026-03-18 15:11           ` Petr Mladek
2026-03-20 10:41             ` Breno Leitao
2026-03-05 16:15 ` [PATCH v2 5/5] workqueue: Add stall detector sample module Breno Leitao
2026-03-05 17:25   ` Song Liu
2026-03-05 17:39 ` [PATCH v2 0/5] workqueue: Improve stall diagnostics Tejun Heo
2026-03-12 16:38 ` [PATCH v2 0/5] workqueue: Detect stalled in-flight workers Petr Mladek
2026-03-13 12:24   ` Breno Leitao
2026-03-13 14:38     ` Petr Mladek
2026-03-13 17:36       ` Breno Leitao [this message]
2026-03-18 16:46         ` Petr Mladek
2026-03-20 10:44           ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abREbwgbsviqGq51@gmail.com \
    --to=leitao@debian.org \
    --cc=akpm@linux-foundation.org \
    --cc=dcostantino@meta.com \
    --cc=jiangshanlai@gmail.com \
    --cc=kasan-dev@googlegroups.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=osandov@osandov.com \
    --cc=pmladek@suse.com \
    --cc=song@kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox