From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-wm1-f54.google.com (mail-wm1-f54.google.com [209.85.128.54])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 335F223E334
	for <linux-kernel@vger.kernel.org>; Fri, 13 Mar 2026 14:39:01 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.54
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1773412744; cv=none; b=lQFO1VGsG9W6cvQc9vGjoBZY3shwZ080WTsPThe6mCyYmRZE55xL1ObOwyCkZDhnHM44LIrFB1JCVklbXPu6q4GXFLWQ+68nER/sIiwds5sIU84K4BL5JipOuWz1fcv5KitqZwnUdI1MmyS/wLESgwsuWyJZMDtZ//0m8Gxehgo=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1773412744; c=relaxed/simple;
	bh=g35+LYaBM2vWzTLz/3HlHHkpf01+n3p9WZzOSWbV7Aw=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=Nk3dd5B8LR9OpzBrUhWuPXUkShvCxcDGqhazaQiPmbuw5rFFu69Ckez8txne9/eVgg0+b1/BQvn7cqXGg5iiJoa47q0YwyTKXi2gAg0ioJBMNHt6RNoH4O8Yim+wWtcZVZJJFELdcjfcPOYS97dXAB4iaaGYWS1GjxTAOFiu0IE=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com; spf=pass smtp.mailfrom=suse.com; dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b=aqKpXY0s; arc=none smtp.client-ip=209.85.128.54
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=suse.com header.i=@suse.com header.b="aqKpXY0s"
Received: by mail-wm1-f54.google.com with SMTP id 5b1f17b1804b1-4855dbfc129so1465595e9.0
        for <linux-kernel@vger.kernel.org>; Fri, 13 Mar 2026 07:39:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=suse.com; s=google; t=1773412740; x=1774017540; darn=vger.kernel.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=HkgQaNUug4fCYWsKM5t0U6F9VCgX1z59NKZGdZVSLaY=;
        b=aqKpXY0s+43VoHAZA8iMUJxDfqNOoG3gH9TxIXWycTZOx3AR5o2C2j/QuZNYVc7/6z
         PAsM+M3/OMDVVk954HJcV8nSsuCqnKW6qTywJTYeP/UCIc2GvCo6ryBr58yVns+IUMSS
         fo4HJdgraDLCMHTAoHefquRbA8YmCd3Y0qGVoPvgFu0fHq8vd3EmgMlsFPP+RCtV/NW1
         wrSQ8aqUQducvLzjmw6phzIHGu6LFuNgjPlFvkJBn4FjOuFF0PxTCdORzZiV8AlfBvZN
         9X8pNZW9TaPn8vqMQAg5yXAoVoOA0ymD9s2uWN3lTgKUxFM2orj1e81Y/Hn+Nbc8yN5Y
         +q5g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1773412740; x=1774017540;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=HkgQaNUug4fCYWsKM5t0U6F9VCgX1z59NKZGdZVSLaY=;
        b=FZjTqLszf1YInrZwljPpeyrELLuEg/wg6t7dI3Wo4H+p5Q+JAZlvcVOFaSym+1Sevt
         z7J34b7ELKUmEnG35XQRBNJeF2OVmSsHoZNP4rKe4ktF3Zz7UtXB43FSMRNEPoYne15O
         9dAvsn6DvWsD138qOSJktE96T3Rgru2KJkmuA2anLlbXn0e4BmJ3XtYLZ8+cPjJ4GVju
         XTbD1KJrpL2G+E6aGIUrIdUesZ7CxFb/BWxfdhKp9JHC7i+T/cLjATkUJT5cGnW9yXTI
         QbYApT3RVqOUfZI6THjIa6qFFLmHEDa12qev/AeEGDWPsQ+f306cGmimlq7JEvd1K70Z
         FQlw==
X-Forwarded-Encrypted: i=1; AJvYcCVaCmh0pr30KHsRHhOq8ULYBsxlESjM7Va1kgfql1GT6vzf0DFYcE54naoT5sYj+45eZu6ivvdHwkQJw4k=@vger.kernel.org
X-Gm-Message-State: AOJu0Yy8KXva+aZ3N6EG4//n1Hyvl9fiP/23R/OUZLpA4dfw973VeSCZ
	HqH4Cf6+MtuHujb2lE9HM7BadQUqsDmizMMt6EXtT2jY7Wi0NEs+8BDDGUrqeGjCGsU=
X-Gm-Gg: ATEYQzz0AYKbapCkPUgxnE5/lEErRTZvdWrUYF91Mk10Ez+Rw77uymIvortR0NrXPSj
	SIHwBkal41/OsiPuxrXI+30nDy/OnHtzZvfWkF6pYpcVcKMfEnf2UcbAnAP6cO9HvwS4pmLDuYm
	E/KmvNVmmqu7I/k52lJIabTMa6WLDQpMtCZFp/LyPtes7IIvINVnrt/OYbeNfSvEaHnfgOrxRDr
	9iLi2p7Db0B9TFt/Xq2dBiyP3SlTNmx9fusyoOBPGNROUE2XC5zGDnmGJU8oHxT1LCH3Dqeljri
	7DFJinuZcslxusvZZ7gUnY6DBLIO2FMUjwYecaI3UI9dbfrrL+TRH6UrBlKuOWLo5KuX1ZLXEUR
	YD3c/ZQ05M5aaTrF8sBGANelquQPVAJFGpvkbxUw4iax1iU3h8w31A4Gbg5Pm3SzuwnGS192TdK
	iqcYDpt2uSTObXZZuBs+/nFjnWrA==
X-Received: by 2002:a05:600c:a4b:b0:485:3692:e8f7 with SMTP id 5b1f17b1804b1-48556707ddbmr59288675e9.25.1773412740107;
        Fri, 13 Mar 2026 07:39:00 -0700 (PDT)
Received: from pathway ([176.114.240.130])
        by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4854b5f6bb4sm210308945e9.4.2026.03.13.07.38.59
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 13 Mar 2026 07:38:59 -0700 (PDT)
Date: Fri, 13 Mar 2026 15:38:57 +0100
From: Petr Mladek <pmladek@suse.com>
To: Breno Leitao <leitao@debian.org>
Cc: Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, Omar Sandoval <osandov@osandov.com>,
	Song Liu <song@kernel.org>,
	Danielle Costantino <dcostantino@meta.com>,
	kasan-dev@googlegroups.com, kernel-team@meta.com
Subject: Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers
Message-ID: <abQhgUAyAphVTHWd@pathway>
References: <20260305-wqstall_start-at-v2-0-b60863ee0899@debian.org>
 <abLsAi7_fU5FrYiF@pathway.suse.cz>
 <abP8wDhYWwk3ufmA@gmail.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <abP8wDhYWwk3ufmA@gmail.com>

On Fri 2026-03-13 05:24:54, Breno Leitao wrote:
> Hello Petr,
> 
> On Thu, Mar 12, 2026 at 05:38:26PM +0100, Petr Mladek wrote:
> > On Thu 2026-03-05 08:15:36, Breno Leitao wrote:
> > > There is a blind spot exists in the work queue stall detecetor (aka
> > > show_cpu_pool_hog()). It only prints workers whose task_is_running() is
> > > true, so a busy worker that is sleeping (e.g. wait_event_idle())
> > > produces an empty backtrace section even though it is the cause of the
> > > stall.
> > > 
> > > Additionally, when the watchdog does report stalled pools, the output
> > > doesn't show how long each in-flight work item has been running, making
> > > it harder to identify which specific worker is stuck.
> > > 
> > > Example of the sample code:
> > > 
> > >     BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
> > >     Showing busy workqueues and worker pools:
> > >     workqueue events: flags=0x100
> > >         pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
> > >         in-flight: 178:stall_work1_fn [wq_stall]
> > >         pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
> > > 	...
> > >     Showing backtraces of running workers in stalled
> > >     CPU-bound worker pools:
> > >         <nothing here>
> > > 
> > > I see it happening on real machines, causing some stalls that doesn't
> > > have any backtrace. This is one of the code path:
> > > 
> > >   1) kfence executes toggle_allocation_gate() as a delayed workqueue
> > >      item (kfence_timer) on the system WQ.
> > > 
> > >   2) toggle_allocation_gate() enables a static key, which IPIs every
> > >      CPU to patch code:
> > >           static_branch_enable(&kfence_allocation_key);
> > > 
> > >   3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
> > >      kfence allocation to occur:
> > >           wait_event_idle(allocation_wait,
> > >                   atomic_read(&kfence_allocation_gate) > 0 || ...);
> > > 
> > >      This can last indefinitely if no allocation goes through the
> > >      kfence path (or IPIing all the CPUs take longer, which is common on
> > >      platforms that do not have NMI).
> > > 
> > >      The worker remains in the pool's busy_hash
> > >      (in-flight) but is no longer task_is_running().
> > >
> > >   4) The workqueue watchdog detects the stall and calls
> > >      show_cpu_pool_hog(), which only prints backtraces for workers
> > >      that are actively running on CPU:
> > > 
> > >           static void show_cpu_pool_hog(struct worker_pool *pool) {
> > >                   ...
> > >                   if (task_is_running(worker->task))
> > >                           sched_show_task(worker->task);
> > >           }
> > > 
> > >   5) Nothing is printed because the offending worker is in TASK_IDLE
> > >      state. The output shows "Showing backtraces of running workers in
> > >      stalled CPU-bound worker pools:" followed by nothing, effectively
> > >      hiding the actual culprit.
> > 
> > I am trying to better understand the situation. There was a reason
> > why only the worker in the running state was shown.
> > 
> > Normally, a sleeping worker should not cause a stall. The scheduler calls
> > wq_worker_sleeping() which should wake up another idle worker. There is
> > always at least one idle worker in the poll. It should start processing
> > the next pending work. Or it should fork another worker when it was
> > the last idle one.
> 
> Right, but let's look at this case:
> 
> 	 BUG: workqueue lockup - pool 55 cpu 13 curr 0 (swapper/13) stack ffff800085640000 cpus=13 node=0 flags=0x0 nice=-20 stuck for 679s!
> 	  work func=blk_mq_timeout_work data=0xffff0000ad7e3a05
> 	  Showing busy workqueues and worker pools:
> 	  workqueue events_unbound: flags=0x2
> 	    pwq 288: cpus=0-71 flags=0x4 nice=0 active=1 refcnt=2
> 	      in-flight: 4083734:btrfs_extent_map_shrinker_worker
> 	  workqueue mm_percpu_wq: flags=0x8
> 	    pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	      pending: vmstat_update
> 	  pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 1715676 4086805 3860852 3587585 4065550 4014041 3944711 3744484
> 	  Showing backtraces of running workers in stalled CPU-bound worker pools:
> 		# Nothing in here
> 
> It seems CPU 13 is idle (curr = 0) and blk_mq_timeout_work has been pending for
> 679s ?

It looks like that progress is not blocked by an overloaded CPU.

One interesting thing is there is no "pwq XXX: cpus=13" in the list
of busy workqueues and worker pools. IMHO, the watchdog should report
a stall only when there is a pending work. It does not make much sense
to me.

BTW: I look at pr_cont_pool_info() in the mainline and it does not
not print the name of the current process and its stack address.
I guess that it is printed by another debugging patch ?


> 	  pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 17

> > I wonder what blocked the idle worker from waking or forking
> > a new worker. Was it caused by the IPIs?
> 
> Not sure, keep in mind that these hosts (arm64) do not have NMI, so,
> IPIs are just regular interrupts that could take a long time to be handled. The
> toggle_allocation_gate() was good example, given it was sending IPIs very
> frequently and I took it as an example for the cover letter, but, this problem
> also show up with diferent places. (more examples later)
> 
> > Did printing the sleeping workers helped to analyze the problem?
> 
> That is my hope. I don't have a reproducer other than the one in this
> patchset.

Good to know. Note that the reproducer is not "realistic".
PF_WQ_WORKER is an internal flag and must not be manipulated
by the queued work callbacks. It is like shooting into an own leg.

> I am currently rolling this patchset to production, and I can report once
> I get more information.

That would be great. I am really curious what is the root problem here.


> > I wonder if we could do better in this case. For example, warn
> > that the scheduler failed to wake up another idle worker when
> > no worker is in the running state. And maybe, print backtrace
> > of the currently running process on the given CPU because it
> > likely blocks waking/scheduling the idle worker.
> 
> I am happy to improve this, given this has been a hard issue. let me give more
> instances of the "empty" stalls I am seeing. All with empty backtraces:
> 
> # Instance 1
> 	 BUG: workqueue lockup - pool cpus=33 node=0 flags=0x0 nice=0 stuck for 33s!
> 	 Showing busy workqueues and worker pools:
> 	 workqueue events: flags=0x0
> 	   pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=3 refcnt=4
> 	     pending: 3*psi_avgs_work
> 	   pwq 218: cpus=54 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     in-flight: 842:key_garbage_collector
> 	 workqueue mm_percpu_wq: flags=0x8
> 	   pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: vmstat_update
> 	 pool 218: cpus=54 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 11200 524627
> 	 Showing backtraces of running workers in stalled CPU-bound worker pools:
> 
> # Instance 2
> 	 BUG: workqueue lockup - pool cpus=53 node=0 flags=0x0 nice=0 stuck for 459s!
> 	 Showing busy workqueues and worker pools:
> 	 workqueue events: flags=0x0
> 	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: psi_avgs_work
> 	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=4 refcnt=5
> 	     pending: 2*psi_avgs_work, drain_local_memcg_stock, iova_depot_work_func
> 	 workqueue events_freezable: flags=0x4
> 	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: pci_pme_list_scan
> 	 workqueue slub_flushwq: flags=0x8
> 	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=3
> 	     pending: flush_cpu_slab BAR(7520)
> 	 workqueue mm_percpu_wq: flags=0x8
> 	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: vmstat_update
> 	 workqueue mlx5_cmd_0002:03:00.1: flags=0x6000a
> 	   pwq 576: cpus=0-143 flags=0x4 nice=0 active=1 refcnt=146
> 	     pending: cmd_work_handler
> 	 Showing backtraces of running workers in stalled CPU-bound worker pools:
> 
> # Instance 3
> 	 BUG: workqueue lockup - pool cpus=74 node=1 flags=0x0 nice=0 stuck for 31s!
> 	 Showing busy workqueues and worker pools:
> 	 workqueue mm_percpu_wq: flags=0x8
> 	   pwq 298: cpus=74 node=1 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: vmstat_update
> 	 Showing backtraces of running workers in stalled CPU-bound worker pools:	
> 
> # Instance 4
> 	 BUG: workqueue lockup - pool cpus=71 node=0 flags=0x0 nice=0 stuck for 32s!
> 	 Showing busy workqueues and worker pools:
> 	 workqueue events: flags=0x0
> 	   pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=2 refcnt=3
> 	     pending: psi_avgs_work, fuse_check_timeout
> 	 workqueue events_freezable: flags=0x4
> 	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: pci_pme_list_scan
> 	 workqueue mm_percpu_wq: flags=0x8
> 	   pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: vmstat_update
> 	 Showing backtraces of running workers in stalled CPU-bound worker pools:

In all these cases, there is listed some pending work on the stuck
"cpus=XXX". So, it looks more sane than the 1st report.

I agree that it looks ugly that it did not print any backtraces.
But I am not sure if the backtraces would help.

If there is no running worker then wq_worker_sleeping() should wake up
another idle worker. And if this is the last idle worker in the
per-CPU pool than it should create another worker.

Honestly, I think that there is only small chance that the backtraces
of the sleeping workers will help to solve the problem.

IMHO, the problem is that wq_worker_sleeping() was not able to
guarantee forward progress. Note that there should always be
at least one idle work on CPU-bound worker pools.

Now, the might be more reasons why it failed:

  1. It did not wake up any idle worker because it though
     it has already been done, for example because a messed
     worker->sleeping flag, worker->flags & WORKER_NOT_RUNNING flag,
     pool->nr_running count.

     IMHO, the chance of this bug is small.


  2. The scheduler does not schedule the woken idle worker because:

	+ too big load
	+ soft/hardlockup on the given CPU
	+ the scheduler does not schedule anything, e.g. because of
	  stop_machine()

      It seems that this not the case on the 1st example where
      the CPU is idle. But I am not sure how exactly are the IPIs
      handled on arm64.


   3. There always must be at least one idle worker in each pool.
      But the last idle worker newer processes pending work.
      It has to create another worker instead.

      create_worker() might fail from more reasons:

	+ worker pool limit (is there any?)
	+ PID limit
	+ memory limit

      I have personally seen these problems caused by PID limit.
      Note that containers might have relatively small limits by
      default !!!

   4. ???


I think that it might be interesting to print backtrace and
state of the worker which is supposed to guarantee progress.
Is it "pool->manager" ?

Also create_worker() prints an error when it can't create worker.
But the error is printed only once. And it might get lost on
huge systems with extensive load and logging.

Maybe, we could add some global variable allow to print
these errors once again when workqueue stall is detected.

Or store some timestamps when the function tried to create a new worker
and when it succeeded last time. And print it in the stall report.

Best Regards,
Petr