[mainline][BUG] Observed Workqueue lockups on offline CPUs.

public inbox for linuxppc-dev@ozlabs.org
 help / color / mirror / Atom feed

From: Samir M <samir@linux.ibm.com>
To: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>, Tejun Heo <tj@kernel.org>,
	RCU <rcu@vger.kernel.org>,
	linuxppc-dev@lists.ozlabs.org,
	Shrikanth Hegde <sshegde@linux.ibm.com>
Subject: [mainline][BUG] Observed Workqueue lockups on offline CPUs.
Date: Mon, 27 Apr 2026 15:32:35 +0530	[thread overview]
Message-ID: <97a7d011-d573-4754-9e5d-68b562c64089@linux.ibm.com> (raw)

Hi Paul,

I've been testing the latest upstream kernel on a PowerPC system and 
encountered workqueue lockup issues that I've bisected to commit 
61bbcfb50514 ("srcu: Push srcu_node allocation to GP when non-preemptible").
After booting, I'm seeing workqueue lockup warnings for CPUs 81-96, 
which are offline on my system. The workqueues remain stuck for over 237 
seconds:

[  243.309302][    C0] BUG: workqueue lockup - pool cpus=81 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309311][    C0] BUG: workqueue lockup - pool cpus=82 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309318][    C0] BUG: workqueue lockup - pool cpus=83 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309326][    C0] BUG: workqueue lockup - pool cpus=84 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309333][    C0] BUG: workqueue lockup - pool cpus=85 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309341][    C0] BUG: workqueue lockup - pool cpus=86 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309348][    C0] BUG: workqueue lockup - pool cpus=87 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309355][    C0] BUG: workqueue lockup - pool cpus=88 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309363][    C0] BUG: workqueue lockup - pool cpus=89 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309370][    C0] BUG: workqueue lockup - pool cpus=90 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309377][    C0] BUG: workqueue lockup - pool cpus=91 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309384][    C0] BUG: workqueue lockup - pool cpus=92 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309392][    C0] BUG: workqueue lockup - pool cpus=93 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309399][    C0] BUG: workqueue lockup - pool cpus=94 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309406][    C0] BUG: workqueue lockup - pool cpus=95 node=0 
flags=0x4 nice=0 stuck for 237s!
[  243.309413][    C0] BUG: workqueue lockup - pool cpus=96 node=0 
flags=0x4 nice=0 stuck for 237s!

Git bisect identified this as the first bad commit:

commit 61bbcfb50514a8a94e035a7349697a3790ab4783
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Fri Mar 20 20:29:20 2026 -0700

     srcu: Push srcu_node allocation to GP when non-preemptible

     When the srcutree.convert_to_big and srcutree.big_cpu_lim kernel boot
     parameters specify initialization-time allocation of the srcu_node
     tree for statically allocated srcu_struct structures (for example, in
     DEFINE_SRCU() at build time instead of init_srcu_struct() at runtime),
     init_srcu_struct_nodes() will attempt to dynamically allocate this tree
     at the first run-time update-side use of this srcu_struct structure,
     but while holding a raw spinlock. Because the memory allocator can
     acquire non-raw spinlocks, this can result in lockdep splats.

     This commit therefore uses the same SRCU_SIZE_ALLOC trick that is used
     when the first run-time update-side use of this srcu_struct structure
     happens before srcu_init() is called. The actual allocation then takes
     place from workqueue context at the ends of upcoming SRCU grace 
periods.

     [boqun: Adjust the sha1 of the Fixes tag]

     Fixes: 175b45ed343a ("srcu: Use raw spinlocks so call_srcu() can be 
used under preempt_disable()")
     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
     Signed-off-by: Boqun Feng <boqun@kernel.org>

  kernel/rcu/srcutree.c | 7 +++++--
  1 file changed, 5 insertions(+), 2 deletions(-)

Reverting this commit resolves the issue.

The problem appears to be that the workqueue is attempting to execute on 
offline CPUs. The commit moves SRCU node allocation to workqueue context 
to avoid lockdep issues with memory allocation under raw spinlocks, 
which makes sense. However, it seems the workqueue scheduling doesn't 
properly account for CPU online/offline state in this code path.

My test environment:
- Architecture: PowerPC
- Kernel version: Latest upstream (7.1-rc1)
- CPUs 81-96 are offline at boot time

I suspect the issue might be related to:
1. Workqueue not checking CPU online status before scheduling SRCU 
allocation work
2. Missing CPU hotplug awareness in the new workqueue-based allocation path
3. Possible race condition with CPU hotplug events

Would it make sense to use queue_work_on() with explicit online CPU 
selection, or add CPU hotplug handlers for this workqueue? I'm not 
deeply familiar with the workqueue internals, so I might be missing 
something.
Please let me know if you need any additional details or if you'd like 
me to test any patches.

If you happen to fix the above issue, then please add below tag.
Reported-by: Samir M <samir@linux.ibm.com>

Thanks,
Samir

next             reply	other threads:[~2026-04-27 10:02 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-27 10:02 Samir M [this message]
2026-04-27 11:30 ` [mainline][BUG] Observed Workqueue lockups on offline CPUs Samir M

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=97a7d011-d573-4754-9e5d-68b562c64089@linux.ibm.com \
    --to=samir@linux.ibm.com \
    --cc=boqun.feng@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=sshegde@linux.ibm.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox