[REGRESSION] 6.12: Workqueue lockups in inode_switch_wbs_work

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [REGRESSION] 6.12: Workqueue lockups in inode_switch_wbs_work_fn (suspect commit 66c14dccd810)
@ 2026-01-12 11:18 Matt Fleming
  2026-01-12 17:04 ` Jan Kara
  0 siblings, 1 reply; 4+ messages in thread
From: Matt Fleming @ 2026-01-12 11:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: cgroups, linux-kernel, Tejun Heo, Christian Brauner,
	linux-fsdevel, kernel-team

Hi Jan, it's me again :)

I’m writing to report a regression we are observing in our production
environment running kernel 6.12. We are seeing severe workqueue lockups that
appear to be triggered by high-volume cgroup destruction. We have isolated the
issue to 66c14dccd810 ("writeback: Avoid softlockup when switching many
inodes").

We're seeing stalled tasks in the inode_switch_wbs workqueue. The worker
appears to be CPU-bound within inode_switch_wbs_work_fn, leading to RCU stalls
and eventual system lockups.

Here is a representative trace from a stalled CPU-bound worker pool:

[1437023.584832][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
[1437023.733923][    C0] pool 358:
[1437023.733924][    C0] task:kworker/89:0    state:R  running task     stack:0     pid:3136989 tgid:3136989 ppid:2      task_flags:0x4208060 flags:0x00004000
[1437023.733929][    C0] Workqueue: inode_switch_wbs inode_switch_wbs_work_fn
[1437023.733933][    C0] Call Trace:
[1437023.733934][    C0]  <TASK>
[1437023.733937][    C0]  __schedule+0x4fb/0xbf0
[1437023.733942][    C0]  __cond_resched+0x33/0x60
[1437023.733944][    C0]  inode_switch_wbs_work_fn+0x481/0x710
[1437023.733948][    C0]  process_one_work+0x17b/0x330
[1437023.733950][    C0]  worker_thread+0x2ce/0x3f0

Our environment makes heavy use of cgroup-based services. When these services
-- specifically our caching layer -- are shut down, they can trigger the
offlining of a massive number of inodes (approx. 200k-250k+ inodes per service).

We have verified that reverting 66c14dccd810 completely eliminates these
lockups in our production environment.

I am currently working on creating a synthetic reproduction case in the lab to
replicate the inode/cgroup density required to trigger this on demand. In the
meantime, I wanted to share these findings to see if you have any insights.

Thanks,
Matt

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [REGRESSION] 6.12: Workqueue lockups in inode_switch_wbs_work_fn (suspect commit 66c14dccd810)
  2026-01-12 11:18 [REGRESSION] 6.12: Workqueue lockups in inode_switch_wbs_work_fn (suspect commit 66c14dccd810) Matt Fleming
@ 2026-01-12 17:04 ` Jan Kara
  2026-01-13 11:46   ` Matt Fleming
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Kara @ 2026-01-12 17:04 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Jan Kara, cgroups, linux-kernel, Tejun Heo, Christian Brauner,
	linux-fsdevel, kernel-team

Hi Matt!

On Mon 12-01-26 11:18:04, Matt Fleming wrote:
> I’m writing to report a regression we are observing in our production
> environment running kernel 6.12. We are seeing severe workqueue lockups
> that appear to be triggered by high-volume cgroup destruction. We have
> isolated the issue to 66c14dccd810 ("writeback: Avoid softlockup when
> switching many inodes").
> 
> We're seeing stalled tasks in the inode_switch_wbs workqueue. The worker
> appears to be CPU-bound within inode_switch_wbs_work_fn, leading to RCU
> stalls and eventual system lockups.

I agree we are CPU bound in inode_switch_wbs_work_fn() but I don't think we
are really hogging the CPU. The backtrace below indicates the worker just
got rescheduled in cond_resched() to give other tasks a chance to run. Is
the machine dying completely or does it eventually finish the cgroup
teardown?

> Here is a representative trace from a stalled CPU-bound worker pool:
> 
> [1437023.584832][    C0] Showing backtraces of running workers in stalled CPU-bound worker pools:
> [1437023.733923][    C0] pool 358:
> [1437023.733924][    C0] task:kworker/89:0    state:R  running task     stack:0     pid:3136989 tgid:3136989 ppid:2      task_flags:0x4208060 flags:0x00004000
> [1437023.733929][    C0] Workqueue: inode_switch_wbs inode_switch_wbs_work_fn
> [1437023.733933][    C0] Call Trace:
> [1437023.733934][    C0]  <TASK>
> [1437023.733937][    C0]  __schedule+0x4fb/0xbf0
> [1437023.733942][    C0]  __cond_resched+0x33/0x60
> [1437023.733944][    C0]  inode_switch_wbs_work_fn+0x481/0x710
> [1437023.733948][    C0]  process_one_work+0x17b/0x330
> [1437023.733950][    C0]  worker_thread+0x2ce/0x3f0
> 
> Our environment makes heavy use of cgroup-based services. When these
> services -- specifically our caching layer -- are shut down, they can
> trigger the offlining of a massive number of inodes (approx. 200k-250k+
> inodes per service).

Well, these changes were introduced because some services are switching
over 1m inodes on their exit and they were softlocking up the machine :).
So there's some commonality, just something in that setup behaves
differently from your setup. Are the inodes clean, dirty, or only with
dirty timestamps? Also since you mention 6.12 kernel but this series was
only merged in 6.18, do you carry full series ending with merge commit
9426414f0d42f?

> We have verified that reverting 66c14dccd810 completely eliminates these
> lockups in our production environment.
> 
> I am currently working on creating a synthetic reproduction case in the
> lab to replicate the inode/cgroup density required to trigger this on
> demand. In the meantime, I wanted to share these findings to see if you
> have any insights.

Yes, having the reproducer would certainly simplify debugging what exactly
is going on that your system is locking up. Because I was able to tear down
a cgroup doing switching of millions of inodes in couple of seconds without
any issue in my testing...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [REGRESSION] 6.12: Workqueue lockups in inode_switch_wbs_work_fn (suspect commit 66c14dccd810)
  2026-01-12 17:04 ` Jan Kara
@ 2026-01-13 11:46   ` Matt Fleming
  2026-01-13 12:02     ` Jan Kara
  0 siblings, 1 reply; 4+ messages in thread
From: Matt Fleming @ 2026-01-13 11:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: cgroups, linux-kernel, Tejun Heo, Christian Brauner,
	linux-fsdevel, kernel-team

On Mon, Jan 12, 2026 at 06:04:50PM +0100, Jan Kara wrote:
> 
> I agree we are CPU bound in inode_switch_wbs_work_fn() but I don't think we
> are really hogging the CPU. The backtrace below indicates the worker just
> got rescheduled in cond_resched() to give other tasks a chance to run. Is
> the machine dying completely or does it eventually finish the cgroup
> teardown?
 
Yeah you're right, the CPU isn't hogged but the interaction with the
workqueue subsystem leads to the machine choking. I've seen 150+
instances of inode_switch_wbs_work_fn() queued up in the workqueue
subsystem:

  [1437017.446174][    C0]     in-flight: 3139338:inode_switch_wbs_work_fn ,2420392:inode_switch_wbs_work_fn ,2914179:inode_switch_wbs_work_fn
  [1437017.446181][    C0]     pending: 11*inode_switch_wbs_work_fn
  [1437017.446185][    C0]   pwq 6: cpus=1 node=0 flags=0x2 nice=0 active=23 refcnt=24
  [1437017.446186][    C0]     in-flight: 2723771:inode_switch_wbs_work_fn ,1710617:inode_switch_wbs_work_fn ,3228683:inode_switch_wbs_work_fn ,3149692:inode_switch_wbs_work_fn ,3224195:inode_switch_wbs_work_fn
  [1437017.446193][    C0]     pending: 18*inode_switch_wbs_work_fn
  [1437017.446195][    C0]   pwq 10: cpus=2 node=0 flags=0x2 nice=0 active=17 refcnt=18
  [1437017.446196][    C0]     in-flight: 3224135:inode_switch_wbs_work_fn ,3193118:inode_switch_wbs_work_fn ,3224106:inode_switch_wbs_work_fn ,3228725:inode_switch_wbs_work_fn ,3087195:inode_switch_wbs_work_fn ,1853835:inode_switch_wbs_work_fn
  [1437017.446204][    C0]     pending: 11*inode_switch_wbs_work_fn

It sometimes finishes the cgroup teardown and sometimes hard locks up.
When workqueue items aren't completing things get really bad :) 

> Well, these changes were introduced because some services are switching
> over 1m inodes on their exit and they were softlocking up the machine :).
> So there's some commonality, just something in that setup behaves
> differently from your setup. Are the inodes clean, dirty, or only with
> dirty timestamps?

Good question. I don't know but I'll get back to you.

> Also since you mention 6.12 kernel but this series was
> only merged in 6.18, do you carry full series ending with merge commit
> 9426414f0d42f?
 
We always run the latest 6.12 LTS release and it looks like only these
two commits got backported:

  9a6ebbdbd412 ("writeback: Avoid excessively long inode switching times")
  66c14dccd810 ("writeback: Avoid softlockup when switching many inodes")

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [REGRESSION] 6.12: Workqueue lockups in inode_switch_wbs_work_fn (suspect commit 66c14dccd810)
  2026-01-13 11:46   ` Matt Fleming
@ 2026-01-13 12:02     ` Jan Kara
  0 siblings, 0 replies; 4+ messages in thread
From: Jan Kara @ 2026-01-13 12:02 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Jan Kara, cgroups, linux-kernel, Tejun Heo, Christian Brauner,
	linux-fsdevel, kernel-team

On Tue 13-01-26 11:46:35, Matt Fleming wrote:
> On Mon, Jan 12, 2026 at 06:04:50PM +0100, Jan Kara wrote:
> > 
> > I agree we are CPU bound in inode_switch_wbs_work_fn() but I don't think we
> > are really hogging the CPU. The backtrace below indicates the worker just
> > got rescheduled in cond_resched() to give other tasks a chance to run. Is
> > the machine dying completely or does it eventually finish the cgroup
> > teardown?
>  
> Yeah you're right, the CPU isn't hogged but the interaction with the
> workqueue subsystem leads to the machine choking. I've seen 150+
> instances of inode_switch_wbs_work_fn() queued up in the workqueue
> subsystem:
> 
>   [1437017.446174][    C0]     in-flight: 3139338:inode_switch_wbs_work_fn ,2420392:inode_switch_wbs_work_fn ,2914179:inode_switch_wbs_work_fn
>   [1437017.446181][    C0]     pending: 11*inode_switch_wbs_work_fn
>   [1437017.446185][    C0]   pwq 6: cpus=1 node=0 flags=0x2 nice=0 active=23 refcnt=24
>   [1437017.446186][    C0]     in-flight: 2723771:inode_switch_wbs_work_fn ,1710617:inode_switch_wbs_work_fn ,3228683:inode_switch_wbs_work_fn ,3149692:inode_switch_wbs_work_fn ,3224195:inode_switch_wbs_work_fn
>   [1437017.446193][    C0]     pending: 18*inode_switch_wbs_work_fn
>   [1437017.446195][    C0]   pwq 10: cpus=2 node=0 flags=0x2 nice=0 active=17 refcnt=18
>   [1437017.446196][    C0]     in-flight: 3224135:inode_switch_wbs_work_fn ,3193118:inode_switch_wbs_work_fn ,3224106:inode_switch_wbs_work_fn ,3228725:inode_switch_wbs_work_fn ,3087195:inode_switch_wbs_work_fn ,1853835:inode_switch_wbs_work_fn
>   [1437017.446204][    C0]     pending: 11*inode_switch_wbs_work_fn
> 
> It sometimes finishes the cgroup teardown and sometimes hard locks up.
> When workqueue items aren't completing things get really bad :) 
> 
> > Well, these changes were introduced because some services are switching
> > over 1m inodes on their exit and they were softlocking up the machine :).
> > So there's some commonality, just something in that setup behaves
> > differently from your setup. Are the inodes clean, dirty, or only with
> > dirty timestamps?
> 
> Good question. I don't know but I'll get back to you.
> 
> > Also since you mention 6.12 kernel but this series was
> > only merged in 6.18, do you carry full series ending with merge commit
> > 9426414f0d42f?
>  
> We always run the latest 6.12 LTS release and it looks like only these
> two commits got backported:
> 
>   9a6ebbdbd412 ("writeback: Avoid excessively long inode switching times")
>   66c14dccd810 ("writeback: Avoid softlockup when switching many inodes")

Ah, OK. Then you're missing e1b849cfa6b61f ("writeback: Avoid contention on
wb->list_lock when switching inodes") which might explain why my system
behaves differently from your one because that commit *heavily* reduces
contention on wb->list_lock when switching inodes and also avoids hogging
multiple workers with the switching works when only one of them can proceed
at a time (others are just spinning on the list_lock). So I'd suggest you
backport that commit and try whether it fixes your issues.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-01-13 12:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-12 11:18 [REGRESSION] 6.12: Workqueue lockups in inode_switch_wbs_work_fn (suspect commit 66c14dccd810) Matt Fleming
2026-01-12 17:04 ` Jan Kara
2026-01-13 11:46   ` Matt Fleming
2026-01-13 12:02     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox