* [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
@ 2026-05-13 9:48 Baokun Li
2026-05-13 20:36 ` Tejun Heo
0 siblings, 1 reply; 4+ messages in thread
From: Baokun Li @ 2026-05-13 9:48 UTC (permalink / raw)
To: linux-fsdevel; +Cc: viro, brauner, jack, tj, linux-kernel, Baokun Li
When a container exits, the following BUG_ON() is occasionally triggered:
==================================================================
VFS: Busy inodes after unmount of sdb (ext4)
------------[ cut here ]------------
kernel BUG at fs/super.c:695!
CPU: 3 PID: 6 Comm: containerd-shim Tainted: G OE K 6.6 #1
pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
pc : generic_shutdown_super+0xf0/0x100
lr : generic_shutdown_super+0xf0/0x100
Call trace:
generic_shutdown_super+0xf0/0x100
kill_block_super+0x20/0x48
ext4_kill_sb+0x28/0x60
deactivate_locked_super+0x54/0x130
deactivate_super+0x84/0xa0
cleanup_mnt+0xa4/0x140
__cleanup_mnt+0x18/0x28
task_work_run+0x78/0xe0
do_notify_resume+0x204/0x240
==================================================================
The root cause is a race between cgroup_writeback_umount() and
inode_switch_wbs()/cleanup_offline_cgwb(). There is a window between
inode_prepare_wbs_switch() returning true and the subsequent
wb_queue_isw() call. Following is the process that triggers the issue:
CPU A (umount) | CPU B (writeback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
inode_switch_wbs/cleanup_offline_cgwb
atomic_inc(&isw_nr_in_flight)
inode_prepare_wbs_switch
-> passes SB_ACTIVE check
__iget(inode)
generic_shutdown_super
sb->s_flags &= ~SB_ACTIVE
cgroup_writeback_umount(sb)
smp_mb()
atomic_read(&isw_nr_in_flight)
rcu_barrier()
-> no pending RCU callbacks
flush_workqueue(isw_wq)
-> nothing queued, returns
evict_inodes(sb)
-> Inode skipped as isw still holds a ref.
sop->put_super(sb)
/* destroys percpu counters */
-> VFS: Busy inodes after unmount!
wb_queue_isw()
queue_work(isw_wq, ...)
/* later in work function */
inode_switch_wbs_work_fn
process_inode_switch_wbs
iput() -> evict
percpu_counter_dec() // UAF!
A simple approach to synchronize is to have cgroup_writeback_umount()
wait for the isw_nr_in_flight count of the current superblock to drop
to zero, since no new increments can occur after SB_ACTIVE is cleared.
However, isw_nr_in_flight is global, so introduce a new per-sb counter
s_isw_nr_in_flight for this purpose, while retaining the global counter
for throttling.
Fixes: a1a0e23e4903 ("writeback: flush inode cgroup wb switches instead of pinning super_block")
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
fs/fs-writeback.c | 44 +++++++++++++++++++---------------
include/linux/fs/super_types.h | 3 +++
2 files changed, 28 insertions(+), 19 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 71b7fd036e97..7924cf3484fa 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -556,8 +556,12 @@ static void process_inode_switch_wbs(struct bdi_writeback *new_wb,
wb_put_many(old_wb, nr_switched);
}
- for (inodep = isw->inodes; *inodep; inodep++)
+ for (inodep = isw->inodes; *inodep; inodep++) {
+ struct super_block *sb = (*inodep)->i_sb;
+
iput(*inodep);
+ atomic_dec(&sb->s_isw_nr_in_flight);
+ }
wb_put(new_wb);
kfree(isw);
atomic_dec(&isw_nr_in_flight);
@@ -602,14 +606,15 @@ static bool inode_prepare_wbs_switch(struct inode *inode,
{
/*
* Paired with smp_mb() in cgroup_writeback_umount().
- * isw_nr_in_flight must be increased before checking SB_ACTIVE and
- * grabbing an inode, otherwise isw_nr_in_flight can be observed as 0
- * in cgroup_writeback_umount() and the isw_wq will be not flushed.
+ * s_isw_nr_in_flight must be increased before checking SB_ACTIVE and
+ * grabbing an inode, otherwise s_isw_nr_in_flight can be observed as
+ * 0 in cgroup_writeback_umount() and the isw_wq will be not flushed.
*/
+ atomic_inc(&inode->i_sb->s_isw_nr_in_flight);
smp_mb();
if (IS_DAX(inode))
- return false;
+ goto out_dec;
/* while holding I_WB_SWITCH, no one else can update the association */
spin_lock(&inode->i_lock);
@@ -617,13 +622,17 @@ static bool inode_prepare_wbs_switch(struct inode *inode,
inode_state_read(inode) & (I_WB_SWITCH | I_FREEING | I_WILL_FREE) ||
inode_to_wb(inode) == new_wb) {
spin_unlock(&inode->i_lock);
- return false;
+ goto out_dec;
}
inode_state_set(inode, I_WB_SWITCH);
__iget(inode);
spin_unlock(&inode->i_lock);
return true;
+
+out_dec:
+ atomic_dec(&inode->i_sb->s_isw_nr_in_flight);
+ return false;
}
static void wb_queue_isw(struct bdi_writeback *wb,
@@ -1212,31 +1221,28 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id,
* @sb: target super_block
*
* This function is called when a super_block is about to be destroyed and
- * flushes in-flight inode wb switches. An inode wb switch goes through
- * RCU and then workqueue, so the two need to be flushed in order to ensure
- * that all previously scheduled switches are finished. As wb switches are
- * rare occurrences and synchronize_rcu() can take a while, perform
- * flushing iff wb switches are in flight.
+ * waits for all in-flight inode wb switches on this sb to complete. Since
+ * SB_ACTIVE is cleared before this is called, no new switches can start,
+ * so the per-sb counter will monotonically decrease to zero.
*/
void cgroup_writeback_umount(struct super_block *sb)
{
-
if (!(sb->s_bdi->capabilities & BDI_CAP_WRITEBACK))
return;
/*
* SB_ACTIVE should be reliably cleared before checking
- * isw_nr_in_flight, see generic_shutdown_super().
+ * s_isw_nr_in_flight, see generic_shutdown_super().
+ *
+ * Paired with smp_mb() in inode_prepare_wbs_switch(), which ensures
+ * that either we see the incremented counter, or the switcher sees
+ * SB_ACTIVE cleared and aborts.
*/
smp_mb();
- if (atomic_read(&isw_nr_in_flight)) {
- /*
- * Use rcu_barrier() to wait for all pending callbacks to
- * ensure that all in-flight wb switches are in the workqueue.
- */
- rcu_barrier();
+ while (atomic_read(&sb->s_isw_nr_in_flight)) {
flush_workqueue(isw_wq);
+ cond_resched();
}
}
diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
index 383050e7fdf5..8030c873016d 100644
--- a/include/linux/fs/super_types.h
+++ b/include/linux/fs/super_types.h
@@ -274,6 +274,9 @@ struct super_block {
/* number of fserrors that are being sent to fsnotify/filesystems */
refcount_t s_pending_errors;
+
+ /* number of in-flight inode writeback switches on this sb */
+ atomic_t s_isw_nr_in_flight;
} __randomize_layout;
/*
--
2.43.7
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
2026-05-13 9:48 [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs() Baokun Li
@ 2026-05-13 20:36 ` Tejun Heo
2026-05-14 2:55 ` Baokun Li
0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2026-05-13 20:36 UTC (permalink / raw)
To: Baokun Li; +Cc: linux-fsdevel, viro, brauner, jack, linux-kernel
Hello,
Resending - earlier send dropped the Cc list. Sorry for the noise.
How rcu_barrier() got out of sync, as best I can reconstruct:
- ec084de929e4 ("fs/writeback.c: use rcu_barrier() to wait for inflight
wb switches going into workqueue when umount", 2019) put the inc
after call_rcu(); rcu_barrier() worked from then.
- 8826ee4fe750 ("writeback, cgroup: increment isw_nr_in_flight before
grabbing an inode", 2021) moved the inc back ahead to cover the prep
window, apparently reopening this gap.
- e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when
switching inodes", 2025) replaced call_rcu() with llist_add() +
queue_work(); rcu_barrier() looks like a no-op for this path since.
Could SRCU work instead? srcu_read_lock around the publish (atomic_inc
through wb_queue_isw), with cgroup_writeback_umount() keeping the
counter gate but swapping rcu_barrier() for synchronize_srcu():
if (atomic_read(&isw_nr_in_flight)) {
synchronize_srcu(&isw_srcu);
flush_workqueue(isw_wq);
}
Thoughts?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
2026-05-13 20:36 ` Tejun Heo
@ 2026-05-14 2:55 ` Baokun Li
2026-05-14 13:09 ` Jan Kara
0 siblings, 1 reply; 4+ messages in thread
From: Baokun Li @ 2026-05-14 2:55 UTC (permalink / raw)
To: Tejun Heo; +Cc: linux-fsdevel, viro, brauner, jack, linux-kernel, libaokun
在 2026/5/14 04:36, Tejun Heo 写道:
> Hello,
>
> Resending - earlier send dropped the Cc list. Sorry for the noise.
>
> How rcu_barrier() got out of sync, as best I can reconstruct:
>
> - ec084de929e4 ("fs/writeback.c: use rcu_barrier() to wait for inflight
> wb switches going into workqueue when umount", 2019) put the inc
> after call_rcu(); rcu_barrier() worked from then.
>
> - 8826ee4fe750 ("writeback, cgroup: increment isw_nr_in_flight before
> grabbing an inode", 2021) moved the inc back ahead to cover the prep
> window, apparently reopening this gap.
>
> - e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when
> switching inodes", 2025) replaced call_rcu() with llist_add() +
> queue_work(); rcu_barrier() looks like a no-op for this path since.
>
> Could SRCU work instead? srcu_read_lock around the publish (atomic_inc
> through wb_queue_isw), with cgroup_writeback_umount() keeping the
> counter gate but swapping rcu_barrier() for synchronize_srcu():
>
> if (atomic_read(&isw_nr_in_flight)) {
> synchronize_srcu(&isw_srcu);
> flush_workqueue(isw_wq);
> }
>
> Thoughts?
Thanks for the detailed analysis on how rcu_barrier() got out of sync,
that matches my understanding as well.
Regarding the SRCU idea: I considered it, but it has a key drawback.
synchronize_srcu() waits for all read-side critical sections globally
-- it cannot distinguish which superblock a given switcher is working
on. So if sb A is being unmounted while unrelated switchers for sb B/C/D
hold srcu_read_lock(), umount of A gets blocked unnecessarily. The
global isw_nr_in_flight gate makes this worse: any non-zero count from
any sb triggers synchronize_srcu(), even when the target sb has no
in-flight switches at all.
This is especially problematic in high-density container environments,
where many containers with separate filesystems are being created and
destroyed concurrently. Frequent cgroup migrations across multiple
superblocks keep the global isw_nr_in_flight perpetually non-zero,
causing every single umount to pay the synchronize_srcu() cost even
when the target sb has zero in-flight switches.
The per-sb counter avoids this entirely -- cgroup_writeback_umount()
only waits for switches belonging to its own superblock to drain, and
returns immediately when s_isw_nr_in_flight is zero. The global counter
is retained solely for throttling (WB_FRN_MAX_IN_FLIGHT).
The other trade-offs are roughly comparable: both need pairing on all
paths, but the per-sb atomic_t gets zero-initialized by kzalloc for
free, while SRCU needs init/cleanup lifecycle management. The per-cpu
read lock advantage doesn't matter here since wb switching is
infrequent.
So I went with the per-sb counter for its precision and simplicity.
That said, if you prefer the SRCU approach, I'm happy to spin a new
version using it.
Cheers,
Baokun
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
2026-05-14 2:55 ` Baokun Li
@ 2026-05-14 13:09 ` Jan Kara
0 siblings, 0 replies; 4+ messages in thread
From: Jan Kara @ 2026-05-14 13:09 UTC (permalink / raw)
To: Baokun Li; +Cc: Tejun Heo, linux-fsdevel, viro, brauner, jack, linux-kernel
On Thu 14-05-26 10:55:03, Baokun Li wrote:
> 在 2026/5/14 04:36, Tejun Heo 写道:
> > Hello,
> >
> > Resending - earlier send dropped the Cc list. Sorry for the noise.
> >
> > How rcu_barrier() got out of sync, as best I can reconstruct:
> >
> > - ec084de929e4 ("fs/writeback.c: use rcu_barrier() to wait for inflight
> > wb switches going into workqueue when umount", 2019) put the inc
> > after call_rcu(); rcu_barrier() worked from then.
> >
> > - 8826ee4fe750 ("writeback, cgroup: increment isw_nr_in_flight before
> > grabbing an inode", 2021) moved the inc back ahead to cover the prep
> > window, apparently reopening this gap.
> >
> > - e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when
> > switching inodes", 2025) replaced call_rcu() with llist_add() +
> > queue_work(); rcu_barrier() looks like a no-op for this path since.
> >
> > Could SRCU work instead? srcu_read_lock around the publish (atomic_inc
> > through wb_queue_isw), with cgroup_writeback_umount() keeping the
> > counter gate but swapping rcu_barrier() for synchronize_srcu():
> >
> > if (atomic_read(&isw_nr_in_flight)) {
> > synchronize_srcu(&isw_srcu);
> > flush_workqueue(isw_wq);
> > }
> >
> > Thoughts?
>
> Thanks for the detailed analysis on how rcu_barrier() got out of sync,
> that matches my understanding as well.
>
> Regarding the SRCU idea: I considered it, but it has a key drawback.
> synchronize_srcu() waits for all read-side critical sections globally
> -- it cannot distinguish which superblock a given switcher is working
> on. So if sb A is being unmounted while unrelated switchers for sb B/C/D
> hold srcu_read_lock(), umount of A gets blocked unnecessarily. The
> global isw_nr_in_flight gate makes this worse: any non-zero count from
> any sb triggers synchronize_srcu(), even when the target sb has no
> in-flight switches at all.
>
> This is especially problematic in high-density container environments,
> where many containers with separate filesystems are being created and
> destroyed concurrently. Frequent cgroup migrations across multiple
> superblocks keep the global isw_nr_in_flight perpetually non-zero,
> causing every single umount to pay the synchronize_srcu() cost even
> when the target sb has zero in-flight switches.
>
> The per-sb counter avoids this entirely -- cgroup_writeback_umount()
> only waits for switches belonging to its own superblock to drain, and
> returns immediately when s_isw_nr_in_flight is zero. The global counter
> is retained solely for throttling (WB_FRN_MAX_IN_FLIGHT).
>
> The other trade-offs are roughly comparable: both need pairing on all
> paths, but the per-sb atomic_t gets zero-initialized by kzalloc for
> free, while SRCU needs init/cleanup lifecycle management. The per-cpu
> read lock advantage doesn't matter here since wb switching is
> infrequent.
>
> So I went with the per-sb counter for its precision and simplicity.
> That said, if you prefer the SRCU approach, I'm happy to spin a new
> version using it.
So I don't think we need a new SRCU. What really needs protection is this
snippet as you correctly identified in your changelog:
if (!inode_prepare_wbs_switch(inode, new_wb))
goto out_free;
isw->inodes[0] = inode;
trace_inode_switch_wbs_queue(inode->i_wb, new_wb, 1);
wb_queue_isw(new_wb, isw);
So between we __iget() the inode and we queue work. The rest gets solved
either by SB_ACTIVE check in inode_prepare_wbs_switch() or
flush_workqueue() in cgroup_writeback_umount(). Since there's no sleep in
this part, we can just wrap it within rcu_read_lock() (with appropriate
comment what's protected) and that fixes the race. We could even downgrade
rcu_barrier() in cgroup_writeback_umount() to synchronize_rcu() and it
would still work. So I'd prefer to fix the race like this (also to ease
backporting to older kernels).
Now the performance optimization to make isw_nr_in_flight per
superblock makes sense and I'm open to it as well. But it should be a
separate commit with proper justification and ideally some numbers backing
it showing the benefit.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-05-14 13:09 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-13 9:48 [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs() Baokun Li
2026-05-13 20:36 ` Tejun Heo
2026-05-14 2:55 ` Baokun Li
2026-05-14 13:09 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox