[PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
@ 2026-05-13  9:48 Baokun Li
  2026-05-13 20:36 ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Baokun Li @ 2026-05-13  9:48 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, brauner, jack, tj, linux-kernel, Baokun Li

When a container exits, the following BUG_ON() is occasionally triggered:

==================================================================
 VFS: Busy inodes after unmount of sdb (ext4)
 ------------[ cut here ]------------
 kernel BUG at fs/super.c:695!
 CPU: 3 PID: 6 Comm: containerd-shim Tainted: G OE K 6.6 #1
 pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
 pc : generic_shutdown_super+0xf0/0x100
 lr : generic_shutdown_super+0xf0/0x100
 Call trace:
  generic_shutdown_super+0xf0/0x100
  kill_block_super+0x20/0x48
  ext4_kill_sb+0x28/0x60
  deactivate_locked_super+0x54/0x130
  deactivate_super+0x84/0xa0
  cleanup_mnt+0xa4/0x140
  __cleanup_mnt+0x18/0x28
  task_work_run+0x78/0xe0
  do_notify_resume+0x204/0x240
==================================================================

The root cause is a race between cgroup_writeback_umount() and
inode_switch_wbs()/cleanup_offline_cgwb(). There is a window between
inode_prepare_wbs_switch() returning true and the subsequent
wb_queue_isw() call. Following is the process that triggers the issue:

      CPU A (umount)           |          CPU B (writeback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                 inode_switch_wbs/cleanup_offline_cgwb
                                  atomic_inc(&isw_nr_in_flight)
                                  inode_prepare_wbs_switch
                                   -> passes SB_ACTIVE check
                                   __iget(inode)
 generic_shutdown_super
  sb->s_flags &= ~SB_ACTIVE
  cgroup_writeback_umount(sb)
   smp_mb()
   atomic_read(&isw_nr_in_flight)
   rcu_barrier()
    -> no pending RCU callbacks
   flush_workqueue(isw_wq)
    -> nothing queued, returns
  evict_inodes(sb)
   -> Inode skipped as isw still holds a ref.
  sop->put_super(sb)
   /* destroys percpu counters */
  -> VFS: Busy inodes after unmount!
                                  wb_queue_isw()
                                   queue_work(isw_wq, ...)
                                  /* later in work function */
                                  inode_switch_wbs_work_fn
                                   process_inode_switch_wbs
                                    iput() -> evict
                                     percpu_counter_dec() // UAF!

A simple approach to synchronize is to have cgroup_writeback_umount()
wait for the isw_nr_in_flight count of the current superblock to drop
to zero, since no new increments can occur after SB_ACTIVE is cleared.
However, isw_nr_in_flight is global, so introduce a new per-sb counter
s_isw_nr_in_flight for this purpose, while retaining the global counter
for throttling.

Fixes: a1a0e23e4903 ("writeback: flush inode cgroup wb switches instead of pinning super_block")
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
---
 fs/fs-writeback.c              | 44 +++++++++++++++++++---------------
 include/linux/fs/super_types.h |  3 +++
 2 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 71b7fd036e97..7924cf3484fa 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -556,8 +556,12 @@ static void process_inode_switch_wbs(struct bdi_writeback *new_wb,
 		wb_put_many(old_wb, nr_switched);
 	}
 
-	for (inodep = isw->inodes; *inodep; inodep++)
+	for (inodep = isw->inodes; *inodep; inodep++) {
+		struct super_block *sb = (*inodep)->i_sb;
+
 		iput(*inodep);
+		atomic_dec(&sb->s_isw_nr_in_flight);
+	}
 	wb_put(new_wb);
 	kfree(isw);
 	atomic_dec(&isw_nr_in_flight);
@@ -602,14 +606,15 @@ static bool inode_prepare_wbs_switch(struct inode *inode,
 {
 	/*
 	 * Paired with smp_mb() in cgroup_writeback_umount().
-	 * isw_nr_in_flight must be increased before checking SB_ACTIVE and
-	 * grabbing an inode, otherwise isw_nr_in_flight can be observed as 0
-	 * in cgroup_writeback_umount() and the isw_wq will be not flushed.
+	 * s_isw_nr_in_flight must be increased before checking SB_ACTIVE and
+	 * grabbing an inode, otherwise s_isw_nr_in_flight can be observed as
+	 * 0 in cgroup_writeback_umount() and the isw_wq will be not flushed.
 	 */
+	atomic_inc(&inode->i_sb->s_isw_nr_in_flight);
 	smp_mb();
 
 	if (IS_DAX(inode))
-		return false;
+		goto out_dec;
 
 	/* while holding I_WB_SWITCH, no one else can update the association */
 	spin_lock(&inode->i_lock);
@@ -617,13 +622,17 @@ static bool inode_prepare_wbs_switch(struct inode *inode,
 	    inode_state_read(inode) & (I_WB_SWITCH | I_FREEING | I_WILL_FREE) ||
 	    inode_to_wb(inode) == new_wb) {
 		spin_unlock(&inode->i_lock);
-		return false;
+		goto out_dec;
 	}
 	inode_state_set(inode, I_WB_SWITCH);
 	__iget(inode);
 	spin_unlock(&inode->i_lock);
 
 	return true;
+
+out_dec:
+	atomic_dec(&inode->i_sb->s_isw_nr_in_flight);
+	return false;
 }
 
 static void wb_queue_isw(struct bdi_writeback *wb,
@@ -1212,31 +1221,28 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id,
  * @sb: target super_block
  *
  * This function is called when a super_block is about to be destroyed and
- * flushes in-flight inode wb switches.  An inode wb switch goes through
- * RCU and then workqueue, so the two need to be flushed in order to ensure
- * that all previously scheduled switches are finished.  As wb switches are
- * rare occurrences and synchronize_rcu() can take a while, perform
- * flushing iff wb switches are in flight.
+ * waits for all in-flight inode wb switches on this sb to complete.  Since
+ * SB_ACTIVE is cleared before this is called, no new switches can start,
+ * so the per-sb counter will monotonically decrease to zero.
  */
 void cgroup_writeback_umount(struct super_block *sb)
 {
-
 	if (!(sb->s_bdi->capabilities & BDI_CAP_WRITEBACK))
 		return;
 
 	/*
 	 * SB_ACTIVE should be reliably cleared before checking
-	 * isw_nr_in_flight, see generic_shutdown_super().
+	 * s_isw_nr_in_flight, see generic_shutdown_super().
+	 *
+	 * Paired with smp_mb() in inode_prepare_wbs_switch(), which ensures
+	 * that either we see the incremented counter, or the switcher sees
+	 * SB_ACTIVE cleared and aborts.
 	 */
 	smp_mb();
 
-	if (atomic_read(&isw_nr_in_flight)) {
-		/*
-		 * Use rcu_barrier() to wait for all pending callbacks to
-		 * ensure that all in-flight wb switches are in the workqueue.
-		 */
-		rcu_barrier();
+	while (atomic_read(&sb->s_isw_nr_in_flight)) {
 		flush_workqueue(isw_wq);
+		cond_resched();
 	}
 }
 
diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
index 383050e7fdf5..8030c873016d 100644
--- a/include/linux/fs/super_types.h
+++ b/include/linux/fs/super_types.h
@@ -274,6 +274,9 @@ struct super_block {
 
 	/* number of fserrors that are being sent to fsnotify/filesystems */
 	refcount_t				s_pending_errors;
+
+	/* number of in-flight inode writeback switches on this sb */
+	atomic_t				s_isw_nr_in_flight;
 } __randomize_layout;
 
 /*
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
  2026-05-13  9:48 [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs() Baokun Li
@ 2026-05-13 20:36 ` Tejun Heo
  2026-05-14  2:55   ` Baokun Li
  0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2026-05-13 20:36 UTC (permalink / raw)
  To: Baokun Li; +Cc: linux-fsdevel, viro, brauner, jack, linux-kernel

Hello,

Resending - earlier send dropped the Cc list. Sorry for the noise.

How rcu_barrier() got out of sync, as best I can reconstruct:

- ec084de929e4 ("fs/writeback.c: use rcu_barrier() to wait for inflight
  wb switches going into workqueue when umount", 2019) put the inc
  after call_rcu(); rcu_barrier() worked from then.

- 8826ee4fe750 ("writeback, cgroup: increment isw_nr_in_flight before
  grabbing an inode", 2021) moved the inc back ahead to cover the prep
  window, apparently reopening this gap.

- e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when
  switching inodes", 2025) replaced call_rcu() with llist_add() +
  queue_work(); rcu_barrier() looks like a no-op for this path since.

Could SRCU work instead? srcu_read_lock around the publish (atomic_inc
through wb_queue_isw), with cgroup_writeback_umount() keeping the
counter gate but swapping rcu_barrier() for synchronize_srcu():

  if (atomic_read(&isw_nr_in_flight)) {
          synchronize_srcu(&isw_srcu);
          flush_workqueue(isw_wq);
  }

Thoughts?

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
  2026-05-13 20:36 ` Tejun Heo
@ 2026-05-14  2:55   ` Baokun Li
  2026-05-14 13:09     ` Jan Kara
  0 siblings, 1 reply; 4+ messages in thread
From: Baokun Li @ 2026-05-14  2:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-fsdevel, viro, brauner, jack, linux-kernel, libaokun

在 2026/5/14 04:36, Tejun Heo 写道:
> Hello,
>
> Resending - earlier send dropped the Cc list. Sorry for the noise.
>
> How rcu_barrier() got out of sync, as best I can reconstruct:
>
> - ec084de929e4 ("fs/writeback.c: use rcu_barrier() to wait for inflight
>   wb switches going into workqueue when umount", 2019) put the inc
>   after call_rcu(); rcu_barrier() worked from then.
>
> - 8826ee4fe750 ("writeback, cgroup: increment isw_nr_in_flight before
>   grabbing an inode", 2021) moved the inc back ahead to cover the prep
>   window, apparently reopening this gap.
>
> - e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when
>   switching inodes", 2025) replaced call_rcu() with llist_add() +
>   queue_work(); rcu_barrier() looks like a no-op for this path since.
>
> Could SRCU work instead? srcu_read_lock around the publish (atomic_inc
> through wb_queue_isw), with cgroup_writeback_umount() keeping the
> counter gate but swapping rcu_barrier() for synchronize_srcu():
>
>   if (atomic_read(&isw_nr_in_flight)) {
>           synchronize_srcu(&isw_srcu);
>           flush_workqueue(isw_wq);
>   }
>
> Thoughts?

Thanks for the detailed analysis on how rcu_barrier() got out of sync,
that matches my understanding as well.

Regarding the SRCU idea: I considered it, but it has a key drawback.
synchronize_srcu() waits for all read-side critical sections globally
-- it cannot distinguish which superblock a given switcher is working
on. So if sb A is being unmounted while unrelated switchers for sb B/C/D
hold srcu_read_lock(), umount of A gets blocked unnecessarily. The
global isw_nr_in_flight gate makes this worse: any non-zero count from
any sb triggers synchronize_srcu(), even when the target sb has no
in-flight switches at all.

This is especially problematic in high-density container environments,
where many containers with separate filesystems are being created and
destroyed concurrently. Frequent cgroup migrations across multiple
superblocks keep the global isw_nr_in_flight perpetually non-zero,
causing every single umount to pay the synchronize_srcu() cost even
when the target sb has zero in-flight switches.

The per-sb counter avoids this entirely -- cgroup_writeback_umount()
only waits for switches belonging to its own superblock to drain, and
returns immediately when s_isw_nr_in_flight is zero. The global counter
is retained solely for throttling (WB_FRN_MAX_IN_FLIGHT).

The other trade-offs are roughly comparable: both need pairing on all
paths, but the per-sb atomic_t gets zero-initialized by kzalloc for
free, while SRCU needs init/cleanup lifecycle management. The per-cpu
read lock advantage doesn't matter here since wb switching is
infrequent.

So I went with the per-sb counter for its precision and simplicity.
That said, if you prefer the SRCU approach, I'm happy to spin a new
version using it.

Cheers,
Baokun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
  2026-05-14  2:55   ` Baokun Li
@ 2026-05-14 13:09     ` Jan Kara
  0 siblings, 0 replies; 4+ messages in thread
From: Jan Kara @ 2026-05-14 13:09 UTC (permalink / raw)
  To: Baokun Li; +Cc: Tejun Heo, linux-fsdevel, viro, brauner, jack, linux-kernel

On Thu 14-05-26 10:55:03, Baokun Li wrote:
> 在 2026/5/14 04:36, Tejun Heo 写道:
> > Hello,
> >
> > Resending - earlier send dropped the Cc list. Sorry for the noise.
> >
> > How rcu_barrier() got out of sync, as best I can reconstruct:
> >
> > - ec084de929e4 ("fs/writeback.c: use rcu_barrier() to wait for inflight
> >   wb switches going into workqueue when umount", 2019) put the inc
> >   after call_rcu(); rcu_barrier() worked from then.
> >
> > - 8826ee4fe750 ("writeback, cgroup: increment isw_nr_in_flight before
> >   grabbing an inode", 2021) moved the inc back ahead to cover the prep
> >   window, apparently reopening this gap.
> >
> > - e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when
> >   switching inodes", 2025) replaced call_rcu() with llist_add() +
> >   queue_work(); rcu_barrier() looks like a no-op for this path since.
> >
> > Could SRCU work instead? srcu_read_lock around the publish (atomic_inc
> > through wb_queue_isw), with cgroup_writeback_umount() keeping the
> > counter gate but swapping rcu_barrier() for synchronize_srcu():
> >
> >   if (atomic_read(&isw_nr_in_flight)) {
> >           synchronize_srcu(&isw_srcu);
> >           flush_workqueue(isw_wq);
> >   }
> >
> > Thoughts?
> 
> Thanks for the detailed analysis on how rcu_barrier() got out of sync,
> that matches my understanding as well.
> 
> Regarding the SRCU idea: I considered it, but it has a key drawback.
> synchronize_srcu() waits for all read-side critical sections globally
> -- it cannot distinguish which superblock a given switcher is working
> on. So if sb A is being unmounted while unrelated switchers for sb B/C/D
> hold srcu_read_lock(), umount of A gets blocked unnecessarily. The
> global isw_nr_in_flight gate makes this worse: any non-zero count from
> any sb triggers synchronize_srcu(), even when the target sb has no
> in-flight switches at all.
> 
> This is especially problematic in high-density container environments,
> where many containers with separate filesystems are being created and
> destroyed concurrently. Frequent cgroup migrations across multiple
> superblocks keep the global isw_nr_in_flight perpetually non-zero,
> causing every single umount to pay the synchronize_srcu() cost even
> when the target sb has zero in-flight switches.
> 
> The per-sb counter avoids this entirely -- cgroup_writeback_umount()
> only waits for switches belonging to its own superblock to drain, and
> returns immediately when s_isw_nr_in_flight is zero. The global counter
> is retained solely for throttling (WB_FRN_MAX_IN_FLIGHT).
> 
> The other trade-offs are roughly comparable: both need pairing on all
> paths, but the per-sb atomic_t gets zero-initialized by kzalloc for
> free, while SRCU needs init/cleanup lifecycle management. The per-cpu
> read lock advantage doesn't matter here since wb switching is
> infrequent.
> 
> So I went with the per-sb counter for its precision and simplicity.
> That said, if you prefer the SRCU approach, I'm happy to spin a new
> version using it.

So I don't think we need a new SRCU. What really needs protection is this
snippet as you correctly identified in your changelog:

        if (!inode_prepare_wbs_switch(inode, new_wb))
                goto out_free;

        isw->inodes[0] = inode;

        trace_inode_switch_wbs_queue(inode->i_wb, new_wb, 1);
        wb_queue_isw(new_wb, isw);

So between we __iget() the inode and we queue work. The rest gets solved
either by SB_ACTIVE check in inode_prepare_wbs_switch() or
flush_workqueue() in cgroup_writeback_umount(). Since there's no sleep in
this part, we can just wrap it within rcu_read_lock() (with appropriate
comment what's protected) and that fixes the race. We could even downgrade
rcu_barrier() in cgroup_writeback_umount() to synchronize_rcu() and it
would still work. So I'd prefer to fix the race like this (also to ease
backporting to older kernels).

Now the performance optimization to make isw_nr_in_flight per
superblock makes sense and I'm open to it as well. But it should be a
separate commit with proper justification and ideally some numbers backing
it showing the benefit.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-14 13:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-13  9:48 [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs() Baokun Li
2026-05-13 20:36 ` Tejun Heo
2026-05-14  2:55   ` Baokun Li
2026-05-14 13:09     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox