Linux filesystem development
 help / color / mirror / Atom feed
From: Baokun Li <libaokun@linux.alibaba.com>
To: linux-fsdevel@vger.kernel.org
Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	tj@kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH v2 0/3] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
Date: Sun, 17 May 2026 22:21:29 +0800	[thread overview]
Message-ID: <20260517142147.3354909-1-libaokun@linux.alibaba.com> (raw)

Changes since v1:
 * Use a simple RCU-based fix (patch 1) that is easy to backport to
   older kernels; the per-sb refcount optimization is split out as a
   separate performance patch (patch 3). (Suggested by Jan Kara)

v1: https://patch.msgid.link/20260513094829.867648-1-libaokun@linux.alibaba.com

======

When a container exits, a race between cgroup_writeback_umount() and
inode_switch_wbs()/cleanup_offline_cgwb() can trigger "VFS: Busy inodes
after unmount" followed by a use-after-free on percpu counters.

There is a window between inode_prepare_wbs_switch() returning true
(having passed the SB_ACTIVE check and grabbed the inode) and the
subsequent wb_queue_isw() call.  If cgroup_writeback_umount() observes
the global isw_nr_in_flight counter as non-zero but flush_workqueue()
finds nothing queued, it returns early — leaving a held inode reference
that blocks evict_inodes() and a later iput() that hits freed percpu
counters.

Patch 1 fixes the race by extending the RCU read-side critical section
to cover the window from inode_prepare_wbs_switch() through
wb_queue_isw(), and adding synchronize_rcu() in the umount path so
that all in-flight switchers complete queueing before flush_workqueue()
runs.

Patch 2 removes the now-dead rcu_barrier() that was left over from the
old queue_rcu_work() era (removed by commit e1b849cfa6b6 ("writeback:
Avoid contention on wb->list_lock when switching inodes")).

Patch 3 replaces the global synchronize_rcu()/flush_workqueue() pair
with a per-sb counter (s_isw_nr_in_flight), eliminating the global
serialization penalty.  This also reverts the RCU extension from patch 1
since the per-sb counter makes it unnecessary.

Measured with 4 background superblocks churning cgwb switches to keep
isw_nr_in_flight non-zero, while a separate idle sb is umounted in a
loop (N=100):

Idle target umount latency under cross-sb cgwb-switch pressure:

                               p50      p95      p99      max
  patch 1+2 (synchronize_rcu) 64.4 ms  95.8 ms 101.4 ms 110.5 ms
  patch 3   (per-sb counter)   5.3 ms   6.9 ms   7.4 ms   7.7 ms
  no-pressure baseline         5.2 ms   5.9 ms   6.0 ms   6.1 ms

8 concurrent umounts of idle sbs under the same pressure (5 batches):

                                p50      p95      max
  patch 1+2 (synchronize_rcu)  57.9 ms  82.1 ms  90.0 ms
  patch 3   (per-sb counter)    7.5 ms   7.8 ms   8.0 ms

In-kernel cgroup_writeback_umount() cumulative cost over 286 calls
(bpftrace, kprobes filtered to the umount call context):

                                cgroup_writeback_umount() time
  patch 1+2 (synchronize_rcu)     8717 ms total  (~30 ms / call)
  patch 3   (per-sb counter)      1.16 ms total  (~4 us / call)

Comments and questions are, as always, welcome.

Thanks,
Baokun

Baokun Li (3):
  writeback: fix race between cgroup_writeback_umount() and
    inode_switch_wbs()
  writeback: drop now-unnecessary rcu_barrier() in
    cgroup_writeback_umount()
  writeback: use a per-sb counter to drain inode wb switches at umount

 fs/fs-writeback.c              | 52 +++++++++++++++++++---------------
 include/linux/fs/super_types.h |  8 ++++++
 2 files changed, 37 insertions(+), 23 deletions(-)

-- 
2.43.7


             reply	other threads:[~2026-05-17 14:22 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-17 14:21 Baokun Li [this message]
2026-05-17 14:21 ` [PATCH v2 1/3] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs() Baokun Li
2026-05-18 12:58   ` Jan Kara
2026-05-17 14:21 ` [PATCH v2 2/3] writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount() Baokun Li
2026-05-18 13:01   ` Jan Kara
2026-05-18 13:10     ` Baokun Li
2026-05-18 15:11       ` Jan Kara
2026-05-17 14:21 ` [PATCH v2 3/3] writeback: use a per-sb counter to drain inode wb switches at umount Baokun Li
2026-05-18 11:14   ` Baokun Li
2026-05-18 11:42   ` Christian Brauner
2026-05-18 11:52     ` Baokun Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260517142147.3354909-1-libaokun@linux.alibaba.com \
    --to=libaokun@linux.alibaba.com \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox