* [PATCH v2 0/3] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
@ 2026-05-17 14:21 Baokun Li
2026-05-17 14:21 ` [PATCH v2 1/3] " Baokun Li
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Baokun Li @ 2026-05-17 14:21 UTC (permalink / raw)
To: linux-fsdevel; +Cc: viro, brauner, jack, tj, linux-kernel
Changes since v1:
* Use a simple RCU-based fix (patch 1) that is easy to backport to
older kernels; the per-sb refcount optimization is split out as a
separate performance patch (patch 3). (Suggested by Jan Kara)
v1: https://patch.msgid.link/20260513094829.867648-1-libaokun@linux.alibaba.com
======
When a container exits, a race between cgroup_writeback_umount() and
inode_switch_wbs()/cleanup_offline_cgwb() can trigger "VFS: Busy inodes
after unmount" followed by a use-after-free on percpu counters.
There is a window between inode_prepare_wbs_switch() returning true
(having passed the SB_ACTIVE check and grabbed the inode) and the
subsequent wb_queue_isw() call. If cgroup_writeback_umount() observes
the global isw_nr_in_flight counter as non-zero but flush_workqueue()
finds nothing queued, it returns early — leaving a held inode reference
that blocks evict_inodes() and a later iput() that hits freed percpu
counters.
Patch 1 fixes the race by extending the RCU read-side critical section
to cover the window from inode_prepare_wbs_switch() through
wb_queue_isw(), and adding synchronize_rcu() in the umount path so
that all in-flight switchers complete queueing before flush_workqueue()
runs.
Patch 2 removes the now-dead rcu_barrier() that was left over from the
old queue_rcu_work() era (removed by commit e1b849cfa6b6 ("writeback:
Avoid contention on wb->list_lock when switching inodes")).
Patch 3 replaces the global synchronize_rcu()/flush_workqueue() pair
with a per-sb counter (s_isw_nr_in_flight), eliminating the global
serialization penalty. This also reverts the RCU extension from patch 1
since the per-sb counter makes it unnecessary.
Measured with 4 background superblocks churning cgwb switches to keep
isw_nr_in_flight non-zero, while a separate idle sb is umounted in a
loop (N=100):
Idle target umount latency under cross-sb cgwb-switch pressure:
p50 p95 p99 max
patch 1+2 (synchronize_rcu) 64.4 ms 95.8 ms 101.4 ms 110.5 ms
patch 3 (per-sb counter) 5.3 ms 6.9 ms 7.4 ms 7.7 ms
no-pressure baseline 5.2 ms 5.9 ms 6.0 ms 6.1 ms
8 concurrent umounts of idle sbs under the same pressure (5 batches):
p50 p95 max
patch 1+2 (synchronize_rcu) 57.9 ms 82.1 ms 90.0 ms
patch 3 (per-sb counter) 7.5 ms 7.8 ms 8.0 ms
In-kernel cgroup_writeback_umount() cumulative cost over 286 calls
(bpftrace, kprobes filtered to the umount call context):
cgroup_writeback_umount() time
patch 1+2 (synchronize_rcu) 8717 ms total (~30 ms / call)
patch 3 (per-sb counter) 1.16 ms total (~4 us / call)
Comments and questions are, as always, welcome.
Thanks,
Baokun
Baokun Li (3):
writeback: fix race between cgroup_writeback_umount() and
inode_switch_wbs()
writeback: drop now-unnecessary rcu_barrier() in
cgroup_writeback_umount()
writeback: use a per-sb counter to drain inode wb switches at umount
fs/fs-writeback.c | 52 +++++++++++++++++++---------------
include/linux/fs/super_types.h | 8 ++++++
2 files changed, 37 insertions(+), 23 deletions(-)
--
2.43.7
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH v2 1/3] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs() 2026-05-17 14:21 [PATCH v2 0/3] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs() Baokun Li @ 2026-05-17 14:21 ` Baokun Li 2026-05-18 12:58 ` Jan Kara 2026-05-17 14:21 ` [PATCH v2 2/3] writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount() Baokun Li 2026-05-17 14:21 ` [PATCH v2 3/3] writeback: use a per-sb counter to drain inode wb switches at umount Baokun Li 2 siblings, 1 reply; 11+ messages in thread From: Baokun Li @ 2026-05-17 14:21 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, brauner, jack, tj, linux-kernel When a container exits, the following BUG_ON() is occasionally triggered: ================================================================== VFS: Busy inodes after unmount of sdb (ext4) ------------[ cut here ]------------ kernel BUG at fs/super.c:695! CPU: 3 PID: 6 Comm: containerd-shim Tainted: G OE K 6.6 #1 pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : generic_shutdown_super+0xf0/0x100 lr : generic_shutdown_super+0xf0/0x100 Call trace: generic_shutdown_super+0xf0/0x100 kill_block_super+0x20/0x48 ext4_kill_sb+0x28/0x60 deactivate_locked_super+0x54/0x130 deactivate_super+0x84/0xa0 cleanup_mnt+0xa4/0x140 __cleanup_mnt+0x18/0x28 task_work_run+0x78/0xe0 do_notify_resume+0x204/0x240 ================================================================== The root cause is a race between cgroup_writeback_umount() and inode_switch_wbs()/cleanup_offline_cgwb(). There is a window between inode_prepare_wbs_switch() returning true and the subsequent wb_queue_isw() call. Following is the process that triggers the issue: CPU A (umount) | CPU B (writeback) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ inode_switch_wbs/cleanup_offline_cgwb atomic_inc(&isw_nr_in_flight) inode_prepare_wbs_switch -> passes SB_ACTIVE check __iget(inode) generic_shutdown_super sb->s_flags &= ~SB_ACTIVE cgroup_writeback_umount(sb) smp_mb() atomic_read(&isw_nr_in_flight) rcu_barrier() -> no pending RCU callbacks flush_workqueue(isw_wq) -> nothing queued, returns evict_inodes(sb) -> Inode skipped as isw still holds a ref. sop->put_super(sb) /* destroys percpu counters */ -> VFS: Busy inodes after unmount! wb_queue_isw() queue_work(isw_wq, ...) /* later in work function */ inode_switch_wbs_work_fn process_inode_switch_wbs iput() -> evict percpu_counter_dec() // UAF! Fix this by extending the RCU read-side critical section in inode_switch_wbs() and cleanup_offline_cgwb() to cover from inode_prepare_wbs_switch() through wb_queue_isw(). Since there is no sleep in this window, rcu_read_lock() can be used. Then add a synchronize_rcu() in cgroup_writeback_umount() before the existing rcu_barrier(), so that all in-flight switchers that have passed the SB_ACTIVE check have completed queue_work() before flush_workqueue() is called. Fixes: a1a0e23e4903 ("writeback: flush inode cgroup wb switches instead of pinning super_block") Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/all/mxnjq2l6guusfchvauxr3v7c4bwjasybxlleqbbh4efloeqspz@iqylk76ohufz Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> --- fs/fs-writeback.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index a65694cbfe68..1f95ddcee363 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -665,7 +665,6 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) memcg_css = css_from_id(new_wb_id, &memory_cgrp_subsys); if (memcg_css && !css_tryget(memcg_css)) memcg_css = NULL; - rcu_read_unlock(); if (!memcg_css) goto out_free; @@ -680,10 +679,18 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) isw->inodes[0] = inode; trace_inode_switch_wbs_queue(inode->i_wb, new_wb, 1); + /* + * Paired with synchronize_rcu() in cgroup_writeback_umount(). + * Holding rcu_read_lock across wb_queue_isw() ensures + * synchronize_rcu() cannot return until the work is queued, so + * the subsequent flush_workqueue() will wait for the switch. + */ wb_queue_isw(new_wb, isw); + rcu_read_unlock(); return; out_free: + rcu_read_unlock(); atomic_dec(&isw_nr_in_flight); if (new_wb) wb_put(new_wb); @@ -741,6 +748,14 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) new_wb = &wb->bdi->wb; /* wb_get() is noop for bdi's wb */ nr = 0; + /* + * Paired with synchronize_rcu() in cgroup_writeback_umount(). + * Holding rcu_read_lock across the SB_ACTIVE check, the inode grab + * and wb_queue_isw() ensures synchronize_rcu() cannot return until + * the work is queued, so the subsequent flush_workqueue() will wait + * for the switch. + */ + rcu_read_lock(); spin_lock(&wb->list_lock); /* * In addition to the inodes that have completed writeback, also switch @@ -758,6 +773,7 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) /* no attached inodes? bail out */ if (nr == 0) { + rcu_read_unlock(); atomic_dec(&isw_nr_in_flight); wb_put(new_wb); kfree(isw); @@ -766,6 +782,7 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) trace_inode_switch_wbs_queue(wb, new_wb, nr); wb_queue_isw(new_wb, isw); + rcu_read_unlock(); return restart; } @@ -1221,6 +1238,14 @@ void cgroup_writeback_umount(struct super_block *sb) smp_mb(); if (atomic_read(&isw_nr_in_flight)) { + /* + * Paired with rcu_read_lock() in inode_switch_wbs() and + * cleanup_offline_cgwb(). synchronize_rcu() waits for any + * in-flight switcher that already passed the SB_ACTIVE check + * to finish queueing its work, so flush_workqueue() below + * will then drain it. + */ + synchronize_rcu(); /* * Use rcu_barrier() to wait for all pending callbacks to * ensure that all in-flight wb switches are in the workqueue. -- 2.43.7 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v2 1/3] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs() 2026-05-17 14:21 ` [PATCH v2 1/3] " Baokun Li @ 2026-05-18 12:58 ` Jan Kara 0 siblings, 0 replies; 11+ messages in thread From: Jan Kara @ 2026-05-18 12:58 UTC (permalink / raw) To: Baokun Li; +Cc: linux-fsdevel, viro, brauner, jack, tj, linux-kernel On Sun 17-05-26 22:21:30, Baokun Li wrote: > When a container exits, the following BUG_ON() is occasionally triggered: > > ================================================================== > VFS: Busy inodes after unmount of sdb (ext4) > ------------[ cut here ]------------ > kernel BUG at fs/super.c:695! > CPU: 3 PID: 6 Comm: containerd-shim Tainted: G OE K 6.6 #1 > pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) > pc : generic_shutdown_super+0xf0/0x100 > lr : generic_shutdown_super+0xf0/0x100 > Call trace: > generic_shutdown_super+0xf0/0x100 > kill_block_super+0x20/0x48 > ext4_kill_sb+0x28/0x60 > deactivate_locked_super+0x54/0x130 > deactivate_super+0x84/0xa0 > cleanup_mnt+0xa4/0x140 > __cleanup_mnt+0x18/0x28 > task_work_run+0x78/0xe0 > do_notify_resume+0x204/0x240 > ================================================================== > > The root cause is a race between cgroup_writeback_umount() and > inode_switch_wbs()/cleanup_offline_cgwb(). There is a window between > inode_prepare_wbs_switch() returning true and the subsequent > wb_queue_isw() call. Following is the process that triggers the issue: > > CPU A (umount) | CPU B (writeback) > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > inode_switch_wbs/cleanup_offline_cgwb > atomic_inc(&isw_nr_in_flight) > inode_prepare_wbs_switch > -> passes SB_ACTIVE check > __iget(inode) > generic_shutdown_super > sb->s_flags &= ~SB_ACTIVE > cgroup_writeback_umount(sb) > smp_mb() > atomic_read(&isw_nr_in_flight) > rcu_barrier() > -> no pending RCU callbacks > flush_workqueue(isw_wq) > -> nothing queued, returns > evict_inodes(sb) > -> Inode skipped as isw still holds a ref. > sop->put_super(sb) > /* destroys percpu counters */ > -> VFS: Busy inodes after unmount! > wb_queue_isw() > queue_work(isw_wq, ...) > /* later in work function */ > inode_switch_wbs_work_fn > process_inode_switch_wbs > iput() -> evict > percpu_counter_dec() // UAF! > > Fix this by extending the RCU read-side critical section in > inode_switch_wbs() and cleanup_offline_cgwb() to cover from > inode_prepare_wbs_switch() through wb_queue_isw(). Since there is > no sleep in this window, rcu_read_lock() can be used. Then add a > synchronize_rcu() in cgroup_writeback_umount() before the existing > rcu_barrier(), so that all in-flight switchers that have passed the > SB_ACTIVE check have completed queue_work() before flush_workqueue() > is called. > > Fixes: a1a0e23e4903 ("writeback: flush inode cgroup wb switches instead of pinning super_block") > Suggested-by: Jan Kara <jack@suse.cz> > Link: https://lore.kernel.org/all/mxnjq2l6guusfchvauxr3v7c4bwjasybxlleqbbh4efloeqspz@iqylk76ohufz > Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> Looks good! Thanks! Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > fs/fs-writeback.c | 27 ++++++++++++++++++++++++++- > 1 file changed, 26 insertions(+), 1 deletion(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index a65694cbfe68..1f95ddcee363 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -665,7 +665,6 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) > memcg_css = css_from_id(new_wb_id, &memory_cgrp_subsys); > if (memcg_css && !css_tryget(memcg_css)) > memcg_css = NULL; > - rcu_read_unlock(); > if (!memcg_css) > goto out_free; > > @@ -680,10 +679,18 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) > isw->inodes[0] = inode; > > trace_inode_switch_wbs_queue(inode->i_wb, new_wb, 1); > + /* > + * Paired with synchronize_rcu() in cgroup_writeback_umount(). > + * Holding rcu_read_lock across wb_queue_isw() ensures > + * synchronize_rcu() cannot return until the work is queued, so > + * the subsequent flush_workqueue() will wait for the switch. > + */ > wb_queue_isw(new_wb, isw); > + rcu_read_unlock(); > return; > > out_free: > + rcu_read_unlock(); > atomic_dec(&isw_nr_in_flight); > if (new_wb) > wb_put(new_wb); > @@ -741,6 +748,14 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) > new_wb = &wb->bdi->wb; /* wb_get() is noop for bdi's wb */ > > nr = 0; > + /* > + * Paired with synchronize_rcu() in cgroup_writeback_umount(). > + * Holding rcu_read_lock across the SB_ACTIVE check, the inode grab > + * and wb_queue_isw() ensures synchronize_rcu() cannot return until > + * the work is queued, so the subsequent flush_workqueue() will wait > + * for the switch. > + */ > + rcu_read_lock(); > spin_lock(&wb->list_lock); > /* > * In addition to the inodes that have completed writeback, also switch > @@ -758,6 +773,7 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) > > /* no attached inodes? bail out */ > if (nr == 0) { > + rcu_read_unlock(); > atomic_dec(&isw_nr_in_flight); > wb_put(new_wb); > kfree(isw); > @@ -766,6 +782,7 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) > > trace_inode_switch_wbs_queue(wb, new_wb, nr); > wb_queue_isw(new_wb, isw); > + rcu_read_unlock(); > > return restart; > } > @@ -1221,6 +1238,14 @@ void cgroup_writeback_umount(struct super_block *sb) > smp_mb(); > > if (atomic_read(&isw_nr_in_flight)) { > + /* > + * Paired with rcu_read_lock() in inode_switch_wbs() and > + * cleanup_offline_cgwb(). synchronize_rcu() waits for any > + * in-flight switcher that already passed the SB_ACTIVE check > + * to finish queueing its work, so flush_workqueue() below > + * will then drain it. > + */ > + synchronize_rcu(); > /* > * Use rcu_barrier() to wait for all pending callbacks to > * ensure that all in-flight wb switches are in the workqueue. > -- > 2.43.7 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v2 2/3] writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount() 2026-05-17 14:21 [PATCH v2 0/3] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs() Baokun Li 2026-05-17 14:21 ` [PATCH v2 1/3] " Baokun Li @ 2026-05-17 14:21 ` Baokun Li 2026-05-18 13:01 ` Jan Kara 2026-05-17 14:21 ` [PATCH v2 3/3] writeback: use a per-sb counter to drain inode wb switches at umount Baokun Li 2 siblings, 1 reply; 11+ messages in thread From: Baokun Li @ 2026-05-17 14:21 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, brauner, jack, tj, linux-kernel Commit e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when switching inodes") replaced the queue_rcu_work() based scheduling of inode wb switches with a plain queue_work(). Since then no switcher goes through call_rcu(), so rcu_barrier() in cgroup_writeback_umount() has no work to wait for and is effectively a no-op. Fixes: e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when switching inodes") Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> --- fs/fs-writeback.c | 5 ----- 1 file changed, 5 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 1f95ddcee363..9ae290547eb2 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1246,11 +1246,6 @@ void cgroup_writeback_umount(struct super_block *sb) * will then drain it. */ synchronize_rcu(); - /* - * Use rcu_barrier() to wait for all pending callbacks to - * ensure that all in-flight wb switches are in the workqueue. - */ - rcu_barrier(); flush_workqueue(isw_wq); } } -- 2.43.7 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v2 2/3] writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount() 2026-05-17 14:21 ` [PATCH v2 2/3] writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount() Baokun Li @ 2026-05-18 13:01 ` Jan Kara 2026-05-18 13:10 ` Baokun Li 0 siblings, 1 reply; 11+ messages in thread From: Jan Kara @ 2026-05-18 13:01 UTC (permalink / raw) To: Baokun Li; +Cc: linux-fsdevel, viro, brauner, jack, tj, linux-kernel On Sun 17-05-26 22:21:31, Baokun Li wrote: > Commit e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when > switching inodes") replaced the queue_rcu_work() based scheduling of > inode wb switches with a plain queue_work(). Since then no switcher > goes through call_rcu(), so rcu_barrier() in cgroup_writeback_umount() > has no work to wait for and is effectively a no-op. > > Fixes: e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when switching inodes") > Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> Looks good. Just I think this should be folded in patch 1, since you have proper Fixes tag, I don't think there's a concern this would be backported to kernels where rcu_barrier() is still needed. Honza > --- > fs/fs-writeback.c | 5 ----- > 1 file changed, 5 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 1f95ddcee363..9ae290547eb2 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -1246,11 +1246,6 @@ void cgroup_writeback_umount(struct super_block *sb) > * will then drain it. > */ > synchronize_rcu(); > - /* > - * Use rcu_barrier() to wait for all pending callbacks to > - * ensure that all in-flight wb switches are in the workqueue. > - */ > - rcu_barrier(); > flush_workqueue(isw_wq); > } > } > -- > 2.43.7 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 2/3] writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount() 2026-05-18 13:01 ` Jan Kara @ 2026-05-18 13:10 ` Baokun Li 2026-05-18 15:11 ` Jan Kara 0 siblings, 1 reply; 11+ messages in thread From: Baokun Li @ 2026-05-18 13:10 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, viro, brauner, tj, linux-kernel 在 2026/5/18 21:01, Jan Kara 写道: > On Sun 17-05-26 22:21:31, Baokun Li wrote: >> Commit e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when >> switching inodes") replaced the queue_rcu_work() based scheduling of >> inode wb switches with a plain queue_work(). Since then no switcher >> goes through call_rcu(), so rcu_barrier() in cgroup_writeback_umount() >> has no work to wait for and is effectively a no-op. >> >> Fixes: e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when switching inodes") >> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> > Looks good. Just I think this should be folded in patch 1, since you have > proper Fixes tag, I don't think there's a concern this would be backported > to kernels where rcu_barrier() is still needed. > > Honza Thank you for your review! Because the two patches have different fix tags, there is concern that removing rcu_barrier here could cause the stable branches still using queue_rcu_work (5.10.y, 6.6.y, ...) to reintroduce similar issues when the round of fix patches deletes this necessary synchronization. Therefore, the patches are split into two to make their purposes clearer. Cheers, Baokun >> --- >> fs/fs-writeback.c | 5 ----- >> 1 file changed, 5 deletions(-) >> >> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c >> index 1f95ddcee363..9ae290547eb2 100644 >> --- a/fs/fs-writeback.c >> +++ b/fs/fs-writeback.c >> @@ -1246,11 +1246,6 @@ void cgroup_writeback_umount(struct super_block *sb) >> * will then drain it. >> */ >> synchronize_rcu(); >> - /* >> - * Use rcu_barrier() to wait for all pending callbacks to >> - * ensure that all in-flight wb switches are in the workqueue. >> - */ >> - rcu_barrier(); >> flush_workqueue(isw_wq); >> } >> } >> -- >> 2.43.7 >> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 2/3] writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount() 2026-05-18 13:10 ` Baokun Li @ 2026-05-18 15:11 ` Jan Kara 0 siblings, 0 replies; 11+ messages in thread From: Jan Kara @ 2026-05-18 15:11 UTC (permalink / raw) To: Baokun Li; +Cc: Jan Kara, linux-fsdevel, viro, brauner, tj, linux-kernel On Mon 18-05-26 21:10:20, Baokun Li wrote: > 在 2026/5/18 21:01, Jan Kara 写道: > > On Sun 17-05-26 22:21:31, Baokun Li wrote: > >> Commit e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when > >> switching inodes") replaced the queue_rcu_work() based scheduling of > >> inode wb switches with a plain queue_work(). Since then no switcher > >> goes through call_rcu(), so rcu_barrier() in cgroup_writeback_umount() > >> has no work to wait for and is effectively a no-op. > >> > >> Fixes: e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when switching inodes") > >> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> > > Looks good. Just I think this should be folded in patch 1, since you have > > proper Fixes tag, I don't think there's a concern this would be backported > > to kernels where rcu_barrier() is still needed. > > > > Honza > > Thank you for your review! > > Because the two patches have different fix tags, there is concern that > removing rcu_barrier here could cause the stable branches still using > queue_rcu_work (5.10.y, 6.6.y, ...) to reintroduce similar issues when > the round of fix patches deletes this necessary synchronization. > > Therefore, the patches are split into two to make their purposes clearer. Ah, ok, I didn't notice that the first patch has so old fixes tag. OK. Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > >> --- > >> fs/fs-writeback.c | 5 ----- > >> 1 file changed, 5 deletions(-) > >> > >> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > >> index 1f95ddcee363..9ae290547eb2 100644 > >> --- a/fs/fs-writeback.c > >> +++ b/fs/fs-writeback.c > >> @@ -1246,11 +1246,6 @@ void cgroup_writeback_umount(struct super_block *sb) > >> * will then drain it. > >> */ > >> synchronize_rcu(); > >> - /* > >> - * Use rcu_barrier() to wait for all pending callbacks to > >> - * ensure that all in-flight wb switches are in the workqueue. > >> - */ > >> - rcu_barrier(); > >> flush_workqueue(isw_wq); > >> } > >> } > >> -- > >> 2.43.7 > >> > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v2 3/3] writeback: use a per-sb counter to drain inode wb switches at umount 2026-05-17 14:21 [PATCH v2 0/3] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs() Baokun Li 2026-05-17 14:21 ` [PATCH v2 1/3] " Baokun Li 2026-05-17 14:21 ` [PATCH v2 2/3] writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount() Baokun Li @ 2026-05-17 14:21 ` Baokun Li 2026-05-18 11:14 ` Baokun Li 2026-05-18 11:42 ` Christian Brauner 2 siblings, 2 replies; 11+ messages in thread From: Baokun Li @ 2026-05-17 14:21 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, brauner, jack, tj, linux-kernel Tracking in-flight inode wb switches with a single global counter (isw_nr_in_flight) plus a synchronize_rcu() based wait in cgroup_writeback_umount() forces every umount to take a global hit whenever any other superblock on the system has wb switches in flight, even if the superblock being unmounted has none of its own. Replace the global synchronize_rcu()/flush_workqueue() pair with a per-sb counter, s_isw_nr_in_flight: - inode_prepare_wbs_switch() increments before checking SB_ACTIVE and grabbing the inode; failure paths decrement before returning. - process_inode_switch_wbs() decrements after the matching iput(). - cgroup_writeback_umount() waits for the per-sb counter to reach zero. The smp_mb() pair between inode_prepare_wbs_switch() and cgroup_writeback_umount() keeps the SB_ACTIVE / counter ordering: either the umounter sees a non-zero counter and waits, or the switcher sees SB_ACTIVE cleared and aborts before grabbing the inode. The existing global isw_nr_in_flight is left in place, since it is still used to throttle in-flight switches via WB_FRN_MAX_IN_FLIGHT. The rcu_read_lock() extension in inode_switch_wbs() and cleanup_offline_cgwb() introduced by the race fix is no longer needed and is reverted. The synchronize_rcu() in cgroup_writeback_umount() is dropped as well. inode_prepare_wbs_switch() also gains a lockless SB_ACTIVE check at the top to skip the atomic_inc/smp_mb cost for inodes whose sb is already being torn down. The following numbers were measured with 4 background superblocks each churning "create memcg -> write 1 MiB -> rmdir memcg" to keep the global isw_nr_in_flight non-zero. A separate idle target sb is mounted and umounted in a loop (N=100); only its umount latency is measured. Idle target umount latency under cross-sb cgwb-switch pressure: p50 p95 p99 max global synchronize_rcu() 64.4 ms 95.8 ms 101.4 ms 110.5 ms per-sb counter (this) 5.3 ms 6.9 ms 7.4 ms 7.7 ms no-pressure baseline 5.2 ms 5.9 ms 6.0 ms 6.1 ms 8 concurrent umounts of idle sbs under the same pressure: p50 p95 max global synchronize_rcu() 57.9 ms 82.1 ms 90.0 ms per-sb counter (this) 7.5 ms 7.8 ms 8.0 ms In-kernel cgroup_writeback_umount() time over 286 calls (bpftrace): global synchronize_rcu() 8717 ms total (~30 ms / call) per-sb counter (this) 1.16 ms total (~4 us / call) When s_isw_nr_in_flight is zero the loop body is never entered. Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> --- fs/fs-writeback.c | 74 ++++++++++++++-------------------- include/linux/fs/super_types.h | 8 ++++ 2 files changed, 38 insertions(+), 44 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 9ae290547eb2..4282fcfe027b 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -554,8 +554,12 @@ static void process_inode_switch_wbs(struct bdi_writeback *new_wb, wb_put_many(old_wb, nr_switched); } - for (inodep = isw->inodes; *inodep; inodep++) + for (inodep = isw->inodes; *inodep; inodep++) { + struct super_block *sb = (*inodep)->i_sb; + iput(*inodep); + atomic_dec(&sb->s_isw_nr_in_flight); + } wb_put(new_wb); kfree(isw); atomic_dec(&isw_nr_in_flight); @@ -598,16 +602,19 @@ void inode_switch_wbs_work_fn(struct work_struct *work) static bool inode_prepare_wbs_switch(struct inode *inode, struct bdi_writeback *new_wb) { + /* Avoid the atomic_inc/smp_mb dance once SB_ACTIVE is gone. */ + if (!(inode->i_sb->s_flags & SB_ACTIVE)) + return false; + /* - * Paired with smp_mb() in cgroup_writeback_umount(). - * isw_nr_in_flight must be increased before checking SB_ACTIVE and - * grabbing an inode, otherwise isw_nr_in_flight can be observed as 0 - * in cgroup_writeback_umount() and the isw_wq will be not flushed. + * Pairs with smp_mb() in cgroup_writeback_umount(): the umounter either + * sees a non-zero counter and waits, or we see SB_ACTIVE clear below. */ + atomic_inc(&inode->i_sb->s_isw_nr_in_flight); smp_mb(); if (IS_DAX(inode)) - return false; + goto out_dec; /* while holding I_WB_SWITCH, no one else can update the association */ spin_lock(&inode->i_lock); @@ -615,13 +622,17 @@ static bool inode_prepare_wbs_switch(struct inode *inode, inode_state_read(inode) & (I_WB_SWITCH | I_FREEING | I_WILL_FREE) || inode_to_wb(inode) == new_wb) { spin_unlock(&inode->i_lock); - return false; + goto out_dec; } inode_state_set(inode, I_WB_SWITCH); __iget(inode); spin_unlock(&inode->i_lock); return true; + +out_dec: + atomic_dec(&inode->i_sb->s_isw_nr_in_flight); + return false; } static void wb_queue_isw(struct bdi_writeback *wb, @@ -665,6 +676,7 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) memcg_css = css_from_id(new_wb_id, &memory_cgrp_subsys); if (memcg_css && !css_tryget(memcg_css)) memcg_css = NULL; + rcu_read_unlock(); if (!memcg_css) goto out_free; @@ -679,18 +691,10 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) isw->inodes[0] = inode; trace_inode_switch_wbs_queue(inode->i_wb, new_wb, 1); - /* - * Paired with synchronize_rcu() in cgroup_writeback_umount(). - * Holding rcu_read_lock across wb_queue_isw() ensures - * synchronize_rcu() cannot return until the work is queued, so - * the subsequent flush_workqueue() will wait for the switch. - */ wb_queue_isw(new_wb, isw); - rcu_read_unlock(); return; out_free: - rcu_read_unlock(); atomic_dec(&isw_nr_in_flight); if (new_wb) wb_put(new_wb); @@ -748,14 +752,6 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) new_wb = &wb->bdi->wb; /* wb_get() is noop for bdi's wb */ nr = 0; - /* - * Paired with synchronize_rcu() in cgroup_writeback_umount(). - * Holding rcu_read_lock across the SB_ACTIVE check, the inode grab - * and wb_queue_isw() ensures synchronize_rcu() cannot return until - * the work is queued, so the subsequent flush_workqueue() will wait - * for the switch. - */ - rcu_read_lock(); spin_lock(&wb->list_lock); /* * In addition to the inodes that have completed writeback, also switch @@ -773,7 +769,6 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) /* no attached inodes? bail out */ if (nr == 0) { - rcu_read_unlock(); atomic_dec(&isw_nr_in_flight); wb_put(new_wb); kfree(isw); @@ -782,7 +777,6 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) trace_inode_switch_wbs_queue(wb, new_wb, nr); wb_queue_isw(new_wb, isw); - rcu_read_unlock(); return restart; } @@ -1215,38 +1209,30 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, } /** - * cgroup_writeback_umount - flush inode wb switches for umount + * cgroup_writeback_umount - wait for in-flight inode wb switches on @sb * @sb: target super_block * - * This function is called when a super_block is about to be destroyed and - * flushes in-flight inode wb switches. An inode wb switch goes through - * RCU and then workqueue, so the two need to be flushed in order to ensure - * that all previously scheduled switches are finished. As wb switches are - * rare occurrences and synchronize_rcu() can take a while, perform - * flushing iff wb switches are in flight. + * Wait until every inode wb switch that already passed the SB_ACTIVE + * check on this superblock has been completed by the worker. Since + * SB_ACTIVE is cleared before this is called, no new switches can start + * for @sb, so s_isw_nr_in_flight will monotonically drop to zero. */ void cgroup_writeback_umount(struct super_block *sb) { - if (!(sb->s_bdi->capabilities & BDI_CAP_WRITEBACK)) return; /* - * SB_ACTIVE should be reliably cleared before checking - * isw_nr_in_flight, see generic_shutdown_super(). + * Pairs with smp_mb() in inode_prepare_wbs_switch(): we either observe + * a non-zero counter and wait, or the switcher sees SB_ACTIVE clear + * (cleared by generic_shutdown_super()) and bails before grabbing the + * inode. */ smp_mb(); - if (atomic_read(&isw_nr_in_flight)) { - /* - * Paired with rcu_read_lock() in inode_switch_wbs() and - * cleanup_offline_cgwb(). synchronize_rcu() waits for any - * in-flight switcher that already passed the SB_ACTIVE check - * to finish queueing its work, so flush_workqueue() below - * will then drain it. - */ - synchronize_rcu(); + while (atomic_read(&sb->s_isw_nr_in_flight)) { flush_workqueue(isw_wq); + cond_resched(); } } diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h index 383050e7fdf5..1ab4e2265129 100644 --- a/include/linux/fs/super_types.h +++ b/include/linux/fs/super_types.h @@ -274,6 +274,14 @@ struct super_block { /* number of fserrors that are being sent to fsnotify/filesystems */ refcount_t s_pending_errors; + +#ifdef CONFIG_CGROUP_WRITEBACK + /* + * Number of in-flight inode wb switches for this sb. Drained by + * cgroup_writeback_umount() before tear-down. + */ + atomic_t s_isw_nr_in_flight; +#endif } __randomize_layout; /* -- 2.43.7 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v2 3/3] writeback: use a per-sb counter to drain inode wb switches at umount 2026-05-17 14:21 ` [PATCH v2 3/3] writeback: use a per-sb counter to drain inode wb switches at umount Baokun Li @ 2026-05-18 11:14 ` Baokun Li 2026-05-18 11:42 ` Christian Brauner 1 sibling, 0 replies; 11+ messages in thread From: Baokun Li @ 2026-05-18 11:14 UTC (permalink / raw) To: linux-fsdevel; +Cc: viro, brauner, jack, tj, linux-kernel Please ignore this patch; sashiko found some issues, and I will use wait_var_event() / wake_up_var() in the next version. https://sashiko.dev/#/patchset/20260517142147.3354909-1-libaokun%40linux.alibaba.com Sorry for the noise. Regards, Baokun 在 2026/5/17 22:21, Baokun Li 写道: > Tracking in-flight inode wb switches with a single global counter > (isw_nr_in_flight) plus a synchronize_rcu() based wait in > cgroup_writeback_umount() forces every umount to take a global hit > whenever any other superblock on the system has wb switches in flight, > even if the superblock being unmounted has none of its own. > > Replace the global synchronize_rcu()/flush_workqueue() pair with a > per-sb counter, s_isw_nr_in_flight: > > - inode_prepare_wbs_switch() increments before checking SB_ACTIVE > and grabbing the inode; failure paths decrement before returning. > - process_inode_switch_wbs() decrements after the matching iput(). > - cgroup_writeback_umount() waits for the per-sb counter to reach zero. > > The smp_mb() pair between inode_prepare_wbs_switch() and > cgroup_writeback_umount() keeps the SB_ACTIVE / counter ordering: > either the umounter sees a non-zero counter and waits, or the switcher > sees SB_ACTIVE cleared and aborts before grabbing the inode. > > The existing global isw_nr_in_flight is left in place, since it is > still used to throttle in-flight switches via WB_FRN_MAX_IN_FLIGHT. > > The rcu_read_lock() extension in inode_switch_wbs() and > cleanup_offline_cgwb() introduced by the race fix is no longer > needed and is reverted. The synchronize_rcu() in > cgroup_writeback_umount() is dropped as well. > > inode_prepare_wbs_switch() also gains a lockless SB_ACTIVE check at > the top to skip the atomic_inc/smp_mb cost for inodes whose sb is > already being torn down. > > The following numbers were measured with 4 background superblocks > each churning "create memcg -> write 1 MiB -> rmdir memcg" to keep > the global isw_nr_in_flight non-zero. A separate idle target sb is > mounted and umounted in a loop (N=100); only its umount latency is > measured. > > Idle target umount latency under cross-sb cgwb-switch pressure: > > p50 p95 p99 max > global synchronize_rcu() 64.4 ms 95.8 ms 101.4 ms 110.5 ms > per-sb counter (this) 5.3 ms 6.9 ms 7.4 ms 7.7 ms > no-pressure baseline 5.2 ms 5.9 ms 6.0 ms 6.1 ms > > 8 concurrent umounts of idle sbs under the same pressure: > > p50 p95 max > global synchronize_rcu() 57.9 ms 82.1 ms 90.0 ms > per-sb counter (this) 7.5 ms 7.8 ms 8.0 ms > > In-kernel cgroup_writeback_umount() time over 286 calls (bpftrace): > > global synchronize_rcu() 8717 ms total (~30 ms / call) > per-sb counter (this) 1.16 ms total (~4 us / call) > > When s_isw_nr_in_flight is zero the loop body is never entered. > > Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> > --- > fs/fs-writeback.c | 74 ++++++++++++++-------------------- > include/linux/fs/super_types.h | 8 ++++ > 2 files changed, 38 insertions(+), 44 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 9ae290547eb2..4282fcfe027b 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -554,8 +554,12 @@ static void process_inode_switch_wbs(struct bdi_writeback *new_wb, > wb_put_many(old_wb, nr_switched); > } > > - for (inodep = isw->inodes; *inodep; inodep++) > + for (inodep = isw->inodes; *inodep; inodep++) { > + struct super_block *sb = (*inodep)->i_sb; > + > iput(*inodep); > + atomic_dec(&sb->s_isw_nr_in_flight); > + } > wb_put(new_wb); > kfree(isw); > atomic_dec(&isw_nr_in_flight); > @@ -598,16 +602,19 @@ void inode_switch_wbs_work_fn(struct work_struct *work) > static bool inode_prepare_wbs_switch(struct inode *inode, > struct bdi_writeback *new_wb) > { > + /* Avoid the atomic_inc/smp_mb dance once SB_ACTIVE is gone. */ > + if (!(inode->i_sb->s_flags & SB_ACTIVE)) > + return false; > + > /* > - * Paired with smp_mb() in cgroup_writeback_umount(). > - * isw_nr_in_flight must be increased before checking SB_ACTIVE and > - * grabbing an inode, otherwise isw_nr_in_flight can be observed as 0 > - * in cgroup_writeback_umount() and the isw_wq will be not flushed. > + * Pairs with smp_mb() in cgroup_writeback_umount(): the umounter either > + * sees a non-zero counter and waits, or we see SB_ACTIVE clear below. > */ > + atomic_inc(&inode->i_sb->s_isw_nr_in_flight); > smp_mb(); > > if (IS_DAX(inode)) > - return false; > + goto out_dec; > > /* while holding I_WB_SWITCH, no one else can update the association */ > spin_lock(&inode->i_lock); > @@ -615,13 +622,17 @@ static bool inode_prepare_wbs_switch(struct inode *inode, > inode_state_read(inode) & (I_WB_SWITCH | I_FREEING | I_WILL_FREE) || > inode_to_wb(inode) == new_wb) { > spin_unlock(&inode->i_lock); > - return false; > + goto out_dec; > } > inode_state_set(inode, I_WB_SWITCH); > __iget(inode); > spin_unlock(&inode->i_lock); > > return true; > + > +out_dec: > + atomic_dec(&inode->i_sb->s_isw_nr_in_flight); > + return false; > } > > static void wb_queue_isw(struct bdi_writeback *wb, > @@ -665,6 +676,7 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) > memcg_css = css_from_id(new_wb_id, &memory_cgrp_subsys); > if (memcg_css && !css_tryget(memcg_css)) > memcg_css = NULL; > + rcu_read_unlock(); > if (!memcg_css) > goto out_free; > > @@ -679,18 +691,10 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) > isw->inodes[0] = inode; > > trace_inode_switch_wbs_queue(inode->i_wb, new_wb, 1); > - /* > - * Paired with synchronize_rcu() in cgroup_writeback_umount(). > - * Holding rcu_read_lock across wb_queue_isw() ensures > - * synchronize_rcu() cannot return until the work is queued, so > - * the subsequent flush_workqueue() will wait for the switch. > - */ > wb_queue_isw(new_wb, isw); > - rcu_read_unlock(); > return; > > out_free: > - rcu_read_unlock(); > atomic_dec(&isw_nr_in_flight); > if (new_wb) > wb_put(new_wb); > @@ -748,14 +752,6 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) > new_wb = &wb->bdi->wb; /* wb_get() is noop for bdi's wb */ > > nr = 0; > - /* > - * Paired with synchronize_rcu() in cgroup_writeback_umount(). > - * Holding rcu_read_lock across the SB_ACTIVE check, the inode grab > - * and wb_queue_isw() ensures synchronize_rcu() cannot return until > - * the work is queued, so the subsequent flush_workqueue() will wait > - * for the switch. > - */ > - rcu_read_lock(); > spin_lock(&wb->list_lock); > /* > * In addition to the inodes that have completed writeback, also switch > @@ -773,7 +769,6 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) > > /* no attached inodes? bail out */ > if (nr == 0) { > - rcu_read_unlock(); > atomic_dec(&isw_nr_in_flight); > wb_put(new_wb); > kfree(isw); > @@ -782,7 +777,6 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) > > trace_inode_switch_wbs_queue(wb, new_wb, nr); > wb_queue_isw(new_wb, isw); > - rcu_read_unlock(); > > return restart; > } > @@ -1215,38 +1209,30 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, > } > > /** > - * cgroup_writeback_umount - flush inode wb switches for umount > + * cgroup_writeback_umount - wait for in-flight inode wb switches on @sb > * @sb: target super_block > * > - * This function is called when a super_block is about to be destroyed and > - * flushes in-flight inode wb switches. An inode wb switch goes through > - * RCU and then workqueue, so the two need to be flushed in order to ensure > - * that all previously scheduled switches are finished. As wb switches are > - * rare occurrences and synchronize_rcu() can take a while, perform > - * flushing iff wb switches are in flight. > + * Wait until every inode wb switch that already passed the SB_ACTIVE > + * check on this superblock has been completed by the worker. Since > + * SB_ACTIVE is cleared before this is called, no new switches can start > + * for @sb, so s_isw_nr_in_flight will monotonically drop to zero. > */ > void cgroup_writeback_umount(struct super_block *sb) > { > - > if (!(sb->s_bdi->capabilities & BDI_CAP_WRITEBACK)) > return; > > /* > - * SB_ACTIVE should be reliably cleared before checking > - * isw_nr_in_flight, see generic_shutdown_super(). > + * Pairs with smp_mb() in inode_prepare_wbs_switch(): we either observe > + * a non-zero counter and wait, or the switcher sees SB_ACTIVE clear > + * (cleared by generic_shutdown_super()) and bails before grabbing the > + * inode. > */ > smp_mb(); > > - if (atomic_read(&isw_nr_in_flight)) { > - /* > - * Paired with rcu_read_lock() in inode_switch_wbs() and > - * cleanup_offline_cgwb(). synchronize_rcu() waits for any > - * in-flight switcher that already passed the SB_ACTIVE check > - * to finish queueing its work, so flush_workqueue() below > - * will then drain it. > - */ > - synchronize_rcu(); > + while (atomic_read(&sb->s_isw_nr_in_flight)) { > flush_workqueue(isw_wq); > + cond_resched(); > } > } > > diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h > index 383050e7fdf5..1ab4e2265129 100644 > --- a/include/linux/fs/super_types.h > +++ b/include/linux/fs/super_types.h > @@ -274,6 +274,14 @@ struct super_block { > > /* number of fserrors that are being sent to fsnotify/filesystems */ > refcount_t s_pending_errors; > + > +#ifdef CONFIG_CGROUP_WRITEBACK > + /* > + * Number of in-flight inode wb switches for this sb. Drained by > + * cgroup_writeback_umount() before tear-down. > + */ > + atomic_t s_isw_nr_in_flight; > +#endif > } __randomize_layout; > > /* ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 3/3] writeback: use a per-sb counter to drain inode wb switches at umount 2026-05-17 14:21 ` [PATCH v2 3/3] writeback: use a per-sb counter to drain inode wb switches at umount Baokun Li 2026-05-18 11:14 ` Baokun Li @ 2026-05-18 11:42 ` Christian Brauner 2026-05-18 11:52 ` Baokun Li 1 sibling, 1 reply; 11+ messages in thread From: Christian Brauner @ 2026-05-18 11:42 UTC (permalink / raw) To: Baokun Li; +Cc: linux-fsdevel, viro, brauner, jack, tj, linux-kernel On Sun, 17 May 2026 22:21:32 +0800, Baokun Li <libaokun@linux.alibaba.com> wrote: > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 9ae290547eb2..4282fcfe027b 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -1215,38 +1209,30 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, > [ ... skip 38 lines ... ] > - * in-flight switcher that already passed the SB_ACTIVE check > - * to finish queueing its work, so flush_workqueue() below > - * will then drain it. > - */ > - synchronize_rcu(); > + while (atomic_read(&sb->s_isw_nr_in_flight)) { Would probably be a bit nicer to just wait until the count is zero? We have that pattern for inode_dio_wait(). So something like (untested): static bool cgroup_writeback_finished(const struct super_block *sb) { if (atomic_read(&sb->s_isw_nr_in_flight) == 0) return true; flush_workqueue(isw_wq); return false; } void cgroup_writeback_wait(const struct super_block *sb) { wait_var_event(&sb->s_isw_nr_in_flight, cgroup_writeback_finished(sb)); } void cgroup_writeback_finish(const struct super_block *sb) { if (atomic_dec_and_test(&sb->s_isw_nr_in_flight)) wake_up_var(&sb->s_isw_nr_in_flight); } -- Christian Brauner <brauner@kernel.org> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2 3/3] writeback: use a per-sb counter to drain inode wb switches at umount 2026-05-18 11:42 ` Christian Brauner @ 2026-05-18 11:52 ` Baokun Li 0 siblings, 0 replies; 11+ messages in thread From: Baokun Li @ 2026-05-18 11:52 UTC (permalink / raw) To: Christian Brauner; +Cc: linux-fsdevel, viro, jack, tj, linux-kernel 在 2026/5/18 19:42, Christian Brauner 写道: > On Sun, 17 May 2026 22:21:32 +0800, Baokun Li <libaokun@linux.alibaba.com> wrote: >> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c >> index 9ae290547eb2..4282fcfe027b 100644 >> --- a/fs/fs-writeback.c >> +++ b/fs/fs-writeback.c >> @@ -1215,38 +1209,30 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, >> [ ... skip 38 lines ... ] >> - * in-flight switcher that already passed the SB_ACTIVE check >> - * to finish queueing its work, so flush_workqueue() below >> - * will then drain it. >> - */ >> - synchronize_rcu(); >> + while (atomic_read(&sb->s_isw_nr_in_flight)) { > Would probably be a bit nicer to just wait until the count is zero? > We have that pattern for inode_dio_wait(). So something like (untested): > > static bool cgroup_writeback_finished(const struct super_block *sb) > { > if (atomic_read(&sb->s_isw_nr_in_flight) == 0) > return true; > > flush_workqueue(isw_wq); > return false; > } > > void cgroup_writeback_wait(const struct super_block *sb) > { > wait_var_event(&sb->s_isw_nr_in_flight, cgroup_writeback_finished(sb)); > } > > void cgroup_writeback_finish(const struct super_block *sb) > { > if (atomic_dec_and_test(&sb->s_isw_nr_in_flight)) > wake_up_var(&sb->s_isw_nr_in_flight); > } > Thank you for your suggestion! Totally agree. sashiko also raised similar opinions, and I will send out the v3 patch using this proposal right away. Cheers, Baokun ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-05-18 15:11 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-17 14:21 [PATCH v2 0/3] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs() Baokun Li 2026-05-17 14:21 ` [PATCH v2 1/3] " Baokun Li 2026-05-18 12:58 ` Jan Kara 2026-05-17 14:21 ` [PATCH v2 2/3] writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount() Baokun Li 2026-05-18 13:01 ` Jan Kara 2026-05-18 13:10 ` Baokun Li 2026-05-18 15:11 ` Jan Kara 2026-05-17 14:21 ` [PATCH v2 3/3] writeback: use a per-sb counter to drain inode wb switches at umount Baokun Li 2026-05-18 11:14 ` Baokun Li 2026-05-18 11:42 ` Christian Brauner 2026-05-18 11:52 ` Baokun Li
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox