* [PATCH 0/3] memcg, writeback: Don't wait writeback completion @ 2025-08-20 11:19 Julian Sun 2025-08-20 11:19 ` [PATCH 1/3] writeback: Rename wb_writeback_work->auto_free to free_work Julian Sun ` (3 more replies) 0 siblings, 4 replies; 26+ messages in thread From: Julian Sun @ 2025-08-20 11:19 UTC (permalink / raw) To: linux-fsdevel, cgroups, linux-mm Cc: viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe, tj This patch series aims to eliminate task hangs in mem_cgroup_css_free() caused by calling wb_wait_for_completion(). This is because there may be a large number of writeback tasks in the foreign memcg, involving millions of pages, and the situation is exacerbated by WBT rate limiting—potentially leading to task hangs lasting several hours. Patch 1 is preparatory work and involves no functional changes. Patch 2 implements the automatic release of wb_completion. Patch 3 removes wb_wait_for_completion() from mem_cgroup_css_free(). Julian Sun (3): writeback: Rename wb_writeback_work->auto_free to free_work. writeback: Add wb_writeback_work->free_done memcg: Don't wait writeback completion when release memcg. fs/fs-writeback.c | 22 ++++++++++++++-------- include/linux/backing-dev-defs.h | 6 ++++++ include/linux/memcontrol.h | 2 +- mm/memcontrol.c | 29 ++++++++++++++++++++--------- 4 files changed, 41 insertions(+), 18 deletions(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/3] writeback: Rename wb_writeback_work->auto_free to free_work. 2025-08-20 11:19 [PATCH 0/3] memcg, writeback: Don't wait writeback completion Julian Sun @ 2025-08-20 11:19 ` Julian Sun 2025-08-20 11:19 ` [PATCH] writeback: Add wb_writeback_work->free_done Julian Sun ` (2 subsequent siblings) 3 siblings, 0 replies; 26+ messages in thread From: Julian Sun @ 2025-08-20 11:19 UTC (permalink / raw) To: linux-fsdevel, cgroups, linux-mm Cc: viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe, tj This patch is to prepare for subsequent patches. It only does 's/auto_free/free_work' with no functional changes. Signed-off-by: Julian Sun <sunjunchao@bytedance.com> --- fs/fs-writeback.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index cc57367fb641..4a6c22df5649 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -48,7 +48,7 @@ struct wb_writeback_work { unsigned int range_cyclic:1; unsigned int for_background:1; unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ - unsigned int auto_free:1; /* free on completion */ + unsigned int free_work:1; /* free work on completion */ enum wb_reason reason; /* why was writeback initiated? */ struct list_head list; /* pending work list */ @@ -170,7 +170,7 @@ static void finish_writeback_work(struct wb_writeback_work *work) { struct wb_completion *done = work->done; - if (work->auto_free) + if (work->free_work) kfree(work); if (done) { wait_queue_head_t *waitq = done->waitq; @@ -1029,7 +1029,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, if (work) { *work = *base_work; work->nr_pages = nr_pages; - work->auto_free = 1; + work->free_work = 1; wb_queue_work(wb, work); continue; } @@ -1048,7 +1048,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, work = &fallback_work; *work = *base_work; work->nr_pages = nr_pages; - work->auto_free = 0; + work->free_work = 0; work->done = &fallback_work_done; wb_queue_work(wb, work); @@ -1130,7 +1130,7 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, work->range_cyclic = 1; work->reason = reason; work->done = done; - work->auto_free = 1; + work->free_work = 1; wb_queue_work(wb, work); ret = 0; } else { @@ -1237,7 +1237,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, might_sleep(); if (!skip_if_busy || !writeback_in_progress(&bdi->wb)) { - base_work->auto_free = 0; + base_work->free_work = 0; wb_queue_work(&bdi->wb, base_work); } } -- 2.20.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH] writeback: Add wb_writeback_work->free_done 2025-08-20 11:19 [PATCH 0/3] memcg, writeback: Don't wait writeback completion Julian Sun 2025-08-20 11:19 ` [PATCH 1/3] writeback: Rename wb_writeback_work->auto_free to free_work Julian Sun @ 2025-08-20 11:19 ` Julian Sun 2025-08-20 11:19 ` [PATCH] memcg: Don't wait writeback completion when release memcg Julian Sun 2025-08-20 12:16 ` [PATCH 0/3] memcg, writeback: Don't wait writeback completion Giorgi Tchankvetadze 3 siblings, 0 replies; 26+ messages in thread From: Julian Sun @ 2025-08-20 11:19 UTC (permalink / raw) To: linux-fsdevel, cgroups, linux-mm Cc: viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe, tj Add wb_writeback_work->free_done, which is used to free the wb_completion variable when the reference count becomes zero. Signed-off-by: Julian Sun <sunjunchao@bytedance.com> --- fs/fs-writeback.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 4a6c22df5649..56faf5c3d064 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -49,6 +49,7 @@ struct wb_writeback_work { unsigned int for_background:1; unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ unsigned int free_work:1; /* free work on completion */ + unsigned int free_done:1; /* free wb_completion on completion */ enum wb_reason reason; /* why was writeback initiated? */ struct list_head list; /* pending work list */ @@ -169,15 +170,19 @@ static void wb_wakeup_delayed(struct bdi_writeback *wb) static void finish_writeback_work(struct wb_writeback_work *work) { struct wb_completion *done = work->done; + bool free_done = work->free_done; if (work->free_work) kfree(work); if (done) { wait_queue_head_t *waitq = done->waitq; - /* @done can't be accessed after the following dec */ - if (atomic_dec_and_test(&done->cnt)) + /* @done can't be accessed after the following dec unless free_done is true */ + if (atomic_dec_and_test(&done->cnt)) { wake_up_all(waitq); + if (free_done) + kfree(done); + } } } -- 2.20.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-20 11:19 [PATCH 0/3] memcg, writeback: Don't wait writeback completion Julian Sun 2025-08-20 11:19 ` [PATCH 1/3] writeback: Rename wb_writeback_work->auto_free to free_work Julian Sun 2025-08-20 11:19 ` [PATCH] writeback: Add wb_writeback_work->free_done Julian Sun @ 2025-08-20 11:19 ` Julian Sun 2025-08-20 20:58 ` Tejun Heo 2025-08-20 12:16 ` [PATCH 0/3] memcg, writeback: Don't wait writeback completion Giorgi Tchankvetadze 3 siblings, 1 reply; 26+ messages in thread From: Julian Sun @ 2025-08-20 11:19 UTC (permalink / raw) To: linux-fsdevel, cgroups, linux-mm Cc: viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe, tj Recently, we encountered the following hung task: INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds. [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn [Wed Jul 30 17:47:45 2025] Call Trace: [Wed Jul 30 17:47:45 2025] __schedule+0x934/0xe10 [Wed Jul 30 17:47:45 2025] ? complete+0x3b/0x50 [Wed Jul 30 17:47:45 2025] ? _cond_resched+0x15/0x30 [Wed Jul 30 17:47:45 2025] schedule+0x40/0xb0 [Wed Jul 30 17:47:45 2025] wb_wait_for_completion+0x52/0x80 [Wed Jul 30 17:47:45 2025] ? finish_wait+0x80/0x80 [Wed Jul 30 17:47:45 2025] mem_cgroup_css_free+0x22/0x1b0 [Wed Jul 30 17:47:45 2025] css_free_rwork_fn+0x42/0x380 [Wed Jul 30 17:47:45 2025] process_one_work+0x1a2/0x360 [Wed Jul 30 17:47:45 2025] worker_thread+0x30/0x390 [Wed Jul 30 17:47:45 2025] ? create_worker+0x1a0/0x1a0 [Wed Jul 30 17:47:45 2025] kthread+0x110/0x130 [Wed Jul 30 17:47:45 2025] ? __kthread_cancel_work+0x40/0x40 [Wed Jul 30 17:47:45 2025] ret_from_fork+0x1f/0x30 The direct cause is that memcg spends a long time waiting for dirty page writeback of foreign memcgs during release. The root causes are: a. The wb may have multiple writeback tasks, containing hundreds of thousands of dirty pages, as shown below: >>> for work in list_for_each_entry("struct wb_writeback_work", \ wb.work_list.address_of_(), "list"): ... print(work.nr_pages, work.reason, hex(work)) ... 900628 WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40 1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540 1275228 WB_REASON_FOREIGN_FLUSH 0xffff969d9b444bc0 1099673 WB_REASON_FOREIGN_FLUSH 0xffff969f0954d6c0 1351522 WB_REASON_FOREIGN_FLUSH 0xffff969e76713340 2567437 WB_REASON_FOREIGN_FLUSH 0xffff9694ae208400 2954033 WB_REASON_FOREIGN_FLUSH 0xffff96a22d62cbc0 3008860 WB_REASON_FOREIGN_FLUSH 0xffff969eee8ce3c0 3337932 WB_REASON_FOREIGN_FLUSH 0xffff9695b45156c0 3348916 WB_REASON_FOREIGN_FLUSH 0xffff96a22c7a4f40 3345363 WB_REASON_FOREIGN_FLUSH 0xffff969e5d872800 3333581 WB_REASON_FOREIGN_FLUSH 0xffff969efd0f4600 3382225 WB_REASON_FOREIGN_FLUSH 0xffff969e770edcc0 3418770 WB_REASON_FOREIGN_FLUSH 0xffff96a252ceea40 3387648 WB_REASON_FOREIGN_FLUSH 0xffff96a3bda86340 3385420 WB_REASON_FOREIGN_FLUSH 0xffff969efc6eb280 3418730 WB_REASON_FOREIGN_FLUSH 0xffff96a348ab1040 3426155 WB_REASON_FOREIGN_FLUSH 0xffff969d90beac00 3397995 WB_REASON_FOREIGN_FLUSH 0xffff96a2d7288800 3293095 WB_REASON_FOREIGN_FLUSH 0xffff969dab423240 3293595 WB_REASON_FOREIGN_FLUSH 0xffff969c765ff400 3199511 WB_REASON_FOREIGN_FLUSH 0xffff969a72d5e680 3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000 3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00 b. The writeback might severely throttled by wbt, with a speed possibly less than 100kb/s, leading to a very long writeback time. >>> wb.write_bandwidth (unsigned long)24 >>> wb.write_bandwidth (unsigned long)13 The wb_wait_for_completion() here is probably only used to prevent use-after-free. Therefore, we manage 'done' separately and automatically free this memory in finish_writeback_work(). This allows us to remove wb_wait_for_completion() while preventing the use-after-free issue. Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing") Signed-off-by: Julian Sun <sunjunchao@bytedance.com> --- fs/fs-writeback.c | 1 + include/linux/backing-dev-defs.h | 6 ++++++ include/linux/memcontrol.h | 2 +- mm/memcontrol.c | 31 ++++++++++++++++++++++--------- 4 files changed, 30 insertions(+), 10 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 56faf5c3d064..fac3ef7e95ae 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1136,6 +1136,7 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, work->reason = reason; work->done = done; work->free_work = 1; + work->free_done = 1; wb_queue_work(wb, work); ret = 0; } else { diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 2ad261082bba..bf28dd9a4783 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -208,6 +208,12 @@ struct wb_lock_cookie { unsigned long flags; }; +static inline void wb_completion_init(struct wb_completion *done, struct wait_queue_head *waitq) +{ + atomic_set(&done->cnt, 1); + done->waitq = waitq; +} + #ifdef CONFIG_CGROUP_WRITEBACK /** diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 785173aa0739..80a6bafbb24a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -161,7 +161,7 @@ struct memcg_cgwb_frn { u64 bdi_id; /* bdi->id of the foreign inode */ int memcg_id; /* memcg->css.id of foreign inode */ u64 at; /* jiffies_64 at the time of dirtying */ - struct wb_completion done; /* tracks in-flight foreign writebacks */ + struct wb_completion *done; /* tracks in-flight foreign writebacks */ }; /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8dd7fbed5a94..6e6a1ce7589a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3459,7 +3459,7 @@ void mem_cgroup_track_foreign_dirty_slowpath(struct folio *folio, frn->memcg_id == wb->memcg_css->id) break; if (time_before64(frn->at, oldest_at) && - atomic_read(&frn->done.cnt) == 1) { + atomic_read(&frn->done->cnt) == 1) { oldest = i; oldest_at = frn->at; } @@ -3506,12 +3506,12 @@ void mem_cgroup_flush_foreign(struct bdi_writeback *wb) * already one in flight. */ if (time_after64(frn->at, now - intv) && - atomic_read(&frn->done.cnt) == 1) { + atomic_read(&frn->done->cnt) == 1) { frn->at = 0; trace_flush_foreign(wb, frn->bdi_id, frn->memcg_id); cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id, WB_REASON_FOREIGN_FLUSH, - &frn->done); + frn->done); } } } @@ -3708,7 +3708,7 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) struct memcg_vmstats_percpu __percpu *pstatc_pcpu; struct mem_cgroup *memcg; int node, cpu; - int __maybe_unused i; + int __maybe_unused i = 0; long error; memcg = kmem_cache_zalloc(memcg_cachep, GFP_KERNEL); @@ -3763,9 +3763,14 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) INIT_LIST_HEAD(&memcg->objcg_list); #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&memcg->cgwb_list); - for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) - memcg->cgwb_frn[i].done = - __WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq); + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { + struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i]; + + frn->done = kmalloc(sizeof(struct wb_completion), GFP_KERNEL); + if (!frn->done) + goto fail; + wb_completion_init(frn->done, &memcg_cgwb_frn_waitq); + } #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE spin_lock_init(&memcg->deferred_split_queue.split_queue_lock); @@ -3775,6 +3780,10 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) lru_gen_init_memcg(memcg); return memcg; fail: +#ifdef CONFIG_CGROUP_WRITEBACK + while (i--) + kfree(memcg->cgwb_frn[i].done); +#endif mem_cgroup_id_remove(memcg); __mem_cgroup_free(memcg); return ERR_PTR(error); @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) int __maybe_unused i; #ifdef CONFIG_CGROUP_WRITEBACK - for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) - wb_wait_for_completion(&memcg->cgwb_frn[i].done); + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { + struct wb_completion *done = memcg->cgwb_frn[i].done; + + if (atomic_dec_and_test(&done->cnt)) + kfree(done); + } #endif if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) static_branch_dec(&memcg_sockets_enabled_key); -- 2.20.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-20 11:19 ` [PATCH] memcg: Don't wait writeback completion when release memcg Julian Sun @ 2025-08-20 20:58 ` Tejun Heo 2025-08-21 2:30 ` [External] " Julian Sun 0 siblings, 1 reply; 26+ messages in thread From: Tejun Heo @ 2025-08-20 20:58 UTC (permalink / raw) To: Julian Sun Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote: > @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) > int __maybe_unused i; > > #ifdef CONFIG_CGROUP_WRITEBACK > - for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) > - wb_wait_for_completion(&memcg->cgwb_frn[i].done); > + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { > + struct wb_completion *done = memcg->cgwb_frn[i].done; > + > + if (atomic_dec_and_test(&done->cnt)) > + kfree(done); > + } > #endif Can't you just remove done? I don't think it's doing anything after your changes anyway. Thanks. -- tejun ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-20 20:58 ` Tejun Heo @ 2025-08-21 2:30 ` Julian Sun 2025-08-21 16:59 ` Tejun Heo 2025-08-25 10:13 ` Jan Kara 0 siblings, 2 replies; 26+ messages in thread From: Julian Sun @ 2025-08-21 2:30 UTC (permalink / raw) To: Tejun Heo Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe On Thu, Aug 21, 2025 at 4:58 AM Tejun Heo <tj@kernel.org> wrote: > > On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote: > > @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) > > int __maybe_unused i; > > > > #ifdef CONFIG_CGROUP_WRITEBACK > > - for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) > > - wb_wait_for_completion(&memcg->cgwb_frn[i].done); > > + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { > > + struct wb_completion *done = memcg->cgwb_frn[i].done; > > + > > + if (atomic_dec_and_test(&done->cnt)) > > + kfree(done); > > + } > > #endif > > Can't you just remove done? I don't think it's doing anything after your > changes anyway. Thanks for your review. AFAICT done is also used to track free slots in mem_cgroup_track_foreign_dirty_slowpath() and mem_cgroup_flush_foreign(), otherwise we have no method to know which one is free and might flush more than what MEMCG_CGWB_FRN_CNT allow. Am I missing something? Thanks, > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-21 2:30 ` [External] " Julian Sun @ 2025-08-21 16:59 ` Tejun Heo 2025-08-21 18:00 ` Julian Sun 2025-08-25 10:13 ` Jan Kara 1 sibling, 1 reply; 26+ messages in thread From: Tejun Heo @ 2025-08-21 16:59 UTC (permalink / raw) To: Julian Sun Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe Hello, On Thu, Aug 21, 2025 at 10:30:30AM +0800, Julian Sun wrote: > On Thu, Aug 21, 2025 at 4:58 AM Tejun Heo <tj@kernel.org> wrote: > > > > On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote: > > > @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) > > > int __maybe_unused i; > > > > > > #ifdef CONFIG_CGROUP_WRITEBACK > > > - for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) > > > - wb_wait_for_completion(&memcg->cgwb_frn[i].done); > > > + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { > > > + struct wb_completion *done = memcg->cgwb_frn[i].done; > > > + > > > + if (atomic_dec_and_test(&done->cnt)) > > > + kfree(done); > > > + } > > > #endif > > > > Can't you just remove done? I don't think it's doing anything after your > > changes anyway. > > Thanks for your review. > > AFAICT done is also used to track free slots in > mem_cgroup_track_foreign_dirty_slowpath() and > mem_cgroup_flush_foreign(), otherwise we have no method to know which > one is free and might flush more than what MEMCG_CGWB_FRN_CNT allow. > > Am I missing something? No, I missed that. I don't think we need to add extra mechanisms in wb for this tho. How about shifting wb_wait_for_completion() and kfree(memcg) into a separate function and punt those to a separate work item? That's going to be a small self-contained change in memcg. Thanks. -- tejun ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-21 16:59 ` Tejun Heo @ 2025-08-21 18:00 ` Julian Sun 2025-08-21 18:16 ` Julian Sun 2025-08-21 19:01 ` Tejun Heo 0 siblings, 2 replies; 26+ messages in thread From: Julian Sun @ 2025-08-21 18:00 UTC (permalink / raw) To: Tejun Heo Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe Hi, On Fri, Aug 22, 2025 at 12:59 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Thu, Aug 21, 2025 at 10:30:30AM +0800, Julian Sun wrote: > > On Thu, Aug 21, 2025 at 4:58 AM Tejun Heo <tj@kernel.org> wrote: > > > > > > On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote: > > > > @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) > > > > int __maybe_unused i; > > > > > > > > #ifdef CONFIG_CGROUP_WRITEBACK > > > > - for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) > > > > - wb_wait_for_completion(&memcg->cgwb_frn[i].done); > > > > + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { > > > > + struct wb_completion *done = memcg->cgwb_frn[i].done; > > > > + > > > > + if (atomic_dec_and_test(&done->cnt)) > > > > + kfree(done); > > > > + } > > > > #endif > > > > > > Can't you just remove done? I don't think it's doing anything after your > > > changes anyway. > > > > Thanks for your review. > > > > AFAICT done is also used to track free slots in > > mem_cgroup_track_foreign_dirty_slowpath() and > > mem_cgroup_flush_foreign(), otherwise we have no method to know which > > one is free and might flush more than what MEMCG_CGWB_FRN_CNT allow. > > > > Am I missing something? > > No, I missed that. I don't think we need to add extra mechanisms in wb for > this tho. How about shifting wb_wait_for_completion() and kfree(memcg) into > a separate function and punt those to a separate work item? That's going to > be a small self-contained change in memcg. > Do you mean logic like this? for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) wb_wait_for_completion(&memcg->cgwb_frn[i].done); kfree(memcg); But there still exist task hang issues as long as wb_wait_for_completion() exists. I think the scope of impact of the current changes should be manageable. I have checked all the other places where wb_queue_work() is called, and their free_done values are all 0, and I also tested this patch with the reproducer in [1] with kasan and kmemleak enabled. The test result looks fine, so this should not have a significant impact. What do you think? [1]: https://lore.kernel.org/all/20190821210235.GN2263813@devbig004.ftw2.facebook.com/ > Thanks. > > -- > tejun Thanks, -- Julian Sun <sunjunchao@bytedance.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-21 18:00 ` Julian Sun @ 2025-08-21 18:16 ` Julian Sun 2025-08-21 19:01 ` Tejun Heo 1 sibling, 0 replies; 26+ messages in thread From: Julian Sun @ 2025-08-21 18:16 UTC (permalink / raw) To: Tejun Heo Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe On Fri, Aug 22, 2025 at 2:00 AM Julian Sun <sunjunchao@bytedance.com> wrote: > > Hi, > > On Fri, Aug 22, 2025 at 12:59 AM Tejun Heo <tj@kernel.org> wrote: > > > > Hello, > > > > On Thu, Aug 21, 2025 at 10:30:30AM +0800, Julian Sun wrote: > > > On Thu, Aug 21, 2025 at 4:58 AM Tejun Heo <tj@kernel.org> wrote: > > > > > > > > On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote: > > > > > @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) > > > > > int __maybe_unused i; > > > > > > > > > > #ifdef CONFIG_CGROUP_WRITEBACK > > > > > - for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) > > > > > - wb_wait_for_completion(&memcg->cgwb_frn[i].done); > > > > > + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { > > > > > + struct wb_completion *done = memcg->cgwb_frn[i].done; > > > > > + > > > > > + if (atomic_dec_and_test(&done->cnt)) > > > > > + kfree(done); > > > > > + } > > > > > #endif > > > > > > > > Can't you just remove done? I don't think it's doing anything after your > > > > changes anyway. > > > > > > Thanks for your review. > > > > > > AFAICT done is also used to track free slots in > > > mem_cgroup_track_foreign_dirty_slowpath() and > > > mem_cgroup_flush_foreign(), otherwise we have no method to know which > > > one is free and might flush more than what MEMCG_CGWB_FRN_CNT allow. > > > > > > Am I missing something? > > > > No, I missed that. I don't think we need to add extra mechanisms in wb for > > this tho. How about shifting wb_wait_for_completion() and kfree(memcg) into > > a separate function and punt those to a separate work item? That's going to > > be a small self-contained change in memcg. > > > > Do you mean logic like this? > > for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) > wb_wait_for_completion(&memcg->cgwb_frn[i].done); > kfree(memcg); > > But there still exist task hang issues as long as > wb_wait_for_completion() exists. > I think the scope of impact of the current changes should be > manageable. I have checked all the other places where wb_queue_work() > is called, and their free_done values are all 0, and I also tested > this patch with the reproducer in [1] with kasan and kmemleak enabled. > The test result looks fine, so this should not have a significant > impact. BTW, the test case is like this — it ran for over a night. while true; do ./repro.sh && sleep 300 && ./stop.sh; done sjc@debian:~/linux$ cat stop.sh #!/bin/bash # TEST=/sys/fs/cgroup/test A=$TEST/A B=$TEST/B echo "-memory" > $TEST/cgroup.subtree_control pkill write-range sync sleep 5 sync sleep 5 echo 3 > /proc/sys/vm/drop_caches rmdir $A $B > What do you think? > > [1]: https://lore.kernel.org/all/20190821210235.GN2263813@devbig004.ftw2.facebook.com/ > > Thanks. > > > > -- > > tejun > > > Thanks, > -- > Julian Sun <sunjunchao@bytedance.com> Thanks, -- Julian Sun <sunjunchao@bytedance.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-21 18:00 ` Julian Sun 2025-08-21 18:16 ` Julian Sun @ 2025-08-21 19:01 ` Tejun Heo 2025-08-22 8:22 ` Julian Sun 1 sibling, 1 reply; 26+ messages in thread From: Tejun Heo @ 2025-08-21 19:01 UTC (permalink / raw) To: Julian Sun Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe Hello, On Fri, Aug 22, 2025 at 02:00:10AM +0800, Julian Sun wrote: ... > Do you mean logic like this? > > for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) > wb_wait_for_completion(&memcg->cgwb_frn[i].done); > kfree(memcg); > > But there still exist task hang issues as long as > wb_wait_for_completion() exists. Ah, right. I was just thinking about the workqueue being stalled. The problem is that the wait itself is too long. > I think the scope of impact of the current changes should be > manageable. I have checked all the other places where wb_queue_work() > is called, and their free_done values are all 0, and I also tested > this patch with the reproducer in [1] with kasan and kmemleak enabled. > The test result looks fine, so this should not have a significant > impact. > What do you think? My source of reluctance is that it's a peculiar situation where flushing of a cgroup takes that long due to hard throttling and the self-freeing mechanism isn't the prettiest thing. Do you think you can do the same thing through custom waitq wakeup function? Thanks. -- tejun ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-21 19:01 ` Tejun Heo @ 2025-08-22 8:22 ` Julian Sun 2025-08-22 17:56 ` Tejun Heo 0 siblings, 1 reply; 26+ messages in thread From: Julian Sun @ 2025-08-22 8:22 UTC (permalink / raw) To: Tejun Heo Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe On 8/22/25 3:01 AM, Tejun Heo wrote: Hi, > Hello, > > On Fri, Aug 22, 2025 at 02:00:10AM +0800, Julian Sun wrote: > ... >> Do you mean logic like this? >> >> for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) >> wb_wait_for_completion(&memcg->cgwb_frn[i].done); >> kfree(memcg); >> >> But there still exist task hang issues as long as >> wb_wait_for_completion() exists. > > Ah, right. I was just thinking about the workqueue being stalled. The > problem is that the wait itself is too long. > >> I think the scope of impact of the current changes should be >> manageable. I have checked all the other places where wb_queue_work() >> is called, and their free_done values are all 0, and I also tested >> this patch with the reproducer in [1] with kasan and kmemleak enabled. >> The test result looks fine, so this should not have a significant >> impact. >> What do you think? > > My source of reluctance is that it's a peculiar situation where flushing of > a cgroup takes that long due to hard throttling and the self-freeing > mechanism isn't the prettiest thing. Do you think you can do the same thing > through custom waitq wakeup function? Yeah, this method looks more general if I understand correctly. If the idea of the following code makes sense to you, I'd like to split and convert it into formal patches. diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index a07b8cf73ae2..10fede792178 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -172,13 +172,8 @@ static void finish_writeback_work(struct wb_writeback_work *work) if (work->auto_free) kfree(work); - if (done) { - wait_queue_head_t *waitq = done->waitq; - - /* @done can't be accessed after the following dec */ - if (atomic_dec_and_test(&done->cnt)) - wake_up_all(waitq); - } + if (done) + done->wb_waitq->wb_wakeup_func(done->wb_waitq, done); } static void wb_queue_work(struct bdi_writeback *wb, @@ -213,7 +208,7 @@ static void wb_queue_work(struct bdi_writeback *wb, void wb_wait_for_completion(struct wb_completion *done) { atomic_dec(&done->cnt); /* put down the initial count */ - wait_event(*done->waitq, !atomic_read(&done->cnt)); + wait_event(done->wb_waitq->waitq, !atomic_read(&done->cnt)); } #ifdef CONFIG_CGROUP_WRITEBACK diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 2ad261082bba..04699458ac50 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -60,13 +60,56 @@ enum wb_reason { WB_REASON_MAX, }; +struct wb_completion; +typedef struct wb_wait_queue_head wb_wait_queue_head_t; +typedef void (*wb_wait_wakeup_func_t)(wb_wait_queue_head_t *wq_waitq, + struct wb_completion *done); +struct wb_wait_queue_head { + wait_queue_head_t waitq; + wb_wait_wakeup_func_t wb_wakeup_func; +}; + struct wb_completion { atomic_t cnt; - wait_queue_head_t *waitq; + wb_wait_queue_head_t *wb_waitq; }; +static inline void wb_default_wakeup_func(wb_wait_queue_head_t *wq_waitq, + struct wb_completion *done) +{ + /* @done can't be accessed after the following dec */ + if (atomic_dec_and_test(&done->cnt)) + wake_up_all(&wq_waitq->waitq); +} + +/* used for cgwb_frn, be careful here, @done can't be accessed */ +static inline void wb_empty_wakeup_func(wb_wait_queue_head_t *wq_waitq, + struct wb_completion *done) +{ +} + +#define __init_wb_waitqueue_head(wb_waitq, func) \ + do { \ + init_waitqueue_head(&wb_waitq.waitq); \ + wb_waitq.wb_wakeup_func = func; \ + } while (0) + +#define init_wb_waitqueue_head(wb_waitq) \ + __init_wb_waitqueue_head(wb_waitq, wb_default_wakeup_func) + +#define __WB_WAIT_QUEUE_HEAD_INITIALIZER(name, func) { \ + .waitq = __WAIT_QUEUE_HEAD_INITIALIZER(name.waitq), \ + .wb_wakeup_func = func, \ +} + +#define __DECLARE_WB_WAIT_QUEUE_HEAD(name, func) \ + struct wb_wait_queue_head name = __WB_WAIT_QUEUE_HEAD_INITIALIZER(name, func) + +#define DECLARE_WB_WAIT_QUEUE_HEAD(name) \ + __DECLARE_WB_WAIT_QUEUE_HEAD(name, wb_default_wakeup_func) + #define __WB_COMPLETION_INIT(_waitq) \ - (struct wb_completion){ .cnt = ATOMIC_INIT(1), .waitq = (_waitq) } + (struct wb_completion){ .cnt = ATOMIC_INIT(1), .wb_waitq = (_waitq) } /* * If one wants to wait for one or more wb_writeback_works, each work's @@ -190,7 +233,7 @@ struct backing_dev_info { struct mutex cgwb_release_mutex; /* protect shutdown of wb structs */ struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ #endif - wait_queue_head_t wb_waitq; + wb_wait_queue_head_t wb_waitq; struct device *dev; char dev_name[64]; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 783904d8c5ef..c4fec9e22978 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -1008,7 +1008,7 @@ int bdi_init(struct backing_dev_info *bdi) bdi->max_prop_frac = FPROP_FRAC_BASE; INIT_LIST_HEAD(&bdi->bdi_list); INIT_LIST_HEAD(&bdi->wb_list); - init_waitqueue_head(&bdi->wb_waitq); + init_wb_waitqueue_head(bdi->wb_waitq); bdi->last_bdp_sleep = jiffies; return cgwb_bdi_init(bdi); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8dd7fbed5a94..999624535470 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -99,7 +99,7 @@ static struct kmem_cache *memcg_cachep; static struct kmem_cache *memcg_pn_cachep; #ifdef CONFIG_CGROUP_WRITEBACK -static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq); +static __DECLARE_WB_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq, wb_empty_wakeup_func); #endif static inline bool task_is_dying(void) @@ -3909,12 +3909,7 @@ static void mem_cgroup_css_released(struct cgroup_subsys_state *css) static void mem_cgroup_css_free(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); - int __maybe_unused i; -#ifdef CONFIG_CGROUP_WRITEBACK - for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) - wb_wait_for_completion(&memcg->cgwb_frn[i].done); -#endif if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) static_branch_dec(&memcg_sockets_enabled_key); > > Thanks. > Thanks, -- Julian Sun <sunjunchao@bytedance.com> ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-22 8:22 ` Julian Sun @ 2025-08-22 17:56 ` Tejun Heo 2025-08-23 6:18 ` Julian Sun 2025-08-25 17:45 ` Julian Sun 0 siblings, 2 replies; 26+ messages in thread From: Tejun Heo @ 2025-08-22 17:56 UTC (permalink / raw) To: Julian Sun Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe Hello, On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun wrote: > +struct wb_wait_queue_head { > + wait_queue_head_t waitq; > + wb_wait_wakeup_func_t wb_wakeup_func; > +}; wait_queue_head_t itself already allows overriding the wakeup function. Please look for init_wait_func() usages in the tree. Hopefully, that should contain the changes within memcg. Thanks. -- tejun ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-22 17:56 ` Tejun Heo @ 2025-08-23 6:18 ` Julian Sun 2025-08-23 8:08 ` Giorgi Tchankvetadze 2025-08-25 17:45 ` Julian Sun 1 sibling, 1 reply; 26+ messages in thread From: Julian Sun @ 2025-08-23 6:18 UTC (permalink / raw) To: Tejun Heo Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe Hi, On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t waitq; > > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t itself already allows overriding the wakeup function. > Please look for init_wait_func() usages in the tree. Hopefully, that should > contain the changes within memcg. Well... Yes, I checked this function before, but it can't do the same thing as in the previous email. There are some differences—please check the code in the last email. First, let's clarify: the key point here is that if we want to remove wb_wait_for_completion() and avoid self-freeing, we must not access "done" in finish_writeback_work(), otherwise it will cause a UAF. However, init_wait_func() can't achieve this. Of course, I also admit that the method in the previous email seems a bit odd. To summarize again, the root causes of the problem here are: 1. When memcg is released, it calls wb_wait_for_completion() to prevent UAF, which is completely unnecessary—cgwb_frn only needs to issue wb work and no need to wait writeback finished. 2. The current finish_writeback_work() will definitely dereference "done", which may lead to UAF. Essentially, cgwb_frn introduces a new scenario where no wake-up is needed. Therefore, we just need to make finish_writeback_work() not dereference "done" and not wake up the waiting thread. However, this cannot keep the modifications within memcg... Please correct me if my understanding is incorrect. > > Thanks. > > -- > tejun Thanks, -- Julian Sun <sunjunchao@bytedance.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-23 6:18 ` Julian Sun @ 2025-08-23 8:08 ` Giorgi Tchankvetadze 2025-08-23 8:22 ` Julian Sun 0 siblings, 1 reply; 26+ messages in thread From: Giorgi Tchankvetadze @ 2025-08-23 8:08 UTC (permalink / raw) To: sunjunchao Cc: axboe, brauner, cgroups, hannes, jack, linux-fsdevel, linux-mm, mhocko, muchun.song, roman.gushchin, shakeel.butt, tj, viro Hi there. Can we fix this by allowing callers to set work->done = NULL when no completion is desired? The already-existing "if (done)" check in finish_writeback_work() already provides the necessary protection, so the change is purely mechanical. On 8/23/2025 10:18 AM, Julian Sun wrote: > Hi, > > On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote: >> > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun > wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t waitq; > > > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t > itself already allows overriding the wakeup function. > Please look for > init_wait_func() usages in the tree. Hopefully, that should > contain > the changes within memcg. > Well... Yes, I checked this function before, but it can't do the same > thing as in the previous email. There are some differences—please > check the code in the last email. > > First, let's clarify: the key point here is that if we want to remove > wb_wait_for_completion() and avoid self-freeing, we must not access > "done" in finish_writeback_work(), otherwise it will cause a UAF. > However, init_wait_func() can't achieve this. Of course, I also admit > that the method in the previous email seems a bit odd. > > To summarize again, the root causes of the problem here are: > 1. When memcg is released, it calls wb_wait_for_completion() to > prevent UAF, which is completely unnecessary—cgwb_frn only needs to > issue wb work and no need to wait writeback finished. > 2. The current finish_writeback_work() will definitely dereference > "done", which may lead to UAF. > > Essentially, cgwb_frn introduces a new scenario where no wake-up is > needed. Therefore, we just need to make finish_writeback_work() not > dereference "done" and not wake up the waiting thread. However, this > cannot keep the modifications within memcg... > > Please correct me if my understanding is incorrect. >> > Thanks. > > -- > tejun > > Thanks, > -- > Julian Sun <sunjunchao@bytedance.com> > > Hi, > > On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote: >> > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun > wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t waitq; > > > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t > itself already allows overriding the wakeup function. > Please look for > init_wait_func() usages in the tree. Hopefully, that should > contain > the changes within memcg. > Well... Yes, I checked this function before, but it can't do the same > thing as in the previous email. There are some differences—please > check the code in the last email. > > First, let's clarify: the key point here is that if we want to remove > wb_wait_for_completion() and avoid self-freeing, we must not access > "done" in finish_writeback_work(), otherwise it will cause a UAF. > However, init_wait_func() can't achieve this. Of course, I also admit > that the method in the previous email seems a bit odd. > > To summarize again, the root causes of the problem here are: > 1. When memcg is released, it calls wb_wait_for_completion() to > prevent UAF, which is completely unnecessary—cgwb_frn only needs to > issue wb work and no need to wait writeback finished. > 2. The current finish_writeback_work() will definitely dereference > "done", which may lead to UAF. > > Essentially, cgwb_frn introduces a new scenario where no wake-up is > needed. Therefore, we just need to make finish_writeback_work() not > dereference "done" and not wake up the waiting thread. However, this > cannot keep the modifications within memcg... > > Please correct me if my understanding is incorrect. >> > Thanks. > > -- > tejun > > Thanks, > -- > Julian Sun <sunjunchao@bytedance.com> > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-23 8:08 ` Giorgi Tchankvetadze @ 2025-08-23 8:22 ` Julian Sun 2025-08-23 14:08 ` Giorgi Tchankvetadze 0 siblings, 1 reply; 26+ messages in thread From: Julian Sun @ 2025-08-23 8:22 UTC (permalink / raw) To: Giorgi Tchankvetadze Cc: axboe, brauner, cgroups, hannes, jack, linux-fsdevel, linux-mm, mhocko, muchun.song, roman.gushchin, shakeel.butt, tj, viro Hi, On Sat, Aug 23, 2025 at 4:08 PM Giorgi Tchankvetadze <giorgitchankvetadze1997@gmail.com> wrote: > > Hi there. Can we fix this by allowing callers to set work->done = NULL > when no completion is desired? No, we can't do that. Because cgwb_frn needs to track the state of wb work by work->done.cnt, if we set work->done = Null, then we can not know whether the wb work finished or not. See mem_cgroup_track_foreign_dirty_slowpath() and mem_cgroup_flush_foreign() for details. > The already-existing "if (done)" check in finish_writeback_work() > already provides the necessary protection, so the change is purely > mechanical. > > > > On 8/23/2025 10:18 AM, Julian Sun wrote: > > Hi, > > > > On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote: > >> > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun > > wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t waitq; > > > > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t > > itself already allows overriding the wakeup function. > Please look for > > init_wait_func() usages in the tree. Hopefully, that should > contain > > the changes within memcg. > > Well... Yes, I checked this function before, but it can't do the same > > thing as in the previous email. There are some differences—please > > check the code in the last email. > > > > First, let's clarify: the key point here is that if we want to remove > > wb_wait_for_completion() and avoid self-freeing, we must not access > > "done" in finish_writeback_work(), otherwise it will cause a UAF. > > However, init_wait_func() can't achieve this. Of course, I also admit > > that the method in the previous email seems a bit odd. > > > > To summarize again, the root causes of the problem here are: > > 1. When memcg is released, it calls wb_wait_for_completion() to > > prevent UAF, which is completely unnecessary—cgwb_frn only needs to > > issue wb work and no need to wait writeback finished. > > 2. The current finish_writeback_work() will definitely dereference > > "done", which may lead to UAF. > > > > Essentially, cgwb_frn introduces a new scenario where no wake-up is > > needed. Therefore, we just need to make finish_writeback_work() not > > dereference "done" and not wake up the waiting thread. However, this > > cannot keep the modifications within memcg... > > > > Please correct me if my understanding is incorrect. > >> > Thanks. > > -- > tejun > > > > Thanks, > > -- > > Julian Sun <sunjunchao@bytedance.com> > > > > > > Hi, > > > > On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote: > >> > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun > > wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t waitq; > > > > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t > > itself already allows overriding the wakeup function. > Please look for > > init_wait_func() usages in the tree. Hopefully, that should > contain > > the changes within memcg. > > Well... Yes, I checked this function before, but it can't do the same > > thing as in the previous email. There are some differences—please > > check the code in the last email. > > > > First, let's clarify: the key point here is that if we want to remove > > wb_wait_for_completion() and avoid self-freeing, we must not access > > "done" in finish_writeback_work(), otherwise it will cause a UAF. > > However, init_wait_func() can't achieve this. Of course, I also admit > > that the method in the previous email seems a bit odd. > > > > To summarize again, the root causes of the problem here are: > > 1. When memcg is released, it calls wb_wait_for_completion() to > > prevent UAF, which is completely unnecessary—cgwb_frn only needs to > > issue wb work and no need to wait writeback finished. > > 2. The current finish_writeback_work() will definitely dereference > > "done", which may lead to UAF. > > > > Essentially, cgwb_frn introduces a new scenario where no wake-up is > > needed. Therefore, we just need to make finish_writeback_work() not > > dereference "done" and not wake up the waiting thread. However, this > > cannot keep the modifications within memcg... > > > > Please correct me if my understanding is incorrect. > >> > Thanks. > > -- > tejun > > > > Thanks, > > -- > > Julian Sun <sunjunchao@bytedance.com> > > > > Thanks, -- Julian Sun <sunjunchao@bytedance.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-23 8:22 ` Julian Sun @ 2025-08-23 14:08 ` Giorgi Tchankvetadze 2025-08-23 15:17 ` Julian Sun 0 siblings, 1 reply; 26+ messages in thread From: Giorgi Tchankvetadze @ 2025-08-23 14:08 UTC (permalink / raw) To: sunjunchao Cc: axboe, brauner, cgroups, giorgitchankvetadze1997, hannes, jack, linux-fsdevel, linux-mm, mhocko, muchun.song, roman.gushchin, shakeel.butt, tj, viro Makes sense, yeah. What about using a shared, long-lived completion object (done_acc) inside cgwb_frn. All writeback jobs point to this same object via work->done = &frn->done_acc. On 8/23/2025 12:23 PM, Julian Sun wrote: > Hi, > > On Sat, Aug 23, 2025 at 4:08 PM Giorgi Tchankvetadze > <giorgitchankvetadze1997@gmail.com> wrote: >> > Hi there. Can we fix this by allowing callers to set work->done = > NULL > when no completion is desired? > No, we can't do that. Because cgwb_frn needs to track the state of wb > work by work->done.cnt, if we set work->done = Null, then we can not > know whether the wb work finished or not. See > mem_cgroup_track_foreign_dirty_slowpath() and > mem_cgroup_flush_foreign() for details. > >> The already-existing "if (done)" check in finish_writeback_work() > already provides the necessary protection, so the change is purely > > mechanical. > > > > On 8/23/2025 10:18 AM, Julian Sun wrote: > > Hi, > > > > > On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote: > > >> > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun > > > wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t > waitq; > > > > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > > wait_queue_head_t > > itself already allows overriding the wakeup > function. > Please look for > > init_wait_func() usages in the tree. > Hopefully, that should > contain > > the changes within memcg. > > > Well... Yes, I checked this function before, but it can't do the same > > > thing as in the previous email. There are some differences—please > > > check the code in the last email. > > > > First, let's clarify: the key > point here is that if we want to remove > > wb_wait_for_completion() and > avoid self-freeing, we must not access > > "done" in > finish_writeback_work(), otherwise it will cause a UAF. > > However, > init_wait_func() can't achieve this. Of course, I also admit > > that > the method in the previous email seems a bit odd. > > > > To summarize > again, the root causes of the problem here are: > > 1. When memcg is > released, it calls wb_wait_for_completion() to > > prevent UAF, which is > completely unnecessary—cgwb_frn only needs to > > issue wb work and no > need to wait writeback finished. > > 2. The current > finish_writeback_work() will definitely dereference > > "done", which > may lead to UAF. > > > > Essentially, cgwb_frn introduces a new scenario > where no wake-up is > > needed. Therefore, we just need to make > finish_writeback_work() not > > dereference "done" and not wake up the > waiting thread. However, this > > cannot keep the modifications within > memcg... > > > > Please correct me if my understanding is incorrect. > > >> > Thanks. > > -- > tejun > > > > Thanks, > > -- > > Julian Sun > <sunjunchao@bytedance.com> > > > > > > Hi, > > > > On Sat, Aug 23, 2025 > at 1:56 AM Tejun Heo <tj@kernel.org> wrote: > >> > Hello, > > On Fri, > Aug 22, 2025 at 04:22:09PM +0800, Julian Sun > > wrote: > > +struct > wb_wait_queue_head { > > + wait_queue_head_t waitq; > > > > + > wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t > > > itself already allows overriding the wakeup function. > Please look for > > > init_wait_func() usages in the tree. Hopefully, that should > > contain > > the changes within memcg. > > Well... Yes, I checked this > function before, but it can't do the same > > thing as in the previous > email. There are some differences—please > > check the code in the last > email. > > > > First, let's clarify: the key point here is that if we > want to remove > > wb_wait_for_completion() and avoid self-freeing, we > must not access > > "done" in finish_writeback_work(), otherwise it will > cause a UAF. > > However, init_wait_func() can't achieve this. Of > course, I also admit > > that the method in the previous email seems a > bit odd. > > > > To summarize again, the root causes of the problem here > are: > > 1. When memcg is released, it calls wb_wait_for_completion() to > > > prevent UAF, which is completely unnecessary—cgwb_frn only needs to > > > issue wb work and no need to wait writeback finished. > > 2. The > current finish_writeback_work() will definitely dereference > > "done", > which may lead to UAF. > > > > Essentially, cgwb_frn introduces a new > scenario where no wake-up is > > needed. Therefore, we just need to make > finish_writeback_work() not > > dereference "done" and not wake up the > waiting thread. However, this > > cannot keep the modifications within > memcg... > > > > Please correct me if my understanding is incorrect. > > >> > Thanks. > > -- > tejun > > > > Thanks, > > -- > > Julian Sun > <sunjunchao@bytedance.com> > > > > > Thanks, > -- > Julian Sun <sunjunchao@bytedance.com> > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-23 14:08 ` Giorgi Tchankvetadze @ 2025-08-23 15:17 ` Julian Sun 0 siblings, 0 replies; 26+ messages in thread From: Julian Sun @ 2025-08-23 15:17 UTC (permalink / raw) To: Giorgi Tchankvetadze Cc: axboe, brauner, cgroups, hannes, jack, linux-fsdevel, linux-mm, mhocko, muchun.song, roman.gushchin, shakeel.butt, tj, viro Hi, On Sat, Aug 23, 2025 at 10:09 PM Giorgi Tchankvetadze <giorgitchankvetadze1997@gmail.com> wrote: > > Makes sense, yeah. What about using a shared, long-lived completion > object (done_acc) inside cgwb_frn. All writeback jobs point to this same > object via work->done = &frn->done_acc. Well, IMO there's still no method to know which work has finished and which hasn't. All we know is whether all the work has finished or not, isn't it? > > On 8/23/2025 12:23 PM, Julian Sun wrote: > > Hi, > > > > On Sat, Aug 23, 2025 at 4:08 PM Giorgi Tchankvetadze > > <giorgitchankvetadze1997@gmail.com> wrote: > >> > Hi there. Can we fix this by allowing callers to set work->done = > > NULL > when no completion is desired? > > No, we can't do that. Because cgwb_frn needs to track the state of wb > > work by work->done.cnt, if we set work->done = Null, then we can not > > know whether the wb work finished or not. See > > mem_cgroup_track_foreign_dirty_slowpath() and > > mem_cgroup_flush_foreign() for details. > > > >> The already-existing "if (done)" check in finish_writeback_work() > already provides the necessary protection, so the change is purely > > > mechanical. > > > > On 8/23/2025 10:18 AM, Julian Sun wrote: > > Hi, > > > > > > On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote: > > > >> > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun > > > > wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t > > waitq; > > > > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > > > wait_queue_head_t > > itself already allows overriding the wakeup > > function. > Please look for > > init_wait_func() usages in the tree. > > Hopefully, that should > contain > > the changes within memcg. > > > > Well... Yes, I checked this function before, but it can't do the same > > > > thing as in the previous email. There are some differences—please > > > > check the code in the last email. > > > > First, let's clarify: the key > > point here is that if we want to remove > > wb_wait_for_completion() and > > avoid self-freeing, we must not access > > "done" in > > finish_writeback_work(), otherwise it will cause a UAF. > > However, > > init_wait_func() can't achieve this. Of course, I also admit > > that > > the method in the previous email seems a bit odd. > > > > To summarize > > again, the root causes of the problem here are: > > 1. When memcg is > > released, it calls wb_wait_for_completion() to > > prevent UAF, which is > > completely unnecessary—cgwb_frn only needs to > > issue wb work and no > > need to wait writeback finished. > > 2. The current > > finish_writeback_work() will definitely dereference > > "done", which > > may lead to UAF. > > > > Essentially, cgwb_frn introduces a new scenario > > where no wake-up is > > needed. Therefore, we just need to make > > finish_writeback_work() not > > dereference "done" and not wake up the > > waiting thread. However, this > > cannot keep the modifications within > > memcg... > > > > Please correct me if my understanding is incorrect. > > > >> > Thanks. > > -- > tejun > > > > Thanks, > > -- > > Julian Sun > > <sunjunchao@bytedance.com> > > > > > > Hi, > > > > On Sat, Aug 23, 2025 > > at 1:56 AM Tejun Heo <tj@kernel.org> wrote: > >> > Hello, > > On Fri, > > Aug 22, 2025 at 04:22:09PM +0800, Julian Sun > > wrote: > > +struct > > wb_wait_queue_head { > > + wait_queue_head_t waitq; > > > > + > > wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t > > > > itself already allows overriding the wakeup function. > Please look for > > > > init_wait_func() usages in the tree. Hopefully, that should > > > contain > > the changes within memcg. > > Well... Yes, I checked this > > function before, but it can't do the same > > thing as in the previous > > email. There are some differences—please > > check the code in the last > > email. > > > > First, let's clarify: the key point here is that if we > > want to remove > > wb_wait_for_completion() and avoid self-freeing, we > > must not access > > "done" in finish_writeback_work(), otherwise it will > > cause a UAF. > > However, init_wait_func() can't achieve this. Of > > course, I also admit > > that the method in the previous email seems a > > bit odd. > > > > To summarize again, the root causes of the problem here > > are: > > 1. When memcg is released, it calls wb_wait_for_completion() to > > > > prevent UAF, which is completely unnecessary—cgwb_frn only needs to > > > > issue wb work and no need to wait writeback finished. > > 2. The > > current finish_writeback_work() will definitely dereference > > "done", > > which may lead to UAF. > > > > Essentially, cgwb_frn introduces a new > > scenario where no wake-up is > > needed. Therefore, we just need to make > > finish_writeback_work() not > > dereference "done" and not wake up the > > waiting thread. However, this > > cannot keep the modifications within > > memcg... > > > > Please correct me if my understanding is incorrect. > > > >> > Thanks. > > -- > tejun > > > > Thanks, > > -- > > Julian Sun > > <sunjunchao@bytedance.com> > > > > > > Thanks, > > -- > > Julian Sun <sunjunchao@bytedance.com> > > > Thanks, -- Julian Sun <sunjunchao@bytedance.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-22 17:56 ` Tejun Heo 2025-08-23 6:18 ` Julian Sun @ 2025-08-25 17:45 ` Julian Sun 2025-08-25 18:53 ` Tejun Heo 1 sibling, 1 reply; 26+ messages in thread From: Julian Sun @ 2025-08-25 17:45 UTC (permalink / raw) To: Tejun Heo Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe Hi, Tejun On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t waitq; > > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t itself already allows overriding the wakeup function. > Please look for init_wait_func() usages in the tree. Hopefully, that should > contain the changes within memcg. Sorry for having misunderstood what you meant before. I’m afraid that init_wait_func() cannot work the same way. Because calling init_wait_func() presupposes that we are preparing to wait for an event(like wb_wait_completion()), but waiting for such an event might lead to a hung task. Please correct me if I'm wrong. > > Thanks. > > -- > tejun Thanks, -- Julian Sun <sunjunchao@bytedance.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-25 17:45 ` Julian Sun @ 2025-08-25 18:53 ` Tejun Heo 2025-08-25 19:06 ` Julian Sun 0 siblings, 1 reply; 26+ messages in thread From: Tejun Heo @ 2025-08-25 18:53 UTC (permalink / raw) To: Julian Sun Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe Hello, On Tue, Aug 26, 2025 at 01:45:44AM +0800, Julian Sun wrote: ... > Sorry for having misunderstood what you meant before. I’m afraid that > init_wait_func() cannot work the same way. Because calling > init_wait_func() presupposes that we are preparing to wait for an > event(like wb_wait_completion()), but waiting for such an event might > lead to a hung task. > > Please correct me if I'm wrong. Using init_wait_func() does not require someone waiting for it. AFAICS, you should be able to do the same thing that you did - allocating the done entries individually and freeing them when the done count reaches zero without anyone waiting for it. waitq doesn't really make many assumptions about how it's used - when you call wake_up() on it, it just walks the queued entries and invoke the callbacks there. Thanks. -- tejun ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-25 18:53 ` Tejun Heo @ 2025-08-25 19:06 ` Julian Sun 0 siblings, 0 replies; 26+ messages in thread From: Julian Sun @ 2025-08-25 19:06 UTC (permalink / raw) To: Tejun Heo Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe Hi, On Tue, Aug 26, 2025 at 2:53 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Tue, Aug 26, 2025 at 01:45:44AM +0800, Julian Sun wrote: > ... > > Sorry for having misunderstood what you meant before. I’m afraid that > > init_wait_func() cannot work the same way. Because calling > > init_wait_func() presupposes that we are preparing to wait for an > > event(like wb_wait_completion()), but waiting for such an event might > > lead to a hung task. > > > > Please correct me if I'm wrong. > > Using init_wait_func() does not require someone waiting for it. AFAICS, you > should be able to do the same thing that you did - allocating the done > entries individually and freeing them when the done count reaches zero > without anyone waiting for it. waitq doesn't really make many assumptions > about how it's used - when you call wake_up() on it, it just walks the > queued entries and invoke the callbacks there. Ah, yeah, this sounds great, many thanks for your clarification. I’ll send v2 of the patch after testing. > > Thanks. > > -- > tejun Thanks, -- Julian Sun <sunjunchao@bytedance.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-21 2:30 ` [External] " Julian Sun 2025-08-21 16:59 ` Tejun Heo @ 2025-08-25 10:13 ` Jan Kara 2025-08-25 12:08 ` Julian Sun 2025-08-25 18:57 ` [External] " Tejun Heo 1 sibling, 2 replies; 26+ messages in thread From: Jan Kara @ 2025-08-25 10:13 UTC (permalink / raw) To: Julian Sun Cc: Tejun Heo, linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe On Thu 21-08-25 10:30:30, Julian Sun wrote: > On Thu, Aug 21, 2025 at 4:58 AM Tejun Heo <tj@kernel.org> wrote: > > > > On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote: > > > @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) > > > int __maybe_unused i; > > > > > > #ifdef CONFIG_CGROUP_WRITEBACK > > > - for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) > > > - wb_wait_for_completion(&memcg->cgwb_frn[i].done); > > > + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { > > > + struct wb_completion *done = memcg->cgwb_frn[i].done; > > > + > > > + if (atomic_dec_and_test(&done->cnt)) > > > + kfree(done); > > > + } > > > #endif > > > > Can't you just remove done? I don't think it's doing anything after your > > changes anyway. > > Thanks for your review. > > AFAICT done is also used to track free slots in > mem_cgroup_track_foreign_dirty_slowpath() and > mem_cgroup_flush_foreign(), otherwise we have no method to know which > one is free and might flush more than what MEMCG_CGWB_FRN_CNT allow. > > Am I missing something? True, but is that mechanism really needed? Given the approximate nature of foreign flushing, couldn't we just always replace the oldest foreign entry regardless of whether the writeback is running or not? I didn't give too deep thought to this but from a quick look this should work just fine... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-25 10:13 ` Jan Kara @ 2025-08-25 12:08 ` Julian Sun 2025-08-25 18:57 ` [External] " Tejun Heo 1 sibling, 0 replies; 26+ messages in thread From: Julian Sun @ 2025-08-25 12:08 UTC (permalink / raw) To: Jan Kara Cc: Tejun Heo, linux-fsdevel, cgroups, linux-mm, viro, brauner, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe 在 2025/8/25 18:13, Jan Kara 写道: Hi, Jan Thanks for your review and comments. > On Thu 21-08-25 10:30:30, Julian Sun wrote: >> On Thu, Aug 21, 2025 at 4:58 AM Tejun Heo <tj@kernel.org> wrote: >>> >>> On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote: >>>> @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) >>>> int __maybe_unused i; >>>> >>>> #ifdef CONFIG_CGROUP_WRITEBACK >>>> - for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) >>>> - wb_wait_for_completion(&memcg->cgwb_frn[i].done); >>>> + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { >>>> + struct wb_completion *done = memcg->cgwb_frn[i].done; >>>> + >>>> + if (atomic_dec_and_test(&done->cnt)) >>>> + kfree(done); >>>> + } >>>> #endif >>> >>> Can't you just remove done? I don't think it's doing anything after your >>> changes anyway. >> >> Thanks for your review. >> >> AFAICT done is also used to track free slots in >> mem_cgroup_track_foreign_dirty_slowpath() and >> mem_cgroup_flush_foreign(), otherwise we have no method to know which >> one is free and might flush more than what MEMCG_CGWB_FRN_CNT allow. >> >> Am I missing something? > > True, but is that mechanism really needed? Given the approximate nature of > foreign flushing, couldn't we just always replace the oldest foreign entry > regardless of whether the writeback is running or not? I didn't give too > deep thought to this but from a quick look this should work just fine... AFAICT the mechanism is used to make the max number of wb works that we issue concurrently less than MEMCG_CGWB_FRN_CNT(4). If we replace the oldest and flush wb work whether the writeback is running or not, it seems that we are likely to flush more than MEMCG_CGWB_FRN_CNT wb works concurrently, I'm not sure how much influence this will have... > > Honza Thanks, -- Julian Sun <sunjunchao@bytedance.com> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg. 2025-08-25 10:13 ` Jan Kara 2025-08-25 12:08 ` Julian Sun @ 2025-08-25 18:57 ` Tejun Heo 1 sibling, 0 replies; 26+ messages in thread From: Tejun Heo @ 2025-08-25 18:57 UTC (permalink / raw) To: Jan Kara Cc: Julian Sun, linux-fsdevel, cgroups, linux-mm, viro, brauner, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe Hello, On Mon, Aug 25, 2025 at 12:13:53PM +0200, Jan Kara wrote: > True, but is that mechanism really needed? Given the approximate nature of > foreign flushing, couldn't we just always replace the oldest foreign entry > regardless of whether the writeback is running or not? I didn't give too > deep thought to this but from a quick look this should work just fine... Maybe, it's a bit difficult to predict. The thing was built to work around real cases of problem and FWIW we haven't heard of similar issues afterwards. Given the peculiarity and subtlety of the problem, I wonder keeping the behavior unchanged, if reasonably possible, is an easier path to follow here. Thanks. -- tejun ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 0/3] memcg, writeback: Don't wait writeback completion 2025-08-20 11:19 [PATCH 0/3] memcg, writeback: Don't wait writeback completion Julian Sun ` (2 preceding siblings ...) 2025-08-20 11:19 ` [PATCH] memcg: Don't wait writeback completion when release memcg Julian Sun @ 2025-08-20 12:16 ` Giorgi Tchankvetadze 2025-08-21 2:37 ` [External] " Julian Sun 3 siblings, 1 reply; 26+ messages in thread From: Giorgi Tchankvetadze @ 2025-08-20 12:16 UTC (permalink / raw) To: Julian Sun, linux-fsdevel, cgroups, linux-mm Cc: viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe, tj Could we add wb_pending_pages to memory.events? Very cheap and useful. A single atomic counter is already kept internally; exposing it is one line in memcontrol.c plus one line in the ABI doc. On 8/20/2025 3:19 PM, Julian Sun wrote: > This patch series aims to eliminate task hangs in mem_cgroup_css_free() > caused by calling wb_wait_for_completion(). > This is because there may be a large number of writeback tasks in the > foreign memcg, involving millions of pages, and the situation is > exacerbated by WBT rate limiting—potentially leading to task hangs > lasting several hours. > > Patch 1 is preparatory work and involves no functional changes. > Patch 2 implements the automatic release of wb_completion. > Patch 3 removes wb_wait_for_completion() from mem_cgroup_css_free(). > > > Julian Sun (3): > writeback: Rename wb_writeback_work->auto_free to free_work. > writeback: Add wb_writeback_work->free_done > memcg: Don't wait writeback completion when release memcg. > > fs/fs-writeback.c | 22 ++++++++++++++-------- > include/linux/backing-dev-defs.h | 6 ++++++ > include/linux/memcontrol.h | 2 +- > mm/memcontrol.c | 29 ++++++++++++++++++++--------- > 4 files changed, 41 insertions(+), 18 deletions(-) > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH 0/3] memcg, writeback: Don't wait writeback completion 2025-08-20 12:16 ` [PATCH 0/3] memcg, writeback: Don't wait writeback completion Giorgi Tchankvetadze @ 2025-08-21 2:37 ` Julian Sun 2025-08-22 9:29 ` Giorgi Tchankvetadze 0 siblings, 1 reply; 26+ messages in thread From: Julian Sun @ 2025-08-21 2:37 UTC (permalink / raw) To: Giorgi Tchankvetadze Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe, tj Hi, thanks for your review. On Wed, Aug 20, 2025 at 8:17 PM Giorgi Tchankvetadze <giorgitchankvetadze1997@gmail.com> wrote: > > Could we add wb_pending_pages to memory.events? > Very cheap and useful. > A single atomic counter is already kept internally; exposing it is one > line in memcontrol.c plus one line in the ABI doc. Not sure what do you mean by wb_pending_pages? Another counter besides existing MEMCG_LOW MEMCG_HIGH MEMCG_MAX, etc.? And AFAIK there's no pending pages in this patch set. Could you give more details? Thanks, > > > On 8/20/2025 3:19 PM, Julian Sun wrote: > > This patch series aims to eliminate task hangs in mem_cgroup_css_free() > > caused by calling wb_wait_for_completion(). > > This is because there may be a large number of writeback tasks in the > > foreign memcg, involving millions of pages, and the situation is > > exacerbated by WBT rate limiting—potentially leading to task hangs > > lasting several hours. > > > > Patch 1 is preparatory work and involves no functional changes. > > Patch 2 implements the automatic release of wb_completion. > > Patch 3 removes wb_wait_for_completion() from mem_cgroup_css_free(). > > > > > > Julian Sun (3): > > writeback: Rename wb_writeback_work->auto_free to free_work. > > writeback: Add wb_writeback_work->free_done > > memcg: Don't wait writeback completion when release memcg. > > > > fs/fs-writeback.c | 22 ++++++++++++++-------- > > include/linux/backing-dev-defs.h | 6 ++++++ > > include/linux/memcontrol.h | 2 +- > > mm/memcontrol.c | 29 ++++++++++++++++++++--------- > > 4 files changed, 41 insertions(+), 18 deletions(-) > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [External] Re: [PATCH 0/3] memcg, writeback: Don't wait writeback completion 2025-08-21 2:37 ` [External] " Julian Sun @ 2025-08-22 9:29 ` Giorgi Tchankvetadze 0 siblings, 0 replies; 26+ messages in thread From: Giorgi Tchankvetadze @ 2025-08-22 9:29 UTC (permalink / raw) To: Julian Sun Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe, tj `memory.stat:writeback` already gives you the exact real-time count of pages still waiting to finish write-back for that cgroup, and it is updated atomically in the hot path (`set_page_writeback()` / `end_page_writeback()`). Reading it is just an `atomic_long_read()` (or per-cpu equivalent), so the extra CPU cost of exposing it is essentially zero. I was thinking that extra additional info would help us Thanks, Giorgi On 8/21/2025 6:37 AM, Julian Sun wrote: > Hi, thanks for your review. > > On Wed, Aug 20, 2025 at 8:17 PM Giorgi Tchankvetadze > <giorgitchankvetadze1997@gmail.com> wrote: >> >> Could we add wb_pending_pages to memory.events? >> Very cheap and useful. >> A single atomic counter is already kept internally; exposing it is one >> line in memcontrol.c plus one line in the ABI doc. > > Not sure what do you mean by wb_pending_pages? Another counter besides > existing MEMCG_LOW MEMCG_HIGH MEMCG_MAX, etc.? And AFAIK there's no > pending pages in this patch set. Could you give more details? > > Thanks, >> >> >> On 8/20/2025 3:19 PM, Julian Sun wrote: >>> This patch series aims to eliminate task hangs in mem_cgroup_css_free() >>> caused by calling wb_wait_for_completion(). >>> This is because there may be a large number of writeback tasks in the >>> foreign memcg, involving millions of pages, and the situation is >>> exacerbated by WBT rate limiting—potentially leading to task hangs >>> lasting several hours. >>> >>> Patch 1 is preparatory work and involves no functional changes. >>> Patch 2 implements the automatic release of wb_completion. >>> Patch 3 removes wb_wait_for_completion() from mem_cgroup_css_free(). >>> >>> >>> Julian Sun (3): >>> writeback: Rename wb_writeback_work->auto_free to free_work. >>> writeback: Add wb_writeback_work->free_done >>> memcg: Don't wait writeback completion when release memcg. >>> >>> fs/fs-writeback.c | 22 ++++++++++++++-------- >>> include/linux/backing-dev-defs.h | 6 ++++++ >>> include/linux/memcontrol.h | 2 +- >>> mm/memcontrol.c | 29 ++++++++++++++++++++--------- >>> 4 files changed, 41 insertions(+), 18 deletions(-) >>> >> ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2025-08-25 19:06 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-20 11:19 [PATCH 0/3] memcg, writeback: Don't wait writeback completion Julian Sun 2025-08-20 11:19 ` [PATCH 1/3] writeback: Rename wb_writeback_work->auto_free to free_work Julian Sun 2025-08-20 11:19 ` [PATCH] writeback: Add wb_writeback_work->free_done Julian Sun 2025-08-20 11:19 ` [PATCH] memcg: Don't wait writeback completion when release memcg Julian Sun 2025-08-20 20:58 ` Tejun Heo 2025-08-21 2:30 ` [External] " Julian Sun 2025-08-21 16:59 ` Tejun Heo 2025-08-21 18:00 ` Julian Sun 2025-08-21 18:16 ` Julian Sun 2025-08-21 19:01 ` Tejun Heo 2025-08-22 8:22 ` Julian Sun 2025-08-22 17:56 ` Tejun Heo 2025-08-23 6:18 ` Julian Sun 2025-08-23 8:08 ` Giorgi Tchankvetadze 2025-08-23 8:22 ` Julian Sun 2025-08-23 14:08 ` Giorgi Tchankvetadze 2025-08-23 15:17 ` Julian Sun 2025-08-25 17:45 ` Julian Sun 2025-08-25 18:53 ` Tejun Heo 2025-08-25 19:06 ` Julian Sun 2025-08-25 10:13 ` Jan Kara 2025-08-25 12:08 ` Julian Sun 2025-08-25 18:57 ` [External] " Tejun Heo 2025-08-20 12:16 ` [PATCH 0/3] memcg, writeback: Don't wait writeback completion Giorgi Tchankvetadze 2025-08-21 2:37 ` [External] " Julian Sun 2025-08-22 9:29 ` Giorgi Tchankvetadze
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).