linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] memcg, writeback: Don't wait writeback completion
@ 2025-08-20 11:19 Julian Sun
  2025-08-20 11:19 ` [PATCH 1/3] writeback: Rename wb_writeback_work->auto_free to free_work Julian Sun
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Julian Sun @ 2025-08-20 11:19 UTC (permalink / raw)
  To: linux-fsdevel, cgroups, linux-mm
  Cc: viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, axboe, tj

This patch series aims to eliminate task hangs in mem_cgroup_css_free() 
caused by calling wb_wait_for_completion(). 
This is because there may be a large number of writeback tasks in the 
foreign memcg, involving millions of pages, and the situation is 
exacerbated by WBT rate limiting—potentially leading to task hangs 
lasting several hours.

Patch 1 is preparatory work and involves no functional changes.
Patch 2 implements the automatic release of wb_completion.
Patch 3 removes wb_wait_for_completion() from mem_cgroup_css_free().


Julian Sun (3):
  writeback: Rename wb_writeback_work->auto_free to free_work.
  writeback: Add wb_writeback_work->free_done
  memcg: Don't wait writeback completion when release memcg.

 fs/fs-writeback.c                | 22 ++++++++++++++--------
 include/linux/backing-dev-defs.h |  6 ++++++
 include/linux/memcontrol.h       |  2 +-
 mm/memcontrol.c                  | 29 ++++++++++++++++++++---------
 4 files changed, 41 insertions(+), 18 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] writeback: Rename wb_writeback_work->auto_free to free_work.
  2025-08-20 11:19 [PATCH 0/3] memcg, writeback: Don't wait writeback completion Julian Sun
@ 2025-08-20 11:19 ` Julian Sun
  2025-08-20 11:19 ` [PATCH] writeback: Add wb_writeback_work->free_done Julian Sun
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Julian Sun @ 2025-08-20 11:19 UTC (permalink / raw)
  To: linux-fsdevel, cgroups, linux-mm
  Cc: viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, axboe, tj

This patch is to prepare for subsequent patches. It only does
's/auto_free/free_work' with no functional changes.

Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
---
 fs/fs-writeback.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index cc57367fb641..4a6c22df5649 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -48,7 +48,7 @@ struct wb_writeback_work {
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
 	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
-	unsigned int auto_free:1;	/* free on completion */
+	unsigned int free_work:1;	/* free work on completion */
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
@@ -170,7 +170,7 @@ static void finish_writeback_work(struct wb_writeback_work *work)
 {
 	struct wb_completion *done = work->done;
 
-	if (work->auto_free)
+	if (work->free_work)
 		kfree(work);
 	if (done) {
 		wait_queue_head_t *waitq = done->waitq;
@@ -1029,7 +1029,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 		if (work) {
 			*work = *base_work;
 			work->nr_pages = nr_pages;
-			work->auto_free = 1;
+			work->free_work = 1;
 			wb_queue_work(wb, work);
 			continue;
 		}
@@ -1048,7 +1048,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 		work = &fallback_work;
 		*work = *base_work;
 		work->nr_pages = nr_pages;
-		work->auto_free = 0;
+		work->free_work = 0;
 		work->done = &fallback_work_done;
 
 		wb_queue_work(wb, work);
@@ -1130,7 +1130,7 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id,
 		work->range_cyclic = 1;
 		work->reason = reason;
 		work->done = done;
-		work->auto_free = 1;
+		work->free_work = 1;
 		wb_queue_work(wb, work);
 		ret = 0;
 	} else {
@@ -1237,7 +1237,7 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 	might_sleep();
 
 	if (!skip_if_busy || !writeback_in_progress(&bdi->wb)) {
-		base_work->auto_free = 0;
+		base_work->free_work = 0;
 		wb_queue_work(&bdi->wb, base_work);
 	}
 }
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH] writeback: Add wb_writeback_work->free_done
  2025-08-20 11:19 [PATCH 0/3] memcg, writeback: Don't wait writeback completion Julian Sun
  2025-08-20 11:19 ` [PATCH 1/3] writeback: Rename wb_writeback_work->auto_free to free_work Julian Sun
@ 2025-08-20 11:19 ` Julian Sun
  2025-08-20 11:19 ` [PATCH] memcg: Don't wait writeback completion when release memcg Julian Sun
  2025-08-20 12:16 ` [PATCH 0/3] memcg, writeback: Don't wait writeback completion Giorgi Tchankvetadze
  3 siblings, 0 replies; 26+ messages in thread
From: Julian Sun @ 2025-08-20 11:19 UTC (permalink / raw)
  To: linux-fsdevel, cgroups, linux-mm
  Cc: viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, axboe, tj

Add wb_writeback_work->free_done, which is used to free the
wb_completion variable when the reference count becomes zero.

Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
---
 fs/fs-writeback.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 4a6c22df5649..56faf5c3d064 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -49,6 +49,7 @@ struct wb_writeback_work {
 	unsigned int for_background:1;
 	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
 	unsigned int free_work:1;	/* free work on completion */
+	unsigned int free_done:1;	/* free wb_completion on completion */
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
@@ -169,15 +170,19 @@ static void wb_wakeup_delayed(struct bdi_writeback *wb)
 static void finish_writeback_work(struct wb_writeback_work *work)
 {
 	struct wb_completion *done = work->done;
+	bool free_done = work->free_done;
 
 	if (work->free_work)
 		kfree(work);
 	if (done) {
 		wait_queue_head_t *waitq = done->waitq;
 
-		/* @done can't be accessed after the following dec */
-		if (atomic_dec_and_test(&done->cnt))
+		/* @done can't be accessed after the following dec unless free_done is true */
+		if (atomic_dec_and_test(&done->cnt)) {
 			wake_up_all(waitq);
+			if (free_done)
+				kfree(done);
+		}
 	}
 }
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-20 11:19 [PATCH 0/3] memcg, writeback: Don't wait writeback completion Julian Sun
  2025-08-20 11:19 ` [PATCH 1/3] writeback: Rename wb_writeback_work->auto_free to free_work Julian Sun
  2025-08-20 11:19 ` [PATCH] writeback: Add wb_writeback_work->free_done Julian Sun
@ 2025-08-20 11:19 ` Julian Sun
  2025-08-20 20:58   ` Tejun Heo
  2025-08-20 12:16 ` [PATCH 0/3] memcg, writeback: Don't wait writeback completion Giorgi Tchankvetadze
  3 siblings, 1 reply; 26+ messages in thread
From: Julian Sun @ 2025-08-20 11:19 UTC (permalink / raw)
  To: linux-fsdevel, cgroups, linux-mm
  Cc: viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, axboe, tj

Recently, we encountered the following hung task:

INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
[Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
[Wed Jul 30 17:47:45 2025] Call Trace:
[Wed Jul 30 17:47:45 2025]  __schedule+0x934/0xe10
[Wed Jul 30 17:47:45 2025]  ? complete+0x3b/0x50
[Wed Jul 30 17:47:45 2025]  ? _cond_resched+0x15/0x30
[Wed Jul 30 17:47:45 2025]  schedule+0x40/0xb0
[Wed Jul 30 17:47:45 2025]  wb_wait_for_completion+0x52/0x80
[Wed Jul 30 17:47:45 2025]  ? finish_wait+0x80/0x80
[Wed Jul 30 17:47:45 2025]  mem_cgroup_css_free+0x22/0x1b0
[Wed Jul 30 17:47:45 2025]  css_free_rwork_fn+0x42/0x380
[Wed Jul 30 17:47:45 2025]  process_one_work+0x1a2/0x360
[Wed Jul 30 17:47:45 2025]  worker_thread+0x30/0x390
[Wed Jul 30 17:47:45 2025]  ? create_worker+0x1a0/0x1a0
[Wed Jul 30 17:47:45 2025]  kthread+0x110/0x130
[Wed Jul 30 17:47:45 2025]  ? __kthread_cancel_work+0x40/0x40
[Wed Jul 30 17:47:45 2025]  ret_from_fork+0x1f/0x30

The direct cause is that memcg spends a long time waiting for dirty page
writeback of foreign memcgs during release.

The root causes are:
    a. The wb may have multiple writeback tasks, containing hundreds of
       thousands of dirty pages, as shown below:

>>> for work in list_for_each_entry("struct wb_writeback_work", \
				    wb.work_list.address_of_(), "list"):
...     print(work.nr_pages, work.reason, hex(work))
...
900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
1275228 WB_REASON_FOREIGN_FLUSH 0xffff969d9b444bc0
1099673 WB_REASON_FOREIGN_FLUSH 0xffff969f0954d6c0
1351522 WB_REASON_FOREIGN_FLUSH 0xffff969e76713340
2567437 WB_REASON_FOREIGN_FLUSH 0xffff9694ae208400
2954033 WB_REASON_FOREIGN_FLUSH 0xffff96a22d62cbc0
3008860 WB_REASON_FOREIGN_FLUSH 0xffff969eee8ce3c0
3337932 WB_REASON_FOREIGN_FLUSH 0xffff9695b45156c0
3348916 WB_REASON_FOREIGN_FLUSH 0xffff96a22c7a4f40
3345363 WB_REASON_FOREIGN_FLUSH 0xffff969e5d872800
3333581 WB_REASON_FOREIGN_FLUSH 0xffff969efd0f4600
3382225 WB_REASON_FOREIGN_FLUSH 0xffff969e770edcc0
3418770 WB_REASON_FOREIGN_FLUSH 0xffff96a252ceea40
3387648 WB_REASON_FOREIGN_FLUSH 0xffff96a3bda86340
3385420 WB_REASON_FOREIGN_FLUSH 0xffff969efc6eb280
3418730 WB_REASON_FOREIGN_FLUSH 0xffff96a348ab1040
3426155 WB_REASON_FOREIGN_FLUSH 0xffff969d90beac00
3397995 WB_REASON_FOREIGN_FLUSH 0xffff96a2d7288800
3293095 WB_REASON_FOREIGN_FLUSH 0xffff969dab423240
3293595 WB_REASON_FOREIGN_FLUSH 0xffff969c765ff400
3199511 WB_REASON_FOREIGN_FLUSH 0xffff969a72d5e680
3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00

    b. The writeback might severely throttled by wbt, with a speed
       possibly less than 100kb/s, leading to a very long writeback time.

>>> wb.write_bandwidth
(unsigned long)24
>>> wb.write_bandwidth
(unsigned long)13

The wb_wait_for_completion() here is probably only used to prevent
use-after-free. Therefore, we manage 'done' separately and automatically
free this memory in finish_writeback_work().

This allows us to remove wb_wait_for_completion() while preventing
the use-after-free issue.

Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
---
 fs/fs-writeback.c                |  1 +
 include/linux/backing-dev-defs.h |  6 ++++++
 include/linux/memcontrol.h       |  2 +-
 mm/memcontrol.c                  | 31 ++++++++++++++++++++++---------
 4 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 56faf5c3d064..fac3ef7e95ae 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1136,6 +1136,7 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id,
 		work->reason = reason;
 		work->done = done;
 		work->free_work = 1;
+		work->free_done = 1;
 		wb_queue_work(wb, work);
 		ret = 0;
 	} else {
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 2ad261082bba..bf28dd9a4783 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -208,6 +208,12 @@ struct wb_lock_cookie {
 	unsigned long flags;
 };
 
+static inline void wb_completion_init(struct wb_completion *done, struct wait_queue_head *waitq)
+{
+	atomic_set(&done->cnt, 1);
+	done->waitq = waitq;
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 785173aa0739..80a6bafbb24a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -161,7 +161,7 @@ struct memcg_cgwb_frn {
 	u64 bdi_id;			/* bdi->id of the foreign inode */
 	int memcg_id;			/* memcg->css.id of foreign inode */
 	u64 at;				/* jiffies_64 at the time of dirtying */
-	struct wb_completion done;	/* tracks in-flight foreign writebacks */
+	struct wb_completion *done;	/* tracks in-flight foreign writebacks */
 };
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8dd7fbed5a94..6e6a1ce7589a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3459,7 +3459,7 @@ void mem_cgroup_track_foreign_dirty_slowpath(struct folio *folio,
 		    frn->memcg_id == wb->memcg_css->id)
 			break;
 		if (time_before64(frn->at, oldest_at) &&
-		    atomic_read(&frn->done.cnt) == 1) {
+		    atomic_read(&frn->done->cnt) == 1) {
 			oldest = i;
 			oldest_at = frn->at;
 		}
@@ -3506,12 +3506,12 @@ void mem_cgroup_flush_foreign(struct bdi_writeback *wb)
 		 * already one in flight.
 		 */
 		if (time_after64(frn->at, now - intv) &&
-		    atomic_read(&frn->done.cnt) == 1) {
+		    atomic_read(&frn->done->cnt) == 1) {
 			frn->at = 0;
 			trace_flush_foreign(wb, frn->bdi_id, frn->memcg_id);
 			cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id,
 					       WB_REASON_FOREIGN_FLUSH,
-					       &frn->done);
+					       frn->done);
 		}
 	}
 }
@@ -3708,7 +3708,7 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
 	struct memcg_vmstats_percpu __percpu *pstatc_pcpu;
 	struct mem_cgroup *memcg;
 	int node, cpu;
-	int __maybe_unused i;
+	int __maybe_unused i = 0;
 	long error;
 
 	memcg = kmem_cache_zalloc(memcg_cachep, GFP_KERNEL);
@@ -3763,9 +3763,14 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
 	INIT_LIST_HEAD(&memcg->objcg_list);
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
-	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
-		memcg->cgwb_frn[i].done =
-			__WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq);
+	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
+		struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i];
+
+		frn->done = kmalloc(sizeof(struct wb_completion), GFP_KERNEL);
+		if (!frn->done)
+			goto fail;
+		wb_completion_init(frn->done, &memcg_cgwb_frn_waitq);
+	}
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	spin_lock_init(&memcg->deferred_split_queue.split_queue_lock);
@@ -3775,6 +3780,10 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
 	lru_gen_init_memcg(memcg);
 	return memcg;
 fail:
+#ifdef CONFIG_CGROUP_WRITEBACK
+	while (i--)
+		kfree(memcg->cgwb_frn[i].done);
+#endif
 	mem_cgroup_id_remove(memcg);
 	__mem_cgroup_free(memcg);
 	return ERR_PTR(error);
@@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	int __maybe_unused i;
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
-		wb_wait_for_completion(&memcg->cgwb_frn[i].done);
+	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
+		struct wb_completion *done = memcg->cgwb_frn[i].done;
+
+		if (atomic_dec_and_test(&done->cnt))
+			kfree(done);
+	}
 #endif
 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
 		static_branch_dec(&memcg_sockets_enabled_key);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/3] memcg, writeback: Don't wait writeback completion
  2025-08-20 11:19 [PATCH 0/3] memcg, writeback: Don't wait writeback completion Julian Sun
                   ` (2 preceding siblings ...)
  2025-08-20 11:19 ` [PATCH] memcg: Don't wait writeback completion when release memcg Julian Sun
@ 2025-08-20 12:16 ` Giorgi Tchankvetadze
  2025-08-21  2:37   ` [External] " Julian Sun
  3 siblings, 1 reply; 26+ messages in thread
From: Giorgi Tchankvetadze @ 2025-08-20 12:16 UTC (permalink / raw)
  To: Julian Sun, linux-fsdevel, cgroups, linux-mm
  Cc: viro, brauner, jack, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, axboe, tj

Could we add wb_pending_pages to memory.events?
Very cheap and useful.
A single atomic counter is already kept internally; exposing it is one 
line in memcontrol.c plus one line in the ABI doc.


On 8/20/2025 3:19 PM, Julian Sun wrote:
> This patch series aims to eliminate task hangs in mem_cgroup_css_free()
> caused by calling wb_wait_for_completion().
> This is because there may be a large number of writeback tasks in the
> foreign memcg, involving millions of pages, and the situation is
> exacerbated by WBT rate limiting—potentially leading to task hangs
> lasting several hours.
> 
> Patch 1 is preparatory work and involves no functional changes.
> Patch 2 implements the automatic release of wb_completion.
> Patch 3 removes wb_wait_for_completion() from mem_cgroup_css_free().
> 
> 
> Julian Sun (3):
>    writeback: Rename wb_writeback_work->auto_free to free_work.
>    writeback: Add wb_writeback_work->free_done
>    memcg: Don't wait writeback completion when release memcg.
> 
>   fs/fs-writeback.c                | 22 ++++++++++++++--------
>   include/linux/backing-dev-defs.h |  6 ++++++
>   include/linux/memcontrol.h       |  2 +-
>   mm/memcontrol.c                  | 29 ++++++++++++++++++++---------
>   4 files changed, 41 insertions(+), 18 deletions(-)
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-20 11:19 ` [PATCH] memcg: Don't wait writeback completion when release memcg Julian Sun
@ 2025-08-20 20:58   ` Tejun Heo
  2025-08-21  2:30     ` [External] " Julian Sun
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-08-20 20:58 UTC (permalink / raw)
  To: Julian Sun
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote:
> @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  	int __maybe_unused i;
>  
>  #ifdef CONFIG_CGROUP_WRITEBACK
> -	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
> -		wb_wait_for_completion(&memcg->cgwb_frn[i].done);
> +	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
> +		struct wb_completion *done = memcg->cgwb_frn[i].done;
> +
> +		if (atomic_dec_and_test(&done->cnt))
> +			kfree(done);
> +	}
>  #endif

Can't you just remove done? I don't think it's doing anything after your
changes anyway.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-20 20:58   ` Tejun Heo
@ 2025-08-21  2:30     ` Julian Sun
  2025-08-21 16:59       ` Tejun Heo
  2025-08-25 10:13       ` Jan Kara
  0 siblings, 2 replies; 26+ messages in thread
From: Julian Sun @ 2025-08-21  2:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

On Thu, Aug 21, 2025 at 4:58 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote:
> > @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
> >       int __maybe_unused i;
> >
> >  #ifdef CONFIG_CGROUP_WRITEBACK
> > -     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
> > -             wb_wait_for_completion(&memcg->cgwb_frn[i].done);
> > +     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
> > +             struct wb_completion *done = memcg->cgwb_frn[i].done;
> > +
> > +             if (atomic_dec_and_test(&done->cnt))
> > +                     kfree(done);
> > +     }
> >  #endif
>
> Can't you just remove done? I don't think it's doing anything after your
> changes anyway.

Thanks for your review.

AFAICT done is also used to track free slots in
mem_cgroup_track_foreign_dirty_slowpath() and
mem_cgroup_flush_foreign(), otherwise we have no method to know which
one is free and might flush more than what MEMCG_CGWB_FRN_CNT allow.

Am I missing something?

Thanks,
>
> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH 0/3] memcg, writeback: Don't wait writeback completion
  2025-08-20 12:16 ` [PATCH 0/3] memcg, writeback: Don't wait writeback completion Giorgi Tchankvetadze
@ 2025-08-21  2:37   ` Julian Sun
  2025-08-22  9:29     ` Giorgi Tchankvetadze
  0 siblings, 1 reply; 26+ messages in thread
From: Julian Sun @ 2025-08-21  2:37 UTC (permalink / raw)
  To: Giorgi Tchankvetadze
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe, tj

Hi, thanks for your review.

On Wed, Aug 20, 2025 at 8:17 PM Giorgi Tchankvetadze
<giorgitchankvetadze1997@gmail.com> wrote:
>
> Could we add wb_pending_pages to memory.events?
> Very cheap and useful.
> A single atomic counter is already kept internally; exposing it is one
> line in memcontrol.c plus one line in the ABI doc.

Not sure what do you mean by wb_pending_pages? Another counter besides
existing MEMCG_LOW MEMCG_HIGH MEMCG_MAX, etc.? And AFAIK there's no
pending pages in this patch set. Could you give more details?

Thanks,
>
>
> On 8/20/2025 3:19 PM, Julian Sun wrote:
> > This patch series aims to eliminate task hangs in mem_cgroup_css_free()
> > caused by calling wb_wait_for_completion().
> > This is because there may be a large number of writeback tasks in the
> > foreign memcg, involving millions of pages, and the situation is
> > exacerbated by WBT rate limiting—potentially leading to task hangs
> > lasting several hours.
> >
> > Patch 1 is preparatory work and involves no functional changes.
> > Patch 2 implements the automatic release of wb_completion.
> > Patch 3 removes wb_wait_for_completion() from mem_cgroup_css_free().
> >
> >
> > Julian Sun (3):
> >    writeback: Rename wb_writeback_work->auto_free to free_work.
> >    writeback: Add wb_writeback_work->free_done
> >    memcg: Don't wait writeback completion when release memcg.
> >
> >   fs/fs-writeback.c                | 22 ++++++++++++++--------
> >   include/linux/backing-dev-defs.h |  6 ++++++
> >   include/linux/memcontrol.h       |  2 +-
> >   mm/memcontrol.c                  | 29 ++++++++++++++++++++---------
> >   4 files changed, 41 insertions(+), 18 deletions(-)
> >
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-21  2:30     ` [External] " Julian Sun
@ 2025-08-21 16:59       ` Tejun Heo
  2025-08-21 18:00         ` Julian Sun
  2025-08-25 10:13       ` Jan Kara
  1 sibling, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-08-21 16:59 UTC (permalink / raw)
  To: Julian Sun
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

Hello,

On Thu, Aug 21, 2025 at 10:30:30AM +0800, Julian Sun wrote:
> On Thu, Aug 21, 2025 at 4:58 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote:
> > > @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
> > >       int __maybe_unused i;
> > >
> > >  #ifdef CONFIG_CGROUP_WRITEBACK
> > > -     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
> > > -             wb_wait_for_completion(&memcg->cgwb_frn[i].done);
> > > +     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
> > > +             struct wb_completion *done = memcg->cgwb_frn[i].done;
> > > +
> > > +             if (atomic_dec_and_test(&done->cnt))
> > > +                     kfree(done);
> > > +     }
> > >  #endif
> >
> > Can't you just remove done? I don't think it's doing anything after your
> > changes anyway.
> 
> Thanks for your review.
> 
> AFAICT done is also used to track free slots in
> mem_cgroup_track_foreign_dirty_slowpath() and
> mem_cgroup_flush_foreign(), otherwise we have no method to know which
> one is free and might flush more than what MEMCG_CGWB_FRN_CNT allow.
> 
> Am I missing something?

No, I missed that. I don't think we need to add extra mechanisms in wb for
this tho. How about shifting wb_wait_for_completion() and kfree(memcg) into
a separate function and punt those to a separate work item? That's going to
be a small self-contained change in memcg.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-21 16:59       ` Tejun Heo
@ 2025-08-21 18:00         ` Julian Sun
  2025-08-21 18:16           ` Julian Sun
  2025-08-21 19:01           ` Tejun Heo
  0 siblings, 2 replies; 26+ messages in thread
From: Julian Sun @ 2025-08-21 18:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

Hi,

On Fri, Aug 22, 2025 at 12:59 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Aug 21, 2025 at 10:30:30AM +0800, Julian Sun wrote:
> > On Thu, Aug 21, 2025 at 4:58 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote:
> > > > @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
> > > >       int __maybe_unused i;
> > > >
> > > >  #ifdef CONFIG_CGROUP_WRITEBACK
> > > > -     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
> > > > -             wb_wait_for_completion(&memcg->cgwb_frn[i].done);
> > > > +     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
> > > > +             struct wb_completion *done = memcg->cgwb_frn[i].done;
> > > > +
> > > > +             if (atomic_dec_and_test(&done->cnt))
> > > > +                     kfree(done);
> > > > +     }
> > > >  #endif
> > >
> > > Can't you just remove done? I don't think it's doing anything after your
> > > changes anyway.
> >
> > Thanks for your review.
> >
> > AFAICT done is also used to track free slots in
> > mem_cgroup_track_foreign_dirty_slowpath() and
> > mem_cgroup_flush_foreign(), otherwise we have no method to know which
> > one is free and might flush more than what MEMCG_CGWB_FRN_CNT allow.
> >
> > Am I missing something?
>
> No, I missed that. I don't think we need to add extra mechanisms in wb for
> this tho. How about shifting wb_wait_for_completion() and kfree(memcg) into
> a separate function and punt those to a separate work item? That's going to
> be a small self-contained change in memcg.
>

Do you mean logic like this?

    for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
        wb_wait_for_completion(&memcg->cgwb_frn[i].done);
    kfree(memcg);

But there still exist task hang issues as long as
wb_wait_for_completion() exists.
I think the scope of impact of the current changes should be
manageable. I have checked all the other places where wb_queue_work()
is called, and their free_done values are all 0, and I also tested
this patch with the reproducer in [1] with kasan and kmemleak enabled.
The test result looks fine, so this should not have a significant
impact.
What do you think?

[1]: https://lore.kernel.org/all/20190821210235.GN2263813@devbig004.ftw2.facebook.com/
> Thanks.
>
> --
> tejun


Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-21 18:00         ` Julian Sun
@ 2025-08-21 18:16           ` Julian Sun
  2025-08-21 19:01           ` Tejun Heo
  1 sibling, 0 replies; 26+ messages in thread
From: Julian Sun @ 2025-08-21 18:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

On Fri, Aug 22, 2025 at 2:00 AM Julian Sun <sunjunchao@bytedance.com> wrote:
>
> Hi,
>
> On Fri, Aug 22, 2025 at 12:59 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Thu, Aug 21, 2025 at 10:30:30AM +0800, Julian Sun wrote:
> > > On Thu, Aug 21, 2025 at 4:58 AM Tejun Heo <tj@kernel.org> wrote:
> > > >
> > > > On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote:
> > > > > @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
> > > > >       int __maybe_unused i;
> > > > >
> > > > >  #ifdef CONFIG_CGROUP_WRITEBACK
> > > > > -     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
> > > > > -             wb_wait_for_completion(&memcg->cgwb_frn[i].done);
> > > > > +     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
> > > > > +             struct wb_completion *done = memcg->cgwb_frn[i].done;
> > > > > +
> > > > > +             if (atomic_dec_and_test(&done->cnt))
> > > > > +                     kfree(done);
> > > > > +     }
> > > > >  #endif
> > > >
> > > > Can't you just remove done? I don't think it's doing anything after your
> > > > changes anyway.
> > >
> > > Thanks for your review.
> > >
> > > AFAICT done is also used to track free slots in
> > > mem_cgroup_track_foreign_dirty_slowpath() and
> > > mem_cgroup_flush_foreign(), otherwise we have no method to know which
> > > one is free and might flush more than what MEMCG_CGWB_FRN_CNT allow.
> > >
> > > Am I missing something?
> >
> > No, I missed that. I don't think we need to add extra mechanisms in wb for
> > this tho. How about shifting wb_wait_for_completion() and kfree(memcg) into
> > a separate function and punt those to a separate work item? That's going to
> > be a small self-contained change in memcg.
> >
>
> Do you mean logic like this?
>
>     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
>         wb_wait_for_completion(&memcg->cgwb_frn[i].done);
>     kfree(memcg);
>
> But there still exist task hang issues as long as
> wb_wait_for_completion() exists.
> I think the scope of impact of the current changes should be
> manageable. I have checked all the other places where wb_queue_work()
> is called, and their free_done values are all 0, and I also tested
> this patch with the reproducer in [1] with kasan and kmemleak enabled.
> The test result looks fine, so this should not have a significant
> impact.

BTW, the test case is like this — it ran for over a night.
    while true; do ./repro.sh && sleep 300 && ./stop.sh; done

sjc@debian:~/linux$ cat stop.sh
#!/bin/bash
#

TEST=/sys/fs/cgroup/test
A=$TEST/A
B=$TEST/B
echo "-memory" > $TEST/cgroup.subtree_control
pkill write-range

sync
sleep 5
sync
sleep 5
echo 3 > /proc/sys/vm/drop_caches

rmdir $A $B

> What do you think?
>
> [1]: https://lore.kernel.org/all/20190821210235.GN2263813@devbig004.ftw2.facebook.com/
> > Thanks.
> >
> > --
> > tejun
>
>
> Thanks,
> --
> Julian Sun <sunjunchao@bytedance.com>


Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-21 18:00         ` Julian Sun
  2025-08-21 18:16           ` Julian Sun
@ 2025-08-21 19:01           ` Tejun Heo
  2025-08-22  8:22             ` Julian Sun
  1 sibling, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-08-21 19:01 UTC (permalink / raw)
  To: Julian Sun
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

Hello,

On Fri, Aug 22, 2025 at 02:00:10AM +0800, Julian Sun wrote:
...
> Do you mean logic like this?
> 
>     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
>         wb_wait_for_completion(&memcg->cgwb_frn[i].done);
>     kfree(memcg);
> 
> But there still exist task hang issues as long as
> wb_wait_for_completion() exists.

Ah, right. I was just thinking about the workqueue being stalled. The
problem is that the wait itself is too long.

> I think the scope of impact of the current changes should be
> manageable. I have checked all the other places where wb_queue_work()
> is called, and their free_done values are all 0, and I also tested
> this patch with the reproducer in [1] with kasan and kmemleak enabled.
> The test result looks fine, so this should not have a significant
> impact.
> What do you think?

My source of reluctance is that it's a peculiar situation where flushing of
a cgroup takes that long due to hard throttling and the self-freeing
mechanism isn't the prettiest thing. Do you think you can do the same thing
through custom waitq wakeup function?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-21 19:01           ` Tejun Heo
@ 2025-08-22  8:22             ` Julian Sun
  2025-08-22 17:56               ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Julian Sun @ 2025-08-22  8:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

On 8/22/25 3:01 AM, Tejun Heo wrote:

Hi,

> Hello,
> 
> On Fri, Aug 22, 2025 at 02:00:10AM +0800, Julian Sun wrote:
> ...
>> Do you mean logic like this?
>>
>>      for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
>>          wb_wait_for_completion(&memcg->cgwb_frn[i].done);
>>      kfree(memcg);
>>
>> But there still exist task hang issues as long as
>> wb_wait_for_completion() exists.
> 
> Ah, right. I was just thinking about the workqueue being stalled. The
> problem is that the wait itself is too long.
> 
>> I think the scope of impact of the current changes should be
>> manageable. I have checked all the other places where wb_queue_work()
>> is called, and their free_done values are all 0, and I also tested
>> this patch with the reproducer in [1] with kasan and kmemleak enabled.
>> The test result looks fine, so this should not have a significant
>> impact.
>> What do you think?
> 
> My source of reluctance is that it's a peculiar situation where flushing of
> a cgroup takes that long due to hard throttling and the self-freeing
> mechanism isn't the prettiest thing. Do you think you can do the same thing
> through custom waitq wakeup function?

Yeah, this method looks more general if I understand correctly.

If the idea of the following code makes sense to you, I'd like to split
and convert it into formal patches.

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a07b8cf73ae2..10fede792178 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -172,13 +172,8 @@ static void finish_writeback_work(struct 
wb_writeback_work *work)

  	if (work->auto_free)
  		kfree(work);
-	if (done) {
-		wait_queue_head_t *waitq = done->waitq;
-
-		/* @done can't be accessed after the following dec */
-		if (atomic_dec_and_test(&done->cnt))
-			wake_up_all(waitq);
-	}
+	if (done)
+		done->wb_waitq->wb_wakeup_func(done->wb_waitq, done);
  }

  static void wb_queue_work(struct bdi_writeback *wb,
@@ -213,7 +208,7 @@ static void wb_queue_work(struct bdi_writeback *wb,
  void wb_wait_for_completion(struct wb_completion *done)
  {
  	atomic_dec(&done->cnt);		/* put down the initial count */
-	wait_event(*done->waitq, !atomic_read(&done->cnt));
+	wait_event(done->wb_waitq->waitq, !atomic_read(&done->cnt));
  }

  #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/include/linux/backing-dev-defs.h 
b/include/linux/backing-dev-defs.h
index 2ad261082bba..04699458ac50 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -60,13 +60,56 @@ enum wb_reason {
  	WB_REASON_MAX,
  };

+struct wb_completion;
+typedef struct wb_wait_queue_head wb_wait_queue_head_t;
+typedef void (*wb_wait_wakeup_func_t)(wb_wait_queue_head_t *wq_waitq,
+									  struct wb_completion *done);
+struct wb_wait_queue_head {
+	wait_queue_head_t waitq;
+	wb_wait_wakeup_func_t wb_wakeup_func;
+};
+
  struct wb_completion {
  	atomic_t		cnt;
-	wait_queue_head_t	*waitq;
+	wb_wait_queue_head_t	*wb_waitq;
  };

+static inline void wb_default_wakeup_func(wb_wait_queue_head_t *wq_waitq,
+										  struct wb_completion *done)
+{
+	/* @done can't be accessed after the following dec */
+	if (atomic_dec_and_test(&done->cnt))
+		wake_up_all(&wq_waitq->waitq);
+}
+
+/* used for cgwb_frn, be careful here, @done can't be accessed */
+static inline void wb_empty_wakeup_func(wb_wait_queue_head_t *wq_waitq,
+										struct wb_completion *done)
+{
+}
+
+#define __init_wb_waitqueue_head(wb_waitq, func) 	\
+	do {											\
+		init_waitqueue_head(&wb_waitq.waitq);		\
+		wb_waitq.wb_wakeup_func = func; 			\
+	} while (0)
+
+#define init_wb_waitqueue_head(wb_waitq) 	\
+	__init_wb_waitqueue_head(wb_waitq, wb_default_wakeup_func)
+
+#define __WB_WAIT_QUEUE_HEAD_INITIALIZER(name, func) {	\
+	.waitq = __WAIT_QUEUE_HEAD_INITIALIZER(name.waitq),	\
+	.wb_wakeup_func = func, 							\
+}
+
+#define __DECLARE_WB_WAIT_QUEUE_HEAD(name, func) \
+	struct wb_wait_queue_head name = 
__WB_WAIT_QUEUE_HEAD_INITIALIZER(name, func)
+
+#define DECLARE_WB_WAIT_QUEUE_HEAD(name) \
+	__DECLARE_WB_WAIT_QUEUE_HEAD(name, wb_default_wakeup_func)
+
  #define __WB_COMPLETION_INIT(_waitq)	\
-	(struct wb_completion){ .cnt = ATOMIC_INIT(1), .waitq = (_waitq) }
+	(struct wb_completion){ .cnt = ATOMIC_INIT(1), .wb_waitq = (_waitq) }

  /*
   * If one wants to wait for one or more wb_writeback_works, each work's
@@ -190,7 +233,7 @@ struct backing_dev_info {
  	struct mutex cgwb_release_mutex;  /* protect shutdown of wb structs */
  	struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
  #endif
-	wait_queue_head_t wb_waitq;
+	wb_wait_queue_head_t wb_waitq;

  	struct device *dev;
  	char dev_name[64];
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 783904d8c5ef..c4fec9e22978 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1008,7 +1008,7 @@ int bdi_init(struct backing_dev_info *bdi)
  	bdi->max_prop_frac = FPROP_FRAC_BASE;
  	INIT_LIST_HEAD(&bdi->bdi_list);
  	INIT_LIST_HEAD(&bdi->wb_list);
-	init_waitqueue_head(&bdi->wb_waitq);
+	init_wb_waitqueue_head(bdi->wb_waitq);
  	bdi->last_bdp_sleep = jiffies;

  	return cgwb_bdi_init(bdi);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8dd7fbed5a94..999624535470 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -99,7 +99,7 @@ static struct kmem_cache *memcg_cachep;
  static struct kmem_cache *memcg_pn_cachep;

  #ifdef CONFIG_CGROUP_WRITEBACK
-static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
+static __DECLARE_WB_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq, 
wb_empty_wakeup_func);
  #endif

  static inline bool task_is_dying(void)
@@ -3909,12 +3909,7 @@ static void mem_cgroup_css_released(struct 
cgroup_subsys_state *css)
  static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
  {
  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	int __maybe_unused i;

-#ifdef CONFIG_CGROUP_WRITEBACK
-	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
-		wb_wait_for_completion(&memcg->cgwb_frn[i].done);
-#endif
  	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
  		static_branch_dec(&memcg_sockets_enabled_key);



> 
> Thanks.
> 

Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH 0/3] memcg, writeback: Don't wait writeback completion
  2025-08-21  2:37   ` [External] " Julian Sun
@ 2025-08-22  9:29     ` Giorgi Tchankvetadze
  0 siblings, 0 replies; 26+ messages in thread
From: Giorgi Tchankvetadze @ 2025-08-22  9:29 UTC (permalink / raw)
  To: Julian Sun
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe, tj


`memory.stat:writeback` already gives you the exact real-time count of 
pages still waiting to finish write-back for that cgroup, and it is 
updated atomically in the hot path (`set_page_writeback()` / 
`end_page_writeback()`). Reading it is just an `atomic_long_read()` (or 
per-cpu equivalent), so the extra CPU cost of exposing it is essentially 
zero. I was thinking that extra additional info would help us

Thanks,
Giorgi

On 8/21/2025 6:37 AM, Julian Sun wrote:
> Hi, thanks for your review.
> 
> On Wed, Aug 20, 2025 at 8:17 PM Giorgi Tchankvetadze
> <giorgitchankvetadze1997@gmail.com> wrote:
>>
>> Could we add wb_pending_pages to memory.events?
>> Very cheap and useful.
>> A single atomic counter is already kept internally; exposing it is one
>> line in memcontrol.c plus one line in the ABI doc.
> 
> Not sure what do you mean by wb_pending_pages? Another counter besides
> existing MEMCG_LOW MEMCG_HIGH MEMCG_MAX, etc.? And AFAIK there's no
> pending pages in this patch set. Could you give more details?
> 
> Thanks,
>>
>>
>> On 8/20/2025 3:19 PM, Julian Sun wrote:
>>> This patch series aims to eliminate task hangs in mem_cgroup_css_free()
>>> caused by calling wb_wait_for_completion().
>>> This is because there may be a large number of writeback tasks in the
>>> foreign memcg, involving millions of pages, and the situation is
>>> exacerbated by WBT rate limiting—potentially leading to task hangs
>>> lasting several hours.
>>>
>>> Patch 1 is preparatory work and involves no functional changes.
>>> Patch 2 implements the automatic release of wb_completion.
>>> Patch 3 removes wb_wait_for_completion() from mem_cgroup_css_free().
>>>
>>>
>>> Julian Sun (3):
>>>     writeback: Rename wb_writeback_work->auto_free to free_work.
>>>     writeback: Add wb_writeback_work->free_done
>>>     memcg: Don't wait writeback completion when release memcg.
>>>
>>>    fs/fs-writeback.c                | 22 ++++++++++++++--------
>>>    include/linux/backing-dev-defs.h |  6 ++++++
>>>    include/linux/memcontrol.h       |  2 +-
>>>    mm/memcontrol.c                  | 29 ++++++++++++++++++++---------
>>>    4 files changed, 41 insertions(+), 18 deletions(-)
>>>
>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-22  8:22             ` Julian Sun
@ 2025-08-22 17:56               ` Tejun Heo
  2025-08-23  6:18                 ` Julian Sun
  2025-08-25 17:45                 ` Julian Sun
  0 siblings, 2 replies; 26+ messages in thread
From: Tejun Heo @ 2025-08-22 17:56 UTC (permalink / raw)
  To: Julian Sun
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

Hello,

On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun wrote:
> +struct wb_wait_queue_head {
> +	wait_queue_head_t waitq;
> +	wb_wait_wakeup_func_t wb_wakeup_func;
> +};

wait_queue_head_t itself already allows overriding the wakeup function.
Please look for init_wait_func() usages in the tree. Hopefully, that should
contain the changes within memcg.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-22 17:56               ` Tejun Heo
@ 2025-08-23  6:18                 ` Julian Sun
  2025-08-23  8:08                   ` Giorgi Tchankvetadze
  2025-08-25 17:45                 ` Julian Sun
  1 sibling, 1 reply; 26+ messages in thread
From: Julian Sun @ 2025-08-23  6:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

Hi,

On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun wrote:
> > +struct wb_wait_queue_head {
> > +     wait_queue_head_t waitq;
> > +     wb_wait_wakeup_func_t wb_wakeup_func;
> > +};
>
> wait_queue_head_t itself already allows overriding the wakeup function.
> Please look for init_wait_func() usages in the tree. Hopefully, that should
> contain the changes within memcg.

Well... Yes, I checked this function before, but it can't do the same
thing as in the previous email. There are some differences—please
check the code in the last email.

First, let's clarify: the key point here is that if we want to remove
wb_wait_for_completion() and avoid self-freeing, we must not access
"done" in finish_writeback_work(), otherwise it will cause a UAF.
However, init_wait_func() can't achieve this. Of course, I also admit
that the method in the previous email seems a bit odd.

To summarize again, the root causes of the problem here are:
1. When memcg is released, it calls wb_wait_for_completion() to
prevent UAF, which is completely unnecessary—cgwb_frn only needs to
issue wb work and no need to wait writeback finished.
2. The current finish_writeback_work() will definitely dereference
"done", which may lead to UAF.

Essentially, cgwb_frn introduces a new scenario where no wake-up is
needed. Therefore, we just need to make finish_writeback_work() not
dereference "done" and not wake up the waiting thread. However, this
cannot keep the modifications within memcg...

Please correct me if my understanding is incorrect.
>
> Thanks.
>
> --
> tejun


Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-23  6:18                 ` Julian Sun
@ 2025-08-23  8:08                   ` Giorgi Tchankvetadze
  2025-08-23  8:22                     ` Julian Sun
  0 siblings, 1 reply; 26+ messages in thread
From: Giorgi Tchankvetadze @ 2025-08-23  8:08 UTC (permalink / raw)
  To: sunjunchao
  Cc: axboe, brauner, cgroups, hannes, jack, linux-fsdevel, linux-mm,
	mhocko, muchun.song, roman.gushchin, shakeel.butt, tj, viro

Hi there. Can we fix this by allowing callers to set work->done = NULL 
when no completion is desired?
The already-existing "if (done)" check in finish_writeback_work() 
already provides the necessary protection, so the change is purely 
mechanical.



On 8/23/2025 10:18 AM, Julian Sun wrote:
> Hi,
> 
> On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote:
>> > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun 
> wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t waitq; > 
>  > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t 
> itself already allows overriding the wakeup function. > Please look for 
> init_wait_func() usages in the tree. Hopefully, that should > contain 
> the changes within memcg.
> Well... Yes, I checked this function before, but it can't do the same
> thing as in the previous email. There are some differences—please
> check the code in the last email.
> 
> First, let's clarify: the key point here is that if we want to remove
> wb_wait_for_completion() and avoid self-freeing, we must not access
> "done" in finish_writeback_work(), otherwise it will cause a UAF.
> However, init_wait_func() can't achieve this. Of course, I also admit
> that the method in the previous email seems a bit odd.
> 
> To summarize again, the root causes of the problem here are:
> 1. When memcg is released, it calls wb_wait_for_completion() to
> prevent UAF, which is completely unnecessary—cgwb_frn only needs to
> issue wb work and no need to wait writeback finished.
> 2. The current finish_writeback_work() will definitely dereference
> "done", which may lead to UAF.
> 
> Essentially, cgwb_frn introduces a new scenario where no wake-up is
> needed. Therefore, we just need to make finish_writeback_work() not
> dereference "done" and not wake up the waiting thread. However, this
> cannot keep the modifications within memcg...
> 
> Please correct me if my understanding is incorrect.
>> > Thanks. > > -- > tejun
> 
> Thanks,
> -- 
> Julian Sun <sunjunchao@bytedance.com>
> 


> Hi,
> 
> On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote:
>> > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun 
> wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t waitq; > 
>  > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t 
> itself already allows overriding the wakeup function. > Please look for 
> init_wait_func() usages in the tree. Hopefully, that should > contain 
> the changes within memcg.
> Well... Yes, I checked this function before, but it can't do the same
> thing as in the previous email. There are some differences—please
> check the code in the last email.
> 
> First, let's clarify: the key point here is that if we want to remove
> wb_wait_for_completion() and avoid self-freeing, we must not access
> "done" in finish_writeback_work(), otherwise it will cause a UAF.
> However, init_wait_func() can't achieve this. Of course, I also admit
> that the method in the previous email seems a bit odd.
> 
> To summarize again, the root causes of the problem here are:
> 1. When memcg is released, it calls wb_wait_for_completion() to
> prevent UAF, which is completely unnecessary—cgwb_frn only needs to
> issue wb work and no need to wait writeback finished.
> 2. The current finish_writeback_work() will definitely dereference
> "done", which may lead to UAF.
> 
> Essentially, cgwb_frn introduces a new scenario where no wake-up is
> needed. Therefore, we just need to make finish_writeback_work() not
> dereference "done" and not wake up the waiting thread. However, this
> cannot keep the modifications within memcg...
> 
> Please correct me if my understanding is incorrect.
>> > Thanks. > > -- > tejun
> 
> Thanks,
> -- 
> Julian Sun <sunjunchao@bytedance.com>
> 




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-23  8:08                   ` Giorgi Tchankvetadze
@ 2025-08-23  8:22                     ` Julian Sun
  2025-08-23 14:08                       ` Giorgi Tchankvetadze
  0 siblings, 1 reply; 26+ messages in thread
From: Julian Sun @ 2025-08-23  8:22 UTC (permalink / raw)
  To: Giorgi Tchankvetadze
  Cc: axboe, brauner, cgroups, hannes, jack, linux-fsdevel, linux-mm,
	mhocko, muchun.song, roman.gushchin, shakeel.butt, tj, viro

Hi,

On Sat, Aug 23, 2025 at 4:08 PM Giorgi Tchankvetadze
<giorgitchankvetadze1997@gmail.com> wrote:
>
> Hi there. Can we fix this by allowing callers to set work->done = NULL
> when no completion is desired?

No, we can't do that. Because cgwb_frn needs to track the state of wb
work by work->done.cnt, if we set work->done = Null, then we can not
know whether the wb work finished or not. See
mem_cgroup_track_foreign_dirty_slowpath() and
mem_cgroup_flush_foreign() for details.

> The already-existing "if (done)" check in finish_writeback_work()
> already provides the necessary protection, so the change is purely
> mechanical.
>
>
>
> On 8/23/2025 10:18 AM, Julian Sun wrote:
> > Hi,
> >
> > On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote:
> >> > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun
> > wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t waitq; >
> >  > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t
> > itself already allows overriding the wakeup function. > Please look for
> > init_wait_func() usages in the tree. Hopefully, that should > contain
> > the changes within memcg.
> > Well... Yes, I checked this function before, but it can't do the same
> > thing as in the previous email. There are some differences—please
> > check the code in the last email.
> >
> > First, let's clarify: the key point here is that if we want to remove
> > wb_wait_for_completion() and avoid self-freeing, we must not access
> > "done" in finish_writeback_work(), otherwise it will cause a UAF.
> > However, init_wait_func() can't achieve this. Of course, I also admit
> > that the method in the previous email seems a bit odd.
> >
> > To summarize again, the root causes of the problem here are:
> > 1. When memcg is released, it calls wb_wait_for_completion() to
> > prevent UAF, which is completely unnecessary—cgwb_frn only needs to
> > issue wb work and no need to wait writeback finished.
> > 2. The current finish_writeback_work() will definitely dereference
> > "done", which may lead to UAF.
> >
> > Essentially, cgwb_frn introduces a new scenario where no wake-up is
> > needed. Therefore, we just need to make finish_writeback_work() not
> > dereference "done" and not wake up the waiting thread. However, this
> > cannot keep the modifications within memcg...
> >
> > Please correct me if my understanding is incorrect.
> >> > Thanks. > > -- > tejun
> >
> > Thanks,
> > --
> > Julian Sun <sunjunchao@bytedance.com>
> >
>
>
> > Hi,
> >
> > On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote:
> >> > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun
> > wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t waitq; >
> >  > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t
> > itself already allows overriding the wakeup function. > Please look for
> > init_wait_func() usages in the tree. Hopefully, that should > contain
> > the changes within memcg.
> > Well... Yes, I checked this function before, but it can't do the same
> > thing as in the previous email. There are some differences—please
> > check the code in the last email.
> >
> > First, let's clarify: the key point here is that if we want to remove
> > wb_wait_for_completion() and avoid self-freeing, we must not access
> > "done" in finish_writeback_work(), otherwise it will cause a UAF.
> > However, init_wait_func() can't achieve this. Of course, I also admit
> > that the method in the previous email seems a bit odd.
> >
> > To summarize again, the root causes of the problem here are:
> > 1. When memcg is released, it calls wb_wait_for_completion() to
> > prevent UAF, which is completely unnecessary—cgwb_frn only needs to
> > issue wb work and no need to wait writeback finished.
> > 2. The current finish_writeback_work() will definitely dereference
> > "done", which may lead to UAF.
> >
> > Essentially, cgwb_frn introduces a new scenario where no wake-up is
> > needed. Therefore, we just need to make finish_writeback_work() not
> > dereference "done" and not wake up the waiting thread. However, this
> > cannot keep the modifications within memcg...
> >
> > Please correct me if my understanding is incorrect.
> >> > Thanks. > > -- > tejun
> >
> > Thanks,
> > --
> > Julian Sun <sunjunchao@bytedance.com>
> >
>
>

Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-23  8:22                     ` Julian Sun
@ 2025-08-23 14:08                       ` Giorgi Tchankvetadze
  2025-08-23 15:17                         ` Julian Sun
  0 siblings, 1 reply; 26+ messages in thread
From: Giorgi Tchankvetadze @ 2025-08-23 14:08 UTC (permalink / raw)
  To: sunjunchao
  Cc: axboe, brauner, cgroups, giorgitchankvetadze1997, hannes, jack,
	linux-fsdevel, linux-mm, mhocko, muchun.song, roman.gushchin,
	shakeel.butt, tj, viro

Makes sense, yeah. What about using a shared, long-lived completion 
object (done_acc) inside cgwb_frn. All writeback jobs point to this same 
object via work->done = &frn->done_acc.

On 8/23/2025 12:23 PM, Julian Sun wrote:
> Hi,
> 
> On Sat, Aug 23, 2025 at 4:08 PM Giorgi Tchankvetadze
> <giorgitchankvetadze1997@gmail.com> wrote:
>> > Hi there. Can we fix this by allowing callers to set work->done = 
> NULL > when no completion is desired?
> No, we can't do that. Because cgwb_frn needs to track the state of wb
> work by work->done.cnt, if we set work->done = Null, then we can not
> know whether the wb work finished or not. See
> mem_cgroup_track_foreign_dirty_slowpath() and
> mem_cgroup_flush_foreign() for details.
> 
>> The already-existing "if (done)" check in finish_writeback_work() > already provides the necessary protection, so the change is purely > 
> mechanical. > > > > On 8/23/2025 10:18 AM, Julian Sun wrote: > > Hi, > > 
>  > > On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote: > 
>  >> > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun > 
>  > wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t 
> waitq; > > > > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > 
> wait_queue_head_t > > itself already allows overriding the wakeup 
> function. > Please look for > > init_wait_func() usages in the tree. 
> Hopefully, that should > contain > > the changes within memcg. > > 
> Well... Yes, I checked this function before, but it can't do the same > 
>  > thing as in the previous email. There are some differences—please > > 
> check the code in the last email. > > > > First, let's clarify: the key 
> point here is that if we want to remove > > wb_wait_for_completion() and 
> avoid self-freeing, we must not access > > "done" in 
> finish_writeback_work(), otherwise it will cause a UAF. > > However, 
> init_wait_func() can't achieve this. Of course, I also admit > > that 
> the method in the previous email seems a bit odd. > > > > To summarize 
> again, the root causes of the problem here are: > > 1. When memcg is 
> released, it calls wb_wait_for_completion() to > > prevent UAF, which is 
> completely unnecessary—cgwb_frn only needs to > > issue wb work and no 
> need to wait writeback finished. > > 2. The current 
> finish_writeback_work() will definitely dereference > > "done", which 
> may lead to UAF. > > > > Essentially, cgwb_frn introduces a new scenario 
> where no wake-up is > > needed. Therefore, we just need to make 
> finish_writeback_work() not > > dereference "done" and not wake up the 
> waiting thread. However, this > > cannot keep the modifications within 
> memcg... > > > > Please correct me if my understanding is incorrect. > 
>  >> > Thanks. > > -- > tejun > > > > Thanks, > > -- > > Julian Sun 
> <sunjunchao@bytedance.com> > > > > > > Hi, > > > > On Sat, Aug 23, 2025 
> at 1:56 AM Tejun Heo <tj@kernel.org> wrote: > >> > Hello, > > On Fri, 
> Aug 22, 2025 at 04:22:09PM +0800, Julian Sun > > wrote: > > +struct 
> wb_wait_queue_head { > > + wait_queue_head_t waitq; > > > > + 
> wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t > > 
> itself already allows overriding the wakeup function. > Please look for 
>  > > init_wait_func() usages in the tree. Hopefully, that should > 
> contain > > the changes within memcg. > > Well... Yes, I checked this 
> function before, but it can't do the same > > thing as in the previous 
> email. There are some differences—please > > check the code in the last 
> email. > > > > First, let's clarify: the key point here is that if we 
> want to remove > > wb_wait_for_completion() and avoid self-freeing, we 
> must not access > > "done" in finish_writeback_work(), otherwise it will 
> cause a UAF. > > However, init_wait_func() can't achieve this. Of 
> course, I also admit > > that the method in the previous email seems a 
> bit odd. > > > > To summarize again, the root causes of the problem here 
> are: > > 1. When memcg is released, it calls wb_wait_for_completion() to 
>  > > prevent UAF, which is completely unnecessary—cgwb_frn only needs to 
>  > > issue wb work and no need to wait writeback finished. > > 2. The 
> current finish_writeback_work() will definitely dereference > > "done", 
> which may lead to UAF. > > > > Essentially, cgwb_frn introduces a new 
> scenario where no wake-up is > > needed. Therefore, we just need to make 
> finish_writeback_work() not > > dereference "done" and not wake up the 
> waiting thread. However, this > > cannot keep the modifications within 
> memcg... > > > > Please correct me if my understanding is incorrect. > 
>  >> > Thanks. > > -- > tejun > > > > Thanks, > > -- > > Julian Sun 
> <sunjunchao@bytedance.com> > > > >
> Thanks,
> -- 
> Julian Sun <sunjunchao@bytedance.com>
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-23 14:08                       ` Giorgi Tchankvetadze
@ 2025-08-23 15:17                         ` Julian Sun
  0 siblings, 0 replies; 26+ messages in thread
From: Julian Sun @ 2025-08-23 15:17 UTC (permalink / raw)
  To: Giorgi Tchankvetadze
  Cc: axboe, brauner, cgroups, hannes, jack, linux-fsdevel, linux-mm,
	mhocko, muchun.song, roman.gushchin, shakeel.butt, tj, viro

Hi,

On Sat, Aug 23, 2025 at 10:09 PM Giorgi Tchankvetadze
<giorgitchankvetadze1997@gmail.com> wrote:
>
> Makes sense, yeah. What about using a shared, long-lived completion
> object (done_acc) inside cgwb_frn. All writeback jobs point to this same
> object via work->done = &frn->done_acc.

Well, IMO there's still no method to know which work has finished and
which hasn't. All we know is whether all the work has finished or not,
isn't it?
>
> On 8/23/2025 12:23 PM, Julian Sun wrote:
> > Hi,
> >
> > On Sat, Aug 23, 2025 at 4:08 PM Giorgi Tchankvetadze
> > <giorgitchankvetadze1997@gmail.com> wrote:
> >> > Hi there. Can we fix this by allowing callers to set work->done =
> > NULL > when no completion is desired?
> > No, we can't do that. Because cgwb_frn needs to track the state of wb
> > work by work->done.cnt, if we set work->done = Null, then we can not
> > know whether the wb work finished or not. See
> > mem_cgroup_track_foreign_dirty_slowpath() and
> > mem_cgroup_flush_foreign() for details.
> >
> >> The already-existing "if (done)" check in finish_writeback_work() > already provides the necessary protection, so the change is purely >
> > mechanical. > > > > On 8/23/2025 10:18 AM, Julian Sun wrote: > > Hi, > >
> >  > > On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote: >
> >  >> > Hello, > > On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun >
> >  > wrote: > > +struct wb_wait_queue_head { > > + wait_queue_head_t
> > waitq; > > > > + wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > >
> > wait_queue_head_t > > itself already allows overriding the wakeup
> > function. > Please look for > > init_wait_func() usages in the tree.
> > Hopefully, that should > contain > > the changes within memcg. > >
> > Well... Yes, I checked this function before, but it can't do the same >
> >  > thing as in the previous email. There are some differences—please > >
> > check the code in the last email. > > > > First, let's clarify: the key
> > point here is that if we want to remove > > wb_wait_for_completion() and
> > avoid self-freeing, we must not access > > "done" in
> > finish_writeback_work(), otherwise it will cause a UAF. > > However,
> > init_wait_func() can't achieve this. Of course, I also admit > > that
> > the method in the previous email seems a bit odd. > > > > To summarize
> > again, the root causes of the problem here are: > > 1. When memcg is
> > released, it calls wb_wait_for_completion() to > > prevent UAF, which is
> > completely unnecessary—cgwb_frn only needs to > > issue wb work and no
> > need to wait writeback finished. > > 2. The current
> > finish_writeback_work() will definitely dereference > > "done", which
> > may lead to UAF. > > > > Essentially, cgwb_frn introduces a new scenario
> > where no wake-up is > > needed. Therefore, we just need to make
> > finish_writeback_work() not > > dereference "done" and not wake up the
> > waiting thread. However, this > > cannot keep the modifications within
> > memcg... > > > > Please correct me if my understanding is incorrect. >
> >  >> > Thanks. > > -- > tejun > > > > Thanks, > > -- > > Julian Sun
> > <sunjunchao@bytedance.com> > > > > > > Hi, > > > > On Sat, Aug 23, 2025
> > at 1:56 AM Tejun Heo <tj@kernel.org> wrote: > >> > Hello, > > On Fri,
> > Aug 22, 2025 at 04:22:09PM +0800, Julian Sun > > wrote: > > +struct
> > wb_wait_queue_head { > > + wait_queue_head_t waitq; > > > > +
> > wb_wait_wakeup_func_t wb_wakeup_func; > > +}; > > wait_queue_head_t > >
> > itself already allows overriding the wakeup function. > Please look for
> >  > > init_wait_func() usages in the tree. Hopefully, that should >
> > contain > > the changes within memcg. > > Well... Yes, I checked this
> > function before, but it can't do the same > > thing as in the previous
> > email. There are some differences—please > > check the code in the last
> > email. > > > > First, let's clarify: the key point here is that if we
> > want to remove > > wb_wait_for_completion() and avoid self-freeing, we
> > must not access > > "done" in finish_writeback_work(), otherwise it will
> > cause a UAF. > > However, init_wait_func() can't achieve this. Of
> > course, I also admit > > that the method in the previous email seems a
> > bit odd. > > > > To summarize again, the root causes of the problem here
> > are: > > 1. When memcg is released, it calls wb_wait_for_completion() to
> >  > > prevent UAF, which is completely unnecessary—cgwb_frn only needs to
> >  > > issue wb work and no need to wait writeback finished. > > 2. The
> > current finish_writeback_work() will definitely dereference > > "done",
> > which may lead to UAF. > > > > Essentially, cgwb_frn introduces a new
> > scenario where no wake-up is > > needed. Therefore, we just need to make
> > finish_writeback_work() not > > dereference "done" and not wake up the
> > waiting thread. However, this > > cannot keep the modifications within
> > memcg... > > > > Please correct me if my understanding is incorrect. >
> >  >> > Thanks. > > -- > tejun > > > > Thanks, > > -- > > Julian Sun
> > <sunjunchao@bytedance.com> > > > >
> > Thanks,
> > --
> > Julian Sun <sunjunchao@bytedance.com>
> >
>

Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-21  2:30     ` [External] " Julian Sun
  2025-08-21 16:59       ` Tejun Heo
@ 2025-08-25 10:13       ` Jan Kara
  2025-08-25 12:08         ` Julian Sun
  2025-08-25 18:57         ` [External] " Tejun Heo
  1 sibling, 2 replies; 26+ messages in thread
From: Jan Kara @ 2025-08-25 10:13 UTC (permalink / raw)
  To: Julian Sun
  Cc: Tejun Heo, linux-fsdevel, cgroups, linux-mm, viro, brauner, jack,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

On Thu 21-08-25 10:30:30, Julian Sun wrote:
> On Thu, Aug 21, 2025 at 4:58 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote:
> > > @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
> > >       int __maybe_unused i;
> > >
> > >  #ifdef CONFIG_CGROUP_WRITEBACK
> > > -     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
> > > -             wb_wait_for_completion(&memcg->cgwb_frn[i].done);
> > > +     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
> > > +             struct wb_completion *done = memcg->cgwb_frn[i].done;
> > > +
> > > +             if (atomic_dec_and_test(&done->cnt))
> > > +                     kfree(done);
> > > +     }
> > >  #endif
> >
> > Can't you just remove done? I don't think it's doing anything after your
> > changes anyway.
> 
> Thanks for your review.
> 
> AFAICT done is also used to track free slots in
> mem_cgroup_track_foreign_dirty_slowpath() and
> mem_cgroup_flush_foreign(), otherwise we have no method to know which
> one is free and might flush more than what MEMCG_CGWB_FRN_CNT allow.
> 
> Am I missing something?

True, but is that mechanism really needed? Given the approximate nature of
foreign flushing, couldn't we just always replace the oldest foreign entry
regardless of whether the writeback is running or not? I didn't give too
deep thought to this but from a quick look this should work just fine...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-25 10:13       ` Jan Kara
@ 2025-08-25 12:08         ` Julian Sun
  2025-08-25 18:57         ` [External] " Tejun Heo
  1 sibling, 0 replies; 26+ messages in thread
From: Julian Sun @ 2025-08-25 12:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Tejun Heo, linux-fsdevel, cgroups, linux-mm, viro, brauner,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

在 2025/8/25 18:13, Jan Kara 写道:

Hi, Jan

Thanks for your review and comments.

> On Thu 21-08-25 10:30:30, Julian Sun wrote:
>> On Thu, Aug 21, 2025 at 4:58 AM Tejun Heo <tj@kernel.org> wrote:
>>>
>>> On Wed, Aug 20, 2025 at 07:19:40PM +0800, Julian Sun wrote:
>>>> @@ -3912,8 +3921,12 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>>>>        int __maybe_unused i;
>>>>
>>>>   #ifdef CONFIG_CGROUP_WRITEBACK
>>>> -     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
>>>> -             wb_wait_for_completion(&memcg->cgwb_frn[i].done);
>>>> +     for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
>>>> +             struct wb_completion *done = memcg->cgwb_frn[i].done;
>>>> +
>>>> +             if (atomic_dec_and_test(&done->cnt))
>>>> +                     kfree(done);
>>>> +     }
>>>>   #endif
>>>
>>> Can't you just remove done? I don't think it's doing anything after your
>>> changes anyway.
>>
>> Thanks for your review.
>>
>> AFAICT done is also used to track free slots in
>> mem_cgroup_track_foreign_dirty_slowpath() and
>> mem_cgroup_flush_foreign(), otherwise we have no method to know which
>> one is free and might flush more than what MEMCG_CGWB_FRN_CNT allow.
>>
>> Am I missing something?
> 
> True, but is that mechanism really needed? Given the approximate nature of
> foreign flushing, couldn't we just always replace the oldest foreign entry
> regardless of whether the writeback is running or not? I didn't give too
> deep thought to this but from a quick look this should work just fine...


AFAICT the mechanism is used to make the max number of wb works that we 
issue concurrently less than MEMCG_CGWB_FRN_CNT(4). If we replace the 
oldest and flush wb work whether the writeback is running or not, it 
seems that we are likely to flush more than MEMCG_CGWB_FRN_CNT wb works 
concurrently, I'm not sure how much influence this will have...
> 
> 								Honza

Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-22 17:56               ` Tejun Heo
  2025-08-23  6:18                 ` Julian Sun
@ 2025-08-25 17:45                 ` Julian Sun
  2025-08-25 18:53                   ` Tejun Heo
  1 sibling, 1 reply; 26+ messages in thread
From: Julian Sun @ 2025-08-25 17:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

Hi, Tejun

On Sat, Aug 23, 2025 at 1:56 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Aug 22, 2025 at 04:22:09PM +0800, Julian Sun wrote:
> > +struct wb_wait_queue_head {
> > +     wait_queue_head_t waitq;
> > +     wb_wait_wakeup_func_t wb_wakeup_func;
> > +};
>
> wait_queue_head_t itself already allows overriding the wakeup function.
> Please look for init_wait_func() usages in the tree. Hopefully, that should
> contain the changes within memcg.

Sorry for having misunderstood what you meant before. I’m afraid that
init_wait_func() cannot work the same way. Because calling
init_wait_func() presupposes that we are preparing to wait for an
event(like wb_wait_completion()), but waiting for such an event might
lead to a hung task.

Please correct me if I'm wrong.
>
> Thanks.
>
> --
> tejun


Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-25 17:45                 ` Julian Sun
@ 2025-08-25 18:53                   ` Tejun Heo
  2025-08-25 19:06                     ` Julian Sun
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2025-08-25 18:53 UTC (permalink / raw)
  To: Julian Sun
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

Hello,

On Tue, Aug 26, 2025 at 01:45:44AM +0800, Julian Sun wrote:
...
> Sorry for having misunderstood what you meant before. I’m afraid that
> init_wait_func() cannot work the same way. Because calling
> init_wait_func() presupposes that we are preparing to wait for an
> event(like wb_wait_completion()), but waiting for such an event might
> lead to a hung task.
> 
> Please correct me if I'm wrong.

Using init_wait_func() does not require someone waiting for it. AFAICS, you
should be able to do the same thing that you did - allocating the done
entries individually and freeing them when the done count reaches zero
without anyone waiting for it. waitq doesn't really make many assumptions
about how it's used - when you call wake_up() on it, it just walks the
queued entries and invoke the callbacks there.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-25 10:13       ` Jan Kara
  2025-08-25 12:08         ` Julian Sun
@ 2025-08-25 18:57         ` Tejun Heo
  1 sibling, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2025-08-25 18:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Julian Sun, linux-fsdevel, cgroups, linux-mm, viro, brauner,
	hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

Hello,

On Mon, Aug 25, 2025 at 12:13:53PM +0200, Jan Kara wrote:
> True, but is that mechanism really needed? Given the approximate nature of
> foreign flushing, couldn't we just always replace the oldest foreign entry
> regardless of whether the writeback is running or not? I didn't give too
> deep thought to this but from a quick look this should work just fine...

Maybe, it's a bit difficult to predict. The thing was built to work around
real cases of problem and FWIW we haven't heard of similar issues
afterwards. Given the peculiarity and subtlety of the problem, I wonder
keeping the behavior unchanged, if reasonably possible, is an easier path to
follow here.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [External] Re: [PATCH] memcg: Don't wait writeback completion when release memcg.
  2025-08-25 18:53                   ` Tejun Heo
@ 2025-08-25 19:06                     ` Julian Sun
  0 siblings, 0 replies; 26+ messages in thread
From: Julian Sun @ 2025-08-25 19:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-fsdevel, cgroups, linux-mm, viro, brauner, jack, hannes,
	mhocko, roman.gushchin, shakeel.butt, muchun.song, axboe

Hi,

On Tue, Aug 26, 2025 at 2:53 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Aug 26, 2025 at 01:45:44AM +0800, Julian Sun wrote:
> ...
> > Sorry for having misunderstood what you meant before. I’m afraid that
> > init_wait_func() cannot work the same way. Because calling
> > init_wait_func() presupposes that we are preparing to wait for an
> > event(like wb_wait_completion()), but waiting for such an event might
> > lead to a hung task.
> >
> > Please correct me if I'm wrong.
>
> Using init_wait_func() does not require someone waiting for it. AFAICS, you
> should be able to do the same thing that you did - allocating the done
> entries individually and freeing them when the done count reaches zero
> without anyone waiting for it. waitq doesn't really make many assumptions
> about how it's used - when you call wake_up() on it, it just walks the
> queued entries and invoke the callbacks there.

Ah, yeah, this sounds great, many thanks for your clarification.  I’ll
send v2 of the patch after testing.
>
> Thanks.
>
> --
> tejun


Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-08-25 19:06 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-20 11:19 [PATCH 0/3] memcg, writeback: Don't wait writeback completion Julian Sun
2025-08-20 11:19 ` [PATCH 1/3] writeback: Rename wb_writeback_work->auto_free to free_work Julian Sun
2025-08-20 11:19 ` [PATCH] writeback: Add wb_writeback_work->free_done Julian Sun
2025-08-20 11:19 ` [PATCH] memcg: Don't wait writeback completion when release memcg Julian Sun
2025-08-20 20:58   ` Tejun Heo
2025-08-21  2:30     ` [External] " Julian Sun
2025-08-21 16:59       ` Tejun Heo
2025-08-21 18:00         ` Julian Sun
2025-08-21 18:16           ` Julian Sun
2025-08-21 19:01           ` Tejun Heo
2025-08-22  8:22             ` Julian Sun
2025-08-22 17:56               ` Tejun Heo
2025-08-23  6:18                 ` Julian Sun
2025-08-23  8:08                   ` Giorgi Tchankvetadze
2025-08-23  8:22                     ` Julian Sun
2025-08-23 14:08                       ` Giorgi Tchankvetadze
2025-08-23 15:17                         ` Julian Sun
2025-08-25 17:45                 ` Julian Sun
2025-08-25 18:53                   ` Tejun Heo
2025-08-25 19:06                     ` Julian Sun
2025-08-25 10:13       ` Jan Kara
2025-08-25 12:08         ` Julian Sun
2025-08-25 18:57         ` [External] " Tejun Heo
2025-08-20 12:16 ` [PATCH 0/3] memcg, writeback: Don't wait writeback completion Giorgi Tchankvetadze
2025-08-21  2:37   ` [External] " Julian Sun
2025-08-22  9:29     ` Giorgi Tchankvetadze

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).