[PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
@ 2026-03-27 10:21 Baolin Wang
  2026-03-27 15:30 ` Andrew Morton
  2026-03-27 16:41 ` Johannes Weiner
  0 siblings, 2 replies; 5+ messages in thread
From: Baolin Wang @ 2026-03-27 10:21 UTC (permalink / raw)
  To: akpm, hannes
  Cc: david, mhocko, zhengqi.arch, shakeel.butt, axelrasmussen, yuanchu,
	weixugc, ljs, baohua, kasong, baolin.wang, linux-mm, linux-kernel

The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
on traditional hierarchies").

Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
longer attempt to write back filesystem folios through reclaim.

On large memory systems, the flusher may not be able to write back quickly
enough. Consequently, MGLRU will encounter many folios that are already
under writeback. Since we cannot reclaim these dirty folios, the system
may run out of memory and trigger the OOM killer.

Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
pages throttling on cgroup v1"), to avoid unnecessary OOM.

The following test program can easily reproduce the OOM issue. With this patch
applied, the test passes successfully.

$mkdir /sys/fs/cgroup/memory/test
$echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
$echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
$dd if=/dev/zero of=/mnt/data.bin bs=1M count=800

Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Changes from RFC:
 - Add the Fixes tag.
 - Add reviewed tag from Barry and Kairui. Thanks.
---
 mm/vmscan.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 46657d2cef42..b5fdad1444af 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5036,9 +5036,24 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	 * If too many file cache in the coldest generation can't be evicted
 	 * due to being dirty, wake up the flusher.
 	 */
-	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
+	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
+		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
 		wakeup_flusher_threads(WB_REASON_VMSCAN);
 
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 *
+		 * Flusher may not be able to issue writeback quickly
+		 * enough for cgroupv1 writeback throttling to work
+		 * on a large system.
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
 	/* whether this lruvec should be rotated */
 	return nr_to_scan < 0;
 }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
  2026-03-27 10:21 [PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU Baolin Wang
@ 2026-03-27 15:30 ` Andrew Morton
  2026-03-28  2:38   ` Baolin Wang
  2026-03-27 16:41 ` Johannes Weiner
  1 sibling, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2026-03-27 15:30 UTC (permalink / raw)
  To: Baolin Wang
  Cc: hannes, david, mhocko, zhengqi.arch, shakeel.butt, axelrasmussen,
	yuanchu, weixugc, ljs, baohua, kasong, linux-mm, linux-kernel

On Fri, 27 Mar 2026 18:21:08 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
> See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
> on traditional hierarchies").
> 
> Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
> longer attempt to write back filesystem folios through reclaim.
> 
> On large memory systems, the flusher may not be able to write back quickly
> enough. Consequently, MGLRU will encounter many folios that are already
> under writeback. Since we cannot reclaim these dirty folios, the system
> may run out of memory and trigger the OOM killer.
> 
> Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
> which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
> pages throttling on cgroup v1"), to avoid unnecessary OOM.
> 
> The following test program can easily reproduce the OOM issue. With this patch
> applied, the test passes successfully.
> 
> $mkdir /sys/fs/cgroup/memory/test
> $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
> $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800
> 
> Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")

3+ years ago, I don't see a need to rush this into 7.0.

But should we cc:stable?




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
  2026-03-27 10:21 [PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU Baolin Wang
  2026-03-27 15:30 ` Andrew Morton
@ 2026-03-27 16:41 ` Johannes Weiner
  2026-03-27 17:59   ` Kairui Song
  1 sibling, 1 reply; 5+ messages in thread
From: Johannes Weiner @ 2026-03-27 16:41 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, david, mhocko, zhengqi.arch, shakeel.butt, axelrasmussen,
	yuanchu, weixugc, ljs, baohua, kasong, linux-mm, linux-kernel

On Fri, Mar 27, 2026 at 06:21:08PM +0800, Baolin Wang wrote:
> The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
> See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
> on traditional hierarchies").
> 
> Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
> longer attempt to write back filesystem folios through reclaim.
> 
> On large memory systems, the flusher may not be able to write back quickly
> enough. Consequently, MGLRU will encounter many folios that are already
> under writeback. Since we cannot reclaim these dirty folios, the system
> may run out of memory and trigger the OOM killer.
> 
> Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
> which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
> pages throttling on cgroup v1"), to avoid unnecessary OOM.

This fix for cgroup1 makes sense to me. For cgroup2, MGLRU shares the
shrink_node() reclaim throttling.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

The remaining notable difference is global reclaim. I don't see any
equivalent throttling in the lru_gen_shrink_node() path. What prevents
premature OOMs at the system level?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
  2026-03-27 16:41 ` Johannes Weiner
@ 2026-03-27 17:59   ` Kairui Song
  0 siblings, 0 replies; 5+ messages in thread
From: Kairui Song @ 2026-03-27 17:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Baolin Wang, akpm, david, mhocko, zhengqi.arch, shakeel.butt,
	axelrasmussen, yuanchu, weixugc, ljs, baohua, kasong, linux-mm,
	linux-kernel

On Fri, Mar 27, 2026 at 12:41:06PM +0800, Johannes Weiner wrote:
> On Fri, Mar 27, 2026 at 06:21:08PM +0800, Baolin Wang wrote:
> > The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
> > See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
> > on traditional hierarchies").
> > 
> > Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
> > longer attempt to write back filesystem folios through reclaim.
> > 
> > On large memory systems, the flusher may not be able to write back quickly
> > enough. Consequently, MGLRU will encounter many folios that are already
> > under writeback. Since we cannot reclaim these dirty folios, the system
> > may run out of memory and trigger the OOM killer.
> > 
> > Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
> > which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
> > pages throttling on cgroup v1"), to avoid unnecessary OOM.
> 
> This fix for cgroup1 makes sense to me. For cgroup2, MGLRU shares the
> shrink_node() reclaim throttling.
> 

Ahem, I think that throttling is actually broken and so I'm fixing
that, I shared a patch yesterday for that as I saw Baolin is fixing V1:
https://lore.kernel.org/linux-mm/acPOn07xah2eh0WU@KASONG-MC4/

Still need about ~3 LOC change after the patch above since I also
need to clean up MGLRU's force reset of PG_reclaim, will post it
tomorrow as part of the V2 of MGLRU's dirty folio handling rework [1].
That series will improve the batch and dirty handling so fixing that
is much easier and cleaner by sharing same routine. Before that
series it could get very messy and cause over aggressive throttling.

Will post tomorrow as my stress test suite is still running
today, it seems all green so far but just in case.

I initially want to post that after that series but Ridong
suggested to just fix that too, and it actually helps to
deduplicate the code, the change is also small.

And right, MGLRU had some issues with dirty flush previously,
Jingxiang and I fixed it:
https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/

But premature OOM is still an problem, not only about dirty
or writeback, but also somehow related to how aging and
protection works. That new series also fixes quite a lot of these,
e.g. the OOM reproducer in cover letter.
https://lore.kernel.org/linux-mm/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com/

Still not perfect but no worry as we will solve all of them
(hopefully very soon):
https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/

My local reproduce using dm_delay and dd can trigger OOM without
the fix I posted above, and no more problem after that.

I guess storage nowadays storage might just be too fast and it
rarely congest the whole memory so no one reported this yet. Of
course it is a real issue. We did see a few suspicious OOM,
rare but might be related.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
  2026-03-27 15:30 ` Andrew Morton
@ 2026-03-28  2:38   ` Baolin Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Baolin Wang @ 2026-03-28  2:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: hannes, david, mhocko, zhengqi.arch, shakeel.butt, axelrasmussen,
	yuanchu, weixugc, ljs, baohua, kasong, linux-mm, linux-kernel



On 3/27/26 11:30 PM, Andrew Morton wrote:
> On Fri, 27 Mar 2026 18:21:08 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> 
>> The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
>> See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
>> on traditional hierarchies").
>>
>> Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
>> longer attempt to write back filesystem folios through reclaim.
>>
>> On large memory systems, the flusher may not be able to write back quickly
>> enough. Consequently, MGLRU will encounter many folios that are already
>> under writeback. Since we cannot reclaim these dirty folios, the system
>> may run out of memory and trigger the OOM killer.
>>
>> Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
>> which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
>> pages throttling on cgroup v1"), to avoid unnecessary OOM.
>>
>> The following test program can easily reproduce the OOM issue. With this patch
>> applied, the test passes successfully.
>>
>> $mkdir /sys/fs/cgroup/memory/test
>> $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>> $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
>> $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800
>>
>> Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation")
> 
> 3+ years ago, I don't see a need to rush this into 7.0.

Agree.

> But should we cc:stable?

I don't think it's necessary. The issue isn't that serious:)


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-28  2:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-27 10:21 [PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU Baolin Wang
2026-03-27 15:30 ` Andrew Morton
2026-03-28  2:38   ` Baolin Wang
2026-03-27 16:41 ` Johannes Weiner
2026-03-27 17:59   ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox