[RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
@ 2026-03-25 11:50 Baolin Wang
  2026-03-25 11:55 ` Baolin Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Baolin Wang @ 2026-03-25 11:50 UTC (permalink / raw)
  To: akpm, hannes
  Cc: david, mhocko, zhengqi.arch, shakeel.butt, axelrasmussen, yuanchu,
	weixugc, baohua, kasong, baolin.wang, linux-mm, linux-kernel

The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
on traditional hierarchies").

Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
longer attempt to write back filesystem folios through reclaim.

On large memory systems, the flusher may not be able to write back quickly
enough. Consequently, MGLRU will encounter many folios that are already
under writeback. Since we cannot reclaim these dirty folios, the system
may run out of memory and trigger the OOM killer.

Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
pages throttling on cgroup v1"), to avoid unnecessary OOM.

The following test program can easily reproduce the OOM issue. With this patch
applied, the test passes successfully.

$mkdir /sys/fs/cgroup/memory/test
$echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
$echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
$dd if=/dev/zero of=/mnt/data.bin bs=1M count=800

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/vmscan.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33287ba4a500..a9648269fae8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5036,9 +5036,20 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	 * If too many file cache in the coldest generation can't be evicted
 	 * due to being dirty, wake up the flusher.
 	 */
-	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
+	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
+		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
 		wakeup_flusher_threads(WB_REASON_VMSCAN);

+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
 	/* whether this lruvec should be rotated */
 	return nr_to_scan < 0;
 }
-- 
2.47.3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
  2026-03-25 11:50 [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU Baolin Wang
@ 2026-03-25 11:55 ` Baolin Wang
  2026-03-25 12:07 ` Kairui Song
  2026-03-26  5:04 ` Barry Song
  2 siblings, 0 replies; 8+ messages in thread
From: Baolin Wang @ 2026-03-25 11:55 UTC (permalink / raw)
  To: akpm, hannes
  Cc: david, mhocko, zhengqi.arch, shakeel.butt, axelrasmussen, yuanchu,
	weixugc, baohua, kasong, linux-mm, linux-kernel,
	Lorenzo Stoakes (Oracle)

CC Lorenzo.

(Sorry, Lorenzo. I switched to use your new email address, but forgot to 
CC you.)

On 3/25/26 7:50 PM, Baolin Wang wrote:
> The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
> See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
> on traditional hierarchies").
> 
> Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
> longer attempt to write back filesystem folios through reclaim.
> 
> On large memory systems, the flusher may not be able to write back quickly
> enough. Consequently, MGLRU will encounter many folios that are already
> under writeback. Since we cannot reclaim these dirty folios, the system
> may run out of memory and trigger the OOM killer.
> 
> Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
> which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
> pages throttling on cgroup v1"), to avoid unnecessary OOM.
> 
> The following test program can easily reproduce the OOM issue. With this patch
> applied, the test passes successfully.
> 
> $mkdir /sys/fs/cgroup/memory/test
> $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
> $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>   mm/vmscan.c | 13 ++++++++++++-
>   1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 33287ba4a500..a9648269fae8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5036,9 +5036,20 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>   	 * If too many file cache in the coldest generation can't be evicted
>   	 * due to being dirty, wake up the flusher.
>   	 */
> -	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
> +	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
> +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
>   		wakeup_flusher_threads(WB_REASON_VMSCAN);
>   
> +		/*
> +		 * For cgroupv1 dirty throttling is achieved by waking up
> +		 * the kernel flusher here and later waiting on folios
> +		 * which are in writeback to finish (see shrink_folio_list()).
> +		 */
> +		if (!writeback_throttling_sane(sc))
> +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> +	}
> +
>   	/* whether this lruvec should be rotated */
>   	return nr_to_scan < 0;
>   }



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
  2026-03-25 11:50 [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU Baolin Wang
  2026-03-25 11:55 ` Baolin Wang
@ 2026-03-25 12:07 ` Kairui Song
  2026-03-25 13:20   ` Baolin Wang
  2026-03-26  5:04 ` Barry Song
  2 siblings, 1 reply; 8+ messages in thread
From: Kairui Song @ 2026-03-25 12:07 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	axelrasmussen, yuanchu, weixugc, baohua, kasong, linux-mm,
	linux-kernel

On Wed, Mar 25, 2026 at 07:50:40PM +0800, Baolin Wang wrote:
> The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
> See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
> on traditional hierarchies").
> 
> Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
> longer attempt to write back filesystem folios through reclaim.
> 
> On large memory systems, the flusher may not be able to write back quickly
> enough. Consequently, MGLRU will encounter many folios that are already
> under writeback. Since we cannot reclaim these dirty folios, the system
> may run out of memory and trigger the OOM killer.
> 
> Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
> which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
> pages throttling on cgroup v1"), to avoid unnecessary OOM.
> 
> The following test program can easily reproduce the OOM issue. With this patch
> applied, the test passes successfully.
> 
> $mkdir /sys/fs/cgroup/memory/test
> $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
> $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  mm/vmscan.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 33287ba4a500..a9648269fae8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5036,9 +5036,20 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  	 * If too many file cache in the coldest generation can't be evicted
>  	 * due to being dirty, wake up the flusher.
>  	 */
> -	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
> +	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
> +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
>  		wakeup_flusher_threads(WB_REASON_VMSCAN);
>  
> +		/*
> +		 * For cgroupv1 dirty throttling is achieved by waking up
> +		 * the kernel flusher here and later waiting on folios
> +		 * which are in writeback to finish (see shrink_folio_list()).
> +		 */
> +		if (!writeback_throttling_sane(sc))
> +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> +	}
> +
>  	/* whether this lruvec should be rotated */
>  	return nr_to_scan < 0;
>  }

Hi Baolin

Interesting I want to fix this too, after or with:
https://lore.kernel.org/linux-mm/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com/

With current fix you posted, MGLRU's dirty throttling is still
a bit different from active / inactive LRU. In fact MGLRU
treat dirty folios quite differently causing many other issues too,
e.g. it's much more likely for dirty folios to stuck at the tail
for MGLRU so simply apply the throttling could cause too
aggressive throttling. Or batch is too large to trigger the
throttling.

So I'm planning to add below patch to V2 of that series (also this
is suggested by Ridong), how do you think? There are several
other throttling things to be fixed too, more than just the
V1 support. I can have your suggested-by too.

commit e9fc6fe9c1236f7f70eeb45d9c47c56125d14013
Author: Kairui Song <kasong@tencent.com>
Date:   Tue Mar 24 19:45:26 2026 +0800

    mm/vmscan: unify writeback reclaim statistic and throttling
    
    Currently MGLRU and non-MGLRU handles the reclaim statistic and
    writeback handling, especially throttling differently. For MGLRU the
    throttling part is basically ignore.
    
    Let just unify this part so both setup will have the same behavior.
    
    Signed-off-by: Kairui Song <kasong@tencent.com>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bdf611544880..fcb91a644277 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1943,6 +1943,44 @@ static int current_may_throttle(void)
 	return !(current->flags & PF_LOCAL_THROTTLE);
 }
 
+static void handle_reclaim_writeback(unsigned long nr_taken,
+				     struct pglist_data *pgdat,
+				     struct scan_control *sc,
+				     struct reclaim_stat *stat)
+{
+	/*
+	 * If dirty folios are scanned that are not queued for IO, it
+	 * implies that flushers are not doing their job. This can
+	 * happen when memory pressure pushes dirty folios to the end of
+	 * the LRU before the dirty limits are breached and the dirty
+	 * data has expired. It can also happen when the proportion of
+	 * dirty folios grows not through writes but through memory
+	 * pressure reclaiming all the clean cache. And in some cases,
+	 * the flushers simply cannot keep up with the allocation
+	 * rate. Nudge the flusher threads in case they are asleep.
+	 */
+	if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 *
+		 * Flusher may not be able to issue writeback quickly
+		 * enough for cgroupv1 writeback throttling to work
+		 * on a large system.
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
+	sc->nr.dirty += stat->nr_dirty;
+	sc->nr.congested += stat->nr_congested;
+	sc->nr.writeback += stat->nr_writeback;
+	sc->nr.immediate += stat->nr_immediate;
+	sc->nr.taken += nr_taken;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
@@ -2006,39 +2044,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	lruvec_lock_irq(lruvec);
 	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
 					nr_scanned - nr_reclaimed);
-
-	/*
-	 * If dirty folios are scanned that are not queued for IO, it
-	 * implies that flushers are not doing their job. This can
-	 * happen when memory pressure pushes dirty folios to the end of
-	 * the LRU before the dirty limits are breached and the dirty
-	 * data has expired. It can also happen when the proportion of
-	 * dirty folios grows not through writes but through memory
-	 * pressure reclaiming all the clean cache. And in some cases,
-	 * the flushers simply cannot keep up with the allocation
-	 * rate. Nudge the flusher threads in case they are asleep.
-	 */
-	if (stat.nr_unqueued_dirty == nr_taken) {
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 *
-		 * Flusher may not be able to issue writeback quickly
-		 * enough for cgroupv1 writeback throttling to work
-		 * on a large system.
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
-	sc->nr.dirty += stat.nr_dirty;
-	sc->nr.congested += stat.nr_congested;
-	sc->nr.writeback += stat.nr_writeback;
-	sc->nr.immediate += stat.nr_immediate;
-	sc->nr.taken += nr_taken;
-
+	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
 	return nr_reclaimed;
@@ -4848,17 +4854,11 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
 	sc->nr_reclaimed += reclaimed;
+	handle_reclaim_writeback(isolated, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (stat.nr_unqueued_dirty == isolated)
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
@@ -4901,6 +4901,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	if (!list_empty(&list)) {
 		skip_retry = true;
+		isolated = 0;
 		goto retry;
 	}


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
  2026-03-25 12:07 ` Kairui Song
@ 2026-03-25 13:20   ` Baolin Wang
  2026-03-25 13:35     ` Kairui Song
  0 siblings, 1 reply; 8+ messages in thread
From: Baolin Wang @ 2026-03-25 13:20 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	axelrasmussen, yuanchu, weixugc, baohua, kasong, linux-mm,
	linux-kernel, Lorenzo Stoakes (Oracle)

Hi Kairui,

On 3/25/26 8:07 PM, Kairui Song wrote:
> On Wed, Mar 25, 2026 at 07:50:40PM +0800, Baolin Wang wrote:
>> The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
>> See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
>> on traditional hierarchies").
>>
>> Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
>> longer attempt to write back filesystem folios through reclaim.
>>
>> On large memory systems, the flusher may not be able to write back quickly
>> enough. Consequently, MGLRU will encounter many folios that are already
>> under writeback. Since we cannot reclaim these dirty folios, the system
>> may run out of memory and trigger the OOM killer.
>>
>> Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
>> which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
>> pages throttling on cgroup v1"), to avoid unnecessary OOM.
>>
>> The following test program can easily reproduce the OOM issue. With this patch
>> applied, the test passes successfully.
>>
>> $mkdir /sys/fs/cgroup/memory/test
>> $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>> $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
>> $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/vmscan.c | 13 ++++++++++++-
>>   1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 33287ba4a500..a9648269fae8 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -5036,9 +5036,20 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>   	 * If too many file cache in the coldest generation can't be evicted
>>   	 * due to being dirty, wake up the flusher.
>>   	 */
>> -	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
>> +	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
>> +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> +
>>   		wakeup_flusher_threads(WB_REASON_VMSCAN);
>>   
>> +		/*
>> +		 * For cgroupv1 dirty throttling is achieved by waking up
>> +		 * the kernel flusher here and later waiting on folios
>> +		 * which are in writeback to finish (see shrink_folio_list()).
>> +		 */
>> +		if (!writeback_throttling_sane(sc))
>> +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>> +	}
>> +
>>   	/* whether this lruvec should be rotated */
>>   	return nr_to_scan < 0;
>>   }
> 
> Hi Baolin
> 
> Interesting I want to fix this too, after or with:
> https://lore.kernel.org/linux-mm/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com/

Thanks for taking a look.

> 
> With current fix you posted, MGLRU's dirty throttling is still
> a bit different from active / inactive LRU. In fact MGLRU
> treat dirty folios quite differently causing many other issues too,
> e.g. it's much more likely for dirty folios to stuck at the tail
> for MGLRU so simply apply the throttling could cause too
> aggressive throttling. Or batch is too large to trigger the
> throttling.

Thanks for sharing this.

> So I'm planning to add below patch to V2 of that series (also this
> is suggested by Ridong), how do you think? There are several
> other throttling things to be fixed too, more than just the
> V1 support. I can have your suggested-by too.

But I still think this fix deserves its own commit, because this is 
indeed fixing a real issue that I ran into. Even if the throttling isn't 
perfect for cgroup v1, it aligns with the legacy-LRU behavior and is 
essential to avoid premature OOMs firstly. MGLRU dirty folio handling 
improvement can be done as a separate optimization in your series.

Anyway, let's also wait for more feedback from others.

> commit e9fc6fe9c1236f7f70eeb45d9c47c56125d14013
> Author: Kairui Song <kasong@tencent.com>
> Date:   Tue Mar 24 19:45:26 2026 +0800
> 
>      mm/vmscan: unify writeback reclaim statistic and throttling
>      
>      Currently MGLRU and non-MGLRU handles the reclaim statistic and
>      writeback handling, especially throttling differently. For MGLRU the
>      throttling part is basically ignore.
>      
>      Let just unify this part so both setup will have the same behavior.
>      
>      Signed-off-by: Kairui Song <kasong@tencent.com>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bdf611544880..fcb91a644277 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1943,6 +1943,44 @@ static int current_may_throttle(void)
>   	return !(current->flags & PF_LOCAL_THROTTLE);
>   }
>   
> +static void handle_reclaim_writeback(unsigned long nr_taken,
> +				     struct pglist_data *pgdat,
> +				     struct scan_control *sc,
> +				     struct reclaim_stat *stat)
> +{
> +	/*
> +	 * If dirty folios are scanned that are not queued for IO, it
> +	 * implies that flushers are not doing their job. This can
> +	 * happen when memory pressure pushes dirty folios to the end of
> +	 * the LRU before the dirty limits are breached and the dirty
> +	 * data has expired. It can also happen when the proportion of
> +	 * dirty folios grows not through writes but through memory
> +	 * pressure reclaiming all the clean cache. And in some cases,
> +	 * the flushers simply cannot keep up with the allocation
> +	 * rate. Nudge the flusher threads in case they are asleep.
> +	 */
> +	if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
> +		wakeup_flusher_threads(WB_REASON_VMSCAN);
> +		/*
> +		 * For cgroupv1 dirty throttling is achieved by waking up
> +		 * the kernel flusher here and later waiting on folios
> +		 * which are in writeback to finish (see shrink_folio_list()).
> +		 *
> +		 * Flusher may not be able to issue writeback quickly
> +		 * enough for cgroupv1 writeback throttling to work
> +		 * on a large system.
> +		 */
> +		if (!writeback_throttling_sane(sc))
> +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> +	}
> +
> +	sc->nr.dirty += stat->nr_dirty;
> +	sc->nr.congested += stat->nr_congested;
> +	sc->nr.writeback += stat->nr_writeback;
> +	sc->nr.immediate += stat->nr_immediate;
> +	sc->nr.taken += nr_taken;
> +}
> +
>   /*
>    * shrink_inactive_list() is a helper for shrink_node().  It returns the number
>    * of reclaimed pages
> @@ -2006,39 +2044,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>   	lruvec_lock_irq(lruvec);
>   	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
>   					nr_scanned - nr_reclaimed);
> -
> -	/*
> -	 * If dirty folios are scanned that are not queued for IO, it
> -	 * implies that flushers are not doing their job. This can
> -	 * happen when memory pressure pushes dirty folios to the end of
> -	 * the LRU before the dirty limits are breached and the dirty
> -	 * data has expired. It can also happen when the proportion of
> -	 * dirty folios grows not through writes but through memory
> -	 * pressure reclaiming all the clean cache. And in some cases,
> -	 * the flushers simply cannot keep up with the allocation
> -	 * rate. Nudge the flusher threads in case they are asleep.
> -	 */
> -	if (stat.nr_unqueued_dirty == nr_taken) {
> -		wakeup_flusher_threads(WB_REASON_VMSCAN);
> -		/*
> -		 * For cgroupv1 dirty throttling is achieved by waking up
> -		 * the kernel flusher here and later waiting on folios
> -		 * which are in writeback to finish (see shrink_folio_list()).
> -		 *
> -		 * Flusher may not be able to issue writeback quickly
> -		 * enough for cgroupv1 writeback throttling to work
> -		 * on a large system.
> -		 */
> -		if (!writeback_throttling_sane(sc))
> -			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> -	}
> -
> -	sc->nr.dirty += stat.nr_dirty;
> -	sc->nr.congested += stat.nr_congested;
> -	sc->nr.writeback += stat.nr_writeback;
> -	sc->nr.immediate += stat.nr_immediate;
> -	sc->nr.taken += nr_taken;
> -
> +	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
>   	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>   			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>   	return nr_reclaimed;
> @@ -4848,17 +4854,11 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   retry:
>   	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
>   	sc->nr_reclaimed += reclaimed;
> +	handle_reclaim_writeback(isolated, pgdat, sc, &stat);
>   	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>   			type_scanned, reclaimed, &stat, sc->priority,
>   			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>   
> -	/*
> -	 * If too many file cache in the coldest generation can't be evicted
> -	 * due to being dirty, wake up the flusher.
> -	 */
> -	if (stat.nr_unqueued_dirty == isolated)
> -		wakeup_flusher_threads(WB_REASON_VMSCAN);
> -
>   	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
>   		DEFINE_MIN_SEQ(lruvec);
>   
> @@ -4901,6 +4901,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   
>   	if (!list_empty(&list)) {
>   		skip_retry = true;
> +		isolated = 0;
>   		goto retry;
>   	}



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
  2026-03-25 13:20   ` Baolin Wang
@ 2026-03-25 13:35     ` Kairui Song
  2026-03-26  1:57       ` Baolin Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Kairui Song @ 2026-03-25 13:35 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	axelrasmussen, yuanchu, weixugc, baohua, kasong, linux-mm,
	linux-kernel, Lorenzo Stoakes (Oracle)

On Wed, Mar 25, 2026 at 09:20:55PM +0800, Baolin Wang wrote:
> Hi Kairui,
> 
> On 3/25/26 8:07 PM, Kairui Song wrote:
> > On Wed, Mar 25, 2026 at 07:50:40PM +0800, Baolin Wang wrote:
> > > The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
> > > See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
> > > on traditional hierarchies").
> > > 
> > > Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
> > > longer attempt to write back filesystem folios through reclaim.
> > > 
> > > On large memory systems, the flusher may not be able to write back quickly
> > > enough. Consequently, MGLRU will encounter many folios that are already
> > > under writeback. Since we cannot reclaim these dirty folios, the system
> > > may run out of memory and trigger the OOM killer.
> > > 
> > > Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
> > > which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
> > > pages throttling on cgroup v1"), to avoid unnecessary OOM.
> > > 
> > > The following test program can easily reproduce the OOM issue. With this patch
> > > applied, the test passes successfully.
> > > 
> > > $mkdir /sys/fs/cgroup/memory/test
> > > $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> > > $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
> > > $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800
> > > 
> > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > ---
> > >   mm/vmscan.c | 13 ++++++++++++-
> > >   1 file changed, 12 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 33287ba4a500..a9648269fae8 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -5036,9 +5036,20 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > >   	 * If too many file cache in the coldest generation can't be evicted
> > >   	 * due to being dirty, wake up the flusher.
> > >   	 */
> > > -	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
> > > +	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
> > > +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > > +
> > >   		wakeup_flusher_threads(WB_REASON_VMSCAN);
> > > +		/*
> > > +		 * For cgroupv1 dirty throttling is achieved by waking up
> > > +		 * the kernel flusher here and later waiting on folios
> > > +		 * which are in writeback to finish (see shrink_folio_list()).
> > > +		 */
> > > +		if (!writeback_throttling_sane(sc))
> > > +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> > > +	}
> > > +
> > >   	/* whether this lruvec should be rotated */
> > >   	return nr_to_scan < 0;
> > >   }
> > 
> > Hi Baolin
> > 
> > Interesting I want to fix this too, after or with:
> > https://lore.kernel.org/linux-mm/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com/
> 
> Thanks for taking a look.
> 
> > 
> > With current fix you posted, MGLRU's dirty throttling is still
> > a bit different from active / inactive LRU. In fact MGLRU
> > treat dirty folios quite differently causing many other issues too,
> > e.g. it's much more likely for dirty folios to stuck at the tail
> > for MGLRU so simply apply the throttling could cause too
> > aggressive throttling. Or batch is too large to trigger the
> > throttling.
> 
> Thanks for sharing this.

Hi Baolin,

> 
> > So I'm planning to add below patch to V2 of that series (also this
> > is suggested by Ridong), how do you think? There are several
> > other throttling things to be fixed too, more than just the
> > V1 support. I can have your suggested-by too.
> 
> But I still think this fix deserves its own commit, because this is indeed
> fixing a real issue that I ran into. Even if the throttling isn't perfect
> for cgroup v1, it aligns with the legacy-LRU behavior and is essential to
> avoid premature OOMs firstly. MGLRU dirty folio handling improvement can be
> done as a separate optimization in your series.
> 
> Anyway, let's also wait for more feedback from others.
> 

Sure, fixing this first is fine to me, just saying that you may
still see unexpected throttling or ineffective throttling with this.

This is no conflict between these two approach. I can rebase that
series on top of yours, and that series would help to solve the
rest of issues.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
  2026-03-25 13:35     ` Kairui Song
@ 2026-03-26  1:57       ` Baolin Wang
  0 siblings, 0 replies; 8+ messages in thread
From: Baolin Wang @ 2026-03-26  1:57 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	axelrasmussen, yuanchu, weixugc, baohua, kasong, linux-mm,
	linux-kernel, Lorenzo Stoakes (Oracle)



On 3/25/26 9:35 PM, Kairui Song wrote:
> On Wed, Mar 25, 2026 at 09:20:55PM +0800, Baolin Wang wrote:
>> Hi Kairui,
>>
>> On 3/25/26 8:07 PM, Kairui Song wrote:
>>> On Wed, Mar 25, 2026 at 07:50:40PM +0800, Baolin Wang wrote:
>>>> The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
>>>> See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
>>>> on traditional hierarchies").
>>>>
>>>> Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
>>>> longer attempt to write back filesystem folios through reclaim.
>>>>
>>>> On large memory systems, the flusher may not be able to write back quickly
>>>> enough. Consequently, MGLRU will encounter many folios that are already
>>>> under writeback. Since we cannot reclaim these dirty folios, the system
>>>> may run out of memory and trigger the OOM killer.
>>>>
>>>> Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
>>>> which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
>>>> pages throttling on cgroup v1"), to avoid unnecessary OOM.
>>>>
>>>> The following test program can easily reproduce the OOM issue. With this patch
>>>> applied, the test passes successfully.
>>>>
>>>> $mkdir /sys/fs/cgroup/memory/test
>>>> $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>>>> $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
>>>> $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    mm/vmscan.c | 13 ++++++++++++-
>>>>    1 file changed, 12 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 33287ba4a500..a9648269fae8 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -5036,9 +5036,20 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>>>    	 * If too many file cache in the coldest generation can't be evicted
>>>>    	 * due to being dirty, wake up the flusher.
>>>>    	 */
>>>> -	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
>>>> +	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
>>>> +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>> +
>>>>    		wakeup_flusher_threads(WB_REASON_VMSCAN);
>>>> +		/*
>>>> +		 * For cgroupv1 dirty throttling is achieved by waking up
>>>> +		 * the kernel flusher here and later waiting on folios
>>>> +		 * which are in writeback to finish (see shrink_folio_list()).
>>>> +		 */
>>>> +		if (!writeback_throttling_sane(sc))
>>>> +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>>>> +	}
>>>> +
>>>>    	/* whether this lruvec should be rotated */
>>>>    	return nr_to_scan < 0;
>>>>    }
>>>
>>> Hi Baolin
>>>
>>> Interesting I want to fix this too, after or with:
>>> https://lore.kernel.org/linux-mm/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com/
>>
>> Thanks for taking a look.
>>
>>>
>>> With current fix you posted, MGLRU's dirty throttling is still
>>> a bit different from active / inactive LRU. In fact MGLRU
>>> treat dirty folios quite differently causing many other issues too,
>>> e.g. it's much more likely for dirty folios to stuck at the tail
>>> for MGLRU so simply apply the throttling could cause too
>>> aggressive throttling. Or batch is too large to trigger the
>>> throttling.
>>
>> Thanks for sharing this.
> 
> Hi Baolin,
> 
>>
>>> So I'm planning to add below patch to V2 of that series (also this
>>> is suggested by Ridong), how do you think? There are several
>>> other throttling things to be fixed too, more than just the
>>> V1 support. I can have your suggested-by too.
>>
>> But I still think this fix deserves its own commit, because this is indeed
>> fixing a real issue that I ran into. Even if the throttling isn't perfect
>> for cgroup v1, it aligns with the legacy-LRU behavior and is essential to
>> avoid premature OOMs firstly. MGLRU dirty folio handling improvement can be
>> done as a separate optimization in your series.
>>
>> Anyway, let's also wait for more feedback from others.
>>
> 
> Sure, fixing this first is fine to me, just saying that you may
> still see unexpected throttling or ineffective throttling with this.
> 
> This is no conflict between these two approach. I can rebase that
> series on top of yours, and that series would help to solve the
> rest of issues.

OK. Thanks.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
  2026-03-25 11:50 [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU Baolin Wang
  2026-03-25 11:55 ` Baolin Wang
  2026-03-25 12:07 ` Kairui Song
@ 2026-03-26  5:04 ` Barry Song
  2026-03-26  8:41   ` Baolin Wang
  2 siblings, 1 reply; 8+ messages in thread
From: Barry Song @ 2026-03-26  5:04 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	axelrasmussen, yuanchu, weixugc, kasong, linux-mm, linux-kernel

On Wed, Mar 25, 2026 at 7:51 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
> See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
> on traditional hierarchies").
>
> Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
> longer attempt to write back filesystem folios through reclaim.
>
> On large memory systems, the flusher may not be able to write back quickly
> enough. Consequently, MGLRU will encounter many folios that are already
> under writeback. Since we cannot reclaim these dirty folios, the system
> may run out of memory and trigger the OOM killer.
>
> Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
> which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
> pages throttling on cgroup v1"), to avoid unnecessary OOM.
>
> The following test program can easily reproduce the OOM issue. With this patch
> applied, the test passes successfully.
>
> $mkdir /sys/fs/cgroup/memory/test
> $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
> $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>

LGTM,

Reviewed-by: Barry Song <baohua@kernel.org>

Maybe we can extract a common inline helper to avoid the copy-paste duplication.

> ---
>  mm/vmscan.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 33287ba4a500..a9648269fae8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5036,9 +5036,20 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>          * If too many file cache in the coldest generation can't be evicted
>          * due to being dirty, wake up the flusher.
>          */
> -       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
> +       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
> +               struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
>                 wakeup_flusher_threads(WB_REASON_VMSCAN);
>
> +               /*
> +                * For cgroupv1 dirty throttling is achieved by waking up
> +                * the kernel flusher here and later waiting on folios
> +                * which are in writeback to finish (see shrink_folio_list()).
> +                */
> +               if (!writeback_throttling_sane(sc))
> +                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> +       }
> +
>         /* whether this lruvec should be rotated */
>         return nr_to_scan < 0;
>  }
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
  2026-03-26  5:04 ` Barry Song
@ 2026-03-26  8:41   ` Baolin Wang
  0 siblings, 0 replies; 8+ messages in thread
From: Baolin Wang @ 2026-03-26  8:41 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, shakeel.butt,
	axelrasmussen, yuanchu, weixugc, kasong, linux-mm, linux-kernel



On 3/26/26 1:04 PM, Barry Song wrote:
> On Wed, Mar 25, 2026 at 7:51 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>> The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
>> See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
>> on traditional hierarchies").
>>
>> Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
>> longer attempt to write back filesystem folios through reclaim.
>>
>> On large memory systems, the flusher may not be able to write back quickly
>> enough. Consequently, MGLRU will encounter many folios that are already
>> under writeback. Since we cannot reclaim these dirty folios, the system
>> may run out of memory and trigger the OOM killer.
>>
>> Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
>> which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
>> pages throttling on cgroup v1"), to avoid unnecessary OOM.
>>
>> The following test program can easily reproduce the OOM issue. With this patch
>> applied, the test passes successfully.
>>
>> $mkdir /sys/fs/cgroup/memory/test
>> $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>> $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
>> $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> LGTM,
> 
> Reviewed-by: Barry Song <baohua@kernel.org>

Thanks.

> Maybe we can extract a common inline helper to avoid the copy-paste duplication.

Kairui is planning further optimizations here (including using a helper) 
[1], so it might be better to leave that for his series.

For this patch, I intend it to be a standalone fix, and I will add the 
Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") tag.

[1] 
https://lore.kernel.org/all/20260318-mglru-reclaim-v1-7-2c46f9eb0508@tencent.com/

>> ---
>>   mm/vmscan.c | 13 ++++++++++++-
>>   1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 33287ba4a500..a9648269fae8 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -5036,9 +5036,20 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>           * If too many file cache in the coldest generation can't be evicted
>>           * due to being dirty, wake up the flusher.
>>           */
>> -       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
>> +       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
>> +               struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> +
>>                  wakeup_flusher_threads(WB_REASON_VMSCAN);
>>
>> +               /*
>> +                * For cgroupv1 dirty throttling is achieved by waking up
>> +                * the kernel flusher here and later waiting on folios
>> +                * which are in writeback to finish (see shrink_folio_list()).
>> +                */
>> +               if (!writeback_throttling_sane(sc))
>> +                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>> +       }
>> +
>>          /* whether this lruvec should be rotated */
>>          return nr_to_scan < 0;
>>   }
>> --
>> 2.47.3
>>



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-26  8:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-25 11:50 [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU Baolin Wang
2026-03-25 11:55 ` Baolin Wang
2026-03-25 12:07 ` Kairui Song
2026-03-25 13:20   ` Baolin Wang
2026-03-25 13:35     ` Kairui Song
2026-03-26  1:57       ` Baolin Wang
2026-03-26  5:04 ` Barry Song
2026-03-26  8:41   ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox