Re: [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, david@kernel.org,
	mhocko@kernel.org, zhengqi.arch@bytedance.com,
	shakeel.butt@linux.dev, axelrasmussen@google.com,
	yuanchu@google.com, weixugc@google.com, baohua@kernel.org,
	kasong@tencent.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	"Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
Subject: Re: [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
Date: Wed, 25 Mar 2026 21:20:55 +0800	[thread overview]
Message-ID: <f3d680da-7480-4d05-ac44-e669e0914a32@linux.alibaba.com> (raw)
In-Reply-To: <acPOn07xah2eh0WU@KASONG-MC4>

Hi Kairui,

On 3/25/26 8:07 PM, Kairui Song wrote:
> On Wed, Mar 25, 2026 at 07:50:40PM +0800, Baolin Wang wrote:
>> The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
>> See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
>> on traditional hierarchies").
>>
>> Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
>> longer attempt to write back filesystem folios through reclaim.
>>
>> On large memory systems, the flusher may not be able to write back quickly
>> enough. Consequently, MGLRU will encounter many folios that are already
>> under writeback. Since we cannot reclaim these dirty folios, the system
>> may run out of memory and trigger the OOM killer.
>>
>> Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
>> which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
>> pages throttling on cgroup v1"), to avoid unnecessary OOM.
>>
>> The following test program can easily reproduce the OOM issue. With this patch
>> applied, the test passes successfully.
>>
>> $mkdir /sys/fs/cgroup/memory/test
>> $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>> $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
>> $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   mm/vmscan.c | 13 ++++++++++++-
>>   1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 33287ba4a500..a9648269fae8 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -5036,9 +5036,20 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>   	 * If too many file cache in the coldest generation can't be evicted
>>   	 * due to being dirty, wake up the flusher.
>>   	 */
>> -	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
>> +	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
>> +		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> +
>>   		wakeup_flusher_threads(WB_REASON_VMSCAN);
>>   
>> +		/*
>> +		 * For cgroupv1 dirty throttling is achieved by waking up
>> +		 * the kernel flusher here and later waiting on folios
>> +		 * which are in writeback to finish (see shrink_folio_list()).
>> +		 */
>> +		if (!writeback_throttling_sane(sc))
>> +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>> +	}
>> +
>>   	/* whether this lruvec should be rotated */
>>   	return nr_to_scan < 0;
>>   }
> 
> Hi Baolin
> 
> Interesting I want to fix this too, after or with:
> https://lore.kernel.org/linux-mm/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com/

Thanks for taking a look.

> 
> With current fix you posted, MGLRU's dirty throttling is still
> a bit different from active / inactive LRU. In fact MGLRU
> treat dirty folios quite differently causing many other issues too,
> e.g. it's much more likely for dirty folios to stuck at the tail
> for MGLRU so simply apply the throttling could cause too
> aggressive throttling. Or batch is too large to trigger the
> throttling.

Thanks for sharing this.

> So I'm planning to add below patch to V2 of that series (also this
> is suggested by Ridong), how do you think? There are several
> other throttling things to be fixed too, more than just the
> V1 support. I can have your suggested-by too.

But I still think this fix deserves its own commit, because this is 
indeed fixing a real issue that I ran into. Even if the throttling isn't 
perfect for cgroup v1, it aligns with the legacy-LRU behavior and is 
essential to avoid premature OOMs firstly. MGLRU dirty folio handling 
improvement can be done as a separate optimization in your series.

Anyway, let's also wait for more feedback from others.

> commit e9fc6fe9c1236f7f70eeb45d9c47c56125d14013
> Author: Kairui Song <kasong@tencent.com>
> Date:   Tue Mar 24 19:45:26 2026 +0800
> 
>      mm/vmscan: unify writeback reclaim statistic and throttling
>      
>      Currently MGLRU and non-MGLRU handles the reclaim statistic and
>      writeback handling, especially throttling differently. For MGLRU the
>      throttling part is basically ignore.
>      
>      Let just unify this part so both setup will have the same behavior.
>      
>      Signed-off-by: Kairui Song <kasong@tencent.com>
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bdf611544880..fcb91a644277 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1943,6 +1943,44 @@ static int current_may_throttle(void)
>   	return !(current->flags & PF_LOCAL_THROTTLE);
>   }
>   
> +static void handle_reclaim_writeback(unsigned long nr_taken,
> +				     struct pglist_data *pgdat,
> +				     struct scan_control *sc,
> +				     struct reclaim_stat *stat)
> +{
> +	/*
> +	 * If dirty folios are scanned that are not queued for IO, it
> +	 * implies that flushers are not doing their job. This can
> +	 * happen when memory pressure pushes dirty folios to the end of
> +	 * the LRU before the dirty limits are breached and the dirty
> +	 * data has expired. It can also happen when the proportion of
> +	 * dirty folios grows not through writes but through memory
> +	 * pressure reclaiming all the clean cache. And in some cases,
> +	 * the flushers simply cannot keep up with the allocation
> +	 * rate. Nudge the flusher threads in case they are asleep.
> +	 */
> +	if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
> +		wakeup_flusher_threads(WB_REASON_VMSCAN);
> +		/*
> +		 * For cgroupv1 dirty throttling is achieved by waking up
> +		 * the kernel flusher here and later waiting on folios
> +		 * which are in writeback to finish (see shrink_folio_list()).
> +		 *
> +		 * Flusher may not be able to issue writeback quickly
> +		 * enough for cgroupv1 writeback throttling to work
> +		 * on a large system.
> +		 */
> +		if (!writeback_throttling_sane(sc))
> +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> +	}
> +
> +	sc->nr.dirty += stat->nr_dirty;
> +	sc->nr.congested += stat->nr_congested;
> +	sc->nr.writeback += stat->nr_writeback;
> +	sc->nr.immediate += stat->nr_immediate;
> +	sc->nr.taken += nr_taken;
> +}
> +
>   /*
>    * shrink_inactive_list() is a helper for shrink_node().  It returns the number
>    * of reclaimed pages
> @@ -2006,39 +2044,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>   	lruvec_lock_irq(lruvec);
>   	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
>   					nr_scanned - nr_reclaimed);
> -
> -	/*
> -	 * If dirty folios are scanned that are not queued for IO, it
> -	 * implies that flushers are not doing their job. This can
> -	 * happen when memory pressure pushes dirty folios to the end of
> -	 * the LRU before the dirty limits are breached and the dirty
> -	 * data has expired. It can also happen when the proportion of
> -	 * dirty folios grows not through writes but through memory
> -	 * pressure reclaiming all the clean cache. And in some cases,
> -	 * the flushers simply cannot keep up with the allocation
> -	 * rate. Nudge the flusher threads in case they are asleep.
> -	 */
> -	if (stat.nr_unqueued_dirty == nr_taken) {
> -		wakeup_flusher_threads(WB_REASON_VMSCAN);
> -		/*
> -		 * For cgroupv1 dirty throttling is achieved by waking up
> -		 * the kernel flusher here and later waiting on folios
> -		 * which are in writeback to finish (see shrink_folio_list()).
> -		 *
> -		 * Flusher may not be able to issue writeback quickly
> -		 * enough for cgroupv1 writeback throttling to work
> -		 * on a large system.
> -		 */
> -		if (!writeback_throttling_sane(sc))
> -			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> -	}
> -
> -	sc->nr.dirty += stat.nr_dirty;
> -	sc->nr.congested += stat.nr_congested;
> -	sc->nr.writeback += stat.nr_writeback;
> -	sc->nr.immediate += stat.nr_immediate;
> -	sc->nr.taken += nr_taken;
> -
> +	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
>   	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>   			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>   	return nr_reclaimed;
> @@ -4848,17 +4854,11 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   retry:
>   	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
>   	sc->nr_reclaimed += reclaimed;
> +	handle_reclaim_writeback(isolated, pgdat, sc, &stat);
>   	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>   			type_scanned, reclaimed, &stat, sc->priority,
>   			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>   
> -	/*
> -	 * If too many file cache in the coldest generation can't be evicted
> -	 * due to being dirty, wake up the flusher.
> -	 */
> -	if (stat.nr_unqueued_dirty == isolated)
> -		wakeup_flusher_threads(WB_REASON_VMSCAN);
> -
>   	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
>   		DEFINE_MIN_SEQ(lruvec);
>   
> @@ -4901,6 +4901,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   
>   	if (!list_empty(&list)) {
>   		skip_retry = true;
> +		isolated = 0;
>   		goto retry;
>   	}

next prev parent reply	other threads:[~2026-03-25 13:21 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-25 11:50 [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU Baolin Wang
2026-03-25 11:55 ` Baolin Wang
2026-03-25 12:07 ` Kairui Song
2026-03-25 13:20   ` Baolin Wang [this message]
2026-03-25 13:35     ` Kairui Song
2026-03-26  1:57       ` Baolin Wang
2026-03-26  5:04 ` Barry Song
2026-03-26  8:41   ` Baolin Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f3d680da-7480-4d05-ac44-e669e0914a32@linux.alibaba.com \
    --to=baolin.wang@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox