From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, david@kernel.org,
mhocko@kernel.org, zhengqi.arch@bytedance.com,
shakeel.butt@linux.dev, axelrasmussen@google.com,
yuanchu@google.com, weixugc@google.com, baohua@kernel.org,
kasong@tencent.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org,
"Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
Subject: Re: [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU
Date: Wed, 25 Mar 2026 21:20:55 +0800 [thread overview]
Message-ID: <f3d680da-7480-4d05-ac44-e669e0914a32@linux.alibaba.com> (raw)
In-Reply-To: <acPOn07xah2eh0WU@KASONG-MC4>
Hi Kairui,
On 3/25/26 8:07 PM, Kairui Song wrote:
> On Wed, Mar 25, 2026 at 07:50:40PM +0800, Baolin Wang wrote:
>> The balance_dirty_pages() won't do the dirty folios throttling on cgroupv1.
>> See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
>> on traditional hierarchies").
>>
>> Moreover, after commit 6b0dfabb3555 ("fs: Remove aops->writepage"), we no
>> longer attempt to write back filesystem folios through reclaim.
>>
>> On large memory systems, the flusher may not be able to write back quickly
>> enough. Consequently, MGLRU will encounter many folios that are already
>> under writeback. Since we cannot reclaim these dirty folios, the system
>> may run out of memory and trigger the OOM killer.
>>
>> Hence, for cgroup v1, let's throttle reclaim after waking up the flusher,
>> which is similar to commit 81a70c21d917 ("mm/cgroup/reclaim: fix dirty
>> pages throttling on cgroup v1"), to avoid unnecessary OOM.
>>
>> The following test program can easily reproduce the OOM issue. With this patch
>> applied, the test passes successfully.
>>
>> $mkdir /sys/fs/cgroup/memory/test
>> $echo 256M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>> $echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
>> $dd if=/dev/zero of=/mnt/data.bin bs=1M count=800
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>> mm/vmscan.c | 13 ++++++++++++-
>> 1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 33287ba4a500..a9648269fae8 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -5036,9 +5036,20 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>> * If too many file cache in the coldest generation can't be evicted
>> * due to being dirty, wake up the flusher.
>> */
>> - if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
>> + if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
>> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> +
>> wakeup_flusher_threads(WB_REASON_VMSCAN);
>>
>> + /*
>> + * For cgroupv1 dirty throttling is achieved by waking up
>> + * the kernel flusher here and later waiting on folios
>> + * which are in writeback to finish (see shrink_folio_list()).
>> + */
>> + if (!writeback_throttling_sane(sc))
>> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>> + }
>> +
>> /* whether this lruvec should be rotated */
>> return nr_to_scan < 0;
>> }
>
> Hi Baolin
>
> Interesting I want to fix this too, after or with:
> https://lore.kernel.org/linux-mm/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com/
Thanks for taking a look.
>
> With current fix you posted, MGLRU's dirty throttling is still
> a bit different from active / inactive LRU. In fact MGLRU
> treat dirty folios quite differently causing many other issues too,
> e.g. it's much more likely for dirty folios to stuck at the tail
> for MGLRU so simply apply the throttling could cause too
> aggressive throttling. Or batch is too large to trigger the
> throttling.
Thanks for sharing this.
> So I'm planning to add below patch to V2 of that series (also this
> is suggested by Ridong), how do you think? There are several
> other throttling things to be fixed too, more than just the
> V1 support. I can have your suggested-by too.
But I still think this fix deserves its own commit, because this is
indeed fixing a real issue that I ran into. Even if the throttling isn't
perfect for cgroup v1, it aligns with the legacy-LRU behavior and is
essential to avoid premature OOMs firstly. MGLRU dirty folio handling
improvement can be done as a separate optimization in your series.
Anyway, let's also wait for more feedback from others.
> commit e9fc6fe9c1236f7f70eeb45d9c47c56125d14013
> Author: Kairui Song <kasong@tencent.com>
> Date: Tue Mar 24 19:45:26 2026 +0800
>
> mm/vmscan: unify writeback reclaim statistic and throttling
>
> Currently MGLRU and non-MGLRU handles the reclaim statistic and
> writeback handling, especially throttling differently. For MGLRU the
> throttling part is basically ignore.
>
> Let just unify this part so both setup will have the same behavior.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bdf611544880..fcb91a644277 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1943,6 +1943,44 @@ static int current_may_throttle(void)
> return !(current->flags & PF_LOCAL_THROTTLE);
> }
>
> +static void handle_reclaim_writeback(unsigned long nr_taken,
> + struct pglist_data *pgdat,
> + struct scan_control *sc,
> + struct reclaim_stat *stat)
> +{
> + /*
> + * If dirty folios are scanned that are not queued for IO, it
> + * implies that flushers are not doing their job. This can
> + * happen when memory pressure pushes dirty folios to the end of
> + * the LRU before the dirty limits are breached and the dirty
> + * data has expired. It can also happen when the proportion of
> + * dirty folios grows not through writes but through memory
> + * pressure reclaiming all the clean cache. And in some cases,
> + * the flushers simply cannot keep up with the allocation
> + * rate. Nudge the flusher threads in case they are asleep.
> + */
> + if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
> + wakeup_flusher_threads(WB_REASON_VMSCAN);
> + /*
> + * For cgroupv1 dirty throttling is achieved by waking up
> + * the kernel flusher here and later waiting on folios
> + * which are in writeback to finish (see shrink_folio_list()).
> + *
> + * Flusher may not be able to issue writeback quickly
> + * enough for cgroupv1 writeback throttling to work
> + * on a large system.
> + */
> + if (!writeback_throttling_sane(sc))
> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> + }
> +
> + sc->nr.dirty += stat->nr_dirty;
> + sc->nr.congested += stat->nr_congested;
> + sc->nr.writeback += stat->nr_writeback;
> + sc->nr.immediate += stat->nr_immediate;
> + sc->nr.taken += nr_taken;
> +}
> +
> /*
> * shrink_inactive_list() is a helper for shrink_node(). It returns the number
> * of reclaimed pages
> @@ -2006,39 +2044,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> lruvec_lock_irq(lruvec);
> lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
> nr_scanned - nr_reclaimed);
> -
> - /*
> - * If dirty folios are scanned that are not queued for IO, it
> - * implies that flushers are not doing their job. This can
> - * happen when memory pressure pushes dirty folios to the end of
> - * the LRU before the dirty limits are breached and the dirty
> - * data has expired. It can also happen when the proportion of
> - * dirty folios grows not through writes but through memory
> - * pressure reclaiming all the clean cache. And in some cases,
> - * the flushers simply cannot keep up with the allocation
> - * rate. Nudge the flusher threads in case they are asleep.
> - */
> - if (stat.nr_unqueued_dirty == nr_taken) {
> - wakeup_flusher_threads(WB_REASON_VMSCAN);
> - /*
> - * For cgroupv1 dirty throttling is achieved by waking up
> - * the kernel flusher here and later waiting on folios
> - * which are in writeback to finish (see shrink_folio_list()).
> - *
> - * Flusher may not be able to issue writeback quickly
> - * enough for cgroupv1 writeback throttling to work
> - * on a large system.
> - */
> - if (!writeback_throttling_sane(sc))
> - reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> - }
> -
> - sc->nr.dirty += stat.nr_dirty;
> - sc->nr.congested += stat.nr_congested;
> - sc->nr.writeback += stat.nr_writeback;
> - sc->nr.immediate += stat.nr_immediate;
> - sc->nr.taken += nr_taken;
> -
> + handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
> trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> nr_scanned, nr_reclaimed, &stat, sc->priority, file);
> return nr_reclaimed;
> @@ -4848,17 +4854,11 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> retry:
> reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
> sc->nr_reclaimed += reclaimed;
> + handle_reclaim_writeback(isolated, pgdat, sc, &stat);
> trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> type_scanned, reclaimed, &stat, sc->priority,
> type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
> - /*
> - * If too many file cache in the coldest generation can't be evicted
> - * due to being dirty, wake up the flusher.
> - */
> - if (stat.nr_unqueued_dirty == isolated)
> - wakeup_flusher_threads(WB_REASON_VMSCAN);
> -
> list_for_each_entry_safe_reverse(folio, next, &list, lru) {
> DEFINE_MIN_SEQ(lruvec);
>
> @@ -4901,6 +4901,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>
> if (!list_empty(&list)) {
> skip_retry = true;
> + isolated = 0;
> goto retry;
> }
next prev parent reply other threads:[~2026-03-25 13:21 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-25 11:50 [RFC PATCH] mm: vmscan: fix dirty folios throttling on cgroup v1 for MGLRU Baolin Wang
2026-03-25 11:55 ` Baolin Wang
2026-03-25 12:07 ` Kairui Song
2026-03-25 13:20 ` Baolin Wang [this message]
2026-03-25 13:35 ` Kairui Song
2026-03-26 1:57 ` Baolin Wang
2026-03-26 5:04 ` Barry Song
2026-03-26 8:41 ` Baolin Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f3d680da-7480-4d05-ac44-e669e0914a32@linux.alibaba.com \
--to=baolin.wang@linux.alibaba.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@kernel.org \
--cc=ryncsn@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox