[PATCH] mm: Require LRU reclaim progress before retrying direct reclaim

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Matt Fleming <matt@readmodwrite.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>,
	Jens Axboe <axboe@kernel.dk>,
	Sergey Senozhatsky <senozhatsky@chromium.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Minchan Kim <minchan@kernel.org>,
	kernel-team@cloudflare.com,
	Matt Fleming <mfleming@cloudflare.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	Vlastimil Babka <vbabka@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Brendan Jackman <jackmanb@google.com>, Zi Yan <ziy@nvidia.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	David Hildenbrand <david@kernel.org>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Lorenzo Stoakes <ljs@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: [PATCH] mm: Require LRU reclaim progress before retrying direct reclaim
Date: Fri, 10 Apr 2026 11:15:49 +0100	[thread overview]
Message-ID: <20260410101550.2930139-1-matt@readmodwrite.com> (raw)

From: Matt Fleming <mfleming@cloudflare.com>

should_reclaim_retry() uses zone_reclaimable_pages() to estimate whether
retrying reclaim could eventually satisfy an allocation. It's possible
for reclaim to make minimal or no progress on an LRU type despite having
ample reclaimable pages, e.g. anonymous pages when the only swap is
RAM-backed (zram). This can cause the reclaim path to loop indefinitely.

Track LRU reclaim progress (anon vs file) through a new struct
reclaim_progress passed out of try_to_free_pages(), and only count a
type's reclaimable pages if at least reclaim_progress_pct% was actually
reclaimed in the last cycle.

The threshold is exposed as /proc/sys/vm/reclaim_progress_pct (default
1, range 0-100). Setting 0 disables the gate and restores the previous
behaviour. Environments with only RAM-backed swap (zram) and small
memory may need a higher value to prevent futile anon LRU churn from
keeping the allocator spinning.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
---
 include/linux/swap.h |  13 +++++-
 mm/page_alloc.c      | 101 +++++++++++++++++++++++++++++++++++--------
 mm/vmscan.c          |  72 ++++++++++++++++++++++--------
 3 files changed, 146 insertions(+), 40 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62fc7499b408..d46477365cd9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -368,9 +368,18 @@ void folio_mark_lazyfree(struct folio *folio);
 extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
+struct reclaim_progress {
+	unsigned long nr_reclaimed;
+	unsigned long nr_anon;
+	unsigned long nr_file;
+};
+
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
-extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-					gfp_t gfp_mask, nodemask_t *mask);
+extern unsigned long zone_reclaimable_file_pages(struct zone *zone);
+extern unsigned long zone_reclaimable_anon_pages(struct zone *zone);
+extern void try_to_free_pages(struct zonelist *zonelist, int order,
+			      gfp_t gfp_mask, nodemask_t *mask,
+			      struct reclaim_progress *progress);
 
 #define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
 #define MEMCG_RECLAIM_PROACTIVE (1 << 2)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d4b6f1a554e..0f2597542ace 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4407,12 +4407,11 @@ static unsigned int check_retry_zonelist(unsigned int seq)
 }
 
 /* Perform direct synchronous page reclaim */
-static unsigned long
-__perform_reclaim(gfp_t gfp_mask, unsigned int order,
-					const struct alloc_context *ac)
+static void __perform_reclaim(gfp_t gfp_mask, unsigned int order,
+			      const struct alloc_context *ac,
+			      struct reclaim_progress *progress)
 {
 	unsigned int noreclaim_flag;
-	unsigned long progress;
 
 	cond_resched();
 
@@ -4421,30 +4420,27 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	fs_reclaim_acquire(gfp_mask);
 	noreclaim_flag = memalloc_noreclaim_save();
 
-	progress = try_to_free_pages(ac->zonelist, order, gfp_mask,
-								ac->nodemask);
+	try_to_free_pages(ac->zonelist, order, gfp_mask, ac->nodemask, progress);
 
 	memalloc_noreclaim_restore(noreclaim_flag);
 	fs_reclaim_release(gfp_mask);
 
 	cond_resched();
-
-	return progress;
 }
 
 /* The really slow allocator path where we enter direct reclaim */
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		unsigned long *did_some_progress)
+		struct reclaim_progress *progress)
 {
 	struct page *page = NULL;
 	unsigned long pflags;
 	bool drained = false;
 
 	psi_memstall_enter(&pflags);
-	*did_some_progress = __perform_reclaim(gfp_mask, order, ac);
-	if (unlikely(!(*did_some_progress)))
+	__perform_reclaim(gfp_mask, order, ac, progress);
+	if (unlikely(!progress->nr_reclaimed))
 		goto out;
 
 retry:
@@ -4586,6 +4582,41 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!__gfp_pfmemalloc_flags(gfp_mask);
 }
 
+/*
+ * Minimum percentage of LRU reclaimable pages that must have been
+ * reclaimed in the last cycle for that type to be counted towards the
+ * "can we satisfy this allocation?" watermark check in
+ * should_reclaim_retry().
+ *
+ * This prevents systems with only RAM-backed swap (zram) from
+ * endlessly retrying reclaim for anon pages when minimal progress is
+ * made despite seemingly having lots of reclaimable pages.
+ *
+ * Setting this to 0 disables the per-LRU progress check: all
+ * reclaimable pages are always counted towards watermark.
+ */
+static int reclaim_progress_pct __read_mostly = 1;
+
+/*
+ * Return true if reclaim for this LRU type made at least
+ * reclaim_progress_pct% progress in the last cycle or the LRU progress
+ * check is disabled.
+ */
+static inline bool reclaim_progress_sufficient(unsigned long reclaimed,
+					       unsigned long reclaimable)
+{
+	unsigned long threshold;
+
+	if (!reclaim_progress_pct)
+		return true;
+
+	if (!reclaimable)
+		return false;
+
+	threshold = DIV_ROUND_UP(reclaimable * reclaim_progress_pct, 100);
+	return reclaimed >= threshold;
+}
+
 /*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
@@ -4599,11 +4630,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress, int *no_progress_loops)
+		     struct reclaim_progress *progress,
+		     int *no_progress_loops)
 {
 	struct zone *zone;
 	struct zoneref *z;
 	bool ret = false;
+	bool did_some_progress = progress->nr_reclaimed > 0;
 
 	/*
 	 * Costly allocations might have made a progress but this doesn't mean
@@ -4629,6 +4662,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 				ac->highest_zoneidx, ac->nodemask) {
 		unsigned long available;
 		unsigned long reclaimable;
+		unsigned long reclaimable_anon;
+		unsigned long reclaimable_file;
 		unsigned long min_wmark = min_wmark_pages(zone);
 		bool wmark;
 
@@ -4637,7 +4672,24 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 			!__cpuset_zone_allowed(zone, gfp_mask))
 				continue;
 
-		available = reclaimable = zone_reclaimable_pages(zone);
+		/*
+		 * Only count reclaimable pages from an LRU type if reclaim
+		 * actually made headway on that type in the last cycle.
+		 * This prevents the allocator from looping endlessly on
+		 * account of a large pool of pages that reclaim cannot make
+		 * progress on, e.g. anonymous pages when the only swap is
+		 * RAM-backed (zram).
+		 */
+		reclaimable = 0;
+		reclaimable_file = zone_reclaimable_file_pages(zone);
+		reclaimable_anon = zone_reclaimable_anon_pages(zone);
+
+		if (reclaim_progress_sufficient(progress->nr_file, reclaimable_file))
+			reclaimable += reclaimable_file;
+		if (reclaim_progress_sufficient(progress->nr_anon, reclaimable_anon))
+			reclaimable += reclaimable_anon;
+
+		available = reclaimable;
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
 		/*
@@ -4716,7 +4768,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
 	struct page *page = NULL;
 	unsigned int alloc_flags;
-	unsigned long did_some_progress;
+	struct reclaim_progress reclaim_progress = {};
+	unsigned long oom_progress;
 	enum compact_priority compact_priority;
 	enum compact_result compact_result;
 	int compaction_retries;
@@ -4727,6 +4780,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	bool compact_first = false;
 	bool can_retry_reserves = true;
 
+
 	if (unlikely(nofail)) {
 		/*
 		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
@@ -4844,7 +4898,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	/* Try direct reclaim and then allocating */
 	if (!compact_first) {
 		page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags,
-							ac, &did_some_progress);
+						ac, &reclaim_progress);
 		if (page)
 			goto got_pg;
 	}
@@ -4904,7 +4958,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto restart;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, &no_progress_loops))
+				 &reclaim_progress, &no_progress_loops))
 		goto retry;
 
 	/*
@@ -4913,7 +4967,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * implementation of the compaction depends on the sufficient amount
 	 * of free memory (see __compaction_suitable)
 	 */
-	if (did_some_progress > 0 && can_compact &&
+	if (reclaim_progress.nr_reclaimed > 0 && can_compact &&
 			should_compact_retry(ac, order, alloc_flags,
 				compact_result, &compact_priority,
 				&compaction_retries))
@@ -4934,7 +4988,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto restart;
 
 	/* Reclaim has failed us, start killing things */
-	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+	page = __alloc_pages_may_oom(gfp_mask, order, ac, &oom_progress);
 	if (page)
 		goto got_pg;
 
@@ -4945,7 +4999,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 	/* Retry as long as the OOM killer is making progress */
-	if (did_some_progress) {
+	if (oom_progress) {
 		no_progress_loops = 0;
 		goto retry;
 	}
@@ -6775,6 +6829,15 @@ static const struct ctl_table page_alloc_sysctl_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "reclaim_progress_pct",
+		.data		= &reclaim_progress_pct,
+		.maxlen		= sizeof(reclaim_progress_pct),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE_HUNDRED,
+	},
 	{
 		.procname	= "percpu_pagelist_high_fraction",
 		.data		= &percpu_pagelist_high_fraction,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fc9373e8251..9087b4e0a704 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -167,6 +167,10 @@ struct scan_control {
 	/* Number of pages freed so far during a call to shrink_zones() */
 	unsigned long nr_reclaimed;
 
+	/* Anon/file LRU contributions to nr_reclaimed */
+	unsigned long nr_reclaimed_anon;
+	unsigned long nr_reclaimed_file;
+
 	struct {
 		unsigned int dirty;
 		unsigned int unqueued_dirty;
@@ -385,6 +389,21 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
 	return can_demote(nid, sc, memcg);
 }
 
+unsigned long zone_reclaimable_file_pages(struct zone *zone)
+{
+	return zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
+		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
+}
+
+unsigned long zone_reclaimable_anon_pages(struct zone *zone)
+{
+	if (!can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
+		return 0;
+
+	return zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
+		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
+}
+
 /*
  * This misses isolated folios which are not accounted for to save counters.
  * As the data only determines if reclaim or compaction continues, it is
@@ -392,15 +411,8 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
  */
 unsigned long zone_reclaimable_pages(struct zone *zone)
 {
-	unsigned long nr;
-
-	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
-		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
-	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
-		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
-			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
-
-	return nr;
+	return zone_reclaimable_file_pages(zone) +
+		zone_reclaimable_anon_pages(zone);
 }
 
 /**
@@ -4718,6 +4730,10 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
 	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
+	if (type)
+		sc->nr_reclaimed_file += reclaimed;
+	else
+		sc->nr_reclaimed_anon += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
@@ -5776,6 +5792,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	unsigned long nr_to_scan;
 	enum lru_list lru;
 	unsigned long nr_reclaimed = 0;
+	unsigned long nr_reclaimed_anon = 0;
+	unsigned long nr_reclaimed_file = 0;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
 	bool proportional_reclaim;
 	struct blk_plug plug;
@@ -5812,11 +5830,18 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 
 		for_each_evictable_lru(lru) {
 			if (nr[lru]) {
+				unsigned long reclaimed;
+
 				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
 				nr[lru] -= nr_to_scan;
 
-				nr_reclaimed += shrink_list(lru, nr_to_scan,
-							    lruvec, sc);
+				reclaimed = shrink_list(lru, nr_to_scan,
+							lruvec, sc);
+				nr_reclaimed += reclaimed;
+				if (is_file_lru(lru))
+					nr_reclaimed_file += reclaimed;
+				else
+					nr_reclaimed_anon += reclaimed;
 			}
 		}
 
@@ -5876,6 +5901,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	}
 	blk_finish_plug(&plug);
 	sc->nr_reclaimed += nr_reclaimed;
+	sc->nr_reclaimed_anon += nr_reclaimed_anon;
+	sc->nr_reclaimed_file += nr_reclaimed_file;
 
 	/*
 	 * Even if we did not try to evict anon pages at all, we want to
@@ -6563,8 +6590,9 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 	return false;
 }
 
-unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-				gfp_t gfp_mask, nodemask_t *nodemask)
+void try_to_free_pages(struct zonelist *zonelist, int order,
+		       gfp_t gfp_mask, nodemask_t *nodemask,
+		       struct reclaim_progress *progress)
 {
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
@@ -6588,12 +6616,14 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	BUILD_BUG_ON(MAX_NR_ZONES > S8_MAX);
 
 	/*
-	 * Do not enter reclaim if fatal signal was delivered while throttled.
-	 * 1 is returned so that the page allocator does not OOM kill at this
-	 * point.
+	 * Do not enter reclaim if fatal signal was delivered while
+	 * throttled. nr_reclaimed is set to 1 so that the page
+	 * allocator does not OOM kill at this point.
 	 */
-	if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
-		return 1;
+	if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask)) {
+		nr_reclaimed = 1;
+		goto out;
+	}
 
 	set_task_reclaim_state(current, &sc.reclaim_state);
 	trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
@@ -6603,7 +6633,11 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
 	set_task_reclaim_state(current, NULL);
 
-	return nr_reclaimed;
+	progress->nr_anon = sc.nr_reclaimed_anon;
+	progress->nr_file = sc.nr_reclaimed_file;
+
+out:
+	progress->nr_reclaimed = nr_reclaimed;
 }
 
 #ifdef CONFIG_MEMCG
-- 
2.43.0

next             reply	other threads:[~2026-04-10 10:15 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-10 10:15 Matt Fleming [this message]
  -- strict thread matches above, loose matches on Subject: below --
2026-03-03 11:53 [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap Matt Fleming
2026-04-10  9:41 ` [PATCH] mm: Require LRU reclaim progress before retrying direct reclaim Matt Fleming
2026-04-10 10:13   ` Matt Fleming

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:62fc7499b40 dfblob:d46477365cd dfblob:2d4b6f1a554
dfblob:0f2597542ac dfblob:0fc9373e825 dfblob:9087b4e0a70 )
 OR (
bs:"[PATCH] mm: Require LRU reclaim progress before retrying direct reclaim" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260410101550.2930139-1-matt@readmodwrite.com \
    --to=matt@readmodwrite.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=jackmanb@google.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mfleming@cloudflare.com \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=senozhatsky@chromium.org \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox