* [to-be-updated] mm-vmscan-mitigate-spurious-kswapd_failures-reset-from-direct-reclaim.patch removed from -mm tree
@ 2025-12-23 2:33 Andrew Morton
0 siblings, 0 replies; 2+ messages in thread
From: Andrew Morton @ 2025-12-23 2:33 UTC (permalink / raw)
To: mm-commits, zhengqi.arch, yuanchu, weixugc, shakeel.butt, mhocko,
lorenzo.stoakes, hannes, david, axelrasmussen, jiayuan.chen, akpm
The quilt patch titled
Subject: mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
has been removed from the -mm tree. Its filename was
mm-vmscan-mitigate-spurious-kswapd_failures-reset-from-direct-reclaim.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: Jiayuan Chen <jiayuan.chen@shopee.com>
Subject: mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
Date: Mon, 22 Dec 2025 20:20:21 +0800
When kswapd fails to reclaim memory, kswapd_failures is incremented. Once
it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid futile
reclaim attempts. However, any successful direct reclaim unconditionally
resets kswapd_failures to 0, which can cause problems.
We observed an issue in production on a multi-NUMA system where a process
allocated large amounts of anonymous pages on a single NUMA node, causing
its watermark to drop below high and evicting most file pages:
$ numastat -m
Per-node system memory usage (in MBs):
Node 0 Node 1 Total
--------------- --------------- ---------------
MemTotal 128222.19 127983.91 256206.11
MemFree 1414.48 1432.80 2847.29
MemUsed 126807.71 126551.11 252358.82
SwapCached 0.00 0.00 0.00
Active 29017.91 25554.57 54572.48
Inactive 92749.06 95377.00 188126.06
Active(anon) 28998.96 23356.47 52355.43
Inactive(anon) 92685.27 87466.11 180151.39
Active(file) 18.95 2198.10 2217.05
Inactive(file) 63.79 7910.89 7974.68
With swap disabled, only file pages can be reclaimed. When kswapd is
woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
raise free memory above the high watermark since reclaimable file pages
are insufficient. Normally, kswapd would eventually stop after
kswapd_failures reaches MAX_RECLAIM_RETRIES.
However, pods on this machine have memory.high set in their cgroup.
Business processes continuously trigger the high limit, causing frequent
direct reclaim that keeps resetting kswapd_failures to 0. This prevents
kswapd from ever stopping.
The result is that kswapd runs endlessly, repeatedly evicting the few
remaining file pages which are actually hot. These pages constantly
refault, generating sustained heavy IO READ pressure.
This is a multi-NUMA system where the memory pressure is not global but
node-local. The key observation is:
Node 0: Under memory pressure, most memory is anonymous (unreclaimable without swap)
Node 1: Has plenty of reclaimable memory (~60GB file cache out of 125GB total)
Node 0's kswapd runs continuously but cannot reclaim anything
Direct reclaim succeeds by reclaiming from Node 1
Direct reclaim resets kswapd_failures, preventing Node 0's kswapd from stopping
The few file pages on Node 0 are hot and keep refaulting, causing heavy I/O
From a per-node perspective, Node 0 is truly out of reclaimable memory
and its kswapd should stop. But the global direct reclaim success
(from Node 1) incorrectly keeps Node 0's kswapd alive.
Fix this by only resetting kswapd_failures from direct reclaim when the
node is actually balanced. This prevents direct reclaim from keeping
kswapd alive when the node cannot be balanced through reclaim alone.
Link: https://lkml.kernel.org/r/20251222122022.254268-1-jiayuan.chen@linux.dev
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/vmscan.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
--- a/mm/vmscan.c~mm-vmscan-mitigate-spurious-kswapd_failures-reset-from-direct-reclaim
+++ a/mm/vmscan.c
@@ -2648,6 +2648,15 @@ static bool can_age_anon_pages(struct lr
lruvec_memcg(lruvec));
}
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
+static inline void reset_kswapd_failures(struct pglist_data *pgdat,
+ struct scan_control *sc)
+{
+ if (!current_is_kswapd() &&
+ pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
+ atomic_set(&pgdat->kswapd_failures, 0);
+}
+
#ifdef CONFIG_LRU_GEN
#ifdef CONFIG_LRU_GEN_ENABLED
@@ -5065,7 +5074,7 @@ static void lru_gen_shrink_node(struct p
blk_finish_plug(&plug);
done:
if (sc->nr_reclaimed > reclaimed)
- atomic_set(&pgdat->kswapd_failures, 0);
+ reset_kswapd_failures(pgdat, sc);
}
/******************************************************************************
@@ -6139,7 +6148,7 @@ again:
* successful direct reclaim run will revive a dormant kswapd.
*/
if (reclaimable)
- atomic_set(&pgdat->kswapd_failures, 0);
+ reset_kswapd_failures(pgdat, sc);
else if (sc->cache_trim_mode)
sc->cache_trim_mode_failed = 1;
}
_
Patches currently in -mm which might be from jiayuan.chen@shopee.com are
^ permalink raw reply [flat|nested] 2+ messages in thread* [to-be-updated] mm-vmscan-mitigate-spurious-kswapd_failures-reset-from-direct-reclaim.patch removed from -mm tree
@ 2026-01-15 23:31 Andrew Morton
0 siblings, 0 replies; 2+ messages in thread
From: Andrew Morton @ 2026-01-15 23:31 UTC (permalink / raw)
To: mm-commits, zhengqi.arch, yuanchu, weixugc, shakeel.butt, mhocko,
lorenzo.stoakes, hannes, david, axelrasmussen, jiayuan.chen, akpm
The quilt patch titled
Subject: mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
has been removed from the -mm tree. Its filename was
mm-vmscan-mitigate-spurious-kswapd_failures-reset-from-direct-reclaim.patch
This patch was dropped because an updated version will be issued
------------------------------------------------------
From: Jiayuan Chen <jiayuan.chen@shopee.com>
Subject: mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
Date: Fri, 26 Dec 2025 16:00:42 +0800
When kswapd fails to reclaim memory, kswapd_failures is incremented. Once
it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid futile
reclaim attempts. However, any successful direct reclaim unconditionally
resets kswapd_failures to 0, which can cause problems.
We observed an issue in production on a multi-NUMA system where a process
allocated large amounts of anonymous pages on a single NUMA node, causing
its watermark to drop below high and evicting most file pages:
$ numastat -m
Per-node system memory usage (in MBs):
Node 0 Node 1 Total
--------------- --------------- ---------------
MemTotal 128222.19 127983.91 256206.11
MemFree 1414.48 1432.80 2847.29
MemUsed 126807.71 126551.11 252358.82
SwapCached 0.00 0.00 0.00
Active 29017.91 25554.57 54572.48
Inactive 92749.06 95377.00 188126.06
Active(anon) 28998.96 23356.47 52355.43
Inactive(anon) 92685.27 87466.11 180151.39
Active(file) 18.95 2198.10 2217.05
Inactive(file) 63.79 7910.89 7974.68
With swap disabled, only file pages can be reclaimed. When kswapd is
woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
raise free memory above the high watermark since reclaimable file pages
are insufficient. Normally, kswapd would eventually stop after
kswapd_failures reaches MAX_RECLAIM_RETRIES.
However, containers on this machine have memory.high set in their cgroup.
Business processes continuously trigger the high limit, causing frequent
direct reclaim that keeps resetting kswapd_failures to 0. This prevents
kswapd from ever stopping.
The key insight is that direct reclaim triggered by cgroup memory.high
performs aggressive scanning to throttle the allocating process. With
sufficiently aggressive scanning, even hot pages will eventually be
reclaimed, making direct reclaim "successful" at freeing some memory.
However, this success does not mean the node has reached a balanced state
- the freed memory may still be insufficient to bring free pages above the
high watermark. Unconditionally resetting kswapd_failures in this case
keeps kswapd alive indefinitely.
The result is that kswapd runs endlessly. Unlike direct reclaim which
only reclaims from the allocating cgroup, kswapd scans the entire node's
memory. This causes hot file pages from all workloads on the node to be
evicted, not just those from the cgroup triggering memory.high. These
pages constantly refault, generating sustained heavy IO READ pressure
across the entire system.
Fix this by only resetting kswapd_failures when the node is actually
balanced. This allows both kswapd and direct reclaim to clear
kswapd_failures upon successful reclaim, but only when the reclaim
actually resolves the memory pressure (i.e., the node becomes balanced).
Link: https://lkml.kernel.org/r/20251226080042.291657-1-jiayuan.chen@linux.dev
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/vmscan.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
--- a/mm/vmscan.c~mm-vmscan-mitigate-spurious-kswapd_failures-reset-from-direct-reclaim
+++ a/mm/vmscan.c
@@ -2648,6 +2648,20 @@ static bool can_age_anon_pages(struct lr
lruvec_memcg(lruvec));
}
+/*
+ * Reset kswapd_failures only when the node is balanced. Without this
+ * check, successful direct reclaim (e.g., from cgroup memory.high
+ * throttling) can keep resetting kswapd_failures even when the node
+ * cannot be balanced, causing kswapd to run endlessly.
+ */
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
+static inline void reset_kswapd_failures(struct pglist_data *pgdat,
+ struct scan_control *sc)
+{
+ if (pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
+ atomic_set(&pgdat->kswapd_failures, 0);
+}
+
#ifdef CONFIG_LRU_GEN
#ifdef CONFIG_LRU_GEN_ENABLED
@@ -5065,7 +5079,7 @@ static void lru_gen_shrink_node(struct p
blk_finish_plug(&plug);
done:
if (sc->nr_reclaimed > reclaimed)
- atomic_set(&pgdat->kswapd_failures, 0);
+ reset_kswapd_failures(pgdat, sc);
}
/******************************************************************************
@@ -6139,7 +6153,7 @@ again:
* successful direct reclaim run will revive a dormant kswapd.
*/
if (reclaimable)
- atomic_set(&pgdat->kswapd_failures, 0);
+ reset_kswapd_failures(pgdat, sc);
else if (sc->cache_trim_mode)
sc->cache_trim_mode_failed = 1;
}
_
Patches currently in -mm which might be from jiayuan.chen@shopee.com are
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-01-15 23:31 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-23 2:33 [to-be-updated] mm-vmscan-mitigate-spurious-kswapd_failures-reset-from-direct-reclaim.patch removed from -mm tree Andrew Morton
-- strict thread matches above, loose matches on Subject: below --
2026-01-15 23:31 Andrew Morton
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.