[PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
@ 2025-12-22 12:20 Jiayuan Chen
  2025-12-22 18:29 ` Andrew Morton
  2025-12-22 21:15 ` Shakeel Butt
  0 siblings, 2 replies; 15+ messages in thread
From: Jiayuan Chen @ 2025-12-22 12:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Jiayuan Chen, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Shakeel Butt, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

From: Jiayuan Chen <jiayuan.chen@shopee.com>

When kswapd fails to reclaim memory, kswapd_failures is incremented.
Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
futile reclaim attempts. However, any successful direct reclaim
unconditionally resets kswapd_failures to 0, which can cause problems.

We observed an issue in production on a multi-NUMA system where a
process allocated large amounts of anonymous pages on a single NUMA
node, causing its watermark to drop below high and evicting most file
pages:

$ numastat -m
Per-node system memory usage (in MBs):
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemTotal               128222.19       127983.91       256206.11
MemFree                  1414.48         1432.80         2847.29
MemUsed                126807.71       126551.11       252358.82
SwapCached                  0.00            0.00            0.00
Active                  29017.91        25554.57        54572.48
Inactive                92749.06        95377.00       188126.06
Active(anon)            28998.96        23356.47        52355.43
Inactive(anon)          92685.27        87466.11       180151.39
Active(file)               18.95         2198.10         2217.05
Inactive(file)             63.79         7910.89         7974.68

With swap disabled, only file pages can be reclaimed. When kswapd is
woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
raise free memory above the high watermark since reclaimable file pages
are insufficient. Normally, kswapd would eventually stop after
kswapd_failures reaches MAX_RECLAIM_RETRIES.

However, pods on this machine have memory.high set in their cgroup.
Business processes continuously trigger the high limit, causing frequent
direct reclaim that keeps resetting kswapd_failures to 0. This prevents
kswapd from ever stopping.

The result is that kswapd runs endlessly, repeatedly evicting the few
remaining file pages which are actually hot. These pages constantly
refault, generating sustained heavy IO READ pressure.

Fix this by only resetting kswapd_failures from direct reclaim when the
node is actually balanced. This prevents direct reclaim from keeping
kswapd alive when the node cannot be balanced through reclaim alone.

Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
 mm/vmscan.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 453d654727c1..b450bde4e489 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2648,6 +2648,15 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
 			  lruvec_memcg(lruvec));
 }
 
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
+static inline void reset_kswapd_failures(struct pglist_data *pgdat,
+					 struct scan_control *sc)
+{
+	if (!current_is_kswapd() &&
+	    pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
+		atomic_set(&pgdat->kswapd_failures, 0);
+}
+
 #ifdef CONFIG_LRU_GEN
 
 #ifdef CONFIG_LRU_GEN_ENABLED
@@ -5065,7 +5074,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
 	blk_finish_plug(&plug);
 done:
 	if (sc->nr_reclaimed > reclaimed)
-		atomic_set(&pgdat->kswapd_failures, 0);
+		reset_kswapd_failures(pgdat, sc);
 }
 
 /******************************************************************************
@@ -6139,7 +6148,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	 * successful direct reclaim run will revive a dormant kswapd.
 	 */
 	if (reclaimable)
-		atomic_set(&pgdat->kswapd_failures, 0);
+		reset_kswapd_failures(pgdat, sc);
 	else if (sc->cache_trim_mode)
 		sc->cache_trim_mode_failed = 1;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2025-12-22 12:20 [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim Jiayuan Chen
@ 2025-12-22 18:29 ` Andrew Morton
  2025-12-23  1:51   ` Jiayuan Chen
  2025-12-22 21:15 ` Shakeel Butt
  1 sibling, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2025-12-22 18:29 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Jiayuan Chen, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Shakeel Butt, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Mon, 22 Dec 2025 20:20:21 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:

> From: Jiayuan Chen <jiayuan.chen@shopee.com>
> 
> When kswapd fails to reclaim memory, kswapd_failures is incremented.
> Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> futile reclaim attempts. However, any successful direct reclaim
> unconditionally resets kswapd_failures to 0, which can cause problems.
> 
> We observed an issue in production on a multi-NUMA system where a
> process allocated large amounts of anonymous pages on a single NUMA
> node, causing its watermark to drop below high and evicting most file
> pages:
> 
> $ numastat -m
> Per-node system memory usage (in MBs):
>                           Node 0          Node 1           Total
>                  --------------- --------------- ---------------
> MemTotal               128222.19       127983.91       256206.11
> MemFree                  1414.48         1432.80         2847.29
> MemUsed                126807.71       126551.11       252358.82
> SwapCached                  0.00            0.00            0.00
> Active                  29017.91        25554.57        54572.48
> Inactive                92749.06        95377.00       188126.06
> Active(anon)            28998.96        23356.47        52355.43
> Inactive(anon)          92685.27        87466.11       180151.39
> Active(file)               18.95         2198.10         2217.05
> Inactive(file)             63.79         7910.89         7974.68
> 
> With swap disabled, only file pages can be reclaimed. When kswapd is
> woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> raise free memory above the high watermark since reclaimable file pages
> are insufficient. Normally, kswapd would eventually stop after
> kswapd_failures reaches MAX_RECLAIM_RETRIES.
> 
> However, pods on this machine have memory.high set in their cgroup.

What's a "pod"?

> Business processes continuously trigger the high limit, causing frequent
> direct reclaim that keeps resetting kswapd_failures to 0. This prevents
> kswapd from ever stopping.
> 
> The result is that kswapd runs endlessly, repeatedly evicting the few
> remaining file pages which are actually hot. These pages constantly
> refault, generating sustained heavy IO READ pressure.

Yes, not good.

> Fix this by only resetting kswapd_failures from direct reclaim when the
> node is actually balanced. This prevents direct reclaim from keeping
> kswapd alive when the node cannot be balanced through reclaim alone.
>
> ...
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2648,6 +2648,15 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>  			  lruvec_memcg(lruvec));
>  }
>  
> +static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);

Forward declaration could be avoided by relocating pgdat_balanced(),
although the patch will get a lot larger.

> +static inline void reset_kswapd_failures(struct pglist_data *pgdat,
> +					 struct scan_control *sc)

It would be nice to have a nice comment explaining why this is here. 
Why are we checking for balanced?

> +{
> +	if (!current_is_kswapd() &&

kswapd can no longer clear ->kswapd_failures.  What's the thinking here?

> +	    pgdat_balanced(pgdat, sc->order, sc->reclaim_idx))
> +		atomic_set(&pgdat->kswapd_failures, 0);
> +}
> +
>  #ifdef CONFIG_LRU_GEN
>  
>  #ifdef CONFIG_LRU_GEN_ENABLED
> @@ -5065,7 +5074,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
>  	blk_finish_plug(&plug);
>  done:
>  	if (sc->nr_reclaimed > reclaimed)
> -		atomic_set(&pgdat->kswapd_failures, 0);
> +		reset_kswapd_failures(pgdat, sc);
>  }
>  
>  /******************************************************************************
> @@ -6139,7 +6148,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	 * successful direct reclaim run will revive a dormant kswapd.
>  	 */
>  	if (reclaimable)
> -		atomic_set(&pgdat->kswapd_failures, 0);
> +		reset_kswapd_failures(pgdat, sc);
>  	else if (sc->cache_trim_mode)
>  		sc->cache_trim_mode_failed = 1;
>  }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2025-12-22 18:29 ` Andrew Morton
@ 2025-12-23  1:51   ` Jiayuan Chen
  0 siblings, 0 replies; 15+ messages in thread
From: Jiayuan Chen @ 2025-12-23  1:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Jiayuan Chen, Johannes  Weiner, David Hildenbrand,
	Michal  Hocko, Qi Zheng, Shakeel  Butt, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

December 23, 2025 at 02:29, "Andrew Morton" <akpm@linux-foundation.org mailto:akpm@linux-foundation.org?to=%22Andrew%20Morton%22%20%3Cakpm%40linux-foundation.org%3E > wrote:

Hi Andrew,
Thanks for the review.
> 
> On Mon, 22 Dec 2025 20:20:21 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
> 
> > 
> > From: Jiayuan Chen <jiayuan.chen@shopee.com>
> >  
> >  When kswapd fails to reclaim memory, kswapd_failures is incremented.
> >  Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> >  futile reclaim attempts. However, any successful direct reclaim
> >  unconditionally resets kswapd_failures to 0, which can cause problems.
> >  
> >  We observed an issue in production on a multi-NUMA system where a
> >  process allocated large amounts of anonymous pages on a single NUMA
> >  node, causing its watermark to drop below high and evicting most file
> >  pages:
> >  
> >  $ numastat -m
> >  Per-node system memory usage (in MBs):
> >  Node 0 Node 1 Total
> >  --------------- --------------- ---------------
> >  MemTotal 128222.19 127983.91 256206.11
> >  MemFree 1414.48 1432.80 2847.29
> >  MemUsed 126807.71 126551.11 252358.82
> >  SwapCached 0.00 0.00 0.00
> >  Active 29017.91 25554.57 54572.48
> >  Inactive 92749.06 95377.00 188126.06
> >  Active(anon) 28998.96 23356.47 52355.43
> >  Inactive(anon) 92685.27 87466.11 180151.39
> >  Active(file) 18.95 2198.10 2217.05
> >  Inactive(file) 63.79 7910.89 7974.68
> >  
> >  With swap disabled, only file pages can be reclaimed. When kswapd is
> >  woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> >  raise free memory above the high watermark since reclaimable file pages
> >  are insufficient. Normally, kswapd would eventually stop after
> >  kswapd_failures reaches MAX_RECLAIM_RETRIES.
> >  
> >  However, pods on this machine have memory.high set in their cgroup.
> > 
> What's a "pod"?

A pod is a Kubernetes container. Sorry for the unclear terminology.


> > 
> > Business processes continuously trigger the high limit, causing frequent
> >  direct reclaim that keeps resetting kswapd_failures to 0. This prevents
> >  kswapd from ever stopping.
> >  
> >  The result is that kswapd runs endlessly, repeatedly evicting the few
> >  remaining file pages which are actually hot. These pages constantly
> >  refault, generating sustained heavy IO READ pressure.
> > 
> Yes, not good.
> 
> > 
> > Fix this by only resetting kswapd_failures from direct reclaim when the
> >  node is actually balanced. This prevents direct reclaim from keeping
> >  kswapd alive when the node cannot be balanced through reclaim alone.
> > 
> >  ...
> > 
> >  --- a/mm/vmscan.c
> >  +++ b/mm/vmscan.c
> >  @@ -2648,6 +2648,15 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
> >  lruvec_memcg(lruvec));
> >  }
> >  
> >  +static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx);
> > 
> Forward declaration could be avoided by relocating pgdat_balanced(),
> although the patch will get a lot larger.

Thanks for pointing this out.

> > 
> > +static inline void reset_kswapd_failures(struct pglist_data *pgdat,
> >  + struct scan_control *sc)
> > 
> It would be nice to have a nice comment explaining why this is here. 
> Why are we checking for balanced?

You're right, a comment explaining the rationale would be helpful.


> > 
> > +{
> >  + if (!current_is_kswapd() &&
> > 
> kswapd can no longer clear ->kswapd_failures. What's the thinking here?


Good catch. My original thinking was that kswapd already checks pgdat_balanced()
in its own path after successful reclaim, so I wanted to avoid redundant checks.
But looking at the code again, this is indeed a bug - kswapd's reclaim path does
need to clear kswapd_failures on successful reclaim.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2025-12-22 12:20 [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim Jiayuan Chen
  2025-12-22 18:29 ` Andrew Morton
@ 2025-12-22 21:15 ` Shakeel Butt
  2025-12-23  1:42   ` Jiayuan Chen
  1 sibling, 1 reply; 15+ messages in thread
From: Shakeel Butt @ 2025-12-22 21:15 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Jiayuan Chen, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Mon, Dec 22, 2025 at 08:20:21PM +0800, Jiayuan Chen wrote:
> From: Jiayuan Chen <jiayuan.chen@shopee.com>
> 
> When kswapd fails to reclaim memory, kswapd_failures is incremented.
> Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> futile reclaim attempts. However, any successful direct reclaim
> unconditionally resets kswapd_failures to 0, which can cause problems.
> 
> We observed an issue in production on a multi-NUMA system where a
> process allocated large amounts of anonymous pages on a single NUMA
> node, causing its watermark to drop below high and evicting most file
> pages:
> 
> $ numastat -m
> Per-node system memory usage (in MBs):
>                           Node 0          Node 1           Total
>                  --------------- --------------- ---------------
> MemTotal               128222.19       127983.91       256206.11
> MemFree                  1414.48         1432.80         2847.29
> MemUsed                126807.71       126551.11       252358.82
> SwapCached                  0.00            0.00            0.00
> Active                  29017.91        25554.57        54572.48
> Inactive                92749.06        95377.00       188126.06
> Active(anon)            28998.96        23356.47        52355.43
> Inactive(anon)          92685.27        87466.11       180151.39
> Active(file)               18.95         2198.10         2217.05
> Inactive(file)             63.79         7910.89         7974.68
> 
> With swap disabled, only file pages can be reclaimed. When kswapd is
> woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> raise free memory above the high watermark since reclaimable file pages
> are insufficient. Normally, kswapd would eventually stop after
> kswapd_failures reaches MAX_RECLAIM_RETRIES.
> 
> However, pods on this machine have memory.high set in their cgroup.
> Business processes continuously trigger the high limit, causing frequent
> direct reclaim that keeps resetting kswapd_failures to 0. This prevents
> kswapd from ever stopping.
> 
> The result is that kswapd runs endlessly, repeatedly evicting the few
> remaining file pages which are actually hot. These pages constantly
> refault, generating sustained heavy IO READ pressure.

I don't think kswapd is an issue here. The system is out of memory and
most of the memory is unreclaimable. Either change the workload to use
less memory or enable swap (or zswap) to have more reclaimable memory.

Other than that we can discuss memcg reclaim resetting the kswapd
failure count should be changed or not but that is a separate
discussion.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2025-12-22 21:15 ` Shakeel Butt
@ 2025-12-23  1:42   ` Jiayuan Chen
  2025-12-23  6:11     ` Shakeel Butt
  0 siblings, 1 reply; 15+ messages in thread
From: Jiayuan Chen @ 2025-12-23  1:42 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: linux-mm, Jiayuan Chen, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

December 23, 2025 at 05:15, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:


> 
> On Mon, Dec 22, 2025 at 08:20:21PM +0800, Jiayuan Chen wrote:
> 
> > 
> > From: Jiayuan Chen <jiayuan.chen@shopee.com>
> >  
> >  When kswapd fails to reclaim memory, kswapd_failures is incremented.
> >  Once it reaches MAX_RECLAIM_RETRIES, kswapd stops running to avoid
> >  futile reclaim attempts. However, any successful direct reclaim
> >  unconditionally resets kswapd_failures to 0, which can cause problems.
> >  
> >  We observed an issue in production on a multi-NUMA system where a
> >  process allocated large amounts of anonymous pages on a single NUMA
> >  node, causing its watermark to drop below high and evicting most file
> >  pages:
> >  
> >  $ numastat -m
> >  Per-node system memory usage (in MBs):
> >  Node 0 Node 1 Total
> >  --------------- --------------- ---------------
> >  MemTotal 128222.19 127983.91 256206.11
> >  MemFree 1414.48 1432.80 2847.29
> >  MemUsed 126807.71 126551.11 252358.82
> >  SwapCached 0.00 0.00 0.00
> >  Active 29017.91 25554.57 54572.48
> >  Inactive 92749.06 95377.00 188126.06
> >  Active(anon) 28998.96 23356.47 52355.43
> >  Inactive(anon) 92685.27 87466.11 180151.39
> >  Active(file) 18.95 2198.10 2217.05
> >  Inactive(file) 63.79 7910.89 7974.68
> >  
> >  With swap disabled, only file pages can be reclaimed. When kswapd is
> >  woken (e.g., via wake_all_kswapds()), it runs continuously but cannot
> >  raise free memory above the high watermark since reclaimable file pages
> >  are insufficient. Normally, kswapd would eventually stop after
> >  kswapd_failures reaches MAX_RECLAIM_RETRIES.
> >  
> >  However, pods on this machine have memory.high set in their cgroup.
> >  Business processes continuously trigger the high limit, causing frequent
> >  direct reclaim that keeps resetting kswapd_failures to 0. This prevents
> >  kswapd from ever stopping.
> >  
> >  The result is that kswapd runs endlessly, repeatedly evicting the few
> >  remaining file pages which are actually hot. These pages constantly
> >  refault, generating sustained heavy IO READ pressure.
> > 
> I don't think kswapd is an issue here. The system is out of memory and
> most of the memory is unreclaimable. Either change the workload to use
> less memory or enable swap (or zswap) to have more reclaimable memory.


Hi,
Thanks for looking into this.

Sorry, I didn't describe the scenario clearly enough in the original patch. Let me clarify:

This is a multi-NUMA system where the memory pressure is not global but node-local. The key observation is:

Node 0: Under memory pressure, most memory is anonymous (unreclaimable without swap)
Node 1: Has plenty of reclaimable memory (~60GB file cache out of 125GB total)

Node 0's kswapd runs continuously but cannot reclaim anything
Direct reclaim succeeds by reclaiming from Node 1
Direct reclaim resets kswapd_failures, preventing Node 0's kswapd from stopping
The few file pages on Node 0 are hot and keep refaulting, causing heavy I/O

From a per-node perspective, Node 0 is truly out of reclaimable memory and its kswapd
should stop. But the global direct reclaim success (from Node 1) incorrectly keeps
Node 0's kswapd alive.


Thanks.

> Other than that we can discuss memcg reclaim resetting the kswapd
> failure count should be changed or not but that is a separate
> discussion.
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2025-12-23  1:42   ` Jiayuan Chen
@ 2025-12-23  6:11     ` Shakeel Butt
  2025-12-23  8:22       ` Jiayuan Chen
  0 siblings, 1 reply; 15+ messages in thread
From: Shakeel Butt @ 2025-12-23  6:11 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Jiayuan Chen, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Tue, Dec 23, 2025 at 01:42:37AM +0000, Jiayuan Chen wrote:
> December 23, 2025 at 05:15, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
> 
[...]
> 
> > > 
> > I don't think kswapd is an issue here. The system is out of memory and
> > most of the memory is unreclaimable. Either change the workload to use
> > less memory or enable swap (or zswap) to have more reclaimable memory.
> 
> 
> Hi,
> Thanks for looking into this.
> 
> Sorry, I didn't describe the scenario clearly enough in the original patch. Let me clarify:
> 
> This is a multi-NUMA system where the memory pressure is not global but node-local. The key observation is:
> 
> Node 0: Under memory pressure, most memory is anonymous (unreclaimable without swap)
> Node 1: Has plenty of reclaimable memory (~60GB file cache out of 125GB total)

Thanks and now the situation is much more clear. IIUC you are running
multiple workloads (pods) on the system. How is the memcg limits
configured for these workloads. You mentioned memory.high, what about
memory.max? Also are you using cpusets to limit the pods to individual
nodes (cpu & memory) or they can run on any node?

Overall I still think it is unbalanced numa nodes in terms of memory and
may for cpu as well. Anyways let's talk about kswapd.

> 
> Node 0's kswapd runs continuously but cannot reclaim anything
> Direct reclaim succeeds by reclaiming from Node 1
> Direct reclaim resets kswapd_failures,

So successful reclaim on one node does not reset kswapd_failures on
other node. The kernel reclaims each node one by one, so if Node 0
direct reclaim was successfull only then kernel allows to reset the
kswapd_failures of Node 0 to be reset.

> preventing Node 0's kswapd from stopping
> The few file pages on Node 0 are hot and keep refaulting, causing heavy I/O
> 

Have you tried numa balancing? Though I think it would be better to
schedule upfront in a way that one node is not overcommitted but numa
balancing provides a dynamic way to adjust the load on each node.

Can you dig deeper on who and why Node 0's kswapd_failures is getting
reset?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2025-12-23  6:11     ` Shakeel Butt
@ 2025-12-23  8:22       ` Jiayuan Chen
  2026-01-05  4:51         ` Shakeel Butt
  0 siblings, 1 reply; 15+ messages in thread
From: Jiayuan Chen @ 2025-12-23  8:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: linux-mm, Jiayuan Chen, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

December 23, 2025 at 14:11, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:

> 
> On Tue, Dec 23, 2025 at 01:42:37AM +0000, Jiayuan Chen wrote:
> 
> > 
> > December 23, 2025 at 05:15, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
> > 
> [...]
> 
> > 
> > > 
> >  I don't think kswapd is an issue here. The system is out of memory and
> >  most of the memory is unreclaimable. Either change the workload to use
> >  less memory or enable swap (or zswap) to have more reclaimable memory.
> >  
> >  
> >  Hi,
> >  Thanks for looking into this.
> >  
> >  Sorry, I didn't describe the scenario clearly enough in the original patch. Let me clarify:
> >  
> >  This is a multi-NUMA system where the memory pressure is not global but node-local. The key observation is:
> >  
> >  Node 0: Under memory pressure, most memory is anonymous (unreclaimable without swap)
> >  Node 1: Has plenty of reclaimable memory (~60GB file cache out of 125GB total)
> > 
> Thanks and now the situation is much more clear. IIUC you are running
> multiple workloads (pods) on the system. How is the memcg limits
> configured for these workloads. You mentioned memory.high, what about

Thanks for the questions. We have pods configured with memory.high and pods configured with memory.max.

Actually, memory.max itself causes heavy I/O issues for us, because it keeps trying to reclaim hot
pages within the cgroup aggressively without killing the process. 

So we configured some pods with memory.high instead, since it performs reclaim in resume_user_mode_work,
which somewhat throttles the memory allocation of user processes.

> memory.max? Also are you using cpusets to limit the pods to individual
> nodes (cpu & memory) or they can run on any node?

Yes, we have cpusets(only cpuset.cpus not cpuset.mems) configured for our cgroups, binding
them to specific NUMA nodes. But I don't think this is directly related to the issue - the
problem can occur with or without cpusets. Even without cpusets.cpus, the kernel prefers
to allocate memory from the node where the process is running, so if a process happens to
run on a CPU belonging to Node 0, the behavior would be similar.

> 
> Overall I still think it is unbalanced numa nodes in terms of memory and
> may for cpu as well. Anyways let's talk about kswapd.
> > 
> > Node 0's kswapd runs continuously but cannot reclaim anything
> >  Direct reclaim succeeds by reclaiming from Node 1
> >  Direct reclaim resets kswapd_failures,
> > 
> So successful reclaim on one node does not reset kswapd_failures on
> other node. The kernel reclaims each node one by one, so if Node 0
> direct reclaim was successfull only then kernel allows to reset the
> kswapd_failures of Node 0 to be reset.

Let me dig deeper into this.

When either memory.max or memory.high is reached, direct reclaim is
triggered. The memory being reclaimed depends on the CPU where the
process is running.

When the problem occurred, we had workloads continuously hitting 
memory.max and workloads continuously hitting memory.high:

reclaim_high    ->   -> try_to_free_mem_cgroup_pages
                   ^      do_try_to_free_pages(zone of current node)
                   |         shrink_zones()
try_charge_memcg  -              shrink_node()
                                     kswapd_failures = 0

Although the pages are hot, if we scan aggressively enough, they will eventually
be reclaimed, and then kswapd_failures gets reset to 0 - because even reclaiming
a single page resets kswapd_failures to 0.

The end result is that we most workloads, which didn't even hit their high
or max limits, experiencing continuous refaults, causing heavy I/O.

Thanks.

> > 
> > preventing Node 0's kswapd from stopping
> >  The few file pages on Node 0 are hot and keep refaulting, causing heavy I/O
> > 
> Have you tried numa balancing? Though I think it would be better to
> schedule upfront in a way that one node is not overcommitted but numa
> balancing provides a dynamic way to adjust the load on each node.

Yes, we have tried it. Actually, I submitted a patch about a month ago to improve
its observability:
https://lore.kernel.org/all/20251124153331.465306a2@gandalf.local.home/
(though only Steven replied, a bit awkward :( ).

We found that the default settings didn't work well for our workloads. When we tried
to increase scan_size to make it more aggressive, we noticed the system load started
to increase. So we haven't fully adopted it yet.

> Can you dig deeper on who and why Node 0's kswapd_failures is getting
> reset?
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2025-12-23  8:22       ` Jiayuan Chen
@ 2026-01-05  4:51         ` Shakeel Butt
  2026-01-06  5:25           ` Jiayuan Chen
  0 siblings, 1 reply; 15+ messages in thread
From: Shakeel Butt @ 2026-01-05  4:51 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Jiayuan Chen, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

Hi Jiayuan,

Sorry for late reply due to holidays/break. I will still be slow to
respond this week but will be fully back after one more week. Anyways,
let me respond below.

On Tue, Dec 23, 2025 at 08:22:43AM +0000, Jiayuan Chen wrote:
> December 23, 2025 at 14:11, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
> 
> 
> > 
> > On Tue, Dec 23, 2025 at 01:42:37AM +0000, Jiayuan Chen wrote:
> > 
> > > 
> > > December 23, 2025 at 05:15, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
> > > 
> > [...]
> > 
> > > 
> > > > 
> > >  I don't think kswapd is an issue here. The system is out of memory and
> > >  most of the memory is unreclaimable. Either change the workload to use
> > >  less memory or enable swap (or zswap) to have more reclaimable memory.
> > >  
> > >  
> > >  Hi,
> > >  Thanks for looking into this.
> > >  
> > >  Sorry, I didn't describe the scenario clearly enough in the original patch. Let me clarify:
> > >  
> > >  This is a multi-NUMA system where the memory pressure is not global but node-local. The key observation is:
> > >  
> > >  Node 0: Under memory pressure, most memory is anonymous (unreclaimable without swap)
> > >  Node 1: Has plenty of reclaimable memory (~60GB file cache out of 125GB total)
> > > 
> > Thanks and now the situation is much more clear. IIUC you are running
> > multiple workloads (pods) on the system. How is the memcg limits
> > configured for these workloads. You mentioned memory.high, what about
> 
> Thanks for the questions. We have pods configured with memory.high and pods configured with memory.max.
> 
> Actually, memory.max itself causes heavy I/O issues for us, because it keeps trying to reclaim hot
> pages within the cgroup aggressively without killing the process. 
> 
> So we configured some pods with memory.high instead, since it performs reclaim in resume_user_mode_work,
> which somewhat throttles the memory allocation of user processes.
> 
> > memory.max? Also are you using cpusets to limit the pods to individual
> > nodes (cpu & memory) or they can run on any node?
> 
> Yes, we have cpusets(only cpuset.cpus not cpuset.mems) configured for our cgroups, binding
> them to specific NUMA nodes. But I don't think this is directly related to the issue - the
> problem can occur with or without cpusets. Even without cpusets.cpus, the kernel prefers
> to allocate memory from the node where the process is running, so if a process happens to
> run on a CPU belonging to Node 0, the behavior would be similar.

Are you limiting (using cpuset.cpus) the workloads to single respective
nodes or the individual workloads can still run on multiple nodes? For
example do you have a workload which can run on both (or more) nodes?

> 
> 
> > 
> > Overall I still think it is unbalanced numa nodes in terms of memory and
> > may for cpu as well. Anyways let's talk about kswapd.
> > > 
> > > Node 0's kswapd runs continuously but cannot reclaim anything
> > >  Direct reclaim succeeds by reclaiming from Node 1
> > >  Direct reclaim resets kswapd_failures,
> > > 
> > So successful reclaim on one node does not reset kswapd_failures on
> > other node. The kernel reclaims each node one by one, so if Node 0
> > direct reclaim was successfull only then kernel allows to reset the
> > kswapd_failures of Node 0 to be reset.
> 
> Let me dig deeper into this.
> 
> When either memory.max or memory.high is reached, direct reclaim is
> triggered. The memory being reclaimed depends on the CPU where the
> process is running.
> 
> When the problem occurred, we had workloads continuously hitting 
> memory.max and workloads continuously hitting memory.high:
> 
> reclaim_high    ->   -> try_to_free_mem_cgroup_pages
>                    ^      do_try_to_free_pages(zone of current node)
>                    |         shrink_zones()
> try_charge_memcg  -              shrink_node()
>                                      kswapd_failures = 0
> 
> Although the pages are hot, if we scan aggressively enough, they will eventually
> be reclaimed, and then kswapd_failures gets reset to 0 - because even reclaiming
> a single page resets kswapd_failures to 0.
> 
> The end result is that we most workloads, which didn't even hit their high
> or max limits, experiencing continuous refaults, causing heavy I/O.
> 

So, the decision to reset kswapd_failures on memcg reclaim can be
re-evaluated but I think that is not the root cause here. The
kswapd_failures mechanism is for situations where kswapd is unable to
reclaim and then punting on the direct reclaimers but in your situation
the workloads are not numa memory bound and thus there really is not any
numa level direct reclaimers. Also the lack of reclaimable memory is
making the situation worse.


> Thanks.
> 
> > > 
> > > preventing Node 0's kswapd from stopping
> > >  The few file pages on Node 0 are hot and keep refaulting, causing heavy I/O
> > > 
> > Have you tried numa balancing? Though I think it would be better to
> > schedule upfront in a way that one node is not overcommitted but numa
> > balancing provides a dynamic way to adjust the load on each node.
> 
> Yes, we have tried it. Actually, I submitted a patch about a month ago to improve
> its observability:
> https://lore.kernel.org/all/20251124153331.465306a2@gandalf.local.home/
> (though only Steven replied, a bit awkward :( ).
> 
> We found that the default settings didn't work well for our workloads. When we tried
> to increase scan_size to make it more aggressive, we noticed the system load started
> to increase. So we haven't fully adopted it yet.
> 

I feel the numa balancing will not help as well as or it might make it
worse as the workloads may have allocated some memory on the other node
which numa balancing might try to move to the node which is already
under pressure.

Let me say what I think is the issue. You have the situation where node
0 is overcommitted and is mostly filled with unreclaimable memory. The
workloads running on node 0 have their workingset continuously getting
reclaimed due to node 0 being OOM.

I think the simplest solution for you is to enable swap to have more
reclaimable memory on the system. Hopefully you will have workingset of
the workloads fully in memory on each node.

You can try to change application/workload to be more numa aware and
balance their anon memory on the given nodes but I think that would much
more involved and error prone.

Shakeel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2026-01-05  4:51         ` Shakeel Butt
@ 2026-01-06  5:25           ` Jiayuan Chen
  2026-01-06  9:49             ` Michal Hocko
  2026-01-06 17:45             ` Shakeel Butt
  0 siblings, 2 replies; 15+ messages in thread
From: Jiayuan Chen @ 2026-01-06  5:25 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: linux-mm, Jiayuan Chen, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

January 5, 2026 at 12:51, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:


> 
> Hi Jiayuan,
> 
> Sorry for late reply due to holidays/break. I will still be slow to
> respond this week but will be fully back after one more week. Anyways,
> let me respond below.

No worries about the delay - happy holidays!

> On Tue, Dec 23, 2025 at 08:22:43AM +0000, Jiayuan Chen wrote:
> 
> > 
> > December 23, 2025 at 14:11, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
> >  
> >  
> >  
> >  On Tue, Dec 23, 2025 at 01:42:37AM +0000, Jiayuan Chen wrote:
> >  
> >  > 
> >  > December 23, 2025 at 05:15, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
> >  > 
> >  [...]
> >  
> >  > 
> >  > > 
> >  > I don't think kswapd is an issue here. The system is out of memory and
> >  > most of the memory is unreclaimable. Either change the workload to use
> >  > less memory or enable swap (or zswap) to have more reclaimable memory.
> >  > 
> >  > 
> >  > Hi,
> >  > Thanks for looking into this.
> >  > 
> >  > Sorry, I didn't describe the scenario clearly enough in the original patch. Let me clarify:
> >  > 
> >  > This is a multi-NUMA system where the memory pressure is not global but node-local. The key observation is:
> >  > 
> >  > Node 0: Under memory pressure, most memory is anonymous (unreclaimable without swap)
> >  > Node 1: Has plenty of reclaimable memory (~60GB file cache out of 125GB total)
> >  > 
> >  Thanks and now the situation is much more clear. IIUC you are running
> >  multiple workloads (pods) on the system. How is the memcg limits
> >  configured for these workloads. You mentioned memory.high, what about
> >  
> >  Thanks for the questions. We have pods configured with memory.high and pods configured with memory.max.
> >  
> >  Actually, memory.max itself causes heavy I/O issues for us, because it keeps trying to reclaim hot
> >  pages within the cgroup aggressively without killing the process. 
> >  
> >  So we configured some pods with memory.high instead, since it performs reclaim in resume_user_mode_work,
> >  which somewhat throttles the memory allocation of user processes.
> >  
> >  memory.max? Also are you using cpusets to limit the pods to individual
> >  nodes (cpu & memory) or they can run on any node?
> >  
> >  Yes, we have cpusets(only cpuset.cpus not cpuset.mems) configured for our cgroups, binding
> >  them to specific NUMA nodes. But I don't think this is directly related to the issue - the
> >  problem can occur with or without cpusets. Even without cpusets.cpus, the kernel prefers
> >  to allocate memory from the node where the process is running, so if a process happens to
> >  run on a CPU belonging to Node 0, the behavior would be similar.
> > 
> Are you limiting (using cpuset.cpus) the workloads to single respective
> nodes or the individual workloads can still run on multiple nodes? For
> example do you have a workload which can run on both (or more) nodes?

We have many workloads. Some performance-sensitive ones have cpuset.cpus configured to
bind to a specific node, while others don't.

> > 
> > Overall I still think it is unbalanced numa nodes in terms of memory and
> >  may for cpu as well. Anyways let's talk about kswapd.
> >  > 
> >  > Node 0's kswapd runs continuously but cannot reclaim anything
> >  > Direct reclaim succeeds by reclaiming from Node 1
> >  > Direct reclaim resets kswapd_failures,
> >  > 
> >  So successful reclaim on one node does not reset kswapd_failures on
> >  other node. The kernel reclaims each node one by one, so if Node 0
> >  direct reclaim was successfull only then kernel allows to reset the
> >  kswapd_failures of Node 0 to be reset.
> >  
> >  Let me dig deeper into this.
> >  
> >  When either memory.max or memory.high is reached, direct reclaim is
> >  triggered. The memory being reclaimed depends on the CPU where the
> >  process is running.
> >  
> >  When the problem occurred, we had workloads continuously hitting 
> >  memory.max and workloads continuously hitting memory.high:
> >  
> >  reclaim_high -> -> try_to_free_mem_cgroup_pages
> >  ^ do_try_to_free_pages(zone of current node)
> >  | shrink_zones()
> >  try_charge_memcg - shrink_node()
> >  kswapd_failures = 0
> >  
> >  Although the pages are hot, if we scan aggressively enough, they will eventually
> >  be reclaimed, and then kswapd_failures gets reset to 0 - because even reclaiming
> >  a single page resets kswapd_failures to 0.
> >  
> >  The end result is that we most workloads, which didn't even hit their high
> >  or max limits, experiencing continuous refaults, causing heavy I/O.
> > 
> So, the decision to reset kswapd_failures on memcg reclaim can be
> re-evaluated but I think that is not the root cause here. The


The workloads triggering direct reclaim have their memory spread across multiple nodes,
since we don't set cpuset.mems, so the cgroup can reclaim memory from multiple nodes.
In particular, complex applications have many threads, different threads allocating and
freeing large amounts of memory (both anonymous and file pages), and these allocations
can consume memory from nodes that are above the low watermark.

You're right that multiple factors contribute to the issue I described. This patch addresses
one of them, just like the boost_watermark patch I submitted before, and the recent patch
about memory.high causing high I/O. There are other scenarios as well that I'm still trying
to reproduce.

That said, I believe this patch is still a valid fix on its own - resetting kswapd_failures
when the node is not actually balanced doesn't seem like correct behavior regardless of the
broader context.

> kswapd_failures mechanism is for situations where kswapd is unable to
> reclaim and then punting on the direct reclaimers but in your situation
> the workloads are not numa memory bound and thus there really is not any
> numa level direct reclaimers. Also the lack of reclaimable memory is
> making the situation worse.


> > 
> > Thanks.
> >  
> >  > 
> >  > preventing Node 0's kswapd from stopping
> >  > The few file pages on Node 0 are hot and keep refaulting, causing heavy I/O
> >  > 
> >  Have you tried numa balancing? Though I think it would be better to
> >  schedule upfront in a way that one node is not overcommitted but numa
> >  balancing provides a dynamic way to adjust the load on each node.
> >  
> >  Yes, we have tried it. Actually, I submitted a patch about a month ago to improve
> >  its observability:
> >  https://lore.kernel.org/all/20251124153331.465306a2@gandalf.local.home/
> >  (though only Steven replied, a bit awkward :( ).
> >  
> >  We found that the default settings didn't work well for our workloads. When we tried
> >  to increase scan_size to make it more aggressive, we noticed the system load started
> >  to increase. So we haven't fully adopted it yet.
> > 
> I feel the numa balancing will not help as well as or it might make it
> worse as the workloads may have allocated some memory on the other node
> which numa balancing might try to move to the node which is already
> under pressure.

Agreed.

> Let me say what I think is the issue. You have the situation where node
> 0 is overcommitted and is mostly filled with unreclaimable memory. The
> workloads running on node 0 have their workingset continuously getting
> reclaimed due to node 0 being OOM.

From our monitoring, only a single cgroup triggered direct reclaim - some
hitting memory.high and some hitting memory.max (we have tracepoints for monitoring).

> I think the simplest solution for you is to enable swap to have more
> reclaimable memory on the system. Hopefully you will have workingset of
> the workloads fully in memory on each node.
> 
> You can try to change application/workload to be more numa aware and
> balance their anon memory on the given nodes but I think that would much
> more involved and error prone.

Enabling swap is one solution, but due to historical reasons we haven't
enabled it - our disk performance is relatively poor. zram is also an
option, but the migration would take significant time.

Thanks

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2026-01-06  5:25           ` Jiayuan Chen
@ 2026-01-06  9:49             ` Michal Hocko
  2026-01-06 11:19               ` Jiayuan Chen
  2026-01-06 17:45             ` Shakeel Butt
  1 sibling, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2026-01-06  9:49 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: Shakeel Butt, linux-mm, Jiayuan Chen, Andrew Morton,
	Johannes Weiner, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Tue 06-01-26 05:25:42, Jiayuan Chen wrote:
> That said, I believe this patch is still a valid fix on its own - resetting kswapd_failures
> when the node is not actually balanced doesn't seem like correct behavior regardless of the
> broader context.

Originally I was more inclined to opt out memcg reclaim from reseting
kswapd retry counter but the more I am thiking about that the more your
patch makes sense to me. 

The reason being that it handles both memcg and global direct reclaims
in the same way which makes the logic easier to follow. Afterall the
primary purpose is to resurrect kswapd after we can see there is a
better chance to reclaim something for kswapd. Until that moment direct
reclaim is the only reclaim mechanism.

Relying on pgdat_balanced might lead to re-enabling kswapd way much
later while memory reclaim would be still mostly direct reclaim bound -
thus increase allocation latencies.

If we wanted to do better we would need to evaluate recent
refaults/thrashing behavior but even then I am not sure we can make a
good cut off.

So in the end pgdat_balanced approach seems worth trying and see whether
this could cause any corner cases.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2026-01-06  9:49             ` Michal Hocko
@ 2026-01-06 11:19               ` Jiayuan Chen
  2026-01-06 12:59                 ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Jiayuan Chen @ 2026-01-06 11:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, linux-mm, Jiayuan Chen, Andrew Morton,
	Johannes Weiner, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

January 6, 2026 at 17:49, "Michal Hocko" <mhocko@suse.com mailto:mhocko@suse.com?to=%22Michal%20Hocko%22%20%3Cmhocko%40suse.com%3E > wrote:


> 
> On Tue 06-01-26 05:25:42, Jiayuan Chen wrote:
> 
> > 
> > That said, I believe this patch is still a valid fix on its own - resetting kswapd_failures
> >  when the node is not actually balanced doesn't seem like correct behavior regardless of the
> >  broader context.
> > 
> Originally I was more inclined to opt out memcg reclaim from reseting
> kswapd retry counter but the more I am thiking about that the more your
> patch makes sense to me. 
> 
> The reason being that it handles both memcg and global direct reclaims
> in the same way which makes the logic easier to follow. Afterall the
> primary purpose is to resurrect kswapd after we can see there is a
> better chance to reclaim something for kswapd. Until that moment direct
> reclaim is the only reclaim mechanism.
> 
> Relying on pgdat_balanced might lead to re-enabling kswapd way much
> later while memory reclaim would be still mostly direct reclaim bound -
> thus increase allocation latencies.
> If we wanted to do better we would need to evaluate recent
> refaults/thrashing behavior but even then I am not sure we can make a
> good cut off.
> 
> So in the end pgdat_balanced approach seems worth trying and see whether
> this could cause any corner cases.

Thanks Michal.

Regarding the allocation latency concern - we are already
in the direct reclaim slowpath, so a little extra overhead
from the pgdat_balanced check should be negligible.

> -- 
> Michal Hocko
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2026-01-06 11:19               ` Jiayuan Chen
@ 2026-01-06 12:59                 ` Michal Hocko
  2026-01-06 16:50                   ` Shakeel Butt
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2026-01-06 12:59 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: Shakeel Butt, linux-mm, Jiayuan Chen, Andrew Morton,
	Johannes Weiner, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Tue 06-01-26 11:19:21, Jiayuan Chen wrote:
> January 6, 2026 at 17:49, "Michal Hocko" <mhocko@suse.com mailto:mhocko@suse.com?to=%22Michal%20Hocko%22%20%3Cmhocko%40suse.com%3E > wrote:
> 
> 
> > 
> > On Tue 06-01-26 05:25:42, Jiayuan Chen wrote:
> > 
> > > 
> > > That said, I believe this patch is still a valid fix on its own - resetting kswapd_failures
> > >  when the node is not actually balanced doesn't seem like correct behavior regardless of the
> > >  broader context.
> > > 
> > Originally I was more inclined to opt out memcg reclaim from reseting
> > kswapd retry counter but the more I am thiking about that the more your
> > patch makes sense to me. 
> > 
> > The reason being that it handles both memcg and global direct reclaims
> > in the same way which makes the logic easier to follow. Afterall the
> > primary purpose is to resurrect kswapd after we can see there is a
> > better chance to reclaim something for kswapd. Until that moment direct
> > reclaim is the only reclaim mechanism.
> > 
> > Relying on pgdat_balanced might lead to re-enabling kswapd way much
> > later while memory reclaim would be still mostly direct reclaim bound -
> > thus increase allocation latencies.
> > If we wanted to do better we would need to evaluate recent
> > refaults/thrashing behavior but even then I am not sure we can make a
> > good cut off.
> > 
> > So in the end pgdat_balanced approach seems worth trying and see whether
> > this could cause any corner cases.
> 
> Thanks Michal.
> 
> Regarding the allocation latency concern - we are already
> in the direct reclaim slowpath, so a little extra overhead
> from the pgdat_balanced check should be negligible.

Yes, I do not think that pgdat_balanced call itself adds to the latency
in the reclaim (slow) path. Mine main concern regarding latencies is
about direct reclaim as a sole source of reclaim itself (as kswapd is
not active).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2026-01-06 12:59                 ` Michal Hocko
@ 2026-01-06 16:50                   ` Shakeel Butt
  2026-01-06 19:14                     ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Shakeel Butt @ 2026-01-06 16:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jiayuan Chen, linux-mm, Jiayuan Chen, Andrew Morton,
	Johannes Weiner, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Tue, Jan 06, 2026 at 01:59:15PM +0100, Michal Hocko wrote:
> On Tue 06-01-26 11:19:21, Jiayuan Chen wrote:
> > January 6, 2026 at 17:49, "Michal Hocko" <mhocko@suse.com mailto:mhocko@suse.com?to=%22Michal%20Hocko%22%20%3Cmhocko%40suse.com%3E > wrote:
> > 
> > 
> > > 
> > > On Tue 06-01-26 05:25:42, Jiayuan Chen wrote:
> > > 
> > > > 
> > > > That said, I believe this patch is still a valid fix on its own - resetting kswapd_failures
> > > >  when the node is not actually balanced doesn't seem like correct behavior regardless of the
> > > >  broader context.
> > > > 
> > > Originally I was more inclined to opt out memcg reclaim from reseting
> > > kswapd retry counter but the more I am thiking about that the more your
> > > patch makes sense to me. 
> > > 
> > > The reason being that it handles both memcg and global direct reclaims
> > > in the same way which makes the logic easier to follow. Afterall the
> > > primary purpose is to resurrect kswapd after we can see there is a
> > > better chance to reclaim something for kswapd. Until that moment direct
> > > reclaim is the only reclaim mechanism.
> > > 
> > > Relying on pgdat_balanced might lead to re-enabling kswapd way much
> > > later while memory reclaim would be still mostly direct reclaim bound -
> > > thus increase allocation latencies.
> > > If we wanted to do better we would need to evaluate recent
> > > refaults/thrashing behavior but even then I am not sure we can make a
> > > good cut off.
> > > 
> > > So in the end pgdat_balanced approach seems worth trying and see whether
> > > this could cause any corner cases.
> > 
> > Thanks Michal.
> > 
> > Regarding the allocation latency concern - we are already
> > in the direct reclaim slowpath, so a little extra overhead
> > from the pgdat_balanced check should be negligible.
> 
> Yes, I do not think that pgdat_balanced call itself adds to the latency
> in the reclaim (slow) path. Mine main concern regarding latencies is
> about direct reclaim as a sole source of reclaim itself (as kswapd is
> not active).

Yes we will be punting on direct reclaimers to collectively balance the
node which I think is fine for such cases i.e. high kswapd_failures.
However I still think the high kswapd_failures is most probably caused
by misconfiguration of the system by the users (like overcommitting zones
or nodes with unreclaimable memory or very memory.min). Yes, we can
reduce the suffering of such misconfigurations like this patch but
somehow the user should be notified that the system is misconfigured.
Anyways, I think we can proceed with this path.

Juayuan, have you tested this patch on your production environment? 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2026-01-06 16:50                   ` Shakeel Butt
@ 2026-01-06 19:14                     ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2026-01-06 19:14 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jiayuan Chen, linux-mm, Jiayuan Chen, Andrew Morton,
	Johannes Weiner, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Tue 06-01-26 08:50:11, Shakeel Butt wrote:
> On Tue, Jan 06, 2026 at 01:59:15PM +0100, Michal Hocko wrote:
> > On Tue 06-01-26 11:19:21, Jiayuan Chen wrote:
> > > January 6, 2026 at 17:49, "Michal Hocko" <mhocko@suse.com mailto:mhocko@suse.com?to=%22Michal%20Hocko%22%20%3Cmhocko%40suse.com%3E > wrote:
> > > 
> > > 
> > > > 
> > > > On Tue 06-01-26 05:25:42, Jiayuan Chen wrote:
> > > > 
> > > > > 
> > > > > That said, I believe this patch is still a valid fix on its own - resetting kswapd_failures
> > > > >  when the node is not actually balanced doesn't seem like correct behavior regardless of the
> > > > >  broader context.
> > > > > 
> > > > Originally I was more inclined to opt out memcg reclaim from reseting
> > > > kswapd retry counter but the more I am thiking about that the more your
> > > > patch makes sense to me. 
> > > > 
> > > > The reason being that it handles both memcg and global direct reclaims
> > > > in the same way which makes the logic easier to follow. Afterall the
> > > > primary purpose is to resurrect kswapd after we can see there is a
> > > > better chance to reclaim something for kswapd. Until that moment direct
> > > > reclaim is the only reclaim mechanism.
> > > > 
> > > > Relying on pgdat_balanced might lead to re-enabling kswapd way much
> > > > later while memory reclaim would be still mostly direct reclaim bound -
> > > > thus increase allocation latencies.
> > > > If we wanted to do better we would need to evaluate recent
> > > > refaults/thrashing behavior but even then I am not sure we can make a
> > > > good cut off.
> > > > 
> > > > So in the end pgdat_balanced approach seems worth trying and see whether
> > > > this could cause any corner cases.
> > > 
> > > Thanks Michal.
> > > 
> > > Regarding the allocation latency concern - we are already
> > > in the direct reclaim slowpath, so a little extra overhead
> > > from the pgdat_balanced check should be negligible.
> > 
> > Yes, I do not think that pgdat_balanced call itself adds to the latency
> > in the reclaim (slow) path. Mine main concern regarding latencies is
> > about direct reclaim as a sole source of reclaim itself (as kswapd is
> > not active).
> 
> Yes we will be punting on direct reclaimers to collectively balance the
> node which I think is fine for such cases i.e. high kswapd_failures.
> However I still think the high kswapd_failures is most probably caused
> by misconfiguration of the system by the users (like overcommitting zones
> or nodes with unreclaimable memory or very memory.min).

I am not questioning a misconfiguration. It is just far from great that
kswapd adds to the problem under those conditions without a very good
reason. I would be pushing back on increasing complexity for apparently
misonfigured systems but I believe it is fair to say that failure
counter reset logic could see some improvements. So let's see whether we
can deal with the situation better while improving on this logic without
much of an added complexity.

> Yes, we can
> reduce the suffering of such misconfigurations like this patch but
> somehow the user should be notified that the system is misconfigured.
> Anyways, I think we can proceed with this path.
> 
> Juayuan, have you tested this patch on your production environment? 

Yes, getting some reclaim stats to the changelog would be highly
appreciated (with and without the patch of course if you can reproduce
the issue).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim
  2026-01-06  5:25           ` Jiayuan Chen
  2026-01-06  9:49             ` Michal Hocko
@ 2026-01-06 17:45             ` Shakeel Butt
  1 sibling, 0 replies; 15+ messages in thread
From: Shakeel Butt @ 2026-01-06 17:45 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Jiayuan Chen, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

On Tue, Jan 06, 2026 at 05:25:42AM +0000, Jiayuan Chen wrote:
> January 5, 2026 at 12:51, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
> 
> 
> 
> > I think the simplest solution for you is to enable swap to have more
> > reclaimable memory on the system. Hopefully you will have workingset of
> > the workloads fully in memory on each node.
> > 
> > You can try to change application/workload to be more numa aware and
> > balance their anon memory on the given nodes but I think that would much
> > more involved and error prone.
> 
> Enabling swap is one solution, but due to historical reasons we haven't
> enabled it - our disk performance is relatively poor. zram is also an
> option, but the migration would take significant time.

Beside zram, You can try zswap with memory.zswap.writeback=0 to avoid
disk for swap. I would suggest to try swap (zswap or swap on zram) on
couple of impacted machines to see if the issue you are seeing is
resolved.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-01-06 19:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-22 12:20 [PATCH v1] mm/vmscan: mitigate spurious kswapd_failures reset from direct reclaim Jiayuan Chen
2025-12-22 18:29 ` Andrew Morton
2025-12-23  1:51   ` Jiayuan Chen
2025-12-22 21:15 ` Shakeel Butt
2025-12-23  1:42   ` Jiayuan Chen
2025-12-23  6:11     ` Shakeel Butt
2025-12-23  8:22       ` Jiayuan Chen
2026-01-05  4:51         ` Shakeel Butt
2026-01-06  5:25           ` Jiayuan Chen
2026-01-06  9:49             ` Michal Hocko
2026-01-06 11:19               ` Jiayuan Chen
2026-01-06 12:59                 ` Michal Hocko
2026-01-06 16:50                   ` Shakeel Butt
2026-01-06 19:14                     ` Michal Hocko
2026-01-06 17:45             ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox