[PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
@ 2025-10-24  2:27 Jiayuan Chen
  2025-10-26  4:40 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jiayuan Chen @ 2025-10-24  2:27 UTC (permalink / raw)
  To: linux-mm
  Cc: Jiayuan Chen, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Shakeel Butt, Lorenzo Stoakes,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-kernel

We encountered a scenario where direct memory reclaim was triggered,
leading to increased system latency:

1. The memory.low values set on host pods are actually quite large, some
   pods are set to 10GB, others to 20GB, etc.
2. Since most pods have memory protection configured, each time kswapd is
   woken up, if a pod's memory usage hasn't exceeded its own memory.low,
   its memory won't be reclaimed.
3. When applications start up, rapidly consume memory, or experience
   network traffic bursts, the kernel reaches steal_suitable_fallback(),
   which sets watermark_boost and subsequently wakes kswapd.
4. In the core logic of kswapd thread (balance_pgdat()), when reclaim is
   triggered by watermark_boost, the maximum priority is 10. Higher
   priority values mean less aggressive LRU scanning, which can result in
   no pages being reclaimed during a single scan cycle:

if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
    raise_priority = false;

5. This eventually causes pgdat->kswapd_failures to continuously
   accumulate, exceeding MAX_RECLAIM_RETRIES, and consequently kswapd stops
   working. At this point, the system's available memory is still
   significantly above the high watermark — it's inappropriate for kswapd
   to stop under these conditions.

The final observable issue is that a brief period of rapid memory
allocation causes kswapd to stop running, ultimately triggering direct
reclaim and making the applications unresponsive.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>

---
v1 -> v2: Do not modify memory.low handling
https://lore.kernel.org/linux-mm/20251014081850.65379-1-jiayuan.chen@linux.dev/
---
 mm/vmscan.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 92f4ca99b73c..fa8663781086 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7128,7 +7128,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		goto restart;
 	}

-	if (!sc.nr_reclaimed)
+	/*
+	 * If the reclaim was boosted, we might still be far from the
+	 * watermark_high at this point. We need to avoid increasing the
+	 * failure count to prevent the kswapd thread from stopping.
+	 */
+	if (!sc.nr_reclaimed && !boosted)
 		atomic_inc(&pgdat->kswapd_failures);

 out:
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  2025-10-24  2:27 [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted Jiayuan Chen
@ 2025-10-26  4:40 ` Andrew Morton
  2025-11-08  1:11 ` Shakeel Butt
  2025-11-13 23:47 ` Shakeel Butt
  2 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2025-10-26  4:40 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-kernel

On Fri, 24 Oct 2025 10:27:11 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:

> We encountered a scenario where direct memory reclaim was triggered,
> leading to increased system latency:

Who is "we", if I may ask?

> 1. The memory.low values set on host pods are actually quite large, some
>    pods are set to 10GB, others to 20GB, etc.
> 2. Since most pods have memory protection configured, each time kswapd is
>    woken up, if a pod's memory usage hasn't exceeded its own memory.low,
>    its memory won't be reclaimed.
> 3. When applications start up, rapidly consume memory, or experience
>    network traffic bursts, the kernel reaches steal_suitable_fallback(),
>    which sets watermark_boost and subsequently wakes kswapd.
> 4. In the core logic of kswapd thread (balance_pgdat()), when reclaim is
>    triggered by watermark_boost, the maximum priority is 10. Higher
>    priority values mean less aggressive LRU scanning, which can result in
>    no pages being reclaimed during a single scan cycle:
> 
> if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
>     raise_priority = false;
> 
> 5. This eventually causes pgdat->kswapd_failures to continuously
>    accumulate, exceeding MAX_RECLAIM_RETRIES, and consequently kswapd stops
>    working. At this point, the system's available memory is still
>    significantly above the high watermark — it's inappropriate for kswapd
>    to stop under these conditions.
> 
> The final observable issue is that a brief period of rapid memory
> allocation causes kswapd to stop running, ultimately triggering direct
> reclaim and making the applications unresponsive.
> 

This logic appears to be at least eight years old.  Can you suggest why
this issue is being observed after so much time?

>
> ...
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -7128,7 +7128,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
>  		goto restart;
>  	}
>  
> -	if (!sc.nr_reclaimed)
> +	/*
> +	 * If the reclaim was boosted, we might still be far from the
> +	 * watermark_high at this point. We need to avoid increasing the
> +	 * failure count to prevent the kswapd thread from stopping.
> +	 */
> +	if (!sc.nr_reclaimed && !boosted)
>  		atomic_inc(&pgdat->kswapd_failures);
>  

Thanks, I'll toss it in for testing and shall await reviewer input
before proceeding further.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  2025-10-24  2:27 [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted Jiayuan Chen
  2025-10-26  4:40 ` Andrew Morton
@ 2025-11-08  1:11 ` Shakeel Butt
  2025-11-12  2:21   ` Jiayuan Chen
                     ` (2 more replies)
  2025-11-13 23:47 ` Shakeel Butt
  2 siblings, 3 replies; 12+ messages in thread
From: Shakeel Butt @ 2025-11-08  1:11 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-kernel

On Fri, Oct 24, 2025 at 10:27:11AM +0800, Jiayuan Chen wrote:
> We encountered a scenario where direct memory reclaim was triggered,
> leading to increased system latency:
> 
> 1. The memory.low values set on host pods are actually quite large, some
>    pods are set to 10GB, others to 20GB, etc.
> 2. Since most pods have memory protection configured, each time kswapd is
>    woken up, if a pod's memory usage hasn't exceeded its own memory.low,
>    its memory won't be reclaimed.

Can you share the numa configuration of your system? How many nodes are
there?

> 3. When applications start up, rapidly consume memory, or experience
>    network traffic bursts, the kernel reaches steal_suitable_fallback(),
>    which sets watermark_boost and subsequently wakes kswapd.
> 4. In the core logic of kswapd thread (balance_pgdat()), when reclaim is
>    triggered by watermark_boost, the maximum priority is 10. Higher
>    priority values mean less aggressive LRU scanning, which can result in
>    no pages being reclaimed during a single scan cycle:
> 
> if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
>     raise_priority = false;

Am I understanding this correctly that watermark boost increase the
chances of this issue but it can still happen?

> 
> 5. This eventually causes pgdat->kswapd_failures to continuously
>    accumulate, exceeding MAX_RECLAIM_RETRIES, and consequently kswapd stops
>    working. At this point, the system's available memory is still
>    significantly above the high watermark — it's inappropriate for kswapd
>    to stop under these conditions.
> 
> The final observable issue is that a brief period of rapid memory
> allocation causes kswapd to stop running, ultimately triggering direct
> reclaim and making the applications unresponsive.
> 
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> 
> ---
> v1 -> v2: Do not modify memory.low handling
> https://lore.kernel.org/linux-mm/20251014081850.65379-1-jiayuan.chen@linux.dev/
> ---
>  mm/vmscan.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 92f4ca99b73c..fa8663781086 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -7128,7 +7128,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
>  		goto restart;
>  	}
>  
> -	if (!sc.nr_reclaimed)
> +	/*
> +	 * If the reclaim was boosted, we might still be far from the
> +	 * watermark_high at this point. We need to avoid increasing the
> +	 * failure count to prevent the kswapd thread from stopping.
> +	 */
> +	if (!sc.nr_reclaimed && !boosted)
>  		atomic_inc(&pgdat->kswapd_failures);

In general I think not incrementing the failure for boosted kswapd
iteration is right. If this issue (high protection causing kswap
failures) happen on non-boosted case, I am not sure what should be right
behavior i.e. allocators doing direct reclaim potentially below low
protection or allowing kswapd to reclaim below low. For min, it is very
clear that direct reclaimer has to reclaim as they may have to trigger
oom-kill. For low protection, I am not sure.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  2025-11-08  1:11 ` Shakeel Butt
@ 2025-11-12  2:21   ` Jiayuan Chen
  2025-11-13 23:41     ` Shakeel Butt
  2025-11-13 10:02   ` Michal Hocko
  2026-02-27  2:15   ` Jiayuan Chen
  2 siblings, 1 reply; 12+ messages in thread
From: Jiayuan Chen @ 2025-11-12  2:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: linux-mm, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-kernel

2025/11/8 09:11, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:

> 
> On Fri, Oct 24, 2025 at 10:27:11AM +0800, Jiayuan Chen wrote:
[...]
> > 
> Can you share the numa configuration of your system? How many nodes are
> there?

My system has 2 nodes.

[...]

> >  if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
> >  raise_priority = false;
> > 
> Am I understanding this correctly that watermark boost increase the
> chances of this issue but it can still happen?

Yes. In the case of watermark_boost, due to the priority having a lower limit,
the scanning intensity is relatively low, making this issue more likely to occur,
even if I haven't configured memory.low. However, this issue can theoretically happen
even without watermark_boost – for example, if the memory.low values for all pods are
set very high. But I consider that a configuration error (based on the current logic
where kswapd does not attempt to reclaim memory whose usage is below memory.low,

[...]
> >  - if (!sc.nr_reclaimed)
> >  + /*
> >  + * If the reclaim was boosted, we might still be far from the
> >  + * watermark_high at this point. We need to avoid increasing the
> >  + * failure count to prevent the kswapd thread from stopping.
> >  + */
> >  + if (!sc.nr_reclaimed && !boosted)
> >  atomic_inc(&pgdat->kswapd_failures);
> > 
> In general I think not incrementing the failure for boosted kswapd
> iteration is right.

Thanks. I applied a livepatch, and it indeed prevented the occurrence
of direct memory reclaim.

> If this issue (high protection causing kswap
> failures) happen on non-boosted case, I am not sure what should be right
> behavior i.e. allocators doing direct reclaim potentially below low
> protection or allowing kswapd to reclaim below low. For min, it is very
> clear that direct reclaimer has to reclaim as they may have to trigger
> oom-kill. For low protection, I am not sure.
>

We have also encountered this issue in non-boosted scenarios. For instance, when
we disabled swap (meaning only file pages are reclaimed, not anonymous pages), it
indeed occurred even without memory.low configured, especially when anonymous pages
constituted the majority.

Another scenario is misconfigured memory.low. However, in our production environment,
the memory.low configurations are generally reasonable – the sum of all low values is
only about half of the system's total memory.

Regarding how to handle memory.low, I believe there is still room for optimization in
kswapd. From an administrator's perspective, we typically calculate memory.low as a
percentage of memory.max (applications often iterate quickly, and usually no one knows
the exact optimal threshold for low).
Furthermore, to make the low protection as effective as possible, memory.low values tend
to be set on the higher side. This inevitably leads to a significant amount of reclaimable
memory not being reclaimed. In the scenarios I've encountered, memory.low, although intended
as a soft limit, doesn't seem very "soft" in practice. This was also the goal of the v1 patch,
although more refined work might still be needed.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  2025-11-12  2:21   ` Jiayuan Chen
@ 2025-11-13 23:41     ` Shakeel Butt
  0 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2025-11-13 23:41 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-kernel

On Wed, Nov 12, 2025 at 02:21:37AM +0000, Jiayuan Chen wrote:
> 2025/11/8 09:11, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
[...]
> > If this issue (high protection causing kswap
> > failures) happen on non-boosted case, I am not sure what should be right
> > behavior i.e. allocators doing direct reclaim potentially below low
> > protection or allowing kswapd to reclaim below low. For min, it is very
> > clear that direct reclaimer has to reclaim as they may have to trigger
> > oom-kill. For low protection, I am not sure.
> >
> 
> We have also encountered this issue in non-boosted scenarios. For instance, when
> we disabled swap (meaning only file pages are reclaimed, not anonymous pages), it
> indeed occurred even without memory.low configured, especially when anonymous pages
> constituted the majority.

Basically whenever the amount of reclaimable memory is low (i.e. in the
range where it is very hard to satisfy the watermarks), this issue can
happen. 'This' as in kswapd failures. This functionality was added to
detect such scenarios where kswapd hogs a CPU doing nothing. The
memory.low adds a twist that we can allow kswapd to go after but let's
punt on that for now and go after the clear and easy issue i.e. kswapd
failures due to boosted watermarks.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  2025-11-08  1:11 ` Shakeel Butt
  2025-11-12  2:21   ` Jiayuan Chen
@ 2025-11-13 10:02   ` Michal Hocko
  2025-11-13 19:28     ` Shakeel Butt
  2026-02-27  2:15   ` Jiayuan Chen
  2 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2025-11-13 10:02 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jiayuan Chen, linux-mm, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Qi Zheng, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-kernel

On Fri 07-11-25 17:11:58, Shakeel Butt wrote:
> On Fri, Oct 24, 2025 at 10:27:11AM +0800, Jiayuan Chen wrote:
> > We encountered a scenario where direct memory reclaim was triggered,
> > leading to increased system latency:
> > 
> > 1. The memory.low values set on host pods are actually quite large, some
> >    pods are set to 10GB, others to 20GB, etc.
> > 2. Since most pods have memory protection configured, each time kswapd is
> >    woken up, if a pod's memory usage hasn't exceeded its own memory.low,
> >    its memory won't be reclaimed.
> 
> Can you share the numa configuration of your system? How many nodes are
> there?
> 
> > 3. When applications start up, rapidly consume memory, or experience
> >    network traffic bursts, the kernel reaches steal_suitable_fallback(),
> >    which sets watermark_boost and subsequently wakes kswapd.
> > 4. In the core logic of kswapd thread (balance_pgdat()), when reclaim is
> >    triggered by watermark_boost, the maximum priority is 10. Higher
> >    priority values mean less aggressive LRU scanning, which can result in
> >    no pages being reclaimed during a single scan cycle:
> > 
> > if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
> >     raise_priority = false;
> 
> Am I understanding this correctly that watermark boost increase the
> chances of this issue but it can still happen?
> 
> > 
> > 5. This eventually causes pgdat->kswapd_failures to continuously
> >    accumulate, exceeding MAX_RECLAIM_RETRIES, and consequently kswapd stops
> >    working. At this point, the system's available memory is still
> >    significantly above the high watermark — it's inappropriate for kswapd
> >    to stop under these conditions.
> > 
> > The final observable issue is that a brief period of rapid memory
> > allocation causes kswapd to stop running, ultimately triggering direct
> > reclaim and making the applications unresponsive.
> > 
> > Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> > 
> > ---
> > v1 -> v2: Do not modify memory.low handling
> > https://lore.kernel.org/linux-mm/20251014081850.65379-1-jiayuan.chen@linux.dev/
> > ---
> >  mm/vmscan.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 92f4ca99b73c..fa8663781086 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -7128,7 +7128,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
> >  		goto restart;
> >  	}
> >  
> > -	if (!sc.nr_reclaimed)
> > +	/*
> > +	 * If the reclaim was boosted, we might still be far from the
> > +	 * watermark_high at this point. We need to avoid increasing the
> > +	 * failure count to prevent the kswapd thread from stopping.
> > +	 */
> > +	if (!sc.nr_reclaimed && !boosted)
> >  		atomic_inc(&pgdat->kswapd_failures);
> 
> In general I think not incrementing the failure for boosted kswapd
> iteration is right. If this issue (high protection causing kswap
> failures) happen on non-boosted case, I am not sure what should be right
> behavior i.e. allocators doing direct reclaim potentially below low
> protection or allowing kswapd to reclaim below low. For min, it is very
> clear that direct reclaimer has to reclaim as they may have to trigger
> oom-kill. For low protection, I am not sure.

Our current documention gives us some room for interpretation. I am
wondering whether we need to change the existing implemnetation though.
If kswapd is not able to make progress then we surely have direct
reclaim happening. So I would only change this if we had examples of
properly/sensibly configured systems where kswapd low limit breach could
help to reuduce stalls (improve performance) while the end result from
the amount of reclaimed memory would be same/very similar.

This specific report is an example where boosting was not low limit
aware and I agree that not accounting kswapd_failures for boosted runs
is reasonable thing to do. I am not yet sure this is a complete fix but
it is certainly a good direction.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  2025-11-13 10:02   ` Michal Hocko
@ 2025-11-13 19:28     ` Shakeel Butt
  2025-11-14  2:23       ` Jiayuan Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Shakeel Butt @ 2025-11-13 19:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jiayuan Chen, linux-mm, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Qi Zheng, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-kernel

On Thu, Nov 13, 2025 at 11:02:41AM +0100, Michal Hocko wrote:
> > 
> > In general I think not incrementing the failure for boosted kswapd
> > iteration is right. If this issue (high protection causing kswap
> > failures) happen on non-boosted case, I am not sure what should be right
> > behavior i.e. allocators doing direct reclaim potentially below low
> > protection or allowing kswapd to reclaim below low. For min, it is very
> > clear that direct reclaimer has to reclaim as they may have to trigger
> > oom-kill. For low protection, I am not sure.
> 
> Our current documention gives us some room for interpretation. I am
> wondering whether we need to change the existing implemnetation though.
> If kswapd is not able to make progress then we surely have direct
> reclaim happening. So I would only change this if we had examples of
> properly/sensibly configured systems where kswapd low limit breach could
> help to reuduce stalls (improve performance) while the end result from
> the amount of reclaimed memory would be same/very similar.

Yes, I think any change here will need much more brainstorming and
experimentation. There are definitely corner cases which the right
solution might not be in kernel. One such case I was thinking about is
unbalanced (memory) numa node where I don't think kswapd of that node
should do anything because of the disconnect between numa memory usage
and memcg limits. On such cases either numa balancing or
promotion/demotion systems under discussion would be more appropriate.
Anyways this is orthogonal.

> 
> This specific report is an example where boosting was not low limit
> aware and I agree that not accounting kswapd_failures for boosted runs
> is reasonable thing to do. I am not yet sure this is a complete fix but
> it is certainly a good direction.

Yes, I think we should move forward with this and keep an eye if this
situation occurs in non-boosted environment.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  2025-11-13 19:28     ` Shakeel Butt
@ 2025-11-14  2:23       ` Jiayuan Chen
  0 siblings, 0 replies; 12+ messages in thread
From: Jiayuan Chen @ 2025-11-14  2:23 UTC (permalink / raw)
  To: Shakeel Butt, Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Qi Zheng, Lorenzo Stoakes, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	linux-kernel

2025/11/14 03:28, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:


> 
> On Thu, Nov 13, 2025 at 11:02:41AM +0100, Michal Hocko wrote:
> 
> > 
> > In general I think not incrementing the failure for boosted kswapd
> >  iteration is right. If this issue (high protection causing kswap
> >  failures) happen on non-boosted case, I am not sure what should be right
> >  behavior i.e. allocators doing direct reclaim potentially below low
> >  protection or allowing kswapd to reclaim below low. For min, it is very
> >  clear that direct reclaimer has to reclaim as they may have to trigger
> >  oom-kill. For low protection, I am not sure.
> >  
> >  Our current documention gives us some room for interpretation. I am
> >  wondering whether we need to change the existing implemnetation though.
> >  If kswapd is not able to make progress then we surely have direct
> >  reclaim happening. So I would only change this if we had examples of
> >  properly/sensibly configured systems where kswapd low limit breach could
> >  help to reuduce stalls (improve performance) while the end result from
> >  the amount of reclaimed memory would be same/very similar.
> > 
> Yes, I think any change here will need much more brainstorming and
> experimentation. There are definitely corner cases which the right
> solution might not be in kernel. One such case I was thinking about is
> unbalanced (memory) numa node where I don't think kswapd of that node
> should do anything because of the disconnect between numa memory usage
> and memcg limits. On such cases either numa balancing or
> promotion/demotion systems under discussion would be more appropriate.
> Anyways this is orthogonal.

Can I ask for a link or some keywords to search the mailing list regarding the NUMA
imbalance you mentioned? 

I'm not sure if it's similar to a problem I encountered before. We have a system
with 2 nodes and swap is disabled. After running for a while, we found that anonymous
pages occupied over 99% of one node. When kswapd on that node runs, it continuously tries
to reclaim the 1% file pages. However, these file pages are mostly code pages and are hot,
leading to frenzied refaults, which eventually causes sustained high read I/O load on the disk.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  2025-11-08  1:11 ` Shakeel Butt
  2025-11-12  2:21   ` Jiayuan Chen
  2025-11-13 10:02   ` Michal Hocko
@ 2026-02-27  2:15   ` Jiayuan Chen
  2 siblings, 0 replies; 12+ messages in thread
From: Jiayuan Chen @ 2026-02-27  2:15 UTC (permalink / raw)
  To: Shakeel Butt, mhocko, hannes, akpm
  Cc: linux-mm, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-kernel, chunguang.xu

On Fri, Nov 07, 2025 at 05:11:58PM +0800, Shakeel Butt wrote:
> On Fri, Oct 24, 2025 at 10:27:11AM +0800, Jiayuan Chen wrote:
> > We encountered a scenario where direct memory reclaim was triggered,
> > leading to increased system latency:
> > 
> > 1. The memory.low values set on host pods are actually quite large, some
> >    pods are set to 10GB, others to 20GB, etc.
> > 2. Since most pods have memory protection configured, each time kswapd is
> >    woken up, if a pod's memory usage hasn't exceeded its own memory.low,
> >    its memory won't be reclaimed.
> 
> Can you share the numa configuration of your system? How many nodes are
> there?
> 
> > 3. When applications start up, rapidly consume memory, or experience
> >    network traffic bursts, the kernel reaches steal_suitable_fallback(),
> >    which sets watermark_boost and subsequently wakes kswapd.
> > 4. In the core logic of kswapd thread (balance_pgdat()), when reclaim is
> >    triggered by watermark_boost, the maximum priority is 10. Higher
> >    priority values mean less aggressive LRU scanning, which can result in
> >    no pages being reclaimed during a single scan cycle:
> > 
> > if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
> >     raise_priority = false;
> 
> Am I understanding this correctly that watermark boost increase the
> chances of this issue but it can still happen?
> 
> > 
> > 5. This eventually causes pgdat->kswapd_failures to continuously
> >    accumulate, exceeding MAX_RECLAIM_RETRIES, and consequently kswapd stops
> >    working. At this point, the system's available memory is still
> >    significantly above the high watermark — it's inappropriate for kswapd
> >    to stop under these conditions.
> > 
> > The final observable issue is that a brief period of rapid memory
> > allocation causes kswapd to stop running, ultimately triggering direct
> > reclaim and making the applications unresponsive.
> > 
> > Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> > 
> > ---
> > v1 -> v2: Do not modify memory.low handling
> > https://lore.kernel.org/linux-mm/20251014081850.65379-1-jiayuan.chen@linux.dev/
> > ---
> >  mm/vmscan.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 92f4ca99b73c..fa8663781086 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -7128,7 +7128,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
> >  		goto restart;
> >  	}
> >  
> > -	if (!sc.nr_reclaimed)
> > +	/*
> > +	 * If the reclaim was boosted, we might still be far from the
> > +	 * watermark_high at this point. We need to avoid increasing the
> > +	 * failure count to prevent the kswapd thread from stopping.
> > +	 */
> > +	if (!sc.nr_reclaimed && !boosted)
> >  		atomic_inc(&pgdat->kswapd_failures);
> 
> In general I think not incrementing the failure for boosted kswapd
> iteration is right. If this issue (high protection causing kswap
> failures) happen on non-boosted case, I am not sure what should be right
> behavior i.e. allocators doing direct reclaim potentially below low
> protection or allowing kswapd to reclaim below low. For min, it is very
> clear that direct reclaimer has to reclaim as they may have to trigger
> oom-kill. For low protection, I am not sure.





Hi all,

Sorry to bring this up late, but I've been thinking about a potential
corner case with this patch and would appreciate some input.

Since steal_suitable_fallback() triggers boost_watermark() whenever
pages are stolen across migrate types, and this patch prevents
kswapd_failures from incrementing during boosted reclaim, I'm wondering
if there's a theoretical scenario where kswapd could end up running
continuously.

For example, if UNMOVABLE and MOVABLE allocations are competing for
memory over a sustained period, the repeated cross-migratetype stealing
would keep boosting watermarks and waking kswapd. If kswapd can't
actually reclaim anything in this situation, it would never hit
MAX_RECLAIM_RETRIES and just keep spinning.

Two questions:

1. Has anyone seen this kind of sustained migratetype stealing in
practice? I'm not sure how realistic this scenario is.


2. If it does happen, waking kswapd for boost reclaim itself makes
total sense - reclaiming order-0 pages to reduce fragmentation is
the right thing to do. But a busy-looping kswapd that can't
actually reclaim anything would still burn CPU cycles, and keep
grabbing lruvec->lru_lock and zone->lock on every pass through
shrink_node(), which could hurt page fault and allocation latency
for other threads. Would it be worth adding some backoff mechanism
for boosted reclaim failures?

Just want to understand if this is something worth worrying about
or purely theoretical.

Thanks,

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
  2025-10-24  2:27 [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted Jiayuan Chen
  2025-10-26  4:40 ` Andrew Morton
  2025-11-08  1:11 ` Shakeel Butt
@ 2025-11-13 23:47 ` Shakeel Butt
  2025-11-14  4:17   ` Jiayuan Chen
  2 siblings, 1 reply; 12+ messages in thread
From: Shakeel Butt @ 2025-11-13 23:47 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: linux-mm, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-kernel

On Fri, Oct 24, 2025 at 10:27:11AM +0800, Jiayuan Chen wrote:
> We encountered a scenario where direct memory reclaim was triggered,
> leading to increased system latency:
> 
> 1. The memory.low values set on host pods are actually quite large, some
>    pods are set to 10GB, others to 20GB, etc.
> 2. Since most pods have memory protection configured, each time kswapd is
>    woken up, if a pod's memory usage hasn't exceeded its own memory.low,
>    its memory won't be reclaimed.
> 3. When applications start up, rapidly consume memory, or experience
>    network traffic bursts, the kernel reaches steal_suitable_fallback(),
>    which sets watermark_boost and subsequently wakes kswapd.
> 4. In the core logic of kswapd thread (balance_pgdat()), when reclaim is
>    triggered by watermark_boost, the maximum priority is 10. Higher
>    priority values mean less aggressive LRU scanning, which can result in
>    no pages being reclaimed during a single scan cycle:
> 
> if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
>     raise_priority = false;
> 
> 5. This eventually causes pgdat->kswapd_failures to continuously
>    accumulate, exceeding MAX_RECLAIM_RETRIES, and consequently kswapd stops
>    working. At this point, the system's available memory is still
>    significantly above the high watermark — it's inappropriate for kswapd
>    to stop under these conditions.
> 
> The final observable issue is that a brief period of rapid memory
> allocation causes kswapd to stop running, ultimately triggering direct
> reclaim and making the applications unresponsive.
> 
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>

Please resolve Andrew's comment and add couple of lines on boosted
watermark increasing the chances of kswapd failures and the patch only
targets that particular scenario, the general solution TBD in the commit
message.

With that, you can add:

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/vmscan: skip increasing kswapd_failures when  reclaim was boosted
  2025-11-13 23:47 ` Shakeel Butt
@ 2025-11-14  4:17   ` Jiayuan Chen
  2025-11-15  0:40     ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Jiayuan Chen @ 2025-11-14  4:17 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: linux-mm, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-kernel

November 14, 2025 at 07:47, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:

[...]
> >  The final observable issue is that a brief period of rapid memory
> >  allocation causes kswapd to stop running, ultimately triggering direct
> >  reclaim and making the applications unresponsive.
> >  
> >  Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> > 
> Please resolve Andrew's comment and add couple of lines on boosted
> watermark increasing the chances of kswapd failures and the patch only
> targets that particular scenario, the general solution TBD in the commit
> message.
> 
> With that, you can add:
> 
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>

I see this patch is already in mm-next. I'm not sure how to proceed.
Perhaps Andrew needs to do a git rebase and then reword the commit message?
But regardless, I'll reword the commit message here and please let me know
how to proceed if possible:

'''
mm/vmscan: skip increasing kswapd_failures when reclaim was boosted

We have a colocation cluster used for deploying both offline and online
services simultaneously. In this environment, we encountered a scenario
where direct memory reclamation was triggered due to kswapd not running.

1. When applications start up, rapidly consume memory, or experience
   network traffic bursts, the kernel reaches steal_suitable_fallback(),
   which sets watermark_boost and subsequently wakes kswapd.

2. In the core logic of kswapd thread (balance_pgdat()), when reclaim is
   triggered by watermark_boost, the maximum priority is 10. Higher
   priority values mean less aggressive LRU scanning, which can result in
   no pages being reclaimed during a single scan cycle:

   if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
       raise_priority = false;

3. Additionally, many of our pods are configured with memory.low, which
   prevents memory reclamation in certain cgroups, further increasing the
   chance of failing to reclaim memory.

4. This eventually causes pgdat->kswapd_failures to continuously
   accumulate, exceeding MAX_RECLAIM_RETRIES, and consequently kswapd stops
   working. At this point, the system's available memory is still
   significantly above the high watermark — it's inappropriate for kswapd
   to stop under these conditions.

The final observable issue is that a brief period of rapid memory
allocation causes kswapd to stop running, ultimately triggering direct
reclaim and making the applications unresponsive.

This problem leading to direct memory reclamation has been a long-standing
issue in our production environment. We initially held the simple
assumption that it was caused by applications allocating memory too rapidly
for kswapd to keep up with reclamation. However, after we began monitoring
kswapd's runtime behavior, we discovered a different pattern:
'''
kswapd initially exhibits very aggressive activity even when there is still
considerable free memory, but it subsequently stops running entirely, even
as memory levels approach the low watermark.
'''

In summary, both boosted watermarks and memory.low increase the probability
of kswapd operation failures.

This patch specifically addresses the scenario involving boosted watermarks
by not incrementing kswapd_failures when reclamation fails. A more general
solution, potentially addressing memory.low or other cases, requires further
discussion.

Link: https://lkml.kernel.org/r/20251024022711.382238-1-jiayuan.chen@linux.dev
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

'''

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/vmscan: skip increasing kswapd_failures when  reclaim was boosted
  2025-11-14  4:17   ` Jiayuan Chen
@ 2025-11-15  0:40     ` Andrew Morton
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2025-11-15  0:40 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: Shakeel Butt, linux-mm, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Lorenzo Stoakes, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, linux-kernel

On Fri, 14 Nov 2025 04:17:40 +0000 "Jiayuan Chen" <jiayuan.chen@linux.dev> wrote:

> > >  Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> > > 
> > Please resolve Andrew's comment and add couple of lines on boosted
> > watermark increasing the chances of kswapd failures and the patch only
> > targets that particular scenario, the general solution TBD in the commit
> > message.
> > 
> > With that, you can add:
> > 
> > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
> >
> 
> I see this patch is already in mm-next. I'm not sure how to proceed.
> Perhaps Andrew needs to do a git rebase and then reword the commit message?

A rebase would be needed if the patch had been placed in mm.git's
mm-stable branch.  But it's still in the mm-unstable branch where
patches are kept in quilt form and are imported into git for each
mm.git release.

> But regardless, I'll reword the commit message here and please let me know
> how to proceed if possible:

Which is why I do things this way.

Easy peasy, edited, thanks.

From: Jiayuan Chen <jiayuan.chen@linux.dev>
Subject: mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
Date: Fri, 24 Oct 2025 10:27:11 +0800

We have a colocation cluster used for deploying both offline and online
services simultaneously.  In this environment, we encountered a
scenario where direct memory reclamation was triggered due to kswapd
not running.

1. When applications start up, rapidly consume memory, or experience
   network traffic bursts, the kernel reaches steal_suitable_fallback(),
   which sets watermark_boost and subsequently wakes kswapd.

2. In the core logic of kswapd thread (balance_pgdat()), when reclaim is
   triggered by watermark_boost, the maximum priority is 10. Higher
   priority values mean less aggressive LRU scanning, which can result in
   no pages being reclaimed during a single scan cycle:

   if (nr_boost_reclaim && sc.priority =3D=3D DEF_PRIORITY - 2)
       raise_priority =3D false;

3. Additionally, many of our pods are configured with memory.low, which
   prevents memory reclamation in certain cgroups, further increasing the
   chance of failing to reclaim memory.

4.  This eventually causes pgdat->kswapd_failures to continuously
   accumulate, exceeding MAX_RECLAIM_RETRIES, and consequently kswapd
   sto= ps working.  At this point, the system's available memory is
   still significantly above the high watermark =E2=80=94 it's
   inappropriate fo= r kswapd to stop under these conditions.

The final observable issue is that a brief period of rapid memory
allocation causes kswapd to stop running, ultimately triggering direct
reclaim and making the applications unresponsive.

This problem leading to direct memory reclamation has been a
long-standing issue in our production environment.  We initially held
the simple assumption that it was caused by applications allocating
memory too rapidly for kswapd to keep up with reclamation.  However,
after we began monitoring kswapd's runtime behavior, we discovered a
different pattern:

kswapd initially exhibits very aggressive activity even when there is
still considerable free memory, but it subsequently stops running
entirely, even as memory levels approach the low watermark.

In summary, both boosted watermarks and memory.low increase the
probability of kswapd operation failures.

This patch specifically addresses the scenario involving boosted
watermarks by not incrementing kswapd_failures when reclamation fails. 
A more general solution, potentially addressing memory.low or other
cases, requires further discussion.

Link: https://lkml.kernel.org/r/53de0b3ee0b822418e909db29bfa6513faff9d36@linux.dev
Link: https://lkml.kernel.org/r/20251024022711.382238-1-jiayuan.chen@linux.dev
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- a/mm/vmscan.c~mm-vmscan-skip-increasing-kswapd_failures-when-reclaim-was-boosted
+++ a/mm/vmscan.c
@@ -7127,7 +7127,12 @@ restart:
 		goto restart;
 	}

-	if (!sc.nr_reclaimed)
+	/*
+	 * If the reclaim was boosted, we might still be far from the
+	 * watermark_high at this point. We need to avoid increasing the
+	 * failure count to prevent the kswapd thread from stopping.
+	 */
+	if (!sc.nr_reclaimed && !boosted)
 		atomic_inc(&pgdat->kswapd_failures);

 out:
_

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-02-27  2:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-24  2:27 [PATCH v2] mm/vmscan: skip increasing kswapd_failures when reclaim was boosted Jiayuan Chen
2025-10-26  4:40 ` Andrew Morton
2025-11-08  1:11 ` Shakeel Butt
2025-11-12  2:21   ` Jiayuan Chen
2025-11-13 23:41     ` Shakeel Butt
2025-11-13 10:02   ` Michal Hocko
2025-11-13 19:28     ` Shakeel Butt
2025-11-14  2:23       ` Jiayuan Chen
2026-02-27  2:15   ` Jiayuan Chen
2025-11-13 23:47 ` Shakeel Butt
2025-11-14  4:17   ` Jiayuan Chen
2025-11-15  0:40     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox