[PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
@ 2026-05-19 20:08 JP Kobryn (Meta)
  2026-05-25 10:02 ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 3+ messages in thread
From: JP Kobryn (Meta) @ 2026-05-19 20:08 UTC (permalink / raw)
  To: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm
  Cc: linux-kernel, kernel-team

compact_gap() returns 2 << order, which is used as watermark headroom in
__compaction_suitable() and as a reclaim target in kswapd. The computed
value scales exponentially by order. For order-9 THP allocations this
evaluates to 1024 pages, but the compaction free scanner's working set is
bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops isolating free
pages once it matches the migration batch. The current gap over-reserves by
32x.

On fragmented production hosts, kswapd will try and reclaim up to the gap,
but it only reaches that threshold 18% of the time, causing reclaim to
continue a majority of the time. The over-sized gap also causes 46% of
order-9 compaction suitability checks to fail unnecessarily - the zone has
sufficient free pages for the scanner to operate, but not enough to clear
the inflated threshold.

Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
with the scanner's actual capacity. Orders 0-4 are unaffected since their
gap is <= 32.

A/B test on ~100 instagram production hosts (64GB, 60s measurement):

Unpatched (43 hosts)
pgscan_kswapd (mean/host): ~1.6M
reclaim efficiency (steal/scan): 83.8%
compaction success (success/stall): 2.1%
THP success (alloc/alloc+fallback): 4.9%
forced lru_add_drain (mean/host): ~107K

Patched (59 hosts)
pgscan_kswapd (mean/host): ~449K
reclaim efficiency (steal/scan): 91.0%
compaction success (success/stall): 28.3%
THP success (alloc/alloc+fallback): 17.2%
forced lru_add_drain (mean/host): ~64K

Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
---
 include/linux/compaction.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 173d9c07a8952..09aea63b8a89d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H

+#include <linux/swap.h>
+
 /*
  * Determines how hard direct compaction should try to succeed.
  * Lower value means higher priority, analogically to reclaim priority.
@@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned int order)
 	 * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
 	 * that the migrate scanner can have isolated on migrate list, and free
 	 * scanner is only invoked when the number of isolated free pages is
-	 * lower than that. But it's not worth to complicate the formula here
-	 * as a bigger gap for higher orders than strictly necessary can also
-	 * improve chances of compaction success.
+	 * lower than that.
 	 */
-	return 2UL << order;
+	return min(2UL << order, COMPACT_CLUSTER_MAX);
 }

 static inline int current_is_kcompactd(void)
-- 
2.53.0-Meta

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
  2026-05-19 20:08 [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX JP Kobryn (Meta)
@ 2026-05-25 10:02 ` Vlastimil Babka (SUSE)
  2026-05-27  0:10   ` JP Kobryn
  0 siblings, 1 reply; 3+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-25 10:02 UTC (permalink / raw)
  To: JP Kobryn (Meta), akpm, surenb, mhocko, jackmanb, hannes, ziy,
	linux-mm
  Cc: linux-kernel, kernel-team

On 5/19/26 22:08, JP Kobryn (Meta) wrote:
> compact_gap() returns 2 << order, which is used as watermark headroom in
> __compaction_suitable() and as a reclaim target in kswapd. The computed
> value scales exponentially by order. For order-9 THP allocations this
> evaluates to 1024 pages, but the compaction free scanner's working set is
> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops isolating free
> pages once it matches the migration batch. The current gap over-reserves by
> 32x.
> 
> On fragmented production hosts, kswapd will try and reclaim up to the gap,
> but it only reaches that threshold 18% of the time, causing reclaim to
> continue a majority of the time.

But doesn't that mean there's genuine memory pressure? We're effectively
raising the high watermark by 4 MB, but if processes are continuously
allocating, we'd be reclaiming without the gap as well? Unless the workload
is sized to fit without the gap.

> The over-sized gap also causes 46% of
> order-9 compaction suitability checks to fail unnecessarily - the zone has
> sufficient free pages for the scanner to operate, but not enough to clear
> the inflated threshold.
> 
> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
> with the scanner's actual capacity. Orders 0-4 are unaffected since their
> gap is <= 32.
> 
> A/B test on ~100 instagram production hosts (64GB, 60s measurement):

What was the base kernel version?

> Unpatched (43 hosts)
> pgscan_kswapd (mean/host): ~1.6M
> reclaim efficiency (steal/scan): 83.8%
> compaction success (success/stall): 2.1%
> THP success (alloc/alloc+fallback): 4.9%
> forced lru_add_drain (mean/host): ~107K
> 
> Patched (59 hosts)
> pgscan_kswapd (mean/host): ~449K

Did the extra reclaim just disappear because we allow the allocations to use
4MB more memory? Or it shifted to direct reclaim?

> reclaim efficiency (steal/scan): 91.0%
> compaction success (success/stall): 28.3%

Is this compaction success per compaction stall or per alloc stall?

> THP success (alloc/alloc+fallback): 17.2%

Weird that things would improve that much. I would expect the free memory
just to stabilize around the lower gap but then behave similarly. Are we
missing something here?

> forced lru_add_drain (mean/host): ~64K
> 
> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
> ---
>  include/linux/compaction.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 173d9c07a8952..09aea63b8a89d 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -2,6 +2,8 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>  
> +#include <linux/swap.h>
> +
>  /*
>   * Determines how hard direct compaction should try to succeed.
>   * Lower value means higher priority, analogically to reclaim priority.
> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned int order)
>  	 * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
>  	 * that the migrate scanner can have isolated on migrate list, and free
>  	 * scanner is only invoked when the number of isolated free pages is
> -	 * lower than that. But it's not worth to complicate the formula here
> -	 * as a bigger gap for higher orders than strictly necessary can also
> -	 * improve chances of compaction success.
> +	 * lower than that.
>  	 */
> -	return 2UL << order;
> +	return min(2UL << order, COMPACT_CLUSTER_MAX);

Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?

>  }
>  
>  static inline int current_is_kcompactd(void)



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
  2026-05-25 10:02 ` Vlastimil Babka (SUSE)
@ 2026-05-27  0:10   ` JP Kobryn
  0 siblings, 0 replies; 3+ messages in thread
From: JP Kobryn @ 2026-05-27  0:10 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), akpm, surenb, mhocko, jackmanb, hannes,
	ziy, linux-mm
  Cc: linux-kernel, kernel-team

On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote:
> On 5/19/26 22:08, JP Kobryn (Meta) wrote:
>> compact_gap() returns 2 << order, which is used as watermark headroom in
>> __compaction_suitable() and as a reclaim target in kswapd. The computed
>> value scales exponentially by order. For order-9 THP allocations this
>> evaluates to 1024 pages, but the compaction free scanner's working set is
>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops 
>> isolating free
>> pages once it matches the migration batch. The current gap 
>> over-reserves by
>> 32x.
>>
>> On fragmented production hosts, kswapd will try and reclaim up to the 
>> gap,
>> but it only reaches that threshold 18% of the time, causing reclaim to
>> continue a majority of the time.
> But doesn't that mean there's genuine memory pressure? We're effectively
> raising the high watermark by 4 MB, but if processes are continuously
> allocating, we'd be reclaiming without the gap as well? Unless the 
> workload
> is sized to fit without the gap.

It wasn't actual pressure, but the repetitive order-9 THP failures that were
waking up kswapd. I should make this more clear in the changelog. After
looking into why so much reclaim was occurring though, the compact gap stood
out since it dictates the target amount to reclaim.

>> The over-sized gap also causes 46% of
>> order-9 compaction suitability checks to fail unnecessarily - the 
>> zone has
>> sufficient free pages for the scanner to operate, but not enough to clear
>> the inflated threshold.
>>
>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
>> with the scanner's actual capacity. Orders 0-4 are unaffected since their
>> gap is <= 32.
>>
>> A/B test on ~100 instagram production hosts (64GB, 60s measurement):
> What was the base kernel version?

6.13. Additional benchmarks were done using a recent mm-new build as well,
and they showed similar reductions in reclaim.

>> Unpatched (43 hosts)
>> pgscan_kswapd (mean/host): ~1.6M
>> reclaim efficiency (steal/scan): 83.8%
>> compaction success (success/stall): 2.1%
>> THP success (alloc/alloc+fallback): 4.9%
>> forced lru_add_drain (mean/host): ~107K
>>
>> Patched (59 hosts)
>> pgscan_kswapd (mean/host): ~449K
> Did the extra reclaim just disappear because we allow the allocations 
> to use
> 4MB more memory? Or it shifted to direct reclaim?

Specifically in the order-9 case, the reclaim target goes from 1024 to 32.
What the data shows is that capping the gap allows compaction to take over
sooner and start working to produce large size pages needed for THP. Whereas
in the pre-patch state, trying to reclaim the full 2x THP delays compaction.

>> reclaim efficiency (steal/scan): 91.0%
>> compaction success (success/stall): 28.3%
> Is this compaction success per compaction stall or per alloc stall?

That's per compaction.

>> THP success (alloc/alloc+fallback): 17.2%
> Weird that things would improve that much. I would expect the free memory
> just to stabilize around the lower gap but then behave similarly. Are we
> missing something here?

This patch was tested in isolation, but also occurring was the case where
bursty net allocations reserve many pageblocks as high atomic. So as
THP-size pages become eligible, their blocks are reserved before being
allocated as THP.

>> forced lru_add_drain (mean/host): ~64K
>>
>> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
>> ---
>> include/linux/compaction.h | 8 ++++----
>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>> index 173d9c07a8952..09aea63b8a89d 100644
>> --- a/include/linux/compaction.h
>> +++ b/include/linux/compaction.h
>> @@ -2,6 +2,8 @@
>> #ifndef _LINUX_COMPACTION_H
>> #define _LINUX_COMPACTION_H
>> +#include <linux/swap.h>
>> +
>> /*
>> * Determines how hard direct compaction should try to succeed.
>> * Lower value means higher priority, analogically to reclaim priority.
>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned 
>> int order)
>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
>> * that the migrate scanner can have isolated on migrate list, and free
>> * scanner is only invoked when the number of isolated free pages is
>> - * lower than that. But it's not worth to complicate the formula here
>> - * as a bigger gap for higher orders than strictly necessary can also
>> - * improve chances of compaction success.
>> + * lower than that.
>> */
>> - return 2UL << order;
>> + return min(2UL << order, COMPACT_CLUSTER_MAX);
> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?

I'm thinking I could reframe this patch as reclaim-focused and use
min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while
either leaving the other non-reclaim users of this function alone or
using the 2x form you suggest above. i.e. I can split this function
into a separate reclaim_compact_gap() and use the originally proposed cap.
Thoughts?


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-05-27  0:10 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19 20:08 [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX JP Kobryn (Meta)
2026-05-25 10:02 ` Vlastimil Babka (SUSE)
2026-05-27  0:10   ` JP Kobryn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox