Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
@ 2026-05-19 20:08 JP Kobryn (Meta)
  2026-05-25 10:02 ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 6+ messages in thread
From: JP Kobryn (Meta) @ 2026-05-19 20:08 UTC (permalink / raw)
  To: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm
  Cc: linux-kernel, kernel-team

compact_gap() returns 2 << order, which is used as watermark headroom in
__compaction_suitable() and as a reclaim target in kswapd. The computed
value scales exponentially by order. For order-9 THP allocations this
evaluates to 1024 pages, but the compaction free scanner's working set is
bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops isolating free
pages once it matches the migration batch. The current gap over-reserves by
32x.

On fragmented production hosts, kswapd will try and reclaim up to the gap,
but it only reaches that threshold 18% of the time, causing reclaim to
continue a majority of the time. The over-sized gap also causes 46% of
order-9 compaction suitability checks to fail unnecessarily - the zone has
sufficient free pages for the scanner to operate, but not enough to clear
the inflated threshold.

Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
with the scanner's actual capacity. Orders 0-4 are unaffected since their
gap is <= 32.

A/B test on ~100 instagram production hosts (64GB, 60s measurement):

Unpatched (43 hosts)
pgscan_kswapd (mean/host): ~1.6M
reclaim efficiency (steal/scan): 83.8%
compaction success (success/stall): 2.1%
THP success (alloc/alloc+fallback): 4.9%
forced lru_add_drain (mean/host): ~107K

Patched (59 hosts)
pgscan_kswapd (mean/host): ~449K
reclaim efficiency (steal/scan): 91.0%
compaction success (success/stall): 28.3%
THP success (alloc/alloc+fallback): 17.2%
forced lru_add_drain (mean/host): ~64K

Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
---
 include/linux/compaction.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 173d9c07a8952..09aea63b8a89d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
+#include <linux/swap.h>
+
 /*
  * Determines how hard direct compaction should try to succeed.
  * Lower value means higher priority, analogically to reclaim priority.
@@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned int order)
 	 * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
 	 * that the migrate scanner can have isolated on migrate list, and free
 	 * scanner is only invoked when the number of isolated free pages is
-	 * lower than that. But it's not worth to complicate the formula here
-	 * as a bigger gap for higher orders than strictly necessary can also
-	 * improve chances of compaction success.
+	 * lower than that.
 	 */
-	return 2UL << order;
+	return min(2UL << order, COMPACT_CLUSTER_MAX);
 }
 
 static inline int current_is_kcompactd(void)
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
  2026-05-19 20:08 [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX JP Kobryn (Meta)
@ 2026-05-25 10:02 ` Vlastimil Babka (SUSE)
  2026-05-27  0:10   ` JP Kobryn
  0 siblings, 1 reply; 6+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-25 10:02 UTC (permalink / raw)
  To: JP Kobryn (Meta), akpm, surenb, mhocko, jackmanb, hannes, ziy,
	linux-mm
  Cc: linux-kernel, kernel-team

On 5/19/26 22:08, JP Kobryn (Meta) wrote:
> compact_gap() returns 2 << order, which is used as watermark headroom in
> __compaction_suitable() and as a reclaim target in kswapd. The computed
> value scales exponentially by order. For order-9 THP allocations this
> evaluates to 1024 pages, but the compaction free scanner's working set is
> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops isolating free
> pages once it matches the migration batch. The current gap over-reserves by
> 32x.
> 
> On fragmented production hosts, kswapd will try and reclaim up to the gap,
> but it only reaches that threshold 18% of the time, causing reclaim to
> continue a majority of the time.

But doesn't that mean there's genuine memory pressure? We're effectively
raising the high watermark by 4 MB, but if processes are continuously
allocating, we'd be reclaiming without the gap as well? Unless the workload
is sized to fit without the gap.

> The over-sized gap also causes 46% of
> order-9 compaction suitability checks to fail unnecessarily - the zone has
> sufficient free pages for the scanner to operate, but not enough to clear
> the inflated threshold.
> 
> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
> with the scanner's actual capacity. Orders 0-4 are unaffected since their
> gap is <= 32.
> 
> A/B test on ~100 instagram production hosts (64GB, 60s measurement):

What was the base kernel version?

> Unpatched (43 hosts)
> pgscan_kswapd (mean/host): ~1.6M
> reclaim efficiency (steal/scan): 83.8%
> compaction success (success/stall): 2.1%
> THP success (alloc/alloc+fallback): 4.9%
> forced lru_add_drain (mean/host): ~107K
> 
> Patched (59 hosts)
> pgscan_kswapd (mean/host): ~449K

Did the extra reclaim just disappear because we allow the allocations to use
4MB more memory? Or it shifted to direct reclaim?

> reclaim efficiency (steal/scan): 91.0%
> compaction success (success/stall): 28.3%

Is this compaction success per compaction stall or per alloc stall?

> THP success (alloc/alloc+fallback): 17.2%

Weird that things would improve that much. I would expect the free memory
just to stabilize around the lower gap but then behave similarly. Are we
missing something here?

> forced lru_add_drain (mean/host): ~64K
> 
> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
> ---
>  include/linux/compaction.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 173d9c07a8952..09aea63b8a89d 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -2,6 +2,8 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>  
> +#include <linux/swap.h>
> +
>  /*
>   * Determines how hard direct compaction should try to succeed.
>   * Lower value means higher priority, analogically to reclaim priority.
> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned int order)
>  	 * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
>  	 * that the migrate scanner can have isolated on migrate list, and free
>  	 * scanner is only invoked when the number of isolated free pages is
> -	 * lower than that. But it's not worth to complicate the formula here
> -	 * as a bigger gap for higher orders than strictly necessary can also
> -	 * improve chances of compaction success.
> +	 * lower than that.
>  	 */
> -	return 2UL << order;
> +	return min(2UL << order, COMPACT_CLUSTER_MAX);

Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?

>  }
>  
>  static inline int current_is_kcompactd(void)



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
  2026-05-25 10:02 ` Vlastimil Babka (SUSE)
@ 2026-05-27  0:10   ` JP Kobryn
  2026-05-28  8:51     ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 6+ messages in thread
From: JP Kobryn @ 2026-05-27  0:10 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), akpm, surenb, mhocko, jackmanb, hannes,
	ziy, linux-mm
  Cc: linux-kernel, kernel-team

On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote:
> On 5/19/26 22:08, JP Kobryn (Meta) wrote:
>> compact_gap() returns 2 << order, which is used as watermark headroom in
>> __compaction_suitable() and as a reclaim target in kswapd. The computed
>> value scales exponentially by order. For order-9 THP allocations this
>> evaluates to 1024 pages, but the compaction free scanner's working set is
>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops 
>> isolating free
>> pages once it matches the migration batch. The current gap 
>> over-reserves by
>> 32x.
>>
>> On fragmented production hosts, kswapd will try and reclaim up to the 
>> gap,
>> but it only reaches that threshold 18% of the time, causing reclaim to
>> continue a majority of the time.
> But doesn't that mean there's genuine memory pressure? We're effectively
> raising the high watermark by 4 MB, but if processes are continuously
> allocating, we'd be reclaiming without the gap as well? Unless the 
> workload
> is sized to fit without the gap.

It wasn't actual pressure, but the repetitive order-9 THP failures that were
waking up kswapd. I should make this more clear in the changelog. After
looking into why so much reclaim was occurring though, the compact gap stood
out since it dictates the target amount to reclaim.

>> The over-sized gap also causes 46% of
>> order-9 compaction suitability checks to fail unnecessarily - the 
>> zone has
>> sufficient free pages for the scanner to operate, but not enough to clear
>> the inflated threshold.
>>
>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
>> with the scanner's actual capacity. Orders 0-4 are unaffected since their
>> gap is <= 32.
>>
>> A/B test on ~100 instagram production hosts (64GB, 60s measurement):
> What was the base kernel version?

6.13. Additional benchmarks were done using a recent mm-new build as well,
and they showed similar reductions in reclaim.

>> Unpatched (43 hosts)
>> pgscan_kswapd (mean/host): ~1.6M
>> reclaim efficiency (steal/scan): 83.8%
>> compaction success (success/stall): 2.1%
>> THP success (alloc/alloc+fallback): 4.9%
>> forced lru_add_drain (mean/host): ~107K
>>
>> Patched (59 hosts)
>> pgscan_kswapd (mean/host): ~449K
> Did the extra reclaim just disappear because we allow the allocations 
> to use
> 4MB more memory? Or it shifted to direct reclaim?

Specifically in the order-9 case, the reclaim target goes from 1024 to 32.
What the data shows is that capping the gap allows compaction to take over
sooner and start working to produce large size pages needed for THP. Whereas
in the pre-patch state, trying to reclaim the full 2x THP delays compaction.

>> reclaim efficiency (steal/scan): 91.0%
>> compaction success (success/stall): 28.3%
> Is this compaction success per compaction stall or per alloc stall?

That's per compaction.

>> THP success (alloc/alloc+fallback): 17.2%
> Weird that things would improve that much. I would expect the free memory
> just to stabilize around the lower gap but then behave similarly. Are we
> missing something here?

This patch was tested in isolation, but also occurring was the case where
bursty net allocations reserve many pageblocks as high atomic. So as
THP-size pages become eligible, their blocks are reserved before being
allocated as THP.

>> forced lru_add_drain (mean/host): ~64K
>>
>> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
>> ---
>> include/linux/compaction.h | 8 ++++----
>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>> index 173d9c07a8952..09aea63b8a89d 100644
>> --- a/include/linux/compaction.h
>> +++ b/include/linux/compaction.h
>> @@ -2,6 +2,8 @@
>> #ifndef _LINUX_COMPACTION_H
>> #define _LINUX_COMPACTION_H
>> +#include <linux/swap.h>
>> +
>> /*
>> * Determines how hard direct compaction should try to succeed.
>> * Lower value means higher priority, analogically to reclaim priority.
>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned 
>> int order)
>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
>> * that the migrate scanner can have isolated on migrate list, and free
>> * scanner is only invoked when the number of isolated free pages is
>> - * lower than that. But it's not worth to complicate the formula here
>> - * as a bigger gap for higher orders than strictly necessary can also
>> - * improve chances of compaction success.
>> + * lower than that.
>> */
>> - return 2UL << order;
>> + return min(2UL << order, COMPACT_CLUSTER_MAX);
> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?

I'm thinking I could reframe this patch as reclaim-focused and use
min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while
either leaving the other non-reclaim users of this function alone or
using the 2x form you suggest above. i.e. I can split this function
into a separate reclaim_compact_gap() and use the originally proposed cap.
Thoughts?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
  2026-05-27  0:10   ` JP Kobryn
@ 2026-05-28  8:51     ` Vlastimil Babka (SUSE)
  2026-06-02  1:48       ` JP Kobryn
  0 siblings, 1 reply; 6+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-28  8:51 UTC (permalink / raw)
  To: JP Kobryn, akpm, surenb, mhocko, jackmanb, hannes, ziy, linux-mm
  Cc: linux-kernel, kernel-team

On 5/27/26 02:10, JP Kobryn wrote:
> On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote:
>> On 5/19/26 22:08, JP Kobryn (Meta) wrote:
>>> compact_gap() returns 2 << order, which is used as watermark headroom in
>>> __compaction_suitable() and as a reclaim target in kswapd. The computed
>>> value scales exponentially by order. For order-9 THP allocations this
>>> evaluates to 1024 pages, but the compaction free scanner's working set is
>>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops 
>>> isolating free
>>> pages once it matches the migration batch. The current gap 
>>> over-reserves by
>>> 32x.
>>>
>>> On fragmented production hosts, kswapd will try and reclaim up to the 
>>> gap,
>>> but it only reaches that threshold 18% of the time, causing reclaim to
>>> continue a majority of the time.
>> But doesn't that mean there's genuine memory pressure? We're effectively
>> raising the high watermark by 4 MB, but if processes are continuously
>> allocating, we'd be reclaiming without the gap as well? Unless the 
>> workload
>> is sized to fit without the gap.
> 
> It wasn't actual pressure, but the repetitive order-9 THP failures that were
> waking up kswapd. I should make this more clear in the changelog. After
> looking into why so much reclaim was occurring though, the compact gap stood
> out since it dictates the target amount to reclaim.

But the "amount to reclaim" is still defined as "reach high watermark +
compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did
I miss something non-obvious.

So if kswapd did any work, it means the memory was consumed (i.e. there was
some memory pressure) and amount of free memory was below high watermark +
compact_gap()?

BTW, are you using mglru here? (probably not)
As that might be different and I'm not so familiar with it.

>>> The over-sized gap also causes 46% of
>>> order-9 compaction suitability checks to fail unnecessarily - the 
>>> zone has
>>> sufficient free pages for the scanner to operate, but not enough to clear
>>> the inflated threshold.
>>>
>>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
>>> with the scanner's actual capacity. Orders 0-4 are unaffected since their
>>> gap is <= 32.
>>>
>>> A/B test on ~100 instagram production hosts (64GB, 60s measurement):
>> What was the base kernel version?
> 
> 6.13. Additional benchmarks were done using a recent mm-new build as well,
> and they showed similar reductions in reclaim.

If it's a NUMA machine, we recently found an over-reclaim issue there fixed
by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE
THP allocations")

>>> Unpatched (43 hosts)
>>> pgscan_kswapd (mean/host): ~1.6M
>>> reclaim efficiency (steal/scan): 83.8%
>>> compaction success (success/stall): 2.1%
>>> THP success (alloc/alloc+fallback): 4.9%
>>> forced lru_add_drain (mean/host): ~107K
>>>
>>> Patched (59 hosts)
>>> pgscan_kswapd (mean/host): ~449K
>> Did the extra reclaim just disappear because we allow the allocations 
>> to use
>> 4MB more memory? Or it shifted to direct reclaim?
> 
> Specifically in the order-9 case, the reclaim target goes from 1024 to 32.
> What the data shows is that capping the gap allows compaction to take over
> sooner and start working to produce large size pages needed for THP. Whereas
> in the pre-patch state, trying to reclaim the full 2x THP delays compaction.

So do I understand correctly we might have an issue due to lack of
hysteresis? We require reaching high watermark + compact_gap() to terminate
reclaim, but then compaction can find out we meanwhile dropped below that
(due to concurrent allocations) and it's not suitable again?

However the suitability checks e.g. compaction_zonelist_suitable() are using
min watermark, so that should provide the difference already.
Actually it's low watermark because of __compaction_suitable() adding an
extra low-min gap for costly orders. But still.

I did just notice compaction_ready() might be too strict. It wants
effectivly high wmark plus the gap plus the low-min difference. Is it
perhaps the underlying issue here?

>>> reclaim efficiency (steal/scan): 91.0%
>>> compaction success (success/stall): 28.3%
>> Is this compaction success per compaction stall or per alloc stall?
> 
> That's per compaction.
> 
>>> THP success (alloc/alloc+fallback): 17.2%
>> Weird that things would improve that much. I would expect the free memory
>> just to stabilize around the lower gap but then behave similarly. Are we
>> missing something here?
> 
> This patch was tested in isolation, but also occurring was the case where
> bursty net allocations reserve many pageblocks as high atomic. So as
> THP-size pages become eligible, their blocks are reserved before being
> allocated as THP.
> 
>>> forced lru_add_drain (mean/host): ~64K
>>>
>>> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
>>> ---
>>> include/linux/compaction.h | 8 ++++----
>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>>> index 173d9c07a8952..09aea63b8a89d 100644
>>> --- a/include/linux/compaction.h
>>> +++ b/include/linux/compaction.h
>>> @@ -2,6 +2,8 @@
>>> #ifndef _LINUX_COMPACTION_H
>>> #define _LINUX_COMPACTION_H
>>> +#include <linux/swap.h>
>>> +
>>> /*
>>> * Determines how hard direct compaction should try to succeed.
>>> * Lower value means higher priority, analogically to reclaim priority.
>>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned 
>>> int order)
>>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
>>> * that the migrate scanner can have isolated on migrate list, and free
>>> * scanner is only invoked when the number of isolated free pages is
>>> - * lower than that. But it's not worth to complicate the formula here
>>> - * as a bigger gap for higher orders than strictly necessary can also
>>> - * improve chances of compaction success.
>>> + * lower than that.
>>> */
>>> - return 2UL << order;
>>> + return min(2UL << order, COMPACT_CLUSTER_MAX);
>> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?
> 
> I'm thinking I could reframe this patch as reclaim-focused and use
> min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while
> either leaving the other non-reclaim users of this function alone or
> using the 2x form you suggest above. i.e. I can split this function
> into a separate reclaim_compact_gap() and use the originally proposed cap.
> Thoughts?

Do I understand correctly you want to cap the reclaim target by
COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is?
But wouldn't that mean we'll actually make changes of passing
compaction_suitable() worse?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
  2026-05-28  8:51     ` Vlastimil Babka (SUSE)
@ 2026-06-02  1:48       ` JP Kobryn
  2026-06-02  8:40         ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 6+ messages in thread
From: JP Kobryn @ 2026-06-02  1:48 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE), akpm, surenb, mhocko, jackmanb, hannes,
	ziy, linux-mm
  Cc: linux-kernel, kernel-team

On 5/28/26 1:51 AM, Vlastimil Babka (SUSE) wrote:
> On 5/27/26 02:10, JP Kobryn wrote:
>> On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote:
>>> On 5/19/26 22:08, JP Kobryn (Meta) wrote:
>>>> compact_gap() returns 2 << order, which is used as watermark headroom in
>>>> __compaction_suitable() and as a reclaim target in kswapd. The computed
>>>> value scales exponentially by order. For order-9 THP allocations this
>>>> evaluates to 1024 pages, but the compaction free scanner's working set is
>>>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops
>>>> isolating free
>>>> pages once it matches the migration batch. The current gap
>>>> over-reserves by
>>>> 32x.
>>>>
>>>> On fragmented production hosts, kswapd will try and reclaim up to the
>>>> gap,
>>>> but it only reaches that threshold 18% of the time, causing reclaim to
>>>> continue a majority of the time.
>>> But doesn't that mean there's genuine memory pressure? We're effectively
>>> raising the high watermark by 4 MB, but if processes are continuously
>>> allocating, we'd be reclaiming without the gap as well? Unless the
>>> workload
>>> is sized to fit without the gap.
>> It wasn't actual pressure, but the repetitive order-9 THP failures that were
>> waking up kswapd. I should make this more clear in the changelog. After
>> looking into why so much reclaim was occurring though, the compact gap stood
>> out since it dictates the target amount to reclaim.
> But the "amount to reclaim" is still defined as "reach high watermark +
> compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did
> I miss something non-obvious.
Within kswapd_shrink_node(), sc->nr_to_reclaim is the sum of max(zone high
watermark or SWAP_CLUSTER_MAX) for each zone combined. The gap is not 
added to
that reclaim target though. It's used afterward as the threshold for 
abandoning
high order reclaim:

if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
     sc->order = 0;

balance_pgdat() then returns sc->order and that becomes the kswapd 
reclaim_order
value, allowing this branch to be taken:

if (reclaim_order < alloc_order)
     goto kswapd_try_sleep;

Then in prepare_kswapd_sleep(), if pgdat_balanced() succeeds (at order-0),
kcompactd is woken up for the original alloc_order (order-9).

> So if kswapd did any work, it means the memory was consumed (i.e. there was
> some memory pressure) and amount of free memory was below high watermark +
> compact_gap()?
Hmm but kswapd can be woken up on a high order failure despite plenty of 
lower
order availability. That's really the case where compact_gap() matters for
higher orders. Unless by pressure you mean the high order pages were gone?

> BTW, are you using mglru here? (probably not)
> As that might be different and I'm not so familiar with it.
Using classic LRU.

>>>> The over-sized gap also causes 46% of
>>>> order-9 compaction suitability checks to fail unnecessarily - the
>>>> zone has
>>>> sufficient free pages for the scanner to operate, but not enough to clear
>>>> the inflated threshold.
>>>>
>>>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
>>>> with the scanner's actual capacity. Orders 0-4 are unaffected since their
>>>> gap is <= 32.
>>>>
>>>> A/B test on ~100 instagram production hosts (64GB, 60s measurement):
>>> What was the base kernel version?
>> 6.13. Additional benchmarks were done using a recent mm-new build as well,
>> and they showed similar reductions in reclaim.
> If it's a NUMA machine, we recently found an over-reclaim issue there fixed
> by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE
> THP allocations")
Thanks for pointing this out. I tested this on a recent mm-new built that
includes 9c9828d3ead6, and I found the compact_gap() change was still 
helpful.
My understanding is that 9c9828d3ead6 addresses direct reclaim for THP
allocations, while this patch affects the kswapd reclaim-compaction hand-off
path. The test runs still showed a benefit from capping the gap.

>>>> Unpatched (43 hosts)
>>>> pgscan_kswapd (mean/host): ~1.6M
>>>> reclaim efficiency (steal/scan): 83.8%
>>>> compaction success (success/stall): 2.1%
>>>> THP success (alloc/alloc+fallback): 4.9%
>>>> forced lru_add_drain (mean/host): ~107K
>>>>
>>>> Patched (59 hosts)
>>>> pgscan_kswapd (mean/host): ~449K
>>> Did the extra reclaim just disappear because we allow the allocations
>>> to use
>>> 4MB more memory? Or it shifted to direct reclaim?
>> Specifically in the order-9 case, the reclaim target goes from 1024 to 32.
>> What the data shows is that capping the gap allows compaction to take over
>> sooner and start working to produce large size pages needed for THP. Whereas
>> in the pre-patch state, trying to reclaim the full 2x THP delays compaction.
> So do I understand correctly we might have an issue due to lack of
> hysteresis? We require reaching high watermark + compact_gap() to terminate
> reclaim, but then compaction can find out we meanwhile dropped below that
> (due to concurrent allocations) and it's not suitable again?
On an unpatched kernel in a fragmented environment, 
compaction_suitable() can
remain false because the effective threshold for costly orders is the low
watermark + the compact gap. Kswapd has to keep reclaiming in high order 
mode
as a result. By capping the gap at SWAP_CLUSTER_MAX, compaction becomes 
suitable
sooner and kswapd reaches the high order reclaim cutoff sooner. So with 
the patch,
kswapd is able to fall back to order-0 balancing earlier and wake up 
kcompactd
for the original high order request.

> However the suitability checks e.g. compaction_zonelist_suitable() are using
> min watermark, so that should provide the difference already.
> Actually it's low watermark because of __compaction_suitable() adding an
> extra low-min gap for costly orders. But still.
>
> I did just notice compaction_ready() might be too strict. It wants
> effectivly high wmark plus the gap plus the low-min difference. Is it
> perhaps the underlying issue here?
It's a good point. It does seem like that's worth looking into, and I'd be
happy to explore that separately. My thought at the moment though is that
changing compaction_ready() would be a different direction from the the 
original
focus of this patch, which started with the realization that the compaction
scanner working set is bounded by COMPACT_CLUSTER_MAX. Since 
compact_gap() is
used in multiple reclaim and compaction decisions, including 
compaction_ready(),
fixing its definition seemed like the right first change if the gap 
itself is
oversized.

>>>> reclaim efficiency (steal/scan): 91.0%
>>>> compaction success (success/stall): 28.3%
>>> Is this compaction success per compaction stall or per alloc stall?
>> That's per compaction.
>>
>>>> THP success (alloc/alloc+fallback): 17.2%
>>> Weird that things would improve that much. I would expect the free memory
>>> just to stabilize around the lower gap but then behave similarly. Are we
>>> missing something here?
>> This patch was tested in isolation, but also occurring was the case where
>> bursty net allocations reserve many pageblocks as high atomic. So as
>> THP-size pages become eligible, their blocks are reserved before being
>> allocated as THP.
>>
>>>> forced lru_add_drain (mean/host): ~64K
>>>>
>>>> Signed-off-by: JP Kobryn (Meta)<jp.kobryn@linux.dev>
>>>> ---
>>>> include/linux/compaction.h | 8 ++++----
>>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>>>> index 173d9c07a8952..09aea63b8a89d 100644
>>>> --- a/include/linux/compaction.h
>>>> +++ b/include/linux/compaction.h
>>>> @@ -2,6 +2,8 @@
>>>> #ifndef _LINUX_COMPACTION_H
>>>> #define _LINUX_COMPACTION_H
>>>> +#include <linux/swap.h>
>>>> +
>>>> /*
>>>> * Determines how hard direct compaction should try to succeed.
>>>> * Lower value means higher priority, analogically to reclaim priority.
>>>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned
>>>> int order)
>>>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
>>>> * that the migrate scanner can have isolated on migrate list, and free
>>>> * scanner is only invoked when the number of isolated free pages is
>>>> - * lower than that. But it's not worth to complicate the formula here
>>>> - * as a bigger gap for higher orders than strictly necessary can also
>>>> - * improve chances of compaction success.
>>>> + * lower than that.
>>>> */
>>>> - return 2UL << order;
>>>> + return min(2UL << order, COMPACT_CLUSTER_MAX);
>>> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?
>> I'm thinking I could reframe this patch as reclaim-focused and use
>> min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while
>> either leaving the other non-reclaim users of this function alone or
>> using the 2x form you suggest above. i.e. I can split this function
>> into a separate reclaim_compact_gap() and use the originally proposed cap.
>> Thoughts?
> Do I understand correctly you want to cap the reclaim target by
> COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is?
> But wouldn't that mean we'll actually make changes of passing
> compaction_suitable() worse?
Good call. I was trying to find some middle ground, but I realize that the
change is better left unified.

Also, I tested a 2x COMPACT_CLUSTER_MAX cap and I saw mixed results - either
similar to this patch or worse, with no improvements over the
COMPACT_CLUSTER_MAX cap.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
  2026-06-02  1:48       ` JP Kobryn
@ 2026-06-02  8:40         ` Vlastimil Babka (SUSE)
  0 siblings, 0 replies; 6+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-06-02  8:40 UTC (permalink / raw)
  To: JP Kobryn, akpm, surenb, mhocko, jackmanb, hannes, ziy, linux-mm
  Cc: linux-kernel, kernel-team

On 6/2/26 03:48, JP Kobryn wrote:
> On 5/28/26 1:51 AM, Vlastimil Babka (SUSE) wrote:
>> On 5/27/26 02:10, JP Kobryn wrote:
>>> On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote:
>>>> On 5/19/26 22:08, JP Kobryn (Meta) wrote:
>>>>> compact_gap() returns 2 << order, which is used as watermark headroom in
>>>>> __compaction_suitable() and as a reclaim target in kswapd. The computed
>>>>> value scales exponentially by order. For order-9 THP allocations this
>>>>> evaluates to 1024 pages, but the compaction free scanner's working set is
>>>>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops
>>>>> isolating free
>>>>> pages once it matches the migration batch. The current gap
>>>>> over-reserves by
>>>>> 32x.
>>>>>
>>>>> On fragmented production hosts, kswapd will try and reclaim up to the
>>>>> gap,
>>>>> but it only reaches that threshold 18% of the time, causing reclaim to
>>>>> continue a majority of the time.
>>>> But doesn't that mean there's genuine memory pressure? We're effectively
>>>> raising the high watermark by 4 MB, but if processes are continuously
>>>> allocating, we'd be reclaiming without the gap as well? Unless the
>>>> workload
>>>> is sized to fit without the gap.
>>> It wasn't actual pressure, but the repetitive order-9 THP failures that were
>>> waking up kswapd. I should make this more clear in the changelog. After
>>> looking into why so much reclaim was occurring though, the compact gap stood
>>> out since it dictates the target amount to reclaim.
>> But the "amount to reclaim" is still defined as "reach high watermark +
>> compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did
>> I miss something non-obvious.
> Within kswapd_shrink_node(), sc->nr_to_reclaim is the sum of max(zone high
> watermark or SWAP_CLUSTER_MAX) for each zone combined. The gap is not 
> added to
> that reclaim target though. It's used afterward as the threshold for 
> abandoning
> high order reclaim:
> 
> if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
>      sc->order = 0;
> 
> balance_pgdat() then returns sc->order and that becomes the kswapd 
> reclaim_order
> value, allowing this branch to be taken:
> 
> if (reclaim_order < alloc_order)
>      goto kswapd_try_sleep;
> 
> Then in prepare_kswapd_sleep(), if pgdat_balanced() succeeds (at order-0),
> kcompactd is woken up for the original alloc_order (order-9).

Oh I see, thanks for explaining. I think it makes sense to target this
particular part (checking sc->nr_reclaimed) than change compact_gap()
globally then? It seems we have some mismatch in the various heuristics? IIUC:

- in shrink_node() we have a should_continue_reclaim() call, which will
return false as soon as compaction is suitable, but before that, we are
likely to not accumulate enough sc->nr_reclaimed, because sc->nr_to_reclaim
would be capped by SWAP_CLUSTER_MAX's

- thus we won't pass the sc->nr_reclaimed >= compact_gap check in
kswapd_shrink_node()

- balance_pgdat() will keep looping because we're not raising priority
(kswapd_shrink_node() returned a high order) and pgdat_balanced() is false
(it checks for high-order page availability)

Maybe only reduce the sc->nr_reclaimed threshold to 2*COMPACT_CLUSTER_MAX then?

>> So if kswapd did any work, it means the memory was consumed (i.e. there was
>> some memory pressure) and amount of free memory was below high watermark +
>> compact_gap()?
> Hmm but kswapd can be woken up on a high order failure despite plenty of 
> lower
> order availability. That's really the case where compact_gap() matters for
> higher orders.

Ack, thanks.

> Unless by pressure you mean the high order pages were gone?

No.

> 
>> BTW, are you using mglru here? (probably not)
>> As that might be different and I'm not so familiar with it.
> Using classic LRU.
> 
>>>>> The over-sized gap also causes 46% of
>>>>> order-9 compaction suitability checks to fail unnecessarily - the
>>>>> zone has
>>>>> sufficient free pages for the scanner to operate, but not enough to clear
>>>>> the inflated threshold.
>>>>>
>>>>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
>>>>> with the scanner's actual capacity. Orders 0-4 are unaffected since their
>>>>> gap is <= 32.
>>>>>
>>>>> A/B test on ~100 instagram production hosts (64GB, 60s measurement):
>>>> What was the base kernel version?
>>> 6.13. Additional benchmarks were done using a recent mm-new build as well,
>>> and they showed similar reductions in reclaim.
>> If it's a NUMA machine, we recently found an over-reclaim issue there fixed
>> by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE
>> THP allocations")
> Thanks for pointing this out. I tested this on a recent mm-new built that
> includes 9c9828d3ead6, and I found the compact_gap() change was still 
> helpful.

OK.

> My understanding is that 9c9828d3ead6 addresses direct reclaim for THP
> allocations, while this patch affects the kswapd reclaim-compaction hand-off
> path. The test runs still showed a benefit from capping the gap.

Yep.

>>>>> Unpatched (43 hosts)
>>>>> pgscan_kswapd (mean/host): ~1.6M
>>>>> reclaim efficiency (steal/scan): 83.8%
>>>>> compaction success (success/stall): 2.1%
>>>>> THP success (alloc/alloc+fallback): 4.9%
>>>>> forced lru_add_drain (mean/host): ~107K
>>>>>
>>>>> Patched (59 hosts)
>>>>> pgscan_kswapd (mean/host): ~449K
>>>> Did the extra reclaim just disappear because we allow the allocations
>>>> to use
>>>> 4MB more memory? Or it shifted to direct reclaim?
>>> Specifically in the order-9 case, the reclaim target goes from 1024 to 32.
>>> What the data shows is that capping the gap allows compaction to take over
>>> sooner and start working to produce large size pages needed for THP. Whereas
>>> in the pre-patch state, trying to reclaim the full 2x THP delays compaction.
>> So do I understand correctly we might have an issue due to lack of
>> hysteresis? We require reaching high watermark + compact_gap() to terminate
>> reclaim, but then compaction can find out we meanwhile dropped below that
>> (due to concurrent allocations) and it's not suitable again?
> On an unpatched kernel in a fragmented environment, 
> compaction_suitable() can
> remain false because the effective threshold for costly orders is the low
> watermark + the compact gap. Kswapd has to keep reclaiming in high order 
> mode
> as a result.

I think this part might be ok.

> By capping the gap at SWAP_CLUSTER_MAX, compaction becomes 
> suitable
> sooner and kswapd reaches the high order reclaim cutoff sooner. So with 

And the problem is with the cutoff only which is not based on a
watermark+gap threshold but wants to reclaim at least gap pages regardless
of how many pages are already free.

> the patch,
> kswapd is able to fall back to order-0 balancing earlier and wake up 
> kcompactd
> for the original high order request.

Yeah that's likely the crucial part.

>> However the suitability checks e.g. compaction_zonelist_suitable() are using
>> min watermark, so that should provide the difference already.
>> Actually it's low watermark because of __compaction_suitable() adding an
>> extra low-min gap for costly orders. But still.
>>
>> I did just notice compaction_ready() might be too strict. It wants
>> effectivly high wmark plus the gap plus the low-min difference. Is it
>> perhaps the underlying issue here?
> It's a good point. It does seem like that's worth looking into, and I'd be
> happy to explore that separately.My thought at the moment though is that
> changing compaction_ready() would be a different direction from the the 
> original
> focus of this patch, which started with the realization that the compaction
> scanner working set is bounded by COMPACT_CLUSTER_MAX. Since 

Ack.

> compact_gap() is
> used in multiple reclaim and compaction decisions, including 
> compaction_ready(),
> fixing its definition seemed like the right first change if the gap 
> itself is
> oversized.

Still I'd try to address the sc->nr_reclaimed usage first, and see if that's
enough.

>>>>> reclaim efficiency (steal/scan): 91.0%
>>>>> compaction success (success/stall): 28.3%
>>>> Is this compaction success per compaction stall or per alloc stall?
>>> That's per compaction.
>>>
>>>>> THP success (alloc/alloc+fallback): 17.2%
>>>> Weird that things would improve that much. I would expect the free memory
>>>> just to stabilize around the lower gap but then behave similarly. Are we
>>>> missing something here?
>>> This patch was tested in isolation, but also occurring was the case where
>>> bursty net allocations reserve many pageblocks as high atomic. So as
>>> THP-size pages become eligible, their blocks are reserved before being
>>> allocated as THP.
>>>
>>>>> forced lru_add_drain (mean/host): ~64K
>>>>>
>>>>> Signed-off-by: JP Kobryn (Meta)<jp.kobryn@linux.dev>
>>>>> ---
>>>>> include/linux/compaction.h | 8 ++++----
>>>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>>>>> index 173d9c07a8952..09aea63b8a89d 100644
>>>>> --- a/include/linux/compaction.h
>>>>> +++ b/include/linux/compaction.h
>>>>> @@ -2,6 +2,8 @@
>>>>> #ifndef _LINUX_COMPACTION_H
>>>>> #define _LINUX_COMPACTION_H
>>>>> +#include <linux/swap.h>
>>>>> +
>>>>> /*
>>>>> * Determines how hard direct compaction should try to succeed.
>>>>> * Lower value means higher priority, analogically to reclaim priority.
>>>>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned
>>>>> int order)
>>>>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
>>>>> * that the migrate scanner can have isolated on migrate list, and free
>>>>> * scanner is only invoked when the number of isolated free pages is
>>>>> - * lower than that. But it's not worth to complicate the formula here
>>>>> - * as a bigger gap for higher orders than strictly necessary can also
>>>>> - * improve chances of compaction success.
>>>>> + * lower than that.
>>>>> */
>>>>> - return 2UL << order;
>>>>> + return min(2UL << order, COMPACT_CLUSTER_MAX);
>>>> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?
>>> I'm thinking I could reframe this patch as reclaim-focused and use
>>> min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while
>>> either leaving the other non-reclaim users of this function alone or
>>> using the 2x form you suggest above. i.e. I can split this function
>>> into a separate reclaim_compact_gap() and use the originally proposed cap.
>>> Thoughts?
>> Do I understand correctly you want to cap the reclaim target by
>> COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is?
>> But wouldn't that mean we'll actually make changes of passing
>> compaction_suitable() worse?
> Good call. I was trying to find some middle ground, but I realize that the
> change is better left unified.

My question was based on not understanding the underlying issue, and that
the "reclaim-only target" isn't based on watermark+gap but "reclaim gap
worth of pages".  Now I think sc->nr_reclaimed is indeed the check that
should be relaxed first.

> Also, I tested a 2x COMPACT_CLUSTER_MAX cap and I saw mixed results - either
> similar to this patch or worse, with no improvements over the
> COMPACT_CLUSTER_MAX cap.

Right. Thanks!



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-02  8:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19 20:08 [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX JP Kobryn (Meta)
2026-05-25 10:02 ` Vlastimil Babka (SUSE)
2026-05-27  0:10   ` JP Kobryn
2026-05-28  8:51     ` Vlastimil Babka (SUSE)
2026-06-02  1:48       ` JP Kobryn
2026-06-02  8:40         ` Vlastimil Babka (SUSE)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox