* [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX @ 2026-05-19 20:08 JP Kobryn (Meta) 2026-05-25 10:02 ` Vlastimil Babka (SUSE) 0 siblings, 1 reply; 6+ messages in thread From: JP Kobryn (Meta) @ 2026-05-19 20:08 UTC (permalink / raw) To: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm Cc: linux-kernel, kernel-team compact_gap() returns 2 << order, which is used as watermark headroom in __compaction_suitable() and as a reclaim target in kswapd. The computed value scales exponentially by order. For order-9 THP allocations this evaluates to 1024 pages, but the compaction free scanner's working set is bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops isolating free pages once it matches the migration batch. The current gap over-reserves by 32x. On fragmented production hosts, kswapd will try and reclaim up to the gap, but it only reaches that threshold 18% of the time, causing reclaim to continue a majority of the time. The over-sized gap also causes 46% of order-9 compaction suitability checks to fail unnecessarily - the zone has sufficient free pages for the scanner to operate, but not enough to clear the inflated threshold. Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom with the scanner's actual capacity. Orders 0-4 are unaffected since their gap is <= 32. A/B test on ~100 instagram production hosts (64GB, 60s measurement): Unpatched (43 hosts) pgscan_kswapd (mean/host): ~1.6M reclaim efficiency (steal/scan): 83.8% compaction success (success/stall): 2.1% THP success (alloc/alloc+fallback): 4.9% forced lru_add_drain (mean/host): ~107K Patched (59 hosts) pgscan_kswapd (mean/host): ~449K reclaim efficiency (steal/scan): 91.0% compaction success (success/stall): 28.3% THP success (alloc/alloc+fallback): 17.2% forced lru_add_drain (mean/host): ~64K Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev> --- include/linux/compaction.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 173d9c07a8952..09aea63b8a89d 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -2,6 +2,8 @@ #ifndef _LINUX_COMPACTION_H #define _LINUX_COMPACTION_H +#include <linux/swap.h> + /* * Determines how hard direct compaction should try to succeed. * Lower value means higher priority, analogically to reclaim priority. @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned int order) * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum * that the migrate scanner can have isolated on migrate list, and free * scanner is only invoked when the number of isolated free pages is - * lower than that. But it's not worth to complicate the formula here - * as a bigger gap for higher orders than strictly necessary can also - * improve chances of compaction success. + * lower than that. */ - return 2UL << order; + return min(2UL << order, COMPACT_CLUSTER_MAX); } static inline int current_is_kcompactd(void) -- 2.53.0-Meta ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX 2026-05-19 20:08 [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX JP Kobryn (Meta) @ 2026-05-25 10:02 ` Vlastimil Babka (SUSE) 2026-05-27 0:10 ` JP Kobryn 0 siblings, 1 reply; 6+ messages in thread From: Vlastimil Babka (SUSE) @ 2026-05-25 10:02 UTC (permalink / raw) To: JP Kobryn (Meta), akpm, surenb, mhocko, jackmanb, hannes, ziy, linux-mm Cc: linux-kernel, kernel-team On 5/19/26 22:08, JP Kobryn (Meta) wrote: > compact_gap() returns 2 << order, which is used as watermark headroom in > __compaction_suitable() and as a reclaim target in kswapd. The computed > value scales exponentially by order. For order-9 THP allocations this > evaluates to 1024 pages, but the compaction free scanner's working set is > bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops isolating free > pages once it matches the migration batch. The current gap over-reserves by > 32x. > > On fragmented production hosts, kswapd will try and reclaim up to the gap, > but it only reaches that threshold 18% of the time, causing reclaim to > continue a majority of the time. But doesn't that mean there's genuine memory pressure? We're effectively raising the high watermark by 4 MB, but if processes are continuously allocating, we'd be reclaiming without the gap as well? Unless the workload is sized to fit without the gap. > The over-sized gap also causes 46% of > order-9 compaction suitability checks to fail unnecessarily - the zone has > sufficient free pages for the scanner to operate, but not enough to clear > the inflated threshold. > > Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom > with the scanner's actual capacity. Orders 0-4 are unaffected since their > gap is <= 32. > > A/B test on ~100 instagram production hosts (64GB, 60s measurement): What was the base kernel version? > Unpatched (43 hosts) > pgscan_kswapd (mean/host): ~1.6M > reclaim efficiency (steal/scan): 83.8% > compaction success (success/stall): 2.1% > THP success (alloc/alloc+fallback): 4.9% > forced lru_add_drain (mean/host): ~107K > > Patched (59 hosts) > pgscan_kswapd (mean/host): ~449K Did the extra reclaim just disappear because we allow the allocations to use 4MB more memory? Or it shifted to direct reclaim? > reclaim efficiency (steal/scan): 91.0% > compaction success (success/stall): 28.3% Is this compaction success per compaction stall or per alloc stall? > THP success (alloc/alloc+fallback): 17.2% Weird that things would improve that much. I would expect the free memory just to stabilize around the lower gap but then behave similarly. Are we missing something here? > forced lru_add_drain (mean/host): ~64K > > Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev> > --- > include/linux/compaction.h | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h > index 173d9c07a8952..09aea63b8a89d 100644 > --- a/include/linux/compaction.h > +++ b/include/linux/compaction.h > @@ -2,6 +2,8 @@ > #ifndef _LINUX_COMPACTION_H > #define _LINUX_COMPACTION_H > > +#include <linux/swap.h> > + > /* > * Determines how hard direct compaction should try to succeed. > * Lower value means higher priority, analogically to reclaim priority. > @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned int order) > * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum > * that the migrate scanner can have isolated on migrate list, and free > * scanner is only invoked when the number of isolated free pages is > - * lower than that. But it's not worth to complicate the formula here > - * as a bigger gap for higher orders than strictly necessary can also > - * improve chances of compaction success. > + * lower than that. > */ > - return 2UL << order; > + return min(2UL << order, COMPACT_CLUSTER_MAX); Shouldn't it at least be 2x COMPACT_CLUSTER_MAX? > } > > static inline int current_is_kcompactd(void) ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX 2026-05-25 10:02 ` Vlastimil Babka (SUSE) @ 2026-05-27 0:10 ` JP Kobryn 2026-05-28 8:51 ` Vlastimil Babka (SUSE) 0 siblings, 1 reply; 6+ messages in thread From: JP Kobryn @ 2026-05-27 0:10 UTC (permalink / raw) To: Vlastimil Babka (SUSE), akpm, surenb, mhocko, jackmanb, hannes, ziy, linux-mm Cc: linux-kernel, kernel-team On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote: > On 5/19/26 22:08, JP Kobryn (Meta) wrote: >> compact_gap() returns 2 << order, which is used as watermark headroom in >> __compaction_suitable() and as a reclaim target in kswapd. The computed >> value scales exponentially by order. For order-9 THP allocations this >> evaluates to 1024 pages, but the compaction free scanner's working set is >> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops >> isolating free >> pages once it matches the migration batch. The current gap >> over-reserves by >> 32x. >> >> On fragmented production hosts, kswapd will try and reclaim up to the >> gap, >> but it only reaches that threshold 18% of the time, causing reclaim to >> continue a majority of the time. > But doesn't that mean there's genuine memory pressure? We're effectively > raising the high watermark by 4 MB, but if processes are continuously > allocating, we'd be reclaiming without the gap as well? Unless the > workload > is sized to fit without the gap. It wasn't actual pressure, but the repetitive order-9 THP failures that were waking up kswapd. I should make this more clear in the changelog. After looking into why so much reclaim was occurring though, the compact gap stood out since it dictates the target amount to reclaim. >> The over-sized gap also causes 46% of >> order-9 compaction suitability checks to fail unnecessarily - the >> zone has >> sufficient free pages for the scanner to operate, but not enough to clear >> the inflated threshold. >> >> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom >> with the scanner's actual capacity. Orders 0-4 are unaffected since their >> gap is <= 32. >> >> A/B test on ~100 instagram production hosts (64GB, 60s measurement): > What was the base kernel version? 6.13. Additional benchmarks were done using a recent mm-new build as well, and they showed similar reductions in reclaim. >> Unpatched (43 hosts) >> pgscan_kswapd (mean/host): ~1.6M >> reclaim efficiency (steal/scan): 83.8% >> compaction success (success/stall): 2.1% >> THP success (alloc/alloc+fallback): 4.9% >> forced lru_add_drain (mean/host): ~107K >> >> Patched (59 hosts) >> pgscan_kswapd (mean/host): ~449K > Did the extra reclaim just disappear because we allow the allocations > to use > 4MB more memory? Or it shifted to direct reclaim? Specifically in the order-9 case, the reclaim target goes from 1024 to 32. What the data shows is that capping the gap allows compaction to take over sooner and start working to produce large size pages needed for THP. Whereas in the pre-patch state, trying to reclaim the full 2x THP delays compaction. >> reclaim efficiency (steal/scan): 91.0% >> compaction success (success/stall): 28.3% > Is this compaction success per compaction stall or per alloc stall? That's per compaction. >> THP success (alloc/alloc+fallback): 17.2% > Weird that things would improve that much. I would expect the free memory > just to stabilize around the lower gap but then behave similarly. Are we > missing something here? This patch was tested in isolation, but also occurring was the case where bursty net allocations reserve many pageblocks as high atomic. So as THP-size pages become eligible, their blocks are reserved before being allocated as THP. >> forced lru_add_drain (mean/host): ~64K >> >> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev> >> --- >> include/linux/compaction.h | 8 ++++---- >> 1 file changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/include/linux/compaction.h b/include/linux/compaction.h >> index 173d9c07a8952..09aea63b8a89d 100644 >> --- a/include/linux/compaction.h >> +++ b/include/linux/compaction.h >> @@ -2,6 +2,8 @@ >> #ifndef _LINUX_COMPACTION_H >> #define _LINUX_COMPACTION_H >> +#include <linux/swap.h> >> + >> /* >> * Determines how hard direct compaction should try to succeed. >> * Lower value means higher priority, analogically to reclaim priority. >> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned >> int order) >> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum >> * that the migrate scanner can have isolated on migrate list, and free >> * scanner is only invoked when the number of isolated free pages is >> - * lower than that. But it's not worth to complicate the formula here >> - * as a bigger gap for higher orders than strictly necessary can also >> - * improve chances of compaction success. >> + * lower than that. >> */ >> - return 2UL << order; >> + return min(2UL << order, COMPACT_CLUSTER_MAX); > Shouldn't it at least be 2x COMPACT_CLUSTER_MAX? I'm thinking I could reframe this patch as reclaim-focused and use min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while either leaving the other non-reclaim users of this function alone or using the 2x form you suggest above. i.e. I can split this function into a separate reclaim_compact_gap() and use the originally proposed cap. Thoughts? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX 2026-05-27 0:10 ` JP Kobryn @ 2026-05-28 8:51 ` Vlastimil Babka (SUSE) 2026-06-02 1:48 ` JP Kobryn 0 siblings, 1 reply; 6+ messages in thread From: Vlastimil Babka (SUSE) @ 2026-05-28 8:51 UTC (permalink / raw) To: JP Kobryn, akpm, surenb, mhocko, jackmanb, hannes, ziy, linux-mm Cc: linux-kernel, kernel-team On 5/27/26 02:10, JP Kobryn wrote: > On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote: >> On 5/19/26 22:08, JP Kobryn (Meta) wrote: >>> compact_gap() returns 2 << order, which is used as watermark headroom in >>> __compaction_suitable() and as a reclaim target in kswapd. The computed >>> value scales exponentially by order. For order-9 THP allocations this >>> evaluates to 1024 pages, but the compaction free scanner's working set is >>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops >>> isolating free >>> pages once it matches the migration batch. The current gap >>> over-reserves by >>> 32x. >>> >>> On fragmented production hosts, kswapd will try and reclaim up to the >>> gap, >>> but it only reaches that threshold 18% of the time, causing reclaim to >>> continue a majority of the time. >> But doesn't that mean there's genuine memory pressure? We're effectively >> raising the high watermark by 4 MB, but if processes are continuously >> allocating, we'd be reclaiming without the gap as well? Unless the >> workload >> is sized to fit without the gap. > > It wasn't actual pressure, but the repetitive order-9 THP failures that were > waking up kswapd. I should make this more clear in the changelog. After > looking into why so much reclaim was occurring though, the compact gap stood > out since it dictates the target amount to reclaim. But the "amount to reclaim" is still defined as "reach high watermark + compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did I miss something non-obvious. So if kswapd did any work, it means the memory was consumed (i.e. there was some memory pressure) and amount of free memory was below high watermark + compact_gap()? BTW, are you using mglru here? (probably not) As that might be different and I'm not so familiar with it. >>> The over-sized gap also causes 46% of >>> order-9 compaction suitability checks to fail unnecessarily - the >>> zone has >>> sufficient free pages for the scanner to operate, but not enough to clear >>> the inflated threshold. >>> >>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom >>> with the scanner's actual capacity. Orders 0-4 are unaffected since their >>> gap is <= 32. >>> >>> A/B test on ~100 instagram production hosts (64GB, 60s measurement): >> What was the base kernel version? > > 6.13. Additional benchmarks were done using a recent mm-new build as well, > and they showed similar reductions in reclaim. If it's a NUMA machine, we recently found an over-reclaim issue there fixed by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocations") >>> Unpatched (43 hosts) >>> pgscan_kswapd (mean/host): ~1.6M >>> reclaim efficiency (steal/scan): 83.8% >>> compaction success (success/stall): 2.1% >>> THP success (alloc/alloc+fallback): 4.9% >>> forced lru_add_drain (mean/host): ~107K >>> >>> Patched (59 hosts) >>> pgscan_kswapd (mean/host): ~449K >> Did the extra reclaim just disappear because we allow the allocations >> to use >> 4MB more memory? Or it shifted to direct reclaim? > > Specifically in the order-9 case, the reclaim target goes from 1024 to 32. > What the data shows is that capping the gap allows compaction to take over > sooner and start working to produce large size pages needed for THP. Whereas > in the pre-patch state, trying to reclaim the full 2x THP delays compaction. So do I understand correctly we might have an issue due to lack of hysteresis? We require reaching high watermark + compact_gap() to terminate reclaim, but then compaction can find out we meanwhile dropped below that (due to concurrent allocations) and it's not suitable again? However the suitability checks e.g. compaction_zonelist_suitable() are using min watermark, so that should provide the difference already. Actually it's low watermark because of __compaction_suitable() adding an extra low-min gap for costly orders. But still. I did just notice compaction_ready() might be too strict. It wants effectivly high wmark plus the gap plus the low-min difference. Is it perhaps the underlying issue here? >>> reclaim efficiency (steal/scan): 91.0% >>> compaction success (success/stall): 28.3% >> Is this compaction success per compaction stall or per alloc stall? > > That's per compaction. > >>> THP success (alloc/alloc+fallback): 17.2% >> Weird that things would improve that much. I would expect the free memory >> just to stabilize around the lower gap but then behave similarly. Are we >> missing something here? > > This patch was tested in isolation, but also occurring was the case where > bursty net allocations reserve many pageblocks as high atomic. So as > THP-size pages become eligible, their blocks are reserved before being > allocated as THP. > >>> forced lru_add_drain (mean/host): ~64K >>> >>> Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev> >>> --- >>> include/linux/compaction.h | 8 ++++---- >>> 1 file changed, 4 insertions(+), 4 deletions(-) >>> >>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h >>> index 173d9c07a8952..09aea63b8a89d 100644 >>> --- a/include/linux/compaction.h >>> +++ b/include/linux/compaction.h >>> @@ -2,6 +2,8 @@ >>> #ifndef _LINUX_COMPACTION_H >>> #define _LINUX_COMPACTION_H >>> +#include <linux/swap.h> >>> + >>> /* >>> * Determines how hard direct compaction should try to succeed. >>> * Lower value means higher priority, analogically to reclaim priority. >>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned >>> int order) >>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum >>> * that the migrate scanner can have isolated on migrate list, and free >>> * scanner is only invoked when the number of isolated free pages is >>> - * lower than that. But it's not worth to complicate the formula here >>> - * as a bigger gap for higher orders than strictly necessary can also >>> - * improve chances of compaction success. >>> + * lower than that. >>> */ >>> - return 2UL << order; >>> + return min(2UL << order, COMPACT_CLUSTER_MAX); >> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX? > > I'm thinking I could reframe this patch as reclaim-focused and use > min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while > either leaving the other non-reclaim users of this function alone or > using the 2x form you suggest above. i.e. I can split this function > into a separate reclaim_compact_gap() and use the originally proposed cap. > Thoughts? Do I understand correctly you want to cap the reclaim target by COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is? But wouldn't that mean we'll actually make changes of passing compaction_suitable() worse? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX 2026-05-28 8:51 ` Vlastimil Babka (SUSE) @ 2026-06-02 1:48 ` JP Kobryn 2026-06-02 8:40 ` Vlastimil Babka (SUSE) 0 siblings, 1 reply; 6+ messages in thread From: JP Kobryn @ 2026-06-02 1:48 UTC (permalink / raw) To: Vlastimil Babka (SUSE), akpm, surenb, mhocko, jackmanb, hannes, ziy, linux-mm Cc: linux-kernel, kernel-team On 5/28/26 1:51 AM, Vlastimil Babka (SUSE) wrote: > On 5/27/26 02:10, JP Kobryn wrote: >> On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote: >>> On 5/19/26 22:08, JP Kobryn (Meta) wrote: >>>> compact_gap() returns 2 << order, which is used as watermark headroom in >>>> __compaction_suitable() and as a reclaim target in kswapd. The computed >>>> value scales exponentially by order. For order-9 THP allocations this >>>> evaluates to 1024 pages, but the compaction free scanner's working set is >>>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops >>>> isolating free >>>> pages once it matches the migration batch. The current gap >>>> over-reserves by >>>> 32x. >>>> >>>> On fragmented production hosts, kswapd will try and reclaim up to the >>>> gap, >>>> but it only reaches that threshold 18% of the time, causing reclaim to >>>> continue a majority of the time. >>> But doesn't that mean there's genuine memory pressure? We're effectively >>> raising the high watermark by 4 MB, but if processes are continuously >>> allocating, we'd be reclaiming without the gap as well? Unless the >>> workload >>> is sized to fit without the gap. >> It wasn't actual pressure, but the repetitive order-9 THP failures that were >> waking up kswapd. I should make this more clear in the changelog. After >> looking into why so much reclaim was occurring though, the compact gap stood >> out since it dictates the target amount to reclaim. > But the "amount to reclaim" is still defined as "reach high watermark + > compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did > I miss something non-obvious. Within kswapd_shrink_node(), sc->nr_to_reclaim is the sum of max(zone high watermark or SWAP_CLUSTER_MAX) for each zone combined. The gap is not added to that reclaim target though. It's used afterward as the threshold for abandoning high order reclaim: if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order)) sc->order = 0; balance_pgdat() then returns sc->order and that becomes the kswapd reclaim_order value, allowing this branch to be taken: if (reclaim_order < alloc_order) goto kswapd_try_sleep; Then in prepare_kswapd_sleep(), if pgdat_balanced() succeeds (at order-0), kcompactd is woken up for the original alloc_order (order-9). > So if kswapd did any work, it means the memory was consumed (i.e. there was > some memory pressure) and amount of free memory was below high watermark + > compact_gap()? Hmm but kswapd can be woken up on a high order failure despite plenty of lower order availability. That's really the case where compact_gap() matters for higher orders. Unless by pressure you mean the high order pages were gone? > BTW, are you using mglru here? (probably not) > As that might be different and I'm not so familiar with it. Using classic LRU. >>>> The over-sized gap also causes 46% of >>>> order-9 compaction suitability checks to fail unnecessarily - the >>>> zone has >>>> sufficient free pages for the scanner to operate, but not enough to clear >>>> the inflated threshold. >>>> >>>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom >>>> with the scanner's actual capacity. Orders 0-4 are unaffected since their >>>> gap is <= 32. >>>> >>>> A/B test on ~100 instagram production hosts (64GB, 60s measurement): >>> What was the base kernel version? >> 6.13. Additional benchmarks were done using a recent mm-new build as well, >> and they showed similar reductions in reclaim. > If it's a NUMA machine, we recently found an over-reclaim issue there fixed > by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE > THP allocations") Thanks for pointing this out. I tested this on a recent mm-new built that includes 9c9828d3ead6, and I found the compact_gap() change was still helpful. My understanding is that 9c9828d3ead6 addresses direct reclaim for THP allocations, while this patch affects the kswapd reclaim-compaction hand-off path. The test runs still showed a benefit from capping the gap. >>>> Unpatched (43 hosts) >>>> pgscan_kswapd (mean/host): ~1.6M >>>> reclaim efficiency (steal/scan): 83.8% >>>> compaction success (success/stall): 2.1% >>>> THP success (alloc/alloc+fallback): 4.9% >>>> forced lru_add_drain (mean/host): ~107K >>>> >>>> Patched (59 hosts) >>>> pgscan_kswapd (mean/host): ~449K >>> Did the extra reclaim just disappear because we allow the allocations >>> to use >>> 4MB more memory? Or it shifted to direct reclaim? >> Specifically in the order-9 case, the reclaim target goes from 1024 to 32. >> What the data shows is that capping the gap allows compaction to take over >> sooner and start working to produce large size pages needed for THP. Whereas >> in the pre-patch state, trying to reclaim the full 2x THP delays compaction. > So do I understand correctly we might have an issue due to lack of > hysteresis? We require reaching high watermark + compact_gap() to terminate > reclaim, but then compaction can find out we meanwhile dropped below that > (due to concurrent allocations) and it's not suitable again? On an unpatched kernel in a fragmented environment, compaction_suitable() can remain false because the effective threshold for costly orders is the low watermark + the compact gap. Kswapd has to keep reclaiming in high order mode as a result. By capping the gap at SWAP_CLUSTER_MAX, compaction becomes suitable sooner and kswapd reaches the high order reclaim cutoff sooner. So with the patch, kswapd is able to fall back to order-0 balancing earlier and wake up kcompactd for the original high order request. > However the suitability checks e.g. compaction_zonelist_suitable() are using > min watermark, so that should provide the difference already. > Actually it's low watermark because of __compaction_suitable() adding an > extra low-min gap for costly orders. But still. > > I did just notice compaction_ready() might be too strict. It wants > effectivly high wmark plus the gap plus the low-min difference. Is it > perhaps the underlying issue here? It's a good point. It does seem like that's worth looking into, and I'd be happy to explore that separately. My thought at the moment though is that changing compaction_ready() would be a different direction from the the original focus of this patch, which started with the realization that the compaction scanner working set is bounded by COMPACT_CLUSTER_MAX. Since compact_gap() is used in multiple reclaim and compaction decisions, including compaction_ready(), fixing its definition seemed like the right first change if the gap itself is oversized. >>>> reclaim efficiency (steal/scan): 91.0% >>>> compaction success (success/stall): 28.3% >>> Is this compaction success per compaction stall or per alloc stall? >> That's per compaction. >> >>>> THP success (alloc/alloc+fallback): 17.2% >>> Weird that things would improve that much. I would expect the free memory >>> just to stabilize around the lower gap but then behave similarly. Are we >>> missing something here? >> This patch was tested in isolation, but also occurring was the case where >> bursty net allocations reserve many pageblocks as high atomic. So as >> THP-size pages become eligible, their blocks are reserved before being >> allocated as THP. >> >>>> forced lru_add_drain (mean/host): ~64K >>>> >>>> Signed-off-by: JP Kobryn (Meta)<jp.kobryn@linux.dev> >>>> --- >>>> include/linux/compaction.h | 8 ++++---- >>>> 1 file changed, 4 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h >>>> index 173d9c07a8952..09aea63b8a89d 100644 >>>> --- a/include/linux/compaction.h >>>> +++ b/include/linux/compaction.h >>>> @@ -2,6 +2,8 @@ >>>> #ifndef _LINUX_COMPACTION_H >>>> #define _LINUX_COMPACTION_H >>>> +#include <linux/swap.h> >>>> + >>>> /* >>>> * Determines how hard direct compaction should try to succeed. >>>> * Lower value means higher priority, analogically to reclaim priority. >>>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned >>>> int order) >>>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum >>>> * that the migrate scanner can have isolated on migrate list, and free >>>> * scanner is only invoked when the number of isolated free pages is >>>> - * lower than that. But it's not worth to complicate the formula here >>>> - * as a bigger gap for higher orders than strictly necessary can also >>>> - * improve chances of compaction success. >>>> + * lower than that. >>>> */ >>>> - return 2UL << order; >>>> + return min(2UL << order, COMPACT_CLUSTER_MAX); >>> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX? >> I'm thinking I could reframe this patch as reclaim-focused and use >> min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while >> either leaving the other non-reclaim users of this function alone or >> using the 2x form you suggest above. i.e. I can split this function >> into a separate reclaim_compact_gap() and use the originally proposed cap. >> Thoughts? > Do I understand correctly you want to cap the reclaim target by > COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is? > But wouldn't that mean we'll actually make changes of passing > compaction_suitable() worse? Good call. I was trying to find some middle ground, but I realize that the change is better left unified. Also, I tested a 2x COMPACT_CLUSTER_MAX cap and I saw mixed results - either similar to this patch or worse, with no improvements over the COMPACT_CLUSTER_MAX cap. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX 2026-06-02 1:48 ` JP Kobryn @ 2026-06-02 8:40 ` Vlastimil Babka (SUSE) 0 siblings, 0 replies; 6+ messages in thread From: Vlastimil Babka (SUSE) @ 2026-06-02 8:40 UTC (permalink / raw) To: JP Kobryn, akpm, surenb, mhocko, jackmanb, hannes, ziy, linux-mm Cc: linux-kernel, kernel-team On 6/2/26 03:48, JP Kobryn wrote: > On 5/28/26 1:51 AM, Vlastimil Babka (SUSE) wrote: >> On 5/27/26 02:10, JP Kobryn wrote: >>> On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote: >>>> On 5/19/26 22:08, JP Kobryn (Meta) wrote: >>>>> compact_gap() returns 2 << order, which is used as watermark headroom in >>>>> __compaction_suitable() and as a reclaim target in kswapd. The computed >>>>> value scales exponentially by order. For order-9 THP allocations this >>>>> evaluates to 1024 pages, but the compaction free scanner's working set is >>>>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops >>>>> isolating free >>>>> pages once it matches the migration batch. The current gap >>>>> over-reserves by >>>>> 32x. >>>>> >>>>> On fragmented production hosts, kswapd will try and reclaim up to the >>>>> gap, >>>>> but it only reaches that threshold 18% of the time, causing reclaim to >>>>> continue a majority of the time. >>>> But doesn't that mean there's genuine memory pressure? We're effectively >>>> raising the high watermark by 4 MB, but if processes are continuously >>>> allocating, we'd be reclaiming without the gap as well? Unless the >>>> workload >>>> is sized to fit without the gap. >>> It wasn't actual pressure, but the repetitive order-9 THP failures that were >>> waking up kswapd. I should make this more clear in the changelog. After >>> looking into why so much reclaim was occurring though, the compact gap stood >>> out since it dictates the target amount to reclaim. >> But the "amount to reclaim" is still defined as "reach high watermark + >> compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did >> I miss something non-obvious. > Within kswapd_shrink_node(), sc->nr_to_reclaim is the sum of max(zone high > watermark or SWAP_CLUSTER_MAX) for each zone combined. The gap is not > added to > that reclaim target though. It's used afterward as the threshold for > abandoning > high order reclaim: > > if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order)) > sc->order = 0; > > balance_pgdat() then returns sc->order and that becomes the kswapd > reclaim_order > value, allowing this branch to be taken: > > if (reclaim_order < alloc_order) > goto kswapd_try_sleep; > > Then in prepare_kswapd_sleep(), if pgdat_balanced() succeeds (at order-0), > kcompactd is woken up for the original alloc_order (order-9). Oh I see, thanks for explaining. I think it makes sense to target this particular part (checking sc->nr_reclaimed) than change compact_gap() globally then? It seems we have some mismatch in the various heuristics? IIUC: - in shrink_node() we have a should_continue_reclaim() call, which will return false as soon as compaction is suitable, but before that, we are likely to not accumulate enough sc->nr_reclaimed, because sc->nr_to_reclaim would be capped by SWAP_CLUSTER_MAX's - thus we won't pass the sc->nr_reclaimed >= compact_gap check in kswapd_shrink_node() - balance_pgdat() will keep looping because we're not raising priority (kswapd_shrink_node() returned a high order) and pgdat_balanced() is false (it checks for high-order page availability) Maybe only reduce the sc->nr_reclaimed threshold to 2*COMPACT_CLUSTER_MAX then? >> So if kswapd did any work, it means the memory was consumed (i.e. there was >> some memory pressure) and amount of free memory was below high watermark + >> compact_gap()? > Hmm but kswapd can be woken up on a high order failure despite plenty of > lower > order availability. That's really the case where compact_gap() matters for > higher orders. Ack, thanks. > Unless by pressure you mean the high order pages were gone? No. > >> BTW, are you using mglru here? (probably not) >> As that might be different and I'm not so familiar with it. > Using classic LRU. > >>>>> The over-sized gap also causes 46% of >>>>> order-9 compaction suitability checks to fail unnecessarily - the >>>>> zone has >>>>> sufficient free pages for the scanner to operate, but not enough to clear >>>>> the inflated threshold. >>>>> >>>>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom >>>>> with the scanner's actual capacity. Orders 0-4 are unaffected since their >>>>> gap is <= 32. >>>>> >>>>> A/B test on ~100 instagram production hosts (64GB, 60s measurement): >>>> What was the base kernel version? >>> 6.13. Additional benchmarks were done using a recent mm-new build as well, >>> and they showed similar reductions in reclaim. >> If it's a NUMA machine, we recently found an over-reclaim issue there fixed >> by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE >> THP allocations") > Thanks for pointing this out. I tested this on a recent mm-new built that > includes 9c9828d3ead6, and I found the compact_gap() change was still > helpful. OK. > My understanding is that 9c9828d3ead6 addresses direct reclaim for THP > allocations, while this patch affects the kswapd reclaim-compaction hand-off > path. The test runs still showed a benefit from capping the gap. Yep. >>>>> Unpatched (43 hosts) >>>>> pgscan_kswapd (mean/host): ~1.6M >>>>> reclaim efficiency (steal/scan): 83.8% >>>>> compaction success (success/stall): 2.1% >>>>> THP success (alloc/alloc+fallback): 4.9% >>>>> forced lru_add_drain (mean/host): ~107K >>>>> >>>>> Patched (59 hosts) >>>>> pgscan_kswapd (mean/host): ~449K >>>> Did the extra reclaim just disappear because we allow the allocations >>>> to use >>>> 4MB more memory? Or it shifted to direct reclaim? >>> Specifically in the order-9 case, the reclaim target goes from 1024 to 32. >>> What the data shows is that capping the gap allows compaction to take over >>> sooner and start working to produce large size pages needed for THP. Whereas >>> in the pre-patch state, trying to reclaim the full 2x THP delays compaction. >> So do I understand correctly we might have an issue due to lack of >> hysteresis? We require reaching high watermark + compact_gap() to terminate >> reclaim, but then compaction can find out we meanwhile dropped below that >> (due to concurrent allocations) and it's not suitable again? > On an unpatched kernel in a fragmented environment, > compaction_suitable() can > remain false because the effective threshold for costly orders is the low > watermark + the compact gap. Kswapd has to keep reclaiming in high order > mode > as a result. I think this part might be ok. > By capping the gap at SWAP_CLUSTER_MAX, compaction becomes > suitable > sooner and kswapd reaches the high order reclaim cutoff sooner. So with And the problem is with the cutoff only which is not based on a watermark+gap threshold but wants to reclaim at least gap pages regardless of how many pages are already free. > the patch, > kswapd is able to fall back to order-0 balancing earlier and wake up > kcompactd > for the original high order request. Yeah that's likely the crucial part. >> However the suitability checks e.g. compaction_zonelist_suitable() are using >> min watermark, so that should provide the difference already. >> Actually it's low watermark because of __compaction_suitable() adding an >> extra low-min gap for costly orders. But still. >> >> I did just notice compaction_ready() might be too strict. It wants >> effectivly high wmark plus the gap plus the low-min difference. Is it >> perhaps the underlying issue here? > It's a good point. It does seem like that's worth looking into, and I'd be > happy to explore that separately.My thought at the moment though is that > changing compaction_ready() would be a different direction from the the > original > focus of this patch, which started with the realization that the compaction > scanner working set is bounded by COMPACT_CLUSTER_MAX. Since Ack. > compact_gap() is > used in multiple reclaim and compaction decisions, including > compaction_ready(), > fixing its definition seemed like the right first change if the gap > itself is > oversized. Still I'd try to address the sc->nr_reclaimed usage first, and see if that's enough. >>>>> reclaim efficiency (steal/scan): 91.0% >>>>> compaction success (success/stall): 28.3% >>>> Is this compaction success per compaction stall or per alloc stall? >>> That's per compaction. >>> >>>>> THP success (alloc/alloc+fallback): 17.2% >>>> Weird that things would improve that much. I would expect the free memory >>>> just to stabilize around the lower gap but then behave similarly. Are we >>>> missing something here? >>> This patch was tested in isolation, but also occurring was the case where >>> bursty net allocations reserve many pageblocks as high atomic. So as >>> THP-size pages become eligible, their blocks are reserved before being >>> allocated as THP. >>> >>>>> forced lru_add_drain (mean/host): ~64K >>>>> >>>>> Signed-off-by: JP Kobryn (Meta)<jp.kobryn@linux.dev> >>>>> --- >>>>> include/linux/compaction.h | 8 ++++---- >>>>> 1 file changed, 4 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h >>>>> index 173d9c07a8952..09aea63b8a89d 100644 >>>>> --- a/include/linux/compaction.h >>>>> +++ b/include/linux/compaction.h >>>>> @@ -2,6 +2,8 @@ >>>>> #ifndef _LINUX_COMPACTION_H >>>>> #define _LINUX_COMPACTION_H >>>>> +#include <linux/swap.h> >>>>> + >>>>> /* >>>>> * Determines how hard direct compaction should try to succeed. >>>>> * Lower value means higher priority, analogically to reclaim priority. >>>>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned >>>>> int order) >>>>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum >>>>> * that the migrate scanner can have isolated on migrate list, and free >>>>> * scanner is only invoked when the number of isolated free pages is >>>>> - * lower than that. But it's not worth to complicate the formula here >>>>> - * as a bigger gap for higher orders than strictly necessary can also >>>>> - * improve chances of compaction success. >>>>> + * lower than that. >>>>> */ >>>>> - return 2UL << order; >>>>> + return min(2UL << order, COMPACT_CLUSTER_MAX); >>>> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX? >>> I'm thinking I could reframe this patch as reclaim-focused and use >>> min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while >>> either leaving the other non-reclaim users of this function alone or >>> using the 2x form you suggest above. i.e. I can split this function >>> into a separate reclaim_compact_gap() and use the originally proposed cap. >>> Thoughts? >> Do I understand correctly you want to cap the reclaim target by >> COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is? >> But wouldn't that mean we'll actually make changes of passing >> compaction_suitable() worse? > Good call. I was trying to find some middle ground, but I realize that the > change is better left unified. My question was based on not understanding the underlying issue, and that the "reclaim-only target" isn't based on watermark+gap but "reclaim gap worth of pages". Now I think sc->nr_reclaimed is indeed the check that should be relaxed first. > Also, I tested a 2x COMPACT_CLUSTER_MAX cap and I saw mixed results - either > similar to this patch or worse, with no improvements over the > COMPACT_CLUSTER_MAX cap. Right. Thanks! ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-06-02 8:41 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-19 20:08 [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX JP Kobryn (Meta) 2026-05-25 10:02 ` Vlastimil Babka (SUSE) 2026-05-27 0:10 ` JP Kobryn 2026-05-28 8:51 ` Vlastimil Babka (SUSE) 2026-06-02 1:48 ` JP Kobryn 2026-06-02 8:40 ` Vlastimil Babka (SUSE)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox