Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: JP Kobryn <jp.kobryn@linux.dev>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>,
	akpm@linux-foundation.org, surenb@google.com, mhocko@suse.com,
	jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
	linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
Date: Mon, 1 Jun 2026 18:48:50 -0700	[thread overview]
Message-ID: <b17dd2a3-8ca3-484e-8398-e5423f5df9c4@linux.dev> (raw)
In-Reply-To: <e65672e7-9a56-4489-84a1-db25d2c75f28@kernel.org>

On 5/28/26 1:51 AM, Vlastimil Babka (SUSE) wrote:
> On 5/27/26 02:10, JP Kobryn wrote:
>> On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote:
>>> On 5/19/26 22:08, JP Kobryn (Meta) wrote:
>>>> compact_gap() returns 2 << order, which is used as watermark headroom in
>>>> __compaction_suitable() and as a reclaim target in kswapd. The computed
>>>> value scales exponentially by order. For order-9 THP allocations this
>>>> evaluates to 1024 pages, but the compaction free scanner's working set is
>>>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops
>>>> isolating free
>>>> pages once it matches the migration batch. The current gap
>>>> over-reserves by
>>>> 32x.
>>>>
>>>> On fragmented production hosts, kswapd will try and reclaim up to the
>>>> gap,
>>>> but it only reaches that threshold 18% of the time, causing reclaim to
>>>> continue a majority of the time.
>>> But doesn't that mean there's genuine memory pressure? We're effectively
>>> raising the high watermark by 4 MB, but if processes are continuously
>>> allocating, we'd be reclaiming without the gap as well? Unless the
>>> workload
>>> is sized to fit without the gap.
>> It wasn't actual pressure, but the repetitive order-9 THP failures that were
>> waking up kswapd. I should make this more clear in the changelog. After
>> looking into why so much reclaim was occurring though, the compact gap stood
>> out since it dictates the target amount to reclaim.
> But the "amount to reclaim" is still defined as "reach high watermark +
> compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did
> I miss something non-obvious.
Within kswapd_shrink_node(), sc->nr_to_reclaim is the sum of max(zone high
watermark or SWAP_CLUSTER_MAX) for each zone combined. The gap is not 
added to
that reclaim target though. It's used afterward as the threshold for 
abandoning
high order reclaim:

if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
     sc->order = 0;

balance_pgdat() then returns sc->order and that becomes the kswapd 
reclaim_order
value, allowing this branch to be taken:

if (reclaim_order < alloc_order)
     goto kswapd_try_sleep;

Then in prepare_kswapd_sleep(), if pgdat_balanced() succeeds (at order-0),
kcompactd is woken up for the original alloc_order (order-9).

> So if kswapd did any work, it means the memory was consumed (i.e. there was
> some memory pressure) and amount of free memory was below high watermark +
> compact_gap()?
Hmm but kswapd can be woken up on a high order failure despite plenty of 
lower
order availability. That's really the case where compact_gap() matters for
higher orders. Unless by pressure you mean the high order pages were gone?

> BTW, are you using mglru here? (probably not)
> As that might be different and I'm not so familiar with it.
Using classic LRU.

>>>> The over-sized gap also causes 46% of
>>>> order-9 compaction suitability checks to fail unnecessarily - the
>>>> zone has
>>>> sufficient free pages for the scanner to operate, but not enough to clear
>>>> the inflated threshold.
>>>>
>>>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
>>>> with the scanner's actual capacity. Orders 0-4 are unaffected since their
>>>> gap is <= 32.
>>>>
>>>> A/B test on ~100 instagram production hosts (64GB, 60s measurement):
>>> What was the base kernel version?
>> 6.13. Additional benchmarks were done using a recent mm-new build as well,
>> and they showed similar reductions in reclaim.
> If it's a NUMA machine, we recently found an over-reclaim issue there fixed
> by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE
> THP allocations")
Thanks for pointing this out. I tested this on a recent mm-new built that
includes 9c9828d3ead6, and I found the compact_gap() change was still 
helpful.
My understanding is that 9c9828d3ead6 addresses direct reclaim for THP
allocations, while this patch affects the kswapd reclaim-compaction hand-off
path. The test runs still showed a benefit from capping the gap.

>>>> Unpatched (43 hosts)
>>>> pgscan_kswapd (mean/host): ~1.6M
>>>> reclaim efficiency (steal/scan): 83.8%
>>>> compaction success (success/stall): 2.1%
>>>> THP success (alloc/alloc+fallback): 4.9%
>>>> forced lru_add_drain (mean/host): ~107K
>>>>
>>>> Patched (59 hosts)
>>>> pgscan_kswapd (mean/host): ~449K
>>> Did the extra reclaim just disappear because we allow the allocations
>>> to use
>>> 4MB more memory? Or it shifted to direct reclaim?
>> Specifically in the order-9 case, the reclaim target goes from 1024 to 32.
>> What the data shows is that capping the gap allows compaction to take over
>> sooner and start working to produce large size pages needed for THP. Whereas
>> in the pre-patch state, trying to reclaim the full 2x THP delays compaction.
> So do I understand correctly we might have an issue due to lack of
> hysteresis? We require reaching high watermark + compact_gap() to terminate
> reclaim, but then compaction can find out we meanwhile dropped below that
> (due to concurrent allocations) and it's not suitable again?
On an unpatched kernel in a fragmented environment, 
compaction_suitable() can
remain false because the effective threshold for costly orders is the low
watermark + the compact gap. Kswapd has to keep reclaiming in high order 
mode
as a result. By capping the gap at SWAP_CLUSTER_MAX, compaction becomes 
suitable
sooner and kswapd reaches the high order reclaim cutoff sooner. So with 
the patch,
kswapd is able to fall back to order-0 balancing earlier and wake up 
kcompactd
for the original high order request.

> However the suitability checks e.g. compaction_zonelist_suitable() are using
> min watermark, so that should provide the difference already.
> Actually it's low watermark because of __compaction_suitable() adding an
> extra low-min gap for costly orders. But still.
>
> I did just notice compaction_ready() might be too strict. It wants
> effectivly high wmark plus the gap plus the low-min difference. Is it
> perhaps the underlying issue here?
It's a good point. It does seem like that's worth looking into, and I'd be
happy to explore that separately. My thought at the moment though is that
changing compaction_ready() would be a different direction from the the 
original
focus of this patch, which started with the realization that the compaction
scanner working set is bounded by COMPACT_CLUSTER_MAX. Since 
compact_gap() is
used in multiple reclaim and compaction decisions, including 
compaction_ready(),
fixing its definition seemed like the right first change if the gap 
itself is
oversized.

>>>> reclaim efficiency (steal/scan): 91.0%
>>>> compaction success (success/stall): 28.3%
>>> Is this compaction success per compaction stall or per alloc stall?
>> That's per compaction.
>>
>>>> THP success (alloc/alloc+fallback): 17.2%
>>> Weird that things would improve that much. I would expect the free memory
>>> just to stabilize around the lower gap but then behave similarly. Are we
>>> missing something here?
>> This patch was tested in isolation, but also occurring was the case where
>> bursty net allocations reserve many pageblocks as high atomic. So as
>> THP-size pages become eligible, their blocks are reserved before being
>> allocated as THP.
>>
>>>> forced lru_add_drain (mean/host): ~64K
>>>>
>>>> Signed-off-by: JP Kobryn (Meta)<jp.kobryn@linux.dev>
>>>> ---
>>>> include/linux/compaction.h | 8 ++++----
>>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>>>> index 173d9c07a8952..09aea63b8a89d 100644
>>>> --- a/include/linux/compaction.h
>>>> +++ b/include/linux/compaction.h
>>>> @@ -2,6 +2,8 @@
>>>> #ifndef _LINUX_COMPACTION_H
>>>> #define _LINUX_COMPACTION_H
>>>> +#include <linux/swap.h>
>>>> +
>>>> /*
>>>> * Determines how hard direct compaction should try to succeed.
>>>> * Lower value means higher priority, analogically to reclaim priority.
>>>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned
>>>> int order)
>>>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
>>>> * that the migrate scanner can have isolated on migrate list, and free
>>>> * scanner is only invoked when the number of isolated free pages is
>>>> - * lower than that. But it's not worth to complicate the formula here
>>>> - * as a bigger gap for higher orders than strictly necessary can also
>>>> - * improve chances of compaction success.
>>>> + * lower than that.
>>>> */
>>>> - return 2UL << order;
>>>> + return min(2UL << order, COMPACT_CLUSTER_MAX);
>>> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?
>> I'm thinking I could reframe this patch as reclaim-focused and use
>> min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while
>> either leaving the other non-reclaim users of this function alone or
>> using the 2x form you suggest above. i.e. I can split this function
>> into a separate reclaim_compact_gap() and use the originally proposed cap.
>> Thoughts?
> Do I understand correctly you want to cap the reclaim target by
> COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is?
> But wouldn't that mean we'll actually make changes of passing
> compaction_suitable() worse?
Good call. I was trying to find some middle ground, but I realize that the
change is better left unified.

Also, I tested a 2x COMPACT_CLUSTER_MAX cap and I saw mixed results - either
similar to this patch or worse, with no improvements over the
COMPACT_CLUSTER_MAX cap.

next prev parent reply	other threads:[~2026-06-02  1:49 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-19 20:08 [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX JP Kobryn (Meta)
2026-05-25 10:02 ` Vlastimil Babka (SUSE)
2026-05-27  0:10   ` JP Kobryn
2026-05-28  8:51     ` Vlastimil Babka (SUSE)
2026-06-02  1:48       ` JP Kobryn [this message]
2026-06-02  8:40         ` Vlastimil Babka (SUSE)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b17dd2a3-8ca3-484e-8398-e5423f5df9c4@linux.dev \
    --to=jp.kobryn@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox