All of lore.kernel.org
 help / color / mirror / Atom feed
From: JP Kobryn <jp.kobryn@linux.dev>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>,
	akpm@linux-foundation.org, surenb@google.com, mhocko@suse.com,
	jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
	linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
Date: Mon, 1 Jun 2026 18:48:50 -0700	[thread overview]
Message-ID: <b17dd2a3-8ca3-484e-8398-e5423f5df9c4@linux.dev> (raw)
In-Reply-To: <e65672e7-9a56-4489-84a1-db25d2c75f28@kernel.org>

On 5/28/26 1:51 AM, Vlastimil Babka (SUSE) wrote:
> On 5/27/26 02:10, JP Kobryn wrote:
>> On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote:
>>> On 5/19/26 22:08, JP Kobryn (Meta) wrote:
>>>> compact_gap() returns 2 << order, which is used as watermark headroom in
>>>> __compaction_suitable() and as a reclaim target in kswapd. The computed
>>>> value scales exponentially by order. For order-9 THP allocations this
>>>> evaluates to 1024 pages, but the compaction free scanner's working set is
>>>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops
>>>> isolating free
>>>> pages once it matches the migration batch. The current gap
>>>> over-reserves by
>>>> 32x.
>>>>
>>>> On fragmented production hosts, kswapd will try and reclaim up to the
>>>> gap,
>>>> but it only reaches that threshold 18% of the time, causing reclaim to
>>>> continue a majority of the time.
>>> But doesn't that mean there's genuine memory pressure? We're effectively
>>> raising the high watermark by 4 MB, but if processes are continuously
>>> allocating, we'd be reclaiming without the gap as well? Unless the
>>> workload
>>> is sized to fit without the gap.
>> It wasn't actual pressure, but the repetitive order-9 THP failures that were
>> waking up kswapd. I should make this more clear in the changelog. After
>> looking into why so much reclaim was occurring though, the compact gap stood
>> out since it dictates the target amount to reclaim.
> But the "amount to reclaim" is still defined as "reach high watermark +
> compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did
> I miss something non-obvious.
Within kswapd_shrink_node(), sc->nr_to_reclaim is the sum of max(zone high
watermark or SWAP_CLUSTER_MAX) for each zone combined. The gap is not 
added to
that reclaim target though. It's used afterward as the threshold for 
abandoning
high order reclaim:

if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
     sc->order = 0;

balance_pgdat() then returns sc->order and that becomes the kswapd 
reclaim_order
value, allowing this branch to be taken:

if (reclaim_order < alloc_order)
     goto kswapd_try_sleep;

Then in prepare_kswapd_sleep(), if pgdat_balanced() succeeds (at order-0),
kcompactd is woken up for the original alloc_order (order-9).

> So if kswapd did any work, it means the memory was consumed (i.e. there was
> some memory pressure) and amount of free memory was below high watermark +
> compact_gap()?
Hmm but kswapd can be woken up on a high order failure despite plenty of 
lower
order availability. That's really the case where compact_gap() matters for
higher orders. Unless by pressure you mean the high order pages were gone?

> BTW, are you using mglru here? (probably not)
> As that might be different and I'm not so familiar with it.
Using classic LRU.

>>>> The over-sized gap also causes 46% of
>>>> order-9 compaction suitability checks to fail unnecessarily - the
>>>> zone has
>>>> sufficient free pages for the scanner to operate, but not enough to clear
>>>> the inflated threshold.
>>>>
>>>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
>>>> with the scanner's actual capacity. Orders 0-4 are unaffected since their
>>>> gap is <= 32.
>>>>
>>>> A/B test on ~100 instagram production hosts (64GB, 60s measurement):
>>> What was the base kernel version?
>> 6.13. Additional benchmarks were done using a recent mm-new build as well,
>> and they showed similar reductions in reclaim.
> If it's a NUMA machine, we recently found an over-reclaim issue there fixed
> by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE
> THP allocations")
Thanks for pointing this out. I tested this on a recent mm-new built that
includes 9c9828d3ead6, and I found the compact_gap() change was still 
helpful.
My understanding is that 9c9828d3ead6 addresses direct reclaim for THP
allocations, while this patch affects the kswapd reclaim-compaction hand-off
path. The test runs still showed a benefit from capping the gap.

>>>> Unpatched (43 hosts)
>>>> pgscan_kswapd (mean/host): ~1.6M
>>>> reclaim efficiency (steal/scan): 83.8%
>>>> compaction success (success/stall): 2.1%
>>>> THP success (alloc/alloc+fallback): 4.9%
>>>> forced lru_add_drain (mean/host): ~107K
>>>>
>>>> Patched (59 hosts)
>>>> pgscan_kswapd (mean/host): ~449K
>>> Did the extra reclaim just disappear because we allow the allocations
>>> to use
>>> 4MB more memory? Or it shifted to direct reclaim?
>> Specifically in the order-9 case, the reclaim target goes from 1024 to 32.
>> What the data shows is that capping the gap allows compaction to take over
>> sooner and start working to produce large size pages needed for THP. Whereas
>> in the pre-patch state, trying to reclaim the full 2x THP delays compaction.
> So do I understand correctly we might have an issue due to lack of
> hysteresis? We require reaching high watermark + compact_gap() to terminate
> reclaim, but then compaction can find out we meanwhile dropped below that
> (due to concurrent allocations) and it's not suitable again?
On an unpatched kernel in a fragmented environment, 
compaction_suitable() can
remain false because the effective threshold for costly orders is the low
watermark + the compact gap. Kswapd has to keep reclaiming in high order 
mode
as a result. By capping the gap at SWAP_CLUSTER_MAX, compaction becomes 
suitable
sooner and kswapd reaches the high order reclaim cutoff sooner. So with 
the patch,
kswapd is able to fall back to order-0 balancing earlier and wake up 
kcompactd
for the original high order request.

> However the suitability checks e.g. compaction_zonelist_suitable() are using
> min watermark, so that should provide the difference already.
> Actually it's low watermark because of __compaction_suitable() adding an
> extra low-min gap for costly orders. But still.
>
> I did just notice compaction_ready() might be too strict. It wants
> effectivly high wmark plus the gap plus the low-min difference. Is it
> perhaps the underlying issue here?
It's a good point. It does seem like that's worth looking into, and I'd be
happy to explore that separately. My thought at the moment though is that
changing compaction_ready() would be a different direction from the the 
original
focus of this patch, which started with the realization that the compaction
scanner working set is bounded by COMPACT_CLUSTER_MAX. Since 
compact_gap() is
used in multiple reclaim and compaction decisions, including 
compaction_ready(),
fixing its definition seemed like the right first change if the gap 
itself is
oversized.

>>>> reclaim efficiency (steal/scan): 91.0%
>>>> compaction success (success/stall): 28.3%
>>> Is this compaction success per compaction stall or per alloc stall?
>> That's per compaction.
>>
>>>> THP success (alloc/alloc+fallback): 17.2%
>>> Weird that things would improve that much. I would expect the free memory
>>> just to stabilize around the lower gap but then behave similarly. Are we
>>> missing something here?
>> This patch was tested in isolation, but also occurring was the case where
>> bursty net allocations reserve many pageblocks as high atomic. So as
>> THP-size pages become eligible, their blocks are reserved before being
>> allocated as THP.
>>
>>>> forced lru_add_drain (mean/host): ~64K
>>>>
>>>> Signed-off-by: JP Kobryn (Meta)<jp.kobryn@linux.dev>
>>>> ---
>>>> include/linux/compaction.h | 8 ++++----
>>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>>>> index 173d9c07a8952..09aea63b8a89d 100644
>>>> --- a/include/linux/compaction.h
>>>> +++ b/include/linux/compaction.h
>>>> @@ -2,6 +2,8 @@
>>>> #ifndef _LINUX_COMPACTION_H
>>>> #define _LINUX_COMPACTION_H
>>>> +#include <linux/swap.h>
>>>> +
>>>> /*
>>>> * Determines how hard direct compaction should try to succeed.
>>>> * Lower value means higher priority, analogically to reclaim priority.
>>>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned
>>>> int order)
>>>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
>>>> * that the migrate scanner can have isolated on migrate list, and free
>>>> * scanner is only invoked when the number of isolated free pages is
>>>> - * lower than that. But it's not worth to complicate the formula here
>>>> - * as a bigger gap for higher orders than strictly necessary can also
>>>> - * improve chances of compaction success.
>>>> + * lower than that.
>>>> */
>>>> - return 2UL << order;
>>>> + return min(2UL << order, COMPACT_CLUSTER_MAX);
>>> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?
>> I'm thinking I could reframe this patch as reclaim-focused and use
>> min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while
>> either leaving the other non-reclaim users of this function alone or
>> using the 2x form you suggest above. i.e. I can split this function
>> into a separate reclaim_compact_gap() and use the originally proposed cap.
>> Thoughts?
> Do I understand correctly you want to cap the reclaim target by
> COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is?
> But wouldn't that mean we'll actually make changes of passing
> compaction_suitable() worse?
Good call. I was trying to find some middle ground, but I realize that the
change is better left unified.

Also, I tested a 2x COMPACT_CLUSTER_MAX cap and I saw mixed results - either
similar to this patch or worse, with no improvements over the
COMPACT_CLUSTER_MAX cap.

  reply	other threads:[~2026-06-02  1:49 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-19 20:08 [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX JP Kobryn (Meta)
2026-05-25 10:02 ` Vlastimil Babka (SUSE)
2026-05-27  0:10   ` JP Kobryn
2026-05-28  8:51     ` Vlastimil Babka (SUSE)
2026-06-02  1:48       ` JP Kobryn [this message]
2026-06-02  8:40         ` Vlastimil Babka (SUSE)
2026-06-03  7:15           ` JP Kobryn
2026-06-03  8:19             ` Vlastimil Babka (SUSE)
2026-06-03  8:20 ` Vlastimil Babka (SUSE)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b17dd2a3-8ca3-484e-8398-e5423f5df9c4@linux.dev \
    --to=jp.kobryn@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.