All of lore.kernel.org
 help / color / mirror / Atom feed
From: JP Kobryn <jp.kobryn@linux.dev>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>,
	akpm@linux-foundation.org, surenb@google.com, mhocko@suse.com,
	jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
	linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
Date: Wed, 3 Jun 2026 00:15:58 -0700	[thread overview]
Message-ID: <89e01c8e-1b08-45a6-a702-b23e45b70a29@linux.dev> (raw)
In-Reply-To: <71edb773-d156-49e6-ba6c-8159665a2dbc@kernel.org>

On 6/2/26 1:40 AM, Vlastimil Babka (SUSE) wrote:
> On 6/2/26 03:48, JP Kobryn wrote:
>> On 5/28/26 1:51 AM, Vlastimil Babka (SUSE) wrote:
>>> On 5/27/26 02:10, JP Kobryn wrote:
>>>> On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote:
>>>>> On 5/19/26 22:08, JP Kobryn (Meta) wrote:
>>>>>> compact_gap() returns 2 << order, which is used as watermark headroom in
>>>>>> __compaction_suitable() and as a reclaim target in kswapd. The computed
>>>>>> value scales exponentially by order. For order-9 THP allocations this
>>>>>> evaluates to 1024 pages, but the compaction free scanner's working set is
>>>>>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops
>>>>>> isolating free
>>>>>> pages once it matches the migration batch. The current gap
>>>>>> over-reserves by
>>>>>> 32x.
>>>>>>
>>>>>> On fragmented production hosts, kswapd will try and reclaim up to the
>>>>>> gap,
>>>>>> but it only reaches that threshold 18% of the time, causing reclaim to
>>>>>> continue a majority of the time.
>>>>> But doesn't that mean there's genuine memory pressure? We're effectively
>>>>> raising the high watermark by 4 MB, but if processes are continuously
>>>>> allocating, we'd be reclaiming without the gap as well? Unless the
>>>>> workload
>>>>> is sized to fit without the gap.
>>>> It wasn't actual pressure, but the repetitive order-9 THP failures that were
>>>> waking up kswapd. I should make this more clear in the changelog. After
>>>> looking into why so much reclaim was occurring though, the compact gap stood
>>>> out since it dictates the target amount to reclaim.
>>> But the "amount to reclaim" is still defined as "reach high watermark +
>>> compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did
>>> I miss something non-obvious.
>> Within kswapd_shrink_node(), sc->nr_to_reclaim is the sum of max(zone high
>> watermark or SWAP_CLUSTER_MAX) for each zone combined. The gap is not 
>> added to
>> that reclaim target though. It's used afterward as the threshold for 
>> abandoning
>> high order reclaim:
>>
>> if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
>>      sc->order = 0;
>>
>> balance_pgdat() then returns sc->order and that becomes the kswapd 
>> reclaim_order
>> value, allowing this branch to be taken:
>>
>> if (reclaim_order < alloc_order)
>>      goto kswapd_try_sleep;
>>
>> Then in prepare_kswapd_sleep(), if pgdat_balanced() succeeds (at order-0),
>> kcompactd is woken up for the original alloc_order (order-9).
> 
> Oh I see, thanks for explaining. I think it makes sense to target this
> particular part (checking sc->nr_reclaimed) than change compact_gap()
> globally then? It seems we have some mismatch in the various heuristics? IIUC:

I gave this a try and got some interesting results. Based on mm-new as
of earlier today, I ran three variations: original compact_gap (2 <<
order), capped compact_gap (this patch), and capped downgrade gate which
has the original compact_gap (2 << order) but caps within
kswapd_shrink_node():

-       if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order)) {
+       if (sc->order && sc->nr_reclaimed >=
+           min(compact_gap(sc->order), SWAP_CLUSTER_MAX)) {
              sc->order = 0;
        }

The new approach showed improvements in THP allocations.

thp_fault_fallback
  original gap: 1217
  capped gap (global): 738
  capped gap at downgrade gate: 898

More details are below.

> 
> - in shrink_node() we have a should_continue_reclaim() call, which will
> return false as soon as compaction is suitable, but before that, we are
> likely to not accumulate enough sc->nr_reclaimed, because sc->nr_to_reclaim
> would be capped by SWAP_CLUSTER_MAX's
> 
> - thus we won't pass the sc->nr_reclaimed >= compact_gap check in
> kswapd_shrink_node()
> 
> - balance_pgdat() will keep looping because we're not raising priority
> (kswapd_shrink_node() returned a high order) and pgdat_balanced() is false
> (it checks for high-order page availability)

I added some temporary tracepoints to verify paths taken. The average
hits across three 60s runs are shown below.

kswapd_shrink_node downgrade to order-0
  original gap:                     0
  capped gap (global, this patch): 28
  capped gap at downgrade gate:    80

So the downgrades are more frequent, but the suggested approach
regressed harder in terms of reclaim.

pgscan_kswapd
  original gap:                    6328
  capped gap (global, this patch): 3773
  capped gap at downgrade gate:    7988

pgsteal_kswapd
  original gap:                    5657
  capped gap (global, this patch): 3243
  capped gap at downgrade gate:    7101

This is because the suitability checks are still using the inflated gap
causing the split below.
  kswapd_shrink_node() gap:      32
  __compaction_suitable() gap: 1024

So it seems that capping globally (this patch) is the better option to
avoid the split above which causes unnecessary reclaim.

> 
> Maybe only reduce the sc->nr_reclaimed threshold to 2*COMPACT_CLUSTER_MAX then?

This was considered as well. But with the new findings, it would imply
even more reclaim.

> 
>>> So if kswapd did any work, it means the memory was consumed (i.e. there was
>>> some memory pressure) and amount of free memory was below high watermark +
>>> compact_gap()?
>> Hmm but kswapd can be woken up on a high order failure despite plenty of 
>> lower
>> order availability. That's really the case where compact_gap() matters for
>> higher orders.
> 
> Ack, thanks.
> 
>> Unless by pressure you mean the high order pages were gone?
> 
> No.
> 
>>
>>> BTW, are you using mglru here? (probably not)
>>> As that might be different and I'm not so familiar with it.
>> Using classic LRU.
>>
>>>>>> The over-sized gap also causes 46% of
>>>>>> order-9 compaction suitability checks to fail unnecessarily - the
>>>>>> zone has
>>>>>> sufficient free pages for the scanner to operate, but not enough to clear
>>>>>> the inflated threshold.
>>>>>>
>>>>>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
>>>>>> with the scanner's actual capacity. Orders 0-4 are unaffected since their
>>>>>> gap is <= 32.
>>>>>>
>>>>>> A/B test on ~100 instagram production hosts (64GB, 60s measurement):
>>>>> What was the base kernel version?
>>>> 6.13. Additional benchmarks were done using a recent mm-new build as well,
>>>> and they showed similar reductions in reclaim.
>>> If it's a NUMA machine, we recently found an over-reclaim issue there fixed
>>> by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE
>>> THP allocations")
>> Thanks for pointing this out. I tested this on a recent mm-new built that
>> includes 9c9828d3ead6, and I found the compact_gap() change was still 
>> helpful.
> 
> OK.
> 
>> My understanding is that 9c9828d3ead6 addresses direct reclaim for THP
>> allocations, while this patch affects the kswapd reclaim-compaction hand-off
>> path. The test runs still showed a benefit from capping the gap.
> 
> Yep.
> 
>>>>>> Unpatched (43 hosts)
>>>>>> pgscan_kswapd (mean/host): ~1.6M
>>>>>> reclaim efficiency (steal/scan): 83.8%
>>>>>> compaction success (success/stall): 2.1%
>>>>>> THP success (alloc/alloc+fallback): 4.9%
>>>>>> forced lru_add_drain (mean/host): ~107K
>>>>>>
>>>>>> Patched (59 hosts)
>>>>>> pgscan_kswapd (mean/host): ~449K
>>>>> Did the extra reclaim just disappear because we allow the allocations
>>>>> to use
>>>>> 4MB more memory? Or it shifted to direct reclaim?
>>>> Specifically in the order-9 case, the reclaim target goes from 1024 to 32.
>>>> What the data shows is that capping the gap allows compaction to take over
>>>> sooner and start working to produce large size pages needed for THP. Whereas
>>>> in the pre-patch state, trying to reclaim the full 2x THP delays compaction.
>>> So do I understand correctly we might have an issue due to lack of
>>> hysteresis? We require reaching high watermark + compact_gap() to terminate
>>> reclaim, but then compaction can find out we meanwhile dropped below that
>>> (due to concurrent allocations) and it's not suitable again?
>> On an unpatched kernel in a fragmented environment, 
>> compaction_suitable() can
>> remain false because the effective threshold for costly orders is the low
>> watermark + the compact gap. Kswapd has to keep reclaiming in high order 
>> mode
>> as a result.
> 
> I think this part might be ok.
> 
>> By capping the gap at SWAP_CLUSTER_MAX, compaction becomes 
>> suitable
>> sooner and kswapd reaches the high order reclaim cutoff sooner. So with 
> 
> And the problem is with the cutoff only which is not based on a
> watermark+gap threshold but wants to reclaim at least gap pages regardless
> of how many pages are already free.
> 
>> the patch,
>> kswapd is able to fall back to order-0 balancing earlier and wake up 
>> kcompactd
>> for the original high order request.
> 
> Yeah that's likely the crucial part.
> 
>>> However the suitability checks e.g. compaction_zonelist_suitable() are using
>>> min watermark, so that should provide the difference already.
>>> Actually it's low watermark because of __compaction_suitable() adding an
>>> extra low-min gap for costly orders. But still.
>>>
>>> I did just notice compaction_ready() might be too strict. It wants
>>> effectivly high wmark plus the gap plus the low-min difference. Is it
>>> perhaps the underlying issue here?
>> It's a good point. It does seem like that's worth looking into, and I'd be
>> happy to explore that separately.My thought at the moment though is that
>> changing compaction_ready() would be a different direction from the the 
>> original
>> focus of this patch, which started with the realization that the compaction
>> scanner working set is bounded by COMPACT_CLUSTER_MAX. Since 
> 
> Ack.
> 
>> compact_gap() is
>> used in multiple reclaim and compaction decisions, including 
>> compaction_ready(),
>> fixing its definition seemed like the right first change if the gap 
>> itself is
>> oversized.
> 
> Still I'd try to address the sc->nr_reclaimed usage first, and see if that's
> enough.

Let me know if I covered that above.

> 
>>>>>> reclaim efficiency (steal/scan): 91.0%
>>>>>> compaction success (success/stall): 28.3%
>>>>> Is this compaction success per compaction stall or per alloc stall?
>>>> That's per compaction.
>>>>
>>>>>> THP success (alloc/alloc+fallback): 17.2%
>>>>> Weird that things would improve that much. I would expect the free memory
>>>>> just to stabilize around the lower gap but then behave similarly. Are we
>>>>> missing something here?
>>>> This patch was tested in isolation, but also occurring was the case where
>>>> bursty net allocations reserve many pageblocks as high atomic. So as
>>>> THP-size pages become eligible, their blocks are reserved before being
>>>> allocated as THP.
>>>>
>>>>>> forced lru_add_drain (mean/host): ~64K
>>>>>>
>>>>>> Signed-off-by: JP Kobryn (Meta)<jp.kobryn@linux.dev>
>>>>>> ---
>>>>>> include/linux/compaction.h | 8 ++++----
>>>>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>>>>>> index 173d9c07a8952..09aea63b8a89d 100644
>>>>>> --- a/include/linux/compaction.h
>>>>>> +++ b/include/linux/compaction.h
>>>>>> @@ -2,6 +2,8 @@
>>>>>> #ifndef _LINUX_COMPACTION_H
>>>>>> #define _LINUX_COMPACTION_H
>>>>>> +#include <linux/swap.h>
>>>>>> +
>>>>>> /*
>>>>>> * Determines how hard direct compaction should try to succeed.
>>>>>> * Lower value means higher priority, analogically to reclaim priority.
>>>>>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned
>>>>>> int order)
>>>>>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
>>>>>> * that the migrate scanner can have isolated on migrate list, and free
>>>>>> * scanner is only invoked when the number of isolated free pages is
>>>>>> - * lower than that. But it's not worth to complicate the formula here
>>>>>> - * as a bigger gap for higher orders than strictly necessary can also
>>>>>> - * improve chances of compaction success.
>>>>>> + * lower than that.
>>>>>> */
>>>>>> - return 2UL << order;
>>>>>> + return min(2UL << order, COMPACT_CLUSTER_MAX);
>>>>> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?
>>>> I'm thinking I could reframe this patch as reclaim-focused and use
>>>> min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while
>>>> either leaving the other non-reclaim users of this function alone or
>>>> using the 2x form you suggest above. i.e. I can split this function
>>>> into a separate reclaim_compact_gap() and use the originally proposed cap.
>>>> Thoughts?
>>> Do I understand correctly you want to cap the reclaim target by
>>> COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is?
>>> But wouldn't that mean we'll actually make changes of passing
>>> compaction_suitable() worse?
>> Good call. I was trying to find some middle ground, but I realize that the
>> change is better left unified.
> 
> My question was based on not understanding the underlying issue, and that
> the "reclaim-only target" isn't based on watermark+gap but "reclaim gap
> worth of pages".  Now I think sc->nr_reclaimed is indeed the check that
> should be relaxed first.

Hopefully the data helps this time around.


  reply	other threads:[~2026-06-03  7:16 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-19 20:08 [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX JP Kobryn (Meta)
2026-05-25 10:02 ` Vlastimil Babka (SUSE)
2026-05-27  0:10   ` JP Kobryn
2026-05-28  8:51     ` Vlastimil Babka (SUSE)
2026-06-02  1:48       ` JP Kobryn
2026-06-02  8:40         ` Vlastimil Babka (SUSE)
2026-06-03  7:15           ` JP Kobryn [this message]
2026-06-03  8:19             ` Vlastimil Babka (SUSE)
2026-06-03  8:20 ` Vlastimil Babka (SUSE)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=89e01c8e-1b08-45a6-a702-b23e45b70a29@linux.dev \
    --to=jp.kobryn@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.