From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id A595BCD6E55 for ; Wed, 3 Jun 2026 07:16:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D621B6B0088; Wed, 3 Jun 2026 03:16:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CEC5E6B008A; Wed, 3 Jun 2026 03:16:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BB4D56B008C; Wed, 3 Jun 2026 03:16:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A69076B0088 for ; Wed, 3 Jun 2026 03:16:12 -0400 (EDT) Received: from smtpin11.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 2D1A38CDA6 for ; Wed, 3 Jun 2026 07:16:12 +0000 (UTC) X-FDA: 84837742584.11.D5E6CED Received: from out-184.mta0.migadu.com (out-184.mta0.migadu.com [91.218.175.184]) by imf05.hostedemail.com (Postfix) with ESMTP id 99A66100003 for ; Wed, 3 Jun 2026 07:16:08 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=pZNxQFnK; spf=pass (imf05.hostedemail.com: domain of jp.kobryn@linux.dev designates 91.218.175.184 as permitted sender) smtp.mailfrom=jp.kobryn@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; a=rsa-sha256; d=hostedemail.com; s=arc-20220608; cv=none; t=1780470969; b=7Jl0KmzAh4OAz67vYypElrAAJgu8N1OKWHLRT/i0c30kjjX8Ib91XTYwZFi1GAFel+gzgs I1xf39ZBWbLEICjBevWIwm2rafIXrODkFGv2A7DI6eGVM+byjFYefhEVsTLzi4So6w3O0F 8kGlfOucRg1ReE78y4SZtWCXKiAOiQg= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=pZNxQFnK; spf=pass (imf05.hostedemail.com: domain of jp.kobryn@linux.dev designates 91.218.175.184 as permitted sender) smtp.mailfrom=jp.kobryn@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1780470969; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Q5sVXXG5iIB+s5bZUY6unt5WcxY1ePGpKzFek5BfxMA=; b=ZMRaJP/JHnUD5fo2kpDK3y8+MuEQrdQqP86Hk1t7+lQ8dD6wHIWGthPNohm1ZMum9v5rgd kJOJplEmp3X8BLRk2lziByxrpo28MTJZXc01x1NP8eBQeKMcY58lUx0TeiiiAzByeCHk7E F0jMajtGnEy42Z52barE1cE3/FYKM4M= Message-ID: <89e01c8e-1b08-45a6-a702-b23e45b70a29@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780470966; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Q5sVXXG5iIB+s5bZUY6unt5WcxY1ePGpKzFek5BfxMA=; b=pZNxQFnKDfa4zACSef5UpUY/aGkNzGtg0dkRpfw2GtYv5O2Ap7oALYQKZV/t+iG/sUqmXu M9XBUgSy0pyER1DnjNwk6k8tPvmBLEczfg2Rh46i75Bjdc8BJXMPfes7YTkMhvRePB90aQ Bl2OvLcjKYUX70QwouK66aFPFqUTMUU= Date: Wed, 3 Jun 2026 00:15:58 -0700 MIME-Version: 1.0 Subject: Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX To: "Vlastimil Babka (SUSE)" , akpm@linux-foundation.org, surenb@google.com, mhocko@suse.com, jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com, linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, kernel-team@meta.com References: <20260519200851.141955-1-jp.kobryn@linux.dev> <71edb773-d156-49e6-ba6c-8159665a2dbc@kernel.org> Content-Language: en-US X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: JP Kobryn In-Reply-To: <71edb773-d156-49e6-ba6c-8159665a2dbc@kernel.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 99A66100003 X-Rspam-User: X-Stat-Signature: f6uqswohogbxcsg3y5mz59mezb3pb78m X-HE-Tag: 1780470968-931223 X-HE-Meta: U2FsdGVkX1+VXbL/OGWc/K8KQ5cwH32Ku5vlqkzFPZp7yonxxL9He+0oFo/Zwum1K+omyWHHVLT8xq8pHBkCvlB6a2sfwhAS/IvXhxiYXNApHhilY65eHo80/Jil/pEPAZUqWQ9g8CCjtkE++8yDXNCQLklUur+RqfTwMKY9Fl2one/LoQUw+RiTUFwapP9xJ0DUVcfCbcBd+np0pHvRi639RF9I9lCjKkYCaoZh8fiRdSEDtBH677p3qD1foID/NDKxd3fvoMcYqgD6sYtOOLgFGYsmEYVW1UW58Y17IXMX22XoVyXMGOTOXRRdNjaIu0TV8PMRztu7PwGNz+ho7wvgQZoxT49ajQoSKHru+stSzfnXyhf0Lh4XC8rXALS5cQuFS+LucX2/hQ4Sjm6r2BQOIgWJFs/xyzG+eooZVoVdbDD/DNqwH8xfkrCgWCN8z9MJkrSEzwe3aPGWV3c1Fojk/zPJs0P1EU94JhEr9LEbcLLC6qXFQKp/iAo398o1d8HaxG1hp0o1Fv0qIQ5WUck4MGsYHZFutyKdAWbsJ+dOBXbDPyWcPUQqS3fvP9oRUHsbOCfOvrWY1Ztlv9MtHBZar8r6Urnuuiaykmi1M1sAOS/NbDCusAS3xscVXEbXS/gVqetwfLWo/85jUzy5yCm7UN81yoSCVZLzvjw7YJuCj1sunMm6e6xbuQukCO0WwVLelnI7dNN5slctqW928VxPwINbEo9y/Ar2F23R4zCHO68MvJdIZJ7MwiUF7P3xua5gElRy5nvHCPogqhVWbFsPUrbS08qKw02XLhHFSe9i7orpY0LkvKNdGxMHfcKXUgmQyI1DmN9vS+ppsMO3oQgKqRVX38rM3xpMrjSWRihARqkWJn0tJ4AxAs7vhyxFH0N7OYjcZT6IZVBwHwyUhaVrTlZDGdI03mKVgKpNDGzMMf8axpYee2WxgXt9hAX8PszalQb+09KksCqrzBk 1i1jHdR8 ItmqzDo5wgKBc8p9wA7Ps6YQnx3flAgOAsc0cYMZuBkbR2k8eRxBazhu60h78DOIjegNZDvFfRDRp+tvBZSIpvYmECIFxM2gRaUGr/0HEgLz2RKK0rzyE4nAc0XhwGa5eYSk2plXB14WvLFo3GC+Vqes8h/yGGmR+EMUvaSxT5LtLL9gHc9Aj6AHJAdiqkUEX99qSKXFX1SETIk8itOA1gntsriAYB0Q06vd8lTBLOFLuGK2miIJtTZCKehkpsmpkeijth1Nixwa9HaYtfQIKxeg5pb9LuUcle5NtI3NDQTaXjnIMeJsUYEjjig3XyTk6KBqr Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 6/2/26 1:40 AM, Vlastimil Babka (SUSE) wrote: > On 6/2/26 03:48, JP Kobryn wrote: >> On 5/28/26 1:51 AM, Vlastimil Babka (SUSE) wrote: >>> On 5/27/26 02:10, JP Kobryn wrote: >>>> On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote: >>>>> On 5/19/26 22:08, JP Kobryn (Meta) wrote: >>>>>> compact_gap() returns 2 << order, which is used as watermark headroom in >>>>>> __compaction_suitable() and as a reclaim target in kswapd. The computed >>>>>> value scales exponentially by order. For order-9 THP allocations this >>>>>> evaluates to 1024 pages, but the compaction free scanner's working set is >>>>>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops >>>>>> isolating free >>>>>> pages once it matches the migration batch. The current gap >>>>>> over-reserves by >>>>>> 32x. >>>>>> >>>>>> On fragmented production hosts, kswapd will try and reclaim up to the >>>>>> gap, >>>>>> but it only reaches that threshold 18% of the time, causing reclaim to >>>>>> continue a majority of the time. >>>>> But doesn't that mean there's genuine memory pressure? We're effectively >>>>> raising the high watermark by 4 MB, but if processes are continuously >>>>> allocating, we'd be reclaiming without the gap as well? Unless the >>>>> workload >>>>> is sized to fit without the gap. >>>> It wasn't actual pressure, but the repetitive order-9 THP failures that were >>>> waking up kswapd. I should make this more clear in the changelog. After >>>> looking into why so much reclaim was occurring though, the compact gap stood >>>> out since it dictates the target amount to reclaim. >>> But the "amount to reclaim" is still defined as "reach high watermark + >>> compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did >>> I miss something non-obvious. >> Within kswapd_shrink_node(), sc->nr_to_reclaim is the sum of max(zone high >> watermark or SWAP_CLUSTER_MAX) for each zone combined. The gap is not >> added to >> that reclaim target though. It's used afterward as the threshold for >> abandoning >> high order reclaim: >> >> if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order)) >>     sc->order = 0; >> >> balance_pgdat() then returns sc->order and that becomes the kswapd >> reclaim_order >> value, allowing this branch to be taken: >> >> if (reclaim_order < alloc_order) >>     goto kswapd_try_sleep; >> >> Then in prepare_kswapd_sleep(), if pgdat_balanced() succeeds (at order-0), >> kcompactd is woken up for the original alloc_order (order-9). > > Oh I see, thanks for explaining. I think it makes sense to target this > particular part (checking sc->nr_reclaimed) than change compact_gap() > globally then? It seems we have some mismatch in the various heuristics? IIUC: I gave this a try and got some interesting results. Based on mm-new as of earlier today, I ran three variations: original compact_gap (2 << order), capped compact_gap (this patch), and capped downgrade gate which has the original compact_gap (2 << order) but caps within kswapd_shrink_node(): - if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order)) { + if (sc->order && sc->nr_reclaimed >= + min(compact_gap(sc->order), SWAP_CLUSTER_MAX)) { sc->order = 0; } The new approach showed improvements in THP allocations. thp_fault_fallback original gap: 1217 capped gap (global): 738 capped gap at downgrade gate: 898 More details are below. > > - in shrink_node() we have a should_continue_reclaim() call, which will > return false as soon as compaction is suitable, but before that, we are > likely to not accumulate enough sc->nr_reclaimed, because sc->nr_to_reclaim > would be capped by SWAP_CLUSTER_MAX's > > - thus we won't pass the sc->nr_reclaimed >= compact_gap check in > kswapd_shrink_node() > > - balance_pgdat() will keep looping because we're not raising priority > (kswapd_shrink_node() returned a high order) and pgdat_balanced() is false > (it checks for high-order page availability) I added some temporary tracepoints to verify paths taken. The average hits across three 60s runs are shown below. kswapd_shrink_node downgrade to order-0 original gap: 0 capped gap (global, this patch): 28 capped gap at downgrade gate: 80 So the downgrades are more frequent, but the suggested approach regressed harder in terms of reclaim. pgscan_kswapd original gap: 6328 capped gap (global, this patch): 3773 capped gap at downgrade gate: 7988 pgsteal_kswapd original gap: 5657 capped gap (global, this patch): 3243 capped gap at downgrade gate: 7101 This is because the suitability checks are still using the inflated gap causing the split below. kswapd_shrink_node() gap: 32 __compaction_suitable() gap: 1024 So it seems that capping globally (this patch) is the better option to avoid the split above which causes unnecessary reclaim. > > Maybe only reduce the sc->nr_reclaimed threshold to 2*COMPACT_CLUSTER_MAX then? This was considered as well. But with the new findings, it would imply even more reclaim. > >>> So if kswapd did any work, it means the memory was consumed (i.e. there was >>> some memory pressure) and amount of free memory was below high watermark + >>> compact_gap()? >> Hmm but kswapd can be woken up on a high order failure despite plenty of >> lower >> order availability. That's really the case where compact_gap() matters for >> higher orders. > > Ack, thanks. > >> Unless by pressure you mean the high order pages were gone? > > No. > >> >>> BTW, are you using mglru here? (probably not) >>> As that might be different and I'm not so familiar with it. >> Using classic LRU. >> >>>>>> The over-sized gap also causes 46% of >>>>>> order-9 compaction suitability checks to fail unnecessarily - the >>>>>> zone has >>>>>> sufficient free pages for the scanner to operate, but not enough to clear >>>>>> the inflated threshold. >>>>>> >>>>>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom >>>>>> with the scanner's actual capacity. Orders 0-4 are unaffected since their >>>>>> gap is <= 32. >>>>>> >>>>>> A/B test on ~100 instagram production hosts (64GB, 60s measurement): >>>>> What was the base kernel version? >>>> 6.13. Additional benchmarks were done using a recent mm-new build as well, >>>> and they showed similar reductions in reclaim. >>> If it's a NUMA machine, we recently found an over-reclaim issue there fixed >>> by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE >>> THP allocations") >> Thanks for pointing this out. I tested this on a recent mm-new built that >> includes 9c9828d3ead6, and I found the compact_gap() change was still >> helpful. > > OK. > >> My understanding is that 9c9828d3ead6 addresses direct reclaim for THP >> allocations, while this patch affects the kswapd reclaim-compaction hand-off >> path. The test runs still showed a benefit from capping the gap. > > Yep. > >>>>>> Unpatched (43 hosts) >>>>>> pgscan_kswapd (mean/host): ~1.6M >>>>>> reclaim efficiency (steal/scan): 83.8% >>>>>> compaction success (success/stall): 2.1% >>>>>> THP success (alloc/alloc+fallback): 4.9% >>>>>> forced lru_add_drain (mean/host): ~107K >>>>>> >>>>>> Patched (59 hosts) >>>>>> pgscan_kswapd (mean/host): ~449K >>>>> Did the extra reclaim just disappear because we allow the allocations >>>>> to use >>>>> 4MB more memory? Or it shifted to direct reclaim? >>>> Specifically in the order-9 case, the reclaim target goes from 1024 to 32. >>>> What the data shows is that capping the gap allows compaction to take over >>>> sooner and start working to produce large size pages needed for THP. Whereas >>>> in the pre-patch state, trying to reclaim the full 2x THP delays compaction. >>> So do I understand correctly we might have an issue due to lack of >>> hysteresis? We require reaching high watermark + compact_gap() to terminate >>> reclaim, but then compaction can find out we meanwhile dropped below that >>> (due to concurrent allocations) and it's not suitable again? >> On an unpatched kernel in a fragmented environment, >> compaction_suitable() can >> remain false because the effective threshold for costly orders is the low >> watermark + the compact gap. Kswapd has to keep reclaiming in high order >> mode >> as a result. > > I think this part might be ok. > >> By capping the gap at SWAP_CLUSTER_MAX, compaction becomes >> suitable >> sooner and kswapd reaches the high order reclaim cutoff sooner. So with > > And the problem is with the cutoff only which is not based on a > watermark+gap threshold but wants to reclaim at least gap pages regardless > of how many pages are already free. > >> the patch, >> kswapd is able to fall back to order-0 balancing earlier and wake up >> kcompactd >> for the original high order request. > > Yeah that's likely the crucial part. > >>> However the suitability checks e.g. compaction_zonelist_suitable() are using >>> min watermark, so that should provide the difference already. >>> Actually it's low watermark because of __compaction_suitable() adding an >>> extra low-min gap for costly orders. But still. >>> >>> I did just notice compaction_ready() might be too strict. It wants >>> effectivly high wmark plus the gap plus the low-min difference. Is it >>> perhaps the underlying issue here? >> It's a good point. It does seem like that's worth looking into, and I'd be >> happy to explore that separately.My thought at the moment though is that >> changing compaction_ready() would be a different direction from the the >> original >> focus of this patch, which started with the realization that the compaction >> scanner working set is bounded by COMPACT_CLUSTER_MAX. Since > > Ack. > >> compact_gap() is >> used in multiple reclaim and compaction decisions, including >> compaction_ready(), >> fixing its definition seemed like the right first change if the gap >> itself is >> oversized. > > Still I'd try to address the sc->nr_reclaimed usage first, and see if that's > enough. Let me know if I covered that above. > >>>>>> reclaim efficiency (steal/scan): 91.0% >>>>>> compaction success (success/stall): 28.3% >>>>> Is this compaction success per compaction stall or per alloc stall? >>>> That's per compaction. >>>> >>>>>> THP success (alloc/alloc+fallback): 17.2% >>>>> Weird that things would improve that much. I would expect the free memory >>>>> just to stabilize around the lower gap but then behave similarly. Are we >>>>> missing something here? >>>> This patch was tested in isolation, but also occurring was the case where >>>> bursty net allocations reserve many pageblocks as high atomic. So as >>>> THP-size pages become eligible, their blocks are reserved before being >>>> allocated as THP. >>>> >>>>>> forced lru_add_drain (mean/host): ~64K >>>>>> >>>>>> Signed-off-by: JP Kobryn (Meta) >>>>>> --- >>>>>> include/linux/compaction.h | 8 ++++---- >>>>>> 1 file changed, 4 insertions(+), 4 deletions(-) >>>>>> >>>>>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h >>>>>> index 173d9c07a8952..09aea63b8a89d 100644 >>>>>> --- a/include/linux/compaction.h >>>>>> +++ b/include/linux/compaction.h >>>>>> @@ -2,6 +2,8 @@ >>>>>> #ifndef _LINUX_COMPACTION_H >>>>>> #define _LINUX_COMPACTION_H >>>>>> +#include >>>>>> + >>>>>> /* >>>>>> * Determines how hard direct compaction should try to succeed. >>>>>> * Lower value means higher priority, analogically to reclaim priority. >>>>>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned >>>>>> int order) >>>>>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum >>>>>> * that the migrate scanner can have isolated on migrate list, and free >>>>>> * scanner is only invoked when the number of isolated free pages is >>>>>> - * lower than that. But it's not worth to complicate the formula here >>>>>> - * as a bigger gap for higher orders than strictly necessary can also >>>>>> - * improve chances of compaction success. >>>>>> + * lower than that. >>>>>> */ >>>>>> - return 2UL << order; >>>>>> + return min(2UL << order, COMPACT_CLUSTER_MAX); >>>>> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX? >>>> I'm thinking I could reframe this patch as reclaim-focused and use >>>> min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while >>>> either leaving the other non-reclaim users of this function alone or >>>> using the 2x form you suggest above. i.e. I can split this function >>>> into a separate reclaim_compact_gap() and use the originally proposed cap. >>>> Thoughts? >>> Do I understand correctly you want to cap the reclaim target by >>> COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is? >>> But wouldn't that mean we'll actually make changes of passing >>> compaction_suitable() worse? >> Good call. I was trying to find some middle ground, but I realize that the >> change is better left unified. > > My question was based on not understanding the underlying issue, and that > the "reclaim-only target" isn't based on watermark+gap but "reclaim gap > worth of pages". Now I think sc->nr_reclaimed is indeed the check that > should be relaxed first. Hopefully the data helps this time around.