From: Salvatore Dipietro <dipiets@amazon.it>
To: <willy@infradead.org>
Cc: <abuehaze@amazon.com>, <akpm@linux-foundation.org>,
<alisaidi@amazon.com>, <blakgeof@amazon.com>,
<brauner@kernel.org>, <dipietro.salvatore@gmail.com>,
<dipiets@amazon.it>, <djwong@kernel.org>,
<linux-fsdevel@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<linux-mm@kvack.org>, <linux-xfs@vger.kernel.org>,
<ritesh.list@gmail.com>, <stable@vger.kernel.org>,
<vbabka@suse.com>
Subject: Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
Date: Wed, 6 May 2026 12:33:18 +0000 [thread overview]
Message-ID: <20260506123326.17293-1-dipiets@amazon.it> (raw)
In-Reply-To: <afc3xFgKogxF5Lbq@casper.infradead.org>
On 5/03/26 05:52, Ritesh Harjani wrote:
> Also as per the documentation [1], huge_pages=try option is the default
> setting. So I am assuming in production we at least won't suffer from
> this memory fragmentation, correct?
Yes, huge_pages=try is the default option, but without pre-allocating the
entire shared_buffer size in memory via "vm.nr_hugepages" — which is not
done automatically — huge pages will not be used and the system falls into
the huge_pages=off category. Even with a partial pre-allocation, PostgreSQL
will not be able to use hugepages.
On 5/03/26 11:55, Matthew Wilcox wrote:
> or we need more understandable GFP flags. Or the page allocator could
> use the __GFP_NORETRY flag to say "oh well, this allocation has a fallback,
> I'll kick kcompactd to try to compact some more memory, but I'll fail
> the allocation".
We also tested kicking off kcompactd in the background when __GFP_NORETRY is
passed, returning "nopage" to avoid blocking the folio allocation request.
Here is the patch tested as the other with PREEMPT_NONE patch [1]:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 65e205111553..d4f322910992 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4818,6 +4818,26 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (current->flags & PF_MEMALLOC)
goto nopage;
+ /*
+ * Costly allocations with __GFP_NORETRY are opportunistic - Don't
+ * stall on direct compaction or reclaim; instead, kick
+ * kcompactd on the preferred node so large pages may become
+ * available for future allocations and let the caller fall back now.
+ *
+ * Direct compaction is way too costly for hot allocation paths on
+ * large systems: each attempt calls drain_all_pages() which IPIs
+ * every CPU. Only wake kcompactd on the local node to avoid
+ * cross-NUMA interference with unrelated workloads.
+ */
+ if (costly_order && (gfp_mask & __GFP_NORETRY)) {
+ struct zone *preferred_zone = ac->preferred_zoneref->zone;
+
+ if (preferred_zone)
+ wakeup_kcompactd(preferred_zone->zone_pgdat, order,
+ ac->highest_zoneidx);
+ goto nopage;
+ }
+
/* Try direct reclaim and then allocating */
if (!compact_first) {
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags,
Here are the results we collected (kcompactd background):
| Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline |
|----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
| Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — |
| Proposed patch | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20 | +38.45% |
| Ritesh's suggestion | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61 | +36.50% |
| Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82 | +41.07% |
| kcompactd background | 146,760.75 | 128,094.92 | 127,979.74 | 134,278.47 | +31.67% |
[1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368
AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico
next prev parent reply other threads:[~2026-05-06 12:34 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-03 19:35 [PATCH 0/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
2026-04-04 1:13 ` Ritesh Harjani
2026-04-04 4:15 ` Matthew Wilcox
2026-04-04 16:47 ` Ritesh Harjani
2026-04-04 20:46 ` Matthew Wilcox
2026-04-16 15:14 ` Ritesh Harjani
2026-04-20 16:33 ` Salvatore Dipietro
2026-04-20 18:44 ` Matthew Wilcox
2026-04-21 1:16 ` Ritesh Harjani
2026-04-28 15:02 ` Salvatore Dipietro
2026-05-03 5:52 ` Ritesh Harjani
2026-05-03 11:55 ` Matthew Wilcox
2026-05-06 12:33 ` Salvatore Dipietro [this message]
2026-04-05 22:43 ` Dave Chinner
2026-04-07 5:40 ` Christoph Hellwig
2026-04-21 9:02 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260506123326.17293-1-dipiets@amazon.it \
--to=dipiets@amazon.it \
--cc=abuehaze@amazon.com \
--cc=akpm@linux-foundation.org \
--cc=alisaidi@amazon.com \
--cc=blakgeof@amazon.com \
--cc=brauner@kernel.org \
--cc=dipietro.salvatore@gmail.com \
--cc=djwong@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=ritesh.list@gmail.com \
--cc=stable@vger.kernel.org \
--cc=vbabka@suse.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox