From: Salvatore Dipietro <dipiets@amazon.it>
To: <willy@infradead.org>
Cc: <abuehaze@amazon.com>, <akpm@linux-foundation.org>,
<alisaidi@amazon.com>, <blakgeof@amazon.com>,
<brauner@kernel.org>, <dipietro.salvatore@gmail.com>,
<dipiets@amazon.it>, <djwong@kernel.org>,
<linux-fsdevel@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<linux-mm@kvack.org>, <linux-xfs@vger.kernel.org>,
<ritesh.list@gmail.com>, <stable@vger.kernel.org>,
<vbabka@suse.com>
Subject: Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
Date: Wed, 6 May 2026 12:33:18 +0000 [thread overview]
Message-ID: <20260506123326.17293-1-dipiets@amazon.it> (raw)
In-Reply-To: <afc3xFgKogxF5Lbq@casper.infradead.org>
On 5/03/26 05:52, Ritesh Harjani wrote:
> Also as per the documentation [1], huge_pages=try option is the default
> setting. So I am assuming in production we at least won't suffer from
> this memory fragmentation, correct?
Yes, huge_pages=try is the default option, but without pre-allocating the
entire shared_buffer size in memory via "vm.nr_hugepages" — which is not
done automatically — huge pages will not be used and the system falls into
the huge_pages=off category. Even with a partial pre-allocation, PostgreSQL
will not be able to use hugepages.
On 5/03/26 11:55, Matthew Wilcox wrote:
> or we need more understandable GFP flags. Or the page allocator could
> use the __GFP_NORETRY flag to say "oh well, this allocation has a fallback,
> I'll kick kcompactd to try to compact some more memory, but I'll fail
> the allocation".
We also tested kicking off kcompactd in the background when __GFP_NORETRY is
passed, returning "nopage" to avoid blocking the folio allocation request.
Here is the patch tested as the other with PREEMPT_NONE patch [1]:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 65e205111553..d4f322910992 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4818,6 +4818,26 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (current->flags & PF_MEMALLOC)
goto nopage;
+ /*
+ * Costly allocations with __GFP_NORETRY are opportunistic - Don't
+ * stall on direct compaction or reclaim; instead, kick
+ * kcompactd on the preferred node so large pages may become
+ * available for future allocations and let the caller fall back now.
+ *
+ * Direct compaction is way too costly for hot allocation paths on
+ * large systems: each attempt calls drain_all_pages() which IPIs
+ * every CPU. Only wake kcompactd on the local node to avoid
+ * cross-NUMA interference with unrelated workloads.
+ */
+ if (costly_order && (gfp_mask & __GFP_NORETRY)) {
+ struct zone *preferred_zone = ac->preferred_zoneref->zone;
+
+ if (preferred_zone)
+ wakeup_kcompactd(preferred_zone->zone_pgdat, order,
+ ac->highest_zoneidx);
+ goto nopage;
+ }
+
/* Try direct reclaim and then allocating */
if (!compact_first) {
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags,
Here are the results we collected (kcompactd background):
| Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline |
|----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
| Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — |
| Proposed patch | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20 | +38.45% |
| Ritesh's suggestion | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61 | +36.50% |
| Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82 | +41.07% |
| kcompactd background | 146,760.75 | 128,094.92 | 127,979.74 | 134,278.47 | +31.67% |
[1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368
AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico
next prev parent reply other threads:[~2026-05-06 12:34 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-03 19:35 [PATCH 0/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
2026-04-04 1:13 ` Ritesh Harjani
2026-04-04 4:15 ` Matthew Wilcox
2026-04-04 16:47 ` Ritesh Harjani
2026-04-04 20:46 ` Matthew Wilcox
2026-04-16 15:14 ` Ritesh Harjani
2026-04-20 16:33 ` Salvatore Dipietro
2026-04-20 18:44 ` Matthew Wilcox
2026-04-21 1:16 ` Ritesh Harjani
2026-04-28 15:02 ` Salvatore Dipietro
2026-05-03 5:52 ` Ritesh Harjani
2026-05-03 11:55 ` Matthew Wilcox
2026-05-06 12:33 ` Salvatore Dipietro [this message]
2026-04-05 22:43 ` Dave Chinner
2026-04-07 5:40 ` Christoph Hellwig
2026-04-21 9:02 ` Vlastimil Babka
[not found] <20260403193201.30479-1-dipiets@amazon.it>
2026-04-03 19:32 ` Salvatore Dipietro
2026-04-04 6:25 ` Greg KH
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260506123326.17293-1-dipiets@amazon.it \
--to=dipiets@amazon.it \
--cc=abuehaze@amazon.com \
--cc=akpm@linux-foundation.org \
--cc=alisaidi@amazon.com \
--cc=blakgeof@amazon.com \
--cc=brauner@kernel.org \
--cc=dipietro.salvatore@gmail.com \
--cc=djwong@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=ritesh.list@gmail.com \
--cc=stable@vger.kernel.org \
--cc=vbabka@suse.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.