From: Matthew Wilcox <willy@infradead.org>
To: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Salvatore Dipietro <dipiets@amazon.it>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, Vlastimil Babka <vbabka@suse.com>,
abuehaze@amazon.com, alisaidi@amazon.com, blakgeof@amazon.com,
brauner@kernel.org, dipietro.salvatore@gmail.com,
djwong@kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
stable@vger.kernel.org
Subject: Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
Date: Sun, 3 May 2026 12:55:48 +0100 [thread overview]
Message-ID: <afc3xFgKogxF5Lbq@casper.infradead.org> (raw)
In-Reply-To: <a4uhrqet.ritesh.list@gmail.com>
On Sun, May 03, 2026 at 11:22:10AM +0530, Ritesh Harjani wrote:
> Now this is what I believe could be the reason for memory fragmentation
> with this workload -
> In Linux, each PTE page table uses 4KB size (assuming you are using 4KB
> system PAGE_SIZE). When your workload forks a
> child process for each new connection, child gets its own copy of the
> page tables which maps the shared buffer.
> Since each PTE table is a single 4KB page, hundreds of connections
> spawning means hundreds of thousands of single-page allocations for page
> tables. So it looks like, the major source of your memory fragmentation
> problem must be these several order-0 allocations for PTE page table
> pages.
While memory is fragmented, the _problem_ is that we try too hard to
defragment. From the original post:
: When memory is fragmented, each failed allocation triggers
: compaction and drain_all_pages() via __alloc_pages_slowpath()
We really should only try compaction once. If it didn't make useful
progress last time, it won't this time either.
> > | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline |
> > |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
> > | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — |
> > | Proposed patch | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20 | +38.45% |
> > | Ritesh's suggestion | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61 | +36.50% |
> > | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82 | +41.07% |
>
>
> The main reason, why I proposed the below patch was because, this only
> affects costly order allocation (i.e for order > PAGE_ALLOC_COSTLY_ORDER)
> by skipping direct reclaim for those orders, while still keeping the
> behaviour same for others.
>
> So, for smaller orders (order > min_order and <=
> PAGE_ALLOC_COSTLY_ORDER), the allocator will still attempt for direct
> reclaim and compaction (which I guess is required to avoid oom too?) And
> also, this looks like a change which could be easily backportable :)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4e636647100c..f2343c26dd63 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
> gfp_t alloc_gfp = gfp;
>
> err = -ENOMEM;
> - if (order > min_order)
> - alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
> + if (order > min_order) {
> + alloc_gfp |= __GFP_NOWARN;
> + if (order > PAGE_ALLOC_COSTLY_ORDER)
> + alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> + else
> + alloc_gfp |= __GFP_NORETRY;
> + }
>
>
> But of course let's hear from others on their suggestions / thoughts.
> Maybe the filemap is not the right place to fix this as Matthew, Andrew
> and others were pointing. Any other suggestions on how to approach this,
> please?
filemap.c REALLY shouldn't know about PAGE_ALLOC_COSTLY_ORDER.
That's an internal detail of the memory allocator.
Either we want an API to say "allocate me a folio between orders A and B"
or we need more understandable GFP flags. Or the page allocator could
use the __GFP_NORETRY flag to say "oh well, this allocation has a fallback,
I'll kick kcompactd to try to compact some more memory, but I'll fail
the allocation".
next prev parent reply other threads:[~2026-05-03 11:55 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-03 19:35 [PATCH 0/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
2026-04-04 1:13 ` Ritesh Harjani
2026-04-04 4:15 ` Matthew Wilcox
2026-04-04 16:47 ` Ritesh Harjani
2026-04-04 20:46 ` Matthew Wilcox
2026-04-16 15:14 ` Ritesh Harjani
2026-04-20 16:33 ` Salvatore Dipietro
2026-04-20 18:44 ` Matthew Wilcox
2026-04-21 1:16 ` Ritesh Harjani
2026-04-28 15:02 ` Salvatore Dipietro
2026-05-03 5:52 ` Ritesh Harjani
2026-05-03 11:55 ` Matthew Wilcox [this message]
2026-04-05 22:43 ` Dave Chinner
2026-04-07 5:40 ` Christoph Hellwig
2026-04-21 9:02 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=afc3xFgKogxF5Lbq@casper.infradead.org \
--to=willy@infradead.org \
--cc=abuehaze@amazon.com \
--cc=akpm@linux-foundation.org \
--cc=alisaidi@amazon.com \
--cc=blakgeof@amazon.com \
--cc=brauner@kernel.org \
--cc=dipietro.salvatore@gmail.com \
--cc=dipiets@amazon.it \
--cc=djwong@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=ritesh.list@gmail.com \
--cc=stable@vger.kernel.org \
--cc=vbabka@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox