From: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To: Salvatore Dipietro <dipiets@amazon.it>,
Matthew Wilcox <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, Vlastimil Babka <vbabka@suse.com>
Cc: abuehaze@amazon.com, alisaidi@amazon.com, blakgeof@amazon.com,
brauner@kernel.org, dipietro.salvatore@gmail.com,
dipiets@amazon.it, djwong@kernel.org,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-xfs@vger.kernel.org, stable@vger.kernel.org,
willy@infradead.org
Subject: Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
Date: Sun, 03 May 2026 11:22:10 +0530 [thread overview]
Message-ID: <a4uhrqet.ritesh.list@gmail.com> (raw)
In-Reply-To: <20260428150240.3009-1-dipiets@amazon.it>
Sorry about the delayed response, got caught up in some other work.
Salvatore Dipietro <dipiets@amazon.it> writes:
> On 4/21/26 00:43, Ritesh Harjani wrote:
>> Also, given the Maintainers (willy, Christoph, Dave) shown their
>> dis-interest in taking the patch in it's current form, the right way is
>> to get back with performance data with both the approaches (which we
>> were discussing) and first get the consensus from everyone, before
>> proposing this as a patch :).
>
> Thank you for the follow-up and the additional context, Ritesh.
> I might have misunderstood the previous request and will make sure to
> link back to previous patch versions in the future.
> Here are the performance results that we have collected on our end with
> the proposed patches:
>
>
> | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline |
> |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
> | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — |
> | PG huge_pages + pre-alloc mem | THP | Run 1 | Run 2 | Run 3 | Average |
> |-------------------------------|---------|--------:|--------:|--------:|--------:|
> | on | never | 189,418 | 187,764 | 188,207 | 188,463 |
>
First of all thanks for sharing the detailed performance numbers.
Ok, so here is what I understood from the data you shared.
This performance problem is mostly seen with PostgreSQL huge_pages=off
[1][2] i.e.
baseline-no-patches ~104K
v/s
baseline-no-patches+huge_pages=on ~188K
Also the observation with huge_pages=off is - we have 40% of memory as page
table memory (as you pointed below)
> We do not use any tool to fragment the memory in advance. Collecting
> memory metric of this system, we noticed that ~40% of memory is used by
> PageTables since PostgreSQL spawns a new process for each client limiting
> significantly the available caching and free memory.
>
So there must be 2 things going on with huge_pages=on option here:
1. Huge pages use PMD-size mapping, which eliminates the need of PTE
tables entirely. This then reduces the amount of memory consumed by page
tables. W/O huges pages, the page table overhead become significant
(~40% of DRAM), because on fork, each child process gets it's own copy
of the PTE tables (even though the underlying shared memory pages
remains the same)
2. The second savings might come from the fact that Linux supports
CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING with hugetlb. With this, the
PMD table pages themselves are shared among proceses.
[1]:
https://www.postgresql.org/docs/current/runtime-config-resource.html
[2]:
https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-HUGE-PAGES
So the above explain the 40% of memory used up in Page tables.
Now this is what I believe could be the reason for memory fragmentation
with this workload -
In Linux, each PTE page table uses 4KB size (assuming you are using 4KB
system PAGE_SIZE). When your workload forks a
child process for each new connection, child gets its own copy of the
page tables which maps the shared buffer.
Since each PTE table is a single 4KB page, hundreds of connections
spawning means hundreds of thousands of single-page allocations for page
tables. So it looks like, the major source of your memory fragmentation
problem must be these several order-0 allocations for PTE page table
pages.
Also as per the documentation [1], huge_pages=try option is the default
setting. So I am assuming in production we at least won't suffer from
this memory fragmentation, correct?
[1]: https://www.postgresql.org/docs/current/runtime-config-resource.html
> PostgreSQL write pattern consists mostly of 8/16 KB data but during
> the database checkpoints, by default every 5 minutes, it flushes write-ahead
> logs to disk, which uses large folios. At this point, the system attempts to
> satisfy the folio allocation request, triggering the regression and falling
> into the slow path, as shown by the Linux perf profile below:
>
> `-0.26%-__arm64_sys_pwrite64
> `-0.26%-vfs_write
> `-0.26%-xfs_file_write_iter
> `-0.26%-xfs_file_buffered_write
> `-0.26%-iomap_file_buffered_write
> `-0.26%-iomap_write_iter
> `-0.22%-iomap_write_begin
> `-0.22%-iomap_get_folio
> `-0.22%-__filemap_get_folio
> `-0.21%-filemap_alloc_folio->alloc_pages
> `-0.20%-__alloc_pages_slowpath
> |-0.12%-__alloc_pages_direct_compact
> | `-0.12%-try_to_compact_pages
> | `-0.11%-compact_zone
> | `-0.11%-isolate_migratepages
> `-0.07%-__drain_all_pages
> `-0.07%-drain_pages_zone
> `-0.07%-free_pcppages_bulk
>
However, I agree that it still make sense look into possible solution to
address this performance gap which you pointed out when the system has
memory fragmentation (with huge_pages=off).
> | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline |
> |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
> | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — |
> | Proposed patch | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20 | +38.45% |
> | Ritesh's suggestion | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61 | +36.50% |
> | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82 | +41.07% |
The main reason, why I proposed the below patch was because, this only
affects costly order allocation (i.e for order > PAGE_ALLOC_COSTLY_ORDER)
by skipping direct reclaim for those orders, while still keeping the
behaviour same for others.
So, for smaller orders (order > min_order and <=
PAGE_ALLOC_COSTLY_ORDER), the allocator will still attempt for direct
reclaim and compaction (which I guess is required to avoid oom too?) And
also, this looks like a change which could be easily backportable :)
diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c..f2343c26dd63 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
gfp_t alloc_gfp = gfp;
err = -ENOMEM;
- if (order > min_order)
- alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
+ if (order > min_order) {
+ alloc_gfp |= __GFP_NOWARN;
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
+ else
+ alloc_gfp |= __GFP_NORETRY;
+ }
But of course let's hear from others on their suggestions / thoughts.
Maybe the filemap is not the right place to fix this as Matthew, Andrew
and others were pointing. Any other suggestions on how to approach this,
please?
-ritesh
next prev parent reply other threads:[~2026-05-03 10:57 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-03 19:35 [PATCH 0/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
2026-04-04 1:13 ` Ritesh Harjani
2026-04-04 4:15 ` Matthew Wilcox
2026-04-04 16:47 ` Ritesh Harjani
2026-04-04 20:46 ` Matthew Wilcox
2026-04-16 15:14 ` Ritesh Harjani
2026-04-20 16:33 ` Salvatore Dipietro
2026-04-20 18:44 ` Matthew Wilcox
2026-04-21 1:16 ` Ritesh Harjani
2026-04-28 15:02 ` Salvatore Dipietro
2026-05-03 5:52 ` Ritesh Harjani [this message]
2026-05-03 11:55 ` Matthew Wilcox
2026-04-05 22:43 ` Dave Chinner
2026-04-07 5:40 ` Christoph Hellwig
2026-04-21 9:02 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a4uhrqet.ritesh.list@gmail.com \
--to=ritesh.list@gmail.com \
--cc=abuehaze@amazon.com \
--cc=akpm@linux-foundation.org \
--cc=alisaidi@amazon.com \
--cc=blakgeof@amazon.com \
--cc=brauner@kernel.org \
--cc=dipietro.salvatore@gmail.com \
--cc=dipiets@amazon.it \
--cc=djwong@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=stable@vger.kernel.org \
--cc=vbabka@suse.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox