* [PATCH 0/1] iomap: avoid compaction for costly folio order allocation
@ 2026-04-03 19:35 Salvatore Dipietro
2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
0 siblings, 1 reply; 25+ messages in thread
From: Salvatore Dipietro @ 2026-04-03 19:35 UTC (permalink / raw)
To: linux-kernel
Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy,
Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel
We are reporting a throughput regression on PostgreSQL pgbench
(simple-update) on arm64 caused by commit 5d8edfb900d5 ("iomap:
Copy larger chunks from userspace") introduced in v6.6-rc1.
The regression manifests as a 0.75x throughput drop on a pgbench
simple-update workload with 1024 clients on a 96-vCPU arm64
system. When memory is even slightly fragmented, each failed
high-order allocation enters into __alloc_pages_slowpath() which
runs 2 memory compactions and drain_all_pages(), forcing all
vCPUs to release their pages. This is done multiple times, one
for each order (up to 6), until the allocation succeeds.
The patch makes costly-order folio allocations in the iomap
buffered write path purely opportunistic -- no direct reclaim,
no compaction, no drain_all_pages().
Combined with the separate PREEMPT_LAZY regression [1], the
total impact is a 2.87x throughput and latency loss.
1. Test environment
___________________
Hardware: 1x AWS EC2 m8g.24xlarge
(12x 1TB IO2 32000 iops RAID0 XFS)
OS: AL2023 (ami-03a8d3251f401ffca)
Kernel: next-20260331
Database: PostgreSQL 17
Workload: pgbench simple-update
1024 clients, 96 threads, 1200s duration
scale factor 8470, fillfactor=90, prepared protocol
2. Results
__________
Config Run1 Run2 Run3 Avg x
_____________________ _________ _________ _________ __________ ____
baseline 47242.39 53369.18 51644.29 50751.96 1.00
iomap patch 69305.92 66994.08 64603.33 66967.78 1.32
preempt-none [1] 92906.62 103976.03 98814.94 98565.86 1.94
iomap+preempt-none[1] 145904.53 146470.95 144728.91 145701.46 2.87
3. Reproduction
_______________
On the AWS EC2 m8g.24xlarge, install and run the PostgreSQL
database using the repro-collection repository like:
# Reproducer code:
git clone https://github.com/aws/repro-collection.git ~/repro-collection
# Setup and start PostgreSQL server in terminal 1:
~/repro-collection/run.sh postgresql SUT --ldg=127.0.0.1
# Run pgbench load generator in terminal 2:
PGBENCH_SCALE=8470 \
PGBENCH_INIT_EXTRA_ARGS="--fillfactor=90" \
PGBENCH_CLIENTS=1024 \
PGBENCH_THREADS=96 \
PGBENCH_DURATION=1200 \
PGBENCH_BUILTIN=simple-update \
PGBENCH_PROTOCOL=prepared \
~/repro-collection/run.sh postgresql LDG --sut=127.0.0.1
[1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#t
Salvatore Dipietro (1):
iomap: avoid compaction for costly folio order allocation
fs/iomap/buffered-io.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
base-commit: 9147566d801602c9e7fc7f85e989735735bf38ba
--
2.50.1 (Apple Git-155)
AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico
^ permalink raw reply [flat|nested] 25+ messages in thread* [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-03 19:35 [PATCH 0/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro @ 2026-04-03 19:35 ` Salvatore Dipietro 2026-04-04 1:13 ` Ritesh Harjani ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Salvatore Dipietro @ 2026-04-03 19:35 UTC (permalink / raw) To: linux-kernel Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") introduced high-order folio allocations in the buffered write path. When memory is fragmented, each failed allocation triggers compaction and drain_all_pages() via __alloc_pages_slowpath(), causing a 0.75x throughput drop on pgbench (simple-update) with 1024 clients on a 96-vCPU arm64 system. Strip __GFP_DIRECT_RECLAIM from folio allocations in iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, making them purely opportunistic. Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Cc: stable@vger.kernel.org Signed-off-by: Salvatore Dipietro <dipiets@amazon.it> --- fs/iomap/buffered-io.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 92a831cf4bf1..cb843d54b4d9 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) { fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS; + gfp_t gfp; if (iter->flags & IOMAP_NOWAIT) fgp |= FGP_NOWAIT; @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) fgp |= FGP_DONTCACHE; fgp |= fgf_set_order(len); + gfp = mapping_gfp_mask(iter->inode->i_mapping); + + /* + * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER, + * strip __GFP_DIRECT_RECLAIM to make the allocation purely + * opportunistic. This avoids compaction + drain_all_pages() + * in __alloc_pages_slowpath() that devastate throughput + * on large systems during buffered writes. + */ + if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER) + gfp &= ~__GFP_DIRECT_RECLAIM; + return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, - fgp, mapping_gfp_mask(iter->inode->i_mapping)); + fgp, gfp); } EXPORT_SYMBOL_GPL(iomap_get_folio); -- 2.50.1 (Apple Git-155) AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro @ 2026-04-04 1:13 ` Ritesh Harjani 2026-04-04 4:15 ` Matthew Wilcox 2026-04-05 22:43 ` Dave Chinner 2 siblings, 0 replies; 25+ messages in thread From: Ritesh Harjani @ 2026-04-04 1:13 UTC (permalink / raw) To: Salvatore Dipietro, linux-kernel Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-mm Let's cc: linux-mm too. Salvatore Dipietro <dipiets@amazon.it> writes: > Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > introduced high-order folio allocations in the buffered write > path. When memory is fragmented, each failed allocation triggers Isn't it the right thing to do i.e. run compaction, when memory is fragmented? > compaction and drain_all_pages() via __alloc_pages_slowpath(), > causing a 0.75x throughput drop on pgbench (simple-update) with > 1024 clients on a 96-vCPU arm64 system. > I think removing the __GFP_DIRECT_RECLAIM flag unconditionally at the caller may cause -ENOMEM. Note that it is the __filemap_get_folio() which retries with smaller order allocations, so instead of changing the callers, shouldn't this be fixed in __filemap_get_folio() instead? Maybe in there too, we should keep the reclaim flag (if passed by caller) at least for <= PAGE_ALLOC_COSTLY_ORDER + 1 Thoughts? -ritesh ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro 2026-04-04 1:13 ` Ritesh Harjani @ 2026-04-04 4:15 ` Matthew Wilcox 2026-04-04 16:47 ` Ritesh Harjani 2026-04-05 22:43 ` Dave Chinner 2 siblings, 1 reply; 25+ messages in thread From: Matthew Wilcox @ 2026-04-04 4:15 UTC (permalink / raw) To: Salvatore Dipietro Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-mm On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote: > Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > introduced high-order folio allocations in the buffered write > path. When memory is fragmented, each failed allocation triggers > compaction and drain_all_pages() via __alloc_pages_slowpath(), > causing a 0.75x throughput drop on pgbench (simple-update) with > 1024 clients on a 96-vCPU arm64 system. > > Strip __GFP_DIRECT_RECLAIM from folio allocations in > iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, > making them purely opportunistic. If you look at __filemap_get_folio_mpol(), that's kind of being tried already: if (order > min_order) alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; * %__GFP_NORETRY: The VM implementation will try only very lightweight * memory direct reclaim to get some memory under memory pressure (thus * it can sleep). It will avoid disruptive actions like OOM killer. The * caller must handle the failure which is quite likely to happen under * heavy memory pressure. The flag is suitable when failure can easily be * handled at small cost, such as reduced throughput. which, from the description, seemed like the right approach. So either the description or the implementation should be updated, I suppose? Now, what happens if you change those two lines to: if (order > min_order) { alloc_gfp &= ~__GFP_DIRECT_RECLAIM; alloc_gfp |= __GFP_NOWARN; } Do you recover the performance? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-04 4:15 ` Matthew Wilcox @ 2026-04-04 16:47 ` Ritesh Harjani 2026-04-04 20:46 ` Matthew Wilcox 2026-04-16 15:14 ` Ritesh Harjani 0 siblings, 2 replies; 25+ messages in thread From: Ritesh Harjani @ 2026-04-04 16:47 UTC (permalink / raw) To: Matthew Wilcox, Salvatore Dipietro Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-mm Matthew Wilcox <willy@infradead.org> writes: > On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote: >> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") >> introduced high-order folio allocations in the buffered write >> path. When memory is fragmented, each failed allocation triggers >> compaction and drain_all_pages() via __alloc_pages_slowpath(), >> causing a 0.75x throughput drop on pgbench (simple-update) with >> 1024 clients on a 96-vCPU arm64 system. >> >> Strip __GFP_DIRECT_RECLAIM from folio allocations in >> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, >> making them purely opportunistic. > > If you look at __filemap_get_folio_mpol(), that's kind of being tried > already: > > if (order > min_order) > alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; > > * %__GFP_NORETRY: The VM implementation will try only very lightweight > * memory direct reclaim to get some memory under memory pressure (thus > * it can sleep). It will avoid disruptive actions like OOM killer. The > * caller must handle the failure which is quite likely to happen under > * heavy memory pressure. The flag is suitable when failure can easily be > * handled at small cost, such as reduced throughput. > > which, from the description, seemed like the right approach. So either > the description or the implementation should be updated, I suppose? > > Now, what happens if you change those two lines to: > > if (order > min_order) { > alloc_gfp &= ~__GFP_DIRECT_RECLAIM; > alloc_gfp |= __GFP_NOWARN; > } Hi Matthew, Shouldn't we try this instead? This would still allows us to keep __GFP_NORETRY and hence light weight direct reclaim/compaction for atleast the non-costly order allocations, right? if (order > min_order) { alloc_gfp |= __GFP_NOWARN; if (order > PAGE_ALLOC_COSTLY_ORDER) alloc_gfp &= ~__GFP_DIRECT_RECLAIM; else alloc_gfp |= __GFP_NORETRY; } -ritesh > > Do you recover the performance? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-04 16:47 ` Ritesh Harjani @ 2026-04-04 20:46 ` Matthew Wilcox 2026-04-16 15:14 ` Ritesh Harjani 1 sibling, 0 replies; 25+ messages in thread From: Matthew Wilcox @ 2026-04-04 20:46 UTC (permalink / raw) To: Ritesh Harjani Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-mm On Sat, Apr 04, 2026 at 10:17:33PM +0530, Ritesh Harjani wrote: > Matthew Wilcox <willy@infradead.org> writes: > > > On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote: > >> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > >> introduced high-order folio allocations in the buffered write > >> path. When memory is fragmented, each failed allocation triggers > >> compaction and drain_all_pages() via __alloc_pages_slowpath(), > >> causing a 0.75x throughput drop on pgbench (simple-update) with > >> 1024 clients on a 96-vCPU arm64 system. > >> > >> Strip __GFP_DIRECT_RECLAIM from folio allocations in > >> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, > >> making them purely opportunistic. > > > > If you look at __filemap_get_folio_mpol(), that's kind of being tried > > already: > > > > if (order > min_order) > > alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; > > > > * %__GFP_NORETRY: The VM implementation will try only very lightweight > > * memory direct reclaim to get some memory under memory pressure (thus > > * it can sleep). It will avoid disruptive actions like OOM killer. The > > * caller must handle the failure which is quite likely to happen under > > * heavy memory pressure. The flag is suitable when failure can easily be > > * handled at small cost, such as reduced throughput. > > > > which, from the description, seemed like the right approach. So either > > the description or the implementation should be updated, I suppose? > > > > Now, what happens if you change those two lines to: > > > > if (order > min_order) { > > alloc_gfp &= ~__GFP_DIRECT_RECLAIM; > > alloc_gfp |= __GFP_NOWARN; > > } > > Hi Matthew, > > Shouldn't we try this instead? This would still allows us to keep > __GFP_NORETRY and hence light weight direct reclaim/compaction for > atleast the non-costly order allocations, right? > > if (order > min_order) { > alloc_gfp |= __GFP_NOWARN; > if (order > PAGE_ALLOC_COSTLY_ORDER) > alloc_gfp &= ~__GFP_DIRECT_RECLAIM; > else > alloc_gfp |= __GFP_NORETRY; > } Uhh ... maybe? I'd want someone more familiar with the page allocator than I am to say whether that's the right approach. If it is, that seems too complex, and maybe we need a better approach to the page allocator flags. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-04 16:47 ` Ritesh Harjani 2026-04-04 20:46 ` Matthew Wilcox @ 2026-04-16 15:14 ` Ritesh Harjani 2026-04-20 16:33 ` Salvatore Dipietro 1 sibling, 1 reply; 25+ messages in thread From: Ritesh Harjani @ 2026-04-16 15:14 UTC (permalink / raw) To: Matthew Wilcox, Salvatore Dipietro Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-mm Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: > Matthew Wilcox <willy@infradead.org> writes: > >> On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote: >>> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") >>> introduced high-order folio allocations in the buffered write >>> path. When memory is fragmented, each failed allocation triggers >>> compaction and drain_all_pages() via __alloc_pages_slowpath(), >>> causing a 0.75x throughput drop on pgbench (simple-update) with >>> 1024 clients on a 96-vCPU arm64 system. >>> >>> Strip __GFP_DIRECT_RECLAIM from folio allocations in >>> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, >>> making them purely opportunistic. >> >> If you look at __filemap_get_folio_mpol(), that's kind of being tried >> already: >> >> if (order > min_order) >> alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; >> >> * %__GFP_NORETRY: The VM implementation will try only very lightweight >> * memory direct reclaim to get some memory under memory pressure (thus >> * it can sleep). It will avoid disruptive actions like OOM killer. The >> * caller must handle the failure which is quite likely to happen under >> * heavy memory pressure. The flag is suitable when failure can easily be >> * handled at small cost, such as reduced throughput. >> >> which, from the description, seemed like the right approach. So either >> the description or the implementation should be updated, I suppose? >> >> Now, what happens if you change those two lines to: >> >> if (order > min_order) { >> alloc_gfp &= ~__GFP_DIRECT_RECLAIM; >> alloc_gfp |= __GFP_NOWARN; >> } > > Hi Matthew, > > Shouldn't we try this instead? This would still allows us to keep > __GFP_NORETRY and hence light weight direct reclaim/compaction for > atleast the non-costly order allocations, right? > > if (order > min_order) { > alloc_gfp |= __GFP_NOWARN; > if (order > PAGE_ALLOC_COSTLY_ORDER) > alloc_gfp &= ~__GFP_DIRECT_RECLAIM; > else > alloc_gfp |= __GFP_NORETRY; > } > Hi Salvatore, Did you get a chance to test the above two options (shared by Matthew and me)? And were you able to recover the performance back with those? So, in a longer run, as Dave suggested, we might need to fix this by maybe considering removing compaction in the direct reclaim path. But I guess for fixing it in older kernel releases, we might need a quick fix ,so maybe worth trying the above suggested changes, perhaps. Also, I am somehow not able to hit this problem at my end (even after creating a bit of memory fragmentation). So please also feel free to share the steps, if you have a setup to re-create it easily. -ritesh ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-16 15:14 ` Ritesh Harjani @ 2026-04-20 16:33 ` Salvatore Dipietro 2026-04-20 18:44 ` Matthew Wilcox 0 siblings, 1 reply; 25+ messages in thread From: Salvatore Dipietro @ 2026-04-20 16:33 UTC (permalink / raw) To: ritesh.list Cc: abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, dipiets, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, stable, willy I have submitted a v2 of the patch based on Ritesh's suggestion. https://lore.kernel.org/linux-mm/20260420161404.642-1-dipiets@amazon.it/T/#u Salvatore AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-20 16:33 ` Salvatore Dipietro @ 2026-04-20 18:44 ` Matthew Wilcox 2026-04-21 1:16 ` Ritesh Harjani 0 siblings, 1 reply; 25+ messages in thread From: Matthew Wilcox @ 2026-04-20 18:44 UTC (permalink / raw) To: Salvatore Dipietro Cc: ritesh.list, abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, stable On Mon, Apr 20, 2026 at 04:33:28PM +0000, Salvatore Dipietro wrote: > I have submitted a v2 of the patch based on Ritesh's suggestion. > https://lore.kernel.org/linux-mm/20260420161404.642-1-dipiets@amazon.it/T/#u ... but without linking back to this thread, so nobody who was exposed to that thread for the first time knows about this one. That's poor form. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-20 18:44 ` Matthew Wilcox @ 2026-04-21 1:16 ` Ritesh Harjani 2026-04-28 15:02 ` Salvatore Dipietro 0 siblings, 1 reply; 25+ messages in thread From: Ritesh Harjani @ 2026-04-21 1:16 UTC (permalink / raw) To: Salvatore Dipietro Cc: Matthew Wilcox, abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, stable Matthew Wilcox <willy@infradead.org> writes: > On Mon, Apr 20, 2026 at 04:33:28PM +0000, Salvatore Dipietro wrote: >> I have submitted a v2 of the patch based on Ritesh's suggestion. >> https://lore.kernel.org/linux-mm/20260420161404.642-1-dipiets@amazon.it/T/#u > > ... but without linking back to this thread, so nobody who was exposed > to that thread for the first time knows about this one. That's poor form. Yup. Also, given the Maintainers (willy, Christoph, Dave) shown their dis-interest in taking the patch in it's current form, the right way is to get back with performance data with both the approaches (which we were discussing) and first get the consensus from everyone, before proposing this as a patch :). Having said that, we do care if a genuine performance issue gets reported. In that context, I wanted to understand your setup a bit from memory fragmentation perspective. Are you trying to simulate memory fragmentation and then benchmarking? Or was this problem hitting when you run simply run the reproduction steps mentioned in your cover letter? BTW - I was following the other thread too where PREEMPT_LAZY problem was getting discussed. And from what I understood, you mentioned [1] enabling THP on the system made that problem go away. Also it looks like enabling THP is the right thing to do for this kind of workload. Does that also mean enabling THP fixed this problem too? Do you still hit memory fragmentation and/or similar throughput drop w/o this fix after you enable THP? It will be good to know those details too please. [1]: https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#md88ca4258766e897e432df85874d197db476c7d1 -ritesh ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-21 1:16 ` Ritesh Harjani @ 2026-04-28 15:02 ` Salvatore Dipietro 2026-05-03 5:52 ` Ritesh Harjani 0 siblings, 1 reply; 25+ messages in thread From: Salvatore Dipietro @ 2026-04-28 15:02 UTC (permalink / raw) To: ritesh.list Cc: abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, dipiets, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, stable, willy On 4/21/26 00:43, Ritesh Harjani wrote: > Also, given the Maintainers (willy, Christoph, Dave) shown their > dis-interest in taking the patch in it's current form, the right way is > to get back with performance data with both the approaches (which we > were discussing) and first get the consensus from everyone, before > proposing this as a patch :). Thank you for the follow-up and the additional context, Ritesh. I might have misunderstood the previous request and will make sure to link back to previous patch versions in the future. Here are the performance results that we have collected on our end with the proposed patches: | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline | |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:| | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — | | Proposed patch | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20 | +38.45% | | Ritesh's suggestion | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61 | +36.50% | | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82 | +41.07% | On 4/21/26 00:43, Ritesh Harjani wrote: > In that context, I wanted to understand your setup a bit from > memory fragmentation perspective. Are you trying to simulate memory > fragmentation and then benchmarking? Or was this problem hitting when > you run simply run the reproduction steps mentioned in your cover > letter? All results were collected on fresh AWS instances as described in the cover letter. Patch [1] has been applied on all instances to avoid the other regression. The instance has been restarted to pick up the patched kernel and ensure clean memory before installing and starting the PostgreSQL benchmark via repro-collection [2]. We do not use any tool to fragment the memory in advance. Collecting memory metric of this system, we noticed that ~40% of memory is used by PageTables since PostgreSQL spawns a new process for each client limiting significantly the available caching and free memory. PostgreSQL write pattern consists mostly of 8/16 KB data but during the database checkpoints, by default every 5 minutes, it flushes write-ahead logs to disk, which uses large folios. At this point, the system attempts to satisfy the folio allocation request, triggering the regression and falling into the slow path, as shown by the Linux perf profile below: `-0.26%-__arm64_sys_pwrite64 `-0.26%-vfs_write `-0.26%-xfs_file_write_iter `-0.26%-xfs_file_buffered_write `-0.26%-iomap_file_buffered_write `-0.26%-iomap_write_iter `-0.22%-iomap_write_begin `-0.22%-iomap_get_folio `-0.22%-__filemap_get_folio `-0.21%-filemap_alloc_folio->alloc_pages `-0.20%-__alloc_pages_slowpath |-0.12%-__alloc_pages_direct_compact | `-0.12%-try_to_compact_pages | `-0.11%-compact_zone | `-0.11%-isolate_migratepages `-0.07%-__drain_all_pages `-0.07%-drain_pages_zone `-0.07%-free_pcppages_bulk This is also visible in the intermediate PGBench results, which drop significantly during checkpoint time execution: # Normal execution: [20260421.141505] [INFO] progress: 660.0 s, 138828.2 tps, lat 7.509 ms stddev 16.985, 0 failed [20260421.141515] [INFO] progress: 670.0 s, 151505.1 tps, lat 6.708 ms stddev 8.308, 0 failed [20260421.141525] [INFO] progress: 680.0 s, 166558.7 tps, lat 6.190 ms stddev 6.537, 0 failed [20260421.141535] [INFO] progress: 690.0 s, 141267.1 tps, lat 7.246 ms stddev 5.951, 0 failed # During checkpoints: [20260421.141605] [INFO] progress: 720.0 s, 54119.8 tps, lat 18.894 ms stddev 81.816, 0 failed [20260421.141615] [INFO] progress: 730.0 s, 55184.7 tps, lat 18.564 ms stddev 12.729, 0 failed [20260421.141625] [INFO] progress: 740.0 s, 37334.0 tps, lat 27.302 ms stddev 25.060, 0 failed [20260421.141635] [INFO] progress: 750.0 s, 53387.6 tps, lat 19.259 ms stddev 18.313, 0 failed [20260421.141645] [INFO] progress: 760.0 s, 41247.3 tps, lat 24.805 ms stddev 24.116, 0 failed On 4/21/26 00:43, Ritesh Harjani wrote: > BTW - I was following the other thread too where PREEMPT_LAZY problem > was getting discussed. And from what I understood, you mentioned [1] > enabling THP on the system made that problem go away. Also it looks like > enabling THP is the right thing to do for this kind of workload. Does > that also mean enabling THP fixed this problem too? Do you still hit > memory fragmentation and/or similar throughput drop w/o this fix after > you enable THP? It will be good to know those details too please. We have run more benchmarks (as baseline) with PostgreSQL huge_pages options (on, off) pre-allocating the shared buffer memory with "vm.nr_hugepages" (~25% of total memory, 2MB size) and Transparent Huge Pages (THP) options (always, madvise, never). PostgreSQL performance improves only when PostgreSQL huge_pages option with pre-allocated memory is enabled. THP has no significant effect on PostgreSQL or system performance in this case. | PG huge_pages + pre-alloc mem | THP | Run 1 | Run 2 | Run 3 | Average | |-------------------------------|---------|--------:|--------:|--------:|--------:| | on | never | 189,418 | 187,764 | 188,207 | 188,463 | | on | always | 188,813 | 189,798 | 190,032 | 189,548 | | on | madvise | 187,405 | 192,234 | 189,201 | 189,613 | | off | never | 102,609 | 109,394 | 100,868 | 104,290 | | off | always | 90,274 | 103,831 | 102,515 | 98,874 | | off | madvise | 90,508 | 103,855 | 96,574 | 96,979 | [1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368 [2] https://github.com/aws/repro-collection.git AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-28 15:02 ` Salvatore Dipietro @ 2026-05-03 5:52 ` Ritesh Harjani 2026-05-03 11:55 ` Matthew Wilcox 0 siblings, 1 reply; 25+ messages in thread From: Ritesh Harjani @ 2026-05-03 5:52 UTC (permalink / raw) To: Salvatore Dipietro, Matthew Wilcox, Andrew Morton, linux-mm, Vlastimil Babka Cc: abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, dipiets, djwong, linux-fsdevel, linux-kernel, linux-xfs, stable, willy Sorry about the delayed response, got caught up in some other work. Salvatore Dipietro <dipiets@amazon.it> writes: > On 4/21/26 00:43, Ritesh Harjani wrote: >> Also, given the Maintainers (willy, Christoph, Dave) shown their >> dis-interest in taking the patch in it's current form, the right way is >> to get back with performance data with both the approaches (which we >> were discussing) and first get the consensus from everyone, before >> proposing this as a patch :). > > Thank you for the follow-up and the additional context, Ritesh. > I might have misunderstood the previous request and will make sure to > link back to previous patch versions in the future. > Here are the performance results that we have collected on our end with > the proposed patches: > > > | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline | > |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:| > | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — | > | PG huge_pages + pre-alloc mem | THP | Run 1 | Run 2 | Run 3 | Average | > |-------------------------------|---------|--------:|--------:|--------:|--------:| > | on | never | 189,418 | 187,764 | 188,207 | 188,463 | > First of all thanks for sharing the detailed performance numbers. Ok, so here is what I understood from the data you shared. This performance problem is mostly seen with PostgreSQL huge_pages=off [1][2] i.e. baseline-no-patches ~104K v/s baseline-no-patches+huge_pages=on ~188K Also the observation with huge_pages=off is - we have 40% of memory as page table memory (as you pointed below) > We do not use any tool to fragment the memory in advance. Collecting > memory metric of this system, we noticed that ~40% of memory is used by > PageTables since PostgreSQL spawns a new process for each client limiting > significantly the available caching and free memory. > So there must be 2 things going on with huge_pages=on option here: 1. Huge pages use PMD-size mapping, which eliminates the need of PTE tables entirely. This then reduces the amount of memory consumed by page tables. W/O huges pages, the page table overhead become significant (~40% of DRAM), because on fork, each child process gets it's own copy of the PTE tables (even though the underlying shared memory pages remains the same) 2. The second savings might come from the fact that Linux supports CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING with hugetlb. With this, the PMD table pages themselves are shared among proceses. [1]: https://www.postgresql.org/docs/current/runtime-config-resource.html [2]: https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-HUGE-PAGES So the above explain the 40% of memory used up in Page tables. Now this is what I believe could be the reason for memory fragmentation with this workload - In Linux, each PTE page table uses 4KB size (assuming you are using 4KB system PAGE_SIZE). When your workload forks a child process for each new connection, child gets its own copy of the page tables which maps the shared buffer. Since each PTE table is a single 4KB page, hundreds of connections spawning means hundreds of thousands of single-page allocations for page tables. So it looks like, the major source of your memory fragmentation problem must be these several order-0 allocations for PTE page table pages. Also as per the documentation [1], huge_pages=try option is the default setting. So I am assuming in production we at least won't suffer from this memory fragmentation, correct? [1]: https://www.postgresql.org/docs/current/runtime-config-resource.html > PostgreSQL write pattern consists mostly of 8/16 KB data but during > the database checkpoints, by default every 5 minutes, it flushes write-ahead > logs to disk, which uses large folios. At this point, the system attempts to > satisfy the folio allocation request, triggering the regression and falling > into the slow path, as shown by the Linux perf profile below: > > `-0.26%-__arm64_sys_pwrite64 > `-0.26%-vfs_write > `-0.26%-xfs_file_write_iter > `-0.26%-xfs_file_buffered_write > `-0.26%-iomap_file_buffered_write > `-0.26%-iomap_write_iter > `-0.22%-iomap_write_begin > `-0.22%-iomap_get_folio > `-0.22%-__filemap_get_folio > `-0.21%-filemap_alloc_folio->alloc_pages > `-0.20%-__alloc_pages_slowpath > |-0.12%-__alloc_pages_direct_compact > | `-0.12%-try_to_compact_pages > | `-0.11%-compact_zone > | `-0.11%-isolate_migratepages > `-0.07%-__drain_all_pages > `-0.07%-drain_pages_zone > `-0.07%-free_pcppages_bulk > However, I agree that it still make sense look into possible solution to address this performance gap which you pointed out when the system has memory fragmentation (with huge_pages=off). > | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline | > |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:| > | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — | > | Proposed patch | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20 | +38.45% | > | Ritesh's suggestion | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61 | +36.50% | > | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82 | +41.07% | The main reason, why I proposed the below patch was because, this only affects costly order allocation (i.e for order > PAGE_ALLOC_COSTLY_ORDER) by skipping direct reclaim for those orders, while still keeping the behaviour same for others. So, for smaller orders (order > min_order and <= PAGE_ALLOC_COSTLY_ORDER), the allocator will still attempt for direct reclaim and compaction (which I guess is required to avoid oom too?) And also, this looks like a change which could be easily backportable :) diff --git a/mm/filemap.c b/mm/filemap.c index 4e636647100c..f2343c26dd63 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping, gfp_t alloc_gfp = gfp; err = -ENOMEM; - if (order > min_order) - alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; + if (order > min_order) { + alloc_gfp |= __GFP_NOWARN; + if (order > PAGE_ALLOC_COSTLY_ORDER) + alloc_gfp &= ~__GFP_DIRECT_RECLAIM; + else + alloc_gfp |= __GFP_NORETRY; + } But of course let's hear from others on their suggestions / thoughts. Maybe the filemap is not the right place to fix this as Matthew, Andrew and others were pointing. Any other suggestions on how to approach this, please? -ritesh ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-05-03 5:52 ` Ritesh Harjani @ 2026-05-03 11:55 ` Matthew Wilcox 2026-05-06 12:33 ` Salvatore Dipietro 0 siblings, 1 reply; 25+ messages in thread From: Matthew Wilcox @ 2026-05-03 11:55 UTC (permalink / raw) To: Ritesh Harjani Cc: Salvatore Dipietro, Andrew Morton, linux-mm, Vlastimil Babka, abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-xfs, stable On Sun, May 03, 2026 at 11:22:10AM +0530, Ritesh Harjani wrote: > Now this is what I believe could be the reason for memory fragmentation > with this workload - > In Linux, each PTE page table uses 4KB size (assuming you are using 4KB > system PAGE_SIZE). When your workload forks a > child process for each new connection, child gets its own copy of the > page tables which maps the shared buffer. > Since each PTE table is a single 4KB page, hundreds of connections > spawning means hundreds of thousands of single-page allocations for page > tables. So it looks like, the major source of your memory fragmentation > problem must be these several order-0 allocations for PTE page table > pages. While memory is fragmented, the _problem_ is that we try too hard to defragment. From the original post: : When memory is fragmented, each failed allocation triggers : compaction and drain_all_pages() via __alloc_pages_slowpath() We really should only try compaction once. If it didn't make useful progress last time, it won't this time either. > > | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline | > > |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:| > > | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — | > > | Proposed patch | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20 | +38.45% | > > | Ritesh's suggestion | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61 | +36.50% | > > | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82 | +41.07% | > > > The main reason, why I proposed the below patch was because, this only > affects costly order allocation (i.e for order > PAGE_ALLOC_COSTLY_ORDER) > by skipping direct reclaim for those orders, while still keeping the > behaviour same for others. > > So, for smaller orders (order > min_order and <= > PAGE_ALLOC_COSTLY_ORDER), the allocator will still attempt for direct > reclaim and compaction (which I guess is required to avoid oom too?) And > also, this looks like a change which could be easily backportable :) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 4e636647100c..f2343c26dd63 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping, > gfp_t alloc_gfp = gfp; > > err = -ENOMEM; > - if (order > min_order) > - alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; > + if (order > min_order) { > + alloc_gfp |= __GFP_NOWARN; > + if (order > PAGE_ALLOC_COSTLY_ORDER) > + alloc_gfp &= ~__GFP_DIRECT_RECLAIM; > + else > + alloc_gfp |= __GFP_NORETRY; > + } > > > But of course let's hear from others on their suggestions / thoughts. > Maybe the filemap is not the right place to fix this as Matthew, Andrew > and others were pointing. Any other suggestions on how to approach this, > please? filemap.c REALLY shouldn't know about PAGE_ALLOC_COSTLY_ORDER. That's an internal detail of the memory allocator. Either we want an API to say "allocate me a folio between orders A and B" or we need more understandable GFP flags. Or the page allocator could use the __GFP_NORETRY flag to say "oh well, this allocation has a fallback, I'll kick kcompactd to try to compact some more memory, but I'll fail the allocation". ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-05-03 11:55 ` Matthew Wilcox @ 2026-05-06 12:33 ` Salvatore Dipietro 2026-05-27 16:24 ` Salvatore Dipietro 0 siblings, 1 reply; 25+ messages in thread From: Salvatore Dipietro @ 2026-05-06 12:33 UTC (permalink / raw) To: willy Cc: abuehaze, akpm, alisaidi, blakgeof, brauner, dipietro.salvatore, dipiets, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, ritesh.list, stable, vbabka On 5/03/26 05:52, Ritesh Harjani wrote: > Also as per the documentation [1], huge_pages=try option is the default > setting. So I am assuming in production we at least won't suffer from > this memory fragmentation, correct? Yes, huge_pages=try is the default option, but without pre-allocating the entire shared_buffer size in memory via "vm.nr_hugepages" — which is not done automatically — huge pages will not be used and the system falls into the huge_pages=off category. Even with a partial pre-allocation, PostgreSQL will not be able to use hugepages. On 5/03/26 11:55, Matthew Wilcox wrote: > or we need more understandable GFP flags. Or the page allocator could > use the __GFP_NORETRY flag to say "oh well, this allocation has a fallback, > I'll kick kcompactd to try to compact some more memory, but I'll fail > the allocation". We also tested kicking off kcompactd in the background when __GFP_NORETRY is passed, returning "nopage" to avoid blocking the folio allocation request. Here is the patch tested as the other with PREEMPT_NONE patch [1]: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 65e205111553..d4f322910992 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4818,6 +4818,26 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (current->flags & PF_MEMALLOC) goto nopage; + /* + * Costly allocations with __GFP_NORETRY are opportunistic - Don't + * stall on direct compaction or reclaim; instead, kick + * kcompactd on the preferred node so large pages may become + * available for future allocations and let the caller fall back now. + * + * Direct compaction is way too costly for hot allocation paths on + * large systems: each attempt calls drain_all_pages() which IPIs + * every CPU. Only wake kcompactd on the local node to avoid + * cross-NUMA interference with unrelated workloads. + */ + if (costly_order && (gfp_mask & __GFP_NORETRY)) { + struct zone *preferred_zone = ac->preferred_zoneref->zone; + + if (preferred_zone) + wakeup_kcompactd(preferred_zone->zone_pgdat, order, + ac->highest_zoneidx); + goto nopage; + } + /* Try direct reclaim and then allocating */ if (!compact_first) { page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, Here are the results we collected (kcompactd background): | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline | |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:| | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — | | Proposed patch | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20 | +38.45% | | Ritesh's suggestion | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61 | +36.50% | | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82 | +41.07% | | kcompactd background | 146,760.75 | 128,094.92 | 127,979.74 | 134,278.47 | +31.67% | [1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368 AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-05-06 12:33 ` Salvatore Dipietro @ 2026-05-27 16:24 ` Salvatore Dipietro 2026-05-31 23:29 ` Karim Manaouil 2026-06-24 8:06 ` Salvatore Dipietro 0 siblings, 2 replies; 25+ messages in thread From: Salvatore Dipietro @ 2026-05-27 16:24 UTC (permalink / raw) To: dipiets Cc: abuehaze, akpm, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, ritesh.list, stable, vbabka, willy Thanks Ritesh and Matthew for the continued feedback and guidance on this thread. I'd like to summarize where we stand and ask for your input on the best path forward. Summary of approaches tested: We've now benchmarked all proposed variations (pgbench simple-update, 1024 clients, 96-vCPU arm64, huge_pages=off, PREEMPT_NONE applied [1]): | Patch | Change Location | Avg TPS | % vs Baseline | |--------------------------------|-----------------------|-----------:|:-------------:| | Baseline (no patch) | — | 101,979.75 | — | | v1 (original, iomap caller) | fs/iomap/buffered-io.c| 141,194.20 | +38.45% | | Ritesh's suggestion | mm/filemap.c | 139,200.61 | +36.50% | | Matthew's suggestion | mm/filemap.c | 143,863.82 | +41.07% | | kcompactd background | mm/page_alloc.c | 134,278.47 | +31.67% | All approaches recover significant throughput. The kcompactd approach (background compaction and returning nopage for costly orders with __GFP_NORETRY) aligns with the architectural direction Dave and Christoph proposed, keeping compaction out of the direct reclaim path, and lives entirely in the page allocator. Based on the discussion, I see two possible directions and would appreciate your guidance: 1. Page allocator fix (mm/page_alloc.c): The kcompactd background approach addresses Matthew's concern that filemap.c shouldn't know about PAGE_ALLOC_COSTLY_ORDER, and aligns with Dave's vision of removing compaction from the direct reclaim path. 2. filemap fix (mm/filemap.c): Both Ritesh's and Matthew's suggestions are minimal, backportable, and preserve lightweight reclaim for non-costly orders. Ritesh's variant differentiates between costly and non-costly orders, while Matthew's is simpler and performs best. Would either of these directions be acceptable for a v3, or would you prefer a different approach? I'm happy to test any additional variations or direction to move this forward Salvatore [1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368 AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-05-27 16:24 ` Salvatore Dipietro @ 2026-05-31 23:29 ` Karim Manaouil 2026-06-05 10:58 ` Salvatore Dipietro 2026-06-24 8:06 ` Salvatore Dipietro 1 sibling, 1 reply; 25+ messages in thread From: Karim Manaouil @ 2026-05-31 23:29 UTC (permalink / raw) To: Salvatore Dipietro Cc: abuehaze, akpm, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, ritesh.list, stable, vbabka, willy On Wed, May 27, 2026 at 04:24:10PM +0000, Salvatore Dipietro wrote: > > Thanks Ritesh and Matthew for the continued feedback and guidance on this thread. > I'd like to summarize where we stand and ask for your input on the best path forward. > > Summary of approaches tested: > We've now benchmarked all proposed variations (pgbench simple-update, 1024 clients, > 96-vCPU arm64, huge_pages=off, PREEMPT_NONE applied [1]): > > | Patch | Change Location | Avg TPS | % vs Baseline | > |--------------------------------|-----------------------|-----------:|:-------------:| > | Baseline (no patch) | — | 101,979.75 | — | > | v1 (original, iomap caller) | fs/iomap/buffered-io.c| 141,194.20 | +38.45% | > | Ritesh's suggestion | mm/filemap.c | 139,200.61 | +36.50% | > | Matthew's suggestion | mm/filemap.c | 143,863.82 | +41.07% | > | kcompactd background | mm/page_alloc.c | 134,278.47 | +31.67% | > > > All approaches recover significant throughput. The kcompactd approach (background > compaction and returning nopage for costly orders with __GFP_NORETRY) aligns with the > architectural direction Dave and Christoph proposed, keeping compaction out of the direct > reclaim path, and lives entirely in the page allocator. > > Based on the discussion, I see two possible directions and would appreciate your guidance: > > 1. Page allocator fix (mm/page_alloc.c): The kcompactd background approach addresses > Matthew's concern that filemap.c shouldn't know about PAGE_ALLOC_COSTLY_ORDER, and aligns > with Dave's vision of removing compaction from the direct reclaim path. > > 2. filemap fix (mm/filemap.c): Both Ritesh's and Matthew's suggestions are minimal, > backportable, and preserve lightweight reclaim for non-costly orders. > Ritesh's variant differentiates between costly and non-costly orders, while Matthew's > is simpler and performs best. I am not very familiar with THPs in the page cache, but for anonymous memory, we have /sys/kernel/mm/transparent_hugepages/defrag which decides what to do in the event of a THP allocation failure, whether to enter a synchronous compaction or wake up kcompactd. Check vma_thp_gfp_mask(). Maybe you should adopt something similar called file_thp_gfp_mask(). The problem with fallback is that your application is never going to get a THP and eventually TLB pressure might actually end up slowing you down in the long run. Also compaction is only really tried if it makes sense. That is if enough free memory is available to actually perform the compaction and have a chance of creating a large enough huge page. So compaction is actually never performed under accute memory pressure. Which means your system actually has enough free pages, but somehow the compaction is slow and inefficient. I am just trying to think loudly here and address the root cause. The real problem here is fragmentation due to unmovable pages, probably in your case the page tables. We should work more on reducing pageblock type mixing. Also page tables can actually be made movable so that compaction can treat them as movable pages. > > Would either of these directions be acceptable for a v3, or would you prefer a different approach? > > I'm happy to test any additional variations or direction to move this forward > > Salvatore > > > [1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368 > > > > > AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico > > -- ~karim ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-05-31 23:29 ` Karim Manaouil @ 2026-06-05 10:58 ` Salvatore Dipietro 0 siblings, 0 replies; 25+ messages in thread From: Salvatore Dipietro @ 2026-06-05 10:58 UTC (permalink / raw) To: kmanaouil.dev Cc: abuehaze, akpm, alisaidi, blakgeof, brauner, dipietro.salvatore, dipiets, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, ritesh.list, stable, vbabka, willy On Sat, May 31, 2026 at 11:29:00PM +0000, Karim Manaouil wrote: > I am not very familiar with THPs in the page cache, but for anonymous > memory, we have /sys/kernel/mm/transparent_hugepages/defrag which > decides what to do in the event of a THP allocation failure, whether to > enter a synchronous compaction or wake up kcompactd. Thanks Karim for the suggestions. To clarify, THPs are not used and do not have any performance change on this workload as reported in [1]. The failing allocations are for high order file-backed folios in the iomap buffered write path. > I am just trying to think loudly here and address the root cause. The > real problem here is fragmentation due to unmovable pages, probably in > your case the page tables. We should work more on reducing pageblock > type mixing. Also page tables can actually be made movable so that > compaction can treat them as movable pages. I agree that making PTEs movable could potentially resolve the fragmentation at its root, since page table pages are indeed the primary source of unmovable fragmentation in this workload. However, making page tables movable has much broader implications. [1] https://lore.kernel.org/all/20260428150240.3009-1-dipiets@amazon.it/ -- Salvatore AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-05-27 16:24 ` Salvatore Dipietro 2026-05-31 23:29 ` Karim Manaouil @ 2026-06-24 8:06 ` Salvatore Dipietro 2026-06-24 12:21 ` Ritesh Harjani 2026-06-24 13:34 ` Christoph Hellwig 1 sibling, 2 replies; 25+ messages in thread From: Salvatore Dipietro @ 2026-06-24 8:06 UTC (permalink / raw) To: ritesh.list, willy Cc: dipiets, abuehaze, akpm, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, stable, vbabka Hi Ritesh, Matthew, I wanted to kindly follow up on my summary from May 27th regarding the best path forward for this patch. To recap, we benchmarked all proposed variations and shared the results: | Patch | Change Location | Avg TPS | % vs Baseline | |--------------------------------|------------------------|------------|:-------------:| | Baseline (no patch) | — | 101,979.75 | — | | v1 (original, iomap caller) | fs/iomap/buffered-io.c | 141,194.20 | +38.45% | | Ritesh's suggestion | mm/filemap.c | 139,200.61 | +36.50% | | Matthew's suggestion | mm/filemap.c | 143,863.82 | +41.07% | | kcompactd background | mm/page_alloc.c | 134,278.47 | +31.67% | I'd really appreciate any guidance on which direction would be acceptable for a v3 — whether that's the page allocator approach (kcompactd background), one of the filemap.c fixes, or something else entirely. I'm happy to test any additional variations or direction to move this forward -- Salvatore AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-06-24 8:06 ` Salvatore Dipietro @ 2026-06-24 12:21 ` Ritesh Harjani 2026-06-24 13:34 ` Christoph Hellwig 1 sibling, 0 replies; 25+ messages in thread From: Ritesh Harjani @ 2026-06-24 12:21 UTC (permalink / raw) To: Salvatore Dipietro, willy Cc: dipiets, abuehaze, akpm, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, stable, vbabka, David Hildenbrand, Lorenzo Stoakes, Andrew Morton, Mike Rapoport, Michal Hocko, Vlastimil Babka Salvatore Dipietro <dipiets@amazon.it> writes: > Hi Ritesh, Matthew, > > I wanted to kindly follow up on my summary from May 27th regarding the best path > forward for this patch. > Hi Salvatore, Sorry about the delay. I did bring this topic up in one of our internal ext4 community calls. And to share some context, MM community thinks we need a better long term fix for this problem rather than patching call sites and/or playing tricks like - diff --git a/mm/filemap.c b/mm/filemap.c index 4e636647100c..f2343c26dd63 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping, gfp_t alloc_gfp = gfp; err = -ENOMEM; - if (order > min_order) - alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; + if (order > min_order) { + alloc_gfp |= __GFP_NOWARN; + if (order > PAGE_ALLOC_COSTLY_ORDER) + alloc_gfp &= ~__GFP_DIRECT_RECLAIM; + else + alloc_gfp |= __GFP_NORETRY; + } Unfortunately most of the folks might be missing free cycles to work on this problem right now :( - Hence the delay in addressing this.. However - I would like to bring this problem to other MM community members as well who might have an interest in this space. Can we look into the proposed solutions from Salvatore and suggest the next steps please? Maybe if someone can share what is MM community looking for here - I guess that will be a good start. Looking into the table I think Salvatore had also shared a diff for kicking kcompactd in the background [2]. [2]: https://lore.kernel.org/all/20260506123326.17293-1-dipiets@amazon.it/ (Sorry I still have few other things on my plate before I start look into this more actively. But let's hear from others, who have better knowledge than me on this.) > To recap, we benchmarked all proposed variations and shared the results: > > | Patch | Change Location | Avg TPS | % vs Baseline | > |--------------------------------|------------------------|------------|:-------------:| > | Baseline (no patch) | — | 101,979.75 | — | > | v1 (original, iomap caller) | fs/iomap/buffered-io.c | 141,194.20 | +38.45% | > | Ritesh's suggestion | mm/filemap.c | 139,200.61 | +36.50% | > | Matthew's suggestion | mm/filemap.c | 143,863.82 | +41.07% | > | kcompactd background | mm/page_alloc.c | 134,278.47 | +31.67% | > -ritesh ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-06-24 8:06 ` Salvatore Dipietro 2026-06-24 12:21 ` Ritesh Harjani @ 2026-06-24 13:34 ` Christoph Hellwig 1 sibling, 0 replies; 25+ messages in thread From: Christoph Hellwig @ 2026-06-24 13:34 UTC (permalink / raw) To: Salvatore Dipietro Cc: ritesh.list, willy, abuehaze, akpm, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, stable, vbabka On Wed, Jun 24, 2026 at 08:06:36AM +0000, Salvatore Dipietro wrote: > > Hi Ritesh, Matthew, > > I wanted to kindly follow up on my summary from May 27th regarding the best path > forward for this patch. > > To recap, we benchmarked all proposed variations and shared the results: > > | Patch | Change Location | Avg TPS | % vs Baseline | > |--------------------------------|------------------------|------------|:-------------:| > | Baseline (no patch) | — | 101,979.75 | — | > | v1 (original, iomap caller) | fs/iomap/buffered-io.c | 141,194.20 | +38.45% | > | Ritesh's suggestion | mm/filemap.c | 139,200.61 | +36.50% | > | Matthew's suggestion | mm/filemap.c | 143,863.82 | +41.07% | > | kcompactd background | mm/page_alloc.c | 134,278.47 | +31.67% | > > I'd really appreciate any guidance on which direction would be acceptable for a v3 — > whether that's the page allocator approach (kcompactd background), one of the filemap.c > fixes, or something else entirely. Do you have ointers to the patches for each approach above? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro 2026-04-04 1:13 ` Ritesh Harjani 2026-04-04 4:15 ` Matthew Wilcox @ 2026-04-05 22:43 ` Dave Chinner 2026-04-07 5:40 ` Christoph Hellwig 2026-04-21 9:02 ` Vlastimil Babka 2 siblings, 2 replies; 25+ messages in thread From: Dave Chinner @ 2026-04-05 22:43 UTC (permalink / raw) To: Salvatore Dipietro Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote: > Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > introduced high-order folio allocations in the buffered write > path. When memory is fragmented, each failed allocation triggers > compaction and drain_all_pages() via __alloc_pages_slowpath(), > causing a 0.75x throughput drop on pgbench (simple-update) with > 1024 clients on a 96-vCPU arm64 system. > > Strip __GFP_DIRECT_RECLAIM from folio allocations in > iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, > making them purely opportunistic. > > Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > Cc: stable@vger.kernel.org > Signed-off-by: Salvatore Dipietro <dipiets@amazon.it> > --- > fs/iomap/buffered-io.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 92a831cf4bf1..cb843d54b4d9 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); > struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) > { > fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS; > + gfp_t gfp; > > if (iter->flags & IOMAP_NOWAIT) > fgp |= FGP_NOWAIT; > @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) > fgp |= FGP_DONTCACHE; > fgp |= fgf_set_order(len); > > + gfp = mapping_gfp_mask(iter->inode->i_mapping); > + > + /* > + * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER, > + * strip __GFP_DIRECT_RECLAIM to make the allocation purely > + * opportunistic. This avoids compaction + drain_all_pages() > + * in __alloc_pages_slowpath() that devastate throughput > + * on large systems during buffered writes. > + */ > + if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER) > + gfp &= ~__GFP_DIRECT_RECLAIM; Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere we need to do high order folio allocation is getting out of hand. Compaction improves long term system performance, so we don't really just want to turn it off whenever we have demand for high order folios. We should be doing is getting rid of compaction out of the direct reclaim path - it is -clearly- way too costly for hot paths that use large allocations, especially those with fallbacks to smaller allocations or vmalloc. Instead, memory reclaim should kick background compaction and let it do the work. If the allocation path really, really needs high order allocation to succeed, then it can direct the allocation to retry until it succeeds and the allocator itself can wait for background compaction to make progress. For code that has fallbacks to smaller allocations, then there is no need to wait for compaction - we can attempt fast smaller allocations and continue that way until an allocation succeeds.... -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-05 22:43 ` Dave Chinner @ 2026-04-07 5:40 ` Christoph Hellwig 2026-04-21 9:02 ` Vlastimil Babka 1 sibling, 0 replies; 25+ messages in thread From: Christoph Hellwig @ 2026-04-07 5:40 UTC (permalink / raw) To: Dave Chinner Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel On Mon, Apr 06, 2026 at 08:43:57AM +1000, Dave Chinner wrote: > > + if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER) > > + gfp &= ~__GFP_DIRECT_RECLAIM; > > Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere > we need to do high order folio allocation is getting out of hand. That's what I thought. > Compaction improves long term system performance, so we don't really > just want to turn it off whenever we have demand for high order > folios. Yes. Also if we want to make block size > PAGE_SIZE a real option, just giving up on allocating large folios is not an option. > Instead, memory reclaim should kick background compaction and let it > do the work. If the allocation path really, really needs high order > allocation to succeed, then it can direct the allocation to retry > until it succeeds and the allocator itself can wait for background > compaction to make progress. > > For code that has fallbacks to smaller allocations, then there is no > need to wait for compaction - we can attempt fast smaller allocations > and continue that way until an allocation succeeds.... Yes. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-05 22:43 ` Dave Chinner 2026-04-07 5:40 ` Christoph Hellwig @ 2026-04-21 9:02 ` Vlastimil Babka 1 sibling, 0 replies; 25+ messages in thread From: Vlastimil Babka @ 2026-04-21 9:02 UTC (permalink / raw) To: Dave Chinner, Salvatore Dipietro Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel, Ritesh Harjani (IBM), Christoph Hellwig, linux-mm@kvack.org, Michal Hocko, David Hildenbrand (Red Hat), Johannes Weiner On 4/6/26 00:43, Dave Chinner wrote: > On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote: >> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") >> introduced high-order folio allocations in the buffered write >> path. When memory is fragmented, each failed allocation triggers >> compaction and drain_all_pages() via __alloc_pages_slowpath(), >> causing a 0.75x throughput drop on pgbench (simple-update) with >> 1024 clients on a 96-vCPU arm64 system. >> >> Strip __GFP_DIRECT_RECLAIM from folio allocations in >> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, >> making them purely opportunistic. >> >> Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") >> Cc: stable@vger.kernel.org >> Signed-off-by: Salvatore Dipietro <dipiets@amazon.it> BTW, backporting perf regressions fixes to 6.6, when they are only reported at the time 7.0 is released, might be too risky. There will likely be a different workload that will regress as a result, no matter what we do. >> --- >> fs/iomap/buffered-io.c | 15 ++++++++++++++- >> 1 file changed, 14 insertions(+), 1 deletion(-) >> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >> index 92a831cf4bf1..cb843d54b4d9 100644 >> --- a/fs/iomap/buffered-io.c >> +++ b/fs/iomap/buffered-io.c >> @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); >> struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) >> { >> fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS; >> + gfp_t gfp; >> >> if (iter->flags & IOMAP_NOWAIT) >> fgp |= FGP_NOWAIT; >> @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) >> fgp |= FGP_DONTCACHE; >> fgp |= fgf_set_order(len); >> >> + gfp = mapping_gfp_mask(iter->inode->i_mapping); >> + >> + /* >> + * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER, >> + * strip __GFP_DIRECT_RECLAIM to make the allocation purely >> + * opportunistic. This avoids compaction + drain_all_pages() >> + * in __alloc_pages_slowpath() that devastate throughput >> + * on large systems during buffered writes. >> + */ >> + if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER) >> + gfp &= ~__GFP_DIRECT_RECLAIM; > > Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere > we need to do high order folio allocation is getting out of hand. > > Compaction improves long term system performance, so we don't really > just want to turn it off whenever we have demand for high order > folios. > > We should be doing is getting rid of compaction out of the direct > reclaim path - it is -clearly- way too costly for hot paths that use > large allocations, especially those with fallbacks to smaller > allocations or vmalloc. > > Instead, memory reclaim should kick background compaction and let it > do the work. If the allocation path really, really needs high order > allocation to succeed, then it can direct the allocation to retry > until it succeeds and the allocator itself can wait for background > compaction to make progress. > > For code that has fallbacks to smaller allocations, then there is no > need to wait for compaction - we can attempt fast smaller allocations > and continue that way until an allocation succeeds.... So, should we do a LSF/MM session? But I think in any case, the page allocator needs to know which allocations do have the fallback. __GFP_NORETRY exists for this. Here it wasn't tried at all, in v2 [1] it was, but not alone. I'd start from __GFP_NORETRY alone, and then we can look at tweaking what it does if it's currently insufficient. We could have a helper to encapsulate this "turn this allocation to a lightweight fallbackable one", which would add __GFP_NORETRY. It probably already exists somewhere but not gfp.h. But I'm not sure we can simply change GFP_KERNEL to start failing more for non-costly orders. We've discussed that a lot in the past :) [1] https://lore.kernel.org/all/20260420161404.642-1-dipiets@amazon.it/ > -Dave. ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <20260403193201.30479-1-dipiets@amazon.it>]
* [PATCH 1/1] iomap: avoid compaction for costly folio order allocation [not found] <20260403193201.30479-1-dipiets@amazon.it> @ 2026-04-03 19:32 ` Salvatore Dipietro 2026-04-04 6:25 ` Greg KH 0 siblings, 1 reply; 25+ messages in thread From: Salvatore Dipietro @ 2026-04-03 19:32 UTC (permalink / raw) To: dipietro.salvatore; +Cc: stable Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") introduced high-order folio allocations in the buffered write path. When memory is fragmented, each failed allocation triggers compaction and drain_all_pages() via __alloc_pages_slowpath(), causing a 0.75x throughput drop on pgbench (simple-update) with 1024 clients on a 96-vCPU arm64 system. Strip __GFP_DIRECT_RECLAIM from folio allocations in iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, making them purely opportunistic. Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Cc: stable@vger.kernel.org Signed-off-by: Salvatore Dipietro <dipiets@amazon.it> --- fs/iomap/buffered-io.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 92a831cf4bf1..cb843d54b4d9 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) { fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS; + gfp_t gfp; if (iter->flags & IOMAP_NOWAIT) fgp |= FGP_NOWAIT; @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) fgp |= FGP_DONTCACHE; fgp |= fgf_set_order(len); + gfp = mapping_gfp_mask(iter->inode->i_mapping); + + /* + * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER, + * strip __GFP_DIRECT_RECLAIM to make the allocation purely + * opportunistic. This avoids compaction + drain_all_pages() + * in __alloc_pages_slowpath() that devastate throughput + * on large systems during buffered writes. + */ + if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER) + gfp &= ~__GFP_DIRECT_RECLAIM; + return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, - fgp, mapping_gfp_mask(iter->inode->i_mapping)); + fgp, gfp); } EXPORT_SYMBOL_GPL(iomap_get_folio); -- 2.50.1 (Apple Git-155) AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-03 19:32 ` Salvatore Dipietro @ 2026-04-04 6:25 ` Greg KH 0 siblings, 0 replies; 25+ messages in thread From: Greg KH @ 2026-04-04 6:25 UTC (permalink / raw) To: Salvatore Dipietro; +Cc: dipietro.salvatore, stable On Fri, Apr 03, 2026 at 07:32:01PM +0000, Salvatore Dipietro wrote: > Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > introduced high-order folio allocations in the buffered write > path. When memory is fragmented, each failed allocation triggers > compaction and drain_all_pages() via __alloc_pages_slowpath(), > causing a 0.75x throughput drop on pgbench (simple-update) with > 1024 clients on a 96-vCPU arm64 system. > > Strip __GFP_DIRECT_RECLAIM from folio allocations in > iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, > making them purely opportunistic. > > Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > Cc: stable@vger.kernel.org > Signed-off-by: Salvatore Dipietro <dipiets@amazon.it> > --- > fs/iomap/buffered-io.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > <formletter> This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html for how to do this properly. </formletter> ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2026-06-24 14:24 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-03 19:35 [PATCH 0/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
2026-04-04 1:13 ` Ritesh Harjani
2026-04-04 4:15 ` Matthew Wilcox
2026-04-04 16:47 ` Ritesh Harjani
2026-04-04 20:46 ` Matthew Wilcox
2026-04-16 15:14 ` Ritesh Harjani
2026-04-20 16:33 ` Salvatore Dipietro
2026-04-20 18:44 ` Matthew Wilcox
2026-04-21 1:16 ` Ritesh Harjani
2026-04-28 15:02 ` Salvatore Dipietro
2026-05-03 5:52 ` Ritesh Harjani
2026-05-03 11:55 ` Matthew Wilcox
2026-05-06 12:33 ` Salvatore Dipietro
2026-05-27 16:24 ` Salvatore Dipietro
2026-05-31 23:29 ` Karim Manaouil
2026-06-05 10:58 ` Salvatore Dipietro
2026-06-24 8:06 ` Salvatore Dipietro
2026-06-24 12:21 ` Ritesh Harjani
2026-06-24 13:34 ` Christoph Hellwig
2026-04-05 22:43 ` Dave Chinner
2026-04-07 5:40 ` Christoph Hellwig
2026-04-21 9:02 ` Vlastimil Babka
[not found] <20260403193201.30479-1-dipiets@amazon.it>
2026-04-03 19:32 ` Salvatore Dipietro
2026-04-04 6:25 ` Greg KH
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.