[PATCH 0/1] iomap: avoid compaction for costly folio order allocation

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/1] iomap: avoid compaction for costly folio order allocation
@ 2026-04-03 19:35 Salvatore Dipietro
  2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
  0 siblings, 1 reply; 25+ messages in thread
From: Salvatore Dipietro @ 2026-04-03 19:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy,
	Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel

We are reporting a throughput regression on PostgreSQL pgbench
(simple-update) on arm64 caused by commit 5d8edfb900d5 ("iomap:
Copy larger chunks from userspace") introduced in v6.6-rc1.

The regression manifests as a 0.75x throughput drop on a pgbench
simple-update workload with 1024 clients on a 96-vCPU arm64
system. When memory is even slightly fragmented, each failed
high-order allocation enters into __alloc_pages_slowpath() which
runs 2 memory compactions and drain_all_pages(), forcing all
vCPUs to release their pages. This is done multiple times, one
for each order (up to 6), until the allocation succeeds.

The patch makes costly-order folio allocations in the iomap
buffered write path purely opportunistic -- no direct reclaim,
no compaction, no drain_all_pages().

Combined with the separate PREEMPT_LAZY regression [1], the
total impact is a 2.87x throughput and latency loss.

1. Test environment
___________________

  Hardware:  1x AWS EC2 m8g.24xlarge
             (12x 1TB IO2 32000 iops RAID0 XFS)
  OS:        AL2023 (ami-03a8d3251f401ffca)
  Kernel:    next-20260331
  Database:  PostgreSQL 17
  Workload:  pgbench simple-update
             1024 clients, 96 threads, 1200s duration
             scale factor 8470, fillfactor=90, prepared protocol

2. Results
__________

  Config                 Run1       Run2       Run3       Avg         x
  _____________________  _________  _________  _________  __________  ____
  baseline                47242.39   53369.18   51644.29    50751.96  1.00
  iomap patch             69305.92   66994.08   64603.33    66967.78  1.32
  preempt-none [1]        92906.62  103976.03   98814.94    98565.86  1.94
  iomap+preempt-none[1]  145904.53  146470.95  144728.91   145701.46  2.87

3. Reproduction
_______________

On the AWS EC2 m8g.24xlarge, install and run the PostgreSQL
database using the repro-collection repository like:

  # Reproducer code:
  git clone https://github.com/aws/repro-collection.git ~/repro-collection

  # Setup and start PostgreSQL server in terminal 1:
  ~/repro-collection/run.sh postgresql SUT --ldg=127.0.0.1

  # Run pgbench load generator in terminal 2:
  PGBENCH_SCALE=8470 \
  PGBENCH_INIT_EXTRA_ARGS="--fillfactor=90" \
  PGBENCH_CLIENTS=1024 \
  PGBENCH_THREADS=96 \
  PGBENCH_DURATION=1200 \
  PGBENCH_BUILTIN=simple-update \
  PGBENCH_PROTOCOL=prepared \
  ~/repro-collection/run.sh postgresql LDG --sut=127.0.0.1

[1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#t

Salvatore Dipietro (1):
  iomap: avoid compaction for costly folio order allocation

 fs/iomap/buffered-io.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

base-commit: 9147566d801602c9e7fc7f85e989735735bf38ba
-- 
2.50.1 (Apple Git-155)

AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:35 [PATCH 0/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
@ 2026-04-03 19:35 ` Salvatore Dipietro
  2026-04-04  1:13   ` Ritesh Harjani
                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Salvatore Dipietro @ 2026-04-03 19:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel

Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
introduced high-order folio allocations in the buffered write
path. When memory is fragmented, each failed allocation triggers
compaction and drain_all_pages() via __alloc_pages_slowpath(),
causing a 0.75x throughput drop on pgbench (simple-update) with 
1024 clients on a 96-vCPU arm64 system.

Strip __GFP_DIRECT_RECLAIM from folio allocations in
iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
making them purely opportunistic.

Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>
---
 fs/iomap/buffered-io.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 92a831cf4bf1..cb843d54b4d9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
+	gfp_t gfp;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
@@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 		fgp |= FGP_DONTCACHE;
 	fgp |= fgf_set_order(len);
 
+	gfp = mapping_gfp_mask(iter->inode->i_mapping);
+
+	/*
+	 * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER,
+	 * strip __GFP_DIRECT_RECLAIM to make the allocation purely
+	 * opportunistic.  This avoids compaction + drain_all_pages()
+	 * in __alloc_pages_slowpath() that devastate throughput
+	 * on large systems during buffered writes.
+	 */
+	if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
+		gfp &= ~__GFP_DIRECT_RECLAIM;
+
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
-			fgp, mapping_gfp_mask(iter->inode->i_mapping));
+			fgp, gfp);
 }
 EXPORT_SYMBOL_GPL(iomap_get_folio);
 
-- 
2.50.1 (Apple Git-155)




AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
@ 2026-04-04  1:13   ` Ritesh Harjani
  2026-04-04  4:15   ` Matthew Wilcox
  2026-04-05 22:43   ` Dave Chinner
  2 siblings, 0 replies; 25+ messages in thread
From: Ritesh Harjani @ 2026-04-04  1:13 UTC (permalink / raw)
  To: Salvatore Dipietro, linux-kernel
  Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, linux-mm


Let's cc: linux-mm too.

Salvatore Dipietro <dipiets@amazon.it> writes:

> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> introduced high-order folio allocations in the buffered write
> path. When memory is fragmented, each failed allocation triggers

Isn't it the right thing to do i.e. run compaction, when memory is
fragmented? 

> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> causing a 0.75x throughput drop on pgbench (simple-update) with 
> 1024 clients on a 96-vCPU arm64 system.
>

I think removing the __GFP_DIRECT_RECLAIM flag unconditionally at the
caller may cause -ENOMEM. Note that it is the __filemap_get_folio()
which retries with smaller order allocations, so instead of changing the
callers, shouldn't this be fixed in __filemap_get_folio() instead?

Maybe in there too, we should keep the reclaim flag (if passed by
caller) at least for <= PAGE_ALLOC_COSTLY_ORDER + 1

Thoughts?

-ritesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
  2026-04-04  1:13   ` Ritesh Harjani
@ 2026-04-04  4:15   ` Matthew Wilcox
  2026-04-04 16:47     ` Ritesh Harjani
  2026-04-05 22:43   ` Dave Chinner
  2 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2026-04-04  4:15 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, linux-mm

On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> introduced high-order folio allocations in the buffered write
> path. When memory is fragmented, each failed allocation triggers
> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> causing a 0.75x throughput drop on pgbench (simple-update) with 
> 1024 clients on a 96-vCPU arm64 system.
> 
> Strip __GFP_DIRECT_RECLAIM from folio allocations in
> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
> making them purely opportunistic.

If you look at __filemap_get_folio_mpol(), that's kind of being tried
already:

                        if (order > min_order)
                                alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;

 * %__GFP_NORETRY: The VM implementation will try only very lightweight
 * memory direct reclaim to get some memory under memory pressure (thus
 * it can sleep). It will avoid disruptive actions like OOM killer. The
 * caller must handle the failure which is quite likely to happen under
 * heavy memory pressure. The flag is suitable when failure can easily be
 * handled at small cost, such as reduced throughput.

which, from the description, seemed like the right approach.  So either
the description or the implementation should be updated, I suppose?

Now, what happens if you change those two lines to:

			if (order > min_order) {
				alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
				alloc_gfp |= __GFP_NOWARN;
			}

Do you recover the performance?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-04  4:15   ` Matthew Wilcox
@ 2026-04-04 16:47     ` Ritesh Harjani
  2026-04-04 20:46       ` Matthew Wilcox
  2026-04-16 15:14       ` Ritesh Harjani
  0 siblings, 2 replies; 25+ messages in thread
From: Ritesh Harjani @ 2026-04-04 16:47 UTC (permalink / raw)
  To: Matthew Wilcox, Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, linux-mm

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
>> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
>> introduced high-order folio allocations in the buffered write
>> path. When memory is fragmented, each failed allocation triggers
>> compaction and drain_all_pages() via __alloc_pages_slowpath(),
>> causing a 0.75x throughput drop on pgbench (simple-update) with 
>> 1024 clients on a 96-vCPU arm64 system.
>> 
>> Strip __GFP_DIRECT_RECLAIM from folio allocations in
>> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
>> making them purely opportunistic.
>
> If you look at __filemap_get_folio_mpol(), that's kind of being tried
> already:
>
>                         if (order > min_order)
>                                 alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
>
>  * %__GFP_NORETRY: The VM implementation will try only very lightweight
>  * memory direct reclaim to get some memory under memory pressure (thus
>  * it can sleep). It will avoid disruptive actions like OOM killer. The
>  * caller must handle the failure which is quite likely to happen under
>  * heavy memory pressure. The flag is suitable when failure can easily be
>  * handled at small cost, such as reduced throughput.
>
> which, from the description, seemed like the right approach.  So either
> the description or the implementation should be updated, I suppose?
>
> Now, what happens if you change those two lines to:
>
> 			if (order > min_order) {
> 				alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> 				alloc_gfp |= __GFP_NOWARN;
> 			}

Hi Matthew,

Shouldn't we try this instead? This would still allows us to keep
__GFP_NORETRY and hence light weight direct reclaim/compaction for
atleast the non-costly order allocations, right?

 			if (order > min_order) {
				alloc_gfp |= __GFP_NOWARN;
				if (order > PAGE_ALLOC_COSTLY_ORDER)
					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
				else
					alloc_gfp |= __GFP_NORETRY;
			}

-ritesh

>
> Do you recover the performance?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-04 16:47     ` Ritesh Harjani
@ 2026-04-04 20:46       ` Matthew Wilcox
  2026-04-16 15:14       ` Ritesh Harjani
  1 sibling, 0 replies; 25+ messages in thread
From: Matthew Wilcox @ 2026-04-04 20:46 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze,
	dipietro.salvatore, stable, Christian Brauner, Darrick J. Wong,
	linux-xfs, linux-fsdevel, linux-mm

On Sat, Apr 04, 2026 at 10:17:33PM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
> >> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> >> introduced high-order folio allocations in the buffered write
> >> path. When memory is fragmented, each failed allocation triggers
> >> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> >> causing a 0.75x throughput drop on pgbench (simple-update) with 
> >> 1024 clients on a 96-vCPU arm64 system.
> >> 
> >> Strip __GFP_DIRECT_RECLAIM from folio allocations in
> >> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
> >> making them purely opportunistic.
> >
> > If you look at __filemap_get_folio_mpol(), that's kind of being tried
> > already:
> >
> >                         if (order > min_order)
> >                                 alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
> >
> >  * %__GFP_NORETRY: The VM implementation will try only very lightweight
> >  * memory direct reclaim to get some memory under memory pressure (thus
> >  * it can sleep). It will avoid disruptive actions like OOM killer. The
> >  * caller must handle the failure which is quite likely to happen under
> >  * heavy memory pressure. The flag is suitable when failure can easily be
> >  * handled at small cost, such as reduced throughput.
> >
> > which, from the description, seemed like the right approach.  So either
> > the description or the implementation should be updated, I suppose?
> >
> > Now, what happens if you change those two lines to:
> >
> > 			if (order > min_order) {
> > 				alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> > 				alloc_gfp |= __GFP_NOWARN;
> > 			}
> 
> Hi Matthew,
> 
> Shouldn't we try this instead? This would still allows us to keep
> __GFP_NORETRY and hence light weight direct reclaim/compaction for
> atleast the non-costly order allocations, right?
> 
>  			if (order > min_order) {
> 				alloc_gfp |= __GFP_NOWARN;
> 				if (order > PAGE_ALLOC_COSTLY_ORDER)
> 					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> 				else
> 					alloc_gfp |= __GFP_NORETRY;
> 			}

Uhh ... maybe?  I'd want someone more familiar with the page allocator
than I am to say whether that's the right approach.  If it is, that
seems too complex, and maybe we need a better approach to the page
allocator flags.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-04 16:47     ` Ritesh Harjani
  2026-04-04 20:46       ` Matthew Wilcox
@ 2026-04-16 15:14       ` Ritesh Harjani
  2026-04-20 16:33         ` Salvatore Dipietro
  1 sibling, 1 reply; 25+ messages in thread
From: Ritesh Harjani @ 2026-04-16 15:14 UTC (permalink / raw)
  To: Matthew Wilcox, Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, linux-mm

Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:

> Matthew Wilcox <willy@infradead.org> writes:
>
>> On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
>>> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
>>> introduced high-order folio allocations in the buffered write
>>> path. When memory is fragmented, each failed allocation triggers
>>> compaction and drain_all_pages() via __alloc_pages_slowpath(),
>>> causing a 0.75x throughput drop on pgbench (simple-update) with 
>>> 1024 clients on a 96-vCPU arm64 system.
>>> 
>>> Strip __GFP_DIRECT_RECLAIM from folio allocations in
>>> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
>>> making them purely opportunistic.
>>
>> If you look at __filemap_get_folio_mpol(), that's kind of being tried
>> already:
>>
>>                         if (order > min_order)
>>                                 alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
>>
>>  * %__GFP_NORETRY: The VM implementation will try only very lightweight
>>  * memory direct reclaim to get some memory under memory pressure (thus
>>  * it can sleep). It will avoid disruptive actions like OOM killer. The
>>  * caller must handle the failure which is quite likely to happen under
>>  * heavy memory pressure. The flag is suitable when failure can easily be
>>  * handled at small cost, such as reduced throughput.
>>
>> which, from the description, seemed like the right approach.  So either
>> the description or the implementation should be updated, I suppose?
>>
>> Now, what happens if you change those two lines to:
>>
>> 			if (order > min_order) {
>> 				alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
>> 				alloc_gfp |= __GFP_NOWARN;
>> 			}
>
> Hi Matthew,
>
> Shouldn't we try this instead? This would still allows us to keep
> __GFP_NORETRY and hence light weight direct reclaim/compaction for
> atleast the non-costly order allocations, right?
>
>  			if (order > min_order) {
> 				alloc_gfp |= __GFP_NOWARN;
> 				if (order > PAGE_ALLOC_COSTLY_ORDER)
> 					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> 				else
> 					alloc_gfp |= __GFP_NORETRY;
> 			}
>

Hi Salvatore,

Did you get a chance to test the above two options (shared by Matthew
and me)? And were you able to recover the performance back with those?

So, in a longer run, as Dave suggested, we might need to fix this by
maybe considering removing compaction in the direct reclaim path. But I
guess for fixing it in older kernel releases, we might need a quick fix
,so maybe worth trying the above suggested changes, perhaps.

Also, I am somehow not able to hit this problem at my end (even after
creating a bit of memory fragmentation). So please also feel free to
share the steps, if you have a setup to re-create it easily.

-ritesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-16 15:14       ` Ritesh Harjani
@ 2026-04-20 16:33         ` Salvatore Dipietro
  2026-04-20 18:44           ` Matthew Wilcox
  0 siblings, 1 reply; 25+ messages in thread
From: Salvatore Dipietro @ 2026-04-20 16:33 UTC (permalink / raw)
  To: ritesh.list
  Cc: abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore,
	dipiets, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs,
	stable, willy

I have submitted a v2 of the patch based on Ritesh's suggestion.
https://lore.kernel.org/linux-mm/20260420161404.642-1-dipiets@amazon.it/T/#u

Salvatore

AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-20 16:33         ` Salvatore Dipietro
@ 2026-04-20 18:44           ` Matthew Wilcox
  2026-04-21  1:16             ` Ritesh Harjani
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2026-04-20 18:44 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: ritesh.list, abuehaze, alisaidi, blakgeof, brauner,
	dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm,
	linux-xfs, stable

On Mon, Apr 20, 2026 at 04:33:28PM +0000, Salvatore Dipietro wrote:
> I have submitted a v2 of the patch based on Ritesh's suggestion.
> https://lore.kernel.org/linux-mm/20260420161404.642-1-dipiets@amazon.it/T/#u

... but without linking back to this thread, so nobody who was exposed
to that thread for the first time knows about this one.  That's poor form.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-20 18:44           ` Matthew Wilcox
@ 2026-04-21  1:16             ` Ritesh Harjani
  2026-04-28 15:02               ` Salvatore Dipietro
  0 siblings, 1 reply; 25+ messages in thread
From: Ritesh Harjani @ 2026-04-21  1:16 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: Matthew Wilcox, abuehaze, alisaidi, blakgeof, brauner,
	dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm,
	linux-xfs, stable

Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Apr 20, 2026 at 04:33:28PM +0000, Salvatore Dipietro wrote:
>> I have submitted a v2 of the patch based on Ritesh's suggestion.
>> https://lore.kernel.org/linux-mm/20260420161404.642-1-dipiets@amazon.it/T/#u
>
> ... but without linking back to this thread, so nobody who was exposed
> to that thread for the first time knows about this one.  That's poor form.

Yup.
Also, given the Maintainers (willy, Christoph, Dave) shown their
dis-interest in taking the patch in it's current form, the right way is
to get back with performance data with both the approaches (which we
were discussing) and first get the consensus from everyone, before
proposing this as a patch :).

Having said that, we do care if a genuine performance issue gets
reported. In that context, I wanted to understand your setup a bit from
memory fragmentation perspective. Are you trying to simulate memory
fragmentation and then benchmarking? Or was this problem hitting when
you run simply run the reproduction steps mentioned in your cover
letter?

BTW - I was following the other thread too where PREEMPT_LAZY problem
was getting discussed. And from what I understood, you mentioned [1]
enabling THP on the system made that problem go away. Also it looks like
enabling THP is the right thing to do for this kind of workload. Does
that also mean enabling THP fixed this problem too? Do you still hit
memory fragmentation and/or similar throughput drop w/o this fix after
you enable THP? It will be good to know those details too please.

[1]: https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#md88ca4258766e897e432df85874d197db476c7d1

-ritesh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-21  1:16             ` Ritesh Harjani
@ 2026-04-28 15:02               ` Salvatore Dipietro
  2026-05-03  5:52                 ` Ritesh Harjani
  0 siblings, 1 reply; 25+ messages in thread
From: Salvatore Dipietro @ 2026-04-28 15:02 UTC (permalink / raw)
  To: ritesh.list
  Cc: abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore,
	dipiets, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs,
	stable, willy

On 4/21/26 00:43, Ritesh Harjani wrote:
> Also, given the Maintainers (willy, Christoph, Dave) shown their
> dis-interest in taking the patch in it's current form, the right way is
> to get back with performance data with both the approaches (which we
> were discussing) and first get the consensus from everyone, before
> proposing this as a patch :).

Thank you for the follow-up and the additional context, Ritesh.
I might have misunderstood the previous request and will make sure to 
link back to previous patch versions in the future.
Here are the performance results that we have collected on our end with
the proposed patches:


| Patch                |    Run 1   |    Run 2   |    Run 3   |   Average   | % vs Baseline |
|----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
| Baseline             | 107,064.61 |  97,043.86 | 101,830.78 | 101,979.75  |       —       |
| Proposed patch       | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20  |    +38.45%    |
| Ritesh's suggestion  | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61  |    +36.50%    |
| Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82  |    +41.07%    |


On 4/21/26 00:43, Ritesh Harjani wrote:
> In that context, I wanted to understand your setup a bit from
> memory fragmentation perspective. Are you trying to simulate memory
> fragmentation and then benchmarking? Or was this problem hitting when
> you run simply run the reproduction steps mentioned in your cover
> letter?

All results were collected on fresh AWS instances as described in the
cover letter. Patch [1] has been applied on all instances to avoid the
other regression. The instance has been restarted to pick up the patched
kernel and ensure clean memory before installing and starting the
PostgreSQL benchmark via repro-collection [2]. 
We do not use any tool to fragment the memory in advance. Collecting 
memory metric of this system, we noticed that ~40% of memory is used by
PageTables since PostgreSQL spawns a new process for each client limiting
significantly the available caching and free memory.

PostgreSQL write pattern consists mostly of 8/16 KB data but during 
the database checkpoints, by default every 5 minutes, it flushes write-ahead
logs to disk, which uses large folios. At this point, the system attempts to
satisfy the folio allocation request, triggering the regression and falling
into the slow path, as shown by the Linux perf profile below:

  `-0.26%-__arm64_sys_pwrite64
    `-0.26%-vfs_write
      `-0.26%-xfs_file_write_iter
        `-0.26%-xfs_file_buffered_write
          `-0.26%-iomap_file_buffered_write
            `-0.26%-iomap_write_iter
              `-0.22%-iomap_write_begin
                `-0.22%-iomap_get_folio
                  `-0.22%-__filemap_get_folio
                    `-0.21%-filemap_alloc_folio->alloc_pages
                      `-0.20%-__alloc_pages_slowpath
                        |-0.12%-__alloc_pages_direct_compact
                        | `-0.12%-try_to_compact_pages
                        |   `-0.11%-compact_zone
                        |     `-0.11%-isolate_migratepages
                        `-0.07%-__drain_all_pages
                          `-0.07%-drain_pages_zone
                            `-0.07%-free_pcppages_bulk

This is also visible in the intermediate PGBench results, which drop
significantly during checkpoint time execution:

# Normal execution:
[20260421.141505] [INFO] progress: 660.0 s, 138828.2 tps, lat 7.509 ms stddev 16.985, 0 failed
[20260421.141515] [INFO] progress: 670.0 s, 151505.1 tps, lat 6.708 ms stddev 8.308, 0 failed
[20260421.141525] [INFO] progress: 680.0 s, 166558.7 tps, lat 6.190 ms stddev 6.537, 0 failed
[20260421.141535] [INFO] progress: 690.0 s, 141267.1 tps, lat 7.246 ms stddev 5.951, 0 failed

# During checkpoints:	
[20260421.141605] [INFO] progress: 720.0 s, 54119.8 tps, lat 18.894 ms stddev 81.816, 0 failed
[20260421.141615] [INFO] progress: 730.0 s, 55184.7 tps, lat 18.564 ms stddev 12.729, 0 failed
[20260421.141625] [INFO] progress: 740.0 s, 37334.0 tps, lat 27.302 ms stddev 25.060, 0 failed
[20260421.141635] [INFO] progress: 750.0 s, 53387.6 tps, lat 19.259 ms stddev 18.313, 0 failed
[20260421.141645] [INFO] progress: 760.0 s, 41247.3 tps, lat 24.805 ms stddev 24.116, 0 failed


On 4/21/26 00:43, Ritesh Harjani wrote:
> BTW - I was following the other thread too where PREEMPT_LAZY problem
> was getting discussed. And from what I understood, you mentioned [1]
> enabling THP on the system made that problem go away. Also it looks like
> enabling THP is the right thing to do for this kind of workload. Does
> that also mean enabling THP fixed this problem too? Do you still hit
> memory fragmentation and/or similar throughput drop w/o this fix after
> you enable THP? It will be good to know those details too please.

We have run more benchmarks (as baseline) with PostgreSQL huge_pages options
(on, off) pre-allocating the shared buffer memory with "vm.nr_hugepages"
(~25% of total memory, 2MB size) and Transparent Huge Pages (THP) options
(always, madvise, never). PostgreSQL performance improves only when
PostgreSQL huge_pages option with pre-allocated memory is enabled.
THP has no significant effect on PostgreSQL or system performance in this
case.

| PG huge_pages + pre-alloc mem | THP     |   Run 1 |   Run 2 |   Run 3 | Average |
|-------------------------------|---------|--------:|--------:|--------:|--------:|
| on                            | never   | 189,418 | 187,764 | 188,207 | 188,463 |
| on                            | always  | 188,813 | 189,798 | 190,032 | 189,548 |
| on                            | madvise | 187,405 | 192,234 | 189,201 | 189,613 |
| off                           | never   | 102,609 | 109,394 | 100,868 | 104,290 |
| off                           | always  |  90,274 | 103,831 | 102,515 |  98,874 |
| off                           | madvise |  90,508 | 103,855 |  96,574 |  96,979 |


[1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368
[2] https://github.com/aws/repro-collection.git



AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-28 15:02               ` Salvatore Dipietro
@ 2026-05-03  5:52                 ` Ritesh Harjani
  2026-05-03 11:55                   ` Matthew Wilcox
  0 siblings, 1 reply; 25+ messages in thread
From: Ritesh Harjani @ 2026-05-03  5:52 UTC (permalink / raw)
  To: Salvatore Dipietro, Matthew Wilcox, Andrew Morton, linux-mm,
	Vlastimil Babka
  Cc: abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore,
	dipiets, djwong, linux-fsdevel, linux-kernel, linux-xfs, stable,
	willy

Sorry about the delayed response, got caught up in some other work.

Salvatore Dipietro <dipiets@amazon.it> writes:

> On 4/21/26 00:43, Ritesh Harjani wrote:
>> Also, given the Maintainers (willy, Christoph, Dave) shown their
>> dis-interest in taking the patch in it's current form, the right way is
>> to get back with performance data with both the approaches (which we
>> were discussing) and first get the consensus from everyone, before
>> proposing this as a patch :).
>
> Thank you for the follow-up and the additional context, Ritesh.
> I might have misunderstood the previous request and will make sure to 
> link back to previous patch versions in the future.
> Here are the performance results that we have collected on our end with
> the proposed patches:
>
>
> | Patch                |    Run 1   |    Run 2   |    Run 3   |   Average   | % vs Baseline |
> |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
> | Baseline             | 107,064.61 |  97,043.86 | 101,830.78 | 101,979.75  |       —       |

> | PG huge_pages + pre-alloc mem | THP     |   Run 1 |   Run 2 |   Run 3 | Average |
> |-------------------------------|---------|--------:|--------:|--------:|--------:|
> | on                            | never   | 189,418 | 187,764 | 188,207 | 188,463 |
>

First of all thanks for sharing the detailed performance numbers.

Ok, so here is what I understood from the data you shared.
This performance problem is mostly seen with PostgreSQL huge_pages=off
[1][2] i.e.

baseline-no-patches                 ~104K
v/s
baseline-no-patches+huge_pages=on   ~188K

Also the observation with huge_pages=off is - we have 40% of memory as page
table memory (as you pointed below)

> We do not use any tool to fragment the memory in advance. Collecting 
> memory metric of this system, we noticed that ~40% of memory is used by
> PageTables since PostgreSQL spawns a new process for each client limiting
> significantly the available caching and free memory.
>

So there must be 2 things going on with huge_pages=on option here:

1. Huge pages use PMD-size mapping, which eliminates the need of PTE
tables entirely. This then reduces the amount of memory consumed by page
tables. W/O huges pages, the page table overhead become significant
(~40% of DRAM), because on fork, each child process gets it's own copy
of the PTE tables (even though the underlying shared memory pages
remains the same)

2. The second savings might come from the fact that Linux supports
CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING with hugetlb. With this, the
PMD table pages themselves are shared among proceses.

[1]:
https://www.postgresql.org/docs/current/runtime-config-resource.html
[2]:
https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-HUGE-PAGES

So the above explain the 40% of memory used up in Page tables.

Now this is what I believe could be the reason for memory fragmentation
with this workload - 
In Linux, each PTE page table uses 4KB size (assuming you are using 4KB
system PAGE_SIZE). When your workload forks a
child process for each new connection, child gets its own copy of the
page tables which maps the shared buffer.
Since each PTE table is a single 4KB page, hundreds of connections
spawning means hundreds of thousands of single-page allocations for page
tables. So it looks like, the major source of your memory fragmentation
problem must be these several order-0 allocations for PTE page table
pages.

Also as per the documentation [1], huge_pages=try option is the default
setting. So I am assuming in production we at least won't suffer from
this memory fragmentation, correct?

[1]: https://www.postgresql.org/docs/current/runtime-config-resource.html

> PostgreSQL write pattern consists mostly of 8/16 KB data but during 
> the database checkpoints, by default every 5 minutes, it flushes write-ahead
> logs to disk, which uses large folios. At this point, the system attempts to
> satisfy the folio allocation request, triggering the regression and falling
> into the slow path, as shown by the Linux perf profile below:
>
>   `-0.26%-__arm64_sys_pwrite64
>     `-0.26%-vfs_write
>       `-0.26%-xfs_file_write_iter
>         `-0.26%-xfs_file_buffered_write
>           `-0.26%-iomap_file_buffered_write
>             `-0.26%-iomap_write_iter
>               `-0.22%-iomap_write_begin
>                 `-0.22%-iomap_get_folio
>                   `-0.22%-__filemap_get_folio
>                     `-0.21%-filemap_alloc_folio->alloc_pages
>                       `-0.20%-__alloc_pages_slowpath
>                         |-0.12%-__alloc_pages_direct_compact
>                         | `-0.12%-try_to_compact_pages
>                         |   `-0.11%-compact_zone
>                         |     `-0.11%-isolate_migratepages
>                         `-0.07%-__drain_all_pages
>                           `-0.07%-drain_pages_zone
>                             `-0.07%-free_pcppages_bulk
>

However, I agree that it still make sense look into possible solution to
address this performance gap which you pointed out when the system has
memory fragmentation (with huge_pages=off).

> | Patch                |    Run 1   |    Run 2   |    Run 3   |   Average   | % vs Baseline |
> |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
> | Baseline             | 107,064.61 |  97,043.86 | 101,830.78 | 101,979.75  |       —       |
> | Proposed patch       | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20  |    +38.45%    |
> | Ritesh's suggestion  | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61  |    +36.50%    |
> | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82  |    +41.07%    |

The main reason, why I proposed the below patch was because, this only
affects costly order allocation (i.e for order > PAGE_ALLOC_COSTLY_ORDER)
by skipping direct reclaim for those orders, while still keeping the
behaviour same for others.

So, for smaller orders (order > min_order and <=
PAGE_ALLOC_COSTLY_ORDER), the allocator will still attempt for direct
reclaim and compaction (which I guess is required to avoid oom too?) And
also, this looks like a change which could be easily backportable :)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c..f2343c26dd63 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
 			gfp_t alloc_gfp = gfp;

 			err = -ENOMEM;
-			if (order > min_order)
-				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
+			if (order > min_order) {
+				alloc_gfp |= __GFP_NOWARN;
+				if (order > PAGE_ALLOC_COSTLY_ORDER)
+					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
+				else
+					alloc_gfp |= __GFP_NORETRY;
+			}

But of course let's hear from others on their suggestions / thoughts.
Maybe the filemap is not the right place to fix this as Matthew, Andrew
and others were pointing. Any other suggestions on how to approach this,
please?

-ritesh

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-05-03  5:52                 ` Ritesh Harjani
@ 2026-05-03 11:55                   ` Matthew Wilcox
  2026-05-06 12:33                     ` Salvatore Dipietro
  0 siblings, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2026-05-03 11:55 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Salvatore Dipietro, Andrew Morton, linux-mm, Vlastimil Babka,
	abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong,
	linux-fsdevel, linux-kernel, linux-xfs, stable

On Sun, May 03, 2026 at 11:22:10AM +0530, Ritesh Harjani wrote:
> Now this is what I believe could be the reason for memory fragmentation
> with this workload - 
> In Linux, each PTE page table uses 4KB size (assuming you are using 4KB
> system PAGE_SIZE). When your workload forks a
> child process for each new connection, child gets its own copy of the
> page tables which maps the shared buffer.
> Since each PTE table is a single 4KB page, hundreds of connections
> spawning means hundreds of thousands of single-page allocations for page
> tables. So it looks like, the major source of your memory fragmentation
> problem must be these several order-0 allocations for PTE page table
> pages.

While memory is fragmented, the _problem_ is that we try too hard to
defragment.  From the original post:

: When memory is fragmented, each failed allocation triggers
: compaction and drain_all_pages() via __alloc_pages_slowpath()

We really should only try compaction once.  If it didn't make useful
progress last time, it won't this time either.

> > | Patch                |    Run 1   |    Run 2   |    Run 3   |   Average   | % vs Baseline |
> > |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
> > | Baseline             | 107,064.61 |  97,043.86 | 101,830.78 | 101,979.75  |       —       |
> > | Proposed patch       | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20  |    +38.45%    |
> > | Ritesh's suggestion  | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61  |    +36.50%    |
> > | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82  |    +41.07%    |
> 
> 
> The main reason, why I proposed the below patch was because, this only
> affects costly order allocation (i.e for order > PAGE_ALLOC_COSTLY_ORDER)
> by skipping direct reclaim for those orders, while still keeping the
> behaviour same for others.
> 
> So, for smaller orders (order > min_order and <=
> PAGE_ALLOC_COSTLY_ORDER), the allocator will still attempt for direct
> reclaim and compaction (which I guess is required to avoid oom too?) And
> also, this looks like a change which could be easily backportable :)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4e636647100c..f2343c26dd63 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
>  			gfp_t alloc_gfp = gfp;
>  
>  			err = -ENOMEM;
> -			if (order > min_order)
> -				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
> +			if (order > min_order) {
> +				alloc_gfp |= __GFP_NOWARN;
> +				if (order > PAGE_ALLOC_COSTLY_ORDER)
> +					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> +				else
> +					alloc_gfp |= __GFP_NORETRY;
> +			}
> 
> 
> But of course let's hear from others on their suggestions / thoughts.
> Maybe the filemap is not the right place to fix this as Matthew, Andrew
> and others were pointing. Any other suggestions on how to approach this,
> please?

filemap.c REALLY shouldn't know about PAGE_ALLOC_COSTLY_ORDER.
That's an internal detail of the memory allocator.

Either we want an API to say "allocate me a folio between orders A and B"
or we need more understandable GFP flags.  Or the page allocator could
use the __GFP_NORETRY flag to say "oh well, this allocation has a fallback,
I'll kick kcompactd to try to compact some more memory, but I'll fail
the allocation".

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-05-03 11:55                   ` Matthew Wilcox
@ 2026-05-06 12:33                     ` Salvatore Dipietro
  2026-05-27 16:24                       ` Salvatore Dipietro
  0 siblings, 1 reply; 25+ messages in thread
From: Salvatore Dipietro @ 2026-05-06 12:33 UTC (permalink / raw)
  To: willy
  Cc: abuehaze, akpm, alisaidi, blakgeof, brauner, dipietro.salvatore,
	dipiets, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs,
	ritesh.list, stable, vbabka


On 5/03/26 05:52, Ritesh Harjani wrote:
> Also as per the documentation [1], huge_pages=try option is the default
> setting. So I am assuming in production we at least won't suffer from
> this memory fragmentation, correct?

Yes, huge_pages=try is the default option, but without pre-allocating the
entire shared_buffer size in memory via "vm.nr_hugepages" — which is not
done automatically — huge pages will not be used and the system falls into
the huge_pages=off category. Even with a partial pre-allocation, PostgreSQL
will not be able to use hugepages.


On 5/03/26 11:55, Matthew Wilcox wrote:
> or we need more understandable GFP flags.  Or the page allocator could
> use the __GFP_NORETRY flag to say "oh well, this allocation has a fallback,
> I'll kick kcompactd to try to compact some more memory, but I'll fail
> the allocation".

We also tested kicking off kcompactd in the background when __GFP_NORETRY is
passed, returning "nopage" to avoid blocking the folio allocation request. 
Here is the patch tested as the other with PREEMPT_NONE patch [1]:


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 65e205111553..d4f322910992 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4818,6 +4818,26 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (current->flags & PF_MEMALLOC)
 		goto nopage;
 
+	/*
+	 * Costly allocations with __GFP_NORETRY are opportunistic - Don't
+	 * stall on direct compaction or reclaim; instead, kick
+	 * kcompactd on the preferred node so large pages may become
+	 * available for future allocations and let the caller fall back now.
+	 *
+	 * Direct compaction is way too costly for hot allocation paths on
+	 * large systems: each attempt calls drain_all_pages() which IPIs
+	 * every CPU.  Only wake kcompactd on the local node to avoid
+	 * cross-NUMA interference with unrelated workloads.
+	 */
+	if (costly_order && (gfp_mask & __GFP_NORETRY)) {
+		struct zone *preferred_zone = ac->preferred_zoneref->zone;
+
+		if (preferred_zone)
+			wakeup_kcompactd(preferred_zone->zone_pgdat, order,
+					 ac->highest_zoneidx);
+		goto nopage;
+	}
+
 	/* Try direct reclaim and then allocating */
 	if (!compact_first) {
 		page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags,



Here are the results we collected (kcompactd background):

| Patch                |    Run 1   |    Run 2   |    Run 3   |   Average   | % vs Baseline |
|----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
| Baseline             | 107,064.61 |  97,043.86 | 101,830.78 | 101,979.75  |       —       |
| Proposed patch       | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20  |    +38.45%    |
| Ritesh's suggestion  | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61  |    +36.50%    |
| Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82  |    +41.07%    |
| kcompactd background | 146,760.75 | 128,094.92 | 127,979.74 | 134,278.47  |    +31.67%    |

  
[1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368





AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-05-06 12:33                     ` Salvatore Dipietro
@ 2026-05-27 16:24                       ` Salvatore Dipietro
  2026-05-31 23:29                         ` Karim Manaouil
  2026-06-24  8:06                         ` Salvatore Dipietro
  0 siblings, 2 replies; 25+ messages in thread
From: Salvatore Dipietro @ 2026-05-27 16:24 UTC (permalink / raw)
  To: dipiets
  Cc: abuehaze, akpm, alisaidi, blakgeof, brauner, dipietro.salvatore,
	djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs,
	ritesh.list, stable, vbabka, willy

Thanks Ritesh and Matthew for the continued feedback and guidance on this thread.
I'd like to summarize where we stand and ask for your input on the best path forward.

Summary of approaches tested:
We've now benchmarked all proposed variations (pgbench simple-update, 1024 clients, 
96-vCPU arm64, huge_pages=off, PREEMPT_NONE applied [1]):

| Patch                          | Change Location       | Avg TPS    | % vs Baseline |
|--------------------------------|-----------------------|-----------:|:-------------:|
| Baseline (no patch)            | —                     | 101,979.75 |       —       |
| v1 (original, iomap caller)    | fs/iomap/buffered-io.c| 141,194.20 |    +38.45%    |
| Ritesh's suggestion            | mm/filemap.c          | 139,200.61 |    +36.50%    |
| Matthew's suggestion           | mm/filemap.c          | 143,863.82 |    +41.07%    |
| kcompactd background           | mm/page_alloc.c       | 134,278.47 |    +31.67%    |

All approaches recover significant throughput. The kcompactd approach (background 
compaction and returning nopage for costly orders with __GFP_NORETRY) aligns with the
architectural direction Dave and Christoph proposed, keeping compaction out of the direct 
reclaim path, and lives entirely in the page allocator. 

Based on the discussion, I see two possible directions and would appreciate your guidance:

1. Page allocator fix (mm/page_alloc.c): The kcompactd background approach addresses 
Matthew's concern that filemap.c shouldn't know about PAGE_ALLOC_COSTLY_ORDER, and aligns 
with Dave's vision of removing compaction from the direct reclaim path.

2. filemap fix (mm/filemap.c): Both Ritesh's and Matthew's suggestions are minimal, 
backportable, and preserve lightweight reclaim for non-costly orders. 
Ritesh's variant differentiates between costly and non-costly orders, while Matthew's 
is simpler and performs best.

Would either of these directions be acceptable for a v3, or would you prefer a different approach?

I'm happy to test any additional variations or direction to move this forward

Salvatore

[1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368

AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-05-27 16:24                       ` Salvatore Dipietro
@ 2026-05-31 23:29                         ` Karim Manaouil
  2026-06-05 10:58                           ` Salvatore Dipietro
  2026-06-24  8:06                         ` Salvatore Dipietro
  1 sibling, 1 reply; 25+ messages in thread
From: Karim Manaouil @ 2026-05-31 23:29 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: abuehaze, akpm, alisaidi, blakgeof, brauner, dipietro.salvatore,
	djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs,
	ritesh.list, stable, vbabka, willy

On Wed, May 27, 2026 at 04:24:10PM +0000, Salvatore Dipietro wrote:
> 
> Thanks Ritesh and Matthew for the continued feedback and guidance on this thread.
> I'd like to summarize where we stand and ask for your input on the best path forward.
> 
> Summary of approaches tested:
> We've now benchmarked all proposed variations (pgbench simple-update, 1024 clients, 
> 96-vCPU arm64, huge_pages=off, PREEMPT_NONE applied [1]):
> 
> | Patch                          | Change Location       | Avg TPS    | % vs Baseline |
> |--------------------------------|-----------------------|-----------:|:-------------:|
> | Baseline (no patch)            | —                     | 101,979.75 |       —       |
> | v1 (original, iomap caller)    | fs/iomap/buffered-io.c| 141,194.20 |    +38.45%    |
> | Ritesh's suggestion            | mm/filemap.c          | 139,200.61 |    +36.50%    |
> | Matthew's suggestion           | mm/filemap.c          | 143,863.82 |    +41.07%    |
> | kcompactd background           | mm/page_alloc.c       | 134,278.47 |    +31.67%    |
> 
> 
> All approaches recover significant throughput. The kcompactd approach (background 
> compaction and returning nopage for costly orders with __GFP_NORETRY) aligns with the
> architectural direction Dave and Christoph proposed, keeping compaction out of the direct 
> reclaim path, and lives entirely in the page allocator. 
> 
> Based on the discussion, I see two possible directions and would appreciate your guidance:
> 
> 1. Page allocator fix (mm/page_alloc.c): The kcompactd background approach addresses 
> Matthew's concern that filemap.c shouldn't know about PAGE_ALLOC_COSTLY_ORDER, and aligns 
> with Dave's vision of removing compaction from the direct reclaim path.
> 
> 2. filemap fix (mm/filemap.c): Both Ritesh's and Matthew's suggestions are minimal, 
> backportable, and preserve lightweight reclaim for non-costly orders. 
> Ritesh's variant differentiates between costly and non-costly orders, while Matthew's 
> is simpler and performs best.

I am not very familiar with THPs in the page cache, but for anonymous
memory, we have /sys/kernel/mm/transparent_hugepages/defrag which
decides what to do in the event of a THP allocation failure, whether to
enter a synchronous compaction or wake up kcompactd.

Check vma_thp_gfp_mask(). Maybe you should adopt something similar called
file_thp_gfp_mask().

The problem with fallback is that your application is never going to get
a THP and eventually TLB pressure might actually end up slowing you
down in the long run.

Also compaction is only really tried if it makes sense. That is if
enough free memory is available to actually perform the compaction and
have a chance of creating a large enough huge page. So compaction is
actually never performed under accute memory pressure. Which means your
system actually has enough free pages, but somehow the compaction is
slow and inefficient.

I am just trying to think loudly here and address the root cause. The
real problem here is fragmentation due to unmovable pages, probably in
your case the page tables. We should work more on reducing pageblock
type mixing. Also page tables can actually be made movable so that
compaction can treat them as movable pages.


> 
> Would either of these directions be acceptable for a v3, or would you prefer a different approach?
> 
> I'm happy to test any additional variations or direction to move this forward
> 
> Salvatore
> 
> 
> [1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368
> 
> 
> 
> 
> AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico
> 
> 

-- 
~karim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-05-31 23:29                         ` Karim Manaouil
@ 2026-06-05 10:58                           ` Salvatore Dipietro
  0 siblings, 0 replies; 25+ messages in thread
From: Salvatore Dipietro @ 2026-06-05 10:58 UTC (permalink / raw)
  To: kmanaouil.dev
  Cc: abuehaze, akpm, alisaidi, blakgeof, brauner, dipietro.salvatore,
	dipiets, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs,
	ritesh.list, stable, vbabka, willy

On Sat, May 31, 2026 at 11:29:00PM +0000, Karim Manaouil wrote:
> I am not very familiar with THPs in the page cache, but for anonymous
> memory, we have /sys/kernel/mm/transparent_hugepages/defrag which
> decides what to do in the event of a THP allocation failure, whether to
> enter a synchronous compaction or wake up kcompactd.

Thanks Karim for the suggestions. To clarify, THPs are not used and do not 
have any performance change on this workload as reported in [1]. 
The failing allocations are for high order file-backed folios in the 
iomap buffered write path.

> I am just trying to think loudly here and address the root cause. The
> real problem here is fragmentation due to unmovable pages, probably in
> your case the page tables. We should work more on reducing pageblock
> type mixing. Also page tables can actually be made movable so that
> compaction can treat them as movable pages.

I agree that making PTEs movable could potentially resolve the
fragmentation at its root, since page table pages are indeed the primary
source of unmovable fragmentation in this workload. However, making page 
tables movable has much broader implications.

[1] https://lore.kernel.org/all/20260428150240.3009-1-dipiets@amazon.it/

--
Salvatore

AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-05-27 16:24                       ` Salvatore Dipietro
  2026-05-31 23:29                         ` Karim Manaouil
@ 2026-06-24  8:06                         ` Salvatore Dipietro
  2026-06-24 12:21                           ` Ritesh Harjani
  2026-06-24 13:34                           ` Christoph Hellwig
  1 sibling, 2 replies; 25+ messages in thread
From: Salvatore Dipietro @ 2026-06-24  8:06 UTC (permalink / raw)
  To: ritesh.list, willy
  Cc: dipiets, abuehaze, akpm, alisaidi, blakgeof, brauner,
	dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm,
	linux-xfs, stable, vbabka

Hi Ritesh, Matthew,

I wanted to kindly follow up on my summary from May 27th regarding the best path 
forward for this patch.

To recap, we benchmarked all proposed variations and shared the results:

| Patch                          | Change Location        | Avg TPS    | % vs Baseline |
|--------------------------------|------------------------|------------|:-------------:|
| Baseline (no patch)            | —                      | 101,979.75 |       —       |
| v1 (original, iomap caller)    | fs/iomap/buffered-io.c | 141,194.20 |    +38.45%    |
| Ritesh's suggestion            | mm/filemap.c           | 139,200.61 |    +36.50%    |
| Matthew's suggestion           | mm/filemap.c           | 143,863.82 |    +41.07%    |
| kcompactd background           | mm/page_alloc.c        | 134,278.47 |    +31.67%    |

I'd really appreciate any guidance on which direction would be acceptable for a v3 — 
whether that's the page allocator approach (kcompactd background), one of the filemap.c
fixes, or something else entirely.

I'm happy to test any additional variations or direction to move this forward

--
Salvatore

AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-06-24  8:06                         ` Salvatore Dipietro
@ 2026-06-24 12:21                           ` Ritesh Harjani
  2026-06-24 13:34                           ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Ritesh Harjani @ 2026-06-24 12:21 UTC (permalink / raw)
  To: Salvatore Dipietro, willy
  Cc: dipiets, abuehaze, akpm, alisaidi, blakgeof, brauner,
	dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm,
	linux-xfs, stable, vbabka, David Hildenbrand, Lorenzo Stoakes,
	Andrew Morton, Mike Rapoport, Michal Hocko, Vlastimil Babka

Salvatore Dipietro <dipiets@amazon.it> writes:

> Hi Ritesh, Matthew,
>
> I wanted to kindly follow up on my summary from May 27th regarding the best path 
> forward for this patch.
>

Hi Salvatore,

Sorry about the delay. I did bring this topic up in one of our internal
ext4 community calls. And to share some context, MM community thinks we
need a better long term fix for this problem rather than patching call
sites and/or playing tricks like - 

diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c..f2343c26dd63 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
 			gfp_t alloc_gfp = gfp;

 			err = -ENOMEM;
-			if (order > min_order)
-				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
+			if (order > min_order) {
+				alloc_gfp |= __GFP_NOWARN;
+				if (order > PAGE_ALLOC_COSTLY_ORDER)
+					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
+				else
+					alloc_gfp |= __GFP_NORETRY;
+			}

Unfortunately most of the folks might be missing free cycles
to work on this problem right now :( - Hence the delay in addressing
this..

However - I would like to bring this problem to other MM community
members as well who might have an interest in this space. Can we look
into the proposed solutions from Salvatore and suggest the next steps
please? 

Maybe if someone can share what is MM community looking  for here - I
guess that will be a good start. Looking into the table I think
Salvatore had also shared a diff for kicking kcompactd in the background
[2].

[2]: https://lore.kernel.org/all/20260506123326.17293-1-dipiets@amazon.it/

(Sorry I still have few other things on my plate before I start look
into this more actively. But let's hear from others, who have better
knowledge than me on this.)

> To recap, we benchmarked all proposed variations and shared the results:
>
> | Patch                          | Change Location        | Avg TPS    | % vs Baseline |
> |--------------------------------|------------------------|------------|:-------------:|
> | Baseline (no patch)            | —                      | 101,979.75 |       —       |
> | v1 (original, iomap caller)    | fs/iomap/buffered-io.c | 141,194.20 |    +38.45%    |
> | Ritesh's suggestion            | mm/filemap.c           | 139,200.61 |    +36.50%    |
> | Matthew's suggestion           | mm/filemap.c           | 143,863.82 |    +41.07%    |
> | kcompactd background           | mm/page_alloc.c        | 134,278.47 |    +31.67%    |
>

-ritesh

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-06-24  8:06                         ` Salvatore Dipietro
  2026-06-24 12:21                           ` Ritesh Harjani
@ 2026-06-24 13:34                           ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2026-06-24 13:34 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: ritesh.list, willy, abuehaze, akpm, alisaidi, blakgeof, brauner,
	dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm,
	linux-xfs, stable, vbabka

On Wed, Jun 24, 2026 at 08:06:36AM +0000, Salvatore Dipietro wrote:
> 
> Hi Ritesh, Matthew,
> 
> I wanted to kindly follow up on my summary from May 27th regarding the best path 
> forward for this patch.
> 
> To recap, we benchmarked all proposed variations and shared the results:
> 
> | Patch                          | Change Location        | Avg TPS    | % vs Baseline |
> |--------------------------------|------------------------|------------|:-------------:|
> | Baseline (no patch)            | —                      | 101,979.75 |       —       |
> | v1 (original, iomap caller)    | fs/iomap/buffered-io.c | 141,194.20 |    +38.45%    |
> | Ritesh's suggestion            | mm/filemap.c           | 139,200.61 |    +36.50%    |
> | Matthew's suggestion           | mm/filemap.c           | 143,863.82 |    +41.07%    |
> | kcompactd background           | mm/page_alloc.c        | 134,278.47 |    +31.67%    |
> 
> I'd really appreciate any guidance on which direction would be acceptable for a v3 — 
> whether that's the page allocator approach (kcompactd background), one of the filemap.c
> fixes, or something else entirely.

Do you have ointers to the patches for each approach above?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
  2026-04-04  1:13   ` Ritesh Harjani
  2026-04-04  4:15   ` Matthew Wilcox
@ 2026-04-05 22:43   ` Dave Chinner
  2026-04-07  5:40     ` Christoph Hellwig
  2026-04-21  9:02     ` Vlastimil Babka
  2 siblings, 2 replies; 25+ messages in thread
From: Dave Chinner @ 2026-04-05 22:43 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel

On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> introduced high-order folio allocations in the buffered write
> path. When memory is fragmented, each failed allocation triggers
> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> causing a 0.75x throughput drop on pgbench (simple-update) with 
> 1024 clients on a 96-vCPU arm64 system.
> 
> Strip __GFP_DIRECT_RECLAIM from folio allocations in
> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
> making them purely opportunistic.
> 
> Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> Cc: stable@vger.kernel.org
> Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>
> ---
>  fs/iomap/buffered-io.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 92a831cf4bf1..cb843d54b4d9 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  {
>  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
> +	gfp_t gfp;
>  
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
> @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  		fgp |= FGP_DONTCACHE;
>  	fgp |= fgf_set_order(len);
>  
> +	gfp = mapping_gfp_mask(iter->inode->i_mapping);
> +
> +	/*
> +	 * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER,
> +	 * strip __GFP_DIRECT_RECLAIM to make the allocation purely
> +	 * opportunistic.  This avoids compaction + drain_all_pages()
> +	 * in __alloc_pages_slowpath() that devastate throughput
> +	 * on large systems during buffered writes.
> +	 */
> +	if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
> +		gfp &= ~__GFP_DIRECT_RECLAIM;

Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere
we need to do high order folio allocation is getting out of hand.

Compaction improves long term system performance, so we don't really
just want to turn it off whenever we have demand for high order
folios.

We should be doing is getting rid of compaction out of the direct
reclaim path - it is -clearly- way too costly for hot paths that use
large allocations, especially those with fallbacks to smaller
allocations or vmalloc.

Instead, memory reclaim should kick background compaction and let it
do the work. If the allocation path really, really needs high order
allocation to succeed, then it can direct the allocation to retry
until it succeeds and the allocator itself can wait for background
compaction to make progress.

For code that has fallbacks to smaller allocations, then there is no
need to wait for compaction - we can attempt fast smaller allocations
and continue that way until an allocation succeeds....

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-05 22:43   ` Dave Chinner
@ 2026-04-07  5:40     ` Christoph Hellwig
  2026-04-21  9:02     ` Vlastimil Babka
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2026-04-07  5:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze,
	dipietro.salvatore, willy, stable, Christian Brauner,
	Darrick J. Wong, linux-xfs, linux-fsdevel

On Mon, Apr 06, 2026 at 08:43:57AM +1000, Dave Chinner wrote:
> > +	if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
> > +		gfp &= ~__GFP_DIRECT_RECLAIM;
> 
> Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere
> we need to do high order folio allocation is getting out of hand.

That's what I thought.

> Compaction improves long term system performance, so we don't really
> just want to turn it off whenever we have demand for high order
> folios.

Yes.  Also if we want to make block size > PAGE_SIZE a real option,
just giving up on allocating large folios is not an option.

> Instead, memory reclaim should kick background compaction and let it
> do the work. If the allocation path really, really needs high order
> allocation to succeed, then it can direct the allocation to retry
> until it succeeds and the allocator itself can wait for background
> compaction to make progress.
> 
> For code that has fallbacks to smaller allocations, then there is no
> need to wait for compaction - we can attempt fast smaller allocations
> and continue that way until an allocation succeeds....

Yes.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-05 22:43   ` Dave Chinner
  2026-04-07  5:40     ` Christoph Hellwig
@ 2026-04-21  9:02     ` Vlastimil Babka
  1 sibling, 0 replies; 25+ messages in thread
From: Vlastimil Babka @ 2026-04-21  9:02 UTC (permalink / raw)
  To: Dave Chinner, Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, Ritesh Harjani (IBM), Christoph Hellwig,
	linux-mm@kvack.org, Michal Hocko, David Hildenbrand (Red Hat),
	Johannes Weiner

On 4/6/26 00:43, Dave Chinner wrote:
> On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
>> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
>> introduced high-order folio allocations in the buffered write
>> path. When memory is fragmented, each failed allocation triggers
>> compaction and drain_all_pages() via __alloc_pages_slowpath(),
>> causing a 0.75x throughput drop on pgbench (simple-update) with 
>> 1024 clients on a 96-vCPU arm64 system.
>> 
>> Strip __GFP_DIRECT_RECLAIM from folio allocations in
>> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
>> making them purely opportunistic.
>> 
>> Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>

BTW, backporting perf regressions fixes to 6.6, when they are only reported
at the time 7.0 is released, might be too risky. There will likely be a
different workload that will regress as a result, no matter what we do.

>> ---
>>  fs/iomap/buffered-io.c | 15 ++++++++++++++-
>>  1 file changed, 14 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 92a831cf4bf1..cb843d54b4d9 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>>  {
>>  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
>> +	gfp_t gfp;
>>  
>>  	if (iter->flags & IOMAP_NOWAIT)
>>  		fgp |= FGP_NOWAIT;
>> @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>>  		fgp |= FGP_DONTCACHE;
>>  	fgp |= fgf_set_order(len);
>>  
>> +	gfp = mapping_gfp_mask(iter->inode->i_mapping);
>> +
>> +	/*
>> +	 * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER,
>> +	 * strip __GFP_DIRECT_RECLAIM to make the allocation purely
>> +	 * opportunistic.  This avoids compaction + drain_all_pages()
>> +	 * in __alloc_pages_slowpath() that devastate throughput
>> +	 * on large systems during buffered writes.
>> +	 */
>> +	if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
>> +		gfp &= ~__GFP_DIRECT_RECLAIM;
> 
> Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere
> we need to do high order folio allocation is getting out of hand.
> 
> Compaction improves long term system performance, so we don't really
> just want to turn it off whenever we have demand for high order
> folios.
> 
> We should be doing is getting rid of compaction out of the direct
> reclaim path - it is -clearly- way too costly for hot paths that use
> large allocations, especially those with fallbacks to smaller
> allocations or vmalloc.
> 
> Instead, memory reclaim should kick background compaction and let it
> do the work. If the allocation path really, really needs high order
> allocation to succeed, then it can direct the allocation to retry
> until it succeeds and the allocator itself can wait for background
> compaction to make progress.
> 
> For code that has fallbacks to smaller allocations, then there is no
> need to wait for compaction - we can attempt fast smaller allocations
> and continue that way until an allocation succeeds....

So, should we do a LSF/MM session?

But I think in any case, the page allocator needs to know which allocations
do have the fallback. __GFP_NORETRY exists for this. Here it wasn't tried at
all, in v2 [1] it was, but not alone. I'd start from __GFP_NORETRY alone,
and then we can look at tweaking what it does if it's currently insufficient.

We could have a helper to encapsulate this "turn this allocation to a
lightweight fallbackable one", which would add __GFP_NORETRY. It probably
already exists somewhere but not gfp.h. But I'm not sure we can simply
change GFP_KERNEL to start failing more for non-costly orders. We've
discussed that a lot in the past :)

[1] https://lore.kernel.org/all/20260420161404.642-1-dipiets@amazon.it/

> -Dave.


^ permalink raw reply	[flat|nested] 25+ messages in thread

[parent not found: <20260403193201.30479-1-dipiets@amazon.it>]

* [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
       [not found] <20260403193201.30479-1-dipiets@amazon.it>
@ 2026-04-03 19:32 ` Salvatore Dipietro
  2026-04-04  6:25   ` Greg KH
  0 siblings, 1 reply; 25+ messages in thread
From: Salvatore Dipietro @ 2026-04-03 19:32 UTC (permalink / raw)
  To: dipietro.salvatore; +Cc: stable

Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
introduced high-order folio allocations in the buffered write
path. When memory is fragmented, each failed allocation triggers
compaction and drain_all_pages() via __alloc_pages_slowpath(),
causing a 0.75x throughput drop on pgbench (simple-update) with 
1024 clients on a 96-vCPU arm64 system.

Strip __GFP_DIRECT_RECLAIM from folio allocations in
iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
making them purely opportunistic.

Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>
---
 fs/iomap/buffered-io.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 92a831cf4bf1..cb843d54b4d9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
+	gfp_t gfp;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
@@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 		fgp |= FGP_DONTCACHE;
 	fgp |= fgf_set_order(len);
 
+	gfp = mapping_gfp_mask(iter->inode->i_mapping);
+
+	/*
+	 * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER,
+	 * strip __GFP_DIRECT_RECLAIM to make the allocation purely
+	 * opportunistic.  This avoids compaction + drain_all_pages()
+	 * in __alloc_pages_slowpath() that devastate throughput
+	 * on large systems during buffered writes.
+	 */
+	if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
+		gfp &= ~__GFP_DIRECT_RECLAIM;
+
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
-			fgp, mapping_gfp_mask(iter->inode->i_mapping));
+			fgp, gfp);
 }
 EXPORT_SYMBOL_GPL(iomap_get_folio);
 
-- 
2.50.1 (Apple Git-155)




AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico




^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:32 ` Salvatore Dipietro
@ 2026-04-04  6:25   ` Greg KH
  0 siblings, 0 replies; 25+ messages in thread
From: Greg KH @ 2026-04-04  6:25 UTC (permalink / raw)
  To: Salvatore Dipietro; +Cc: dipietro.salvatore, stable

On Fri, Apr 03, 2026 at 07:32:01PM +0000, Salvatore Dipietro wrote:
> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> introduced high-order folio allocations in the buffered write
> path. When memory is fragmented, each failed allocation triggers
> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> causing a 0.75x throughput drop on pgbench (simple-update) with 
> 1024 clients on a 96-vCPU arm64 system.
> 
> Strip __GFP_DIRECT_RECLAIM from folio allocations in
> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
> making them purely opportunistic.
> 
> Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> Cc: stable@vger.kernel.org
> Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>
> ---
>  fs/iomap/buffered-io.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-06-24 14:24 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-03 19:35 [PATCH 0/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
2026-04-04  1:13   ` Ritesh Harjani
2026-04-04  4:15   ` Matthew Wilcox
2026-04-04 16:47     ` Ritesh Harjani
2026-04-04 20:46       ` Matthew Wilcox
2026-04-16 15:14       ` Ritesh Harjani
2026-04-20 16:33         ` Salvatore Dipietro
2026-04-20 18:44           ` Matthew Wilcox
2026-04-21  1:16             ` Ritesh Harjani
2026-04-28 15:02               ` Salvatore Dipietro
2026-05-03  5:52                 ` Ritesh Harjani
2026-05-03 11:55                   ` Matthew Wilcox
2026-05-06 12:33                     ` Salvatore Dipietro
2026-05-27 16:24                       ` Salvatore Dipietro
2026-05-31 23:29                         ` Karim Manaouil
2026-06-05 10:58                           ` Salvatore Dipietro
2026-06-24  8:06                         ` Salvatore Dipietro
2026-06-24 12:21                           ` Ritesh Harjani
2026-06-24 13:34                           ` Christoph Hellwig
2026-04-05 22:43   ` Dave Chinner
2026-04-07  5:40     ` Christoph Hellwig
2026-04-21  9:02     ` Vlastimil Babka
     [not found] <20260403193201.30479-1-dipiets@amazon.it>
2026-04-03 19:32 ` Salvatore Dipietro
2026-04-04  6:25   ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.