[PATCH 0/1] iomap: avoid compaction for costly folio order allocation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/1] iomap: avoid compaction for costly folio order allocation
@ 2026-04-03 19:35 Salvatore Dipietro
  2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
  0 siblings, 1 reply; 7+ messages in thread
From: Salvatore Dipietro @ 2026-04-03 19:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy,
	Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel

We are reporting a throughput regression on PostgreSQL pgbench
(simple-update) on arm64 caused by commit 5d8edfb900d5 ("iomap:
Copy larger chunks from userspace") introduced in v6.6-rc1.

The regression manifests as a 0.75x throughput drop on a pgbench
simple-update workload with 1024 clients on a 96-vCPU arm64
system. When memory is even slightly fragmented, each failed
high-order allocation enters into __alloc_pages_slowpath() which
runs 2 memory compactions and drain_all_pages(), forcing all
vCPUs to release their pages. This is done multiple times, one
for each order (up to 6), until the allocation succeeds.

The patch makes costly-order folio allocations in the iomap
buffered write path purely opportunistic -- no direct reclaim,
no compaction, no drain_all_pages().

Combined with the separate PREEMPT_LAZY regression [1], the
total impact is a 2.87x throughput and latency loss.

1. Test environment
___________________

  Hardware:  1x AWS EC2 m8g.24xlarge
             (12x 1TB IO2 32000 iops RAID0 XFS)
  OS:        AL2023 (ami-03a8d3251f401ffca)
  Kernel:    next-20260331
  Database:  PostgreSQL 17
  Workload:  pgbench simple-update
             1024 clients, 96 threads, 1200s duration
             scale factor 8470, fillfactor=90, prepared protocol

2. Results
__________

  Config                 Run1       Run2       Run3       Avg         x
  _____________________  _________  _________  _________  __________  ____
  baseline                47242.39   53369.18   51644.29    50751.96  1.00
  iomap patch             69305.92   66994.08   64603.33    66967.78  1.32
  preempt-none [1]        92906.62  103976.03   98814.94    98565.86  1.94
  iomap+preempt-none[1]  145904.53  146470.95  144728.91   145701.46  2.87

3. Reproduction
_______________

On the AWS EC2 m8g.24xlarge, install and run the PostgreSQL
database using the repro-collection repository like:

  # Reproducer code:
  git clone https://github.com/aws/repro-collection.git ~/repro-collection

  # Setup and start PostgreSQL server in terminal 1:
  ~/repro-collection/run.sh postgresql SUT --ldg=127.0.0.1

  # Run pgbench load generator in terminal 2:
  PGBENCH_SCALE=8470 \
  PGBENCH_INIT_EXTRA_ARGS="--fillfactor=90" \
  PGBENCH_CLIENTS=1024 \
  PGBENCH_THREADS=96 \
  PGBENCH_DURATION=1200 \
  PGBENCH_BUILTIN=simple-update \
  PGBENCH_PROTOCOL=prepared \
  ~/repro-collection/run.sh postgresql LDG --sut=127.0.0.1

[1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#t

Salvatore Dipietro (1):
  iomap: avoid compaction for costly folio order allocation

 fs/iomap/buffered-io.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

base-commit: 9147566d801602c9e7fc7f85e989735735bf38ba
-- 
2.50.1 (Apple Git-155)

AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:35 [PATCH 0/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
@ 2026-04-03 19:35 ` Salvatore Dipietro
  2026-04-04  1:13   ` Ritesh Harjani
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Salvatore Dipietro @ 2026-04-03 19:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel

Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
introduced high-order folio allocations in the buffered write
path. When memory is fragmented, each failed allocation triggers
compaction and drain_all_pages() via __alloc_pages_slowpath(),
causing a 0.75x throughput drop on pgbench (simple-update) with 
1024 clients on a 96-vCPU arm64 system.

Strip __GFP_DIRECT_RECLAIM from folio allocations in
iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
making them purely opportunistic.

Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>
---
 fs/iomap/buffered-io.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 92a831cf4bf1..cb843d54b4d9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
+	gfp_t gfp;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
@@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 		fgp |= FGP_DONTCACHE;
 	fgp |= fgf_set_order(len);
 
+	gfp = mapping_gfp_mask(iter->inode->i_mapping);
+
+	/*
+	 * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER,
+	 * strip __GFP_DIRECT_RECLAIM to make the allocation purely
+	 * opportunistic.  This avoids compaction + drain_all_pages()
+	 * in __alloc_pages_slowpath() that devastate throughput
+	 * on large systems during buffered writes.
+	 */
+	if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
+		gfp &= ~__GFP_DIRECT_RECLAIM;
+
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
-			fgp, mapping_gfp_mask(iter->inode->i_mapping));
+			fgp, gfp);
 }
 EXPORT_SYMBOL_GPL(iomap_get_folio);
 
-- 
2.50.1 (Apple Git-155)




AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico




^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
@ 2026-04-04  1:13   ` Ritesh Harjani
  2026-04-04  4:15   ` Matthew Wilcox
  2026-04-05 22:43   ` Dave Chinner
  2 siblings, 0 replies; 7+ messages in thread
From: Ritesh Harjani @ 2026-04-04  1:13 UTC (permalink / raw)
  To: Salvatore Dipietro, linux-kernel
  Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, linux-mm


Let's cc: linux-mm too.

Salvatore Dipietro <dipiets@amazon.it> writes:

> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> introduced high-order folio allocations in the buffered write
> path. When memory is fragmented, each failed allocation triggers

Isn't it the right thing to do i.e. run compaction, when memory is
fragmented? 

> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> causing a 0.75x throughput drop on pgbench (simple-update) with 
> 1024 clients on a 96-vCPU arm64 system.
>

I think removing the __GFP_DIRECT_RECLAIM flag unconditionally at the
caller may cause -ENOMEM. Note that it is the __filemap_get_folio()
which retries with smaller order allocations, so instead of changing the
callers, shouldn't this be fixed in __filemap_get_folio() instead?

Maybe in there too, we should keep the reclaim flag (if passed by
caller) at least for <= PAGE_ALLOC_COSTLY_ORDER + 1

Thoughts?

-ritesh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
  2026-04-04  1:13   ` Ritesh Harjani
@ 2026-04-04  4:15   ` Matthew Wilcox
  2026-04-04 16:47     ` Ritesh Harjani
  2026-04-05 22:43   ` Dave Chinner
  2 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2026-04-04  4:15 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, linux-mm

On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> introduced high-order folio allocations in the buffered write
> path. When memory is fragmented, each failed allocation triggers
> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> causing a 0.75x throughput drop on pgbench (simple-update) with 
> 1024 clients on a 96-vCPU arm64 system.
> 
> Strip __GFP_DIRECT_RECLAIM from folio allocations in
> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
> making them purely opportunistic.

If you look at __filemap_get_folio_mpol(), that's kind of being tried
already:

                        if (order > min_order)
                                alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;

 * %__GFP_NORETRY: The VM implementation will try only very lightweight
 * memory direct reclaim to get some memory under memory pressure (thus
 * it can sleep). It will avoid disruptive actions like OOM killer. The
 * caller must handle the failure which is quite likely to happen under
 * heavy memory pressure. The flag is suitable when failure can easily be
 * handled at small cost, such as reduced throughput.

which, from the description, seemed like the right approach.  So either
the description or the implementation should be updated, I suppose?

Now, what happens if you change those two lines to:

			if (order > min_order) {
				alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
				alloc_gfp |= __GFP_NOWARN;
			}

Do you recover the performance?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-04  4:15   ` Matthew Wilcox
@ 2026-04-04 16:47     ` Ritesh Harjani
  2026-04-04 20:46       ` Matthew Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: Ritesh Harjani @ 2026-04-04 16:47 UTC (permalink / raw)
  To: Matthew Wilcox, Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, linux-mm

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
>> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
>> introduced high-order folio allocations in the buffered write
>> path. When memory is fragmented, each failed allocation triggers
>> compaction and drain_all_pages() via __alloc_pages_slowpath(),
>> causing a 0.75x throughput drop on pgbench (simple-update) with 
>> 1024 clients on a 96-vCPU arm64 system.
>> 
>> Strip __GFP_DIRECT_RECLAIM from folio allocations in
>> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
>> making them purely opportunistic.
>
> If you look at __filemap_get_folio_mpol(), that's kind of being tried
> already:
>
>                         if (order > min_order)
>                                 alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
>
>  * %__GFP_NORETRY: The VM implementation will try only very lightweight
>  * memory direct reclaim to get some memory under memory pressure (thus
>  * it can sleep). It will avoid disruptive actions like OOM killer. The
>  * caller must handle the failure which is quite likely to happen under
>  * heavy memory pressure. The flag is suitable when failure can easily be
>  * handled at small cost, such as reduced throughput.
>
> which, from the description, seemed like the right approach.  So either
> the description or the implementation should be updated, I suppose?
>
> Now, what happens if you change those two lines to:
>
> 			if (order > min_order) {
> 				alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> 				alloc_gfp |= __GFP_NOWARN;
> 			}

Hi Matthew,

Shouldn't we try this instead? This would still allows us to keep
__GFP_NORETRY and hence light weight direct reclaim/compaction for
atleast the non-costly order allocations, right?

 			if (order > min_order) {
				alloc_gfp |= __GFP_NOWARN;
				if (order > PAGE_ALLOC_COSTLY_ORDER)
					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
				else
					alloc_gfp |= __GFP_NORETRY;
			}

-ritesh

>
> Do you recover the performance?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-04 16:47     ` Ritesh Harjani
@ 2026-04-04 20:46       ` Matthew Wilcox
  0 siblings, 0 replies; 7+ messages in thread
From: Matthew Wilcox @ 2026-04-04 20:46 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze,
	dipietro.salvatore, stable, Christian Brauner, Darrick J. Wong,
	linux-xfs, linux-fsdevel, linux-mm

On Sat, Apr 04, 2026 at 10:17:33PM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
> >> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> >> introduced high-order folio allocations in the buffered write
> >> path. When memory is fragmented, each failed allocation triggers
> >> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> >> causing a 0.75x throughput drop on pgbench (simple-update) with 
> >> 1024 clients on a 96-vCPU arm64 system.
> >> 
> >> Strip __GFP_DIRECT_RECLAIM from folio allocations in
> >> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
> >> making them purely opportunistic.
> >
> > If you look at __filemap_get_folio_mpol(), that's kind of being tried
> > already:
> >
> >                         if (order > min_order)
> >                                 alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
> >
> >  * %__GFP_NORETRY: The VM implementation will try only very lightweight
> >  * memory direct reclaim to get some memory under memory pressure (thus
> >  * it can sleep). It will avoid disruptive actions like OOM killer. The
> >  * caller must handle the failure which is quite likely to happen under
> >  * heavy memory pressure. The flag is suitable when failure can easily be
> >  * handled at small cost, such as reduced throughput.
> >
> > which, from the description, seemed like the right approach.  So either
> > the description or the implementation should be updated, I suppose?
> >
> > Now, what happens if you change those two lines to:
> >
> > 			if (order > min_order) {
> > 				alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> > 				alloc_gfp |= __GFP_NOWARN;
> > 			}
> 
> Hi Matthew,
> 
> Shouldn't we try this instead? This would still allows us to keep
> __GFP_NORETRY and hence light weight direct reclaim/compaction for
> atleast the non-costly order allocations, right?
> 
>  			if (order > min_order) {
> 				alloc_gfp |= __GFP_NOWARN;
> 				if (order > PAGE_ALLOC_COSTLY_ORDER)
> 					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> 				else
> 					alloc_gfp |= __GFP_NORETRY;
> 			}

Uhh ... maybe?  I'd want someone more familiar with the page allocator
than I am to say whether that's the right approach.  If it is, that
seems too complex, and maybe we need a better approach to the page
allocator flags.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
  2026-04-04  1:13   ` Ritesh Harjani
  2026-04-04  4:15   ` Matthew Wilcox
@ 2026-04-05 22:43   ` Dave Chinner
  2 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2026-04-05 22:43 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel

On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> introduced high-order folio allocations in the buffered write
> path. When memory is fragmented, each failed allocation triggers
> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> causing a 0.75x throughput drop on pgbench (simple-update) with 
> 1024 clients on a 96-vCPU arm64 system.
> 
> Strip __GFP_DIRECT_RECLAIM from folio allocations in
> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
> making them purely opportunistic.
> 
> Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> Cc: stable@vger.kernel.org
> Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>
> ---
>  fs/iomap/buffered-io.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 92a831cf4bf1..cb843d54b4d9 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  {
>  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
> +	gfp_t gfp;
>  
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
> @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  		fgp |= FGP_DONTCACHE;
>  	fgp |= fgf_set_order(len);
>  
> +	gfp = mapping_gfp_mask(iter->inode->i_mapping);
> +
> +	/*
> +	 * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER,
> +	 * strip __GFP_DIRECT_RECLAIM to make the allocation purely
> +	 * opportunistic.  This avoids compaction + drain_all_pages()
> +	 * in __alloc_pages_slowpath() that devastate throughput
> +	 * on large systems during buffered writes.
> +	 */
> +	if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
> +		gfp &= ~__GFP_DIRECT_RECLAIM;

Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere
we need to do high order folio allocation is getting out of hand.

Compaction improves long term system performance, so we don't really
just want to turn it off whenever we have demand for high order
folios.

We should be doing is getting rid of compaction out of the direct
reclaim path - it is -clearly- way too costly for hot paths that use
large allocations, especially those with fallbacks to smaller
allocations or vmalloc.

Instead, memory reclaim should kick background compaction and let it
do the work. If the allocation path really, really needs high order
allocation to succeed, then it can direct the allocation to retry
until it succeeds and the allocator itself can wait for background
compaction to make progress.

For code that has fallbacks to smaller allocations, then there is no
need to wait for compaction - we can attempt fast smaller allocations
and continue that way until an allocation succeeds....

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-04-05 22:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-03 19:35 [PATCH 0/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
2026-04-03 19:35 ` [PATCH 1/1] " Salvatore Dipietro
2026-04-04  1:13   ` Ritesh Harjani
2026-04-04  4:15   ` Matthew Wilcox
2026-04-04 16:47     ` Ritesh Harjani
2026-04-04 20:46       ` Matthew Wilcox
2026-04-05 22:43   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox