[PATCH 1/1] iomap: avoid compaction for costly folio order allocation

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
       [not found] <20260403193201.30479-1-dipiets@amazon.it>
@ 2026-04-03 19:32 ` Salvatore Dipietro
  2026-04-04  6:25   ` Greg KH
  0 siblings, 1 reply; 17+ messages in thread
From: Salvatore Dipietro @ 2026-04-03 19:32 UTC (permalink / raw)
  To: dipietro.salvatore; +Cc: stable

Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
introduced high-order folio allocations in the buffered write
path. When memory is fragmented, each failed allocation triggers
compaction and drain_all_pages() via __alloc_pages_slowpath(),
causing a 0.75x throughput drop on pgbench (simple-update) with 
1024 clients on a 96-vCPU arm64 system.

Strip __GFP_DIRECT_RECLAIM from folio allocations in
iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
making them purely opportunistic.

Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>
---
 fs/iomap/buffered-io.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 92a831cf4bf1..cb843d54b4d9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
+	gfp_t gfp;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
@@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 		fgp |= FGP_DONTCACHE;
 	fgp |= fgf_set_order(len);
 
+	gfp = mapping_gfp_mask(iter->inode->i_mapping);
+
+	/*
+	 * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER,
+	 * strip __GFP_DIRECT_RECLAIM to make the allocation purely
+	 * opportunistic.  This avoids compaction + drain_all_pages()
+	 * in __alloc_pages_slowpath() that devastate throughput
+	 * on large systems during buffered writes.
+	 */
+	if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
+		gfp &= ~__GFP_DIRECT_RECLAIM;
+
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
-			fgp, mapping_gfp_mask(iter->inode->i_mapping));
+			fgp, gfp);
 }
 EXPORT_SYMBOL_GPL(iomap_get_folio);
 
-- 
2.50.1 (Apple Git-155)




AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico




^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
       [not found] <20260403193535.9970-1-dipiets@amazon.it>
@ 2026-04-03 19:35 ` Salvatore Dipietro
  2026-04-04  1:13   ` Ritesh Harjani
                     ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Salvatore Dipietro @ 2026-04-03 19:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel

Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
introduced high-order folio allocations in the buffered write
path. When memory is fragmented, each failed allocation triggers
compaction and drain_all_pages() via __alloc_pages_slowpath(),
causing a 0.75x throughput drop on pgbench (simple-update) with 
1024 clients on a 96-vCPU arm64 system.

Strip __GFP_DIRECT_RECLAIM from folio allocations in
iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
making them purely opportunistic.

Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>
---
 fs/iomap/buffered-io.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 92a831cf4bf1..cb843d54b4d9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
+	gfp_t gfp;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
@@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 		fgp |= FGP_DONTCACHE;
 	fgp |= fgf_set_order(len);
 
+	gfp = mapping_gfp_mask(iter->inode->i_mapping);
+
+	/*
+	 * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER,
+	 * strip __GFP_DIRECT_RECLAIM to make the allocation purely
+	 * opportunistic.  This avoids compaction + drain_all_pages()
+	 * in __alloc_pages_slowpath() that devastate throughput
+	 * on large systems during buffered writes.
+	 */
+	if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
+		gfp &= ~__GFP_DIRECT_RECLAIM;
+
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
-			fgp, mapping_gfp_mask(iter->inode->i_mapping));
+			fgp, gfp);
 }
 EXPORT_SYMBOL_GPL(iomap_get_folio);
 
-- 
2.50.1 (Apple Git-155)




AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico




^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:35 ` [PATCH 1/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
@ 2026-04-04  1:13   ` Ritesh Harjani
  2026-04-04  4:15   ` Matthew Wilcox
  2026-04-05 22:43   ` Dave Chinner
  2 siblings, 0 replies; 17+ messages in thread
From: Ritesh Harjani @ 2026-04-04  1:13 UTC (permalink / raw)
  To: Salvatore Dipietro, linux-kernel
  Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, linux-mm


Let's cc: linux-mm too.

Salvatore Dipietro <dipiets@amazon.it> writes:

> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> introduced high-order folio allocations in the buffered write
> path. When memory is fragmented, each failed allocation triggers

Isn't it the right thing to do i.e. run compaction, when memory is
fragmented? 

> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> causing a 0.75x throughput drop on pgbench (simple-update) with 
> 1024 clients on a 96-vCPU arm64 system.
>

I think removing the __GFP_DIRECT_RECLAIM flag unconditionally at the
caller may cause -ENOMEM. Note that it is the __filemap_get_folio()
which retries with smaller order allocations, so instead of changing the
callers, shouldn't this be fixed in __filemap_get_folio() instead?

Maybe in there too, we should keep the reclaim flag (if passed by
caller) at least for <= PAGE_ALLOC_COSTLY_ORDER + 1

Thoughts?

-ritesh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:35 ` [PATCH 1/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
  2026-04-04  1:13   ` Ritesh Harjani
@ 2026-04-04  4:15   ` Matthew Wilcox
  2026-04-04 16:47     ` Ritesh Harjani
  2026-04-05 22:43   ` Dave Chinner
  2 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2026-04-04  4:15 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, linux-mm

On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> introduced high-order folio allocations in the buffered write
> path. When memory is fragmented, each failed allocation triggers
> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> causing a 0.75x throughput drop on pgbench (simple-update) with 
> 1024 clients on a 96-vCPU arm64 system.
> 
> Strip __GFP_DIRECT_RECLAIM from folio allocations in
> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
> making them purely opportunistic.

If you look at __filemap_get_folio_mpol(), that's kind of being tried
already:

                        if (order > min_order)
                                alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;

 * %__GFP_NORETRY: The VM implementation will try only very lightweight
 * memory direct reclaim to get some memory under memory pressure (thus
 * it can sleep). It will avoid disruptive actions like OOM killer. The
 * caller must handle the failure which is quite likely to happen under
 * heavy memory pressure. The flag is suitable when failure can easily be
 * handled at small cost, such as reduced throughput.

which, from the description, seemed like the right approach.  So either
the description or the implementation should be updated, I suppose?

Now, what happens if you change those two lines to:

			if (order > min_order) {
				alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
				alloc_gfp |= __GFP_NOWARN;
			}

Do you recover the performance?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:32 ` Salvatore Dipietro
@ 2026-04-04  6:25   ` Greg KH
  0 siblings, 0 replies; 17+ messages in thread
From: Greg KH @ 2026-04-04  6:25 UTC (permalink / raw)
  To: Salvatore Dipietro; +Cc: dipietro.salvatore, stable

On Fri, Apr 03, 2026 at 07:32:01PM +0000, Salvatore Dipietro wrote:
> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> introduced high-order folio allocations in the buffered write
> path. When memory is fragmented, each failed allocation triggers
> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> causing a 0.75x throughput drop on pgbench (simple-update) with 
> 1024 clients on a 96-vCPU arm64 system.
> 
> Strip __GFP_DIRECT_RECLAIM from folio allocations in
> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
> making them purely opportunistic.
> 
> Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> Cc: stable@vger.kernel.org
> Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>
> ---
>  fs/iomap/buffered-io.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-04  4:15   ` Matthew Wilcox
@ 2026-04-04 16:47     ` Ritesh Harjani
  2026-04-04 20:46       ` Matthew Wilcox
  2026-04-16 15:14       ` Ritesh Harjani
  0 siblings, 2 replies; 17+ messages in thread
From: Ritesh Harjani @ 2026-04-04 16:47 UTC (permalink / raw)
  To: Matthew Wilcox, Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, linux-mm

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
>> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
>> introduced high-order folio allocations in the buffered write
>> path. When memory is fragmented, each failed allocation triggers
>> compaction and drain_all_pages() via __alloc_pages_slowpath(),
>> causing a 0.75x throughput drop on pgbench (simple-update) with 
>> 1024 clients on a 96-vCPU arm64 system.
>> 
>> Strip __GFP_DIRECT_RECLAIM from folio allocations in
>> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
>> making them purely opportunistic.
>
> If you look at __filemap_get_folio_mpol(), that's kind of being tried
> already:
>
>                         if (order > min_order)
>                                 alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
>
>  * %__GFP_NORETRY: The VM implementation will try only very lightweight
>  * memory direct reclaim to get some memory under memory pressure (thus
>  * it can sleep). It will avoid disruptive actions like OOM killer. The
>  * caller must handle the failure which is quite likely to happen under
>  * heavy memory pressure. The flag is suitable when failure can easily be
>  * handled at small cost, such as reduced throughput.
>
> which, from the description, seemed like the right approach.  So either
> the description or the implementation should be updated, I suppose?
>
> Now, what happens if you change those two lines to:
>
> 			if (order > min_order) {
> 				alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> 				alloc_gfp |= __GFP_NOWARN;
> 			}

Hi Matthew,

Shouldn't we try this instead? This would still allows us to keep
__GFP_NORETRY and hence light weight direct reclaim/compaction for
atleast the non-costly order allocations, right?

 			if (order > min_order) {
				alloc_gfp |= __GFP_NOWARN;
				if (order > PAGE_ALLOC_COSTLY_ORDER)
					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
				else
					alloc_gfp |= __GFP_NORETRY;
			}

-ritesh

>
> Do you recover the performance?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-04 16:47     ` Ritesh Harjani
@ 2026-04-04 20:46       ` Matthew Wilcox
  2026-04-16 15:14       ` Ritesh Harjani
  1 sibling, 0 replies; 17+ messages in thread
From: Matthew Wilcox @ 2026-04-04 20:46 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze,
	dipietro.salvatore, stable, Christian Brauner, Darrick J. Wong,
	linux-xfs, linux-fsdevel, linux-mm

On Sat, Apr 04, 2026 at 10:17:33PM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
> >> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> >> introduced high-order folio allocations in the buffered write
> >> path. When memory is fragmented, each failed allocation triggers
> >> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> >> causing a 0.75x throughput drop on pgbench (simple-update) with 
> >> 1024 clients on a 96-vCPU arm64 system.
> >> 
> >> Strip __GFP_DIRECT_RECLAIM from folio allocations in
> >> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
> >> making them purely opportunistic.
> >
> > If you look at __filemap_get_folio_mpol(), that's kind of being tried
> > already:
> >
> >                         if (order > min_order)
> >                                 alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
> >
> >  * %__GFP_NORETRY: The VM implementation will try only very lightweight
> >  * memory direct reclaim to get some memory under memory pressure (thus
> >  * it can sleep). It will avoid disruptive actions like OOM killer. The
> >  * caller must handle the failure which is quite likely to happen under
> >  * heavy memory pressure. The flag is suitable when failure can easily be
> >  * handled at small cost, such as reduced throughput.
> >
> > which, from the description, seemed like the right approach.  So either
> > the description or the implementation should be updated, I suppose?
> >
> > Now, what happens if you change those two lines to:
> >
> > 			if (order > min_order) {
> > 				alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> > 				alloc_gfp |= __GFP_NOWARN;
> > 			}
> 
> Hi Matthew,
> 
> Shouldn't we try this instead? This would still allows us to keep
> __GFP_NORETRY and hence light weight direct reclaim/compaction for
> atleast the non-costly order allocations, right?
> 
>  			if (order > min_order) {
> 				alloc_gfp |= __GFP_NOWARN;
> 				if (order > PAGE_ALLOC_COSTLY_ORDER)
> 					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> 				else
> 					alloc_gfp |= __GFP_NORETRY;
> 			}

Uhh ... maybe?  I'd want someone more familiar with the page allocator
than I am to say whether that's the right approach.  If it is, that
seems too complex, and maybe we need a better approach to the page
allocator flags.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-03 19:35 ` [PATCH 1/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
  2026-04-04  1:13   ` Ritesh Harjani
  2026-04-04  4:15   ` Matthew Wilcox
@ 2026-04-05 22:43   ` Dave Chinner
  2026-04-07  5:40     ` Christoph Hellwig
  2026-04-21  9:02     ` Vlastimil Babka
  2 siblings, 2 replies; 17+ messages in thread
From: Dave Chinner @ 2026-04-05 22:43 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel

On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> introduced high-order folio allocations in the buffered write
> path. When memory is fragmented, each failed allocation triggers
> compaction and drain_all_pages() via __alloc_pages_slowpath(),
> causing a 0.75x throughput drop on pgbench (simple-update) with 
> 1024 clients on a 96-vCPU arm64 system.
> 
> Strip __GFP_DIRECT_RECLAIM from folio allocations in
> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
> making them purely opportunistic.
> 
> Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> Cc: stable@vger.kernel.org
> Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>
> ---
>  fs/iomap/buffered-io.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 92a831cf4bf1..cb843d54b4d9 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  {
>  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
> +	gfp_t gfp;
>  
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
> @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  		fgp |= FGP_DONTCACHE;
>  	fgp |= fgf_set_order(len);
>  
> +	gfp = mapping_gfp_mask(iter->inode->i_mapping);
> +
> +	/*
> +	 * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER,
> +	 * strip __GFP_DIRECT_RECLAIM to make the allocation purely
> +	 * opportunistic.  This avoids compaction + drain_all_pages()
> +	 * in __alloc_pages_slowpath() that devastate throughput
> +	 * on large systems during buffered writes.
> +	 */
> +	if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
> +		gfp &= ~__GFP_DIRECT_RECLAIM;

Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere
we need to do high order folio allocation is getting out of hand.

Compaction improves long term system performance, so we don't really
just want to turn it off whenever we have demand for high order
folios.

We should be doing is getting rid of compaction out of the direct
reclaim path - it is -clearly- way too costly for hot paths that use
large allocations, especially those with fallbacks to smaller
allocations or vmalloc.

Instead, memory reclaim should kick background compaction and let it
do the work. If the allocation path really, really needs high order
allocation to succeed, then it can direct the allocation to retry
until it succeeds and the allocator itself can wait for background
compaction to make progress.

For code that has fallbacks to smaller allocations, then there is no
need to wait for compaction - we can attempt fast smaller allocations
and continue that way until an allocation succeeds....

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-05 22:43   ` Dave Chinner
@ 2026-04-07  5:40     ` Christoph Hellwig
  2026-04-21  9:02     ` Vlastimil Babka
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2026-04-07  5:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze,
	dipietro.salvatore, willy, stable, Christian Brauner,
	Darrick J. Wong, linux-xfs, linux-fsdevel

On Mon, Apr 06, 2026 at 08:43:57AM +1000, Dave Chinner wrote:
> > +	if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
> > +		gfp &= ~__GFP_DIRECT_RECLAIM;
> 
> Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere
> we need to do high order folio allocation is getting out of hand.

That's what I thought.

> Compaction improves long term system performance, so we don't really
> just want to turn it off whenever we have demand for high order
> folios.

Yes.  Also if we want to make block size > PAGE_SIZE a real option,
just giving up on allocating large folios is not an option.

> Instead, memory reclaim should kick background compaction and let it
> do the work. If the allocation path really, really needs high order
> allocation to succeed, then it can direct the allocation to retry
> until it succeeds and the allocator itself can wait for background
> compaction to make progress.
> 
> For code that has fallbacks to smaller allocations, then there is no
> need to wait for compaction - we can attempt fast smaller allocations
> and continue that way until an allocation succeeds....

Yes.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-04 16:47     ` Ritesh Harjani
  2026-04-04 20:46       ` Matthew Wilcox
@ 2026-04-16 15:14       ` Ritesh Harjani
  2026-04-20 16:33         ` Salvatore Dipietro
  1 sibling, 1 reply; 17+ messages in thread
From: Ritesh Harjani @ 2026-04-16 15:14 UTC (permalink / raw)
  To: Matthew Wilcox, Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, linux-mm

Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:

> Matthew Wilcox <willy@infradead.org> writes:
>
>> On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
>>> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
>>> introduced high-order folio allocations in the buffered write
>>> path. When memory is fragmented, each failed allocation triggers
>>> compaction and drain_all_pages() via __alloc_pages_slowpath(),
>>> causing a 0.75x throughput drop on pgbench (simple-update) with 
>>> 1024 clients on a 96-vCPU arm64 system.
>>> 
>>> Strip __GFP_DIRECT_RECLAIM from folio allocations in
>>> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
>>> making them purely opportunistic.
>>
>> If you look at __filemap_get_folio_mpol(), that's kind of being tried
>> already:
>>
>>                         if (order > min_order)
>>                                 alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
>>
>>  * %__GFP_NORETRY: The VM implementation will try only very lightweight
>>  * memory direct reclaim to get some memory under memory pressure (thus
>>  * it can sleep). It will avoid disruptive actions like OOM killer. The
>>  * caller must handle the failure which is quite likely to happen under
>>  * heavy memory pressure. The flag is suitable when failure can easily be
>>  * handled at small cost, such as reduced throughput.
>>
>> which, from the description, seemed like the right approach.  So either
>> the description or the implementation should be updated, I suppose?
>>
>> Now, what happens if you change those two lines to:
>>
>> 			if (order > min_order) {
>> 				alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
>> 				alloc_gfp |= __GFP_NOWARN;
>> 			}
>
> Hi Matthew,
>
> Shouldn't we try this instead? This would still allows us to keep
> __GFP_NORETRY and hence light weight direct reclaim/compaction for
> atleast the non-costly order allocations, right?
>
>  			if (order > min_order) {
> 				alloc_gfp |= __GFP_NOWARN;
> 				if (order > PAGE_ALLOC_COSTLY_ORDER)
> 					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> 				else
> 					alloc_gfp |= __GFP_NORETRY;
> 			}
>

Hi Salvatore,

Did you get a chance to test the above two options (shared by Matthew
and me)? And were you able to recover the performance back with those?

So, in a longer run, as Dave suggested, we might need to fix this by
maybe considering removing compaction in the direct reclaim path. But I
guess for fixing it in older kernel releases, we might need a quick fix
,so maybe worth trying the above suggested changes, perhaps.

Also, I am somehow not able to hit this problem at my end (even after
creating a bit of memory fragmentation). So please also feel free to
share the steps, if you have a setup to re-create it easily.

-ritesh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-16 15:14       ` Ritesh Harjani
@ 2026-04-20 16:33         ` Salvatore Dipietro
  2026-04-20 18:44           ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: Salvatore Dipietro @ 2026-04-20 16:33 UTC (permalink / raw)
  To: ritesh.list
  Cc: abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore,
	dipiets, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs,
	stable, willy

I have submitted a v2 of the patch based on Ritesh's suggestion.
https://lore.kernel.org/linux-mm/20260420161404.642-1-dipiets@amazon.it/T/#u

Salvatore

AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-20 16:33         ` Salvatore Dipietro
@ 2026-04-20 18:44           ` Matthew Wilcox
  2026-04-21  1:16             ` Ritesh Harjani
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2026-04-20 18:44 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: ritesh.list, abuehaze, alisaidi, blakgeof, brauner,
	dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm,
	linux-xfs, stable

On Mon, Apr 20, 2026 at 04:33:28PM +0000, Salvatore Dipietro wrote:
> I have submitted a v2 of the patch based on Ritesh's suggestion.
> https://lore.kernel.org/linux-mm/20260420161404.642-1-dipiets@amazon.it/T/#u

... but without linking back to this thread, so nobody who was exposed
to that thread for the first time knows about this one.  That's poor form.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-20 18:44           ` Matthew Wilcox
@ 2026-04-21  1:16             ` Ritesh Harjani
  2026-04-28 15:02               ` Salvatore Dipietro
  0 siblings, 1 reply; 17+ messages in thread
From: Ritesh Harjani @ 2026-04-21  1:16 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: Matthew Wilcox, abuehaze, alisaidi, blakgeof, brauner,
	dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm,
	linux-xfs, stable

Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Apr 20, 2026 at 04:33:28PM +0000, Salvatore Dipietro wrote:
>> I have submitted a v2 of the patch based on Ritesh's suggestion.
>> https://lore.kernel.org/linux-mm/20260420161404.642-1-dipiets@amazon.it/T/#u
>
> ... but without linking back to this thread, so nobody who was exposed
> to that thread for the first time knows about this one.  That's poor form.

Yup.
Also, given the Maintainers (willy, Christoph, Dave) shown their
dis-interest in taking the patch in it's current form, the right way is
to get back with performance data with both the approaches (which we
were discussing) and first get the consensus from everyone, before
proposing this as a patch :).

Having said that, we do care if a genuine performance issue gets
reported. In that context, I wanted to understand your setup a bit from
memory fragmentation perspective. Are you trying to simulate memory
fragmentation and then benchmarking? Or was this problem hitting when
you run simply run the reproduction steps mentioned in your cover
letter?

BTW - I was following the other thread too where PREEMPT_LAZY problem
was getting discussed. And from what I understood, you mentioned [1]
enabling THP on the system made that problem go away. Also it looks like
enabling THP is the right thing to do for this kind of workload. Does
that also mean enabling THP fixed this problem too? Do you still hit
memory fragmentation and/or similar throughput drop w/o this fix after
you enable THP? It will be good to know those details too please.

[1]: https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#md88ca4258766e897e432df85874d197db476c7d1

-ritesh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-05 22:43   ` Dave Chinner
  2026-04-07  5:40     ` Christoph Hellwig
@ 2026-04-21  9:02     ` Vlastimil Babka
  1 sibling, 0 replies; 17+ messages in thread
From: Vlastimil Babka @ 2026-04-21  9:02 UTC (permalink / raw)
  To: Dave Chinner, Salvatore Dipietro
  Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore,
	willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs,
	linux-fsdevel, Ritesh Harjani (IBM), Christoph Hellwig,
	linux-mm@kvack.org, Michal Hocko, David Hildenbrand (Red Hat),
	Johannes Weiner

On 4/6/26 00:43, Dave Chinner wrote:
> On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote:
>> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
>> introduced high-order folio allocations in the buffered write
>> path. When memory is fragmented, each failed allocation triggers
>> compaction and drain_all_pages() via __alloc_pages_slowpath(),
>> causing a 0.75x throughput drop on pgbench (simple-update) with 
>> 1024 clients on a 96-vCPU arm64 system.
>> 
>> Strip __GFP_DIRECT_RECLAIM from folio allocations in
>> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER,
>> making them purely opportunistic.
>> 
>> Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>

BTW, backporting perf regressions fixes to 6.6, when they are only reported
at the time 7.0 is released, might be too risky. There will likely be a
different workload that will regress as a result, no matter what we do.

>> ---
>>  fs/iomap/buffered-io.c | 15 ++++++++++++++-
>>  1 file changed, 14 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 92a831cf4bf1..cb843d54b4d9 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>>  {
>>  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
>> +	gfp_t gfp;
>>  
>>  	if (iter->flags & IOMAP_NOWAIT)
>>  		fgp |= FGP_NOWAIT;
>> @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>>  		fgp |= FGP_DONTCACHE;
>>  	fgp |= fgf_set_order(len);
>>  
>> +	gfp = mapping_gfp_mask(iter->inode->i_mapping);
>> +
>> +	/*
>> +	 * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER,
>> +	 * strip __GFP_DIRECT_RECLAIM to make the allocation purely
>> +	 * opportunistic.  This avoids compaction + drain_all_pages()
>> +	 * in __alloc_pages_slowpath() that devastate throughput
>> +	 * on large systems during buffered writes.
>> +	 */
>> +	if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER)
>> +		gfp &= ~__GFP_DIRECT_RECLAIM;
> 
> Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere
> we need to do high order folio allocation is getting out of hand.
> 
> Compaction improves long term system performance, so we don't really
> just want to turn it off whenever we have demand for high order
> folios.
> 
> We should be doing is getting rid of compaction out of the direct
> reclaim path - it is -clearly- way too costly for hot paths that use
> large allocations, especially those with fallbacks to smaller
> allocations or vmalloc.
> 
> Instead, memory reclaim should kick background compaction and let it
> do the work. If the allocation path really, really needs high order
> allocation to succeed, then it can direct the allocation to retry
> until it succeeds and the allocator itself can wait for background
> compaction to make progress.
> 
> For code that has fallbacks to smaller allocations, then there is no
> need to wait for compaction - we can attempt fast smaller allocations
> and continue that way until an allocation succeeds....

So, should we do a LSF/MM session?

But I think in any case, the page allocator needs to know which allocations
do have the fallback. __GFP_NORETRY exists for this. Here it wasn't tried at
all, in v2 [1] it was, but not alone. I'd start from __GFP_NORETRY alone,
and then we can look at tweaking what it does if it's currently insufficient.

We could have a helper to encapsulate this "turn this allocation to a
lightweight fallbackable one", which would add __GFP_NORETRY. It probably
already exists somewhere but not gfp.h. But I'm not sure we can simply
change GFP_KERNEL to start failing more for non-costly orders. We've
discussed that a lot in the past :)

[1] https://lore.kernel.org/all/20260420161404.642-1-dipiets@amazon.it/

> -Dave.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-21  1:16             ` Ritesh Harjani
@ 2026-04-28 15:02               ` Salvatore Dipietro
  2026-05-03  5:52                 ` Ritesh Harjani
  0 siblings, 1 reply; 17+ messages in thread
From: Salvatore Dipietro @ 2026-04-28 15:02 UTC (permalink / raw)
  To: ritesh.list
  Cc: abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore,
	dipiets, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs,
	stable, willy

On 4/21/26 00:43, Ritesh Harjani wrote:
> Also, given the Maintainers (willy, Christoph, Dave) shown their
> dis-interest in taking the patch in it's current form, the right way is
> to get back with performance data with both the approaches (which we
> were discussing) and first get the consensus from everyone, before
> proposing this as a patch :).

Thank you for the follow-up and the additional context, Ritesh.
I might have misunderstood the previous request and will make sure to 
link back to previous patch versions in the future.
Here are the performance results that we have collected on our end with
the proposed patches:


| Patch                |    Run 1   |    Run 2   |    Run 3   |   Average   | % vs Baseline |
|----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
| Baseline             | 107,064.61 |  97,043.86 | 101,830.78 | 101,979.75  |       —       |
| Proposed patch       | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20  |    +38.45%    |
| Ritesh's suggestion  | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61  |    +36.50%    |
| Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82  |    +41.07%    |


On 4/21/26 00:43, Ritesh Harjani wrote:
> In that context, I wanted to understand your setup a bit from
> memory fragmentation perspective. Are you trying to simulate memory
> fragmentation and then benchmarking? Or was this problem hitting when
> you run simply run the reproduction steps mentioned in your cover
> letter?

All results were collected on fresh AWS instances as described in the
cover letter. Patch [1] has been applied on all instances to avoid the
other regression. The instance has been restarted to pick up the patched
kernel and ensure clean memory before installing and starting the
PostgreSQL benchmark via repro-collection [2]. 
We do not use any tool to fragment the memory in advance. Collecting 
memory metric of this system, we noticed that ~40% of memory is used by
PageTables since PostgreSQL spawns a new process for each client limiting
significantly the available caching and free memory.

PostgreSQL write pattern consists mostly of 8/16 KB data but during 
the database checkpoints, by default every 5 minutes, it flushes write-ahead
logs to disk, which uses large folios. At this point, the system attempts to
satisfy the folio allocation request, triggering the regression and falling
into the slow path, as shown by the Linux perf profile below:

  `-0.26%-__arm64_sys_pwrite64
    `-0.26%-vfs_write
      `-0.26%-xfs_file_write_iter
        `-0.26%-xfs_file_buffered_write
          `-0.26%-iomap_file_buffered_write
            `-0.26%-iomap_write_iter
              `-0.22%-iomap_write_begin
                `-0.22%-iomap_get_folio
                  `-0.22%-__filemap_get_folio
                    `-0.21%-filemap_alloc_folio->alloc_pages
                      `-0.20%-__alloc_pages_slowpath
                        |-0.12%-__alloc_pages_direct_compact
                        | `-0.12%-try_to_compact_pages
                        |   `-0.11%-compact_zone
                        |     `-0.11%-isolate_migratepages
                        `-0.07%-__drain_all_pages
                          `-0.07%-drain_pages_zone
                            `-0.07%-free_pcppages_bulk

This is also visible in the intermediate PGBench results, which drop
significantly during checkpoint time execution:

# Normal execution:
[20260421.141505] [INFO] progress: 660.0 s, 138828.2 tps, lat 7.509 ms stddev 16.985, 0 failed
[20260421.141515] [INFO] progress: 670.0 s, 151505.1 tps, lat 6.708 ms stddev 8.308, 0 failed
[20260421.141525] [INFO] progress: 680.0 s, 166558.7 tps, lat 6.190 ms stddev 6.537, 0 failed
[20260421.141535] [INFO] progress: 690.0 s, 141267.1 tps, lat 7.246 ms stddev 5.951, 0 failed

# During checkpoints:	
[20260421.141605] [INFO] progress: 720.0 s, 54119.8 tps, lat 18.894 ms stddev 81.816, 0 failed
[20260421.141615] [INFO] progress: 730.0 s, 55184.7 tps, lat 18.564 ms stddev 12.729, 0 failed
[20260421.141625] [INFO] progress: 740.0 s, 37334.0 tps, lat 27.302 ms stddev 25.060, 0 failed
[20260421.141635] [INFO] progress: 750.0 s, 53387.6 tps, lat 19.259 ms stddev 18.313, 0 failed
[20260421.141645] [INFO] progress: 760.0 s, 41247.3 tps, lat 24.805 ms stddev 24.116, 0 failed


On 4/21/26 00:43, Ritesh Harjani wrote:
> BTW - I was following the other thread too where PREEMPT_LAZY problem
> was getting discussed. And from what I understood, you mentioned [1]
> enabling THP on the system made that problem go away. Also it looks like
> enabling THP is the right thing to do for this kind of workload. Does
> that also mean enabling THP fixed this problem too? Do you still hit
> memory fragmentation and/or similar throughput drop w/o this fix after
> you enable THP? It will be good to know those details too please.

We have run more benchmarks (as baseline) with PostgreSQL huge_pages options
(on, off) pre-allocating the shared buffer memory with "vm.nr_hugepages"
(~25% of total memory, 2MB size) and Transparent Huge Pages (THP) options
(always, madvise, never). PostgreSQL performance improves only when
PostgreSQL huge_pages option with pre-allocated memory is enabled.
THP has no significant effect on PostgreSQL or system performance in this
case.

| PG huge_pages + pre-alloc mem | THP     |   Run 1 |   Run 2 |   Run 3 | Average |
|-------------------------------|---------|--------:|--------:|--------:|--------:|
| on                            | never   | 189,418 | 187,764 | 188,207 | 188,463 |
| on                            | always  | 188,813 | 189,798 | 190,032 | 189,548 |
| on                            | madvise | 187,405 | 192,234 | 189,201 | 189,613 |
| off                           | never   | 102,609 | 109,394 | 100,868 | 104,290 |
| off                           | always  |  90,274 | 103,831 | 102,515 |  98,874 |
| off                           | madvise |  90,508 | 103,855 |  96,574 |  96,979 |


[1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368
[2] https://github.com/aws/repro-collection.git



AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-04-28 15:02               ` Salvatore Dipietro
@ 2026-05-03  5:52                 ` Ritesh Harjani
  2026-05-03 11:55                   ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: Ritesh Harjani @ 2026-05-03  5:52 UTC (permalink / raw)
  To: Salvatore Dipietro, Matthew Wilcox, Andrew Morton, linux-mm,
	Vlastimil Babka
  Cc: abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore,
	dipiets, djwong, linux-fsdevel, linux-kernel, linux-xfs, stable,
	willy

Sorry about the delayed response, got caught up in some other work.

Salvatore Dipietro <dipiets@amazon.it> writes:

> On 4/21/26 00:43, Ritesh Harjani wrote:
>> Also, given the Maintainers (willy, Christoph, Dave) shown their
>> dis-interest in taking the patch in it's current form, the right way is
>> to get back with performance data with both the approaches (which we
>> were discussing) and first get the consensus from everyone, before
>> proposing this as a patch :).
>
> Thank you for the follow-up and the additional context, Ritesh.
> I might have misunderstood the previous request and will make sure to 
> link back to previous patch versions in the future.
> Here are the performance results that we have collected on our end with
> the proposed patches:
>
>
> | Patch                |    Run 1   |    Run 2   |    Run 3   |   Average   | % vs Baseline |
> |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
> | Baseline             | 107,064.61 |  97,043.86 | 101,830.78 | 101,979.75  |       —       |

> | PG huge_pages + pre-alloc mem | THP     |   Run 1 |   Run 2 |   Run 3 | Average |
> |-------------------------------|---------|--------:|--------:|--------:|--------:|
> | on                            | never   | 189,418 | 187,764 | 188,207 | 188,463 |
>

First of all thanks for sharing the detailed performance numbers.

Ok, so here is what I understood from the data you shared.
This performance problem is mostly seen with PostgreSQL huge_pages=off
[1][2] i.e.

baseline-no-patches                 ~104K
v/s
baseline-no-patches+huge_pages=on   ~188K

Also the observation with huge_pages=off is - we have 40% of memory as page
table memory (as you pointed below)

> We do not use any tool to fragment the memory in advance. Collecting 
> memory metric of this system, we noticed that ~40% of memory is used by
> PageTables since PostgreSQL spawns a new process for each client limiting
> significantly the available caching and free memory.
>

So there must be 2 things going on with huge_pages=on option here:

1. Huge pages use PMD-size mapping, which eliminates the need of PTE
tables entirely. This then reduces the amount of memory consumed by page
tables. W/O huges pages, the page table overhead become significant
(~40% of DRAM), because on fork, each child process gets it's own copy
of the PTE tables (even though the underlying shared memory pages
remains the same)

2. The second savings might come from the fact that Linux supports
CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING with hugetlb. With this, the
PMD table pages themselves are shared among proceses.

[1]:
https://www.postgresql.org/docs/current/runtime-config-resource.html
[2]:
https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-HUGE-PAGES

So the above explain the 40% of memory used up in Page tables.

Now this is what I believe could be the reason for memory fragmentation
with this workload - 
In Linux, each PTE page table uses 4KB size (assuming you are using 4KB
system PAGE_SIZE). When your workload forks a
child process for each new connection, child gets its own copy of the
page tables which maps the shared buffer.
Since each PTE table is a single 4KB page, hundreds of connections
spawning means hundreds of thousands of single-page allocations for page
tables. So it looks like, the major source of your memory fragmentation
problem must be these several order-0 allocations for PTE page table
pages.

Also as per the documentation [1], huge_pages=try option is the default
setting. So I am assuming in production we at least won't suffer from
this memory fragmentation, correct?

[1]: https://www.postgresql.org/docs/current/runtime-config-resource.html

> PostgreSQL write pattern consists mostly of 8/16 KB data but during 
> the database checkpoints, by default every 5 minutes, it flushes write-ahead
> logs to disk, which uses large folios. At this point, the system attempts to
> satisfy the folio allocation request, triggering the regression and falling
> into the slow path, as shown by the Linux perf profile below:
>
>   `-0.26%-__arm64_sys_pwrite64
>     `-0.26%-vfs_write
>       `-0.26%-xfs_file_write_iter
>         `-0.26%-xfs_file_buffered_write
>           `-0.26%-iomap_file_buffered_write
>             `-0.26%-iomap_write_iter
>               `-0.22%-iomap_write_begin
>                 `-0.22%-iomap_get_folio
>                   `-0.22%-__filemap_get_folio
>                     `-0.21%-filemap_alloc_folio->alloc_pages
>                       `-0.20%-__alloc_pages_slowpath
>                         |-0.12%-__alloc_pages_direct_compact
>                         | `-0.12%-try_to_compact_pages
>                         |   `-0.11%-compact_zone
>                         |     `-0.11%-isolate_migratepages
>                         `-0.07%-__drain_all_pages
>                           `-0.07%-drain_pages_zone
>                             `-0.07%-free_pcppages_bulk
>

However, I agree that it still make sense look into possible solution to
address this performance gap which you pointed out when the system has
memory fragmentation (with huge_pages=off).

> | Patch                |    Run 1   |    Run 2   |    Run 3   |   Average   | % vs Baseline |
> |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
> | Baseline             | 107,064.61 |  97,043.86 | 101,830.78 | 101,979.75  |       —       |
> | Proposed patch       | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20  |    +38.45%    |
> | Ritesh's suggestion  | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61  |    +36.50%    |
> | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82  |    +41.07%    |

The main reason, why I proposed the below patch was because, this only
affects costly order allocation (i.e for order > PAGE_ALLOC_COSTLY_ORDER)
by skipping direct reclaim for those orders, while still keeping the
behaviour same for others.

So, for smaller orders (order > min_order and <=
PAGE_ALLOC_COSTLY_ORDER), the allocator will still attempt for direct
reclaim and compaction (which I guess is required to avoid oom too?) And
also, this looks like a change which could be easily backportable :)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c..f2343c26dd63 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
 			gfp_t alloc_gfp = gfp;

 			err = -ENOMEM;
-			if (order > min_order)
-				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
+			if (order > min_order) {
+				alloc_gfp |= __GFP_NOWARN;
+				if (order > PAGE_ALLOC_COSTLY_ORDER)
+					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
+				else
+					alloc_gfp |= __GFP_NORETRY;
+			}

But of course let's hear from others on their suggestions / thoughts.
Maybe the filemap is not the right place to fix this as Matthew, Andrew
and others were pointing. Any other suggestions on how to approach this,
please?

-ritesh

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation
  2026-05-03  5:52                 ` Ritesh Harjani
@ 2026-05-03 11:55                   ` Matthew Wilcox
  0 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox @ 2026-05-03 11:55 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Salvatore Dipietro, Andrew Morton, linux-mm, Vlastimil Babka,
	abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong,
	linux-fsdevel, linux-kernel, linux-xfs, stable

On Sun, May 03, 2026 at 11:22:10AM +0530, Ritesh Harjani wrote:
> Now this is what I believe could be the reason for memory fragmentation
> with this workload - 
> In Linux, each PTE page table uses 4KB size (assuming you are using 4KB
> system PAGE_SIZE). When your workload forks a
> child process for each new connection, child gets its own copy of the
> page tables which maps the shared buffer.
> Since each PTE table is a single 4KB page, hundreds of connections
> spawning means hundreds of thousands of single-page allocations for page
> tables. So it looks like, the major source of your memory fragmentation
> problem must be these several order-0 allocations for PTE page table
> pages.

While memory is fragmented, the _problem_ is that we try too hard to
defragment.  From the original post:

: When memory is fragmented, each failed allocation triggers
: compaction and drain_all_pages() via __alloc_pages_slowpath()

We really should only try compaction once.  If it didn't make useful
progress last time, it won't this time either.

> > | Patch                |    Run 1   |    Run 2   |    Run 3   |   Average   | % vs Baseline |
> > |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:|
> > | Baseline             | 107,064.61 |  97,043.86 | 101,830.78 | 101,979.75  |       —       |
> > | Proposed patch       | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20  |    +38.45%    |
> > | Ritesh's suggestion  | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61  |    +36.50%    |
> > | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82  |    +41.07%    |
> 
> 
> The main reason, why I proposed the below patch was because, this only
> affects costly order allocation (i.e for order > PAGE_ALLOC_COSTLY_ORDER)
> by skipping direct reclaim for those orders, while still keeping the
> behaviour same for others.
> 
> So, for smaller orders (order > min_order and <=
> PAGE_ALLOC_COSTLY_ORDER), the allocator will still attempt for direct
> reclaim and compaction (which I guess is required to avoid oom too?) And
> also, this looks like a change which could be easily backportable :)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4e636647100c..f2343c26dd63 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
>  			gfp_t alloc_gfp = gfp;
>  
>  			err = -ENOMEM;
> -			if (order > min_order)
> -				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
> +			if (order > min_order) {
> +				alloc_gfp |= __GFP_NOWARN;
> +				if (order > PAGE_ALLOC_COSTLY_ORDER)
> +					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> +				else
> +					alloc_gfp |= __GFP_NORETRY;
> +			}
> 
> 
> But of course let's hear from others on their suggestions / thoughts.
> Maybe the filemap is not the right place to fix this as Matthew, Andrew
> and others were pointing. Any other suggestions on how to approach this,
> please?

filemap.c REALLY shouldn't know about PAGE_ALLOC_COSTLY_ORDER.
That's an internal detail of the memory allocator.

Either we want an API to say "allocate me a folio between orders A and B"
or we need more understandable GFP flags.  Or the page allocator could
use the __GFP_NORETRY flag to say "oh well, this allocation has a fallback,
I'll kick kcompactd to try to compact some more memory, but I'll fail
the allocation".

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-05-03 11:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260403193535.9970-1-dipiets@amazon.it>
2026-04-03 19:35 ` [PATCH 1/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
2026-04-04  1:13   ` Ritesh Harjani
2026-04-04  4:15   ` Matthew Wilcox
2026-04-04 16:47     ` Ritesh Harjani
2026-04-04 20:46       ` Matthew Wilcox
2026-04-16 15:14       ` Ritesh Harjani
2026-04-20 16:33         ` Salvatore Dipietro
2026-04-20 18:44           ` Matthew Wilcox
2026-04-21  1:16             ` Ritesh Harjani
2026-04-28 15:02               ` Salvatore Dipietro
2026-05-03  5:52                 ` Ritesh Harjani
2026-05-03 11:55                   ` Matthew Wilcox
2026-04-05 22:43   ` Dave Chinner
2026-04-07  5:40     ` Christoph Hellwig
2026-04-21  9:02     ` Vlastimil Babka
     [not found] <20260403193201.30479-1-dipiets@amazon.it>
2026-04-03 19:32 ` Salvatore Dipietro
2026-04-04  6:25   ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox