public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] mm/filemap: avoid costly reclaim for high-order folio allocations
@ 2026-04-20 16:14 Salvatore Dipietro
  2026-04-20 16:51 ` Andrew Morton
  2026-04-20 19:12 ` Matthew Wilcox
  0 siblings, 2 replies; 4+ messages in thread
From: Salvatore Dipietro @ 2026-04-20 16:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: ritesh.list, abuehaze, alisaidi, blakgeof, brauner,
	dipietro.salvatore, dipiets, djwong, linux-fsdevel, linux-mm,
	linux-xfs, stable, willy, Jan Kara, Andrew Morton

Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
introduced high-order folio allocations in the buffered write path.
When memory is fragmented, each failed allocation above
PAGE_ALLOC_COSTLY_ORDER triggers compaction and drain_all_pages() via
__alloc_pages_slowpath(), causing a 0.75x throughput drop on pgbench
(simple-update) with  1024 clients on a 96-vCPU arm64 system.

In __filemap_get_folio(), for orders above min_order, split the
allocation behavior by cost:

 - For orders above PAGE_ALLOC_COSTLY_ORDER: strip
   __GFP_DIRECT_RECLAIM, making them purely opportunistic. The
   allocator tries the freelists only and returns NULL immediately if
   pages are not available.

 - For non-costly orders (between min_order and
   PAGE_ALLOC_COSTLY_ORDER): use __GFP_NORETRY to allow lightweight
   direct reclaim without expensive compaction retries.

With this patch, pgbench throughput recovers to 148k TPS (+67% vs
regressed baseline), stable across all iterations.

v2: 
- strip __GFP_DIRECT_RECLAIM to avoid costly reclaim for high-order
  folio allocations
- Moved fix from iomap to mm/filemap layer

Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Salvatore Dipietro <dipiets@amazon.it>
---
 mm/filemap.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c..f2343c26dd63 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
 			gfp_t alloc_gfp = gfp;
 
 			err = -ENOMEM;
-			if (order > min_order)
-				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
+			if (order > min_order) {
+				alloc_gfp |= __GFP_NOWARN;
+				if (order > PAGE_ALLOC_COSTLY_ORDER)
+					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
+				else
+					alloc_gfp |= __GFP_NORETRY;
+			}
 			folio = filemap_alloc_folio(alloc_gfp, order, policy);
 			if (!folio)
 				continue;

base-commit: c7275b05bc428c7373d97aa2da02d3a7fa6b9f66
-- 
2.47.3




AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico




^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] mm/filemap: avoid costly reclaim for high-order folio allocations
  2026-04-20 16:14 [PATCH v2] mm/filemap: avoid costly reclaim for high-order folio allocations Salvatore Dipietro
@ 2026-04-20 16:51 ` Andrew Morton
  2026-04-20 18:41   ` Matthew Wilcox
  2026-04-20 19:12 ` Matthew Wilcox
  1 sibling, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2026-04-20 16:51 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: linux-kernel, ritesh.list, abuehaze, alisaidi, blakgeof, brauner,
	dipietro.salvatore, djwong, linux-fsdevel, linux-mm, linux-xfs,
	stable, willy, Jan Kara

On Mon, 20 Apr 2026 16:14:03 +0000 Salvatore Dipietro <dipiets@amazon.it> wrote:

> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> introduced high-order folio allocations in the buffered write path.
> When memory is fragmented, each failed allocation above
> PAGE_ALLOC_COSTLY_ORDER triggers compaction and drain_all_pages() via
> __alloc_pages_slowpath(), causing a 0.75x throughput drop on pgbench
> (simple-update) with  1024 clients on a 96-vCPU arm64 system.
> 
> In __filemap_get_folio(), for orders above min_order, split the
> allocation behavior by cost:
> 
>  - For orders above PAGE_ALLOC_COSTLY_ORDER: strip
>    __GFP_DIRECT_RECLAIM, making them purely opportunistic. The
>    allocator tries the freelists only and returns NULL immediately if
>    pages are not available.
> 
>  - For non-costly orders (between min_order and
>    PAGE_ALLOC_COSTLY_ORDER): use __GFP_NORETRY to allow lightweight
>    direct reclaim without expensive compaction retries.
> 
> With this patch, pgbench throughput recovers to 148k TPS (+67% vs
> regressed baseline), stable across all iterations.

"Good money after bad"?  Prove me wrong!

Instead of performing weird fragile hard-to-maintain party tricks with
the page allocator to work around the damage, plan B is to simply
revert 5d8edfb900d5.

5d8edfb900d5 came with no performance testing results.  Does anyone
have any evidence that it improved anything?  By how much?

> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
>  			gfp_t alloc_gfp = gfp;
>  
>  			err = -ENOMEM;
> -			if (order > min_order)
> -				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
> +			if (order > min_order) {
> +				alloc_gfp |= __GFP_NOWARN;
> +				if (order > PAGE_ALLOC_COSTLY_ORDER)
> +					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> +				else
> +					alloc_gfp |= __GFP_NORETRY;
> +			}
>  			folio = filemap_alloc_folio(alloc_gfp, order, policy);

I don't think it's reasonable to expect a reader to understand why this
code is as it is.  Hence each clause here should have a comment
explaining why we're taking that step, please.


Look.  I'm being grumpy.  We know that patches which purportedly
improve performance must come with quality performance testing results.
How long have we been at this?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] mm/filemap: avoid costly reclaim for high-order folio allocations
  2026-04-20 16:51 ` Andrew Morton
@ 2026-04-20 18:41   ` Matthew Wilcox
  0 siblings, 0 replies; 4+ messages in thread
From: Matthew Wilcox @ 2026-04-20 18:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Salvatore Dipietro, linux-kernel, ritesh.list, abuehaze, alisaidi,
	blakgeof, brauner, dipietro.salvatore, djwong, linux-fsdevel,
	linux-mm, linux-xfs, stable, Jan Kara

On Mon, Apr 20, 2026 at 09:51:06AM -0700, Andrew Morton wrote:
> On Mon, 20 Apr 2026 16:14:03 +0000 Salvatore Dipietro <dipiets@amazon.it> wrote:
> > Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace")
> > introduced high-order folio allocations in the buffered write path.

No it didn't.  5d8edfb900d5 only allows the use of larger folios if they
already existed in the page cache.  d6bb59a9444d allows the creation of
large folios.

> > When memory is fragmented, each failed allocation above
> > PAGE_ALLOC_COSTLY_ORDER triggers compaction and drain_all_pages() via
> > __alloc_pages_slowpath(), causing a 0.75x throughput drop on pgbench
> > (simple-update) with  1024 clients on a 96-vCPU arm64 system.

Why are you pretending this is new instead of already being the source
of much recent discussion?

https://lore.kernel.org/all/20260403193535.9970-1-dipiets@amazon.it/

> "Good money after bad"?  Prove me wrong!
> 
> Instead of performing weird fragile hard-to-maintain party tricks with
> the page allocator to work around the damage, plan B is to simply
> revert 5d8edfb900d5.

lol.  best of luck with that.  you'd break a lot of other things if you
did.

> 5d8edfb900d5 came with no performance testing results.  Does anyone
> have any evidence that it improved anything?  By how much?

Christoph reported it doubled write performance with NFS once NFS was
converted to use it.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] mm/filemap: avoid costly reclaim for high-order folio allocations
  2026-04-20 16:14 [PATCH v2] mm/filemap: avoid costly reclaim for high-order folio allocations Salvatore Dipietro
  2026-04-20 16:51 ` Andrew Morton
@ 2026-04-20 19:12 ` Matthew Wilcox
  1 sibling, 0 replies; 4+ messages in thread
From: Matthew Wilcox @ 2026-04-20 19:12 UTC (permalink / raw)
  To: Salvatore Dipietro
  Cc: linux-kernel, ritesh.list, abuehaze, alisaidi, blakgeof, brauner,
	dipietro.salvatore, djwong, linux-fsdevel, linux-mm, linux-xfs,
	stable, Jan Kara, Andrew Morton

On Mon, Apr 20, 2026 at 04:14:03PM +0000, Salvatore Dipietro wrote:
> v2: 
> - strip __GFP_DIRECT_RECLAIM to avoid costly reclaim for high-order
>   folio allocations
> - Moved fix from iomap to mm/filemap layer

I don't think filemap is the right place for this.  And neither does
Dave Chinner, nor Christoph Hellwig:

https://lore.kernel.org/all/adSY3GnLHyQatigQ@infradead.org/

I asked you for performance results with different patches, and you
didn't reply.  Now you're asking for this patch to be merged instead.
THIS IS NOT HOW IT WORKS.  You answer the damned questions being asked
of you by your fellow developers.

>  			err = -ENOMEM;
> -			if (order > min_order)
> -				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
> +			if (order > min_order) {
> +				alloc_gfp |= __GFP_NOWARN;
> +				if (order > PAGE_ALLOC_COSTLY_ORDER)
> +					alloc_gfp &= ~__GFP_DIRECT_RECLAIM;
> +				else
> +					alloc_gfp |= __GFP_NORETRY;
> +			}
>  			folio = filemap_alloc_folio(alloc_gfp, order, policy);
>  			if (!folio)
>  				continue;
> 
> base-commit: c7275b05bc428c7373d97aa2da02d3a7fa6b9f66
> -- 
> 2.47.3
> 
> 
> 
> 
> AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico
> 
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-20 19:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-20 16:14 [PATCH v2] mm/filemap: avoid costly reclaim for high-order folio allocations Salvatore Dipietro
2026-04-20 16:51 ` Andrew Morton
2026-04-20 18:41   ` Matthew Wilcox
2026-04-20 19:12 ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox