* [PATCH 1/1] iomap: avoid compaction for costly folio order allocation [not found] <20260403193535.9970-1-dipiets@amazon.it> @ 2026-04-03 19:35 ` Salvatore Dipietro 2026-04-04 1:13 ` Ritesh Harjani ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Salvatore Dipietro @ 2026-04-03 19:35 UTC (permalink / raw) To: linux-kernel Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") introduced high-order folio allocations in the buffered write path. When memory is fragmented, each failed allocation triggers compaction and drain_all_pages() via __alloc_pages_slowpath(), causing a 0.75x throughput drop on pgbench (simple-update) with 1024 clients on a 96-vCPU arm64 system. Strip __GFP_DIRECT_RECLAIM from folio allocations in iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, making them purely opportunistic. Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Cc: stable@vger.kernel.org Signed-off-by: Salvatore Dipietro <dipiets@amazon.it> --- fs/iomap/buffered-io.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 92a831cf4bf1..cb843d54b4d9 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) { fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS; + gfp_t gfp; if (iter->flags & IOMAP_NOWAIT) fgp |= FGP_NOWAIT; @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) fgp |= FGP_DONTCACHE; fgp |= fgf_set_order(len); + gfp = mapping_gfp_mask(iter->inode->i_mapping); + + /* + * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER, + * strip __GFP_DIRECT_RECLAIM to make the allocation purely + * opportunistic. This avoids compaction + drain_all_pages() + * in __alloc_pages_slowpath() that devastate throughput + * on large systems during buffered writes. + */ + if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER) + gfp &= ~__GFP_DIRECT_RECLAIM; + return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, - fgp, mapping_gfp_mask(iter->inode->i_mapping)); + fgp, gfp); } EXPORT_SYMBOL_GPL(iomap_get_folio); -- 2.50.1 (Apple Git-155) AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-03 19:35 ` [PATCH 1/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro @ 2026-04-04 1:13 ` Ritesh Harjani 2026-04-04 4:15 ` Matthew Wilcox 2026-04-05 22:43 ` Dave Chinner 2 siblings, 0 replies; 17+ messages in thread From: Ritesh Harjani @ 2026-04-04 1:13 UTC (permalink / raw) To: Salvatore Dipietro, linux-kernel Cc: dipiets, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-mm Let's cc: linux-mm too. Salvatore Dipietro <dipiets@amazon.it> writes: > Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > introduced high-order folio allocations in the buffered write > path. When memory is fragmented, each failed allocation triggers Isn't it the right thing to do i.e. run compaction, when memory is fragmented? > compaction and drain_all_pages() via __alloc_pages_slowpath(), > causing a 0.75x throughput drop on pgbench (simple-update) with > 1024 clients on a 96-vCPU arm64 system. > I think removing the __GFP_DIRECT_RECLAIM flag unconditionally at the caller may cause -ENOMEM. Note that it is the __filemap_get_folio() which retries with smaller order allocations, so instead of changing the callers, shouldn't this be fixed in __filemap_get_folio() instead? Maybe in there too, we should keep the reclaim flag (if passed by caller) at least for <= PAGE_ALLOC_COSTLY_ORDER + 1 Thoughts? -ritesh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-03 19:35 ` [PATCH 1/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro 2026-04-04 1:13 ` Ritesh Harjani @ 2026-04-04 4:15 ` Matthew Wilcox 2026-04-04 16:47 ` Ritesh Harjani 2026-04-05 22:43 ` Dave Chinner 2 siblings, 1 reply; 17+ messages in thread From: Matthew Wilcox @ 2026-04-04 4:15 UTC (permalink / raw) To: Salvatore Dipietro Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-mm On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote: > Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > introduced high-order folio allocations in the buffered write > path. When memory is fragmented, each failed allocation triggers > compaction and drain_all_pages() via __alloc_pages_slowpath(), > causing a 0.75x throughput drop on pgbench (simple-update) with > 1024 clients on a 96-vCPU arm64 system. > > Strip __GFP_DIRECT_RECLAIM from folio allocations in > iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, > making them purely opportunistic. If you look at __filemap_get_folio_mpol(), that's kind of being tried already: if (order > min_order) alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; * %__GFP_NORETRY: The VM implementation will try only very lightweight * memory direct reclaim to get some memory under memory pressure (thus * it can sleep). It will avoid disruptive actions like OOM killer. The * caller must handle the failure which is quite likely to happen under * heavy memory pressure. The flag is suitable when failure can easily be * handled at small cost, such as reduced throughput. which, from the description, seemed like the right approach. So either the description or the implementation should be updated, I suppose? Now, what happens if you change those two lines to: if (order > min_order) { alloc_gfp &= ~__GFP_DIRECT_RECLAIM; alloc_gfp |= __GFP_NOWARN; } Do you recover the performance? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-04 4:15 ` Matthew Wilcox @ 2026-04-04 16:47 ` Ritesh Harjani 2026-04-04 20:46 ` Matthew Wilcox 2026-04-16 15:14 ` Ritesh Harjani 0 siblings, 2 replies; 17+ messages in thread From: Ritesh Harjani @ 2026-04-04 16:47 UTC (permalink / raw) To: Matthew Wilcox, Salvatore Dipietro Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-mm Matthew Wilcox <willy@infradead.org> writes: > On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote: >> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") >> introduced high-order folio allocations in the buffered write >> path. When memory is fragmented, each failed allocation triggers >> compaction and drain_all_pages() via __alloc_pages_slowpath(), >> causing a 0.75x throughput drop on pgbench (simple-update) with >> 1024 clients on a 96-vCPU arm64 system. >> >> Strip __GFP_DIRECT_RECLAIM from folio allocations in >> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, >> making them purely opportunistic. > > If you look at __filemap_get_folio_mpol(), that's kind of being tried > already: > > if (order > min_order) > alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; > > * %__GFP_NORETRY: The VM implementation will try only very lightweight > * memory direct reclaim to get some memory under memory pressure (thus > * it can sleep). It will avoid disruptive actions like OOM killer. The > * caller must handle the failure which is quite likely to happen under > * heavy memory pressure. The flag is suitable when failure can easily be > * handled at small cost, such as reduced throughput. > > which, from the description, seemed like the right approach. So either > the description or the implementation should be updated, I suppose? > > Now, what happens if you change those two lines to: > > if (order > min_order) { > alloc_gfp &= ~__GFP_DIRECT_RECLAIM; > alloc_gfp |= __GFP_NOWARN; > } Hi Matthew, Shouldn't we try this instead? This would still allows us to keep __GFP_NORETRY and hence light weight direct reclaim/compaction for atleast the non-costly order allocations, right? if (order > min_order) { alloc_gfp |= __GFP_NOWARN; if (order > PAGE_ALLOC_COSTLY_ORDER) alloc_gfp &= ~__GFP_DIRECT_RECLAIM; else alloc_gfp |= __GFP_NORETRY; } -ritesh > > Do you recover the performance? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-04 16:47 ` Ritesh Harjani @ 2026-04-04 20:46 ` Matthew Wilcox 2026-04-16 15:14 ` Ritesh Harjani 1 sibling, 0 replies; 17+ messages in thread From: Matthew Wilcox @ 2026-04-04 20:46 UTC (permalink / raw) To: Ritesh Harjani Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-mm On Sat, Apr 04, 2026 at 10:17:33PM +0530, Ritesh Harjani wrote: > Matthew Wilcox <willy@infradead.org> writes: > > > On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote: > >> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > >> introduced high-order folio allocations in the buffered write > >> path. When memory is fragmented, each failed allocation triggers > >> compaction and drain_all_pages() via __alloc_pages_slowpath(), > >> causing a 0.75x throughput drop on pgbench (simple-update) with > >> 1024 clients on a 96-vCPU arm64 system. > >> > >> Strip __GFP_DIRECT_RECLAIM from folio allocations in > >> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, > >> making them purely opportunistic. > > > > If you look at __filemap_get_folio_mpol(), that's kind of being tried > > already: > > > > if (order > min_order) > > alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; > > > > * %__GFP_NORETRY: The VM implementation will try only very lightweight > > * memory direct reclaim to get some memory under memory pressure (thus > > * it can sleep). It will avoid disruptive actions like OOM killer. The > > * caller must handle the failure which is quite likely to happen under > > * heavy memory pressure. The flag is suitable when failure can easily be > > * handled at small cost, such as reduced throughput. > > > > which, from the description, seemed like the right approach. So either > > the description or the implementation should be updated, I suppose? > > > > Now, what happens if you change those two lines to: > > > > if (order > min_order) { > > alloc_gfp &= ~__GFP_DIRECT_RECLAIM; > > alloc_gfp |= __GFP_NOWARN; > > } > > Hi Matthew, > > Shouldn't we try this instead? This would still allows us to keep > __GFP_NORETRY and hence light weight direct reclaim/compaction for > atleast the non-costly order allocations, right? > > if (order > min_order) { > alloc_gfp |= __GFP_NOWARN; > if (order > PAGE_ALLOC_COSTLY_ORDER) > alloc_gfp &= ~__GFP_DIRECT_RECLAIM; > else > alloc_gfp |= __GFP_NORETRY; > } Uhh ... maybe? I'd want someone more familiar with the page allocator than I am to say whether that's the right approach. If it is, that seems too complex, and maybe we need a better approach to the page allocator flags. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-04 16:47 ` Ritesh Harjani 2026-04-04 20:46 ` Matthew Wilcox @ 2026-04-16 15:14 ` Ritesh Harjani 2026-04-20 16:33 ` Salvatore Dipietro 1 sibling, 1 reply; 17+ messages in thread From: Ritesh Harjani @ 2026-04-16 15:14 UTC (permalink / raw) To: Matthew Wilcox, Salvatore Dipietro Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel, linux-mm Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: > Matthew Wilcox <willy@infradead.org> writes: > >> On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote: >>> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") >>> introduced high-order folio allocations in the buffered write >>> path. When memory is fragmented, each failed allocation triggers >>> compaction and drain_all_pages() via __alloc_pages_slowpath(), >>> causing a 0.75x throughput drop on pgbench (simple-update) with >>> 1024 clients on a 96-vCPU arm64 system. >>> >>> Strip __GFP_DIRECT_RECLAIM from folio allocations in >>> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, >>> making them purely opportunistic. >> >> If you look at __filemap_get_folio_mpol(), that's kind of being tried >> already: >> >> if (order > min_order) >> alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; >> >> * %__GFP_NORETRY: The VM implementation will try only very lightweight >> * memory direct reclaim to get some memory under memory pressure (thus >> * it can sleep). It will avoid disruptive actions like OOM killer. The >> * caller must handle the failure which is quite likely to happen under >> * heavy memory pressure. The flag is suitable when failure can easily be >> * handled at small cost, such as reduced throughput. >> >> which, from the description, seemed like the right approach. So either >> the description or the implementation should be updated, I suppose? >> >> Now, what happens if you change those two lines to: >> >> if (order > min_order) { >> alloc_gfp &= ~__GFP_DIRECT_RECLAIM; >> alloc_gfp |= __GFP_NOWARN; >> } > > Hi Matthew, > > Shouldn't we try this instead? This would still allows us to keep > __GFP_NORETRY and hence light weight direct reclaim/compaction for > atleast the non-costly order allocations, right? > > if (order > min_order) { > alloc_gfp |= __GFP_NOWARN; > if (order > PAGE_ALLOC_COSTLY_ORDER) > alloc_gfp &= ~__GFP_DIRECT_RECLAIM; > else > alloc_gfp |= __GFP_NORETRY; > } > Hi Salvatore, Did you get a chance to test the above two options (shared by Matthew and me)? And were you able to recover the performance back with those? So, in a longer run, as Dave suggested, we might need to fix this by maybe considering removing compaction in the direct reclaim path. But I guess for fixing it in older kernel releases, we might need a quick fix ,so maybe worth trying the above suggested changes, perhaps. Also, I am somehow not able to hit this problem at my end (even after creating a bit of memory fragmentation). So please also feel free to share the steps, if you have a setup to re-create it easily. -ritesh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-16 15:14 ` Ritesh Harjani @ 2026-04-20 16:33 ` Salvatore Dipietro 2026-04-20 18:44 ` Matthew Wilcox 0 siblings, 1 reply; 17+ messages in thread From: Salvatore Dipietro @ 2026-04-20 16:33 UTC (permalink / raw) To: ritesh.list Cc: abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, dipiets, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, stable, willy I have submitted a v2 of the patch based on Ritesh's suggestion. https://lore.kernel.org/linux-mm/20260420161404.642-1-dipiets@amazon.it/T/#u Salvatore AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-20 16:33 ` Salvatore Dipietro @ 2026-04-20 18:44 ` Matthew Wilcox 2026-04-21 1:16 ` Ritesh Harjani 0 siblings, 1 reply; 17+ messages in thread From: Matthew Wilcox @ 2026-04-20 18:44 UTC (permalink / raw) To: Salvatore Dipietro Cc: ritesh.list, abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, stable On Mon, Apr 20, 2026 at 04:33:28PM +0000, Salvatore Dipietro wrote: > I have submitted a v2 of the patch based on Ritesh's suggestion. > https://lore.kernel.org/linux-mm/20260420161404.642-1-dipiets@amazon.it/T/#u ... but without linking back to this thread, so nobody who was exposed to that thread for the first time knows about this one. That's poor form. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-20 18:44 ` Matthew Wilcox @ 2026-04-21 1:16 ` Ritesh Harjani 2026-04-28 15:02 ` Salvatore Dipietro 0 siblings, 1 reply; 17+ messages in thread From: Ritesh Harjani @ 2026-04-21 1:16 UTC (permalink / raw) To: Salvatore Dipietro Cc: Matthew Wilcox, abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, stable Matthew Wilcox <willy@infradead.org> writes: > On Mon, Apr 20, 2026 at 04:33:28PM +0000, Salvatore Dipietro wrote: >> I have submitted a v2 of the patch based on Ritesh's suggestion. >> https://lore.kernel.org/linux-mm/20260420161404.642-1-dipiets@amazon.it/T/#u > > ... but without linking back to this thread, so nobody who was exposed > to that thread for the first time knows about this one. That's poor form. Yup. Also, given the Maintainers (willy, Christoph, Dave) shown their dis-interest in taking the patch in it's current form, the right way is to get back with performance data with both the approaches (which we were discussing) and first get the consensus from everyone, before proposing this as a patch :). Having said that, we do care if a genuine performance issue gets reported. In that context, I wanted to understand your setup a bit from memory fragmentation perspective. Are you trying to simulate memory fragmentation and then benchmarking? Or was this problem hitting when you run simply run the reproduction steps mentioned in your cover letter? BTW - I was following the other thread too where PREEMPT_LAZY problem was getting discussed. And from what I understood, you mentioned [1] enabling THP on the system made that problem go away. Also it looks like enabling THP is the right thing to do for this kind of workload. Does that also mean enabling THP fixed this problem too? Do you still hit memory fragmentation and/or similar throughput drop w/o this fix after you enable THP? It will be good to know those details too please. [1]: https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#md88ca4258766e897e432df85874d197db476c7d1 -ritesh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-21 1:16 ` Ritesh Harjani @ 2026-04-28 15:02 ` Salvatore Dipietro 2026-05-03 5:52 ` Ritesh Harjani 0 siblings, 1 reply; 17+ messages in thread From: Salvatore Dipietro @ 2026-04-28 15:02 UTC (permalink / raw) To: ritesh.list Cc: abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, dipiets, djwong, linux-fsdevel, linux-kernel, linux-mm, linux-xfs, stable, willy On 4/21/26 00:43, Ritesh Harjani wrote: > Also, given the Maintainers (willy, Christoph, Dave) shown their > dis-interest in taking the patch in it's current form, the right way is > to get back with performance data with both the approaches (which we > were discussing) and first get the consensus from everyone, before > proposing this as a patch :). Thank you for the follow-up and the additional context, Ritesh. I might have misunderstood the previous request and will make sure to link back to previous patch versions in the future. Here are the performance results that we have collected on our end with the proposed patches: | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline | |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:| | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — | | Proposed patch | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20 | +38.45% | | Ritesh's suggestion | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61 | +36.50% | | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82 | +41.07% | On 4/21/26 00:43, Ritesh Harjani wrote: > In that context, I wanted to understand your setup a bit from > memory fragmentation perspective. Are you trying to simulate memory > fragmentation and then benchmarking? Or was this problem hitting when > you run simply run the reproduction steps mentioned in your cover > letter? All results were collected on fresh AWS instances as described in the cover letter. Patch [1] has been applied on all instances to avoid the other regression. The instance has been restarted to pick up the patched kernel and ensure clean memory before installing and starting the PostgreSQL benchmark via repro-collection [2]. We do not use any tool to fragment the memory in advance. Collecting memory metric of this system, we noticed that ~40% of memory is used by PageTables since PostgreSQL spawns a new process for each client limiting significantly the available caching and free memory. PostgreSQL write pattern consists mostly of 8/16 KB data but during the database checkpoints, by default every 5 minutes, it flushes write-ahead logs to disk, which uses large folios. At this point, the system attempts to satisfy the folio allocation request, triggering the regression and falling into the slow path, as shown by the Linux perf profile below: `-0.26%-__arm64_sys_pwrite64 `-0.26%-vfs_write `-0.26%-xfs_file_write_iter `-0.26%-xfs_file_buffered_write `-0.26%-iomap_file_buffered_write `-0.26%-iomap_write_iter `-0.22%-iomap_write_begin `-0.22%-iomap_get_folio `-0.22%-__filemap_get_folio `-0.21%-filemap_alloc_folio->alloc_pages `-0.20%-__alloc_pages_slowpath |-0.12%-__alloc_pages_direct_compact | `-0.12%-try_to_compact_pages | `-0.11%-compact_zone | `-0.11%-isolate_migratepages `-0.07%-__drain_all_pages `-0.07%-drain_pages_zone `-0.07%-free_pcppages_bulk This is also visible in the intermediate PGBench results, which drop significantly during checkpoint time execution: # Normal execution: [20260421.141505] [INFO] progress: 660.0 s, 138828.2 tps, lat 7.509 ms stddev 16.985, 0 failed [20260421.141515] [INFO] progress: 670.0 s, 151505.1 tps, lat 6.708 ms stddev 8.308, 0 failed [20260421.141525] [INFO] progress: 680.0 s, 166558.7 tps, lat 6.190 ms stddev 6.537, 0 failed [20260421.141535] [INFO] progress: 690.0 s, 141267.1 tps, lat 7.246 ms stddev 5.951, 0 failed # During checkpoints: [20260421.141605] [INFO] progress: 720.0 s, 54119.8 tps, lat 18.894 ms stddev 81.816, 0 failed [20260421.141615] [INFO] progress: 730.0 s, 55184.7 tps, lat 18.564 ms stddev 12.729, 0 failed [20260421.141625] [INFO] progress: 740.0 s, 37334.0 tps, lat 27.302 ms stddev 25.060, 0 failed [20260421.141635] [INFO] progress: 750.0 s, 53387.6 tps, lat 19.259 ms stddev 18.313, 0 failed [20260421.141645] [INFO] progress: 760.0 s, 41247.3 tps, lat 24.805 ms stddev 24.116, 0 failed On 4/21/26 00:43, Ritesh Harjani wrote: > BTW - I was following the other thread too where PREEMPT_LAZY problem > was getting discussed. And from what I understood, you mentioned [1] > enabling THP on the system made that problem go away. Also it looks like > enabling THP is the right thing to do for this kind of workload. Does > that also mean enabling THP fixed this problem too? Do you still hit > memory fragmentation and/or similar throughput drop w/o this fix after > you enable THP? It will be good to know those details too please. We have run more benchmarks (as baseline) with PostgreSQL huge_pages options (on, off) pre-allocating the shared buffer memory with "vm.nr_hugepages" (~25% of total memory, 2MB size) and Transparent Huge Pages (THP) options (always, madvise, never). PostgreSQL performance improves only when PostgreSQL huge_pages option with pre-allocated memory is enabled. THP has no significant effect on PostgreSQL or system performance in this case. | PG huge_pages + pre-alloc mem | THP | Run 1 | Run 2 | Run 3 | Average | |-------------------------------|---------|--------:|--------:|--------:|--------:| | on | never | 189,418 | 187,764 | 188,207 | 188,463 | | on | always | 188,813 | 189,798 | 190,032 | 189,548 | | on | madvise | 187,405 | 192,234 | 189,201 | 189,613 | | off | never | 102,609 | 109,394 | 100,868 | 104,290 | | off | always | 90,274 | 103,831 | 102,515 | 98,874 | | off | madvise | 90,508 | 103,855 | 96,574 | 96,979 | [1] https://lore.kernel.org/all/20260403191942.21410-1-dipiets@amazon.it/T/#m8baeeaf48aa7ae5342c8c2db8f4e1c27e03c1368 [2] https://github.com/aws/repro-collection.git AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-28 15:02 ` Salvatore Dipietro @ 2026-05-03 5:52 ` Ritesh Harjani 2026-05-03 11:55 ` Matthew Wilcox 0 siblings, 1 reply; 17+ messages in thread From: Ritesh Harjani @ 2026-05-03 5:52 UTC (permalink / raw) To: Salvatore Dipietro, Matthew Wilcox, Andrew Morton, linux-mm, Vlastimil Babka Cc: abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, dipiets, djwong, linux-fsdevel, linux-kernel, linux-xfs, stable, willy Sorry about the delayed response, got caught up in some other work. Salvatore Dipietro <dipiets@amazon.it> writes: > On 4/21/26 00:43, Ritesh Harjani wrote: >> Also, given the Maintainers (willy, Christoph, Dave) shown their >> dis-interest in taking the patch in it's current form, the right way is >> to get back with performance data with both the approaches (which we >> were discussing) and first get the consensus from everyone, before >> proposing this as a patch :). > > Thank you for the follow-up and the additional context, Ritesh. > I might have misunderstood the previous request and will make sure to > link back to previous patch versions in the future. > Here are the performance results that we have collected on our end with > the proposed patches: > > > | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline | > |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:| > | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — | > | PG huge_pages + pre-alloc mem | THP | Run 1 | Run 2 | Run 3 | Average | > |-------------------------------|---------|--------:|--------:|--------:|--------:| > | on | never | 189,418 | 187,764 | 188,207 | 188,463 | > First of all thanks for sharing the detailed performance numbers. Ok, so here is what I understood from the data you shared. This performance problem is mostly seen with PostgreSQL huge_pages=off [1][2] i.e. baseline-no-patches ~104K v/s baseline-no-patches+huge_pages=on ~188K Also the observation with huge_pages=off is - we have 40% of memory as page table memory (as you pointed below) > We do not use any tool to fragment the memory in advance. Collecting > memory metric of this system, we noticed that ~40% of memory is used by > PageTables since PostgreSQL spawns a new process for each client limiting > significantly the available caching and free memory. > So there must be 2 things going on with huge_pages=on option here: 1. Huge pages use PMD-size mapping, which eliminates the need of PTE tables entirely. This then reduces the amount of memory consumed by page tables. W/O huges pages, the page table overhead become significant (~40% of DRAM), because on fork, each child process gets it's own copy of the PTE tables (even though the underlying shared memory pages remains the same) 2. The second savings might come from the fact that Linux supports CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING with hugetlb. With this, the PMD table pages themselves are shared among proceses. [1]: https://www.postgresql.org/docs/current/runtime-config-resource.html [2]: https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-HUGE-PAGES So the above explain the 40% of memory used up in Page tables. Now this is what I believe could be the reason for memory fragmentation with this workload - In Linux, each PTE page table uses 4KB size (assuming you are using 4KB system PAGE_SIZE). When your workload forks a child process for each new connection, child gets its own copy of the page tables which maps the shared buffer. Since each PTE table is a single 4KB page, hundreds of connections spawning means hundreds of thousands of single-page allocations for page tables. So it looks like, the major source of your memory fragmentation problem must be these several order-0 allocations for PTE page table pages. Also as per the documentation [1], huge_pages=try option is the default setting. So I am assuming in production we at least won't suffer from this memory fragmentation, correct? [1]: https://www.postgresql.org/docs/current/runtime-config-resource.html > PostgreSQL write pattern consists mostly of 8/16 KB data but during > the database checkpoints, by default every 5 minutes, it flushes write-ahead > logs to disk, which uses large folios. At this point, the system attempts to > satisfy the folio allocation request, triggering the regression and falling > into the slow path, as shown by the Linux perf profile below: > > `-0.26%-__arm64_sys_pwrite64 > `-0.26%-vfs_write > `-0.26%-xfs_file_write_iter > `-0.26%-xfs_file_buffered_write > `-0.26%-iomap_file_buffered_write > `-0.26%-iomap_write_iter > `-0.22%-iomap_write_begin > `-0.22%-iomap_get_folio > `-0.22%-__filemap_get_folio > `-0.21%-filemap_alloc_folio->alloc_pages > `-0.20%-__alloc_pages_slowpath > |-0.12%-__alloc_pages_direct_compact > | `-0.12%-try_to_compact_pages > | `-0.11%-compact_zone > | `-0.11%-isolate_migratepages > `-0.07%-__drain_all_pages > `-0.07%-drain_pages_zone > `-0.07%-free_pcppages_bulk > However, I agree that it still make sense look into possible solution to address this performance gap which you pointed out when the system has memory fragmentation (with huge_pages=off). > | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline | > |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:| > | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — | > | Proposed patch | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20 | +38.45% | > | Ritesh's suggestion | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61 | +36.50% | > | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82 | +41.07% | The main reason, why I proposed the below patch was because, this only affects costly order allocation (i.e for order > PAGE_ALLOC_COSTLY_ORDER) by skipping direct reclaim for those orders, while still keeping the behaviour same for others. So, for smaller orders (order > min_order and <= PAGE_ALLOC_COSTLY_ORDER), the allocator will still attempt for direct reclaim and compaction (which I guess is required to avoid oom too?) And also, this looks like a change which could be easily backportable :) diff --git a/mm/filemap.c b/mm/filemap.c index 4e636647100c..f2343c26dd63 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping, gfp_t alloc_gfp = gfp; err = -ENOMEM; - if (order > min_order) - alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; + if (order > min_order) { + alloc_gfp |= __GFP_NOWARN; + if (order > PAGE_ALLOC_COSTLY_ORDER) + alloc_gfp &= ~__GFP_DIRECT_RECLAIM; + else + alloc_gfp |= __GFP_NORETRY; + } But of course let's hear from others on their suggestions / thoughts. Maybe the filemap is not the right place to fix this as Matthew, Andrew and others were pointing. Any other suggestions on how to approach this, please? -ritesh ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-05-03 5:52 ` Ritesh Harjani @ 2026-05-03 11:55 ` Matthew Wilcox 0 siblings, 0 replies; 17+ messages in thread From: Matthew Wilcox @ 2026-05-03 11:55 UTC (permalink / raw) To: Ritesh Harjani Cc: Salvatore Dipietro, Andrew Morton, linux-mm, Vlastimil Babka, abuehaze, alisaidi, blakgeof, brauner, dipietro.salvatore, djwong, linux-fsdevel, linux-kernel, linux-xfs, stable On Sun, May 03, 2026 at 11:22:10AM +0530, Ritesh Harjani wrote: > Now this is what I believe could be the reason for memory fragmentation > with this workload - > In Linux, each PTE page table uses 4KB size (assuming you are using 4KB > system PAGE_SIZE). When your workload forks a > child process for each new connection, child gets its own copy of the > page tables which maps the shared buffer. > Since each PTE table is a single 4KB page, hundreds of connections > spawning means hundreds of thousands of single-page allocations for page > tables. So it looks like, the major source of your memory fragmentation > problem must be these several order-0 allocations for PTE page table > pages. While memory is fragmented, the _problem_ is that we try too hard to defragment. From the original post: : When memory is fragmented, each failed allocation triggers : compaction and drain_all_pages() via __alloc_pages_slowpath() We really should only try compaction once. If it didn't make useful progress last time, it won't this time either. > > | Patch | Run 1 | Run 2 | Run 3 | Average | % vs Baseline | > > |----------------------|-----------:|-----------:|-----------:|------------:|:-------------:| > > | Baseline | 107,064.61 | 97,043.86 | 101,830.78 | 101,979.75 | — | > > | Proposed patch | 146,012.23 | 136,392.36 | 141,178.00 | 141,194.20 | +38.45% | > > | Ritesh's suggestion | 147,481.50 | 133,069.03 | 137,051.30 | 139,200.61 | +36.50% | > > | Matthew's suggestion | 145,653.91 | 144,169.24 | 141,768.31 | 143,863.82 | +41.07% | > > > The main reason, why I proposed the below patch was because, this only > affects costly order allocation (i.e for order > PAGE_ALLOC_COSTLY_ORDER) > by skipping direct reclaim for those orders, while still keeping the > behaviour same for others. > > So, for smaller orders (order > min_order and <= > PAGE_ALLOC_COSTLY_ORDER), the allocator will still attempt for direct > reclaim and compaction (which I guess is required to avoid oom too?) And > also, this looks like a change which could be easily backportable :) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 4e636647100c..f2343c26dd63 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2007,8 +2007,13 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping, > gfp_t alloc_gfp = gfp; > > err = -ENOMEM; > - if (order > min_order) > - alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN; > + if (order > min_order) { > + alloc_gfp |= __GFP_NOWARN; > + if (order > PAGE_ALLOC_COSTLY_ORDER) > + alloc_gfp &= ~__GFP_DIRECT_RECLAIM; > + else > + alloc_gfp |= __GFP_NORETRY; > + } > > > But of course let's hear from others on their suggestions / thoughts. > Maybe the filemap is not the right place to fix this as Matthew, Andrew > and others were pointing. Any other suggestions on how to approach this, > please? filemap.c REALLY shouldn't know about PAGE_ALLOC_COSTLY_ORDER. That's an internal detail of the memory allocator. Either we want an API to say "allocate me a folio between orders A and B" or we need more understandable GFP flags. Or the page allocator could use the __GFP_NORETRY flag to say "oh well, this allocation has a fallback, I'll kick kcompactd to try to compact some more memory, but I'll fail the allocation". ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-03 19:35 ` [PATCH 1/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro 2026-04-04 1:13 ` Ritesh Harjani 2026-04-04 4:15 ` Matthew Wilcox @ 2026-04-05 22:43 ` Dave Chinner 2026-04-07 5:40 ` Christoph Hellwig 2026-04-21 9:02 ` Vlastimil Babka 2 siblings, 2 replies; 17+ messages in thread From: Dave Chinner @ 2026-04-05 22:43 UTC (permalink / raw) To: Salvatore Dipietro Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote: > Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > introduced high-order folio allocations in the buffered write > path. When memory is fragmented, each failed allocation triggers > compaction and drain_all_pages() via __alloc_pages_slowpath(), > causing a 0.75x throughput drop on pgbench (simple-update) with > 1024 clients on a 96-vCPU arm64 system. > > Strip __GFP_DIRECT_RECLAIM from folio allocations in > iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, > making them purely opportunistic. > > Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > Cc: stable@vger.kernel.org > Signed-off-by: Salvatore Dipietro <dipiets@amazon.it> > --- > fs/iomap/buffered-io.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 92a831cf4bf1..cb843d54b4d9 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); > struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) > { > fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS; > + gfp_t gfp; > > if (iter->flags & IOMAP_NOWAIT) > fgp |= FGP_NOWAIT; > @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) > fgp |= FGP_DONTCACHE; > fgp |= fgf_set_order(len); > > + gfp = mapping_gfp_mask(iter->inode->i_mapping); > + > + /* > + * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER, > + * strip __GFP_DIRECT_RECLAIM to make the allocation purely > + * opportunistic. This avoids compaction + drain_all_pages() > + * in __alloc_pages_slowpath() that devastate throughput > + * on large systems during buffered writes. > + */ > + if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER) > + gfp &= ~__GFP_DIRECT_RECLAIM; Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere we need to do high order folio allocation is getting out of hand. Compaction improves long term system performance, so we don't really just want to turn it off whenever we have demand for high order folios. We should be doing is getting rid of compaction out of the direct reclaim path - it is -clearly- way too costly for hot paths that use large allocations, especially those with fallbacks to smaller allocations or vmalloc. Instead, memory reclaim should kick background compaction and let it do the work. If the allocation path really, really needs high order allocation to succeed, then it can direct the allocation to retry until it succeeds and the allocator itself can wait for background compaction to make progress. For code that has fallbacks to smaller allocations, then there is no need to wait for compaction - we can attempt fast smaller allocations and continue that way until an allocation succeeds.... -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-05 22:43 ` Dave Chinner @ 2026-04-07 5:40 ` Christoph Hellwig 2026-04-21 9:02 ` Vlastimil Babka 1 sibling, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2026-04-07 5:40 UTC (permalink / raw) To: Dave Chinner Cc: Salvatore Dipietro, linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel On Mon, Apr 06, 2026 at 08:43:57AM +1000, Dave Chinner wrote: > > + if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER) > > + gfp &= ~__GFP_DIRECT_RECLAIM; > > Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere > we need to do high order folio allocation is getting out of hand. That's what I thought. > Compaction improves long term system performance, so we don't really > just want to turn it off whenever we have demand for high order > folios. Yes. Also if we want to make block size > PAGE_SIZE a real option, just giving up on allocating large folios is not an option. > Instead, memory reclaim should kick background compaction and let it > do the work. If the allocation path really, really needs high order > allocation to succeed, then it can direct the allocation to retry > until it succeeds and the allocator itself can wait for background > compaction to make progress. > > For code that has fallbacks to smaller allocations, then there is no > need to wait for compaction - we can attempt fast smaller allocations > and continue that way until an allocation succeeds.... Yes. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-05 22:43 ` Dave Chinner 2026-04-07 5:40 ` Christoph Hellwig @ 2026-04-21 9:02 ` Vlastimil Babka 1 sibling, 0 replies; 17+ messages in thread From: Vlastimil Babka @ 2026-04-21 9:02 UTC (permalink / raw) To: Dave Chinner, Salvatore Dipietro Cc: linux-kernel, alisaidi, blakgeof, abuehaze, dipietro.salvatore, willy, stable, Christian Brauner, Darrick J. Wong, linux-xfs, linux-fsdevel, Ritesh Harjani (IBM), Christoph Hellwig, linux-mm@kvack.org, Michal Hocko, David Hildenbrand (Red Hat), Johannes Weiner On 4/6/26 00:43, Dave Chinner wrote: > On Fri, Apr 03, 2026 at 07:35:34PM +0000, Salvatore Dipietro wrote: >> Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") >> introduced high-order folio allocations in the buffered write >> path. When memory is fragmented, each failed allocation triggers >> compaction and drain_all_pages() via __alloc_pages_slowpath(), >> causing a 0.75x throughput drop on pgbench (simple-update) with >> 1024 clients on a 96-vCPU arm64 system. >> >> Strip __GFP_DIRECT_RECLAIM from folio allocations in >> iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, >> making them purely opportunistic. >> >> Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") >> Cc: stable@vger.kernel.org >> Signed-off-by: Salvatore Dipietro <dipiets@amazon.it> BTW, backporting perf regressions fixes to 6.6, when they are only reported at the time 7.0 is released, might be too risky. There will likely be a different workload that will regress as a result, no matter what we do. >> --- >> fs/iomap/buffered-io.c | 15 ++++++++++++++- >> 1 file changed, 14 insertions(+), 1 deletion(-) >> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >> index 92a831cf4bf1..cb843d54b4d9 100644 >> --- a/fs/iomap/buffered-io.c >> +++ b/fs/iomap/buffered-io.c >> @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); >> struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) >> { >> fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS; >> + gfp_t gfp; >> >> if (iter->flags & IOMAP_NOWAIT) >> fgp |= FGP_NOWAIT; >> @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) >> fgp |= FGP_DONTCACHE; >> fgp |= fgf_set_order(len); >> >> + gfp = mapping_gfp_mask(iter->inode->i_mapping); >> + >> + /* >> + * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER, >> + * strip __GFP_DIRECT_RECLAIM to make the allocation purely >> + * opportunistic. This avoids compaction + drain_all_pages() >> + * in __alloc_pages_slowpath() that devastate throughput >> + * on large systems during buffered writes. >> + */ >> + if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER) >> + gfp &= ~__GFP_DIRECT_RECLAIM; > > Adding these "gfp &= ~__GFP_DIRECT_RECLAIM" hacks everywhere > we need to do high order folio allocation is getting out of hand. > > Compaction improves long term system performance, so we don't really > just want to turn it off whenever we have demand for high order > folios. > > We should be doing is getting rid of compaction out of the direct > reclaim path - it is -clearly- way too costly for hot paths that use > large allocations, especially those with fallbacks to smaller > allocations or vmalloc. > > Instead, memory reclaim should kick background compaction and let it > do the work. If the allocation path really, really needs high order > allocation to succeed, then it can direct the allocation to retry > until it succeeds and the allocator itself can wait for background > compaction to make progress. > > For code that has fallbacks to smaller allocations, then there is no > need to wait for compaction - we can attempt fast smaller allocations > and continue that way until an allocation succeeds.... So, should we do a LSF/MM session? But I think in any case, the page allocator needs to know which allocations do have the fallback. __GFP_NORETRY exists for this. Here it wasn't tried at all, in v2 [1] it was, but not alone. I'd start from __GFP_NORETRY alone, and then we can look at tweaking what it does if it's currently insufficient. We could have a helper to encapsulate this "turn this allocation to a lightweight fallbackable one", which would add __GFP_NORETRY. It probably already exists somewhere but not gfp.h. But I'm not sure we can simply change GFP_KERNEL to start failing more for non-costly orders. We've discussed that a lot in the past :) [1] https://lore.kernel.org/all/20260420161404.642-1-dipiets@amazon.it/ > -Dave. ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20260403193201.30479-1-dipiets@amazon.it>]
* [PATCH 1/1] iomap: avoid compaction for costly folio order allocation [not found] <20260403193201.30479-1-dipiets@amazon.it> @ 2026-04-03 19:32 ` Salvatore Dipietro 2026-04-04 6:25 ` Greg KH 0 siblings, 1 reply; 17+ messages in thread From: Salvatore Dipietro @ 2026-04-03 19:32 UTC (permalink / raw) To: dipietro.salvatore; +Cc: stable Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") introduced high-order folio allocations in the buffered write path. When memory is fragmented, each failed allocation triggers compaction and drain_all_pages() via __alloc_pages_slowpath(), causing a 0.75x throughput drop on pgbench (simple-update) with 1024 clients on a 96-vCPU arm64 system. Strip __GFP_DIRECT_RECLAIM from folio allocations in iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, making them purely opportunistic. Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Cc: stable@vger.kernel.org Signed-off-by: Salvatore Dipietro <dipiets@amazon.it> --- fs/iomap/buffered-io.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 92a831cf4bf1..cb843d54b4d9 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -715,6 +715,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) { fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS; + gfp_t gfp; if (iter->flags & IOMAP_NOWAIT) fgp |= FGP_NOWAIT; @@ -722,8 +723,20 @@ struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len) fgp |= FGP_DONTCACHE; fgp |= fgf_set_order(len); + gfp = mapping_gfp_mask(iter->inode->i_mapping); + + /* + * If the folio order hint exceeds PAGE_ALLOC_COSTLY_ORDER, + * strip __GFP_DIRECT_RECLAIM to make the allocation purely + * opportunistic. This avoids compaction + drain_all_pages() + * in __alloc_pages_slowpath() that devastate throughput + * on large systems during buffered writes. + */ + if (FGF_GET_ORDER(fgp) > PAGE_ALLOC_COSTLY_ORDER) + gfp &= ~__GFP_DIRECT_RECLAIM; + return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, - fgp, mapping_gfp_mask(iter->inode->i_mapping)); + fgp, gfp); } EXPORT_SYMBOL_GPL(iomap_get_folio); -- 2.50.1 (Apple Git-155) AMAZON DEVELOPMENT CENTER ITALY SRL, viale Monte Grappa 3/5, 20124 Milano, Italia, Registro delle Imprese di Milano Monza Brianza Lodi REA n. 2504859, Capitale Sociale: 10.000 EUR i.v., Cod. Fisc. e P.IVA 10100050961, Societa con Socio Unico ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] iomap: avoid compaction for costly folio order allocation 2026-04-03 19:32 ` Salvatore Dipietro @ 2026-04-04 6:25 ` Greg KH 0 siblings, 0 replies; 17+ messages in thread From: Greg KH @ 2026-04-04 6:25 UTC (permalink / raw) To: Salvatore Dipietro; +Cc: dipietro.salvatore, stable On Fri, Apr 03, 2026 at 07:32:01PM +0000, Salvatore Dipietro wrote: > Commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > introduced high-order folio allocations in the buffered write > path. When memory is fragmented, each failed allocation triggers > compaction and drain_all_pages() via __alloc_pages_slowpath(), > causing a 0.75x throughput drop on pgbench (simple-update) with > 1024 clients on a 96-vCPU arm64 system. > > Strip __GFP_DIRECT_RECLAIM from folio allocations in > iomap_get_folio() when the order exceeds PAGE_ALLOC_COSTLY_ORDER, > making them purely opportunistic. > > Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") > Cc: stable@vger.kernel.org > Signed-off-by: Salvatore Dipietro <dipiets@amazon.it> > --- > fs/iomap/buffered-io.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > <formletter> This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html for how to do this properly. </formletter> ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-05-03 11:55 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260403193535.9970-1-dipiets@amazon.it>
2026-04-03 19:35 ` [PATCH 1/1] iomap: avoid compaction for costly folio order allocation Salvatore Dipietro
2026-04-04 1:13 ` Ritesh Harjani
2026-04-04 4:15 ` Matthew Wilcox
2026-04-04 16:47 ` Ritesh Harjani
2026-04-04 20:46 ` Matthew Wilcox
2026-04-16 15:14 ` Ritesh Harjani
2026-04-20 16:33 ` Salvatore Dipietro
2026-04-20 18:44 ` Matthew Wilcox
2026-04-21 1:16 ` Ritesh Harjani
2026-04-28 15:02 ` Salvatore Dipietro
2026-05-03 5:52 ` Ritesh Harjani
2026-05-03 11:55 ` Matthew Wilcox
2026-04-05 22:43 ` Dave Chinner
2026-04-07 5:40 ` Christoph Hellwig
2026-04-21 9:02 ` Vlastimil Babka
[not found] <20260403193201.30479-1-dipiets@amazon.it>
2026-04-03 19:32 ` Salvatore Dipietro
2026-04-04 6:25 ` Greg KH
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox