Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/2] mm: improve large folio readahead for exec memory
@ 2026-05-28 16:55 Usama Arif
  2026-05-28 16:55 ` [PATCH v6 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Usama Arif @ 2026-05-28 16:55 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, Andrew Donnellan, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam R.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	ljs, mhocko, npache, pasha.tatashin, rmclure, rppt, surenb,
	vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel, ziy,
	hannes, kas, shakeel.butt, kernel-team, Usama Arif

Two checks in do_sync_mmap_readahead() limit large-folio readahead:

  1. The mmap_miss heuristic is meant to throttle wasteful speculative
     readahead. It is currently also applied to the VM_EXEC readahead
     path, which is targeted rather than speculative. Once mmap_miss exceeds
     MMAP_LOTSAMISS, exec readahead - including the large-folio
     order requested by exec_folio_order() - is disabled. On
     configurations where the mmap_miss decrement paths are not
     active (see patch 1) the counter only grows, so exec readahead
     is permanently disabled after the first 100 faults.

  2. The force_thp_readahead path is gated only on
     HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and always drives the
     readahead at HPAGE_PMD_ORDER. Configurations where
     HPAGE_PMD_ORDER exceeds MAX_PAGECACHE_ORDER never reach this
     path, even when the mapping itself supports usefully large
     folios well below the cap.

Both issues are most visible on arm64 with a 64K base page size,
where HPAGE_PMD_ORDER is 13 (512MB) -- above MAX_PAGECACHE_ORDER
(11) -- and where fault_around_pages collapses to 1 disabling
should_fault_around() (one of the two mmap_miss decrement sites).
However the fixes are architecture-agnostic: patch 1 reflects the
nature of VM_EXEC readahead regardless of base page size, and
patch 2 generalises the gate so any mapping advertising a usefully
large maximum folio order can benefit.

I created a benchmark that mmaps a large executable file madvises it
as huge and calls RET-stub functions at PAGE_SIZE offsets across it.
"Cold" measures fault + readahead cost. "Random" first faults in all
pages with a sequential sweep (not measured), then measures time for
calling random offsets, isolating iTLB miss cost for scattered execution.

The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
512MB executable file on ext4, averaged over 3 runs:

  Phase      | Baseline     | Patched      | Improvement
  -----------|--------------|--------------|------------------
  Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
  Random     | 76.0 ms      | 58.3 ms      | 23% faster

The patches are on top of mm-unstable from 28 May
(8a74e22643189e0ae339afc91110ddb4cab1941b) which include patch [1]
that make mmap_miss accounting symmetric for VM_SEQ_READ which was pointed
out by sashiko in the previous revision.

[1] https://lore.kernel.org/all/20260525145751.2671248-1-usama.arif@linux.dev/ 

Kiryl and Jan, I have kept your Reviewed-by tags from the previous revision
for patch 1 as the concept is still the same, please let me know if that
is not ok.

v5 -> v6: https://lore.kernel.org/all/20260522162422.3856502-1-usama.arif@linux.dev/
- Based on top of patch [1] (sashiko)
- Changes to commit message to make it more accurate for patch 1 and skip
  mmap_miss decrement as well. (sashiko)
- Keep old behaviour if large folio mappings is not enabled (sashiko).
- sashiko pointed to a TOCTOU data race that was pre-existing. My patch
  could make it worse. Dont make it worse by introducing thp_order local
  variable.

v3 -> v5: https://lore.kernel.org/all/20260402181326.3107102-1-usama.arif@linux.dev/
- (Looks like I messed up the versioning here and went directly form
  v3 to v5.)
- Drop patches for elf thp unmapped area alignment and deal with them
  separately. These patches will just bring folios smaller than PMD
  at the same level as PMD. The 2 patches now should be much easier
  to merge.
- Tackle size of THP for exec pages at the same point as PMD instead
  of tackling using exec_folio_order() (Ryan during LSFMM, Thanks!)

v2 -> v3: https://lore.kernel.org/all/20260320140315.979307-1-usama.arif@linux.dev/
- Take into account READ_ONLY_THP_FOR_FS for elf alignment by aligning
  to HPAGE_PMD_SIZE limited to 2M (Rui)
- Reviewed-by tags for patch 1 from Kiryl and Jan
- Remove preferred_exec_order() (Jan)
- Change ra->order to HPAGE_PMD_ORDER if vma_pages(vma) >= HPAGE_PMD_NR
  otherwise use exec_folio_order() with gfp &= ~__GFP_RECLAIM for
  do_sync_mmap_readahead().
- Change exec_folio_order() to return 2M (cont-pte size) for 64K base
  page size for arm64.
- remove bprm->file NULL check (Matthew)
- Change filp to file (Matthew)
- Improve checking of p_vaddr and p_vaddr (Rui and Matthew)

v1 -> v2: https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev/
- disable mmap_miss logic for VM_EXEC (Jan Kara)
- Align in elf only when segment VA and file offset are already aligned (Rui)
- preferred_exec_order() for VM_EXEC sync mmap_readahead which takes into
  account zone high watermarks (as an approximation of memory pressure)
  (David, or atleast my approach to what David suggested in [1] :))
- Extend max alignment to mapping_max_folio_size() instead of
  exec_folio_order()
 
Usama Arif (2):
  mm: bypass mmap_miss heuristic for VM_EXEC readahead
  mm: use mapping_max_folio_order() for force_thp_readahead order

 mm/filemap.c | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)

-- 
2.52.0



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v6 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead
  2026-05-28 16:55 [PATCH v6 0/2] mm: improve large folio readahead for exec memory Usama Arif
@ 2026-05-28 16:55 ` Usama Arif
  2026-05-29  9:47   ` Pedro Falcato
  2026-05-28 16:55 ` [PATCH v6 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order Usama Arif
  2026-05-28 20:27 ` [PATCH v6 0/2] mm: improve large folio readahead for exec memory Andrew Morton
  2 siblings, 1 reply; 10+ messages in thread
From: Usama Arif @ 2026-05-28 16:55 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, Andrew Donnellan, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam R.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	ljs, mhocko, npache, pasha.tatashin, rmclure, rppt, surenb,
	vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel, ziy,
	hannes, kas, shakeel.butt, kernel-team, Usama Arif

The mmap_miss heuristic is intended to stop speculative mmap readahead
when a file looks like a random-access workload. That does not fit the
VM_EXEC path very well.

VM_EXEC readahead is already constrained differently from ordinary mmap
read-around: it is bounded by the VMA, uses exec_folio_order() to choose
an order useful for executable mappings, and sets async_size to 0 so it
does not create follow-on readahead. When VM_HUGEPAGE is also present,
the larger readahead is an explicit userspace opt-in.

The mmap_miss counter is decremented from cache-hit paths in
do_async_mmap_readahead() and filemap_map_pages(). Those paths are not
always enough to balance the synchronous miss increments for executable
mappings. In particular, when fault-around is effectively disabled, such
as configurations where fault_around_pages is 1, filemap_map_pages() is
not reached from the fault path. The counter can then become a stale
throttle for VM_EXEC mappings and suppress the readahead behavior that
the executable-specific path is trying to provide.

Skip both mmap_miss increments and decrements for VM_EXEC mappings,
matching the existing VM_SEQ_READ treatment and keeping the counter
accounting symmetric.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
---
 mm/filemap.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index cca20e350c95..a16b33e0fc71 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3339,7 +3339,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		}
 	}
 
-	if (!(vm_flags & VM_SEQ_READ)) {
+	if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
 		/* Avoid banging the cache line if not needed */
 		mmap_miss = READ_ONCE(ra->mmap_miss);
 		if (mmap_miss < MMAP_LOTSAMISS * 10)
@@ -3434,12 +3434,12 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 	 * times for a single folio and break the balance with mmap_miss
 	 * increase in do_sync_mmap_readahead().
 	 *
-	 * VM_SEQ_READ mappings skip the mmap_miss increment in
+	 * VM_SEQ_READ and VM_EXEC mappings skip the mmap_miss increment in
 	 * do_sync_mmap_readahead(), so skip the decrement here as well to
 	 * keep the counter symmetric.
 	 */
 	if (likely(!folio_test_locked(folio)) &&
-	    !(vmf->vma->vm_flags & VM_SEQ_READ)) {
+	    !(vmf->vma->vm_flags & (VM_SEQ_READ | VM_EXEC))) {
 		mmap_miss = READ_ONCE(ra->mmap_miss);
 		if (mmap_miss)
 			WRITE_ONCE(ra->mmap_miss, --mmap_miss);
@@ -3941,14 +3941,14 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		 * Don't decrease mmap_miss in this scenario to make sure
 		 * we can stop read-ahead.
 		 *
-		 * VM_SEQ_READ mappings skip the mmap_miss increment in
-		 * do_sync_mmap_readahead(), so skip the decrement here as
-		 * well to keep the counter symmetric.
+		 * VM_SEQ_READ and VM_EXEC mappings skip the mmap_miss
+		 * increment in do_sync_mmap_readahead(), so skip the
+		 * decrement here as well to keep the counter symmetric.
 		 */
 		if ((map_ret & VM_FAULT_NOPAGE) &&
 		    !(vmf->flags & FAULT_FLAG_TRIED) &&
 		    !folio_test_workingset(folio) &&
-		    !(vma->vm_flags & VM_SEQ_READ)) {
+		    !(vma->vm_flags & (VM_SEQ_READ | VM_EXEC))) {
 			unsigned short mmap_miss;
 
 			mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v6 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order
  2026-05-28 16:55 [PATCH v6 0/2] mm: improve large folio readahead for exec memory Usama Arif
  2026-05-28 16:55 ` [PATCH v6 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-05-28 16:55 ` Usama Arif
  2026-05-29 10:01   ` Pedro Falcato
  2026-05-29 12:36   ` Usama Arif
  2026-05-28 20:27 ` [PATCH v6 0/2] mm: improve large folio readahead for exec memory Andrew Morton
  2 siblings, 2 replies; 10+ messages in thread
From: Usama Arif @ 2026-05-28 16:55 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, Andrew Donnellan, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam R.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	ljs, mhocko, npache, pasha.tatashin, rmclure, rppt, surenb,
	vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel, ziy,
	hannes, kas, shakeel.butt, kernel-team, Usama Arif

The force_thp_readahead path in do_sync_mmap_readahead() is gated on
HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and always requests
HPAGE_PMD_ORDER / HPAGE_PMD_NR. On configurations where HPAGE_PMD_ORDER
exceeds MAX_PAGECACHE_ORDER, notably arm64 with a 64K base page size,
VM_HUGEPAGE mappings cannot use this path and fall back to the non-forced
mmap readahead path even when the mapping supports useful large folios.

Keep the existing PMD-sized behavior when HPAGE_PMD_ORDER fits in the
page cache. When it does not, enable forced readahead for mappings that
support large folios and request an order capped by both
mapping_max_folio_order(mapping) and 2MB.

2MB is chosen as the cap because it matches the PMD size on x86_64
and on arm64 with 4K or 16K base pages, so the size/memory-pressure
tradeoff for folios of that size is already well understood. On arm64
with a 64K base page size, 2MB is also the contiguous-PTE (contpte)
block size, so the resulting folios coalesce into a single TLB entry
and reduce TLB pressure on the readahead path.

The final allocation order may still be clamped by page_cache_ra_order()
to the mapping and request geometry, but this gives VM_HUGEPAGE mappings
on such configurations a large-folio readahead request instead of
dropping back to base-page readahead.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/filemap.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index a16b33e0fc71..bfb891d9da1f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3312,14 +3312,23 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	struct file *fpin = NULL;
 	vm_flags_t vm_flags = vmf->vma->vm_flags;
 	bool force_thp_readahead = false;
+	unsigned int thp_order = 0;
 	unsigned short mmap_miss;
 
 	ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
 
 	/* Use the readahead code, even if readahead is disabled */
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
-		force_thp_readahead = true;
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
+		if (HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
+			force_thp_readahead = true;
+			thp_order = HPAGE_PMD_ORDER;
+		} else if (mapping_large_folio_support(mapping)) {
+			force_thp_readahead = true;
+			thp_order = min_t(unsigned int,
+					  mapping_max_folio_order(mapping),
+					  get_order(SZ_2M));
+		}
+	}
 
 	if (!force_thp_readahead) {
 		/*
@@ -3354,17 +3363,19 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	}
 
 	if (force_thp_readahead) {
+		unsigned long folio_nr_pages = 1UL << thp_order;
+
 		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
-		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
-		ra->size = HPAGE_PMD_NR;
+		ractl._index &= ~(folio_nr_pages - 1);
+		ra->size = folio_nr_pages;
 		/*
-		 * Fetch two PMD folios, so we get the chance to actually
+		 * Fetch two folios so we get the chance to actually
 		 * readahead, unless we've been told not to.
 		 */
 		if (!(vm_flags & VM_RAND_READ))
 			ra->size *= 2;
-		ra->async_size = HPAGE_PMD_NR;
-		ra->order = HPAGE_PMD_ORDER;
+		ra->async_size = folio_nr_pages;
+		ra->order = thp_order;
 		page_cache_ra_order(&ractl, ra);
 		return fpin;
 	}
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 0/2] mm: improve large folio readahead for exec memory
  2026-05-28 16:55 [PATCH v6 0/2] mm: improve large folio readahead for exec memory Usama Arif
  2026-05-28 16:55 ` [PATCH v6 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
  2026-05-28 16:55 ` [PATCH v6 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order Usama Arif
@ 2026-05-28 20:27 ` Andrew Morton
  2 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2026-05-28 20:27 UTC (permalink / raw)
  To: Usama Arif
  Cc: david, willy, ryan.roberts, linux-mm, r, jack, Andrew Donnellan,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam R.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, ljs, mhocko, npache, pasha.tatashin,
	rmclure, rppt, surenb, vbabka, Al Viro, wilts.infradead.org,
	"linux-fsdevel, ziy, hannes, kas, shakeel.butt, kernel-team

On Thu, 28 May 2026 09:55:18 -0700 Usama Arif <usama.arif@linux.dev> wrote:

> Two checks in do_sync_mmap_readahead() limit large-folio readahead:
> 
>   1. The mmap_miss heuristic is meant to throttle wasteful speculative
>      readahead. It is currently also applied to the VM_EXEC readahead
>      path, which is targeted rather than speculative. Once mmap_miss exceeds
>      MMAP_LOTSAMISS, exec readahead - including the large-folio
>      order requested by exec_folio_order() - is disabled. On
>      configurations where the mmap_miss decrement paths are not
>      active (see patch 1) the counter only grows, so exec readahead
>      is permanently disabled after the first 100 faults.
> 
>   2. The force_thp_readahead path is gated only on
>      HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and always drives the
>      readahead at HPAGE_PMD_ORDER. Configurations where
>      HPAGE_PMD_ORDER exceeds MAX_PAGECACHE_ORDER never reach this
>      path, even when the mapping itself supports usefully large
>      folios well below the cap.
> 
> Both issues are most visible on arm64 with a 64K base page size,
> where HPAGE_PMD_ORDER is 13 (512MB) -- above MAX_PAGECACHE_ORDER
> (11) -- and where fault_around_pages collapses to 1 disabling
> should_fault_around() (one of the two mmap_miss decrement sites).
> However the fixes are architecture-agnostic: patch 1 reflects the
> nature of VM_EXEC readahead regardless of base page size, and
> patch 2 generalises the gate so any mapping advertising a usefully
> large maximum folio order can benefit.

OK, thanks, it's getting late and [2/2] hasn't yet been reviewed.  I'll
add it, just, because I'm a sucker for speedups.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead
  2026-05-28 16:55 ` [PATCH v6 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-05-29  9:47   ` Pedro Falcato
  0 siblings, 0 replies; 10+ messages in thread
From: Pedro Falcato @ 2026-05-29  9:47 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack,
	Andrew Donnellan, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam R. Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	ljs, mhocko, npache, pasha.tatashin, rmclure, rppt, surenb,
	vbabka, Al Viro, wilts.infradead.org, ziy, hannes, kas,
	shakeel.butt, kernel-team

On Thu, May 28, 2026 at 09:55:19AM -0700, Usama Arif wrote:
> The mmap_miss heuristic is intended to stop speculative mmap readahead
> when a file looks like a random-access workload. That does not fit the
> VM_EXEC path very well.
> 
> VM_EXEC readahead is already constrained differently from ordinary mmap
> read-around: it is bounded by the VMA, uses exec_folio_order() to choose
> an order useful for executable mappings, and sets async_size to 0 so it
> does not create follow-on readahead. When VM_HUGEPAGE is also present,
> the larger readahead is an explicit userspace opt-in.
> 
> The mmap_miss counter is decremented from cache-hit paths in
> do_async_mmap_readahead() and filemap_map_pages(). Those paths are not
> always enough to balance the synchronous miss increments for executable
> mappings. In particular, when fault-around is effectively disabled, such
> as configurations where fault_around_pages is 1, filemap_map_pages() is
> not reached from the fault path. The counter can then become a stale
> throttle for VM_EXEC mappings and suppress the readahead behavior that
> the executable-specific path is trying to provide.
> 
> Skip both mmap_miss increments and decrements for VM_EXEC mappings,
> matching the existing VM_SEQ_READ treatment and keeping the counter
> accounting symmetric.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

This is reasonable, thanks.

> ---
>  mm/filemap.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index cca20e350c95..a16b33e0fc71 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3339,7 +3339,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  		}
>  	}
>  
> -	if (!(vm_flags & VM_SEQ_READ)) {
> +	if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
>  		/* Avoid banging the cache line if not needed */
>  		mmap_miss = READ_ONCE(ra->mmap_miss);
>  		if (mmap_miss < MMAP_LOTSAMISS * 10)
> @@ -3434,12 +3434,12 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
>  	 * times for a single folio and break the balance with mmap_miss
>  	 * increase in do_sync_mmap_readahead().
>  	 *
> -	 * VM_SEQ_READ mappings skip the mmap_miss increment in
> +	 * VM_SEQ_READ and VM_EXEC mappings skip the mmap_miss increment in
>  	 * do_sync_mmap_readahead(), so skip the decrement here as well to
>  	 * keep the counter symmetric.
>  	 */
>  	if (likely(!folio_test_locked(folio)) &&
> -	    !(vmf->vma->vm_flags & VM_SEQ_READ)) {
> +	    !(vmf->vma->vm_flags & (VM_SEQ_READ | VM_EXEC))) {
>  		mmap_miss = READ_ONCE(ra->mmap_miss);
>  		if (mmap_miss)
>  			WRITE_ONCE(ra->mmap_miss, --mmap_miss);
> @@ -3941,14 +3941,14 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>  		 * Don't decrease mmap_miss in this scenario to make sure
>  		 * we can stop read-ahead.
>  		 *
> -		 * VM_SEQ_READ mappings skip the mmap_miss increment in
> -		 * do_sync_mmap_readahead(), so skip the decrement here as
> -		 * well to keep the counter symmetric.
> +		 * VM_SEQ_READ and VM_EXEC mappings skip the mmap_miss
> +		 * increment in do_sync_mmap_readahead(), so skip the
> +		 * decrement here as well to keep the counter symmetric.
>  		 */
>  		if ((map_ret & VM_FAULT_NOPAGE) &&
>  		    !(vmf->flags & FAULT_FLAG_TRIED) &&
>  		    !folio_test_workingset(folio) &&
> -		    !(vma->vm_flags & VM_SEQ_READ)) {
> +		    !(vma->vm_flags & (VM_SEQ_READ | VM_EXEC))) {
>  			unsigned short mmap_miss;
>  
>  			mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
> -- 
> 2.52.0
> 
> 

-- 
Pedro


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order
  2026-05-28 16:55 ` [PATCH v6 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order Usama Arif
@ 2026-05-29 10:01   ` Pedro Falcato
  2026-05-29 12:19     ` Usama Arif
  2026-05-29 12:36   ` Usama Arif
  1 sibling, 1 reply; 10+ messages in thread
From: Pedro Falcato @ 2026-05-29 10:01 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack,
	Andrew Donnellan, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam R. Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	ljs, mhocko, npache, pasha.tatashin, rmclure, rppt, surenb,
	vbabka, Al Viro, wilts.infradead.org, ziy, hannes, kas,
	shakeel.butt, kernel-team

On Thu, May 28, 2026 at 09:55:20AM -0700, Usama Arif wrote:
> The force_thp_readahead path in do_sync_mmap_readahead() is gated on
> HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and always requests
> HPAGE_PMD_ORDER / HPAGE_PMD_NR. On configurations where HPAGE_PMD_ORDER
> exceeds MAX_PAGECACHE_ORDER, notably arm64 with a 64K base page size,
> VM_HUGEPAGE mappings cannot use this path and fall back to the non-forced
> mmap readahead path even when the mapping supports useful large folios.
> 
> Keep the existing PMD-sized behavior when HPAGE_PMD_ORDER fits in the
> page cache. When it does not, enable forced readahead for mappings that
> support large folios and request an order capped by both
> mapping_max_folio_order(mapping) and 2MB.
> 
> 2MB is chosen as the cap because it matches the PMD size on x86_64
> and on arm64 with 4K or 16K base pages, so the size/memory-pressure

16K base page size's PMDs should be 32M, no? Am I misunderstanding what
you mean here?

> tradeoff for folios of that size is already well understood. On arm64
> with a 64K base page size, 2MB is also the contiguous-PTE (contpte)
> block size, so the resulting folios coalesce into a single TLB entry
> and reduce TLB pressure on the readahead path.
> 
> The final allocation order may still be clamped by page_cache_ra_order()
> to the mapping and request geometry, but this gives VM_HUGEPAGE mappings
> on such configurations a large-folio readahead request instead of
> dropping back to base-page readahead.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>  mm/filemap.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a16b33e0fc71..bfb891d9da1f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3312,14 +3312,23 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	struct file *fpin = NULL;
>  	vm_flags_t vm_flags = vmf->vma->vm_flags;
>  	bool force_thp_readahead = false;
> +	unsigned int thp_order = 0;
>  	unsigned short mmap_miss;
>  
>  	ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
>  
>  	/* Use the readahead code, even if readahead is disabled */
> -	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> -	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
> -		force_thp_readahead = true;
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
> +		if (HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
> +			force_thp_readahead = true;
> +			thp_order = HPAGE_PMD_ORDER;
> +		} else if (mapping_large_folio_support(mapping)) {
> +			force_thp_readahead = true;
> +			thp_order = min_t(unsigned int,
> +					  mapping_max_folio_order(mapping),
> +					  get_order(SZ_2M));

This looks somewhat arbitrary to me (as does the old logic). It seems more
natural (correct?) to do this:

	if (THP_ENABLED && (vm_flags & VM_HUGEPAGE)) {
		if (mapping_large_folio_support(mapping)) {
			force_thp_readahead = true;
			/*
			 * Cap THP readahead to either the max folio order the
			 * mapping supports, or the max order the page cache
			 * supports (useless to try more), or the hugepage PMD
			 * order (CPU can't benefit from larger).
			 */
			thp_order = min3(mapping_max_folio_order(mapping),
					MAX_PAGECACHE_ORDER, HPAGE_PMD_ORDER);
		}
	}

> +		}
> +	}
>  
>  	if (!force_thp_readahead) {
>  		/*
> @@ -3354,17 +3363,19 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	}
>  
>  	if (force_thp_readahead) {
> +		unsigned long folio_nr_pages = 1UL << thp_order;
> +
>  		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> -		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
> -		ra->size = HPAGE_PMD_NR;
> +		ractl._index &= ~(folio_nr_pages - 1);
> +		ra->size = folio_nr_pages;
>  		/*
> -		 * Fetch two PMD folios, so we get the chance to actually
> +		 * Fetch two folios so we get the chance to actually
			  ^^ large folios?
>  		 * readahead, unless we've been told not to.
>  		 */
>  		if (!(vm_flags & VM_RAND_READ))
>  			ra->size *= 2;
> -		ra->async_size = HPAGE_PMD_NR;
> -		ra->order = HPAGE_PMD_ORDER;
> +		ra->async_size = folio_nr_pages;
> +		ra->order = thp_order;

This part LGTM.

-- 
Pedro


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order
  2026-05-29 10:01   ` Pedro Falcato
@ 2026-05-29 12:19     ` Usama Arif
  2026-05-29 13:40       ` Pedro Falcato
  0 siblings, 1 reply; 10+ messages in thread
From: Usama Arif @ 2026-05-29 12:19 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack,
	Andrew Donnellan, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam R. Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	ljs, mhocko, npache, pasha.tatashin, rmclure, rppt, surenb,
	vbabka, Al Viro, wilts.infradead.org, ziy, hannes, kas,
	shakeel.butt, kernel-team



On 29/05/2026 11:01, Pedro Falcato wrote:
> On Thu, May 28, 2026 at 09:55:20AM -0700, Usama Arif wrote:
>> The force_thp_readahead path in do_sync_mmap_readahead() is gated on
>> HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and always requests
>> HPAGE_PMD_ORDER / HPAGE_PMD_NR. On configurations where HPAGE_PMD_ORDER
>> exceeds MAX_PAGECACHE_ORDER, notably arm64 with a 64K base page size,
>> VM_HUGEPAGE mappings cannot use this path and fall back to the non-forced
>> mmap readahead path even when the mapping supports useful large folios.
>>
>> Keep the existing PMD-sized behavior when HPAGE_PMD_ORDER fits in the
>> page cache. When it does not, enable forced readahead for mappings that
>> support large folios and request an order capped by both
>> mapping_max_folio_order(mapping) and 2MB.
>>
>> 2MB is chosen as the cap because it matches the PMD size on x86_64
>> and on arm64 with 4K or 16K base pages, so the size/memory-pressure
> 
> 16K base page size's PMDs should be 32M, no? Am I misunderstanding what
> you mean here?
> 

Yes, should have just said 4K. Messed up when rewriting the commit
message. Andrew Thanks for merging in mm-new, would it be possible
to remove "or 16K" from the commit message. I can send a new revision
as well if that is easier and preferred.

>> tradeoff for folios of that size is already well understood. On arm64
>> with a 64K base page size, 2MB is also the contiguous-PTE (contpte)
>> block size, so the resulting folios coalesce into a single TLB entry
>> and reduce TLB pressure on the readahead path.
>>
>> The final allocation order may still be clamped by page_cache_ra_order()
>> to the mapping and request geometry, but this gives VM_HUGEPAGE mappings
>> on such configurations a large-folio readahead request instead of
>> dropping back to base-page readahead.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>> ---
>>  mm/filemap.c | 27 +++++++++++++++++++--------
>>  1 file changed, 19 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index a16b33e0fc71..bfb891d9da1f 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3312,14 +3312,23 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>>  	struct file *fpin = NULL;
>>  	vm_flags_t vm_flags = vmf->vma->vm_flags;
>>  	bool force_thp_readahead = false;
>> +	unsigned int thp_order = 0;
>>  	unsigned short mmap_miss;
>>  
>>  	ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
>>  
>>  	/* Use the readahead code, even if readahead is disabled */
>> -	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
>> -	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
>> -		force_thp_readahead = true;
>> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
>> +		if (HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
>> +			force_thp_readahead = true;
>> +			thp_order = HPAGE_PMD_ORDER;
>> +		} else if (mapping_large_folio_support(mapping)) {
>> +			force_thp_readahead = true;
>> +			thp_order = min_t(unsigned int,
>> +					  mapping_max_folio_order(mapping),
>> +					  get_order(SZ_2M));
> 
> This looks somewhat arbitrary to me (as does the old logic). It seems more
> natural (correct?) to do this:
> 
> 	if (THP_ENABLED && (vm_flags & VM_HUGEPAGE)) {
> 		if (mapping_large_folio_support(mapping)) {
> 			force_thp_readahead = true;
> 			/*
> 			 * Cap THP readahead to either the max folio order the
> 			 * mapping supports, or the max order the page cache
> 			 * supports (useless to try more), or the hugepage PMD
> 			 * order (CPU can't benefit from larger).
> 			 */
> 			thp_order = min3(mapping_max_folio_order(mapping),
> 					MAX_PAGECACHE_ORDER, HPAGE_PMD_ORDER);
> 		}
> 	}
> 

So this can result in extreme memory pressure. For example, on 64K base page size,
with HPAGE_PMD_ORDER = 13 (512M) and MAX_PAGECACHE_ORDER = 11 (128M), this will evaluate
THP order to 128M. I think Jan mentioned in previous versions that there are already
reports of high memory pressure due to large file folios, and we are seeing this
in our fleet as well with xfs.

I chose 2M cap as that is what most (not all) architectures which currently support this
hugepage code flow are currently used to, and on archs + base page size combination which
just evaluate this code to base page, for example 64K base page size, it will provide a
performance uplift without causing excessive memory pressure.

In pagemap.h, we have

#define PREFERRED_MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
#define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)

So MAX_PAGECACHE_ORDER <= HPAGE_PMD_ORDER is always true.

At filesystem/inode setup: mapping_set_folio_order_range() 

	if (max > MAX_PAGECACHE_ORDER)
		max = MAX_PAGECACHE_ORDER;

so mapping_max_folio_order() <= MAX_PAGECACHE_ORDER is also always true.

which means mapping_max_folio_order(mapping) <= MAX_PAGECACHE_ORDER <= HPAGE_PMD_ORDER is always
true, and you dont need the min3(..) in your diff.

Now the question is if then why not just do:

	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
		if (mapping_large_folio_support(mapping)) {
			force_thp_readahead = true;
			thp_order = min_t(unsigned int,
					  mapping_max_folio_order(mapping),
					  get_order(SZ_2M));
		}
	}


This is because this will regress the 16K ARM case where we already got 32M
folios. Someone might upgrade the kernel and start getting 2M folios now.

So IMHO the current code is correct, both in maintaining what the
existing behaviour has been, but also trying to provide large folios to
architectures that currently dont get them but in a controlled way.

Later on, when we think or evaluate that 2M was too small a cap, it can
be raised. I think its very easy to raise a cap, but much more difficult
to make the cap smaller, as we might get reports on kernel upgrades that
someone was getting a certain size of folio but now its much smaller.

As a summary, what my code in this current form gives is:
1) 4K x86/ARM (HPAGE_PMD_ORDER=9 <= 9): first branch -> 2M unchanged behaviour
2) 16K ARM (HPAGE_PMD_ORDER=11 <= 11): first branch -> 32M unchanged behaviour
3) 64K ARM (HPAGE_PMD_ORDER=13 <= 11): new else-if -> 2M new capability, conservative
behaviour, which can be increased later if needed.
I know that 64K base page size having 2M large folios while 16K base page size
having 32M sounds counterintuitive, but I think 32M on 16K was a mistake and
we shouldnt repeat that mistake for 64K.

What your change with min3(mapping_max_folio_order(mapping), MAX_PAGECACHE_ORDER, HPAGE_PMD_ORDER)
will give
3) 64K ARM -> 128M folios, extremely high memory pressure.

What only having min(mapping_max_folio_order, get_order(SZ_2M)) will give:
2) 16K ARM -> 2M folios, (regression since we always had 32M on 16K ARM).



>> +		}
>> +	}
>>  
>>  	if (!force_thp_readahead) {
>>  		/*
>> @@ -3354,17 +3363,19 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>>  	}
>>  
>>  	if (force_thp_readahead) {
>> +		unsigned long folio_nr_pages = 1UL << thp_order;
>> +
>>  		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>> -		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
>> -		ra->size = HPAGE_PMD_NR;
>> +		ractl._index &= ~(folio_nr_pages - 1);
>> +		ra->size = folio_nr_pages;
>>  		/*
>> -		 * Fetch two PMD folios, so we get the chance to actually
>> +		 * Fetch two folios so we get the chance to actually
> 			  ^^ large folios?
>>  		 * readahead, unless we've been told not to.
>>  		 */
>>  		if (!(vm_flags & VM_RAND_READ))
>>  			ra->size *= 2;
>> -		ra->async_size = HPAGE_PMD_NR;
>> -		ra->order = HPAGE_PMD_ORDER;
>> +		ra->async_size = folio_nr_pages;
>> +		ra->order = thp_order;
> 
> This part LGTM.
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order
  2026-05-28 16:55 ` [PATCH v6 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order Usama Arif
  2026-05-29 10:01   ` Pedro Falcato
@ 2026-05-29 12:36   ` Usama Arif
  1 sibling, 0 replies; 10+ messages in thread
From: Usama Arif @ 2026-05-29 12:36 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, Andrew Donnellan, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam R. Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	ljs, mhocko, npache, pasha.tatashin, rmclure, rppt, surenb,
	vbabka, Al Viro, ziy, hannes, kas, shakeel.butt, kernel-team



On 28/05/2026 17:55, Usama Arif wrote:
> The force_thp_readahead path in do_sync_mmap_readahead() is gated on
> HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and always requests
> HPAGE_PMD_ORDER / HPAGE_PMD_NR. On configurations where HPAGE_PMD_ORDER
> exceeds MAX_PAGECACHE_ORDER, notably arm64 with a 64K base page size,
> VM_HUGEPAGE mappings cannot use this path and fall back to the non-forced
> mmap readahead path even when the mapping supports useful large folios.
> 
> Keep the existing PMD-sized behavior when HPAGE_PMD_ORDER fits in the
> page cache. When it does not, enable forced readahead for mappings that
> support large folios and request an order capped by both
> mapping_max_folio_order(mapping) and 2MB.
> 
> 2MB is chosen as the cap because it matches the PMD size on x86_64
> and on arm64 with 4K or 16K base pages, so the size/memory-pressure
> tradeoff for folios of that size is already well understood. On arm64
> with a 64K base page size, 2MB is also the contiguous-PTE (contpte)
> block size, so the resulting folios coalesce into a single TLB entry
> and reduce TLB pressure on the readahead path.
> 
> The final allocation order may still be clamped by page_cache_ra_order()
> to the mapping and request geometry, but this gives VM_HUGEPAGE mappings
> on such configurations a large-folio readahead request instead of
> dropping back to base-page readahead.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>  mm/filemap.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a16b33e0fc71..bfb891d9da1f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3312,14 +3312,23 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	struct file *fpin = NULL;
>  	vm_flags_t vm_flags = vmf->vma->vm_flags;
>  	bool force_thp_readahead = false;
> +	unsigned int thp_order = 0;
>  	unsigned short mmap_miss;
>  
>  	ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
>  
>  	/* Use the readahead code, even if readahead is disabled */
> -	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> -	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
> -		force_thp_readahead = true;
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
> +		if (HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
> +			force_thp_readahead = true;
> +			thp_order = HPAGE_PMD_ORDER;
> +		} else if (mapping_large_folio_support(mapping)) {
> +			force_thp_readahead = true;
> +			thp_order = min_t(unsigned int,
> +					  mapping_max_folio_order(mapping),
> +					  get_order(SZ_2M));
> +		}
> +	}
>  

I think might be good to include the below comment to explain the decision being made
here:

From 6673c04a434df01d449c1bdb9ac8de74e19d6b7e Mon Sep 17 00:00:00 2001
From: Usama Arif <usama.arif@linux.dev>
Date: Fri, 29 May 2026 05:31:05 -0700
Subject: [PATCH] [fixlet] mm/filemap: add comment explaining design decision

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/filemap.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index bfb891d9da1f..0a3facf452b3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3319,6 +3319,14 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 
 	/* Use the readahead code, even if readahead is disabled */
 	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
+		/*
+		 * Preserve PMD-sized readahead where it already fits in
+		 * the page cache. Otherwise cap the new fallback path at
+		 * 2MB: this is the common PMD-sized hugepage size, and it
+		 * avoids memory pressure from very large forced readahead
+		 * when mapping_max_folio_order() is high (for example,
+		 * 128MB with 64K base pages on arm64).
+		 */
 		if (HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
 			force_thp_readahead = true;
 			thp_order = HPAGE_PMD_ORDER;
-- 
2.53.0-Meta




^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order
  2026-05-29 12:19     ` Usama Arif
@ 2026-05-29 13:40       ` Pedro Falcato
  2026-05-29 14:11         ` Usama Arif
  0 siblings, 1 reply; 10+ messages in thread
From: Pedro Falcato @ 2026-05-29 13:40 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack,
	Andrew Donnellan, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam R. Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	ljs, mhocko, npache, pasha.tatashin, rmclure, rppt, surenb,
	vbabka, Al Viro, wilts.infradead.org, ziy, hannes, kas,
	shakeel.butt, kernel-team

On Fri, May 29, 2026 at 01:19:03PM +0100, Usama Arif wrote:
> 
> 
> On 29/05/2026 11:01, Pedro Falcato wrote:
> > On Thu, May 28, 2026 at 09:55:20AM -0700, Usama Arif wrote:
> >> The force_thp_readahead path in do_sync_mmap_readahead() is gated on
> >> HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and always requests
> >> HPAGE_PMD_ORDER / HPAGE_PMD_NR. On configurations where HPAGE_PMD_ORDER
> >> exceeds MAX_PAGECACHE_ORDER, notably arm64 with a 64K base page size,
> >> VM_HUGEPAGE mappings cannot use this path and fall back to the non-forced
> >> mmap readahead path even when the mapping supports useful large folios.
> >>
> >> Keep the existing PMD-sized behavior when HPAGE_PMD_ORDER fits in the
> >> page cache. When it does not, enable forced readahead for mappings that
> >> support large folios and request an order capped by both
> >> mapping_max_folio_order(mapping) and 2MB.
> >>
> >> 2MB is chosen as the cap because it matches the PMD size on x86_64
> >> and on arm64 with 4K or 16K base pages, so the size/memory-pressure
> > 
> > 16K base page size's PMDs should be 32M, no? Am I misunderstanding what
> > you mean here?
> > 
> 
> Yes, should have just said 4K. Messed up when rewriting the commit
> message. Andrew Thanks for merging in mm-new, would it be possible
> to remove "or 16K" from the commit message. I can send a new revision
> as well if that is easier and preferred.
> 
> >> tradeoff for folios of that size is already well understood. On arm64
> >> with a 64K base page size, 2MB is also the contiguous-PTE (contpte)
> >> block size, so the resulting folios coalesce into a single TLB entry
> >> and reduce TLB pressure on the readahead path.
> >>
> >> The final allocation order may still be clamped by page_cache_ra_order()
> >> to the mapping and request geometry, but this gives VM_HUGEPAGE mappings
> >> on such configurations a large-folio readahead request instead of
> >> dropping back to base-page readahead.
> >>
> >> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> >> ---
> >>  mm/filemap.c | 27 +++++++++++++++++++--------
> >>  1 file changed, 19 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/filemap.c b/mm/filemap.c
> >> index a16b33e0fc71..bfb891d9da1f 100644
> >> --- a/mm/filemap.c
> >> +++ b/mm/filemap.c
> >> @@ -3312,14 +3312,23 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> >>  	struct file *fpin = NULL;
> >>  	vm_flags_t vm_flags = vmf->vma->vm_flags;
> >>  	bool force_thp_readahead = false;
> >> +	unsigned int thp_order = 0;
> >>  	unsigned short mmap_miss;
> >>  
> >>  	ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
> >>  
> >>  	/* Use the readahead code, even if readahead is disabled */
> >> -	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> >> -	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
> >> -		force_thp_readahead = true;
> >> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
> >> +		if (HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
> >> +			force_thp_readahead = true;
> >> +			thp_order = HPAGE_PMD_ORDER;
> >> +		} else if (mapping_large_folio_support(mapping)) {
> >> +			force_thp_readahead = true;
> >> +			thp_order = min_t(unsigned int,
> >> +					  mapping_max_folio_order(mapping),
> >> +					  get_order(SZ_2M));
> > 
> > This looks somewhat arbitrary to me (as does the old logic). It seems more
> > natural (correct?) to do this:
> > 
> > 	if (THP_ENABLED && (vm_flags & VM_HUGEPAGE)) {
> > 		if (mapping_large_folio_support(mapping)) {
> > 			force_thp_readahead = true;
> > 			/*
> > 			 * Cap THP readahead to either the max folio order the
> > 			 * mapping supports, or the max order the page cache
> > 			 * supports (useless to try more), or the hugepage PMD
> > 			 * order (CPU can't benefit from larger).
> > 			 */
> > 			thp_order = min3(mapping_max_folio_order(mapping),
> > 					MAX_PAGECACHE_ORDER, HPAGE_PMD_ORDER);
> > 		}
> > 	}
> > 
> 
> So this can result in extreme memory pressure. For example, on 64K base page size,
> with HPAGE_PMD_ORDER = 13 (512M) and MAX_PAGECACHE_ORDER = 11 (128M), this will evaluate
> THP order to 128M. I think Jan mentioned in previous versions that there are already
> reports of high memory pressure due to large file folios, and we are seeing this
> in our fleet as well with xfs.
> 
> I chose 2M cap as that is what most (not all) architectures which currently support this
> hugepage code flow are currently used to, and on archs + base page size combination which
> just evaluate this code to base page, for example 64K base page size, it will provide a
> performance uplift without causing excessive memory pressure.

For what it's worth, memory pressure is still a problem even on 2MB folios. But
yes, I understand.

For what it's worth, I don't think it's such a big issue in this case - you're
asking for THPs, and you're getting them - though maybe not efficiently because
these aren't at the PMD level. But this is also true for e.g x86 where an
an order-3 folio or order-7 folio are completely useless for the hardware, since
there is no mTHP to speak of (and, with that, you also lose PG_active and
PG_reference precision as you only keep one of each for each folio, amongst
other tradeoffs).

> 
> In pagemap.h, we have
> 
> #define PREFERRED_MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> #define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
> 
> So MAX_PAGECACHE_ORDER <= HPAGE_PMD_ORDER is always true.
> 
> At filesystem/inode setup: mapping_set_folio_order_range() 
> 
> 	if (max > MAX_PAGECACHE_ORDER)
> 		max = MAX_PAGECACHE_ORDER;
> 
> so mapping_max_folio_order() <= MAX_PAGECACHE_ORDER is also always true.

Right, good point. (well, I suppose it will not be true in the future as
someone tries to harness PMD-level cont bits)

> 
> which means mapping_max_folio_order(mapping) <= MAX_PAGECACHE_ORDER <= HPAGE_PMD_ORDER is always
> true, and you dont need the min3(..) in your diff.
> 
> Now the question is if then why not just do:
> 
> 	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
> 		if (mapping_large_folio_support(mapping)) {
> 			force_thp_readahead = true;
> 			thp_order = min_t(unsigned int,
> 					  mapping_max_folio_order(mapping),
> 					  get_order(SZ_2M));
> 		}
> 	}
> 
> 
> This is because this will regress the 16K ARM case where we already got 32M
> folios. Someone might upgrade the kernel and start getting 2M folios now.

So maybe limit to 32MB? It's still arbitrary but at least you get simpler
logic. If the architecture does not support 32MiB folios, it will clamp
the maximum folio order to HPAGE_PMD_ORDER, and you get the same result.

Does this sound correct?

> 
> So IMHO the current code is correct, both in maintaining what the
> existing behaviour has been, but also trying to provide large folios to
> architectures that currently dont get them but in a controlled way.
> 
> Later on, when we think or evaluate that 2M was too small a cap, it can
> be raised. I think its very easy to raise a cap, but much more difficult
> to make the cap smaller, as we might get reports on kernel upgrades that
> someone was getting a certain size of folio but now its much smaller.

I don't think this is a valid argument. You can also have reports that now
folios are too large and they're regressing workloads because of thrashing/
compaction/possibly TLB thrashing. And indeed we have several reports of that
(https://lore.kernel.org/linux-fsdevel/20260403193535.9970-1-dipiets@amazon.it/
comes to mind, plus all of the anon THP regressions over the years, including
in $current_year).

Bottom line is that changing things will always affect someone :) Particularly
since the logic we have is not too careful at deciding what should or should
not be a THP (both in anon and file cases). And if (once?) we make it smarter,
it will surely also regress someone!

-- 
Pedro


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order
  2026-05-29 13:40       ` Pedro Falcato
@ 2026-05-29 14:11         ` Usama Arif
  0 siblings, 0 replies; 10+ messages in thread
From: Usama Arif @ 2026-05-29 14:11 UTC (permalink / raw)
  To: Pedro Falcato, willy, jack
  Cc: Andrew Morton, david, ryan.roberts, linux-mm, r, Andrew Donnellan,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam R. Howlett,
	linux-arm-kernel, linux-fsdevel, linux-kernel, ljs, mhocko,
	npache, pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
	wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team



On 29/05/2026 14:40, Pedro Falcato wrote:
> On Fri, May 29, 2026 at 01:19:03PM +0100, Usama Arif wrote:
>>
>>
>> On 29/05/2026 11:01, Pedro Falcato wrote:
>>> On Thu, May 28, 2026 at 09:55:20AM -0700, Usama Arif wrote:
>>>> The force_thp_readahead path in do_sync_mmap_readahead() is gated on
>>>> HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and always requests
>>>> HPAGE_PMD_ORDER / HPAGE_PMD_NR. On configurations where HPAGE_PMD_ORDER
>>>> exceeds MAX_PAGECACHE_ORDER, notably arm64 with a 64K base page size,
>>>> VM_HUGEPAGE mappings cannot use this path and fall back to the non-forced
>>>> mmap readahead path even when the mapping supports useful large folios.
>>>>
>>>> Keep the existing PMD-sized behavior when HPAGE_PMD_ORDER fits in the
>>>> page cache. When it does not, enable forced readahead for mappings that
>>>> support large folios and request an order capped by both
>>>> mapping_max_folio_order(mapping) and 2MB.
>>>>
>>>> 2MB is chosen as the cap because it matches the PMD size on x86_64
>>>> and on arm64 with 4K or 16K base pages, so the size/memory-pressure
>>>
>>> 16K base page size's PMDs should be 32M, no? Am I misunderstanding what
>>> you mean here?
>>>
>>
>> Yes, should have just said 4K. Messed up when rewriting the commit
>> message. Andrew Thanks for merging in mm-new, would it be possible
>> to remove "or 16K" from the commit message. I can send a new revision
>> as well if that is easier and preferred.
>>
>>>> tradeoff for folios of that size is already well understood. On arm64
>>>> with a 64K base page size, 2MB is also the contiguous-PTE (contpte)
>>>> block size, so the resulting folios coalesce into a single TLB entry
>>>> and reduce TLB pressure on the readahead path.
>>>>
>>>> The final allocation order may still be clamped by page_cache_ra_order()
>>>> to the mapping and request geometry, but this gives VM_HUGEPAGE mappings
>>>> on such configurations a large-folio readahead request instead of
>>>> dropping back to base-page readahead.
>>>>
>>>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
>>>> ---
>>>>  mm/filemap.c | 27 +++++++++++++++++++--------
>>>>  1 file changed, 19 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>> index a16b33e0fc71..bfb891d9da1f 100644
>>>> --- a/mm/filemap.c
>>>> +++ b/mm/filemap.c
>>>> @@ -3312,14 +3312,23 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>>>>  	struct file *fpin = NULL;
>>>>  	vm_flags_t vm_flags = vmf->vma->vm_flags;
>>>>  	bool force_thp_readahead = false;
>>>> +	unsigned int thp_order = 0;
>>>>  	unsigned short mmap_miss;
>>>>  
>>>>  	ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
>>>>  
>>>>  	/* Use the readahead code, even if readahead is disabled */
>>>> -	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
>>>> -	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
>>>> -		force_thp_readahead = true;
>>>> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
>>>> +		if (HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER) {
>>>> +			force_thp_readahead = true;
>>>> +			thp_order = HPAGE_PMD_ORDER;
>>>> +		} else if (mapping_large_folio_support(mapping)) {
>>>> +			force_thp_readahead = true;
>>>> +			thp_order = min_t(unsigned int,
>>>> +					  mapping_max_folio_order(mapping),
>>>> +					  get_order(SZ_2M));
>>>
>>> This looks somewhat arbitrary to me (as does the old logic). It seems more
>>> natural (correct?) to do this:
>>>
>>> 	if (THP_ENABLED && (vm_flags & VM_HUGEPAGE)) {
>>> 		if (mapping_large_folio_support(mapping)) {
>>> 			force_thp_readahead = true;
>>> 			/*
>>> 			 * Cap THP readahead to either the max folio order the
>>> 			 * mapping supports, or the max order the page cache
>>> 			 * supports (useless to try more), or the hugepage PMD
>>> 			 * order (CPU can't benefit from larger).
>>> 			 */
>>> 			thp_order = min3(mapping_max_folio_order(mapping),
>>> 					MAX_PAGECACHE_ORDER, HPAGE_PMD_ORDER);
>>> 		}
>>> 	}
>>>
>>
>> So this can result in extreme memory pressure. For example, on 64K base page size,
>> with HPAGE_PMD_ORDER = 13 (512M) and MAX_PAGECACHE_ORDER = 11 (128M), this will evaluate
>> THP order to 128M. I think Jan mentioned in previous versions that there are already
>> reports of high memory pressure due to large file folios, and we are seeing this
>> in our fleet as well with xfs.
>>
>> I chose 2M cap as that is what most (not all) architectures which currently support this
>> hugepage code flow are currently used to, and on archs + base page size combination which
>> just evaluate this code to base page, for example 64K base page size, it will provide a
>> performance uplift without causing excessive memory pressure.
> 
> For what it's worth, memory pressure is still a problem even on 2MB folios. But
> yes, I understand.
> 
> For what it's worth, I don't think it's such a big issue in this case - you're
> asking for THPs, and you're getting them - though maybe not efficiently because
> these aren't at the PMD level. But this is also true for e.g x86 where an
> an order-3 folio or order-7 folio are completely useless for the hardware, since
> there is no mTHP to speak of (and, with that, you also lose PG_active and
> PG_reference precision as you only keep one of each for each folio, amongst
> other tradeoffs).
> 

So there is hardware TLB coalescing on almost every new AMD CPU, I believe it is
at 32kB, and we would benefit from that. On top of that, we would reduce pagefaults.

>>
>> In pagemap.h, we have
>>
>> #define PREFERRED_MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
>> #define MAX_PAGECACHE_ORDER	min(MAX_XAS_ORDER, PREFERRED_MAX_PAGECACHE_ORDER)
>>
>> So MAX_PAGECACHE_ORDER <= HPAGE_PMD_ORDER is always true.
>>
>> At filesystem/inode setup: mapping_set_folio_order_range() 
>>
>> 	if (max > MAX_PAGECACHE_ORDER)
>> 		max = MAX_PAGECACHE_ORDER;
>>
>> so mapping_max_folio_order() <= MAX_PAGECACHE_ORDER is also always true.
> 
> Right, good point. (well, I suppose it will not be true in the future as
> someone tries to harness PMD-level cont bits)
> 
>>
>> which means mapping_max_folio_order(mapping) <= MAX_PAGECACHE_ORDER <= HPAGE_PMD_ORDER is always
>> true, and you dont need the min3(..) in your diff.
>>
>> Now the question is if then why not just do:
>>
>> 	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
>> 		if (mapping_large_folio_support(mapping)) {
>> 			force_thp_readahead = true;
>> 			thp_order = min_t(unsigned int,
>> 					  mapping_max_folio_order(mapping),
>> 					  get_order(SZ_2M));
>> 		}
>> 	}
>>
>>
>> This is because this will regress the 16K ARM case where we already got 32M
>> folios. Someone might upgrade the kernel and start getting 2M folios now.
> 
> So maybe limit to 32MB? It's still arbitrary but at least you get simpler
> logic. If the architecture does not support 32MiB folios, it will clamp
> the maximum folio order to HPAGE_PMD_ORDER, and you get the same result.
> 
> Does this sound correct?
> 

Yes, so if we replace it with SZ_32M, it sounds correct. I just think
the 32M size is too large. But as you pointed out, even 2M can be too large...


>>
>> So IMHO the current code is correct, both in maintaining what the
>> existing behaviour has been, but also trying to provide large folios to
>> architectures that currently dont get them but in a controlled way.
>>
>> Later on, when we think or evaluate that 2M was too small a cap, it can
>> be raised. I think its very easy to raise a cap, but much more difficult
>> to make the cap smaller, as we might get reports on kernel upgrades that
>> someone was getting a certain size of folio but now its much smaller.
> 
> I don't think this is a valid argument. You can also have reports that now
> folios are too large and they're regressing workloads because of thrashing/
> compaction/possibly TLB thrashing. And indeed we have several reports of that
> (https://lore.kernel.org/linux-fsdevel/20260403193535.9970-1-dipiets@amazon.it/
> comes to mind, plus all of the anon THP regressions over the years, including
> in $current_year).

Ack, and yes we have seen both thrashing and compaction issues due to xfs
large folios in the fleet with x86..

> 
> Bottom line is that changing things will always affect someone :) Particularly
> since the logic we have is not too careful at deciding what should or should
> not be a THP (both in anon and file cases). And if (once?) we make it smarter,
> it will surely also regress someone!
> 

Yes completely agree on this as well.

So personally I do have a preference of keeping the cap at 2M atleast initially
while we currently try and solve the issues we see with 2M alone. As we are already
seeing reports of thrashing and compaction with just 2M, I dont think the logic
in this patch with just an if else is that complicated.

Matthew, Jan, do you have any thoughts or strong preferences on cap size?

Thanks!



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-05-29 14:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28 16:55 [PATCH v6 0/2] mm: improve large folio readahead for exec memory Usama Arif
2026-05-28 16:55 ` [PATCH v6 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-05-29  9:47   ` Pedro Falcato
2026-05-28 16:55 ` [PATCH v6 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order Usama Arif
2026-05-29 10:01   ` Pedro Falcato
2026-05-29 12:19     ` Usama Arif
2026-05-29 13:40       ` Pedro Falcato
2026-05-29 14:11         ` Usama Arif
2026-05-29 12:36   ` Usama Arif
2026-05-28 20:27 ` [PATCH v6 0/2] mm: improve large folio readahead for exec memory Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox