[PATCH v7 0/2] mm: improve large folio readahead for exec memory

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v7 0/2] mm: improve large folio readahead for exec memory
@ 2026-06-01 10:21 Usama Arif
  2026-06-01 10:21 ` [PATCH v7 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
  2026-06-01 10:21 ` [PATCH v7 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order Usama Arif
  0 siblings, 2 replies; 4+ messages in thread
From: Usama Arif @ 2026-06-01 10:21 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: pfalcato, r, jack, Andrew Donnellan, apopple, baohua, baolin.wang,
	brauner, catalin.marinas, dev.jain, kees, kevin.brodsky,
	lance.yang, Liam R. Howlett, linux-arm-kernel, linux-fsdevel,
	linux-kernel, ljs, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, ziy, hannes, kas, shakeel.butt,
	kernel-team, Usama Arif

Hopefully this is the last revision. The only change from the previous
revision is that the logic for deciding THP order was simplified and
the max is now capped to 2M. Thanks Pedro and Jan for the suggestion
and the dicusssion!

The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
512MB executable file on ext4, averaged over 3 runs:

  Phase      | Baseline     | Patched      | Improvement
  -----------|--------------|--------------|------------------
  Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
  Random     | 76.0 ms      | 58.3 ms      | 23% faster

The patches are on top of mm-unstable from 28 May
(8a74e22643189e0ae339afc91110ddb4cab1941b) which include patch [1]
that make mmap_miss accounting symmetric for VM_SEQ_READ which was pointed
out by sashiko in the previous revision.

[1] https://lore.kernel.org/all/20260525145751.2671248-1-usama.arif@linux.dev/ 

v6 -> v7: https://lore.kernel.org/all/20260528165635.2068012-1-usama.arif@linux.dev/
- Simplify logic and just cap the max THP order to 2M (Pedro and Jan)

v5 -> v6: https://lore.kernel.org/all/20260522162422.3856502-1-usama.arif@linux.dev/
- Based on top of patch [1] (sashiko)
- Changes to commit message to make it more accurate for patch 1 and skip
  mmap_miss decrement as well. (sashiko)
- Keep old behaviour if large folio mappings is not enabled (sashiko).
- sashiko pointed to a TOCTOU data race that was pre-existing. My patch
  could make it worse. Dont make it worse by introducing thp_order local
  variable.

v3 -> v5: https://lore.kernel.org/all/20260402181326.3107102-1-usama.arif@linux.dev/
- (Looks like I messed up the versioning here and went directly form
  v3 to v5.)
- Drop patches for elf thp unmapped area alignment and deal with them
  separately. These patches will just bring folios smaller than PMD
  at the same level as PMD. The 2 patches now should be much easier
  to merge.
- Tackle size of THP for exec pages at the same point as PMD instead
  of tackling using exec_folio_order() (Ryan during LSFMM, Thanks!)

v2 -> v3: https://lore.kernel.org/all/20260320140315.979307-1-usama.arif@linux.dev/
- Take into account READ_ONLY_THP_FOR_FS for elf alignment by aligning
  to HPAGE_PMD_SIZE limited to 2M (Rui)
- Reviewed-by tags for patch 1 from Kiryl and Jan
- Remove preferred_exec_order() (Jan)
- Change ra->order to HPAGE_PMD_ORDER if vma_pages(vma) >= HPAGE_PMD_NR
  otherwise use exec_folio_order() with gfp &= ~__GFP_RECLAIM for
  do_sync_mmap_readahead().
- Change exec_folio_order() to return 2M (cont-pte size) for 64K base
  page size for arm64.
- remove bprm->file NULL check (Matthew)
- Change filp to file (Matthew)
- Improve checking of p_vaddr and p_vaddr (Rui and Matthew)

v1 -> v2: https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev/
- disable mmap_miss logic for VM_EXEC (Jan Kara)
- Align in elf only when segment VA and file offset are already aligned (Rui)
- preferred_exec_order() for VM_EXEC sync mmap_readahead which takes into
  account zone high watermarks (as an approximation of memory pressure)
  (David, or atleast my approach to what David suggested in [1] :))
- Extend max alignment to mapping_max_folio_size() instead of
  exec_folio_order()
 
Usama Arif (2):
  mm: bypass mmap_miss heuristic for VM_EXEC readahead
  mm: use mapping_max_folio_order() for force_thp_readahead order

 mm/filemap.c | 44 +++++++++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 15 deletions(-)

-- 
2.52.0



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v7 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead
  2026-06-01 10:21 [PATCH v7 0/2] mm: improve large folio readahead for exec memory Usama Arif
@ 2026-06-01 10:21 ` Usama Arif
  2026-06-01 10:21 ` [PATCH v7 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order Usama Arif
  1 sibling, 0 replies; 4+ messages in thread
From: Usama Arif @ 2026-06-01 10:21 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: pfalcato, r, jack, Andrew Donnellan, apopple, baohua, baolin.wang,
	brauner, catalin.marinas, dev.jain, kees, kevin.brodsky,
	lance.yang, Liam R. Howlett, linux-arm-kernel, linux-fsdevel,
	linux-kernel, ljs, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, ziy, hannes, kas, shakeel.butt,
	kernel-team, Usama Arif

The mmap_miss heuristic is intended to stop speculative mmap readahead
when a file looks like a random-access workload. That does not fit the
VM_EXEC path very well.

VM_EXEC readahead is already constrained differently from ordinary mmap
read-around: it is bounded by the VMA, uses exec_folio_order() to choose
an order useful for executable mappings, and sets async_size to 0 so it
does not create follow-on readahead. When VM_HUGEPAGE is also present,
the larger readahead is an explicit userspace opt-in.

The mmap_miss counter is decremented from cache-hit paths in
do_async_mmap_readahead() and filemap_map_pages(). Those paths are not
always enough to balance the synchronous miss increments for executable
mappings. In particular, when fault-around is effectively disabled, such
as configurations where fault_around_pages is 1, filemap_map_pages() is
not reached from the fault path. The counter can then become a stale
throttle for VM_EXEC mappings and suppress the readahead behavior that
the executable-specific path is trying to provide.

Skip both mmap_miss increments and decrements for VM_EXEC mappings,
matching the existing VM_SEQ_READ treatment and keeping the counter
accounting symmetric.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
---
 mm/filemap.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index cca20e350c95..a16b33e0fc71 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3339,7 +3339,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		}
 	}
 
-	if (!(vm_flags & VM_SEQ_READ)) {
+	if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
 		/* Avoid banging the cache line if not needed */
 		mmap_miss = READ_ONCE(ra->mmap_miss);
 		if (mmap_miss < MMAP_LOTSAMISS * 10)
@@ -3434,12 +3434,12 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 	 * times for a single folio and break the balance with mmap_miss
 	 * increase in do_sync_mmap_readahead().
 	 *
-	 * VM_SEQ_READ mappings skip the mmap_miss increment in
+	 * VM_SEQ_READ and VM_EXEC mappings skip the mmap_miss increment in
 	 * do_sync_mmap_readahead(), so skip the decrement here as well to
 	 * keep the counter symmetric.
 	 */
 	if (likely(!folio_test_locked(folio)) &&
-	    !(vmf->vma->vm_flags & VM_SEQ_READ)) {
+	    !(vmf->vma->vm_flags & (VM_SEQ_READ | VM_EXEC))) {
 		mmap_miss = READ_ONCE(ra->mmap_miss);
 		if (mmap_miss)
 			WRITE_ONCE(ra->mmap_miss, --mmap_miss);
@@ -3941,14 +3941,14 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		 * Don't decrease mmap_miss in this scenario to make sure
 		 * we can stop read-ahead.
 		 *
-		 * VM_SEQ_READ mappings skip the mmap_miss increment in
-		 * do_sync_mmap_readahead(), so skip the decrement here as
-		 * well to keep the counter symmetric.
+		 * VM_SEQ_READ and VM_EXEC mappings skip the mmap_miss
+		 * increment in do_sync_mmap_readahead(), so skip the
+		 * decrement here as well to keep the counter symmetric.
 		 */
 		if ((map_ret & VM_FAULT_NOPAGE) &&
 		    !(vmf->flags & FAULT_FLAG_TRIED) &&
 		    !folio_test_workingset(folio) &&
-		    !(vma->vm_flags & VM_SEQ_READ)) {
+		    !(vma->vm_flags & (VM_SEQ_READ | VM_EXEC))) {
 			unsigned short mmap_miss;
 
 			mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v7 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order
  2026-06-01 10:21 [PATCH v7 0/2] mm: improve large folio readahead for exec memory Usama Arif
  2026-06-01 10:21 ` [PATCH v7 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-06-01 10:21 ` Usama Arif
  2026-06-01 16:22   ` Jan Kara
  1 sibling, 1 reply; 4+ messages in thread
From: Usama Arif @ 2026-06-01 10:21 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: pfalcato, r, jack, Andrew Donnellan, apopple, baohua, baolin.wang,
	brauner, catalin.marinas, dev.jain, kees, kevin.brodsky,
	lance.yang, Liam R. Howlett, linux-arm-kernel, linux-fsdevel,
	linux-kernel, ljs, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, ziy, hannes, kas, shakeel.butt,
	kernel-team, Usama Arif

The force_thp_readahead path in do_sync_mmap_readahead() is gated on
HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and always requests
HPAGE_PMD_ORDER / HPAGE_PMD_NR. On configurations where HPAGE_PMD_ORDER
exceeds MAX_PAGECACHE_ORDER, notably arm64 with a 64K base page size,
VM_HUGEPAGE mappings cannot use this path and fall back to the non-forced
mmap readahead path even when the mapping supports useful large folios.

Enable forced readahead for mappings that support large folios and request
the max folio order supported by the mapping, capped at 2M.
2MB is chosen as the cap because it matches the PMD size on x86_64
and on arm64 with 4K base pages, so the size/memory-pressure tradeoff
for folios of that size is already well understood. On arm64 with 16K
and 64K base page sizes, 2MB is also the contiguous-PTE (contpte)
block size, so the resulting folios coalesce into a single TLB entry
and reduce TLB pressure on the readahead path. This will result
in 32M folios not being faulted in with 16K base page size for arm64,
but with contpte, the performance difference should be negligible.

The final allocation order may still be clamped by page_cache_ra_order()
to the mapping and request geometry, but this gives VM_HUGEPAGE mappings
on such configurations a large-folio readahead request instead of
dropping back to base-page readahead.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/filemap.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index a16b33e0fc71..9cf89efaf3f1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3312,14 +3312,26 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	struct file *fpin = NULL;
 	vm_flags_t vm_flags = vmf->vma->vm_flags;
 	bool force_thp_readahead = false;
+	unsigned int thp_order = 0;
 	unsigned short mmap_miss;

 	ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;

 	/* Use the readahead code, even if readahead is disabled */
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
-		force_thp_readahead = true;
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
+		/*
+		 * Cap max THP order at 2MB: this is the common PMD-sized
+		 * hugepage size, and it avoids memory pressure from very
+		 * large forced readahead when mapping_max_folio_order() is
+		 * high (for example, 128MB with 64K base pages on arm64).
+		 */
+		if (mapping_large_folio_support(mapping)) {
+			force_thp_readahead = true;
+			thp_order = min_t(unsigned int,
+					  mapping_max_folio_order(mapping),
+					  get_order(SZ_2M));
+		}
+	}

 	if (!force_thp_readahead) {
 		/*
@@ -3354,17 +3366,19 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	}

 	if (force_thp_readahead) {
+		unsigned long folio_nr_pages = 1UL << thp_order;
+
 		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
-		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
-		ra->size = HPAGE_PMD_NR;
+		ractl._index &= ~(folio_nr_pages - 1);
+		ra->size = folio_nr_pages;
 		/*
-		 * Fetch two PMD folios, so we get the chance to actually
+		 * Fetch two folios so we get the chance to actually
 		 * readahead, unless we've been told not to.
 		 */
 		if (!(vm_flags & VM_RAND_READ))
 			ra->size *= 2;
-		ra->async_size = HPAGE_PMD_NR;
-		ra->order = HPAGE_PMD_ORDER;
+		ra->async_size = folio_nr_pages;
+		ra->order = thp_order;
 		page_cache_ra_order(&ractl, ra);
 		return fpin;
 	}
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v7 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order
  2026-06-01 10:21 ` [PATCH v7 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order Usama Arif
@ 2026-06-01 16:22   ` Jan Kara
  0 siblings, 0 replies; 4+ messages in thread
From: Jan Kara @ 2026-06-01 16:22 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, pfalcato, r,
	jack, Andrew Donnellan, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam R. Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	ljs, mhocko, npache, pasha.tatashin, rmclure, rppt, surenb,
	vbabka, Al Viro, ziy, hannes, kas, shakeel.butt, kernel-team

On Mon 01-06-26 03:21:18, Usama Arif wrote:
> The force_thp_readahead path in do_sync_mmap_readahead() is gated on
> HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and always requests
> HPAGE_PMD_ORDER / HPAGE_PMD_NR. On configurations where HPAGE_PMD_ORDER
> exceeds MAX_PAGECACHE_ORDER, notably arm64 with a 64K base page size,
> VM_HUGEPAGE mappings cannot use this path and fall back to the non-forced
> mmap readahead path even when the mapping supports useful large folios.
> 
> Enable forced readahead for mappings that support large folios and request
> the max folio order supported by the mapping, capped at 2M.
> 2MB is chosen as the cap because it matches the PMD size on x86_64
> and on arm64 with 4K base pages, so the size/memory-pressure tradeoff
> for folios of that size is already well understood. On arm64 with 16K
> and 64K base page sizes, 2MB is also the contiguous-PTE (contpte)
> block size, so the resulting folios coalesce into a single TLB entry
> and reduce TLB pressure on the readahead path. This will result
> in 32M folios not being faulted in with 16K base page size for arm64,
> but with contpte, the performance difference should be negligible.
> 
> The final allocation order may still be clamped by page_cache_ra_order()
> to the mapping and request geometry, but this gives VM_HUGEPAGE mappings
> on such configurations a large-folio readahead request instead of
> dropping back to base-page readahead.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/filemap.c | 30 ++++++++++++++++++++++--------
>  1 file changed, 22 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a16b33e0fc71..9cf89efaf3f1 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3312,14 +3312,26 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	struct file *fpin = NULL;
>  	vm_flags_t vm_flags = vmf->vma->vm_flags;
>  	bool force_thp_readahead = false;
> +	unsigned int thp_order = 0;
>  	unsigned short mmap_miss;
>  
>  	ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
>  
>  	/* Use the readahead code, even if readahead is disabled */
> -	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> -	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
> -		force_thp_readahead = true;
> +	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE)) {
> +		/*
> +		 * Cap max THP order at 2MB: this is the common PMD-sized
> +		 * hugepage size, and it avoids memory pressure from very
> +		 * large forced readahead when mapping_max_folio_order() is
> +		 * high (for example, 128MB with 64K base pages on arm64).
> +		 */
> +		if (mapping_large_folio_support(mapping)) {
> +			force_thp_readahead = true;
> +			thp_order = min_t(unsigned int,
> +					  mapping_max_folio_order(mapping),
> +					  get_order(SZ_2M));
> +		}
> +	}
>  
>  	if (!force_thp_readahead) {
>  		/*
> @@ -3354,17 +3366,19 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	}
>  
>  	if (force_thp_readahead) {
> +		unsigned long folio_nr_pages = 1UL << thp_order;
> +
>  		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> -		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
> -		ra->size = HPAGE_PMD_NR;
> +		ractl._index &= ~(folio_nr_pages - 1);
> +		ra->size = folio_nr_pages;
>  		/*
> -		 * Fetch two PMD folios, so we get the chance to actually
> +		 * Fetch two folios so we get the chance to actually
>  		 * readahead, unless we've been told not to.
>  		 */
>  		if (!(vm_flags & VM_RAND_READ))
>  			ra->size *= 2;
> -		ra->async_size = HPAGE_PMD_NR;
> -		ra->order = HPAGE_PMD_ORDER;
> +		ra->async_size = folio_nr_pages;
> +		ra->order = thp_order;
>  		page_cache_ra_order(&ractl, ra);
>  		return fpin;
>  	}
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-01 16:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-01 10:21 [PATCH v7 0/2] mm: improve large folio readahead for exec memory Usama Arif
2026-06-01 10:21 ` [PATCH v7 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-06-01 10:21 ` [PATCH v7 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order Usama Arif
2026-06-01 16:22   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox