[PATCH v5 0/2] mm: improve large folio readahead for exec memory

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/2] mm: improve large folio readahead for exec memory
@ 2026-05-22 16:23 Usama Arif
  2026-05-22 16:23 ` [PATCH v5 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Usama Arif @ 2026-05-22 16:23 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, Andrew Donnellan, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam R.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	ljs, mhocko, npache, pasha.tatashin, rmclure, rppt, surenb,
	vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel, ziy,
	hannes, kas, shakeel.butt, kernel-team, Usama Arif

Two checks in do_sync_mmap_readahead() limit large-folio readahead:

  1. The mmap_miss heuristic is meant to throttle wasteful speculative
     readahead. It is currently also applied to the VM_EXEC readahead
     path, which is targeted rather than speculative. Once mmap_miss exceeds
     MMAP_LOTSAMISS, exec readahead - including the large-folio
     order requested by exec_folio_order() - is disabled. On
     configurations where the mmap_miss decrement paths are not
     active (see patch 1) the counter only grows, so exec readahead
     is permanently disabled after the first 100 faults.

  2. The force_thp_readahead path is gated only on
     HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and always drives the
     readahead at HPAGE_PMD_ORDER. Configurations where
     HPAGE_PMD_ORDER exceeds MAX_PAGECACHE_ORDER never reach this
     path, even when the mapping itself supports usefully large
     folios well below the cap.

Both issues are most visible on arm64 with a 64K base page size,
where HPAGE_PMD_ORDER is 13 (512MB) -- above MAX_PAGECACHE_ORDER
(11) -- and where fault_around_pages collapses to 1 disabling
should_fault_around() (one of the two mmap_miss decrement sites).
However the fixes are architecture-agnostic: patch 1 reflects the
nature of VM_EXEC readahead regardless of base page size, and
patch 2 generalises the gate so any mapping advertising a usefully
large maximum folio order can benefit.

I created a benchmark that mmaps a large executable file and calls
RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
fault + readahead cost. "Random" first faults in all pages with a
sequential sweep (not measured), then measures time for calling random
offsets, isolating iTLB miss cost for scattered execution.

The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
512MB executable file on ext4, averaged over 3 runs:

  Phase      | Baseline     | Patched      | Improvement
  -----------|--------------|--------------|------------------
  Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
  Random     | 76.0 ms      | 58.3 ms      | 23% faster

The patches are based on fb61d7dda82e416a89e1f918a08535ee38976995 (akpm/mm-unstable)
from 22 May.


v3 -> v4: https://lore.kernel.org/all/20260402181326.3107102-1-usama.arif@linux.dev/
- Drop patches for elf thp unmapped area alignment and deal with them
  separately. These patches will just bring folios smaller than PMD
  at the same level as PMD. The 2 patches now should be much easier
  to merge.
- Tackle size of THP for exec pages at the same point as PMD instead
  of tackling using exec_folio_order() (Ryan during LSFMM, Thanks!)
  pr_err("KKK %s %s %d\n", __FILE__, __func__, __LINE__);

v2 -> v3: https://lore.kernel.org/all/20260320140315.979307-1-usama.arif@linux.dev/
- Take into account READ_ONLY_THP_FOR_FS for elf alignment by aligning
  to HPAGE_PMD_SIZE limited to 2M (Rui)
- Reviewed-by tags for patch 1 from Kiryl and Jan
- Remove preferred_exec_order() (Jan)
- Change ra->order to HPAGE_PMD_ORDER if vma_pages(vma) >= HPAGE_PMD_NR
  otherwise use exec_folio_order() with gfp &= ~__GFP_RECLAIM for
  do_sync_mmap_readahead().
- Change exec_folio_order() to return 2M (cont-pte size) for 64K base
  page size for arm64.
- remove bprm->file NULL check (Matthew)
- Change filp to file (Matthew)
- Improve checking of p_vaddr and p_vaddr (Rui and Matthew)

v1 -> v2: https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev/
- disable mmap_miss logic for VM_EXEC (Jan Kara)
- Align in elf only when segment VA and file offset are already aligned (Rui)
- preferred_exec_order() for VM_EXEC sync mmap_readahead which takes into
  account zone high watermarks (as an approximation of memory pressure)
  (David, or atleast my approach to what David suggested in [1] :))
- Extend max alignment to mapping_max_folio_size() instead of
  exec_folio_order()
 
Usama Arif (2):
  mm: bypass mmap_miss heuristic for VM_EXEC readahead
  mm: use mapping_max_folio_order() for force_thp_readahead order

 mm/filemap.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

-- 
2.52.0



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v5 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead
  2026-05-22 16:23 [PATCH v5 0/2] mm: improve large folio readahead for exec memory Usama Arif
@ 2026-05-22 16:23 ` Usama Arif
  2026-05-22 16:23 ` [PATCH v5 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order Usama Arif
  2026-05-22 19:20 ` [PATCH v5 0/2] mm: improve large folio readahead for exec memory Andrew Morton
  2 siblings, 0 replies; 4+ messages in thread
From: Usama Arif @ 2026-05-22 16:23 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, Andrew Donnellan, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam R.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	ljs, mhocko, npache, pasha.tatashin, rmclure, rppt, surenb,
	vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel, ziy,
	hannes, kas, shakeel.butt, kernel-team, Usama Arif

The mmap_miss counter in do_sync_mmap_readahead() tracks whether
readahead is useful for mmap'd file access. It is incremented by 1 on
every page cache miss in do_sync_mmap_readahead(), and decremented in
two places:

  - filemap_map_pages(): decremented by N for each of N pages
    successfully mapped via fault-around (pages found already in cache,
    evidence readahead was useful). Only pages not in the workingset
    count as hits.

  - do_async_mmap_readahead(): decremented by 1 when a page with
    PG_readahead is found in cache.

When the counter exceeds MMAP_LOTSAMISS (100), all readahead is
disabled, including the targeted VM_EXEC readahead [1] that requests
large folio orders for contpte mapping.

On arm64 with 64K base pages, both decrement paths are inactive:

  1. filemap_map_pages() is never called because fault_around_pages
     (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
     requires fault_around_pages > 1. With only 1 page in the
     fault-around window, there is nothing "around" to map.

  2. do_async_mmap_readahead() never fires for exec mappings because
     exec readahead sets async_size = 0, so no PG_readahead markers
     are placed.

With no decrements, mmap_miss monotonically increases past
MMAP_LOTSAMISS after 100 page faults, disabling all subsequent
exec readahead.

Fix this by excluding VM_EXEC VMAs from the mmap_miss logic, similar
to how VM_SEQ_READ is already excluded. The exec readahead path is
targeted (one folio at the fault location, async_size=0), not
speculative prefetch, so the mmap_miss heuristic designed to throttle
wasteful speculative readahead should not apply to it.

[1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/

Signed-off-by: Usama Arif <usama.arif@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 5aaba0d3e81d..f45a1b74870d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3339,7 +3339,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		}
 	}
 
-	if (!(vm_flags & VM_SEQ_READ)) {
+	if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
 		/* Avoid banging the cache line if not needed */
 		mmap_miss = READ_ONCE(ra->mmap_miss);
 		if (mmap_miss < MMAP_LOTSAMISS * 10)
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v5 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order
  2026-05-22 16:23 [PATCH v5 0/2] mm: improve large folio readahead for exec memory Usama Arif
  2026-05-22 16:23 ` [PATCH v5 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-05-22 16:23 ` Usama Arif
  2026-05-22 19:20 ` [PATCH v5 0/2] mm: improve large folio readahead for exec memory Andrew Morton
  2 siblings, 0 replies; 4+ messages in thread
From: Usama Arif @ 2026-05-22 16:23 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, Andrew Donnellan, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam R.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	ljs, mhocko, npache, pasha.tatashin, rmclure, rppt, surenb,
	vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel, ziy,
	hannes, kas, shakeel.butt, kernel-team, Usama Arif

The force_thp_readahead path in do_sync_mmap_readahead() was gated on
the compile-time check HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and
always drove the readahead at HPAGE_PMD_ORDER / HPAGE_PMD_NR. On
configurations where HPAGE_PMD_ORDER exceeds MAX_PAGECACHE_ORDER --
notably arm64 with a 64K base page size, where HPAGE_PMD_ORDER is 13
(512MB) -- VM_HUGEPAGE mappings could never reach this path and fell
back to base-page readahead, even when the mapping itself could serve
usefully large folios well below the cap.

Widen the gate to

  min(HPAGE_PMD_ORDER, mapping_max_folio_order(mapping))
      <= MAX_PAGECACHE_ORDER

so force_thp_readahead engages whenever either order fits, and pick
ra->order accordingly:

  - HPAGE_PMD_ORDER when it fits (existing behaviour);
  - otherwise min(mapping_max_folio_order(mapping), get_order(SZ_2M)),
    capping the readahead folio at 2MB regardless of what the mapping
    advertises.

Size and align the readahead window from (1UL << ra->order) instead
of the hardcoded HPAGE_PMD_NR / HPAGE_PMD_ORDER so the chosen order
is honoured end-to-end.

On arm64 with a 64K base page size this lets VM_HUGEPAGE mappings get
large-folio readahead at the mapping's supported order (capped at
2MB) rather than dropping back to base pages. 2MB is also the size of
an arm64 contiguous-PTE (contpte) block on a 64K base, so the
resulting folios coalesce into a single TLB entry and reduce TLB
pressure on the readahead path.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/filemap.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index f45a1b74870d..56fa715d66cd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3317,9 +3317,16 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	ractl._max_index = vmf->vma->vm_pgoff + vma_pages(vmf->vma) - 1;
 
 	/* Use the readahead code, even if readahead is disabled */
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-	    (vm_flags & VM_HUGEPAGE) && HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (vm_flags & VM_HUGEPAGE) &&
+	    min(HPAGE_PMD_ORDER, mapping_max_folio_order(mapping)) <= MAX_PAGECACHE_ORDER) {
 		force_thp_readahead = true;
+		if (HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER)
+			ra->order = HPAGE_PMD_ORDER;
+		else
+			ra->order = min_t(unsigned int,
+					  mapping_max_folio_order(mapping),
+					  get_order(SZ_2M));
+	}
 
 	if (!force_thp_readahead) {
 		/*
@@ -3354,17 +3361,18 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	}
 
 	if (force_thp_readahead) {
+		unsigned long folio_nr_pages = 1UL << ra->order;
+
 		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
-		ractl._index &= ~((unsigned long)HPAGE_PMD_NR - 1);
-		ra->size = HPAGE_PMD_NR;
+		ractl._index &= ~(folio_nr_pages - 1);
+		ra->size = folio_nr_pages;
 		/*
-		 * Fetch two PMD folios, so we get the chance to actually
+		 * Fetch two folios so we get the chance to actually
 		 * readahead, unless we've been told not to.
 		 */
 		if (!(vm_flags & VM_RAND_READ))
 			ra->size *= 2;
-		ra->async_size = HPAGE_PMD_NR;
-		ra->order = HPAGE_PMD_ORDER;
+		ra->async_size = folio_nr_pages;
 		page_cache_ra_order(&ractl, ra);
 		return fpin;
 	}
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v5 0/2] mm: improve large folio readahead for exec memory
  2026-05-22 16:23 [PATCH v5 0/2] mm: improve large folio readahead for exec memory Usama Arif
  2026-05-22 16:23 ` [PATCH v5 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
  2026-05-22 16:23 ` [PATCH v5 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order Usama Arif
@ 2026-05-22 19:20 ` Andrew Morton
  2 siblings, 0 replies; 4+ messages in thread
From: Andrew Morton @ 2026-05-22 19:20 UTC (permalink / raw)
  To: Usama Arif
  Cc: david, willy, ryan.roberts, linux-mm, r, jack, Andrew Donnellan,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam R.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, ljs, mhocko, npache, pasha.tatashin,
	rmclure, rppt, surenb, vbabka, Al Viro, wilts.infradead.org,
	"linux-fsdevel, ziy, hannes, kas, shakeel.butt, kernel-team

On Fri, 22 May 2026 09:23:46 -0700 Usama Arif <usama.arif@linux.dev> wrote:

> Two checks in do_sync_mmap_readahead() limit large-folio readahead:
> 
>   1. The mmap_miss heuristic is meant to throttle wasteful speculative
>      readahead. It is currently also applied to the VM_EXEC readahead
>      path, which is targeted rather than speculative. Once mmap_miss exceeds
>      MMAP_LOTSAMISS, exec readahead - including the large-folio
>      order requested by exec_folio_order() - is disabled. On
>      configurations where the mmap_miss decrement paths are not
>      active (see patch 1) the counter only grows, so exec readahead
>      is permanently disabled after the first 100 faults.
> 
>   2. The force_thp_readahead path is gated only on
>      HPAGE_PMD_ORDER <= MAX_PAGECACHE_ORDER and always drives the
>      readahead at HPAGE_PMD_ORDER. Configurations where
>      HPAGE_PMD_ORDER exceeds MAX_PAGECACHE_ORDER never reach this
>      path, even when the mapping itself supports usefully large
>      folios well below the cap.
> 
> Both issues are most visible on arm64 with a 64K base page size,
> where HPAGE_PMD_ORDER is 13 (512MB) -- above MAX_PAGECACHE_ORDER
> (11) -- and where fault_around_pages collapses to 1 disabling
> should_fault_around() (one of the two mmap_miss decrement sites).
> However the fixes are architecture-agnostic: patch 1 reflects the
> nature of VM_EXEC readahead regardless of base page size, and
> patch 2 generalises the gate so any mapping advertising a usefully
> large maximum folio order can benefit.
> 
> I created a benchmark that mmaps a large executable file and calls
> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
> fault + readahead cost. "Random" first faults in all pages with a
> sequential sweep (not measured), then measures time for calling random
> offsets, isolating iTLB miss cost for scattered execution.
> 
> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
> 512MB executable file on ext4, averaged over 3 runs:
> 
>   Phase      | Baseline     | Patched      | Improvement
>   -----------|--------------|--------------|------------------
>   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
>   Random     | 76.0 ms      | 58.3 ms      | 23% faster

Well that's nice.

AI review might have found a few things:
	https://sashiko.dev/#/patchset/20260522162422.3856502-1-usama.arif@linux.dev




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-22 19:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-22 16:23 [PATCH v5 0/2] mm: improve large folio readahead for exec memory Usama Arif
2026-05-22 16:23 ` [PATCH v5 1/2] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-05-22 16:23 ` [PATCH v5 2/2] mm: use mapping_max_folio_order() for force_thp_readahead order Usama Arif
2026-05-22 19:20 ` [PATCH v5 0/2] mm: improve large folio readahead for exec memory Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox