[PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
@ 2026-03-10 14:51 Usama Arif
  2026-03-10 14:51 ` [PATCH 1/4] arm64: request contpte-sized folios for exec memory Usama Arif
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Usama Arif @ 2026-03-10 14:51 UTC (permalink / raw)
  To: Andrew Morton, ryan.roberts, david
  Cc: ajd, anshuman.khandual, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, lorenzo.stoakes, npache, rmclure, Al Viro, will, willy,
	ziy, hannes, kas, shakeel.butt, kernel-team, Usama Arif

On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
into a single iTLB entry, reducing iTLB pressure for large executable
mappings.

exec_folio_order() was introduced [1] to request readahead at an
arch-preferred folio order for executable memory, enabling contpte
mapping on the fault path.

However, several things prevent this from working optimally on 16K and
64K page configurations:

1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
   produces the optimal contpte order for 4K pages. For 16K pages it
   returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
   returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by
   using ilog2(CONT_PTES) which evaluates to the optimal order for all
   page sizes.

2. Even with the optimal order, the mmap_miss heuristic in
   do_sync_mmap_readahead() silently disables exec readahead after 100
   page faults. The mmap_miss counter tracks whether readahead is useful
   for mmap'd file access:

   - Incremented by 1 in do_sync_mmap_readahead() on every page cache
     miss (page needed IO).

   - Decremented by N in filemap_map_pages() for N pages successfully
     mapped via fault-around (pages found in cache without faulting,
     evidence that readahead was useful). Only non-workingset pages
     count and recently evicted and re-read pages don't count as hits.

   - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
     marker page is found (indicates sequential consumption of readahead
     pages).

   When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
   disabled. On 64K pages, both decrement paths are inactive:

   - filemap_map_pages() is never called because fault_around_pages
     (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
     requires fault_around_pages > 1. With only 1 page in the
     fault-around window, there is nothing "around" to map.

   - do_async_mmap_readahead() never fires for exec mappings because
     exec readahead sets async_size = 0, so no PG_readahead markers
     are placed.

   With no decrements, mmap_miss monotonically increases past
   MMAP_LOTSAMISS after 100 faults, disabling exec readahead
   for the remainder of the mapping.
   Patch 2 fixes this by moving the VM_EXEC readahead block
   above the mmap_miss check, since exec readahead is targeted (one
   folio at the fault location, async_size=0) not speculative prefetch.

3. Even with correct folio order and readahead, contpte mapping requires
   the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
   The readahead path aligns file offsets and the buddy allocator aligns
   physical memory, but the virtual address depends on the VMA start.
   For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
   granularity, giving only a 1/32 chance of 2M alignment. When
   misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
   any folio in the VMA, resulting in zero iTLB coalescing benefit.

   Patch 3 fixes this for the main binary by bumping the ELF loader's
   alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.

   Patch 4 fixes this for shared libraries by adding a contpte-size
   alignment fallback in thp_get_unmapped_area_vmflags(). The existing
   PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
   libraries, so this smaller fallback (2M) succeeds where PMD fails.

I created a benchmark that mmaps a large executable file and calls
RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
fault + readahead cost. "Random" first faults in all pages with a
sequential sweep (not measured), then measures time for calling random
offsets, isolating iTLB miss cost for scattered execution.

The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
512MB executable file on ext4, averaged over 3 runs:

  Phase      | Baseline     | Patched      | Improvement
  -----------|--------------|--------------|------------------
  Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
  Random     | 76.0 ms      | 58.3 ms      | 23% faster

[1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/

Usama Arif (4):
  arm64: request contpte-sized folios for exec memory
  mm: bypass mmap_miss heuristic for VM_EXEC readahead
  elf: align ET_DYN base to exec folio order for contpte mapping
  mm: align file-backed mmap to exec folio order in
    thp_get_unmapped_area

 arch/arm64/include/asm/pgtable.h |  9 ++--
 fs/binfmt_elf.c                  | 15 +++++++
 mm/filemap.c                     | 72 +++++++++++++++++---------------
 mm/huge_memory.c                 | 17 ++++++++
 4 files changed, 75 insertions(+), 38 deletions(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] arm64: request contpte-sized folios for exec memory
  2026-03-10 14:51 [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages Usama Arif
@ 2026-03-10 14:51 ` Usama Arif
  2026-03-19  7:35   ` David Hildenbrand (Arm)
  2026-03-10 14:51 ` [PATCH 2/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Usama Arif @ 2026-03-10 14:51 UTC (permalink / raw)
  To: Andrew Morton, ryan.roberts, david
  Cc: ajd, anshuman.khandual, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, lorenzo.stoakes, npache, rmclure, Al Viro, will, willy,
	ziy, hannes, kas, shakeel.butt, kernel-team, Usama Arif

exec_folio_order() was introduced [1] to request readahead of executable
file-backed pages at an arch-preferred folio order, so that the hardware
can coalesce contiguous PTEs into fewer iTLB entries (contpte).

The current implementation uses ilog2(SZ_64K >> PAGE_SHIFT), which
requests 64K folios. This is optimal for 4K base pages (where CONT_PTES
= 16, contpte size = 64K), but suboptimal for 16K and 64K base pages:

Page size | Before (order) | After (order) | contpte
----------|----------------|---------------|--------
4K        | 4 (64K)        | 4 (64K)       | Yes (unchanged)
16K       | 2 (64K)        | 7 (2M)        | Yes (new)
64K       | 0 (64K)        | 5 (2M)        | Yes (new)

For 16K pages, CONT_PTES = 128 and the contpte size is 2M (order 7).
For 64K pages, CONT_PTES = 32 and the contpte size is 2M (order 5).

Use ilog2(CONT_PTES) instead, which directly evaluates to contpte-aligned
order for all page sizes.

The worst-case waste is bounded to one folio (up to 2MB - 64KB)
at the end of the file, since page_cache_ra_order() reduces the folio
order near EOF to avoid allocating past i_size.

[1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 arch/arm64/include/asm/pgtable.h | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b3e58735c49bd..a1110a33acb35 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1600,12 +1600,11 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
 #define arch_wants_old_prefaulted_pte	cpu_has_hw_af
 
 /*
- * Request exec memory is read into pagecache in at least 64K folios. This size
- * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
- * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
- * pages are in use.
+ * Request exec memory is read into pagecache in contpte-sized folios. The
+ * contpte size is the number of contiguous PTEs that the hardware can coalesce
+ * into a single iTLB entry: 64K for 4K pages, 2M for 16K and 64K pages.
  */
-#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
+#define exec_folio_order() ilog2(CONT_PTES)
 
 static inline bool pud_sect_supported(void)
 {
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
  2026-03-10 14:51 [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages Usama Arif
  2026-03-10 14:51 ` [PATCH 1/4] arm64: request contpte-sized folios for exec memory Usama Arif
@ 2026-03-10 14:51 ` Usama Arif
  2026-03-18 16:43   ` Jan Kara
  2026-03-10 14:51 ` [PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping Usama Arif
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Usama Arif @ 2026-03-10 14:51 UTC (permalink / raw)
  To: Andrew Morton, ryan.roberts, david
  Cc: ajd, anshuman.khandual, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, lorenzo.stoakes, npache, rmclure, Al Viro, will, willy,
	ziy, hannes, kas, shakeel.butt, kernel-team, Usama Arif

The mmap_miss counter in do_sync_mmap_readahead() tracks whether
readahead is useful for mmap'd file access. It is incremented by 1 on
every page cache miss in do_sync_mmap_readahead(), and decremented in
two places:

  - filemap_map_pages(): decremented by N for each of N pages
    successfully mapped via fault-around (pages found already in cache,
    evidence readahead was useful). Only pages not in the workingset
    count as hits.

  - do_async_mmap_readahead(): decremented by 1 when a page with
    PG_readahead is found in cache.

When the counter exceeds MMAP_LOTSAMISS (100), all readahead is
disabled, including the targeted VM_EXEC readahead [1] that requests
arch-preferred folio orders for contpte mapping.

On arm64 with 64K base pages, both decrement paths are inactive:

  1. filemap_map_pages() is never called because fault_around_pages
     (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
     requires fault_around_pages > 1. With only 1 page in the
     fault-around window, there is nothing "around" to map.

  2. do_async_mmap_readahead() never fires for exec mappings because
     exec readahead sets async_size = 0, so no PG_readahead markers
     are placed.

With no decrements, mmap_miss monotonically increases past
MMAP_LOTSAMISS after 100 page faults, disabling all subsequent
exec readahead.

Fix this by moving the VM_EXEC readahead block above the mmap_miss
check. The exec readahead path is targeted. It reads a single folio at
the fault location with async_size=0, not speculative prefetch, so the
mmap_miss heuristic designed to throttle wasteful speculative readahead
should not gate it. The page would need to be faulted in regardless,
the only question is at what order.

[1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/filemap.c | 72 ++++++++++++++++++++++++++++------------------------
 1 file changed, 39 insertions(+), 33 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6cd7974d4adab..c064f31ecec5a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3331,6 +3331,37 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		}
 	}
 
+	if (vm_flags & VM_EXEC) {
+		/*
+		 * Allow arch to request a preferred minimum folio order for
+		 * executable memory. This can often be beneficial to
+		 * performance if (e.g.) arm64 can contpte-map the folio.
+		 * Executable memory rarely benefits from readahead, due to its
+		 * random access nature, so set async_size to 0.
+		 *
+		 * Limit to the boundaries of the VMA to avoid reading in any
+		 * pad that might exist between sections, which would be a waste
+		 * of memory.
+		 *
+		 * This is targeted readahead (one folio at the fault location),
+		 * not speculative prefetch, so bypass the mmap_miss heuristic
+		 * which would otherwise disable it after MMAP_LOTSAMISS faults.
+		 */
+		struct vm_area_struct *vma = vmf->vma;
+		unsigned long start = vma->vm_pgoff;
+		unsigned long end = start + vma_pages(vma);
+		unsigned long ra_end;
+
+		ra->order = exec_folio_order();
+		ra->start = round_down(vmf->pgoff, 1UL << ra->order);
+		ra->start = max(ra->start, start);
+		ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
+		ra_end = min(ra_end, end);
+		ra->size = ra_end - ra->start;
+		ra->async_size = 0;
+		goto do_readahead;
+	}
+
 	if (!(vm_flags & VM_SEQ_READ)) {
 		/* Avoid banging the cache line if not needed */
 		mmap_miss = READ_ONCE(ra->mmap_miss);
@@ -3361,40 +3392,15 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		return fpin;
 	}
 
-	if (vm_flags & VM_EXEC) {
-		/*
-		 * Allow arch to request a preferred minimum folio order for
-		 * executable memory. This can often be beneficial to
-		 * performance if (e.g.) arm64 can contpte-map the folio.
-		 * Executable memory rarely benefits from readahead, due to its
-		 * random access nature, so set async_size to 0.
-		 *
-		 * Limit to the boundaries of the VMA to avoid reading in any
-		 * pad that might exist between sections, which would be a waste
-		 * of memory.
-		 */
-		struct vm_area_struct *vma = vmf->vma;
-		unsigned long start = vma->vm_pgoff;
-		unsigned long end = start + vma_pages(vma);
-		unsigned long ra_end;
-
-		ra->order = exec_folio_order();
-		ra->start = round_down(vmf->pgoff, 1UL << ra->order);
-		ra->start = max(ra->start, start);
-		ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
-		ra_end = min(ra_end, end);
-		ra->size = ra_end - ra->start;
-		ra->async_size = 0;
-	} else {
-		/*
-		 * mmap read-around
-		 */
-		ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
-		ra->size = ra->ra_pages;
-		ra->async_size = ra->ra_pages / 4;
-		ra->order = 0;
-	}
+	/*
+	 * mmap read-around
+	 */
+	ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
+	ra->size = ra->ra_pages;
+	ra->async_size = ra->ra_pages / 4;
+	ra->order = 0;
 
+do_readahead:
 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	ractl._index = ra->start;
 	page_cache_ra_order(&ractl, ra);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping
  2026-03-10 14:51 [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages Usama Arif
  2026-03-10 14:51 ` [PATCH 1/4] arm64: request contpte-sized folios for exec memory Usama Arif
  2026-03-10 14:51 ` [PATCH 2/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-03-10 14:51 ` Usama Arif
  2026-03-13 14:42   ` WANG Rui
  2026-03-10 14:51 ` [PATCH 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Usama Arif @ 2026-03-10 14:51 UTC (permalink / raw)
  To: Andrew Morton, ryan.roberts, david
  Cc: ajd, anshuman.khandual, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, lorenzo.stoakes, npache, rmclure, Al Viro, will, willy,
	ziy, hannes, kas, shakeel.butt, kernel-team, Usama Arif

For PIE binaries (ET_DYN), the load address is randomized at PAGE_SIZE
granularity via arch_mmap_rnd(). On arm64 with 64K base pages, this
means the binary is 64K-aligned, but contpte mapping requires 2M
(CONT_PTE_SIZE) alignment.

Without proper virtual address alignment, the readahead patches that
allocate 2M folios with 2M-aligned file offsets and physical addresses
cannot benefit from contpte mapping. The contpte fold check in
contpte_set_ptes() requires the virtual address to be CONT_PTE_SIZE-
aligned, and since the misalignment from vma->vm_start is constant
across all folios in the VMA, no folio gets the contiguous PTE bit
set, resulting in zero iTLB coalescing benefit.

Fix this by bumping the ELF alignment to PAGE_SIZE << exec_folio_order()
when the arch defines a non-zero exec_folio_order(). This ensures
load_bias is aligned to the folio size, so that file-offset-aligned
folios map to properly aligned virtual addresses.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 fs/binfmt_elf.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 8e89cc5b28200..2d2b3e9fd474f 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -49,6 +49,7 @@
 #include <uapi/linux/rseq.h>
 #include <asm/param.h>
 #include <asm/page.h>
+#include <linux/pgtable.h>
 
 #ifndef ELF_COMPAT
 #define ELF_COMPAT 0
@@ -1106,6 +1107,20 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			/* Calculate any requested alignment. */
 			alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
 
+			/*
+			 * If the arch requested large folios for exec
+			 * memory via exec_folio_order(), ensure the
+			 * binary is mapped with sufficient alignment so
+			 * that virtual addresses of exec pages are
+			 * aligned to the folio boundary. Without this,
+			 * the hardware cannot coalesce PTEs (e.g. arm64
+			 * contpte) even though the physical memory and
+			 * file offset are correctly aligned.
+			 */
+			if (exec_folio_order())
+				alignment = max(alignment,
+					(unsigned long)PAGE_SIZE << exec_folio_order());
+
 			/**
 			 * DOC: PIE handling
 			 *
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area
  2026-03-10 14:51 [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages Usama Arif
                   ` (2 preceding siblings ...)
  2026-03-10 14:51 ` [PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping Usama Arif
@ 2026-03-10 14:51 ` Usama Arif
  2026-03-14  3:47   ` WANG Rui
  2026-03-13 13:20 ` [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages David Hildenbrand (Arm)
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Usama Arif @ 2026-03-10 14:51 UTC (permalink / raw)
  To: Andrew Morton, ryan.roberts, david
  Cc: ajd, anshuman.khandual, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, lorenzo.stoakes, npache, rmclure, Al Viro, will, willy,
	ziy, hannes, kas, shakeel.butt, kernel-team, Usama Arif

thp_get_unmapped_area() is the get_unmapped_area callback for
filesystems like ext4, xfs, and btrfs. It attempts to align the virtual
address for PMD_SIZE THP mappings, but on arm64 with 64K base pages
PMD_SIZE is 512M, which is too large for typical shared library mappings,
so the alignment always fails and falls back to PAGE_SIZE.

This means shared libraries loaded by ld.so via mmap() get 64K-aligned
virtual addresses, preventing contpte mapping even when 2M folios are
allocated with properly aligned file offsets and physical addresses.

Add a fallback in thp_get_unmapped_area_vmflags() that tries
PAGE_SIZE << exec_folio_order() alignment (2M on arm64 64K pages)
when PMD_SIZE alignment fails. This is small enough that shared
libraries could qualify, enabling contpte mapping for their executable
segments.

This applies to all file-backed mappings (not just exec). Non-exec
file-backed mappings also benefit from contpte mapping when large
folios are used. Aligning all file-backed mappings ensures that any
large folio in the page cache can be contpte-mapped regardless of
the mapping's protection flags, reducing dTLB misses for read-heavy
workloads.

The fallback is gated by exec_folio_order() which returns 0 by default,
making this a no-op on architectures that don't define it.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e2746ea74adf..1c9476a5ed51c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1242,6 +1242,23 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 	if (ret)
 		return ret;
 
+	/*
+	 * If the arch requested large folios for exec memory, try to align
+	 * to the folio size as a fallback. This is much smaller than PMD_SIZE
+	 * (e.g. 2M vs 512M on arm64 64K pages), so it succeeds for mappings
+	 * that are too small for PMD alignment. Proper alignment ensures that
+	 * the hardware can coalesce PTEs (e.g. arm64 contpte) when large
+	 * folios are mapped.
+	 */
+	if (exec_folio_order()) {
+		unsigned long folio_size = PAGE_SIZE << exec_folio_order();
+
+		ret = __thp_get_unmapped_area(filp, addr, len, off, flags,
+					      folio_size, vm_flags);
+		if (ret)
+			return ret;
+	}
+
 	return mm_get_unmapped_area_vmflags(filp, addr, len, pgoff, flags,
 					    vm_flags);
 }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-10 14:51 [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages Usama Arif
                   ` (3 preceding siblings ...)
  2026-03-10 14:51 ` [PATCH 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
@ 2026-03-13 13:20 ` David Hildenbrand (Arm)
  2026-03-13 19:59   ` Usama Arif
  2026-03-13 16:33 ` Ryan Roberts
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-13 13:20 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, ryan.roberts
  Cc: ajd, anshuman.khandual, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, lorenzo.stoakes, npache, rmclure, Al Viro, will, willy,
	ziy, hannes, kas, shakeel.butt, kernel-team

On 3/10/26 15:51, Usama Arif wrote:
> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
> into a single iTLB entry, reducing iTLB pressure for large executable
> mappings.
> 
> exec_folio_order() was introduced [1] to request readahead at an
> arch-preferred folio order for executable memory, enabling contpte
> mapping on the fault path.
> 
> However, several things prevent this from working optimally on 16K and
> 64K page configurations:
> 
> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
>    produces the optimal contpte order for 4K pages. For 16K pages it
>    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
>    returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by
>    using ilog2(CONT_PTES) which evaluates to the optimal order for all
>    page sizes.
> 
> 2. Even with the optimal order, the mmap_miss heuristic in
>    do_sync_mmap_readahead() silently disables exec readahead after 100
>    page faults. The mmap_miss counter tracks whether readahead is useful
>    for mmap'd file access:
> 
>    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
>      miss (page needed IO).
> 
>    - Decremented by N in filemap_map_pages() for N pages successfully
>      mapped via fault-around (pages found in cache without faulting,
>      evidence that readahead was useful). Only non-workingset pages
>      count and recently evicted and re-read pages don't count as hits.
> 
>    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
>      marker page is found (indicates sequential consumption of readahead
>      pages).
> 
>    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
>    disabled. On 64K pages, both decrement paths are inactive:
> 
>    - filemap_map_pages() is never called because fault_around_pages
>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>      requires fault_around_pages > 1. With only 1 page in the
>      fault-around window, there is nothing "around" to map.
> 
>    - do_async_mmap_readahead() never fires for exec mappings because
>      exec readahead sets async_size = 0, so no PG_readahead markers
>      are placed.
> 
>    With no decrements, mmap_miss monotonically increases past
>    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
>    for the remainder of the mapping.
>    Patch 2 fixes this by moving the VM_EXEC readahead block
>    above the mmap_miss check, since exec readahead is targeted (one
>    folio at the fault location, async_size=0) not speculative prefetch.
> 
> 3. Even with correct folio order and readahead, contpte mapping requires
>    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
>    The readahead path aligns file offsets and the buddy allocator aligns
>    physical memory, but the virtual address depends on the VMA start.
>    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
>    granularity, giving only a 1/32 chance of 2M alignment. When
>    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
>    any folio in the VMA, resulting in zero iTLB coalescing benefit.
> 
>    Patch 3 fixes this for the main binary by bumping the ELF loader's
>    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
> 
>    Patch 4 fixes this for shared libraries by adding a contpte-size
>    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
>    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
>    libraries, so this smaller fallback (2M) succeeds where PMD fails.
> 
> I created a benchmark that mmaps a large executable file and calls
> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
> fault + readahead cost. "Random" first faults in all pages with a
> sequential sweep (not measured), then measures time for calling random
> offsets, isolating iTLB miss cost for scattered execution.
> 
> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
> 512MB executable file on ext4, averaged over 3 runs:
> 
>   Phase      | Baseline     | Patched      | Improvement
>   -----------|--------------|--------------|------------------
>   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
>   Random     | 76.0 ms      | 58.3 ms      | 23% faster

I'm curious: is a single order really what we want?

I'd instead assume that we might want to make decisions based on the
mapping size.

Assume you have a 128M mapping, wouldn't we want to use a different
alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping
  2026-03-10 14:51 ` [PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping Usama Arif
@ 2026-03-13 14:42   ` WANG Rui
  2026-03-13 19:47     ` Usama Arif
  0 siblings, 1 reply; 26+ messages in thread
From: WANG Rui @ 2026-03-13 14:42 UTC (permalink / raw)
  To: usama.arif
  Cc: Liam.Howlett, ajd, akpm, anshuman.khandual, apopple, baohua,
	baolin.wang, brauner, catalin.marinas, david, dev.jain, hannes,
	jack, kas, kees, kernel-team, kevin.brodsky, lance.yang,
	linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
	lorenzo.stoakes, npache, rmclure, ryan.roberts, shakeel.butt,
	viro, will, willy, ziy, WANG Rui

Hi Usama,

Glad to see you're pushing on this, I'm also following it. I first noticed this when rustc's perf regressed after a binutils upgrade. I'm trying to make ld.so to aware THP and adjust PT_LOAD alignment to increase the chances of shared libraries being mapped by THP [1]. As you're probably seen, I'm doing something similar in the kernel to improve it for executables [2].

> +			if (exec_folio_order())
> +				alignment = max(alignment,
> +					(unsigned long)PAGE_SIZE << exec_folio_order());

I’m curious, does it make sense to add some constraints here, like only increasing p_align when the segment length, virtual address, and file offset are all huge-aligned, as I did in my patch? This has come up several times in the glibc review, where increasing alignment was noted to reduce ASLR entropy.

[1] https://sourceware.org/pipermail/libc-alpha/2026-March/175776.html
[2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc

Thanks,
Rui

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-10 14:51 [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages Usama Arif
                   ` (4 preceding siblings ...)
  2026-03-13 13:20 ` [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages David Hildenbrand (Arm)
@ 2026-03-13 16:33 ` Ryan Roberts
  2026-03-13 20:55   ` Usama Arif
  2026-03-14 13:20   ` WANG Rui
  2026-03-13 16:35 ` hev
  2026-03-14  9:50 ` WANG Rui
  7 siblings, 2 replies; 26+ messages in thread
From: Ryan Roberts @ 2026-03-13 16:33 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, david
  Cc: ajd, anshuman.khandual, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, lorenzo.stoakes, npache, rmclure, Al Viro, will, willy,
	ziy, hannes, kas, shakeel.butt, kernel-team

On 10/03/2026 14:51, Usama Arif wrote:
> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
> into a single iTLB entry, reducing iTLB pressure for large executable
> mappings.
> 
> exec_folio_order() was introduced [1] to request readahead at an
> arch-preferred folio order for executable memory, enabling contpte
> mapping on the fault path.
> 
> However, several things prevent this from working optimally on 16K and
> 64K page configurations:
> 
> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
>    produces the optimal contpte order for 4K pages. For 16K pages it
>    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
>    returns order 0 (64K) instead of order 5 (2M). 

This was deliberate, although perhaps a bit conservative. I was concerned about
the possibility of read amplification; pointlessly reading in a load of memory
that never actually gets used. And that is independent of page size.

2M seems quite big as a default IMHO, I could imagine Android might complain
about memory pressure in their 16K config, for example.

Additionally, ELF files are normally only aligned to 64K and you can only get
the TLB benefits if the memory is aligned in physical and virtual memory.

> Patch 1 fixes this by
>    using ilog2(CONT_PTES) which evaluates to the optimal order for all
>    page sizes.
> 
> 2. Even with the optimal order, the mmap_miss heuristic in
>    do_sync_mmap_readahead() silently disables exec readahead after 100
>    page faults. The mmap_miss counter tracks whether readahead is useful
>    for mmap'd file access:
> 
>    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
>      miss (page needed IO).
> 
>    - Decremented by N in filemap_map_pages() for N pages successfully
>      mapped via fault-around (pages found in cache without faulting,
>      evidence that readahead was useful). Only non-workingset pages
>      count and recently evicted and re-read pages don't count as hits.
> 
>    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
>      marker page is found (indicates sequential consumption of readahead
>      pages).
> 
>    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
>    disabled. On 64K pages, both decrement paths are inactive:
> 
>    - filemap_map_pages() is never called because fault_around_pages
>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>      requires fault_around_pages > 1. With only 1 page in the
>      fault-around window, there is nothing "around" to map.
> 
>    - do_async_mmap_readahead() never fires for exec mappings because
>      exec readahead sets async_size = 0, so no PG_readahead markers
>      are placed.
> 
>    With no decrements, mmap_miss monotonically increases past
>    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
>    for the remainder of the mapping.
>    Patch 2 fixes this by moving the VM_EXEC readahead block
>    above the mmap_miss check, since exec readahead is targeted (one
>    folio at the fault location, async_size=0) not speculative prefetch.

Interesting!

> 
> 3. Even with correct folio order and readahead, contpte mapping requires
>    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
>    The readahead path aligns file offsets and the buddy allocator aligns
>    physical memory, but the virtual address depends on the VMA start.
>    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
>    granularity, giving only a 1/32 chance of 2M alignment. When
>    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
>    any folio in the VMA, resulting in zero iTLB coalescing benefit.
> 
>    Patch 3 fixes this for the main binary by bumping the ELF loader's
>    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
> 
>    Patch 4 fixes this for shared libraries by adding a contpte-size
>    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
>    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
>    libraries, so this smaller fallback (2M) succeeds where PMD fails.

I don't see how you can reliably influence this from the kernel? The ELF file
alignment is, by default, 64K (16K on Android) and there is no guarrantee that
the text section is the first section in the file. You need to align the start
of the text section to the 2M boundary and to do that, you'll need to align the
start of the file to some 64K boundary at a specific offset to the 2M boundary,
based on the size of any sections before the text section. That's a job for the
dynamic loader I think? Perhaps I've misunderstood what you're doing...

> 
> I created a benchmark that mmaps a large executable file and calls
> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
> fault + readahead cost. "Random" first faults in all pages with a
> sequential sweep (not measured), then measures time for calling random
> offsets, isolating iTLB miss cost for scattered execution.
> 
> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
> 512MB executable file on ext4, averaged over 3 runs:
> 
>   Phase      | Baseline     | Patched      | Improvement
>   -----------|--------------|--------------|------------------
>   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
>   Random     | 76.0 ms      | 58.3 ms      | 23% faster

I think the proper way to do this is to link the text section with 2M alignment
and have the dynamic linker mark the region with MADV_HUGEPAGE?

Thanks,
Ryan


> 
> [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
> 
> Usama Arif (4):
>   arm64: request contpte-sized folios for exec memory
>   mm: bypass mmap_miss heuristic for VM_EXEC readahead
>   elf: align ET_DYN base to exec folio order for contpte mapping
>   mm: align file-backed mmap to exec folio order in
>     thp_get_unmapped_area
> 
>  arch/arm64/include/asm/pgtable.h |  9 ++--
>  fs/binfmt_elf.c                  | 15 +++++++
>  mm/filemap.c                     | 72 +++++++++++++++++---------------
>  mm/huge_memory.c                 | 17 ++++++++
>  4 files changed, 75 insertions(+), 38 deletions(-)
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-10 14:51 [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages Usama Arif
                   ` (5 preceding siblings ...)
  2026-03-13 16:33 ` Ryan Roberts
@ 2026-03-13 16:35 ` hev
  2026-03-14  9:50 ` WANG Rui
  7 siblings, 0 replies; 26+ messages in thread
From: hev @ 2026-03-13 16:35 UTC (permalink / raw)
  To: usama.arif
  Cc: Liam.Howlett, ajd, akpm, anshuman.khandual, apopple, baohua,
	baolin.wang, brauner, catalin.marinas, david, dev.jain, hannes,
	jack, kas, kees, kernel-team, kevin.brodsky, lance.yang,
	linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
	lorenzo.stoakes, npache, rmclure, ryan.roberts, shakeel.butt,
	viro, will, willy, ziy, WANG Rui

From: WANG Rui <r@hev.cc>

Hi,

I ran a quick bench on RK3399:

Binutils: 2.46
GCC: 15.2.1 (--enable-host-pie)

Workload: building vmlinux from Linux v7.0-rc1 with allnoconfig.

                Base                 Patchset [1]         Patchset [2]
instructions    3,115,852,636,773    3,194,533,947,809    3,235,417,205,947
cpu-cycles      8,374,429,970,450    8,457,398,871,141    8,323,881,987,768
itlb-misses         9,250,336,037        8,033,415,293        2,946,152,935
                                             ~ -13.16%            ~ -68.15%
time elapsed             610.51 s             605.12 s             593.83 s

[1] https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev
[2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc

Should we prefer to PMD aligned here?

Thanks,
Rui


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping
  2026-03-13 14:42   ` WANG Rui
@ 2026-03-13 19:47     ` Usama Arif
  2026-03-14  2:10       ` hev
  0 siblings, 1 reply; 26+ messages in thread
From: Usama Arif @ 2026-03-13 19:47 UTC (permalink / raw)
  To: WANG Rui
  Cc: Liam.Howlett, ajd, akpm, anshuman.khandual, apopple, baohua,
	baolin.wang, brauner, catalin.marinas, david, dev.jain, hannes,
	jack, kas, kees, kernel-team, kevin.brodsky, lance.yang,
	linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
	lorenzo.stoakes, npache, rmclure, ryan.roberts, shakeel.butt,
	viro, will, willy, ziy



On 13/03/2026 17:42, WANG Rui wrote:
> Hi Usama,
> 

Hello!

> Glad to see you're pushing on this, I'm also following it. I first noticed this when rustc's perf regressed after a binutils upgrade. I'm trying to make ld.so to aware THP and adjust PT_LOAD alignment to increase the chances of shared libraries being mapped by THP [1]. As you're probably seen, I'm doing something similar in the kernel to improve it for executables [2].

For us it came about because we use 64K page size on ARM, and none of the
text sections were getting hugified (because PMD size is 512M). I went with
exec_folio_order() = cont-pte size (2M) for 16K and 64K as we can get both page
fault benefit (which might not be that important) and iTLB coverage (due to
cont-pte).
x86 already faults in at 2M (HPAGE_PMD_ORDER) due to force_thp_readahead path in
do_sync_mmap_readahead() so the memory pressure introduced in ARM won't be worse
than what already exists in x86.

> 
>> +			if (exec_folio_order())
>> +				alignment = max(alignment,
>> +					(unsigned long)PAGE_SIZE << exec_folio_order());
> 
> I’m curious, does it make sense to add some constraints here, like only increasing p_align when the segment length, virtual address, and file offset are all huge-aligned, as I did in my patch? This has come up several times in the glibc review, where increasing alignment was noted to reduce ASLR entropy.
> 

Yes I think this makes sense!

Although maybe we should check all segments with PT_LOAD. So maybe something
like below over this series?

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 2d2b3e9fd474f..a0e83b541a7d8 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1116,10 +1116,30 @@ static int load_elf_binary(struct linux_binprm *bprm)
                         * the hardware cannot coalesce PTEs (e.g. arm64
                         * contpte) even though the physical memory and
                         * file offset are correctly aligned.
+                        *
+                        * Only increase alignment when at least one
+                        * PT_LOAD segment is large enough to contain a
+                        * full folio and has its file offset and virtual
+                        * address folio-aligned. This avoids reducing
+                        * ASLR entropy for small binaries that cannot
+                        * benefit from contpte mapping.
                         */
-                       if (exec_folio_order())
-                               alignment = max(alignment,
-                                       (unsigned long)PAGE_SIZE << exec_folio_order());
+                       if (exec_folio_order()) {
+                               unsigned long folio_sz = PAGE_SIZE << exec_folio_order();
+
+                               for (i = 0; i < elf_ex->e_phnum; i++) {
+                                       if (elf_phdata[i].p_type != PT_LOAD)
+                                               continue;
+                                       if (elf_phdata[i].p_filesz < folio_sz)
+                                               continue;
+                                       if (!IS_ALIGNED(elf_phdata[i].p_vaddr, folio_sz))
+                                               continue;
+                                       if (!IS_ALIGNED(elf_phdata[i].p_offset, folio_sz))
+                                               continue;
+                                       alignment = max(alignment, folio_sz);
+                                       break;
+                               }
+                       }
 
                        /**
                         * DOC: PIE handling

> [1] https://sourceware.org/pipermail/libc-alpha/2026-March/175776.html
> [2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc
> 
> Thanks,
> Rui



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-13 13:20 ` [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages David Hildenbrand (Arm)
@ 2026-03-13 19:59   ` Usama Arif
  2026-03-16 16:06     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 26+ messages in thread
From: Usama Arif @ 2026-03-13 19:59 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, ryan.roberts
  Cc: ajd, anshuman.khandual, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, lorenzo.stoakes, npache, rmclure, Al Viro, will, willy,
	ziy, hannes, kas, shakeel.butt, kernel-team



On 13/03/2026 16:20, David Hildenbrand (Arm) wrote:
> On 3/10/26 15:51, Usama Arif wrote:
>> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
>> into a single iTLB entry, reducing iTLB pressure for large executable
>> mappings.
>>
>> exec_folio_order() was introduced [1] to request readahead at an
>> arch-preferred folio order for executable memory, enabling contpte
>> mapping on the fault path.
>>
>> However, several things prevent this from working optimally on 16K and
>> 64K page configurations:
>>
>> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
>>    produces the optimal contpte order for 4K pages. For 16K pages it
>>    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
>>    returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by
>>    using ilog2(CONT_PTES) which evaluates to the optimal order for all
>>    page sizes.
>>
>> 2. Even with the optimal order, the mmap_miss heuristic in
>>    do_sync_mmap_readahead() silently disables exec readahead after 100
>>    page faults. The mmap_miss counter tracks whether readahead is useful
>>    for mmap'd file access:
>>
>>    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
>>      miss (page needed IO).
>>
>>    - Decremented by N in filemap_map_pages() for N pages successfully
>>      mapped via fault-around (pages found in cache without faulting,
>>      evidence that readahead was useful). Only non-workingset pages
>>      count and recently evicted and re-read pages don't count as hits.
>>
>>    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
>>      marker page is found (indicates sequential consumption of readahead
>>      pages).
>>
>>    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
>>    disabled. On 64K pages, both decrement paths are inactive:
>>
>>    - filemap_map_pages() is never called because fault_around_pages
>>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>>      requires fault_around_pages > 1. With only 1 page in the
>>      fault-around window, there is nothing "around" to map.
>>
>>    - do_async_mmap_readahead() never fires for exec mappings because
>>      exec readahead sets async_size = 0, so no PG_readahead markers
>>      are placed.
>>
>>    With no decrements, mmap_miss monotonically increases past
>>    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
>>    for the remainder of the mapping.
>>    Patch 2 fixes this by moving the VM_EXEC readahead block
>>    above the mmap_miss check, since exec readahead is targeted (one
>>    folio at the fault location, async_size=0) not speculative prefetch.
>>
>> 3. Even with correct folio order and readahead, contpte mapping requires
>>    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
>>    The readahead path aligns file offsets and the buddy allocator aligns
>>    physical memory, but the virtual address depends on the VMA start.
>>    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
>>    granularity, giving only a 1/32 chance of 2M alignment. When
>>    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
>>    any folio in the VMA, resulting in zero iTLB coalescing benefit.
>>
>>    Patch 3 fixes this for the main binary by bumping the ELF loader's
>>    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
>>
>>    Patch 4 fixes this for shared libraries by adding a contpte-size
>>    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
>>    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
>>    libraries, so this smaller fallback (2M) succeeds where PMD fails.
>>
>> I created a benchmark that mmaps a large executable file and calls
>> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
>> fault + readahead cost. "Random" first faults in all pages with a
>> sequential sweep (not measured), then measures time for calling random
>> offsets, isolating iTLB miss cost for scattered execution.
>>
>> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
>> 512MB executable file on ext4, averaged over 3 runs:
>>
>>   Phase      | Baseline     | Patched      | Improvement
>>   -----------|--------------|--------------|------------------
>>   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
>>   Random     | 76.0 ms      | 58.3 ms      | 23% faster
> 
> I'm curious: is a single order really what we want?
> 
> I'd instead assume that we might want to make decisions based on the
> mapping size.
> 
> Assume you have a 128M mapping, wouldn't we want to use a different
> alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping?
> 

So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page
faults are not that big of a deal? If the text section is hot, it wont
get flushed after faulting in. So the real benefit comes from improved
iTLB coverage.

For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning
to something larger (say 128M) wouldn't give any additional TLB
coalescing, each 2M-aligned region independently qualifies for contpte.

Mappings smaller than 2M can't benefit from contpte regardless of
alignment, so falling back to PAGE_SIZE would be the optimal behaviour.
Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any
hardware boundary and adds complexity without TLB benefit?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-13 16:33 ` Ryan Roberts
@ 2026-03-13 20:55   ` Usama Arif
  2026-03-18 10:52     ` Usama Arif
  2026-03-14 13:20   ` WANG Rui
  1 sibling, 1 reply; 26+ messages in thread
From: Usama Arif @ 2026-03-13 20:55 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Usama Arif, Andrew Morton, david, ajd, anshuman.khandual, apopple,
	baohua, baolin.wang, brauner, catalin.marinas, dev.jain, jack,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, linux-mm, lorenzo.stoakes, npache,
	rmclure, Al Viro, will, willy, ziy, hannes, kas, shakeel.butt,
	kernel-team

On Fri, 13 Mar 2026 16:33:42 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:

> On 10/03/2026 14:51, Usama Arif wrote:
> > On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
> > into a single iTLB entry, reducing iTLB pressure for large executable
> > mappings.
> > 
> > exec_folio_order() was introduced [1] to request readahead at an
> > arch-preferred folio order for executable memory, enabling contpte
> > mapping on the fault path.
> > 
> > However, several things prevent this from working optimally on 16K and
> > 64K page configurations:
> > 
> > 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
> >    produces the optimal contpte order for 4K pages. For 16K pages it
> >    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
> >    returns order 0 (64K) instead of order 5 (2M). 
> 
> This was deliberate, although perhaps a bit conservative. I was concerned about
> the possibility of read amplification; pointlessly reading in a load of memory
> that never actually gets used. And that is independent of page size.
> 
> 2M seems quite big as a default IMHO, I could imagine Android might complain
> about memory pressure in their 16K config, for example.
> 

The force_thp_readahead path in do_sync_mmap_readahead() reads at
HPAGE_PMD_ORDER (2M on x86) and even doubles it to 4M for
non VM_RAND_READ mappings (ra->size *= 2), with async readahead
enabled. exec_folio_order() is more conservative. a single 2M folio
with async_size=0, no speculative prefetch. So I think the memory
pressure would not be worse than what x86 has?

For memory pressure on Android 16K: the readahead is clamped to VMA
boundaries, so a small shared library won't read 2M.
page_cache_ra_order() reduces folio order near EOF and on allocation
failure, so the 2M order is a preference, not a guarantee with the
current code?

> Additionally, ELF files are normally only aligned to 64K and you can only get
> the TLB benefits if the memory is aligned in physical and virtual memory.
> 
> > Patch 1 fixes this by
> >    using ilog2(CONT_PTES) which evaluates to the optimal order for all
> >    page sizes.
> > 
> > 2. Even with the optimal order, the mmap_miss heuristic in
> >    do_sync_mmap_readahead() silently disables exec readahead after 100
> >    page faults. The mmap_miss counter tracks whether readahead is useful
> >    for mmap'd file access:
> > 
> >    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
> >      miss (page needed IO).
> > 
> >    - Decremented by N in filemap_map_pages() for N pages successfully
> >      mapped via fault-around (pages found in cache without faulting,
> >      evidence that readahead was useful). Only non-workingset pages
> >      count and recently evicted and re-read pages don't count as hits.
> > 
> >    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
> >      marker page is found (indicates sequential consumption of readahead
> >      pages).
> > 
> >    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
> >    disabled. On 64K pages, both decrement paths are inactive:
> > 
> >    - filemap_map_pages() is never called because fault_around_pages
> >      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
> >      requires fault_around_pages > 1. With only 1 page in the
> >      fault-around window, there is nothing "around" to map.
> > 
> >    - do_async_mmap_readahead() never fires for exec mappings because
> >      exec readahead sets async_size = 0, so no PG_readahead markers
> >      are placed.
> > 
> >    With no decrements, mmap_miss monotonically increases past
> >    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
> >    for the remainder of the mapping.
> >    Patch 2 fixes this by moving the VM_EXEC readahead block
> >    above the mmap_miss check, since exec readahead is targeted (one
> >    folio at the fault location, async_size=0) not speculative prefetch.
> 
> Interesting!
> 
> > 
> > 3. Even with correct folio order and readahead, contpte mapping requires
> >    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
> >    The readahead path aligns file offsets and the buddy allocator aligns
> >    physical memory, but the virtual address depends on the VMA start.
> >    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
> >    granularity, giving only a 1/32 chance of 2M alignment. When
> >    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
> >    any folio in the VMA, resulting in zero iTLB coalescing benefit.
> > 
> >    Patch 3 fixes this for the main binary by bumping the ELF loader's
> >    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
> > 
> >    Patch 4 fixes this for shared libraries by adding a contpte-size
> >    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
> >    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
> >    libraries, so this smaller fallback (2M) succeeds where PMD fails.
> 
> I don't see how you can reliably influence this from the kernel? The ELF file
> alignment is, by default, 64K (16K on Android) and there is no guarrantee that
> the text section is the first section in the file. You need to align the start
> of the text section to the 2M boundary and to do that, you'll need to align the
> start of the file to some 64K boundary at a specific offset to the 2M boundary,
> based on the size of any sections before the text section. That's a job for the
> dynamic loader I think? Perhaps I've misunderstood what you're doing...
>

I only started looking into how this works a few days before sending these
patches, so I could be wrong (please do correct me if thats the case!)

For the main binary (patch 3): load_elf_binary() controls load_bias.
Each PT_LOAD segment is mapped at load_bias + p_vaddr via elf_map().
The alignment variable feeds directly into load_bias calculation.
If p_vaddr=0 and p_offset=0, mapped_addr = load_bias + 0 = load_bias. By
ensuring load_bias is folio size aligned, the text segment's virtual address
is also folio size aligned.

For shared libraries (patch 4): ld.so loads these via mmap(), and the
kernel's get_unmapped_area callback (thp_get_unmapped_area for ext4,
xfs, btrfs) picks the virtual address. The existing code tries
PMD_SIZE alignment first (512M on 64K pages), which is too large for
typical shared libraries and always fails. Patch 4 adds a fallback
that tries folio-size alignment (2M), which is small enough to succeed
for most libraries.

> > 
> > I created a benchmark that mmaps a large executable file and calls
> > RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
> > fault + readahead cost. "Random" first faults in all pages with a
> > sequential sweep (not measured), then measures time for calling random
> > offsets, isolating iTLB miss cost for scattered execution.
> > 
> > The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
> > 512MB executable file on ext4, averaged over 3 runs:
> > 
> >   Phase      | Baseline     | Patched      | Improvement
> >   -----------|--------------|--------------|------------------
> >   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
> >   Random     | 76.0 ms      | 58.3 ms      | 23% faster
> 
> I think the proper way to do this is to link the text section with 2M alignment
> and have the dynamic linker mark the region with MADV_HUGEPAGE?
> 

On arm64 with 64K pages, the force_thp_readahead path triggered by
MADV_HUGEPAGE reads at HPAGE_PMD_ORDER (512M). Even with file and
anon khugepaged support aded for khugpaged, the collapse won't happen
form the start.

Yes I think dynamic linker is also a good alternate approach from Wangs
patches [1]. But doing it in the kernel would be more transparent?

[1] https://sourceware.org/pipermail/libc-alpha/2026-March/175776.html

> Thanks,
> Ryan
> 
> 
> > 
> > [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
> > 
> > Usama Arif (4):
> >   arm64: request contpte-sized folios for exec memory
> >   mm: bypass mmap_miss heuristic for VM_EXEC readahead
> >   elf: align ET_DYN base to exec folio order for contpte mapping
> >   mm: align file-backed mmap to exec folio order in
> >     thp_get_unmapped_area
> > 
> >  arch/arm64/include/asm/pgtable.h |  9 ++--
> >  fs/binfmt_elf.c                  | 15 +++++++
> >  mm/filemap.c                     | 72 +++++++++++++++++---------------
> >  mm/huge_memory.c                 | 17 ++++++++
> >  4 files changed, 75 insertions(+), 38 deletions(-)
> > 
> 
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping
  2026-03-13 19:47     ` Usama Arif
@ 2026-03-14  2:10       ` hev
  0 siblings, 0 replies; 26+ messages in thread
From: hev @ 2026-03-14  2:10 UTC (permalink / raw)
  To: Usama Arif
  Cc: Liam.Howlett, ajd, akpm, anshuman.khandual, apopple, baohua,
	baolin.wang, brauner, catalin.marinas, david, dev.jain, hannes,
	jack, kas, kees, kernel-team, kevin.brodsky, lance.yang,
	linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
	lorenzo.stoakes, npache, rmclure, ryan.roberts, shakeel.butt,
	viro, will, willy, ziy

On Sat, Mar 14, 2026 at 3:47 AM Usama Arif <usama.arif@linux.dev> wrote:
>
>
>
> On 13/03/2026 17:42, WANG Rui wrote:
> > Hi Usama,
> >
>
> Hello!
>
> > Glad to see you're pushing on this, I'm also following it. I first noticed this when rustc's perf regressed after a binutils upgrade. I'm trying to make ld.so to aware THP and adjust PT_LOAD alignment to increase the chances of shared libraries being mapped by THP [1]. As you're probably seen, I'm doing something similar in the kernel to improve it for executables [2].
>
> For us it came about because we use 64K page size on ARM, and none of the
> text sections were getting hugified (because PMD size is 512M). I went with
> exec_folio_order() = cont-pte size (2M) for 16K and 64K as we can get both page
> fault benefit (which might not be that important) and iTLB coverage (due to
> cont-pte).
> x86 already faults in at 2M (HPAGE_PMD_ORDER) due to force_thp_readahead path in
> do_sync_mmap_readahead() so the memory pressure introduced in ARM won't be worse
> than what already exists in x86.
>
> >
> >> +                    if (exec_folio_order())
> >> +                            alignment = max(alignment,
> >> +                                    (unsigned long)PAGE_SIZE << exec_folio_order());
> >
> > I’m curious, does it make sense to add some constraints here, like only increasing p_align when the segment length, virtual address, and file offset are all huge-aligned, as I did in my patch? This has come up several times in the glibc review, where increasing alignment was noted to reduce ASLR entropy.
> >
>
> Yes I think this makes sense!
>
> Although maybe we should check all segments with PT_LOAD. So maybe something
> like below over this series?
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 2d2b3e9fd474f..a0e83b541a7d8 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -1116,10 +1116,30 @@ static int load_elf_binary(struct linux_binprm *bprm)
>                          * the hardware cannot coalesce PTEs (e.g. arm64
>                          * contpte) even though the physical memory and
>                          * file offset are correctly aligned.
> +                        *
> +                        * Only increase alignment when at least one
> +                        * PT_LOAD segment is large enough to contain a
> +                        * full folio and has its file offset and virtual
> +                        * address folio-aligned. This avoids reducing
> +                        * ASLR entropy for small binaries that cannot
> +                        * benefit from contpte mapping.
>                          */
> -                       if (exec_folio_order())
> -                               alignment = max(alignment,
> -                                       (unsigned long)PAGE_SIZE << exec_folio_order());
> +                       if (exec_folio_order()) {
> +                               unsigned long folio_sz = PAGE_SIZE << exec_folio_order();
> +
> +                               for (i = 0; i < elf_ex->e_phnum; i++) {
> +                                       if (elf_phdata[i].p_type != PT_LOAD)
> +                                               continue;
> +                                       if (elf_phdata[i].p_filesz < folio_sz)
> +                                               continue;
> +                                       if (!IS_ALIGNED(elf_phdata[i].p_vaddr, folio_sz))
> +                                               continue;
> +                                       if (!IS_ALIGNED(elf_phdata[i].p_offset, folio_sz))
> +                                               continue;
> +                                       alignment = max(alignment, folio_sz);
> +                                       break;
> +                               }
> +                       }

I think this logic should live in maximum_alignment(), so we don't
have to walk the segments twice. It might be better to move it into a
separate helper, something like should_align_to_exec_folio()?

>
>                         /**
>                          * DOC: PIE handling
>
> > [1] https://sourceware.org/pipermail/libc-alpha/2026-March/175776.html
> > [2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc
> >
> > Thanks,
> > Rui
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area
  2026-03-10 14:51 ` [PATCH 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
@ 2026-03-14  3:47   ` WANG Rui
  0 siblings, 0 replies; 26+ messages in thread
From: WANG Rui @ 2026-03-14  3:47 UTC (permalink / raw)
  To: usama.arif
  Cc: Liam.Howlett, ajd, akpm, anshuman.khandual, apopple, baohua,
	baolin.wang, brauner, catalin.marinas, david, dev.jain, hannes,
	jack, kas, kees, kernel-team, kevin.brodsky, lance.yang,
	linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
	lorenzo.stoakes, npache, rmclure, ryan.roberts, shakeel.butt,
	viro, will, willy, ziy, WANG Rui

> +	if (exec_folio_order()) {
> +		unsigned long folio_size = PAGE_SIZE << exec_folio_order();
> +
> +		ret = __thp_get_unmapped_area(filp, addr, len, off, flags,
> +					      folio_size, vm_flags);
> +		if (ret)
> +			return ret;
> +	}
> +

I noticed that even when the code segment of a user-space shared library
satisfies PMD_SIZE (32MB), it still doesn’t end up at a PMD-aligned virtual
address. This might be the fallback you mentioned. Adjusting p_align in the
ld.so ELF loader does work, though it also avoids extremely large PMD_SIZE
values (capped at ≤32M).

It would probably be better to skip the PMD_SIZE == folio_sz case here,
so we don’t end up calling __thp_get_unmapped_area() twice with the same
parameters.

Thanks,
Rui

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-10 14:51 [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages Usama Arif
                   ` (6 preceding siblings ...)
  2026-03-13 16:35 ` hev
@ 2026-03-14  9:50 ` WANG Rui
  2026-03-18 10:57   ` Usama Arif
  7 siblings, 1 reply; 26+ messages in thread
From: WANG Rui @ 2026-03-14  9:50 UTC (permalink / raw)
  To: usama.arif
  Cc: Liam.Howlett, ajd, akpm, anshuman.khandual, apopple, baohua,
	baolin.wang, brauner, catalin.marinas, david, dev.jain, hannes,
	jack, kas, kees, kernel-team, kevin.brodsky, lance.yang,
	linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
	lorenzo.stoakes, npache, rmclure, ryan.roberts, shakeel.butt,
	viro, will, willy, ziy, WANG Rui

I only just realized your focus was on 64K normal pages, what I was
referring to here is AArch64 with 4K normal pages.

Sorry about the earlier numbers. They were a bit low precision.
RK3399 has pretty limited PMU events, and it looks like it can’t
collect events from the A53 and A72 clusters at the same time, so
I reran the measurements on the A53.

Even though the A53 backend isn’t very wide, we can still see the
impact from iTLB pressure. With 4K pages, aligning the code to PMD
size (2M) performs slightly better than 64K.

Binutils: 2.46
GCC: 15.2.1 (--enable-host-pie)

Workload: building vmlinux from Linux v7.0-rc1 with allnoconfig.
Loop: 5

                Base                 Patchset [1]         Patchset [2]
instructions    1,994,512,163,037    1,994,528,896,322    1,994,536,148,574
cpu-cycles      6,890,054,789,351    6,870,685,379,047    6,720,442,248,967
                                              ~ -0.28%             ~ -2.46%
itlb-misses           579,692,117          455,848,211           43,814,795
                                             ~ -21.36%            ~ -92.44%
time elapsed            1331.15 s            1325.50 s            1296.35 s
                                              ~ -0.42%             ~ -2.61%

Maybe we could make exec_folio_order() choose differently folio size
depending on the configuration and conditional in some way, for example
based on the size of the code segment?

[1] https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev
[2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc

Thanks,
Rui

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-13 16:33 ` Ryan Roberts
  2026-03-13 20:55   ` Usama Arif
@ 2026-03-14 13:20   ` WANG Rui
  1 sibling, 0 replies; 26+ messages in thread
From: WANG Rui @ 2026-03-14 13:20 UTC (permalink / raw)
  To: ryan.roberts
  Cc: Liam.Howlett, ajd, akpm, anshuman.khandual, apopple, baohua,
	baolin.wang, brauner, catalin.marinas, david, dev.jain, hannes,
	jack, kas, kees, kernel-team, kevin.brodsky, lance.yang,
	linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
	lorenzo.stoakes, npache, rmclure, shakeel.butt, usama.arif, viro,
	will, willy, ziy, WANG Rui

Hi Ryan,

> I don't see how you can reliably influence this from the kernel? The ELF file
> alignment is, by default, 64K (16K on Android) and there is no guarrantee that
> the text section is the first section in the file. You need to align the start
> of the text section to the 2M boundary and to do that, you'll need to align the
> start of the file to some 64K boundary at a specific offset to the 2M boundary,
> based on the size of any sections before the text section. That's a job for the
> dynamic loader I think? Perhaps I've misunderstood what you're doing...

On Arch Linux for AArch64 and LoongArch64 I've observed that most
binaries place the executable segment in the first PT_LOAD. In that
case both the virtual address and file offset are 0, which happens to
satisfy the alignment requirements for PMD-sized or large folio
mappings.

x86 looks quite different. The executable segment is usually not the
first one.

After digging into this I realized this mostly comes from the linker
defaults. With GNU ld, -z noseparate-code merges the read-only and
read-only-executable segments into one, while -z separate-code
splits them, placing a non-executable read-only segment first. The
latter is the default on x86, partly to avoid making the ELF headers
executable when mappings start from the beginning of the file.

Other architectures tend to default to -z noseparate-code, which
makes it more likely that the text segment is the first PT_LOAD.

LLVM lld behaves differently again: --rosegment (equivalent to
-z separate-code) is enabled by default on all architectures, which
similarly tends to place the executable segment after an initial
read-only one. That default is clearly less friendly to large-page
text mappings.

Thanks,
Rui

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-13 19:59   ` Usama Arif
@ 2026-03-16 16:06     ` David Hildenbrand (Arm)
  2026-03-18 10:41       ` Usama Arif
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-16 16:06 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, ryan.roberts
  Cc: ajd, anshuman.khandual, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, lorenzo.stoakes, npache, rmclure, Al Viro, will, willy,
	ziy, hannes, kas, shakeel.butt, kernel-team

On 3/13/26 20:59, Usama Arif wrote:
> 
> 
> On 13/03/2026 16:20, David Hildenbrand (Arm) wrote:
>> On 3/10/26 15:51, Usama Arif wrote:
>>> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
>>> into a single iTLB entry, reducing iTLB pressure for large executable
>>> mappings.
>>>
>>> exec_folio_order() was introduced [1] to request readahead at an
>>> arch-preferred folio order for executable memory, enabling contpte
>>> mapping on the fault path.
>>>
>>> However, several things prevent this from working optimally on 16K and
>>> 64K page configurations:
>>>
>>> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
>>>    produces the optimal contpte order for 4K pages. For 16K pages it
>>>    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
>>>    returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by
>>>    using ilog2(CONT_PTES) which evaluates to the optimal order for all
>>>    page sizes.
>>>
>>> 2. Even with the optimal order, the mmap_miss heuristic in
>>>    do_sync_mmap_readahead() silently disables exec readahead after 100
>>>    page faults. The mmap_miss counter tracks whether readahead is useful
>>>    for mmap'd file access:
>>>
>>>    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
>>>      miss (page needed IO).
>>>
>>>    - Decremented by N in filemap_map_pages() for N pages successfully
>>>      mapped via fault-around (pages found in cache without faulting,
>>>      evidence that readahead was useful). Only non-workingset pages
>>>      count and recently evicted and re-read pages don't count as hits.
>>>
>>>    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
>>>      marker page is found (indicates sequential consumption of readahead
>>>      pages).
>>>
>>>    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
>>>    disabled. On 64K pages, both decrement paths are inactive:
>>>
>>>    - filemap_map_pages() is never called because fault_around_pages
>>>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>>>      requires fault_around_pages > 1. With only 1 page in the
>>>      fault-around window, there is nothing "around" to map.
>>>
>>>    - do_async_mmap_readahead() never fires for exec mappings because
>>>      exec readahead sets async_size = 0, so no PG_readahead markers
>>>      are placed.
>>>
>>>    With no decrements, mmap_miss monotonically increases past
>>>    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
>>>    for the remainder of the mapping.
>>>    Patch 2 fixes this by moving the VM_EXEC readahead block
>>>    above the mmap_miss check, since exec readahead is targeted (one
>>>    folio at the fault location, async_size=0) not speculative prefetch.
>>>
>>> 3. Even with correct folio order and readahead, contpte mapping requires
>>>    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
>>>    The readahead path aligns file offsets and the buddy allocator aligns
>>>    physical memory, but the virtual address depends on the VMA start.
>>>    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
>>>    granularity, giving only a 1/32 chance of 2M alignment. When
>>>    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
>>>    any folio in the VMA, resulting in zero iTLB coalescing benefit.
>>>
>>>    Patch 3 fixes this for the main binary by bumping the ELF loader's
>>>    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
>>>
>>>    Patch 4 fixes this for shared libraries by adding a contpte-size
>>>    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
>>>    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
>>>    libraries, so this smaller fallback (2M) succeeds where PMD fails.
>>>
>>> I created a benchmark that mmaps a large executable file and calls
>>> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
>>> fault + readahead cost. "Random" first faults in all pages with a
>>> sequential sweep (not measured), then measures time for calling random
>>> offsets, isolating iTLB miss cost for scattered execution.
>>>
>>> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
>>> 512MB executable file on ext4, averaged over 3 runs:
>>>
>>>   Phase      | Baseline     | Patched      | Improvement
>>>   -----------|--------------|--------------|------------------
>>>   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
>>>   Random     | 76.0 ms      | 58.3 ms      | 23% faster
>>
>> I'm curious: is a single order really what we want?
>>
>> I'd instead assume that we might want to make decisions based on the
>> mapping size.
>>
>> Assume you have a 128M mapping, wouldn't we want to use a different
>> alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping?
>>
> 
> So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page
> faults are not that big of a deal? If the text section is hot, it wont
> get flushed after faulting in. So the real benefit comes from improved
> iTLB coverage.
> 
> For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning
> to something larger (say 128M) wouldn't give any additional TLB
> coalescing, each 2M-aligned region independently qualifies for contpte.
> 
> Mappings smaller than 2M can't benefit from contpte regardless of
> alignment, so falling back to PAGE_SIZE would be the optimal behaviour.
> Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any
> hardware boundary and adds complexity without TLB benefit?

I might be wrong, but I think you are mixing two things here:

(1) "Minimum" folio size (exec_folio_order())

(2) VMA alignment.


(2) should certainly be as large as (1), but assume we can get a 2M
folio on arm64 4k, why shouldn't we align it to 2M if the region is
reasonably sized, and use a PMD?


-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-16 16:06     ` David Hildenbrand (Arm)
@ 2026-03-18 10:41       ` Usama Arif
  2026-03-18 12:41         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 26+ messages in thread
From: Usama Arif @ 2026-03-18 10:41 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton, ryan.roberts
  Cc: ajd, anshuman.khandual, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, lorenzo.stoakes, npache, rmclure, Al Viro, will, willy,
	ziy, hannes, kas, shakeel.butt, kernel-team, WANG Rui



On 16/03/2026 19:06, David Hildenbrand (Arm) wrote:
> On 3/13/26 20:59, Usama Arif wrote:
>>
>>
>> On 13/03/2026 16:20, David Hildenbrand (Arm) wrote:
>>> On 3/10/26 15:51, Usama Arif wrote:
>>>> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
>>>> into a single iTLB entry, reducing iTLB pressure for large executable
>>>> mappings.
>>>>
>>>> exec_folio_order() was introduced [1] to request readahead at an
>>>> arch-preferred folio order for executable memory, enabling contpte
>>>> mapping on the fault path.
>>>>
>>>> However, several things prevent this from working optimally on 16K and
>>>> 64K page configurations:
>>>>
>>>> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
>>>>    produces the optimal contpte order for 4K pages. For 16K pages it
>>>>    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
>>>>    returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by
>>>>    using ilog2(CONT_PTES) which evaluates to the optimal order for all
>>>>    page sizes.
>>>>
>>>> 2. Even with the optimal order, the mmap_miss heuristic in
>>>>    do_sync_mmap_readahead() silently disables exec readahead after 100
>>>>    page faults. The mmap_miss counter tracks whether readahead is useful
>>>>    for mmap'd file access:
>>>>
>>>>    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
>>>>      miss (page needed IO).
>>>>
>>>>    - Decremented by N in filemap_map_pages() for N pages successfully
>>>>      mapped via fault-around (pages found in cache without faulting,
>>>>      evidence that readahead was useful). Only non-workingset pages
>>>>      count and recently evicted and re-read pages don't count as hits.
>>>>
>>>>    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
>>>>      marker page is found (indicates sequential consumption of readahead
>>>>      pages).
>>>>
>>>>    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
>>>>    disabled. On 64K pages, both decrement paths are inactive:
>>>>
>>>>    - filemap_map_pages() is never called because fault_around_pages
>>>>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>>>>      requires fault_around_pages > 1. With only 1 page in the
>>>>      fault-around window, there is nothing "around" to map.
>>>>
>>>>    - do_async_mmap_readahead() never fires for exec mappings because
>>>>      exec readahead sets async_size = 0, so no PG_readahead markers
>>>>      are placed.
>>>>
>>>>    With no decrements, mmap_miss monotonically increases past
>>>>    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
>>>>    for the remainder of the mapping.
>>>>    Patch 2 fixes this by moving the VM_EXEC readahead block
>>>>    above the mmap_miss check, since exec readahead is targeted (one
>>>>    folio at the fault location, async_size=0) not speculative prefetch.
>>>>
>>>> 3. Even with correct folio order and readahead, contpte mapping requires
>>>>    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
>>>>    The readahead path aligns file offsets and the buddy allocator aligns
>>>>    physical memory, but the virtual address depends on the VMA start.
>>>>    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
>>>>    granularity, giving only a 1/32 chance of 2M alignment. When
>>>>    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
>>>>    any folio in the VMA, resulting in zero iTLB coalescing benefit.
>>>>
>>>>    Patch 3 fixes this for the main binary by bumping the ELF loader's
>>>>    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
>>>>
>>>>    Patch 4 fixes this for shared libraries by adding a contpte-size
>>>>    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
>>>>    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
>>>>    libraries, so this smaller fallback (2M) succeeds where PMD fails.
>>>>
>>>> I created a benchmark that mmaps a large executable file and calls
>>>> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
>>>> fault + readahead cost. "Random" first faults in all pages with a
>>>> sequential sweep (not measured), then measures time for calling random
>>>> offsets, isolating iTLB miss cost for scattered execution.
>>>>
>>>> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
>>>> 512MB executable file on ext4, averaged over 3 runs:
>>>>
>>>>   Phase      | Baseline     | Patched      | Improvement
>>>>   -----------|--------------|--------------|------------------
>>>>   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
>>>>   Random     | 76.0 ms      | 58.3 ms      | 23% faster
>>>
>>> I'm curious: is a single order really what we want?
>>>
>>> I'd instead assume that we might want to make decisions based on the
>>> mapping size.
>>>
>>> Assume you have a 128M mapping, wouldn't we want to use a different
>>> alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping?
>>>
>>
>> So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page
>> faults are not that big of a deal? If the text section is hot, it wont
>> get flushed after faulting in. So the real benefit comes from improved
>> iTLB coverage.
>>
>> For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning
>> to something larger (say 128M) wouldn't give any additional TLB
>> coalescing, each 2M-aligned region independently qualifies for contpte.
>>
>> Mappings smaller than 2M can't benefit from contpte regardless of
>> alignment, so falling back to PAGE_SIZE would be the optimal behaviour.
>> Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any
>> hardware boundary and adds complexity without TLB benefit?
> 
> I might be wrong, but I think you are mixing two things here:
> 
> (1) "Minimum" folio size (exec_folio_order())
> 
> (2) VMA alignment.
> 
> 
> (2) should certainly be as large as (1), but assume we can get a 2M
> folio on arm64 4k, why shouldn't we align it to 2M if the region is
> reasonably sized, and use a PMD?
> 
> 

So this series is tackling both (1) and (2). When I started making changes
to the code, what I wanted was 2M folios at fault with 64K base page size
to reduce iTLB misses. This is what patch 1 (and 2) will achieve.

Yes, completely agree, (2) should be as large as (1). I didn't think about
PMD size on 4K which you pointed out. do_sync_mmap_readahead can give
that with force_thp_readahead, so this should be supported.

But we shouldn't align to PMD size for all base page sizes. As Rui pointed
out, increasing alignment size reduces ASLR entropy [1]. Should we max alignement
to 2M?

[1] https://lore.kernel.org/all/20260313144213.95686-1-r@hev.cc/


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-13 20:55   ` Usama Arif
@ 2026-03-18 10:52     ` Usama Arif
  2026-03-19  7:40       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 26+ messages in thread
From: Usama Arif @ 2026-03-18 10:52 UTC (permalink / raw)
  To: ryan.roberts, r
  Cc: Usama Arif, Andrew Morton, david, ajd, anshuman.khandual, apopple,
	baohua, baolin.wang, brauner, catalin.marinas, dev.jain, jack,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, linux-mm, lorenzo.stoakes, npache,
	rmclure, Al Viro, will, willy, ziy, hannes, kas, shakeel.butt,
	kernel-team

On Fri, 13 Mar 2026 13:55:38 -0700 Usama Arif <usama.arif@linux.dev> wrote:

> On Fri, 13 Mar 2026 16:33:42 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
> > On 10/03/2026 14:51, Usama Arif wrote:
> > > On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
> > > into a single iTLB entry, reducing iTLB pressure for large executable
> > > mappings.
> > > 
> > > exec_folio_order() was introduced [1] to request readahead at an
> > > arch-preferred folio order for executable memory, enabling contpte
> > > mapping on the fault path.
> > > 
> > > However, several things prevent this from working optimally on 16K and
> > > 64K page configurations:
> > > 
> > > 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
> > >    produces the optimal contpte order for 4K pages. For 16K pages it
> > >    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
> > >    returns order 0 (64K) instead of order 5 (2M). 
> > 
> > This was deliberate, although perhaps a bit conservative. I was concerned about
> > the possibility of read amplification; pointlessly reading in a load of memory
> > that never actually gets used. And that is independent of page size.
> > 
> > 2M seems quite big as a default IMHO, I could imagine Android might complain
> > about memory pressure in their 16K config, for example.
> > 
> 
> The force_thp_readahead path in do_sync_mmap_readahead() reads at
> HPAGE_PMD_ORDER (2M on x86) and even doubles it to 4M for
> non VM_RAND_READ mappings (ra->size *= 2), with async readahead
> enabled. exec_folio_order() is more conservative. a single 2M folio
> with async_size=0, no speculative prefetch. So I think the memory
> pressure would not be worse than what x86 has?
> 
> For memory pressure on Android 16K: the readahead is clamped to VMA
> boundaries, so a small shared library won't read 2M.
> page_cache_ra_order() reduces folio order near EOF and on allocation
> failure, so the 2M order is a preference, not a guarantee with the
> current code?
> 

I am not a big fan of introducing Kconfig options, but would
CONFIG_EXEC_FOLIO_ORDER with the default value being 64K be a better
solution? Or maybe a default of 64K for 4K and 16K base page size,
but 2M for 64K page size as 64K base page size is mostly for servers.

Using a default value of 64K would mean no change in behaviour.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-14  9:50 ` WANG Rui
@ 2026-03-18 10:57   ` Usama Arif
  2026-03-18 11:46     ` WANG Rui
  0 siblings, 1 reply; 26+ messages in thread
From: Usama Arif @ 2026-03-18 10:57 UTC (permalink / raw)
  To: WANG Rui, Ryan Roberts, David Hildenbrand
  Cc: Liam.Howlett, ajd, akpm, anshuman.khandual, apopple, baohua,
	baolin.wang, brauner, catalin.marinas, david, dev.jain, hannes,
	jack, kas, kees, kernel-team, kevin.brodsky, lance.yang,
	linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
	lorenzo.stoakes, npache, rmclure, ryan.roberts, shakeel.butt,
	viro, will, willy, ziy



On 14/03/2026 12:50, WANG Rui wrote:
> I only just realized your focus was on 64K normal pages, what I was
> referring to here is AArch64 with 4K normal pages.
> 
> Sorry about the earlier numbers. They were a bit low precision.
> RK3399 has pretty limited PMU events, and it looks like it can’t
> collect events from the A53 and A72 clusters at the same time, so
> I reran the measurements on the A53.
> 
> Even though the A53 backend isn’t very wide, we can still see the
> impact from iTLB pressure. With 4K pages, aligning the code to PMD
> size (2M) performs slightly better than 64K.
> 
> Binutils: 2.46
> GCC: 15.2.1 (--enable-host-pie)
> 
> Workload: building vmlinux from Linux v7.0-rc1 with allnoconfig.
> Loop: 5
> 
>                 Base                 Patchset [1]         Patchset [2]
> instructions    1,994,512,163,037    1,994,528,896,322    1,994,536,148,574
> cpu-cycles      6,890,054,789,351    6,870,685,379,047    6,720,442,248,967
>                                               ~ -0.28%             ~ -2.46%
> itlb-misses           579,692,117          455,848,211           43,814,795
>                                              ~ -21.36%            ~ -92.44%
> time elapsed            1331.15 s            1325.50 s            1296.35 s
>                                               ~ -0.42%             ~ -2.61%
> 


Thanks for running these! Just wanted to check what is the base page size
of this experiment?

Ofcourse PMD is going to perform better than TLB coalescing (pagefault
itself will be one less page table level). But its a tradeoff between
memory pressure + reduced ASLR vs performance. As Ryan pointed out in
[1], even 2M for 16K base page size might introduce too much of memory
pressure for android phones, and the PMD size for 16K is 32M! 

[1] https://lore.kernel.org/all/cfdfca9c-4752-4037-a289-03e6e7a00d47@arm.com/

> Maybe we could make exec_folio_order() choose differently folio size
> depending on the configuration and conditional in some way, for example
> based on the size of the code segment?

Yeah I think introducing Kconfig might be an option.

> 
> [1] https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev
> [2] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc
> 
> Thanks,
> Rui



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-18 10:57   ` Usama Arif
@ 2026-03-18 11:46     ` WANG Rui
  0 siblings, 0 replies; 26+ messages in thread
From: WANG Rui @ 2026-03-18 11:46 UTC (permalink / raw)
  To: usama.arif
  Cc: Liam.Howlett, ajd, akpm, anshuman.khandual, apopple, baohua,
	baolin.wang, brauner, catalin.marinas, david, dev.jain, hannes,
	jack, kas, kees, kernel-team, kevin.brodsky, lance.yang,
	linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
	lorenzo.stoakes, npache, r, rmclure, ryan.roberts, shakeel.butt,
	viro, will, willy, ziy

> Thanks for running these! Just wanted to check what is the base page size
> of this experiment?

base page size: 4K

> Yeah I think introducing Kconfig might be an option.

I wonder if it would make sense for exec_folio_order() to vary the
order based on the code size, instead of always returning a fixed
value for a given architecture and base page size.

For example, on AArch64 with 4K base pages, in the load_elf_binary()
case: if exec_folio_order() only ever returns cont-PTE (64K), we may
miss the opportunity to use PMD mappings. On the other hand, if it
always returns PMD (2M), then for binaries smaller than 2M we end up
reducing ASLR entropy.

Maybe something along these lines would work better:

unsigned int exec_folio_order(size_t code_size)
{
#if PAGE_SIZE == 4096
    if (code_size >= PMD_SIZE)
        return ilog2(SZ_2M >> PAGE_SHIFT);
    else if (code_size >= SZ_64K)
        return ilog2(SZ_64K >> PAGE_SHIFT);
    else
        return 0;
#elif PAGE_SIZE == 16384
    ...
#elif PAGE_SIZE == ...
    /* let the arch cap the max order here, rather
       than hard-coding it at the use sites */
#endif
}

Thanks,
Rui


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-18 10:41       ` Usama Arif
@ 2026-03-18 12:41         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-18 12:41 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, ryan.roberts
  Cc: ajd, anshuman.khandual, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, lorenzo.stoakes, npache, rmclure, Al Viro, will, willy,
	ziy, hannes, kas, shakeel.butt, kernel-team, WANG Rui

On 3/18/26 11:41, Usama Arif wrote:
> 
> 
> On 16/03/2026 19:06, David Hildenbrand (Arm) wrote:
>> On 3/13/26 20:59, Usama Arif wrote:
>>>
>>>
>>>
>>> So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page
>>> faults are not that big of a deal? If the text section is hot, it wont
>>> get flushed after faulting in. So the real benefit comes from improved
>>> iTLB coverage.
>>>
>>> For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning
>>> to something larger (say 128M) wouldn't give any additional TLB
>>> coalescing, each 2M-aligned region independently qualifies for contpte.
>>>
>>> Mappings smaller than 2M can't benefit from contpte regardless of
>>> alignment, so falling back to PAGE_SIZE would be the optimal behaviour.
>>> Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any
>>> hardware boundary and adds complexity without TLB benefit?
>>
>> I might be wrong, but I think you are mixing two things here:
>>
>> (1) "Minimum" folio size (exec_folio_order())
>>
>> (2) VMA alignment.
>>
>>
>> (2) should certainly be as large as (1), but assume we can get a 2M
>> folio on arm64 4k, why shouldn't we align it to 2M if the region is
>> reasonably sized, and use a PMD?
>>
>>
> 
> So this series is tackling both (1) and (2). When I started making changes
> to the code, what I wanted was 2M folios at fault with 64K base page size
> to reduce iTLB misses. This is what patch 1 (and 2) will achieve.
> 
> Yes, completely agree, (2) should be as large as (1). I didn't think about
> PMD size on 4K which you pointed out. do_sync_mmap_readahead can give
> that with force_thp_readahead, so this should be supported.

In particular, imagine if hw starts optimizing transparently on other
granularity, then the "smallest granularity" (exec_folio_order())
decision will soon be wrong.

> 
> But we shouldn't align to PMD size for all base page sizes. As Rui pointed
> out, increasing alignment size reduces ASLR entropy [1]. Should we max alignement
> to 2M?

That's why I said that likely, as an input, we'd want to use the mapping
size or other heuristics.

We wouldn't want to align a 4k mapping to either 64k or 2M.

Long story short: the change in thp_get_unmapped_area_vmflags() needs
some thought IMHO.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
  2026-03-10 14:51 ` [PATCH 2/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-03-18 16:43   ` Jan Kara
  2026-03-19  7:37     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2026-03-18 16:43 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, ryan.roberts, david, ajd, anshuman.khandual,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	jack, kees, kevin.brodsky, lance.yang, Liam.Howlett,
	linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
	lorenzo.stoakes, npache, rmclure, Al Viro, will, willy, ziy,
	hannes, kas, shakeel.butt, kernel-team

On Tue 10-03-26 07:51:15, Usama Arif wrote:
> The mmap_miss counter in do_sync_mmap_readahead() tracks whether
> readahead is useful for mmap'd file access. It is incremented by 1 on
> every page cache miss in do_sync_mmap_readahead(), and decremented in
> two places:
> 
>   - filemap_map_pages(): decremented by N for each of N pages
>     successfully mapped via fault-around (pages found already in cache,
>     evidence readahead was useful). Only pages not in the workingset
>     count as hits.
> 
>   - do_async_mmap_readahead(): decremented by 1 when a page with
>     PG_readahead is found in cache.
> 
> When the counter exceeds MMAP_LOTSAMISS (100), all readahead is
> disabled, including the targeted VM_EXEC readahead [1] that requests
> arch-preferred folio orders for contpte mapping.
> 
> On arm64 with 64K base pages, both decrement paths are inactive:
> 
>   1. filemap_map_pages() is never called because fault_around_pages
>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>      requires fault_around_pages > 1. With only 1 page in the
>      fault-around window, there is nothing "around" to map.
> 
>   2. do_async_mmap_readahead() never fires for exec mappings because
>      exec readahead sets async_size = 0, so no PG_readahead markers
>      are placed.
> 
> With no decrements, mmap_miss monotonically increases past
> MMAP_LOTSAMISS after 100 page faults, disabling all subsequent
> exec readahead.
> 
> Fix this by moving the VM_EXEC readahead block above the mmap_miss
> check. The exec readahead path is targeted. It reads a single folio at
> the fault location with async_size=0, not speculative prefetch, so the
> mmap_miss heuristic designed to throttle wasteful speculative readahead
> should not gate it. The page would need to be faulted in regardless,
> the only question is at what order.
> 
> [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>

I can see the problem but I'm not sure what you propose is the right fix.
If you move the VM_EXEC logic earlier, you'll effectively disable
VM_HUGEPAGE handling for VM_EXEC vmas which I don't think we want. So
shouldn't we rather disable mmap_miss logic for VM_EXEC vmas like:

	if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
		...
	}

								Honza

> ---
>  mm/filemap.c | 72 ++++++++++++++++++++++++++++------------------------
>  1 file changed, 39 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6cd7974d4adab..c064f31ecec5a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3331,6 +3331,37 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  		}
>  	}
>  
> +	if (vm_flags & VM_EXEC) {
> +		/*
> +		 * Allow arch to request a preferred minimum folio order for
> +		 * executable memory. This can often be beneficial to
> +		 * performance if (e.g.) arm64 can contpte-map the folio.
> +		 * Executable memory rarely benefits from readahead, due to its
> +		 * random access nature, so set async_size to 0.
> +		 *
> +		 * Limit to the boundaries of the VMA to avoid reading in any
> +		 * pad that might exist between sections, which would be a waste
> +		 * of memory.
> +		 *
> +		 * This is targeted readahead (one folio at the fault location),
> +		 * not speculative prefetch, so bypass the mmap_miss heuristic
> +		 * which would otherwise disable it after MMAP_LOTSAMISS faults.
> +		 */
> +		struct vm_area_struct *vma = vmf->vma;
> +		unsigned long start = vma->vm_pgoff;
> +		unsigned long end = start + vma_pages(vma);
> +		unsigned long ra_end;
> +
> +		ra->order = exec_folio_order();
> +		ra->start = round_down(vmf->pgoff, 1UL << ra->order);
> +		ra->start = max(ra->start, start);
> +		ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
> +		ra_end = min(ra_end, end);
> +		ra->size = ra_end - ra->start;
> +		ra->async_size = 0;
> +		goto do_readahead;
> +	}
> +
>  	if (!(vm_flags & VM_SEQ_READ)) {
>  		/* Avoid banging the cache line if not needed */
>  		mmap_miss = READ_ONCE(ra->mmap_miss);
> @@ -3361,40 +3392,15 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  		return fpin;
>  	}
>  
> -	if (vm_flags & VM_EXEC) {
> -		/*
> -		 * Allow arch to request a preferred minimum folio order for
> -		 * executable memory. This can often be beneficial to
> -		 * performance if (e.g.) arm64 can contpte-map the folio.
> -		 * Executable memory rarely benefits from readahead, due to its
> -		 * random access nature, so set async_size to 0.
> -		 *
> -		 * Limit to the boundaries of the VMA to avoid reading in any
> -		 * pad that might exist between sections, which would be a waste
> -		 * of memory.
> -		 */
> -		struct vm_area_struct *vma = vmf->vma;
> -		unsigned long start = vma->vm_pgoff;
> -		unsigned long end = start + vma_pages(vma);
> -		unsigned long ra_end;
> -
> -		ra->order = exec_folio_order();
> -		ra->start = round_down(vmf->pgoff, 1UL << ra->order);
> -		ra->start = max(ra->start, start);
> -		ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
> -		ra_end = min(ra_end, end);
> -		ra->size = ra_end - ra->start;
> -		ra->async_size = 0;
> -	} else {
> -		/*
> -		 * mmap read-around
> -		 */
> -		ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> -		ra->size = ra->ra_pages;
> -		ra->async_size = ra->ra_pages / 4;
> -		ra->order = 0;
> -	}
> +	/*
> +	 * mmap read-around
> +	 */
> +	ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> +	ra->size = ra->ra_pages;
> +	ra->async_size = ra->ra_pages / 4;
> +	ra->order = 0;
>  
> +do_readahead:
>  	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>  	ractl._index = ra->start;
>  	page_cache_ra_order(&ractl, ra);
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/4] arm64: request contpte-sized folios for exec memory
  2026-03-10 14:51 ` [PATCH 1/4] arm64: request contpte-sized folios for exec memory Usama Arif
@ 2026-03-19  7:35   ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-19  7:35 UTC (permalink / raw)
  To: Usama Arif, Andrew Morton, ryan.roberts
  Cc: ajd, anshuman.khandual, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	linux-mm, lorenzo.stoakes, npache, rmclure, Al Viro, will, willy,
	ziy, hannes, kas, shakeel.butt, kernel-team

On 3/10/26 15:51, Usama Arif wrote:
> exec_folio_order() was introduced [1] to request readahead of executable
> file-backed pages at an arch-preferred folio order, so that the hardware
> can coalesce contiguous PTEs into fewer iTLB entries (contpte).
> 
> The current implementation uses ilog2(SZ_64K >> PAGE_SHIFT), which
> requests 64K folios. This is optimal for 4K base pages (where CONT_PTES
> = 16, contpte size = 64K), but suboptimal for 16K and 64K base pages:
> 
> Page size | Before (order) | After (order) | contpte
> ----------|----------------|---------------|--------
> 4K        | 4 (64K)        | 4 (64K)       | Yes (unchanged)
> 16K       | 2 (64K)        | 7 (2M)        | Yes (new)
> 64K       | 0 (64K)        | 5 (2M)        | Yes (new)
> 
> For 16K pages, CONT_PTES = 128 and the contpte size is 2M (order 7).
> For 64K pages, CONT_PTES = 32 and the contpte size is 2M (order 5).
> 
> Use ilog2(CONT_PTES) instead, which directly evaluates to contpte-aligned
> order for all page sizes.
> 
> The worst-case waste is bounded to one folio (up to 2MB - 64KB)
> at the end of the file, since page_cache_ra_order() reduces the folio
> order near EOF to avoid allocating past i_size.

So, if you have a smallish text segment in a larger file, we'd always
try to allocate 2M on 16k/64k?

That feels wrong.

Asking the other way around: why not also use 2M on a 4k system and end
up with a PMD?

And no, I don't think we should default to that, just emphasizing my
point that *maybe* we really want to consider mapping (vma) size as well.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
  2026-03-18 16:43   ` Jan Kara
@ 2026-03-19  7:37     ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-19  7:37 UTC (permalink / raw)
  To: Jan Kara, Usama Arif
  Cc: Andrew Morton, ryan.roberts, ajd, anshuman.khandual, apopple,
	baohua, baolin.wang, brauner, catalin.marinas, dev.jain, kees,
	kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, linux-mm, lorenzo.stoakes, npache,
	rmclure, Al Viro, will, willy, ziy, hannes, kas, shakeel.butt,
	kernel-team

On 3/18/26 17:43, Jan Kara wrote:
> On Tue 10-03-26 07:51:15, Usama Arif wrote:
>> The mmap_miss counter in do_sync_mmap_readahead() tracks whether
>> readahead is useful for mmap'd file access. It is incremented by 1 on
>> every page cache miss in do_sync_mmap_readahead(), and decremented in
>> two places:
>>
>>   - filemap_map_pages(): decremented by N for each of N pages
>>     successfully mapped via fault-around (pages found already in cache,
>>     evidence readahead was useful). Only pages not in the workingset
>>     count as hits.
>>
>>   - do_async_mmap_readahead(): decremented by 1 when a page with
>>     PG_readahead is found in cache.
>>
>> When the counter exceeds MMAP_LOTSAMISS (100), all readahead is
>> disabled, including the targeted VM_EXEC readahead [1] that requests
>> arch-preferred folio orders for contpte mapping.
>>
>> On arm64 with 64K base pages, both decrement paths are inactive:
>>
>>   1. filemap_map_pages() is never called because fault_around_pages
>>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>>      requires fault_around_pages > 1. With only 1 page in the
>>      fault-around window, there is nothing "around" to map.
>>
>>   2. do_async_mmap_readahead() never fires for exec mappings because
>>      exec readahead sets async_size = 0, so no PG_readahead markers
>>      are placed.
>>
>> With no decrements, mmap_miss monotonically increases past
>> MMAP_LOTSAMISS after 100 page faults, disabling all subsequent
>> exec readahead.
>>
>> Fix this by moving the VM_EXEC readahead block above the mmap_miss
>> check. The exec readahead path is targeted. It reads a single folio at
>> the fault location with async_size=0, not speculative prefetch, so the
>> mmap_miss heuristic designed to throttle wasteful speculative readahead
>> should not gate it. The page would need to be faulted in regardless,
>> the only question is at what order.
>>
>> [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> 
> I can see the problem but I'm not sure what you propose is the right fix.
> If you move the VM_EXEC logic earlier, you'll effectively disable
> VM_HUGEPAGE handling for VM_EXEC vmas which I don't think we want. So
> shouldn't we rather disable mmap_miss logic for VM_EXEC vmas like:
> 
> 	if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
> 		...
> 	}
> 

That sounds reasonable to me.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
  2026-03-18 10:52     ` Usama Arif
@ 2026-03-19  7:40       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-19  7:40 UTC (permalink / raw)
  To: Usama Arif, ryan.roberts, r
  Cc: Andrew Morton, ajd, anshuman.khandual, apopple, baohua,
	baolin.wang, brauner, catalin.marinas, dev.jain, jack, kees,
	kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, linux-mm, lorenzo.stoakes, npache,
	rmclure, Al Viro, will, willy, ziy, hannes, kas, shakeel.butt,
	kernel-team

On 3/18/26 11:52, Usama Arif wrote:
> On Fri, 13 Mar 2026 13:55:38 -0700 Usama Arif <usama.arif@linux.dev> wrote:
> 
>> On Fri, 13 Mar 2026 16:33:42 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>>>
>>> This was deliberate, although perhaps a bit conservative. I was concerned about
>>> the possibility of read amplification; pointlessly reading in a load of memory
>>> that never actually gets used. And that is independent of page size.
>>>
>>> 2M seems quite big as a default IMHO, I could imagine Android might complain
>>> about memory pressure in their 16K config, for example.
>>>
>>
>> The force_thp_readahead path in do_sync_mmap_readahead() reads at
>> HPAGE_PMD_ORDER (2M on x86) and even doubles it to 4M for
>> non VM_RAND_READ mappings (ra->size *= 2), with async readahead
>> enabled. exec_folio_order() is more conservative. a single 2M folio
>> with async_size=0, no speculative prefetch. So I think the memory
>> pressure would not be worse than what x86 has?
>>
>> For memory pressure on Android 16K: the readahead is clamped to VMA
>> boundaries, so a small shared library won't read 2M.
>> page_cache_ra_order() reduces folio order near EOF and on allocation
>> failure, so the 2M order is a preference, not a guarantee with the
>> current code?
>>
> 
> I am not a big fan of introducing Kconfig options, but would
> CONFIG_EXEC_FOLIO_ORDER with the default value being 64K be a better
> solution? Or maybe a default of 64K for 4K and 16K base page size,
> but 2M for 64K page size as 64K base page size is mostly for servers.
> 
> Using a default value of 64K would mean no change in behaviour.

I don't think such a tunable is the right approach. We should try to do
something smarter in the kernel.

We should have access to the mapping size and whether there is currently
real memory pressure. We might even know the estimated speed of the
device we're loading data from :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-03-19  7:40 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10 14:51 [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages Usama Arif
2026-03-10 14:51 ` [PATCH 1/4] arm64: request contpte-sized folios for exec memory Usama Arif
2026-03-19  7:35   ` David Hildenbrand (Arm)
2026-03-10 14:51 ` [PATCH 2/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-03-18 16:43   ` Jan Kara
2026-03-19  7:37     ` David Hildenbrand (Arm)
2026-03-10 14:51 ` [PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping Usama Arif
2026-03-13 14:42   ` WANG Rui
2026-03-13 19:47     ` Usama Arif
2026-03-14  2:10       ` hev
2026-03-10 14:51 ` [PATCH 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
2026-03-14  3:47   ` WANG Rui
2026-03-13 13:20 ` [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages David Hildenbrand (Arm)
2026-03-13 19:59   ` Usama Arif
2026-03-16 16:06     ` David Hildenbrand (Arm)
2026-03-18 10:41       ` Usama Arif
2026-03-18 12:41         ` David Hildenbrand (Arm)
2026-03-13 16:33 ` Ryan Roberts
2026-03-13 20:55   ` Usama Arif
2026-03-18 10:52     ` Usama Arif
2026-03-19  7:40       ` David Hildenbrand (Arm)
2026-03-14 13:20   ` WANG Rui
2026-03-13 16:35 ` hev
2026-03-14  9:50 ` WANG Rui
2026-03-18 10:57   ` Usama Arif
2026-03-18 11:46     ` WANG Rui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox