public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory
@ 2026-04-02 18:08 Usama Arif
  2026-04-02 18:08 ` [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
	ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif

v2 -> v3: https://lore.kernel.org/all/20260320140315.979307-1-usama.arif@linux.dev/
- Take into account READ_ONLY_THP_FOR_FS for elf alignment by aligning
  to HPAGE_PMD_SIZE limited to 2M (Rui)
- Reviewed-by tags for patch 1 from Kiryl and Jan
- Remove preferred_exec_order() (Jan)
- Change ra->order to HPAGE_PMD_ORDER if vma_pages(vma) >= HPAGE_PMD_NR
  otherwise use exec_folio_order() with gfp &= ~__GFP_RECLAIM for
  do_sync_mmap_readahead().
- Change exec_folio_order() to return 2M (cont-pte size) for 64K base
  page size for arm64.
- remove bprm->file NULL check (Matthew)
- Change filp to file (Matthew)
- Improve checking of p_vaddr and p_vaddr (Rui and Matthew)


v1 -> v2: https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev/
- disable mmap_miss logic for VM_EXEC (Jan Kara)
- Align in elf only when segment VA and file offset are already aligned (Rui)
- preferred_exec_order() for VM_EXEC sync mmap_readahead which takes into
  account zone high watermarks (as an approximation of memory pressure)
  (David, or atleast my approach to what David suggested in [1] :))
- Extend max alignment to mapping_max_folio_size() instead of
  exec_folio_order()

Motiviation
===========
exec_folio_order() was introduced [2] to request readahead at an
arch-preferred folio order for executable memory, enabling hardware PTE
coalescing (e.g. arm64 contpte) and PMD mappings on the fault path.

However, several things prevent this from working optimally:

1. The mmap_miss heuristic in do_sync_mmap_readahead() silently disables
   exec readahead after 100 page faults. The mmap_miss counter tracks
   whether readahead is useful for mmap'd file access:

   - Incremented by 1 in do_sync_mmap_readahead() on every page cache
     miss (page needed IO).

   - Decremented by N in filemap_map_pages() for N pages successfully
     mapped via fault-around (pages found in cache without faulting,
     evidence that readahead was useful). Only non-workingset pages
     count and recently evicted and re-read pages don't count as hits.

   - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
     marker page is found (indicates sequential consumption of readahead
     pages).

   When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
   disabled. On arm64 with 64K pages, both decrement paths are inactive:

   - filemap_map_pages() is never called because fault_around_pages
     (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
     requires fault_around_pages > 1. With only 1 page in the
     fault-around window, there is nothing "around" to map.

   - do_async_mmap_readahead() never fires for exec mappings because
     exec readahead sets async_size = 0, so no PG_readahead markers
     are placed.

   With no decrements, mmap_miss monotonically increases past
   MMAP_LOTSAMISS after 100 faults, disabling exec readahead
   for the remainder of the mapping. Patch 1 fixes this by excluding
   VM_EXEC VMAs from the mmap_miss logic, similar to how VM_SEQ_READ
   is already excluded.

2. exec_folio_order() is an arch-specific hook that returns a static
   order (ilog2(SZ_64K >> PAGE_SHIFT)), which is suboptimal for non-4K
   page sizes. Patch 2 updates the arm64 exec_folio_order() to return
   2M on 64K page configurations (for contpte coalescing, where the
   previous SZ_64K value collapsed to order 0) and uses a tiered
   allocation strategy in do_sync_mmap_readahead(): if the VMA is large
   enough for a full PMD, request HPAGE_PMD_ORDER so the folio can be
   PMD-mapped; otherwise fall back to exec_folio_order() for hardware
   PTE coalescing. The allocation uses ~__GFP_RECLAIM so it is
   opportunistic, falling back to smaller folios without stalling on
   reclaim or compaction.

3. Even with correct folio order and readahead, hardware PTE coalescing
   (e.g. contpte) and PMD mapping require the virtual address to be
   aligned to the folio size. The readahead path aligns file offsets and
   the buddy allocator aligns physical memory, but the virtual address
   depends on the VMA start. For PIE binaries, ASLR randomizes the load
   address at PAGE_SIZE granularity, so on arm64 with 64K pages only
   1/32 of load addresses are 2M-aligned. When misaligned, contpte
   cannot be used for any folio in the VMA.

   Patch 3 fixes this for the main binary by extending maximum_alignment()
   in the ELF loader with a folio_alignment() helper that tries two
   tiers matching the readahead strategy: first HPAGE_PMD_SIZE for PMD
   mapping, then exec_folio_order() as a fallback for hardware TLB
   coalescing. The alignment is capped to the segment size to avoid
   reducing ASLR entropy for small binaries.

   Patch 4 fixes this for shared libraries by adding an
   exec_folio_order() alignment fallback in
   thp_get_unmapped_area_vmflags(). The existing PMD_SIZE alignment
   (512M on arm64 64K pages) is too large for typical shared libraries,
   so this smaller fallback succeeds where PMD fails.

I created a benchmark that mmaps a large executable file and calls
RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
fault + readahead cost. "Random" first faults in all pages with a
sequential sweep (not measured), then measures time for calling random
offsets, isolating iTLB miss cost for scattered execution.

The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
512MB executable file on ext4, averaged over 3 runs:

  Phase      | Baseline     | Patched      | Improvement
  -----------|--------------|--------------|------------------
  Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
  Random     | 76.0 ms      | 58.3 ms      | 23% faster

[1] https://lore.kernel.org/all/d72d5ca3-4b92-470e-9f89-9f39a3975f1e@kernel.org/
[2] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
 
Usama Arif (4):
  mm: bypass mmap_miss heuristic for VM_EXEC readahead
  mm: use tiered folio allocation for VM_EXEC readahead
  elf: align ET_DYN base for PTE coalescing and PMD mapping
  mm: align file-backed mmap to exec folio order in
    thp_get_unmapped_area

 arch/arm64/include/asm/pgtable.h | 16 ++++++----
 fs/binfmt_elf.c                  | 50 ++++++++++++++++++++++++++++++++
 mm/filemap.c                     | 42 +++++++++++++++++++--------
 mm/huge_memory.c                 | 13 +++++++++
 mm/internal.h                    |  3 +-
 mm/readahead.c                   |  7 ++---
 6 files changed, 109 insertions(+), 22 deletions(-)

-- 
2.52.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
  2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
@ 2026-04-02 18:08 ` Usama Arif
  2026-04-02 18:08 ` [PATCH v3 2/4] mm: use tiered folio allocation " Usama Arif
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
	ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif

The mmap_miss counter in do_sync_mmap_readahead() tracks whether
readahead is useful for mmap'd file access. It is incremented by 1 on
every page cache miss in do_sync_mmap_readahead(), and decremented in
two places:

  - filemap_map_pages(): decremented by N for each of N pages
    successfully mapped via fault-around (pages found already in cache,
    evidence readahead was useful). Only pages not in the workingset
    count as hits.

  - do_async_mmap_readahead(): decremented by 1 when a page with
    PG_readahead is found in cache.

When the counter exceeds MMAP_LOTSAMISS (100), all readahead is
disabled, including the targeted VM_EXEC readahead [1] that requests
large folio orders for contpte mapping.

On arm64 with 64K base pages, both decrement paths are inactive:

  1. filemap_map_pages() is never called because fault_around_pages
     (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
     requires fault_around_pages > 1. With only 1 page in the
     fault-around window, there is nothing "around" to map.

  2. do_async_mmap_readahead() never fires for exec mappings because
     exec readahead sets async_size = 0, so no PG_readahead markers
     are placed.

With no decrements, mmap_miss monotonically increases past
MMAP_LOTSAMISS after 100 page faults, disabling all subsequent
exec readahead.

Fix this by excluding VM_EXEC VMAs from the mmap_miss logic, similar
to how VM_SEQ_READ is already excluded. The exec readahead path is
targeted (one folio at the fault location, async_size=0), not
speculative prefetch, so the mmap_miss heuristic designed to throttle
wasteful speculative readahead should not apply to it.

[1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/

Signed-off-by: Usama Arif <usama.arif@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2b933a1da9bd..a4ea869b2ca1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3337,7 +3337,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		}
 	}
 
-	if (!(vm_flags & VM_SEQ_READ)) {
+	if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
 		/* Avoid banging the cache line if not needed */
 		mmap_miss = READ_ONCE(ra->mmap_miss);
 		if (mmap_miss < MMAP_LOTSAMISS * 10)
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 2/4] mm: use tiered folio allocation for VM_EXEC readahead
  2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
  2026-04-02 18:08 ` [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-04-02 18:08 ` Usama Arif
  2026-04-02 18:08 ` [PATCH v3 3/4] elf: align ET_DYN base for PTE coalescing and PMD mapping Usama Arif
  2026-04-02 18:08 ` [PATCH v3 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
  3 siblings, 0 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
	ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif

When executable pages are faulted via do_sync_mmap_readahead(), request
a folio order that enables the best hardware TLB coalescing available:

- If the VMA is large enough to contain a full PMD, request
  HPAGE_PMD_ORDER so the folio can be PMD-mapped. This benefits
  architectures where PMD_SIZE is reasonable (e.g. 2M on x86-64
  and arm64 with 4K pages). VM_EXEC VMAs are very unlikely to be
  large enough for 512M pages on ARM to take into affect.

- Otherwise, fall back to exec_folio_order(), which returns the
  minimum order for hardware PTE coalescing for arm64:
  - arm64 4K:  order 4 (64K) for contpte (16 PTEs → 1 iTLB entry)
  - arm64 16K: order 2 (64K) for HPA (4 pages → 1 TLB entry)
  - arm64 64K: order 5 (2M) for contpte (32 PTEs → 1 iTLB entry)
  - generic:   order 0 (no coalescing)

Update the arm64 exec_folio_order() to return ilog2(SZ_2M >>
PAGE_SHIFT) on 64K page configurations, where the previous SZ_64K
value collapsed to order 0 (a single page) and provided no coalescing
benefit.

Use ~__GFP_RECLAIM so the allocation is opportunistic: if a large
folio is readily available, use it, otherwise fall back to smaller
folios without stalling on reclaim or compaction. The existing fallback
in page_cache_ra_order() handles this naturally.

The readahead window is already clamped to the VMA boundaries, so
ra->size naturally caps the folio order via ilog2(ra->size) in
page_cache_ra_order().

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 arch/arm64/include/asm/pgtable.h | 16 +++++++++----
 mm/filemap.c                     | 40 +++++++++++++++++++++++---------
 mm/internal.h                    |  3 ++-
 mm/readahead.c                   |  7 +++---
 4 files changed, 45 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 52bafe79c10a..9ce9f73a6f35 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1591,12 +1591,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
 #define arch_wants_old_prefaulted_pte	cpu_has_hw_af
 
 /*
- * Request exec memory is read into pagecache in at least 64K folios. This size
- * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
- * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
- * pages are in use.
+ * Request exec memory is read into pagecache in folios large enough for
+ * hardware TLB coalescing. On 4K and 16K page configs this is 64K, which
+ * enables contpte mapping (16 × 4K) and HPA coalescing (4 × 16K). On
+ * 64K page configs, contpte requires 2M (32 × 64K).
  */
-#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
+#define exec_folio_order exec_folio_order
+static inline unsigned int exec_folio_order(void)
+{
+	if (PAGE_SIZE == SZ_64K)
+		return ilog2(SZ_2M >> PAGE_SHIFT);
+	return ilog2(SZ_64K >> PAGE_SHIFT);
+}
 
 static inline bool pud_sect_supported(void)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index a4ea869b2ca1..7ffea986b3b4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3311,6 +3311,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
 	struct file *fpin = NULL;
 	vm_flags_t vm_flags = vmf->vma->vm_flags;
+	gfp_t gfp = readahead_gfp_mask(mapping);
 	bool force_thp_readahead = false;
 	unsigned short mmap_miss;
 
@@ -3363,28 +3364,45 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 			ra->size *= 2;
 		ra->async_size = HPAGE_PMD_NR;
 		ra->order = HPAGE_PMD_ORDER;
-		page_cache_ra_order(&ractl, ra);
+		page_cache_ra_order(&ractl, ra, gfp);
 		return fpin;
 	}
 
 	if (vm_flags & VM_EXEC) {
 		/*
-		 * Allow arch to request a preferred minimum folio order for
-		 * executable memory. This can often be beneficial to
-		 * performance if (e.g.) arm64 can contpte-map the folio.
-		 * Executable memory rarely benefits from readahead, due to its
-		 * random access nature, so set async_size to 0.
+		 * Request large folios for executable memory to enable
+		 * hardware PTE coalescing and PMD mappings:
 		 *
-		 * Limit to the boundaries of the VMA to avoid reading in any
-		 * pad that might exist between sections, which would be a waste
-		 * of memory.
+		 *  - If the VMA is large enough for a PMD, request
+		 *    HPAGE_PMD_ORDER so the folio can be PMD-mapped.
+		 *  - Otherwise, use exec_folio_order() which returns
+		 *    the minimum order for hardware TLB coalescing
+		 *    (e.g. arm64 contpte/HPA).
+		 *
+		 * Use ~__GFP_RECLAIM so large folio allocation is
+		 * opportunistic — if memory isn't readily available,
+		 * fall back to smaller folios rather than stalling on
+		 * reclaim or compaction.
+		 *
+		 * Executable memory rarely benefits from speculative
+		 * readahead due to its random access nature, so set
+		 * async_size to 0.
+		 *
+		 * Limit to the boundaries of the VMA to avoid reading
+		 * in any pad that might exist between sections, which
+		 * would be a waste of memory.
 		 */
+		gfp &= ~__GFP_RECLAIM;
 		struct vm_area_struct *vma = vmf->vma;
 		unsigned long start = vma->vm_pgoff;
 		unsigned long end = start + vma_pages(vma);
 		unsigned long ra_end;
 
-		ra->order = exec_folio_order();
+		if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+		    vma_pages(vma) >= HPAGE_PMD_NR)
+			ra->order = HPAGE_PMD_ORDER;
+		else
+			ra->order = exec_folio_order();
 		ra->start = round_down(vmf->pgoff, 1UL << ra->order);
 		ra->start = max(ra->start, start);
 		ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
@@ -3403,7 +3421,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 
 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	ractl._index = ra->start;
-	page_cache_ra_order(&ractl, ra);
+	page_cache_ra_order(&ractl, ra, gfp);
 	return fpin;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 475bd281a10d..e624cb619057 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -545,7 +545,8 @@ int zap_vma_for_reaping(struct vm_area_struct *vma);
 int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
 			   gfp_t gfp);
 
-void page_cache_ra_order(struct readahead_control *, struct file_ra_state *);
+void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
+			 gfp_t gfp);
 void force_page_cache_ra(struct readahead_control *, unsigned long nr);
 static inline void force_page_cache_readahead(struct address_space *mapping,
 		struct file *file, pgoff_t index, unsigned long nr_to_read)
diff --git a/mm/readahead.c b/mm/readahead.c
index 7b05082c89ea..b3dc08cf180c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -465,7 +465,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 }
 
 void page_cache_ra_order(struct readahead_control *ractl,
-		struct file_ra_state *ra)
+		struct file_ra_state *ra, gfp_t gfp)
 {
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t start = readahead_index(ractl);
@@ -475,7 +475,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
 	pgoff_t mark = index + ra->size - ra->async_size;
 	unsigned int nofs;
 	int err = 0;
-	gfp_t gfp = readahead_gfp_mask(mapping);
 	unsigned int new_order = ra->order;
 
 	trace_page_cache_ra_order(mapping->host, start, ra);
@@ -626,7 +625,7 @@ void page_cache_sync_ra(struct readahead_control *ractl,
 readit:
 	ra->order = 0;
 	ractl->_index = ra->start;
-	page_cache_ra_order(ractl, ra);
+	page_cache_ra_order(ractl, ra, readahead_gfp_mask(ractl->mapping));
 }
 EXPORT_SYMBOL_GPL(page_cache_sync_ra);
 
@@ -697,7 +696,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
 		ra->size -= end - aligned_end;
 	ra->async_size = ra->size;
 	ractl->_index = ra->start;
-	page_cache_ra_order(ractl, ra);
+	page_cache_ra_order(ractl, ra, readahead_gfp_mask(ractl->mapping));
 }
 EXPORT_SYMBOL_GPL(page_cache_async_ra);
 
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 3/4] elf: align ET_DYN base for PTE coalescing and PMD mapping
  2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
  2026-04-02 18:08 ` [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
  2026-04-02 18:08 ` [PATCH v3 2/4] mm: use tiered folio allocation " Usama Arif
@ 2026-04-02 18:08 ` Usama Arif
  2026-04-02 18:08 ` [PATCH v3 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
  3 siblings, 0 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
	ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif

For PIE binaries (ET_DYN), the load address is randomized at PAGE_SIZE
granularity via arch_mmap_rnd(). On arm64 with 64K base pages, this
means the binary is 64K-aligned, but contpte mapping requires 2M
(CONT_PTE_SIZE) alignment.

Without proper virtual address alignment, readahead patches that
allocate large folios with aligned file offsets and physical addresses
cannot benefit from contpte mapping, as the contpte fold check in
contpte_set_ptes() requires the virtual address to be CONT_PTE_SIZE-
aligned.

Fix this by extending maximum_alignment() to consider folio alignment
at two tiers, matching the readahead allocation strategy:

- HPAGE_PMD_SIZE, so large folios can be PMD-mapped on
  architectures where PMD_SIZE is reasonable (e.g. 2M on x86-64
  and arm64 with 4K pages).

- exec_folio_order(), the minimum order for hardware TLB
  coalescing (e.g. arm64 contpte/HPA).

For each PT_LOAD segment, folio_alignment() tries both tiers and
returns the largest power-of-2 alignment that fits within the segment
size, with both p_vaddr and p_offset aligned to that size. This
ensures load_bias is folio-aligned so that file-offset-aligned folios
map to properly aligned virtual addresses, enabling hardware PTE
coalescing and PMD mappings for large folios.

The segment size check in folio_alignment() avoids reducing ASLR
entropy for small binaries that cannot benefit from large folio
alignment.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 fs/binfmt_elf.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 16a56b6b3f6c..f84fae6daf23 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -488,6 +488,54 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
 	return 0;
 }
 
+/*
+ * Return the largest folio alignment for a PT_LOAD segment, so the
+ * hardware can coalesce PTEs (e.g. arm64 contpte) or use PMD mappings
+ * for large folios.
+ *
+ * Try PMD alignment so large folios can be PMD-mapped. Then try
+ * exec_folio_order() alignment for hardware TLB coalescing (e.g.
+ * arm64 contpte/HPA).
+ *
+ * Use the largest power-of-2 that fits within the segment size, capped
+ * by the target folio size.
+ * Only align when the segment's virtual address and file offset are
+ * already aligned to that size, as misalignment would prevent coalescing
+ * anyway.
+ *
+ * The segment size check avoids reducing ASLR entropy for small binaries
+ * that cannot benefit.
+ */
+static unsigned long folio_alignment(struct elf_phdr *cmd)
+{
+	unsigned long alignment = 0;
+	unsigned long seg_size;
+
+	if (!cmd->p_filesz)
+		return 0;
+
+	seg_size = rounddown_pow_of_two(cmd->p_filesz);
+
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		unsigned long size = min(seg_size, HPAGE_PMD_SIZE);
+
+		if (size > PAGE_SIZE &&
+		    IS_ALIGNED(cmd->p_vaddr | cmd->p_offset, size))
+			alignment = size;
+	}
+
+	if (!alignment && exec_folio_order()) {
+		unsigned long size = min(seg_size,
+					PAGE_SIZE << exec_folio_order());
+
+		if (size > PAGE_SIZE &&
+		    IS_ALIGNED(cmd->p_vaddr | cmd->p_offset, size))
+			alignment = size;
+	}
+
+	return alignment;
+}
+
 static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
 {
 	unsigned long alignment = 0;
@@ -501,6 +549,8 @@ static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
 			if (!is_power_of_2(p_align))
 				continue;
 			alignment = max(alignment, p_align);
+			alignment = max(alignment,
+					folio_alignment(&cmds[i]));
 		}
 	}
 
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area
  2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
                   ` (2 preceding siblings ...)
  2026-04-02 18:08 ` [PATCH v3 3/4] elf: align ET_DYN base for PTE coalescing and PMD mapping Usama Arif
@ 2026-04-02 18:08 ` Usama Arif
  3 siblings, 0 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
	ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif

thp_get_unmapped_area() is the get_unmapped_area callback for
filesystems like ext4, xfs, and btrfs. It attempts to align the virtual
address for PMD_SIZE THP mappings, but on arm64 with 64K base pages
PMD_SIZE is 512M, which is too large for typical shared library mappings,
so the alignment always fails and falls back to PAGE_SIZE.

This means shared libraries loaded by ld.so via mmap() get 64K-aligned
virtual addresses, preventing contpte mapping even when large folios are
allocated with properly aligned file offsets and physical addresses.

Add a fallback in thp_get_unmapped_area_vmflags() that uses
exec_folio_order() to determine alignment, matching the readahead
allocation strategy. This aligns mappings to the hardware TLB
coalescing size (e.g. 2M for contpte on arm64 64K pages, 64K for
contpte/HPA on arm64 4K/16K pages), capped to the mapping length via
rounddown_pow_of_two(len).

The fallback is naturally a no-op on architectures where
exec_folio_order() returns 0, and skips the retry when the alignment
would equal PMD_SIZE (already attempted above) incase another
architecture changes exec_folio_order() in the future.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b2a6060b3c20..ad97ac8406dc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1218,6 +1218,19 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 	if (ret)
 		return ret;
 
+	if (filp && exec_folio_order()) {
+		unsigned long exec_folio_size = PAGE_SIZE << exec_folio_order();
+		unsigned long size = rounddown_pow_of_two(len);
+
+		size = min(size, exec_folio_size);
+		if (size > PAGE_SIZE && size != PMD_SIZE) {
+			ret = __thp_get_unmapped_area(filp, addr, len, off,
+						      flags, size, vm_flags);
+			if (ret)
+				return ret;
+		}
+	}
+
 	return mm_get_unmapped_area_vmflags(filp, addr, len, pgoff, flags,
 					    vm_flags);
 }
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-02 18:14 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-04-02 18:08 ` [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-04-02 18:08 ` [PATCH v3 2/4] mm: use tiered folio allocation " Usama Arif
2026-04-02 18:08 ` [PATCH v3 3/4] elf: align ET_DYN base for PTE coalescing and PMD mapping Usama Arif
2026-04-02 18:08 ` [PATCH v3 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox