[PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory
@ 2026-03-20 13:58 Usama Arif
  2026-03-20 13:58 ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Usama Arif @ 2026-03-20 13:58 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	lorenzo.stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, wilts.infradead.org, linux-fsdevel

v2 takes a different approach from v1 trying to move away from
exec_folio_order() and comeup with a generic arch independent solution
which is what I think David was suggesting in [1].
I thought I would send it as code rather than try and discuss it in v1
as its easier :)

v1 -> v2: https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev/
- disable mmap_miss logic for VM_EXEC (Jan Kara)
- Align in elf only when segment VA and file offset are already aligned (Rui)
- preferred_exec_order() for VM_EXEC sync mmap_readahead which takes into
  account zone high watermarks (as an approximation of memory pressure)
  (David, or atleast my approach to what David suggested in [1] :))
- Extend max alignment to mapping_max_folio_size() instead of
  exec_folio_order()

Motiviation
===========
exec_folio_order() was introduced [2] to request readahead at an
arch-preferred folio order for executable memory, enabling hardware PTE
coalescing (e.g. arm64 contpte) and PMD mappings on the fault path.

However, several things prevent this from working optimally:

1. The mmap_miss heuristic in do_sync_mmap_readahead() silently disables
   exec readahead after 100 page faults. The mmap_miss counter tracks
   whether readahead is useful for mmap'd file access:

   - Incremented by 1 in do_sync_mmap_readahead() on every page cache
     miss (page needed IO).

   - Decremented by N in filemap_map_pages() for N pages successfully
     mapped via fault-around (pages found in cache without faulting,
     evidence that readahead was useful). Only non-workingset pages
     count and recently evicted and re-read pages don't count as hits.

   - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
     marker page is found (indicates sequential consumption of readahead
     pages).

   When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
   disabled. On arm64 with 64K pages, both decrement paths are inactive:

   - filemap_map_pages() is never called because fault_around_pages
     (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
     requires fault_around_pages > 1. With only 1 page in the
     fault-around window, there is nothing "around" to map.

   - do_async_mmap_readahead() never fires for exec mappings because
     exec readahead sets async_size = 0, so no PG_readahead markers
     are placed.

   With no decrements, mmap_miss monotonically increases past
   MMAP_LOTSAMISS after 100 faults, disabling exec readahead
   for the remainder of the mapping. Patch 1 fixes this by excluding
   VM_EXEC VMAs from the mmap_miss logic, similar to how VM_SEQ_READ
   is already excluded.

2. exec_folio_order() is an arch-specific hook that returns a static
   order (ilog2(SZ_64K >> PAGE_SHIFT)), which is suboptimal for non-4K
   page sizes and doesn't adapt to runtime conditions. Patch 2 replaces
   it with a generic preferred_exec_order() that targets min(PMD_ORDER,
   2M), which naturally gives the right answer across architectures and
   different page sizes: contpte on arm64 (2M for all page sizes), PMD
   mapping on x86 (2M), and smaller PMDs like s390 (1M). The 2M cap also
   avoids requesting excessively large folios on configurations where
   PMD_ORDER is much larger (32M on arm64 16K pages, 512M on arm64 64K
   pages), which would cause unnecessary memory pressure. The function
   adapts at runtime based on VMA size and memory pressure (zone
   watermarks), stepping down the order when memory is tight.

3. Even with correct folio order and readahead, hardware PTE coalescing
   (e.g. contpte) and PMD mapping require the virtual address to be
   aligned to the folio size. The readahead path aligns file offsets and
   the buddy allocator aligns physical memory, but the virtual address
   depends on the VMA start. For PIE binaries, ASLR randomizes the load
   address at PAGE_SIZE granularity, so on arm64 with 64K pages only
   1/32 of load addresses are 2M-aligned. When misaligned, contpte
   cannot be used for any folio in the VMA.

   Patch 3 fixes this for the main binary by extending maximum_alignment()
   in the ELF loader to consider mapping_max_folio_size(), aligning
   load_bias to the largest folio the filesystem will allocate.

   Patch 4 fixes this for shared libraries by adding a folio-size
   alignment fallback in thp_get_unmapped_area_vmflags(). The existing
   PMD_SIZE alignment (512M on arm64 64K pages) is too large for typical
   shared libraries, so this smaller fallback succeeds where PMD fails.

I created a benchmark that mmaps a large executable file and calls
RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
fault + readahead cost. "Random" first faults in all pages with a
sequential sweep (not measured), then measures time for calling random
offsets, isolating iTLB miss cost for scattered execution.

The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
512MB executable file on ext4, averaged over 3 runs:

  Phase      | Baseline     | Patched      | Improvement
  -----------|--------------|--------------|------------------
  Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
  Random     | 76.0 ms      | 58.3 ms      | 23% faster

[1] https://lore.kernel.org/all/d72d5ca3-4b92-470e-9f89-9f39a3975f1e@kernel.org/
[2] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/

Usama Arif (4):
  mm: bypass mmap_miss heuristic for VM_EXEC readahead
  mm: replace exec_folio_order() with generic preferred_exec_order()
  elf: align ET_DYN base to max folio size for PTE coalescing
  mm: align file-backed mmap to max folio order in thp_get_unmapped_area

 arch/arm64/include/asm/pgtable.h |  8 -----
 fs/binfmt_elf.c                  | 38 ++++++++++++++++++--
 include/linux/pgtable.h          | 11 ------
 mm/filemap.c                     | 59 ++++++++++++++++++++++++++++----
 mm/huge_memory.c                 | 14 ++++++++
 5 files changed, 102 insertions(+), 28 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
  2026-03-20 13:58 [PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
@ 2026-03-20 13:58 ` Usama Arif
  2026-03-20 14:18   ` Jan Kara
  2026-03-20 14:26   ` Kiryl Shutsemau
  2026-03-20 13:58 ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Usama Arif
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 16+ messages in thread
From: Usama Arif @ 2026-03-20 13:58 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	lorenzo.stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, wilts.infradead.org, linux-fsdevel

The mmap_miss counter in do_sync_mmap_readahead() tracks whether
readahead is useful for mmap'd file access. It is incremented by 1 on
every page cache miss in do_sync_mmap_readahead(), and decremented in
two places:

  - filemap_map_pages(): decremented by N for each of N pages
    successfully mapped via fault-around (pages found already in cache,
    evidence readahead was useful). Only pages not in the workingset
    count as hits.

  - do_async_mmap_readahead(): decremented by 1 when a page with
    PG_readahead is found in cache.

When the counter exceeds MMAP_LOTSAMISS (100), all readahead is
disabled, including the targeted VM_EXEC readahead [1] that requests
large folio orders for contpte mapping.

On arm64 with 64K base pages, both decrement paths are inactive:

  1. filemap_map_pages() is never called because fault_around_pages
     (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
     requires fault_around_pages > 1. With only 1 page in the
     fault-around window, there is nothing "around" to map.

  2. do_async_mmap_readahead() never fires for exec mappings because
     exec readahead sets async_size = 0, so no PG_readahead markers
     are placed.

With no decrements, mmap_miss monotonically increases past
MMAP_LOTSAMISS after 100 page faults, disabling all subsequent
exec readahead.

Fix this by excluding VM_EXEC VMAs from the mmap_miss logic, similar
to how VM_SEQ_READ is already excluded. The exec readahead path is
targeted (one folio at the fault location, async_size=0), not
speculative prefetch, so the mmap_miss heuristic designed to throttle
wasteful speculative readahead should not apply to it.

[1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6cd7974d4adab..7d89c6b384cc4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3331,7 +3331,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		}
 	}
 
-	if (!(vm_flags & VM_SEQ_READ)) {
+	if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
 		/* Avoid banging the cache line if not needed */
 		mmap_miss = READ_ONCE(ra->mmap_miss);
 		if (mmap_miss < MMAP_LOTSAMISS * 10)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
  2026-03-20 13:58 ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-03-20 14:18   ` Jan Kara
  2026-03-20 14:26   ` Kiryl Shutsemau
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Kara @ 2026-03-20 14:18 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
	wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team

On Fri 20-03-26 06:58:51, Usama Arif wrote:
> The mmap_miss counter in do_sync_mmap_readahead() tracks whether
> readahead is useful for mmap'd file access. It is incremented by 1 on
> every page cache miss in do_sync_mmap_readahead(), and decremented in
> two places:
> 
>   - filemap_map_pages(): decremented by N for each of N pages
>     successfully mapped via fault-around (pages found already in cache,
>     evidence readahead was useful). Only pages not in the workingset
>     count as hits.
> 
>   - do_async_mmap_readahead(): decremented by 1 when a page with
>     PG_readahead is found in cache.
> 
> When the counter exceeds MMAP_LOTSAMISS (100), all readahead is
> disabled, including the targeted VM_EXEC readahead [1] that requests
> large folio orders for contpte mapping.
> 
> On arm64 with 64K base pages, both decrement paths are inactive:
> 
>   1. filemap_map_pages() is never called because fault_around_pages
>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>      requires fault_around_pages > 1. With only 1 page in the
>      fault-around window, there is nothing "around" to map.
> 
>   2. do_async_mmap_readahead() never fires for exec mappings because
>      exec readahead sets async_size = 0, so no PG_readahead markers
>      are placed.
> 
> With no decrements, mmap_miss monotonically increases past
> MMAP_LOTSAMISS after 100 page faults, disabling all subsequent
> exec readahead.
> 
> Fix this by excluding VM_EXEC VMAs from the mmap_miss logic, similar
> to how VM_SEQ_READ is already excluded. The exec readahead path is
> targeted (one folio at the fault location, async_size=0), not
> speculative prefetch, so the mmap_miss heuristic designed to throttle
> wasteful speculative readahead should not apply to it.
> 
> [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/filemap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6cd7974d4adab..7d89c6b384cc4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3331,7 +3331,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  		}
>  	}
>  
> -	if (!(vm_flags & VM_SEQ_READ)) {
> +	if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
>  		/* Avoid banging the cache line if not needed */
>  		mmap_miss = READ_ONCE(ra->mmap_miss);
>  		if (mmap_miss < MMAP_LOTSAMISS * 10)
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
  2026-03-20 13:58 ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
  2026-03-20 14:18   ` Jan Kara
@ 2026-03-20 14:26   ` Kiryl Shutsemau
  1 sibling, 0 replies; 16+ messages in thread
From: Kiryl Shutsemau @ 2026-03-20 14:26 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro, wilts,
	ziy, hannes, shakeel.butt, kernel-team

On Fri, Mar 20, 2026 at 06:58:51AM -0700, Usama Arif wrote:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6cd7974d4adab..7d89c6b384cc4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3331,7 +3331,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  		}
>  	}
>  
> -	if (!(vm_flags & VM_SEQ_READ)) {
> +	if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {

Strictly speaking the fact that the file mapped as executable doesn't
mean we are serving instruction fetch page fault.

FAULT_FLAG_INSTRUCTION would be the signal, but it is only provided by
handful of architectures.

VM_EXEC is good enough proxy.

Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
  2026-03-20 13:58 [PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
  2026-03-20 13:58 ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-03-20 13:58 ` Usama Arif
  2026-03-20 14:41   ` Kiryl Shutsemau
  2026-03-20 14:42   ` Jan Kara
  2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
  2026-03-20 13:58 ` [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area Usama Arif
  3 siblings, 2 replies; 16+ messages in thread
From: Usama Arif @ 2026-03-20 13:58 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	lorenzo.stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, wilts.infradead.org, linux-fsdevel

Replace the arch-specific exec_folio_order() hook with a generic
preferred_exec_order() that dynamically computes the readahead folio
order for executable memory. It targets min(PMD_ORDER, 2M) as the
maximum, which optimally gives the right answer for contpte (arm64),
PMD mapping (x86, arm64 4K), and architectures with smaller PMDs
(s390 1M). It adapts at runtime based on:

- VMA size: caps the order so folios fit within the mapping
- Memory pressure: steps down the order when the local node's free
  memory is below the high watermark for the requested order

This avoids over-allocating on memory-constrained systems while still
requesting the optimal order when memory is plentiful.

Since exec_folio_order() is no longer needed, remove the arm64
definition and the generic default from pgtable.h.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 arch/arm64/include/asm/pgtable.h |  8 -----
 include/linux/pgtable.h          | 11 ------
 mm/filemap.c                     | 57 ++++++++++++++++++++++++++++----
 3 files changed, 51 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b3e58735c49bd..b1e74940624d8 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1599,14 +1599,6 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
  */
 #define arch_wants_old_prefaulted_pte	cpu_has_hw_af
 
-/*
- * Request exec memory is read into pagecache in at least 64K folios. This size
- * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
- * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
- * pages are in use.
- */
-#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
-
 static inline bool pud_sect_supported(void)
 {
 	return PAGE_SIZE == SZ_4K;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a50df42a893fb..874333549eb3c 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -577,17 +577,6 @@ static inline bool arch_has_hw_pte_young(void)
 }
 #endif
 
-#ifndef exec_folio_order
-/*
- * Returns preferred minimum folio order for executable file-backed memory. Must
- * be in range [0, PMD_ORDER). Default to order-0.
- */
-static inline unsigned int exec_folio_order(void)
-{
-	return 0;
-}
-#endif
-
 #ifndef arch_check_zapped_pte
 static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
 					 pte_t pte)
diff --git a/mm/filemap.c b/mm/filemap.c
index 7d89c6b384cc4..aebfb78e487d7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3290,6 +3290,52 @@ static int lock_folio_maybe_drop_mmap(struct vm_fault *vmf, struct folio *folio,
 	return 1;
 }
 
+/*
+ * Compute the preferred folio order for executable memory readahead.
+ * Targets min(PMD_ORDER, 2M) as the maximum, which gives the
+ * optimal order for contpte (arm64), PMD mapping (x86, arm64 4K), and
+ * architectures with smaller PMDs (s390 1M). The 2M cap also avoids
+ * requesting excessively large folios on configurations where PMD_ORDER
+ * is much larger (32M on 16K pages, 512M on 64K pages), which would cause
+ * unnecessary memory pressure. Adapts at runtime based on:
+ *
+ * - VMA size: cap the order so folios fit within the mapping.
+ *
+ * - Memory pressure: step down the order when free memory on the local
+ *   node is below the high watermark for the requested order. This
+ *   avoids expensive reclaim or compaction to satisfy large folio
+ *   allocations when memory is tight.
+ */
+static unsigned int preferred_exec_order(struct vm_area_struct *vma)
+{
+	int order;
+	unsigned long vma_len = vma_pages(vma);
+	struct zone *zone;
+	gfp_t gfp;
+
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return 0;
+
+	/* Cap at min(PMD_ORDER, 2M) */
+	order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT));
+
+	/* Don't request folios larger than the VMA */
+	order = min(order, ilog2(vma_len));
+
+	/* Step down under memory pressure */
+	gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
+	zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
+				    gfp_zone(gfp), NULL)->zone;
+	if (zone) {
+		while (order > 0 &&
+		       !zone_watermark_ok(zone, order,
+					  high_wmark_pages(zone), 0, 0))
+			order--;
+	}
+
+	return order;
+}
+
 /*
  * Synchronous readahead happens when we don't even find a page in the page
  * cache at all.  We don't want to perform IO under the mmap sem, so if we have
@@ -3363,11 +3409,10 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 
 	if (vm_flags & VM_EXEC) {
 		/*
-		 * Allow arch to request a preferred minimum folio order for
-		 * executable memory. This can often be beneficial to
-		 * performance if (e.g.) arm64 can contpte-map the folio.
-		 * Executable memory rarely benefits from readahead, due to its
-		 * random access nature, so set async_size to 0.
+		 * Request a preferred folio order for executable memory,
+		 * dynamically adapted to VMA size and memory pressure.
+		 * Executable memory rarely benefits from speculative readahead
+		 * due to its random access nature, so set async_size to 0.
 		 *
 		 * Limit to the boundaries of the VMA to avoid reading in any
 		 * pad that might exist between sections, which would be a waste
@@ -3378,7 +3423,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 		unsigned long end = start + vma_pages(vma);
 		unsigned long ra_end;
 
-		ra->order = exec_folio_order();
+		ra->order = preferred_exec_order(vma);
 		ra->start = round_down(vmf->pgoff, 1UL << ra->order);
 		ra->start = max(ra->start, start);
 		ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
  2026-03-20 13:58 ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Usama Arif
@ 2026-03-20 14:41   ` Kiryl Shutsemau
  2026-03-20 14:42   ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: Kiryl Shutsemau @ 2026-03-20 14:41 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro, wilts,
	ziy, hannes, shakeel.butt, kernel-team

On Fri, Mar 20, 2026 at 06:58:52AM -0700, Usama Arif wrote:
> + *   allocations when memory is tight.
> + */
> +static unsigned int preferred_exec_order(struct vm_area_struct *vma)
> +{
> +	int order;
> +	unsigned long vma_len = vma_pages(vma);
> +	struct zone *zone;
> +	gfp_t gfp;
> +
> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return 0;
> +
> +	/* Cap at min(PMD_ORDER, 2M) */
> +	order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT));
> +
> +	/* Don't request folios larger than the VMA */
> +	order = min(order, ilog2(vma_len));
> +
> +	/* Step down under memory pressure */
> +	gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
> +	zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
> +				    gfp_zone(gfp), NULL)->zone;
> +	if (zone) {
> +		while (order > 0 &&
> +		       !zone_watermark_ok(zone, order,
> +					  high_wmark_pages(zone), 0, 0))
> +			order--;
> +	}

Eww. That's overkill and layering violation.

If we need something like this, it has to be do within page allocator:
an allocation interface that takes a range (or mask) of acceptable
order.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
  2026-03-20 13:58 ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Usama Arif
  2026-03-20 14:41   ` Kiryl Shutsemau
@ 2026-03-20 14:42   ` Jan Kara
  2026-03-26 12:40     ` Usama Arif
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Kara @ 2026-03-20 14:42 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
	wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team

On Fri 20-03-26 06:58:52, Usama Arif wrote:
> Replace the arch-specific exec_folio_order() hook with a generic
> preferred_exec_order() that dynamically computes the readahead folio
> order for executable memory. It targets min(PMD_ORDER, 2M) as the
> maximum, which optimally gives the right answer for contpte (arm64),
> PMD mapping (x86, arm64 4K), and architectures with smaller PMDs
> (s390 1M). It adapts at runtime based on:
> 
> - VMA size: caps the order so folios fit within the mapping
> - Memory pressure: steps down the order when the local node's free
>   memory is below the high watermark for the requested order
> 
> This avoids over-allocating on memory-constrained systems while still
> requesting the optimal order when memory is plentiful.
> 
> Since exec_folio_order() is no longer needed, remove the arm64
> definition and the generic default from pgtable.h.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
...
> +static unsigned int preferred_exec_order(struct vm_area_struct *vma)
> +{
> +	int order;
> +	unsigned long vma_len = vma_pages(vma);
> +	struct zone *zone;
> +	gfp_t gfp;
> +
> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return 0;
> +
> +	/* Cap at min(PMD_ORDER, 2M) */
> +	order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT));
> +
> +	/* Don't request folios larger than the VMA */
> +	order = min(order, ilog2(vma_len));

Hum, as far as I'm checking page_cache_ra_order() used in
do_sync_mmap_readahead(), ra->order is the preferred order but it will be
trimmed down to fit both within the file and within ra->size. And ra->size
is set for the readahead to fit within the vma so I don't think any order
trimming based on vma length is needed in this place?

> +	/* Step down under memory pressure */
> +	gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
> +	zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
> +				    gfp_zone(gfp), NULL)->zone;
> +	if (zone) {
> +		while (order > 0 &&
> +		       !zone_watermark_ok(zone, order,
> +					  high_wmark_pages(zone), 0, 0))
> +			order--;
> +	}

It looks wrong for this logic to be here. Trimming order based on memory
pressure makes sense (and we've already got reports that on memory limited
devices large order folios in the page cache have too big memory overhead
so we'll likely need to handle that for page cache allocations in general)
but IMHO it belongs to page_cache_ra_order() or some other common place
like that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
  2026-03-20 14:42   ` Jan Kara
@ 2026-03-26 12:40     ` Usama Arif
  2026-03-26 16:21       ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Usama Arif @ 2026-03-26 12:40 UTC (permalink / raw)
  To: Jan Kara, david, ryan.roberts
  Cc: Andrew Morton, willy, linux-mm, r, ajd, apopple, baohua,
	baolin.wang, brauner, catalin.marinas, dev.jain, kees,
	kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
	wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team



On 20/03/2026 17:42, Jan Kara wrote:
> On Fri 20-03-26 06:58:52, Usama Arif wrote:
>> Replace the arch-specific exec_folio_order() hook with a generic
>> preferred_exec_order() that dynamically computes the readahead folio
>> order for executable memory. It targets min(PMD_ORDER, 2M) as the
>> maximum, which optimally gives the right answer for contpte (arm64),
>> PMD mapping (x86, arm64 4K), and architectures with smaller PMDs
>> (s390 1M). It adapts at runtime based on:
>>
>> - VMA size: caps the order so folios fit within the mapping
>> - Memory pressure: steps down the order when the local node's free
>>   memory is below the high watermark for the requested order
>>
>> This avoids over-allocating on memory-constrained systems while still
>> requesting the optimal order when memory is plentiful.
>>
>> Since exec_folio_order() is no longer needed, remove the arm64
>> definition and the generic default from pgtable.h.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ...
>> +static unsigned int preferred_exec_order(struct vm_area_struct *vma)
>> +{
>> +	int order;
>> +	unsigned long vma_len = vma_pages(vma);
>> +	struct zone *zone;
>> +	gfp_t gfp;
>> +
>> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
>> +		return 0;
>> +
>> +	/* Cap at min(PMD_ORDER, 2M) */
>> +	order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT));
>> +
>> +	/* Don't request folios larger than the VMA */
>> +	order = min(order, ilog2(vma_len));
> 

Hi Jan,

Thanks for the feedback and sorry for the late reply! I was travelling
during the week.

> Hum, as far as I'm checking page_cache_ra_order() used in
> do_sync_mmap_readahead(), ra->order is the preferred order but it will be
> trimmed down to fit both within the file and within ra->size. And ra->size
> is set for the readahead to fit within the vma so I don't think any order
> trimming based on vma length is needed in this place?

Ack, yes makes sense.

> 
>> +	/* Step down under memory pressure */
>> +	gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
>> +	zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
>> +				    gfp_zone(gfp), NULL)->zone;
>> +	if (zone) {
>> +		while (order > 0 &&
>> +		       !zone_watermark_ok(zone, order,
>> +					  high_wmark_pages(zone), 0, 0))
>> +			order--;
>> +	}
> 
> It looks wrong for this logic to be here. Trimming order based on memory
> pressure makes sense (and we've already got reports that on memory limited
> devices large order folios in the page cache have too big memory overhead
> so we'll likely need to handle that for page cache allocations in general)
> but IMHO it belongs to page_cache_ra_order() or some other common place
> like that.
> 
> 								Honza

So I have been thinking about this. readahead_gfp_mask() already sets
__GFP_NORETRY, so we wont try aggressive reclaim/compaction to satisfy
the allocation. page_cache_ra_order() falls through to the fallback path
faulting in order 0 page when allocation is not satsified.

So the allocator already naturally steps down under memory pressure,
the explicit zone_watermark_ok() loop might be redundant?

What are your thoughts on just setting
ra->order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT))?
We can do the higher orlder allocation with gfp &= ~__GFP_RECLAIM
for the VM_EXEC case.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
  2026-03-26 12:40     ` Usama Arif
@ 2026-03-26 16:21       ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2026-03-26 16:21 UTC (permalink / raw)
  To: Usama Arif
  Cc: Jan Kara, david, ryan.roberts, Andrew Morton, willy, linux-mm, r,
	ajd, apopple, baohua, baolin.wang, brauner, catalin.marinas,
	dev.jain, kees, kevin.brodsky, lance.yang, Liam.Howlett,
	linux-arm-kernel, linux-fsdevel, linux-kernel, lorenzo.stoakes,
	mhocko, npache, pasha.tatashin, rmclure, rppt, surenb, vbabka,
	Al Viro, wilts.infradead.org, ziy, hannes, kas, shakeel.butt,
	kernel-team

On Thu 26-03-26 08:40:21, Usama Arif wrote:
> On 20/03/2026 17:42, Jan Kara wrote:
> >> +	/* Step down under memory pressure */
> >> +	gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
> >> +	zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
> >> +				    gfp_zone(gfp), NULL)->zone;
> >> +	if (zone) {
> >> +		while (order > 0 &&
> >> +		       !zone_watermark_ok(zone, order,
> >> +					  high_wmark_pages(zone), 0, 0))
> >> +			order--;
> >> +	}
> > 
> > It looks wrong for this logic to be here. Trimming order based on memory
> > pressure makes sense (and we've already got reports that on memory limited
> > devices large order folios in the page cache have too big memory overhead
> > so we'll likely need to handle that for page cache allocations in general)
> > but IMHO it belongs to page_cache_ra_order() or some other common place
> > like that.
> > 
> > 								Honza
> 
> So I have been thinking about this. readahead_gfp_mask() already sets
> __GFP_NORETRY, so we wont try aggressive reclaim/compaction to satisfy
> the allocation. page_cache_ra_order() falls through to the fallback path
> faulting in order 0 page when allocation is not satsified.
> 
> So the allocator already naturally steps down under memory pressure,
> the explicit zone_watermark_ok() loop might be redundant?

Probably yes. I still think we'll have to somehow better tweak the used
order based on expected size of the page cache (2M folios seem unreasonably
large for a machine that has e.g. 1G of memory in total, even if it has
enough free memory at this point in time - we'll benefit from smaller
folios and thus finer grained folio activity tracking for such cases). But
that's not for this patch set.

> What are your thoughts on just setting
> ra->order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT))?
> We can do the higher orlder allocation with gfp &= ~__GFP_RECLAIM
> for the VM_EXEC case.

Yes, it's simple and it makes sense to me so if others are fine with it...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
  2026-03-20 13:58 [PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
  2026-03-20 13:58 ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
  2026-03-20 13:58 ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Usama Arif
@ 2026-03-20 13:58 ` Usama Arif
  2026-03-20 14:55   ` Kiryl Shutsemau
                     ` (2 more replies)
  2026-03-20 13:58 ` [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area Usama Arif
  3 siblings, 3 replies; 16+ messages in thread
From: Usama Arif @ 2026-03-20 13:58 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	lorenzo.stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, wilts.infradead.org, linux-fsdevel

For PIE binaries (ET_DYN), the load address is randomized at PAGE_SIZE
granularity via arch_mmap_rnd(). On arm64 with 64K base pages, this
means the binary is 64K-aligned, but contpte mapping requires 2M
(CONT_PTE_SIZE) alignment.

Without proper virtual address alignment, readahead patches that
allocate 2M folios with 2M-aligned file offsets and physical addresses
cannot benefit from contpte mapping, as the contpte fold check in
contpte_set_ptes() requires the virtual address to be CONT_PTE_SIZE-
aligned.

Fix this by extending maximum_alignment() to consider the maximum folio
size supported by the page cache (via mapping_max_folio_size()). For
each PT_LOAD segment, the alignment is bumped to the largest
power-of-2 that fits within the segment size, capped by the max folio
size the filesystem will allocate, if:

  - Both p_vaddr and p_offset are aligned to that size
  - The segment is large enough (p_filesz >= size)

This ensures load_bias is folio-aligned so that file-offset-aligned
folios map to properly aligned virtual addresses, enabling hardware PTE
coalescing (e.g. arm64 contpte) and PMD mappings for large folios.

The segment size check avoids reducing ASLR entropy for small binaries
that cannot benefit from large folio alignment.

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 fs/binfmt_elf.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 8e89cc5b28200..042af81766fcd 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -49,6 +49,7 @@
 #include <uapi/linux/rseq.h>
 #include <asm/param.h>
 #include <asm/page.h>
+#include <linux/pagemap.h>
 
 #ifndef ELF_COMPAT
 #define ELF_COMPAT 0
@@ -488,19 +489,51 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
 	return 0;
 }
 
-static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
+static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr,
+				       struct file *filp)
 {
 	unsigned long alignment = 0;
+	unsigned long max_folio_size = PAGE_SIZE;
 	int i;
 
+	if (filp && filp->f_mapping)
+		max_folio_size = mapping_max_folio_size(filp->f_mapping);
+
 	for (i = 0; i < nr; i++) {
 		if (cmds[i].p_type == PT_LOAD) {
 			unsigned long p_align = cmds[i].p_align;
+			unsigned long size;
 
 			/* skip non-power of two alignments as invalid */
 			if (!is_power_of_2(p_align))
 				continue;
 			alignment = max(alignment, p_align);
+
+			/*
+			 * Try to align the binary to the largest folio
+			 * size that the page cache supports, so the
+			 * hardware can coalesce PTEs (e.g. arm64
+			 * contpte) or use PMD mappings for large folios.
+			 *
+			 * Use the largest power-of-2 that fits within
+			 * the segment size, capped by what the page
+			 * cache will allocate. Only align when the
+			 * segment's virtual address and file offset are
+			 * already aligned to the folio size, as
+			 * misalignment would prevent coalescing anyway.
+			 *
+			 * The segment size check avoids reducing ASLR
+			 * entropy for small binaries that cannot
+			 * benefit.
+			 */
+			if (!cmds[i].p_filesz)
+				continue;
+			size = rounddown_pow_of_two(cmds[i].p_filesz);
+			size = min(size, max_folio_size);
+			if (size > PAGE_SIZE &&
+			    IS_ALIGNED(cmds[i].p_vaddr, size) &&
+			    IS_ALIGNED(cmds[i].p_offset, size))
+				alignment = max(alignment, size);
 		}
 	}
 
@@ -1104,7 +1137,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			}
 
 			/* Calculate any requested alignment. */
-			alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
+			alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum,
+						      bprm->file);
 
 			/**
 			 * DOC: PIE handling
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
  2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
@ 2026-03-20 14:55   ` Kiryl Shutsemau
  2026-03-20 15:58   ` Matthew Wilcox
  2026-03-20 16:05   ` WANG Rui
  2 siblings, 0 replies; 16+ messages in thread
From: Kiryl Shutsemau @ 2026-03-20 14:55 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro, wilts,
	ziy, hannes, shakeel.butt, kernel-team

On Fri, Mar 20, 2026 at 06:58:53AM -0700, Usama Arif wrote:
> For PIE binaries (ET_DYN), the load address is randomized at PAGE_SIZE
> granularity via arch_mmap_rnd(). On arm64 with 64K base pages, this
> means the binary is 64K-aligned, but contpte mapping requires 2M
> (CONT_PTE_SIZE) alignment.
> 
> Without proper virtual address alignment, readahead patches that
> allocate 2M folios with 2M-aligned file offsets and physical addresses
> cannot benefit from contpte mapping, as the contpte fold check in
> contpte_set_ptes() requires the virtual address to be CONT_PTE_SIZE-
> aligned.
> 
> Fix this by extending maximum_alignment() to consider the maximum folio
> size supported by the page cache (via mapping_max_folio_size()). For
> each PT_LOAD segment, the alignment is bumped to the largest
> power-of-2 that fits within the segment size, capped by the max folio
> size the filesystem will allocate, if:
> 
>   - Both p_vaddr and p_offset are aligned to that size
>   - The segment is large enough (p_filesz >= size)
> 
> This ensures load_bias is folio-aligned so that file-offset-aligned
> folios map to properly aligned virtual addresses, enabling hardware PTE
> coalescing (e.g. arm64 contpte) and PMD mappings for large folios.
> 
> The segment size check avoids reducing ASLR entropy for small binaries
> that cannot benefit from large folio alignment.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>

Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
  2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
  2026-03-20 14:55   ` Kiryl Shutsemau
@ 2026-03-20 15:58   ` Matthew Wilcox
  2026-03-20 16:05   ` WANG Rui
  2 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2026-03-20 15:58 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
	wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team

On Fri, Mar 20, 2026 at 06:58:53AM -0700, Usama Arif wrote:
> -static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
> +static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr,
> +				       struct file *filp)
>  {
>  	unsigned long alignment = 0;
> +	unsigned long max_folio_size = PAGE_SIZE;
>  	int i;
>  
> +	if (filp && filp->f_mapping)
> +		max_folio_size = mapping_max_folio_size(filp->f_mapping);

Under what circumstances can bprm->file be NULL?

Also we tend to prefer the name "file" rather than "filp" for new code
(yes, there's a lot of old code out there).

> +
> +			/*
> +			 * Try to align the binary to the largest folio
> +			 * size that the page cache supports, so the
> +			 * hardware can coalesce PTEs (e.g. arm64
> +			 * contpte) or use PMD mappings for large folios.
> +			 *
> +			 * Use the largest power-of-2 that fits within
> +			 * the segment size, capped by what the page
> +			 * cache will allocate. Only align when the
> +			 * segment's virtual address and file offset are
> +			 * already aligned to the folio size, as
> +			 * misalignment would prevent coalescing anyway.
> +			 *
> +			 * The segment size check avoids reducing ASLR
> +			 * entropy for small binaries that cannot
> +			 * benefit.
> +			 */
> +			if (!cmds[i].p_filesz)
> +				continue;
> +			size = rounddown_pow_of_two(cmds[i].p_filesz);
> +			size = min(size, max_folio_size);
> +			if (size > PAGE_SIZE &&
> +			    IS_ALIGNED(cmds[i].p_vaddr, size) &&
> +			    IS_ALIGNED(cmds[i].p_offset, size))
> +				alignment = max(alignment, size);

Can this not all be factored out into a different function?  Also, I
think it was done a bit better here:
https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc/

+	if (!IS_ALIGNED(cmd->p_vaddr | cmd->p_offset, PMD_SIZE))
+		return false;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
  2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
  2026-03-20 14:55   ` Kiryl Shutsemau
  2026-03-20 15:58   ` Matthew Wilcox
@ 2026-03-20 16:05   ` WANG Rui
  2026-03-20 17:47     ` Matthew Wilcox
  2 siblings, 1 reply; 16+ messages in thread
From: WANG Rui @ 2026-03-20 16:05 UTC (permalink / raw)
  To: usama.arif
  Cc: Liam.Howlett, ajd, akpm, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, david, dev.jain, jack, kees, kevin.brodsky,
	lance.yang, linux-arm-kernel, linux-fsdevel, linux-fsdevel,
	linux-kernel, linux-mm, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, r, rmclure, rppt, ryan.roberts, surenb, vbabka,
	viro, willy

Hi Usama,

On Fri, Mar 20, 2026 at 10:04 PM Usama Arif <usama.arif@linux.dev> wrote:
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 8e89cc5b28200..042af81766fcd 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -49,6 +49,7 @@
>  #include <uapi/linux/rseq.h>
>  #include <asm/param.h>
>  #include <asm/page.h>
> +#include <linux/pagemap.h>
>
>  #ifndef ELF_COMPAT
>  #define ELF_COMPAT 0
> @@ -488,19 +489,51 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
>         return 0;
>  }
>
> -static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
> +static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr,
> +                                      struct file *filp)
>  {
>         unsigned long alignment = 0;
> +       unsigned long max_folio_size = PAGE_SIZE;
>         int i;
>
> +       if (filp && filp->f_mapping)
> +               max_folio_size = mapping_max_folio_size(filp->f_mapping);

From experiments (with 16K base pages), mapping_max_folio_size() appears to
depend on the filesystem. It returns 8M on ext4, while on btrfs it always
falls back to PAGE_SIZE (it seems CONFIG_BTRFS_EXPERIMENTAL=y may change this).
This looks overly conservative and ends up missing practical optimization
opportunities.

> +
>         for (i = 0; i < nr; i++) {
>                 if (cmds[i].p_type == PT_LOAD) {
>                         unsigned long p_align = cmds[i].p_align;
> +                       unsigned long size;
>
>                         /* skip non-power of two alignments as invalid */
>                         if (!is_power_of_2(p_align))
>                                 continue;
>                         alignment = max(alignment, p_align);
> +
> +                       /*
> +                        * Try to align the binary to the largest folio
> +                        * size that the page cache supports, so the
> +                        * hardware can coalesce PTEs (e.g. arm64
> +                        * contpte) or use PMD mappings for large folios.
> +                        *
> +                        * Use the largest power-of-2 that fits within
> +                        * the segment size, capped by what the page
> +                        * cache will allocate. Only align when the
> +                        * segment's virtual address and file offset are
> +                        * already aligned to the folio size, as
> +                        * misalignment would prevent coalescing anyway.
> +                        *
> +                        * The segment size check avoids reducing ASLR
> +                        * entropy for small binaries that cannot
> +                        * benefit.
> +                        */
> +                       if (!cmds[i].p_filesz)
> +                               continue;
> +                       size = rounddown_pow_of_two(cmds[i].p_filesz);
> +                       size = min(size, max_folio_size);
> +                       if (size > PAGE_SIZE &&
> +                           IS_ALIGNED(cmds[i].p_vaddr, size) &&
> +                           IS_ALIGNED(cmds[i].p_offset, size))
> +                               alignment = max(alignment, size);

In my patch [1], by aligning eligible segments to PMD_SIZE, THP can quickly
collapse them into large mappings with minimal warmup. That doesn’t happen
with the current behavior. I think allowing a reasonably sized PMD (say <= 32M)
is worth considering. All we really need here is to ensure virtual address
alignment. The rest can be left to THP under always, which can decide whether
to collapse or not based on memory pressure and other factors.

[1] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc

>                 }
>         }
>
> @@ -1104,7 +1137,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
>                         }
>
>                         /* Calculate any requested alignment. */
> -                       alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
> +                       alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum,
> +                                                     bprm->file);
>
>                         /**
>                          * DOC: PIE handling
> --
> 2.52.0
>

Thanks,
Rui

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
  2026-03-20 16:05   ` WANG Rui
@ 2026-03-20 17:47     ` Matthew Wilcox
  0 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2026-03-20 17:47 UTC (permalink / raw)
  To: WANG Rui
  Cc: usama.arif, Liam.Howlett, ajd, akpm, apopple, baohua, baolin.wang,
	brauner, catalin.marinas, david, dev.jain, jack, kees,
	kevin.brodsky, lance.yang, linux-arm-kernel, linux-fsdevel,
	linux-fsdevel, linux-kernel, linux-mm, lorenzo.stoakes, mhocko,
	npache, pasha.tatashin, rmclure, rppt, ryan.roberts, surenb,
	vbabka, viro

On Sat, Mar 21, 2026 at 12:05:18AM +0800, WANG Rui wrote:
> >From experiments (with 16K base pages), mapping_max_folio_size() appears to
> depend on the filesystem. It returns 8M on ext4, while on btrfs it always
> falls back to PAGE_SIZE (it seems CONFIG_BTRFS_EXPERIMENTAL=y may change this).
> This looks overly conservative and ends up missing practical optimization
> opportunities.

btrfs only supports large folios with CONFIG_BTRFS_EXPERIMENTAL.
I mean, it's only been five years since it was added to XFS, can't rush
these things.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area
  2026-03-20 13:58 [PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
                   ` (2 preceding siblings ...)
  2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
@ 2026-03-20 13:58 ` Usama Arif
  2026-03-20 15:06   ` Kiryl Shutsemau
  3 siblings, 1 reply; 16+ messages in thread
From: Usama Arif @ 2026-03-20 13:58 UTC (permalink / raw)
  To: Andrew Morton, david, willy, ryan.roberts, linux-mm
  Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
	Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
	lorenzo.stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
	surenb, vbabka, Al Viro, wilts.infradead.org, linux-fsdevel

thp_get_unmapped_area() is the get_unmapped_area callback for
filesystems like ext4, xfs, and btrfs. It attempts to align the virtual
address for PMD_SIZE THP mappings, but on arm64 with 64K base pages
PMD_SIZE is 512M, which is too large for typical shared library mappings,
so the alignment always fails and falls back to PAGE_SIZE.

This means shared libraries loaded by ld.so via mmap() get 64K-aligned
virtual addresses, preventing contpte mapping even when 2M folios are
allocated with properly aligned file offsets and physical addresses.

Add a fallback in thp_get_unmapped_area_vmflags() that uses the
filesystem's mapping_max_folio_size() to determine alignment, capped to
the mapping length via rounddown_pow_of_two(len). This aligns mappings
to the largest folio the page cache will actually allocate, without
over-aligning small mappings.

The fallback is naturally a no-op for filesystems that don't support
large folios and skips the retry when the alignment would equal PMD_SIZE
(already attempted above).

Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
 mm/huge_memory.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e2746ea74adf..4005084c9c65b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1242,6 +1242,20 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 	if (ret)
 		return ret;

+	if (filp && filp->f_mapping) {
+		unsigned long max_folio_size =
+			mapping_max_folio_size(filp->f_mapping);
+		unsigned long size = rounddown_pow_of_two(len);
+
+		size = min(size, max_folio_size);
+		if (size > PAGE_SIZE && size != PMD_SIZE) {
+			ret = __thp_get_unmapped_area(filp, addr, len, off,
+						      flags, size, vm_flags);
+			if (ret)
+				return ret;
+		}
+	}
+
 	return mm_get_unmapped_area_vmflags(filp, addr, len, pgoff, flags,
 					    vm_flags);
 }
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area
  2026-03-20 13:58 ` [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area Usama Arif
@ 2026-03-20 15:06   ` Kiryl Shutsemau
  0 siblings, 0 replies; 16+ messages in thread
From: Kiryl Shutsemau @ 2026-03-20 15:06 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro, ziy,
	hannes, shakeel.butt, kernel-team

On Fri, Mar 20, 2026 at 06:58:54AM -0700, Usama Arif wrote:
> thp_get_unmapped_area() is the get_unmapped_area callback for
> filesystems like ext4, xfs, and btrfs. It attempts to align the virtual
> address for PMD_SIZE THP mappings, but on arm64 with 64K base pages
> PMD_SIZE is 512M, which is too large for typical shared library mappings,
> so the alignment always fails and falls back to PAGE_SIZE.
> 
> This means shared libraries loaded by ld.so via mmap() get 64K-aligned
> virtual addresses, preventing contpte mapping even when 2M folios are
> allocated with properly aligned file offsets and physical addresses.
> 
> Add a fallback in thp_get_unmapped_area_vmflags() that uses the
> filesystem's mapping_max_folio_size() to determine alignment, capped to
> the mapping length via rounddown_pow_of_two(len). This aligns mappings
> to the largest folio the page cache will actually allocate, without
> over-aligning small mappings.
> 
> The fallback is naturally a no-op for filesystems that don't support
> large folios and skips the retry when the alignment would equal PMD_SIZE
> (already attempted above).
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>  mm/huge_memory.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8e2746ea74adf..4005084c9c65b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1242,6 +1242,20 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  	if (ret)
>  		return ret;
>  
> +	if (filp && filp->f_mapping) {
> +		unsigned long max_folio_size =
> +			mapping_max_folio_size(filp->f_mapping);
> +		unsigned long size = rounddown_pow_of_two(len);
> +
> +		size = min(size, max_folio_size);
> +		if (size > PAGE_SIZE && size != PMD_SIZE) {
> +			ret = __thp_get_unmapped_area(filp, addr, len, off,
> +						      flags, size, vm_flags);

Have you considered to integrate inside __thp_get_unmapped_area?

Like, start with PMD_SIZE alignment, then lower to
mapping_max_folio_size if needed and then lower further based on mapping
size.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-03-26 16:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20 13:58 [PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-03-20 13:58 ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-03-20 14:18   ` Jan Kara
2026-03-20 14:26   ` Kiryl Shutsemau
2026-03-20 13:58 ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Usama Arif
2026-03-20 14:41   ` Kiryl Shutsemau
2026-03-20 14:42   ` Jan Kara
2026-03-26 12:40     ` Usama Arif
2026-03-26 16:21       ` Jan Kara
2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
2026-03-20 14:55   ` Kiryl Shutsemau
2026-03-20 15:58   ` Matthew Wilcox
2026-03-20 16:05   ` WANG Rui
2026-03-20 17:47     ` Matthew Wilcox
2026-03-20 13:58 ` [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area Usama Arif
2026-03-20 15:06   ` Kiryl Shutsemau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox