* [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
@ 2026-04-02 18:08 ` Usama Arif
2026-04-02 18:08 ` [PATCH v3 2/4] mm: use tiered folio allocation " Usama Arif
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif
The mmap_miss counter in do_sync_mmap_readahead() tracks whether
readahead is useful for mmap'd file access. It is incremented by 1 on
every page cache miss in do_sync_mmap_readahead(), and decremented in
two places:
- filemap_map_pages(): decremented by N for each of N pages
successfully mapped via fault-around (pages found already in cache,
evidence readahead was useful). Only pages not in the workingset
count as hits.
- do_async_mmap_readahead(): decremented by 1 when a page with
PG_readahead is found in cache.
When the counter exceeds MMAP_LOTSAMISS (100), all readahead is
disabled, including the targeted VM_EXEC readahead [1] that requests
large folio orders for contpte mapping.
On arm64 with 64K base pages, both decrement paths are inactive:
1. filemap_map_pages() is never called because fault_around_pages
(65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
requires fault_around_pages > 1. With only 1 page in the
fault-around window, there is nothing "around" to map.
2. do_async_mmap_readahead() never fires for exec mappings because
exec readahead sets async_size = 0, so no PG_readahead markers
are placed.
With no decrements, mmap_miss monotonically increases past
MMAP_LOTSAMISS after 100 page faults, disabling all subsequent
exec readahead.
Fix this by excluding VM_EXEC VMAs from the mmap_miss logic, similar
to how VM_SEQ_READ is already excluded. The exec readahead path is
targeted (one folio at the fault location, async_size=0), not
speculative prefetch, so the mmap_miss heuristic designed to throttle
wasteful speculative readahead should not apply to it.
[1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
Signed-off-by: Usama Arif <usama.arif@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
---
mm/filemap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 2b933a1da9bd..a4ea869b2ca1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3337,7 +3337,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
}
}
- if (!(vm_flags & VM_SEQ_READ)) {
+ if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
/* Avoid banging the cache line if not needed */
mmap_miss = READ_ONCE(ra->mmap_miss);
if (mmap_miss < MMAP_LOTSAMISS * 10)
--
2.52.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH v3 2/4] mm: use tiered folio allocation for VM_EXEC readahead
2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-04-02 18:08 ` [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-04-02 18:08 ` Usama Arif
2026-04-02 18:08 ` [PATCH v3 3/4] elf: align ET_DYN base for PTE coalescing and PMD mapping Usama Arif
2026-04-02 18:08 ` [PATCH v3 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
3 siblings, 0 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif
When executable pages are faulted via do_sync_mmap_readahead(), request
a folio order that enables the best hardware TLB coalescing available:
- If the VMA is large enough to contain a full PMD, request
HPAGE_PMD_ORDER so the folio can be PMD-mapped. This benefits
architectures where PMD_SIZE is reasonable (e.g. 2M on x86-64
and arm64 with 4K pages). VM_EXEC VMAs are very unlikely to be
large enough for 512M pages on ARM to take into affect.
- Otherwise, fall back to exec_folio_order(), which returns the
minimum order for hardware PTE coalescing for arm64:
- arm64 4K: order 4 (64K) for contpte (16 PTEs → 1 iTLB entry)
- arm64 16K: order 2 (64K) for HPA (4 pages → 1 TLB entry)
- arm64 64K: order 5 (2M) for contpte (32 PTEs → 1 iTLB entry)
- generic: order 0 (no coalescing)
Update the arm64 exec_folio_order() to return ilog2(SZ_2M >>
PAGE_SHIFT) on 64K page configurations, where the previous SZ_64K
value collapsed to order 0 (a single page) and provided no coalescing
benefit.
Use ~__GFP_RECLAIM so the allocation is opportunistic: if a large
folio is readily available, use it, otherwise fall back to smaller
folios without stalling on reclaim or compaction. The existing fallback
in page_cache_ra_order() handles this naturally.
The readahead window is already clamped to the VMA boundaries, so
ra->size naturally caps the folio order via ilog2(ra->size) in
page_cache_ra_order().
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
arch/arm64/include/asm/pgtable.h | 16 +++++++++----
mm/filemap.c | 40 +++++++++++++++++++++++---------
mm/internal.h | 3 ++-
mm/readahead.c | 7 +++---
4 files changed, 45 insertions(+), 21 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 52bafe79c10a..9ce9f73a6f35 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1591,12 +1591,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
#define arch_wants_old_prefaulted_pte cpu_has_hw_af
/*
- * Request exec memory is read into pagecache in at least 64K folios. This size
- * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
- * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
- * pages are in use.
+ * Request exec memory is read into pagecache in folios large enough for
+ * hardware TLB coalescing. On 4K and 16K page configs this is 64K, which
+ * enables contpte mapping (16 × 4K) and HPA coalescing (4 × 16K). On
+ * 64K page configs, contpte requires 2M (32 × 64K).
*/
-#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
+#define exec_folio_order exec_folio_order
+static inline unsigned int exec_folio_order(void)
+{
+ if (PAGE_SIZE == SZ_64K)
+ return ilog2(SZ_2M >> PAGE_SHIFT);
+ return ilog2(SZ_64K >> PAGE_SHIFT);
+}
static inline bool pud_sect_supported(void)
{
diff --git a/mm/filemap.c b/mm/filemap.c
index a4ea869b2ca1..7ffea986b3b4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3311,6 +3311,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
struct file *fpin = NULL;
vm_flags_t vm_flags = vmf->vma->vm_flags;
+ gfp_t gfp = readahead_gfp_mask(mapping);
bool force_thp_readahead = false;
unsigned short mmap_miss;
@@ -3363,28 +3364,45 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
ra->size *= 2;
ra->async_size = HPAGE_PMD_NR;
ra->order = HPAGE_PMD_ORDER;
- page_cache_ra_order(&ractl, ra);
+ page_cache_ra_order(&ractl, ra, gfp);
return fpin;
}
if (vm_flags & VM_EXEC) {
/*
- * Allow arch to request a preferred minimum folio order for
- * executable memory. This can often be beneficial to
- * performance if (e.g.) arm64 can contpte-map the folio.
- * Executable memory rarely benefits from readahead, due to its
- * random access nature, so set async_size to 0.
+ * Request large folios for executable memory to enable
+ * hardware PTE coalescing and PMD mappings:
*
- * Limit to the boundaries of the VMA to avoid reading in any
- * pad that might exist between sections, which would be a waste
- * of memory.
+ * - If the VMA is large enough for a PMD, request
+ * HPAGE_PMD_ORDER so the folio can be PMD-mapped.
+ * - Otherwise, use exec_folio_order() which returns
+ * the minimum order for hardware TLB coalescing
+ * (e.g. arm64 contpte/HPA).
+ *
+ * Use ~__GFP_RECLAIM so large folio allocation is
+ * opportunistic — if memory isn't readily available,
+ * fall back to smaller folios rather than stalling on
+ * reclaim or compaction.
+ *
+ * Executable memory rarely benefits from speculative
+ * readahead due to its random access nature, so set
+ * async_size to 0.
+ *
+ * Limit to the boundaries of the VMA to avoid reading
+ * in any pad that might exist between sections, which
+ * would be a waste of memory.
*/
+ gfp &= ~__GFP_RECLAIM;
struct vm_area_struct *vma = vmf->vma;
unsigned long start = vma->vm_pgoff;
unsigned long end = start + vma_pages(vma);
unsigned long ra_end;
- ra->order = exec_folio_order();
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ vma_pages(vma) >= HPAGE_PMD_NR)
+ ra->order = HPAGE_PMD_ORDER;
+ else
+ ra->order = exec_folio_order();
ra->start = round_down(vmf->pgoff, 1UL << ra->order);
ra->start = max(ra->start, start);
ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
@@ -3403,7 +3421,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
ractl._index = ra->start;
- page_cache_ra_order(&ractl, ra);
+ page_cache_ra_order(&ractl, ra, gfp);
return fpin;
}
diff --git a/mm/internal.h b/mm/internal.h
index 475bd281a10d..e624cb619057 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -545,7 +545,8 @@ int zap_vma_for_reaping(struct vm_area_struct *vma);
int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
gfp_t gfp);
-void page_cache_ra_order(struct readahead_control *, struct file_ra_state *);
+void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
+ gfp_t gfp);
void force_page_cache_ra(struct readahead_control *, unsigned long nr);
static inline void force_page_cache_readahead(struct address_space *mapping,
struct file *file, pgoff_t index, unsigned long nr_to_read)
diff --git a/mm/readahead.c b/mm/readahead.c
index 7b05082c89ea..b3dc08cf180c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -465,7 +465,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
}
void page_cache_ra_order(struct readahead_control *ractl,
- struct file_ra_state *ra)
+ struct file_ra_state *ra, gfp_t gfp)
{
struct address_space *mapping = ractl->mapping;
pgoff_t start = readahead_index(ractl);
@@ -475,7 +475,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
pgoff_t mark = index + ra->size - ra->async_size;
unsigned int nofs;
int err = 0;
- gfp_t gfp = readahead_gfp_mask(mapping);
unsigned int new_order = ra->order;
trace_page_cache_ra_order(mapping->host, start, ra);
@@ -626,7 +625,7 @@ void page_cache_sync_ra(struct readahead_control *ractl,
readit:
ra->order = 0;
ractl->_index = ra->start;
- page_cache_ra_order(ractl, ra);
+ page_cache_ra_order(ractl, ra, readahead_gfp_mask(ractl->mapping));
}
EXPORT_SYMBOL_GPL(page_cache_sync_ra);
@@ -697,7 +696,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
ra->size -= end - aligned_end;
ra->async_size = ra->size;
ractl->_index = ra->start;
- page_cache_ra_order(ractl, ra);
+ page_cache_ra_order(ractl, ra, readahead_gfp_mask(ractl->mapping));
}
EXPORT_SYMBOL_GPL(page_cache_async_ra);
--
2.52.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH v3 3/4] elf: align ET_DYN base for PTE coalescing and PMD mapping
2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-04-02 18:08 ` [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-04-02 18:08 ` [PATCH v3 2/4] mm: use tiered folio allocation " Usama Arif
@ 2026-04-02 18:08 ` Usama Arif
2026-04-02 18:08 ` [PATCH v3 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
3 siblings, 0 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif
For PIE binaries (ET_DYN), the load address is randomized at PAGE_SIZE
granularity via arch_mmap_rnd(). On arm64 with 64K base pages, this
means the binary is 64K-aligned, but contpte mapping requires 2M
(CONT_PTE_SIZE) alignment.
Without proper virtual address alignment, readahead patches that
allocate large folios with aligned file offsets and physical addresses
cannot benefit from contpte mapping, as the contpte fold check in
contpte_set_ptes() requires the virtual address to be CONT_PTE_SIZE-
aligned.
Fix this by extending maximum_alignment() to consider folio alignment
at two tiers, matching the readahead allocation strategy:
- HPAGE_PMD_SIZE, so large folios can be PMD-mapped on
architectures where PMD_SIZE is reasonable (e.g. 2M on x86-64
and arm64 with 4K pages).
- exec_folio_order(), the minimum order for hardware TLB
coalescing (e.g. arm64 contpte/HPA).
For each PT_LOAD segment, folio_alignment() tries both tiers and
returns the largest power-of-2 alignment that fits within the segment
size, with both p_vaddr and p_offset aligned to that size. This
ensures load_bias is folio-aligned so that file-offset-aligned folios
map to properly aligned virtual addresses, enabling hardware PTE
coalescing and PMD mappings for large folios.
The segment size check in folio_alignment() avoids reducing ASLR
entropy for small binaries that cannot benefit from large folio
alignment.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
fs/binfmt_elf.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 16a56b6b3f6c..f84fae6daf23 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -488,6 +488,54 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
return 0;
}
+/*
+ * Return the largest folio alignment for a PT_LOAD segment, so the
+ * hardware can coalesce PTEs (e.g. arm64 contpte) or use PMD mappings
+ * for large folios.
+ *
+ * Try PMD alignment so large folios can be PMD-mapped. Then try
+ * exec_folio_order() alignment for hardware TLB coalescing (e.g.
+ * arm64 contpte/HPA).
+ *
+ * Use the largest power-of-2 that fits within the segment size, capped
+ * by the target folio size.
+ * Only align when the segment's virtual address and file offset are
+ * already aligned to that size, as misalignment would prevent coalescing
+ * anyway.
+ *
+ * The segment size check avoids reducing ASLR entropy for small binaries
+ * that cannot benefit.
+ */
+static unsigned long folio_alignment(struct elf_phdr *cmd)
+{
+ unsigned long alignment = 0;
+ unsigned long seg_size;
+
+ if (!cmd->p_filesz)
+ return 0;
+
+ seg_size = rounddown_pow_of_two(cmd->p_filesz);
+
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ unsigned long size = min(seg_size, HPAGE_PMD_SIZE);
+
+ if (size > PAGE_SIZE &&
+ IS_ALIGNED(cmd->p_vaddr | cmd->p_offset, size))
+ alignment = size;
+ }
+
+ if (!alignment && exec_folio_order()) {
+ unsigned long size = min(seg_size,
+ PAGE_SIZE << exec_folio_order());
+
+ if (size > PAGE_SIZE &&
+ IS_ALIGNED(cmd->p_vaddr | cmd->p_offset, size))
+ alignment = size;
+ }
+
+ return alignment;
+}
+
static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
{
unsigned long alignment = 0;
@@ -501,6 +549,8 @@ static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
if (!is_power_of_2(p_align))
continue;
alignment = max(alignment, p_align);
+ alignment = max(alignment,
+ folio_alignment(&cmds[i]));
}
}
--
2.52.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH v3 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area
2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
` (2 preceding siblings ...)
2026-04-02 18:08 ` [PATCH v3 3/4] elf: align ET_DYN base for PTE coalescing and PMD mapping Usama Arif
@ 2026-04-02 18:08 ` Usama Arif
3 siblings, 0 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif
thp_get_unmapped_area() is the get_unmapped_area callback for
filesystems like ext4, xfs, and btrfs. It attempts to align the virtual
address for PMD_SIZE THP mappings, but on arm64 with 64K base pages
PMD_SIZE is 512M, which is too large for typical shared library mappings,
so the alignment always fails and falls back to PAGE_SIZE.
This means shared libraries loaded by ld.so via mmap() get 64K-aligned
virtual addresses, preventing contpte mapping even when large folios are
allocated with properly aligned file offsets and physical addresses.
Add a fallback in thp_get_unmapped_area_vmflags() that uses
exec_folio_order() to determine alignment, matching the readahead
allocation strategy. This aligns mappings to the hardware TLB
coalescing size (e.g. 2M for contpte on arm64 64K pages, 64K for
contpte/HPA on arm64 4K/16K pages), capped to the mapping length via
rounddown_pow_of_two(len).
The fallback is naturally a no-op on architectures where
exec_folio_order() returns 0, and skips the retry when the alignment
would equal PMD_SIZE (already attempted above) incase another
architecture changes exec_folio_order() in the future.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/huge_memory.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b2a6060b3c20..ad97ac8406dc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1218,6 +1218,19 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
if (ret)
return ret;
+ if (filp && exec_folio_order()) {
+ unsigned long exec_folio_size = PAGE_SIZE << exec_folio_order();
+ unsigned long size = rounddown_pow_of_two(len);
+
+ size = min(size, exec_folio_size);
+ if (size > PAGE_SIZE && size != PMD_SIZE) {
+ ret = __thp_get_unmapped_area(filp, addr, len, off,
+ flags, size, vm_flags);
+ if (ret)
+ return ret;
+ }
+ }
+
return mm_get_unmapped_area_vmflags(filp, addr, len, pgoff, flags,
vm_flags);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 5+ messages in thread