* [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory
@ 2026-04-02 18:08 Usama Arif
2026-04-02 18:08 ` [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif
v2 -> v3: https://lore.kernel.org/all/20260320140315.979307-1-usama.arif@linux.dev/
- Take into account READ_ONLY_THP_FOR_FS for elf alignment by aligning
to HPAGE_PMD_SIZE limited to 2M (Rui)
- Reviewed-by tags for patch 1 from Kiryl and Jan
- Remove preferred_exec_order() (Jan)
- Change ra->order to HPAGE_PMD_ORDER if vma_pages(vma) >= HPAGE_PMD_NR
otherwise use exec_folio_order() with gfp &= ~__GFP_RECLAIM for
do_sync_mmap_readahead().
- Change exec_folio_order() to return 2M (cont-pte size) for 64K base
page size for arm64.
- remove bprm->file NULL check (Matthew)
- Change filp to file (Matthew)
- Improve checking of p_vaddr and p_vaddr (Rui and Matthew)
v1 -> v2: https://lore.kernel.org/all/20260310145406.3073394-1-usama.arif@linux.dev/
- disable mmap_miss logic for VM_EXEC (Jan Kara)
- Align in elf only when segment VA and file offset are already aligned (Rui)
- preferred_exec_order() for VM_EXEC sync mmap_readahead which takes into
account zone high watermarks (as an approximation of memory pressure)
(David, or atleast my approach to what David suggested in [1] :))
- Extend max alignment to mapping_max_folio_size() instead of
exec_folio_order()
Motiviation
===========
exec_folio_order() was introduced [2] to request readahead at an
arch-preferred folio order for executable memory, enabling hardware PTE
coalescing (e.g. arm64 contpte) and PMD mappings on the fault path.
However, several things prevent this from working optimally:
1. The mmap_miss heuristic in do_sync_mmap_readahead() silently disables
exec readahead after 100 page faults. The mmap_miss counter tracks
whether readahead is useful for mmap'd file access:
- Incremented by 1 in do_sync_mmap_readahead() on every page cache
miss (page needed IO).
- Decremented by N in filemap_map_pages() for N pages successfully
mapped via fault-around (pages found in cache without faulting,
evidence that readahead was useful). Only non-workingset pages
count and recently evicted and re-read pages don't count as hits.
- Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
marker page is found (indicates sequential consumption of readahead
pages).
When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
disabled. On arm64 with 64K pages, both decrement paths are inactive:
- filemap_map_pages() is never called because fault_around_pages
(65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
requires fault_around_pages > 1. With only 1 page in the
fault-around window, there is nothing "around" to map.
- do_async_mmap_readahead() never fires for exec mappings because
exec readahead sets async_size = 0, so no PG_readahead markers
are placed.
With no decrements, mmap_miss monotonically increases past
MMAP_LOTSAMISS after 100 faults, disabling exec readahead
for the remainder of the mapping. Patch 1 fixes this by excluding
VM_EXEC VMAs from the mmap_miss logic, similar to how VM_SEQ_READ
is already excluded.
2. exec_folio_order() is an arch-specific hook that returns a static
order (ilog2(SZ_64K >> PAGE_SHIFT)), which is suboptimal for non-4K
page sizes. Patch 2 updates the arm64 exec_folio_order() to return
2M on 64K page configurations (for contpte coalescing, where the
previous SZ_64K value collapsed to order 0) and uses a tiered
allocation strategy in do_sync_mmap_readahead(): if the VMA is large
enough for a full PMD, request HPAGE_PMD_ORDER so the folio can be
PMD-mapped; otherwise fall back to exec_folio_order() for hardware
PTE coalescing. The allocation uses ~__GFP_RECLAIM so it is
opportunistic, falling back to smaller folios without stalling on
reclaim or compaction.
3. Even with correct folio order and readahead, hardware PTE coalescing
(e.g. contpte) and PMD mapping require the virtual address to be
aligned to the folio size. The readahead path aligns file offsets and
the buddy allocator aligns physical memory, but the virtual address
depends on the VMA start. For PIE binaries, ASLR randomizes the load
address at PAGE_SIZE granularity, so on arm64 with 64K pages only
1/32 of load addresses are 2M-aligned. When misaligned, contpte
cannot be used for any folio in the VMA.
Patch 3 fixes this for the main binary by extending maximum_alignment()
in the ELF loader with a folio_alignment() helper that tries two
tiers matching the readahead strategy: first HPAGE_PMD_SIZE for PMD
mapping, then exec_folio_order() as a fallback for hardware TLB
coalescing. The alignment is capped to the segment size to avoid
reducing ASLR entropy for small binaries.
Patch 4 fixes this for shared libraries by adding an
exec_folio_order() alignment fallback in
thp_get_unmapped_area_vmflags(). The existing PMD_SIZE alignment
(512M on arm64 64K pages) is too large for typical shared libraries,
so this smaller fallback succeeds where PMD fails.
I created a benchmark that mmaps a large executable file and calls
RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
fault + readahead cost. "Random" first faults in all pages with a
sequential sweep (not measured), then measures time for calling random
offsets, isolating iTLB miss cost for scattered execution.
The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
512MB executable file on ext4, averaged over 3 runs:
Phase | Baseline | Patched | Improvement
-----------|--------------|--------------|------------------
Cold fault | 83.4 ms | 41.3 ms | 50% faster
Random | 76.0 ms | 58.3 ms | 23% faster
[1] https://lore.kernel.org/all/d72d5ca3-4b92-470e-9f89-9f39a3975f1e@kernel.org/
[2] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
Usama Arif (4):
mm: bypass mmap_miss heuristic for VM_EXEC readahead
mm: use tiered folio allocation for VM_EXEC readahead
elf: align ET_DYN base for PTE coalescing and PMD mapping
mm: align file-backed mmap to exec folio order in
thp_get_unmapped_area
arch/arm64/include/asm/pgtable.h | 16 ++++++----
fs/binfmt_elf.c | 50 ++++++++++++++++++++++++++++++++
mm/filemap.c | 42 +++++++++++++++++++--------
mm/huge_memory.c | 13 +++++++++
mm/internal.h | 3 +-
mm/readahead.c | 7 ++---
6 files changed, 109 insertions(+), 22 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
@ 2026-04-02 18:08 ` Usama Arif
2026-04-02 18:08 ` [PATCH v3 2/4] mm: use tiered folio allocation " Usama Arif
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif
The mmap_miss counter in do_sync_mmap_readahead() tracks whether
readahead is useful for mmap'd file access. It is incremented by 1 on
every page cache miss in do_sync_mmap_readahead(), and decremented in
two places:
- filemap_map_pages(): decremented by N for each of N pages
successfully mapped via fault-around (pages found already in cache,
evidence readahead was useful). Only pages not in the workingset
count as hits.
- do_async_mmap_readahead(): decremented by 1 when a page with
PG_readahead is found in cache.
When the counter exceeds MMAP_LOTSAMISS (100), all readahead is
disabled, including the targeted VM_EXEC readahead [1] that requests
large folio orders for contpte mapping.
On arm64 with 64K base pages, both decrement paths are inactive:
1. filemap_map_pages() is never called because fault_around_pages
(65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
requires fault_around_pages > 1. With only 1 page in the
fault-around window, there is nothing "around" to map.
2. do_async_mmap_readahead() never fires for exec mappings because
exec readahead sets async_size = 0, so no PG_readahead markers
are placed.
With no decrements, mmap_miss monotonically increases past
MMAP_LOTSAMISS after 100 page faults, disabling all subsequent
exec readahead.
Fix this by excluding VM_EXEC VMAs from the mmap_miss logic, similar
to how VM_SEQ_READ is already excluded. The exec readahead path is
targeted (one folio at the fault location, async_size=0), not
speculative prefetch, so the mmap_miss heuristic designed to throttle
wasteful speculative readahead should not apply to it.
[1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
Signed-off-by: Usama Arif <usama.arif@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
---
mm/filemap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 2b933a1da9bd..a4ea869b2ca1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3337,7 +3337,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
}
}
- if (!(vm_flags & VM_SEQ_READ)) {
+ if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
/* Avoid banging the cache line if not needed */
mmap_miss = READ_ONCE(ra->mmap_miss);
if (mmap_miss < MMAP_LOTSAMISS * 10)
--
2.52.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v3 2/4] mm: use tiered folio allocation for VM_EXEC readahead
2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-04-02 18:08 ` [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-04-02 18:08 ` Usama Arif
2026-04-02 18:08 ` [PATCH v3 3/4] elf: align ET_DYN base for PTE coalescing and PMD mapping Usama Arif
2026-04-02 18:08 ` [PATCH v3 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
3 siblings, 0 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif
When executable pages are faulted via do_sync_mmap_readahead(), request
a folio order that enables the best hardware TLB coalescing available:
- If the VMA is large enough to contain a full PMD, request
HPAGE_PMD_ORDER so the folio can be PMD-mapped. This benefits
architectures where PMD_SIZE is reasonable (e.g. 2M on x86-64
and arm64 with 4K pages). VM_EXEC VMAs are very unlikely to be
large enough for 512M pages on ARM to take into affect.
- Otherwise, fall back to exec_folio_order(), which returns the
minimum order for hardware PTE coalescing for arm64:
- arm64 4K: order 4 (64K) for contpte (16 PTEs → 1 iTLB entry)
- arm64 16K: order 2 (64K) for HPA (4 pages → 1 TLB entry)
- arm64 64K: order 5 (2M) for contpte (32 PTEs → 1 iTLB entry)
- generic: order 0 (no coalescing)
Update the arm64 exec_folio_order() to return ilog2(SZ_2M >>
PAGE_SHIFT) on 64K page configurations, where the previous SZ_64K
value collapsed to order 0 (a single page) and provided no coalescing
benefit.
Use ~__GFP_RECLAIM so the allocation is opportunistic: if a large
folio is readily available, use it, otherwise fall back to smaller
folios without stalling on reclaim or compaction. The existing fallback
in page_cache_ra_order() handles this naturally.
The readahead window is already clamped to the VMA boundaries, so
ra->size naturally caps the folio order via ilog2(ra->size) in
page_cache_ra_order().
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
arch/arm64/include/asm/pgtable.h | 16 +++++++++----
mm/filemap.c | 40 +++++++++++++++++++++++---------
mm/internal.h | 3 ++-
mm/readahead.c | 7 +++---
4 files changed, 45 insertions(+), 21 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 52bafe79c10a..9ce9f73a6f35 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1591,12 +1591,18 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
#define arch_wants_old_prefaulted_pte cpu_has_hw_af
/*
- * Request exec memory is read into pagecache in at least 64K folios. This size
- * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
- * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
- * pages are in use.
+ * Request exec memory is read into pagecache in folios large enough for
+ * hardware TLB coalescing. On 4K and 16K page configs this is 64K, which
+ * enables contpte mapping (16 × 4K) and HPA coalescing (4 × 16K). On
+ * 64K page configs, contpte requires 2M (32 × 64K).
*/
-#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
+#define exec_folio_order exec_folio_order
+static inline unsigned int exec_folio_order(void)
+{
+ if (PAGE_SIZE == SZ_64K)
+ return ilog2(SZ_2M >> PAGE_SHIFT);
+ return ilog2(SZ_64K >> PAGE_SHIFT);
+}
static inline bool pud_sect_supported(void)
{
diff --git a/mm/filemap.c b/mm/filemap.c
index a4ea869b2ca1..7ffea986b3b4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3311,6 +3311,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
struct file *fpin = NULL;
vm_flags_t vm_flags = vmf->vma->vm_flags;
+ gfp_t gfp = readahead_gfp_mask(mapping);
bool force_thp_readahead = false;
unsigned short mmap_miss;
@@ -3363,28 +3364,45 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
ra->size *= 2;
ra->async_size = HPAGE_PMD_NR;
ra->order = HPAGE_PMD_ORDER;
- page_cache_ra_order(&ractl, ra);
+ page_cache_ra_order(&ractl, ra, gfp);
return fpin;
}
if (vm_flags & VM_EXEC) {
/*
- * Allow arch to request a preferred minimum folio order for
- * executable memory. This can often be beneficial to
- * performance if (e.g.) arm64 can contpte-map the folio.
- * Executable memory rarely benefits from readahead, due to its
- * random access nature, so set async_size to 0.
+ * Request large folios for executable memory to enable
+ * hardware PTE coalescing and PMD mappings:
*
- * Limit to the boundaries of the VMA to avoid reading in any
- * pad that might exist between sections, which would be a waste
- * of memory.
+ * - If the VMA is large enough for a PMD, request
+ * HPAGE_PMD_ORDER so the folio can be PMD-mapped.
+ * - Otherwise, use exec_folio_order() which returns
+ * the minimum order for hardware TLB coalescing
+ * (e.g. arm64 contpte/HPA).
+ *
+ * Use ~__GFP_RECLAIM so large folio allocation is
+ * opportunistic — if memory isn't readily available,
+ * fall back to smaller folios rather than stalling on
+ * reclaim or compaction.
+ *
+ * Executable memory rarely benefits from speculative
+ * readahead due to its random access nature, so set
+ * async_size to 0.
+ *
+ * Limit to the boundaries of the VMA to avoid reading
+ * in any pad that might exist between sections, which
+ * would be a waste of memory.
*/
+ gfp &= ~__GFP_RECLAIM;
struct vm_area_struct *vma = vmf->vma;
unsigned long start = vma->vm_pgoff;
unsigned long end = start + vma_pages(vma);
unsigned long ra_end;
- ra->order = exec_folio_order();
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ vma_pages(vma) >= HPAGE_PMD_NR)
+ ra->order = HPAGE_PMD_ORDER;
+ else
+ ra->order = exec_folio_order();
ra->start = round_down(vmf->pgoff, 1UL << ra->order);
ra->start = max(ra->start, start);
ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
@@ -3403,7 +3421,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
ractl._index = ra->start;
- page_cache_ra_order(&ractl, ra);
+ page_cache_ra_order(&ractl, ra, gfp);
return fpin;
}
diff --git a/mm/internal.h b/mm/internal.h
index 475bd281a10d..e624cb619057 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -545,7 +545,8 @@ int zap_vma_for_reaping(struct vm_area_struct *vma);
int folio_unmap_invalidate(struct address_space *mapping, struct folio *folio,
gfp_t gfp);
-void page_cache_ra_order(struct readahead_control *, struct file_ra_state *);
+void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
+ gfp_t gfp);
void force_page_cache_ra(struct readahead_control *, unsigned long nr);
static inline void force_page_cache_readahead(struct address_space *mapping,
struct file *file, pgoff_t index, unsigned long nr_to_read)
diff --git a/mm/readahead.c b/mm/readahead.c
index 7b05082c89ea..b3dc08cf180c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -465,7 +465,7 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
}
void page_cache_ra_order(struct readahead_control *ractl,
- struct file_ra_state *ra)
+ struct file_ra_state *ra, gfp_t gfp)
{
struct address_space *mapping = ractl->mapping;
pgoff_t start = readahead_index(ractl);
@@ -475,7 +475,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
pgoff_t mark = index + ra->size - ra->async_size;
unsigned int nofs;
int err = 0;
- gfp_t gfp = readahead_gfp_mask(mapping);
unsigned int new_order = ra->order;
trace_page_cache_ra_order(mapping->host, start, ra);
@@ -626,7 +625,7 @@ void page_cache_sync_ra(struct readahead_control *ractl,
readit:
ra->order = 0;
ractl->_index = ra->start;
- page_cache_ra_order(ractl, ra);
+ page_cache_ra_order(ractl, ra, readahead_gfp_mask(ractl->mapping));
}
EXPORT_SYMBOL_GPL(page_cache_sync_ra);
@@ -697,7 +696,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
ra->size -= end - aligned_end;
ra->async_size = ra->size;
ractl->_index = ra->start;
- page_cache_ra_order(ractl, ra);
+ page_cache_ra_order(ractl, ra, readahead_gfp_mask(ractl->mapping));
}
EXPORT_SYMBOL_GPL(page_cache_async_ra);
--
2.52.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v3 3/4] elf: align ET_DYN base for PTE coalescing and PMD mapping
2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-04-02 18:08 ` [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-04-02 18:08 ` [PATCH v3 2/4] mm: use tiered folio allocation " Usama Arif
@ 2026-04-02 18:08 ` Usama Arif
2026-04-02 18:08 ` [PATCH v3 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
3 siblings, 0 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif
For PIE binaries (ET_DYN), the load address is randomized at PAGE_SIZE
granularity via arch_mmap_rnd(). On arm64 with 64K base pages, this
means the binary is 64K-aligned, but contpte mapping requires 2M
(CONT_PTE_SIZE) alignment.
Without proper virtual address alignment, readahead patches that
allocate large folios with aligned file offsets and physical addresses
cannot benefit from contpte mapping, as the contpte fold check in
contpte_set_ptes() requires the virtual address to be CONT_PTE_SIZE-
aligned.
Fix this by extending maximum_alignment() to consider folio alignment
at two tiers, matching the readahead allocation strategy:
- HPAGE_PMD_SIZE, so large folios can be PMD-mapped on
architectures where PMD_SIZE is reasonable (e.g. 2M on x86-64
and arm64 with 4K pages).
- exec_folio_order(), the minimum order for hardware TLB
coalescing (e.g. arm64 contpte/HPA).
For each PT_LOAD segment, folio_alignment() tries both tiers and
returns the largest power-of-2 alignment that fits within the segment
size, with both p_vaddr and p_offset aligned to that size. This
ensures load_bias is folio-aligned so that file-offset-aligned folios
map to properly aligned virtual addresses, enabling hardware PTE
coalescing and PMD mappings for large folios.
The segment size check in folio_alignment() avoids reducing ASLR
entropy for small binaries that cannot benefit from large folio
alignment.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
fs/binfmt_elf.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 16a56b6b3f6c..f84fae6daf23 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -488,6 +488,54 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
return 0;
}
+/*
+ * Return the largest folio alignment for a PT_LOAD segment, so the
+ * hardware can coalesce PTEs (e.g. arm64 contpte) or use PMD mappings
+ * for large folios.
+ *
+ * Try PMD alignment so large folios can be PMD-mapped. Then try
+ * exec_folio_order() alignment for hardware TLB coalescing (e.g.
+ * arm64 contpte/HPA).
+ *
+ * Use the largest power-of-2 that fits within the segment size, capped
+ * by the target folio size.
+ * Only align when the segment's virtual address and file offset are
+ * already aligned to that size, as misalignment would prevent coalescing
+ * anyway.
+ *
+ * The segment size check avoids reducing ASLR entropy for small binaries
+ * that cannot benefit.
+ */
+static unsigned long folio_alignment(struct elf_phdr *cmd)
+{
+ unsigned long alignment = 0;
+ unsigned long seg_size;
+
+ if (!cmd->p_filesz)
+ return 0;
+
+ seg_size = rounddown_pow_of_two(cmd->p_filesz);
+
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ unsigned long size = min(seg_size, HPAGE_PMD_SIZE);
+
+ if (size > PAGE_SIZE &&
+ IS_ALIGNED(cmd->p_vaddr | cmd->p_offset, size))
+ alignment = size;
+ }
+
+ if (!alignment && exec_folio_order()) {
+ unsigned long size = min(seg_size,
+ PAGE_SIZE << exec_folio_order());
+
+ if (size > PAGE_SIZE &&
+ IS_ALIGNED(cmd->p_vaddr | cmd->p_offset, size))
+ alignment = size;
+ }
+
+ return alignment;
+}
+
static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
{
unsigned long alignment = 0;
@@ -501,6 +549,8 @@ static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
if (!is_power_of_2(p_align))
continue;
alignment = max(alignment, p_align);
+ alignment = max(alignment,
+ folio_alignment(&cmds[i]));
}
}
--
2.52.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v3 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area
2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
` (2 preceding siblings ...)
2026-04-02 18:08 ` [PATCH v3 3/4] elf: align ET_DYN base for PTE coalescing and PMD mapping Usama Arif
@ 2026-04-02 18:08 ` Usama Arif
3 siblings, 0 replies; 5+ messages in thread
From: Usama Arif @ 2026-04-02 18:08 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
Lorenzo Stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, "linux-fsdevel,
ziy, hannes, kas, shakeel.butt, leitao, kernel-team, Usama Arif
thp_get_unmapped_area() is the get_unmapped_area callback for
filesystems like ext4, xfs, and btrfs. It attempts to align the virtual
address for PMD_SIZE THP mappings, but on arm64 with 64K base pages
PMD_SIZE is 512M, which is too large for typical shared library mappings,
so the alignment always fails and falls back to PAGE_SIZE.
This means shared libraries loaded by ld.so via mmap() get 64K-aligned
virtual addresses, preventing contpte mapping even when large folios are
allocated with properly aligned file offsets and physical addresses.
Add a fallback in thp_get_unmapped_area_vmflags() that uses
exec_folio_order() to determine alignment, matching the readahead
allocation strategy. This aligns mappings to the hardware TLB
coalescing size (e.g. 2M for contpte on arm64 64K pages, 64K for
contpte/HPA on arm64 4K/16K pages), capped to the mapping length via
rounddown_pow_of_two(len).
The fallback is naturally a no-op on architectures where
exec_folio_order() returns 0, and skips the retry when the alignment
would equal PMD_SIZE (already attempted above) incase another
architecture changes exec_folio_order() in the future.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/huge_memory.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b2a6060b3c20..ad97ac8406dc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1218,6 +1218,19 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
if (ret)
return ret;
+ if (filp && exec_folio_order()) {
+ unsigned long exec_folio_size = PAGE_SIZE << exec_folio_order();
+ unsigned long size = rounddown_pow_of_two(len);
+
+ size = min(size, exec_folio_size);
+ if (size > PAGE_SIZE && size != PMD_SIZE) {
+ ret = __thp_get_unmapped_area(filp, addr, len, off,
+ flags, size, vm_flags);
+ if (ret)
+ return ret;
+ }
+ }
+
return mm_get_unmapped_area_vmflags(filp, addr, len, pgoff, flags,
vm_flags);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-04-02 18:14 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02 18:08 [PATCH v3 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-04-02 18:08 ` [PATCH v3 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-04-02 18:08 ` [PATCH v3 2/4] mm: use tiered folio allocation " Usama Arif
2026-04-02 18:08 ` [PATCH v3 3/4] elf: align ET_DYN base for PTE coalescing and PMD mapping Usama Arif
2026-04-02 18:08 ` [PATCH v3 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox