* [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
2026-03-20 13:58 [PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
@ 2026-03-20 13:58 ` Usama Arif
2026-03-20 14:18 ` Jan Kara
2026-03-20 14:26 ` Kiryl Shutsemau
2026-03-20 13:58 ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Usama Arif
` (2 subsequent siblings)
3 siblings, 2 replies; 16+ messages in thread
From: Usama Arif @ 2026-03-20 13:58 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
lorenzo.stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, linux-fsdevel
The mmap_miss counter in do_sync_mmap_readahead() tracks whether
readahead is useful for mmap'd file access. It is incremented by 1 on
every page cache miss in do_sync_mmap_readahead(), and decremented in
two places:
- filemap_map_pages(): decremented by N for each of N pages
successfully mapped via fault-around (pages found already in cache,
evidence readahead was useful). Only pages not in the workingset
count as hits.
- do_async_mmap_readahead(): decremented by 1 when a page with
PG_readahead is found in cache.
When the counter exceeds MMAP_LOTSAMISS (100), all readahead is
disabled, including the targeted VM_EXEC readahead [1] that requests
large folio orders for contpte mapping.
On arm64 with 64K base pages, both decrement paths are inactive:
1. filemap_map_pages() is never called because fault_around_pages
(65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
requires fault_around_pages > 1. With only 1 page in the
fault-around window, there is nothing "around" to map.
2. do_async_mmap_readahead() never fires for exec mappings because
exec readahead sets async_size = 0, so no PG_readahead markers
are placed.
With no decrements, mmap_miss monotonically increases past
MMAP_LOTSAMISS after 100 page faults, disabling all subsequent
exec readahead.
Fix this by excluding VM_EXEC VMAs from the mmap_miss logic, similar
to how VM_SEQ_READ is already excluded. The exec readahead path is
targeted (one folio at the fault location, async_size=0), not
speculative prefetch, so the mmap_miss heuristic designed to throttle
wasteful speculative readahead should not apply to it.
[1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/filemap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 6cd7974d4adab..7d89c6b384cc4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3331,7 +3331,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
}
}
- if (!(vm_flags & VM_SEQ_READ)) {
+ if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
/* Avoid banging the cache line if not needed */
mmap_miss = READ_ONCE(ra->mmap_miss);
if (mmap_miss < MMAP_LOTSAMISS * 10)
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
2026-03-20 13:58 ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-03-20 14:18 ` Jan Kara
2026-03-20 14:26 ` Kiryl Shutsemau
1 sibling, 0 replies; 16+ messages in thread
From: Jan Kara @ 2026-03-20 14:18 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team
On Fri 20-03-26 06:58:51, Usama Arif wrote:
> The mmap_miss counter in do_sync_mmap_readahead() tracks whether
> readahead is useful for mmap'd file access. It is incremented by 1 on
> every page cache miss in do_sync_mmap_readahead(), and decremented in
> two places:
>
> - filemap_map_pages(): decremented by N for each of N pages
> successfully mapped via fault-around (pages found already in cache,
> evidence readahead was useful). Only pages not in the workingset
> count as hits.
>
> - do_async_mmap_readahead(): decremented by 1 when a page with
> PG_readahead is found in cache.
>
> When the counter exceeds MMAP_LOTSAMISS (100), all readahead is
> disabled, including the targeted VM_EXEC readahead [1] that requests
> large folio orders for contpte mapping.
>
> On arm64 with 64K base pages, both decrement paths are inactive:
>
> 1. filemap_map_pages() is never called because fault_around_pages
> (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
> requires fault_around_pages > 1. With only 1 page in the
> fault-around window, there is nothing "around" to map.
>
> 2. do_async_mmap_readahead() never fires for exec mappings because
> exec readahead sets async_size = 0, so no PG_readahead markers
> are placed.
>
> With no decrements, mmap_miss monotonically increases past
> MMAP_LOTSAMISS after 100 page faults, disabling all subsequent
> exec readahead.
>
> Fix this by excluding VM_EXEC VMAs from the mmap_miss logic, similar
> to how VM_SEQ_READ is already excluded. The exec readahead path is
> targeted (one folio at the fault location, async_size=0), not
> speculative prefetch, so the mmap_miss heuristic designed to throttle
> wasteful speculative readahead should not apply to it.
>
> [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> mm/filemap.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6cd7974d4adab..7d89c6b384cc4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3331,7 +3331,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> }
> }
>
> - if (!(vm_flags & VM_SEQ_READ)) {
> + if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
> /* Avoid banging the cache line if not needed */
> mmap_miss = READ_ONCE(ra->mmap_miss);
> if (mmap_miss < MMAP_LOTSAMISS * 10)
> --
> 2.52.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
2026-03-20 13:58 ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-03-20 14:18 ` Jan Kara
@ 2026-03-20 14:26 ` Kiryl Shutsemau
1 sibling, 0 replies; 16+ messages in thread
From: Kiryl Shutsemau @ 2026-03-20 14:26 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro, wilts,
ziy, hannes, shakeel.butt, kernel-team
On Fri, Mar 20, 2026 at 06:58:51AM -0700, Usama Arif wrote:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6cd7974d4adab..7d89c6b384cc4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3331,7 +3331,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
> }
> }
>
> - if (!(vm_flags & VM_SEQ_READ)) {
> + if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
Strictly speaking the fact that the file mapped as executable doesn't
mean we are serving instruction fetch page fault.
FAULT_FLAG_INSTRUCTION would be the signal, but it is only provided by
handful of architectures.
VM_EXEC is good enough proxy.
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
2026-03-20 13:58 [PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-03-20 13:58 ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
@ 2026-03-20 13:58 ` Usama Arif
2026-03-20 14:41 ` Kiryl Shutsemau
2026-03-20 14:42 ` Jan Kara
2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
2026-03-20 13:58 ` [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area Usama Arif
3 siblings, 2 replies; 16+ messages in thread
From: Usama Arif @ 2026-03-20 13:58 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
lorenzo.stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, linux-fsdevel
Replace the arch-specific exec_folio_order() hook with a generic
preferred_exec_order() that dynamically computes the readahead folio
order for executable memory. It targets min(PMD_ORDER, 2M) as the
maximum, which optimally gives the right answer for contpte (arm64),
PMD mapping (x86, arm64 4K), and architectures with smaller PMDs
(s390 1M). It adapts at runtime based on:
- VMA size: caps the order so folios fit within the mapping
- Memory pressure: steps down the order when the local node's free
memory is below the high watermark for the requested order
This avoids over-allocating on memory-constrained systems while still
requesting the optimal order when memory is plentiful.
Since exec_folio_order() is no longer needed, remove the arm64
definition and the generic default from pgtable.h.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
arch/arm64/include/asm/pgtable.h | 8 -----
include/linux/pgtable.h | 11 ------
mm/filemap.c | 57 ++++++++++++++++++++++++++++----
3 files changed, 51 insertions(+), 25 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b3e58735c49bd..b1e74940624d8 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1599,14 +1599,6 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf,
*/
#define arch_wants_old_prefaulted_pte cpu_has_hw_af
-/*
- * Request exec memory is read into pagecache in at least 64K folios. This size
- * can be contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB
- * entry), and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base
- * pages are in use.
- */
-#define exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
-
static inline bool pud_sect_supported(void)
{
return PAGE_SIZE == SZ_4K;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a50df42a893fb..874333549eb3c 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -577,17 +577,6 @@ static inline bool arch_has_hw_pte_young(void)
}
#endif
-#ifndef exec_folio_order
-/*
- * Returns preferred minimum folio order for executable file-backed memory. Must
- * be in range [0, PMD_ORDER). Default to order-0.
- */
-static inline unsigned int exec_folio_order(void)
-{
- return 0;
-}
-#endif
-
#ifndef arch_check_zapped_pte
static inline void arch_check_zapped_pte(struct vm_area_struct *vma,
pte_t pte)
diff --git a/mm/filemap.c b/mm/filemap.c
index 7d89c6b384cc4..aebfb78e487d7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3290,6 +3290,52 @@ static int lock_folio_maybe_drop_mmap(struct vm_fault *vmf, struct folio *folio,
return 1;
}
+/*
+ * Compute the preferred folio order for executable memory readahead.
+ * Targets min(PMD_ORDER, 2M) as the maximum, which gives the
+ * optimal order for contpte (arm64), PMD mapping (x86, arm64 4K), and
+ * architectures with smaller PMDs (s390 1M). The 2M cap also avoids
+ * requesting excessively large folios on configurations where PMD_ORDER
+ * is much larger (32M on 16K pages, 512M on 64K pages), which would cause
+ * unnecessary memory pressure. Adapts at runtime based on:
+ *
+ * - VMA size: cap the order so folios fit within the mapping.
+ *
+ * - Memory pressure: step down the order when free memory on the local
+ * node is below the high watermark for the requested order. This
+ * avoids expensive reclaim or compaction to satisfy large folio
+ * allocations when memory is tight.
+ */
+static unsigned int preferred_exec_order(struct vm_area_struct *vma)
+{
+ int order;
+ unsigned long vma_len = vma_pages(vma);
+ struct zone *zone;
+ gfp_t gfp;
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return 0;
+
+ /* Cap at min(PMD_ORDER, 2M) */
+ order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT));
+
+ /* Don't request folios larger than the VMA */
+ order = min(order, ilog2(vma_len));
+
+ /* Step down under memory pressure */
+ gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
+ zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
+ gfp_zone(gfp), NULL)->zone;
+ if (zone) {
+ while (order > 0 &&
+ !zone_watermark_ok(zone, order,
+ high_wmark_pages(zone), 0, 0))
+ order--;
+ }
+
+ return order;
+}
+
/*
* Synchronous readahead happens when we don't even find a page in the page
* cache at all. We don't want to perform IO under the mmap sem, so if we have
@@ -3363,11 +3409,10 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
if (vm_flags & VM_EXEC) {
/*
- * Allow arch to request a preferred minimum folio order for
- * executable memory. This can often be beneficial to
- * performance if (e.g.) arm64 can contpte-map the folio.
- * Executable memory rarely benefits from readahead, due to its
- * random access nature, so set async_size to 0.
+ * Request a preferred folio order for executable memory,
+ * dynamically adapted to VMA size and memory pressure.
+ * Executable memory rarely benefits from speculative readahead
+ * due to its random access nature, so set async_size to 0.
*
* Limit to the boundaries of the VMA to avoid reading in any
* pad that might exist between sections, which would be a waste
@@ -3378,7 +3423,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
unsigned long end = start + vma_pages(vma);
unsigned long ra_end;
- ra->order = exec_folio_order();
+ ra->order = preferred_exec_order(vma);
ra->start = round_down(vmf->pgoff, 1UL << ra->order);
ra->start = max(ra->start, start);
ra_end = round_up(ra->start + ra->ra_pages, 1UL << ra->order);
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
2026-03-20 13:58 ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Usama Arif
@ 2026-03-20 14:41 ` Kiryl Shutsemau
2026-03-20 14:42 ` Jan Kara
1 sibling, 0 replies; 16+ messages in thread
From: Kiryl Shutsemau @ 2026-03-20 14:41 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro, wilts,
ziy, hannes, shakeel.butt, kernel-team
On Fri, Mar 20, 2026 at 06:58:52AM -0700, Usama Arif wrote:
> + * allocations when memory is tight.
> + */
> +static unsigned int preferred_exec_order(struct vm_area_struct *vma)
> +{
> + int order;
> + unsigned long vma_len = vma_pages(vma);
> + struct zone *zone;
> + gfp_t gfp;
> +
> + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> + return 0;
> +
> + /* Cap at min(PMD_ORDER, 2M) */
> + order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT));
> +
> + /* Don't request folios larger than the VMA */
> + order = min(order, ilog2(vma_len));
> +
> + /* Step down under memory pressure */
> + gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
> + zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
> + gfp_zone(gfp), NULL)->zone;
> + if (zone) {
> + while (order > 0 &&
> + !zone_watermark_ok(zone, order,
> + high_wmark_pages(zone), 0, 0))
> + order--;
> + }
Eww. That's overkill and layering violation.
If we need something like this, it has to be do within page allocator:
an allocation interface that takes a range (or mask) of acceptable
order.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
2026-03-20 13:58 ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Usama Arif
2026-03-20 14:41 ` Kiryl Shutsemau
@ 2026-03-20 14:42 ` Jan Kara
2026-03-26 12:40 ` Usama Arif
1 sibling, 1 reply; 16+ messages in thread
From: Jan Kara @ 2026-03-20 14:42 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team
On Fri 20-03-26 06:58:52, Usama Arif wrote:
> Replace the arch-specific exec_folio_order() hook with a generic
> preferred_exec_order() that dynamically computes the readahead folio
> order for executable memory. It targets min(PMD_ORDER, 2M) as the
> maximum, which optimally gives the right answer for contpte (arm64),
> PMD mapping (x86, arm64 4K), and architectures with smaller PMDs
> (s390 1M). It adapts at runtime based on:
>
> - VMA size: caps the order so folios fit within the mapping
> - Memory pressure: steps down the order when the local node's free
> memory is below the high watermark for the requested order
>
> This avoids over-allocating on memory-constrained systems while still
> requesting the optimal order when memory is plentiful.
>
> Since exec_folio_order() is no longer needed, remove the arm64
> definition and the generic default from pgtable.h.
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
...
> +static unsigned int preferred_exec_order(struct vm_area_struct *vma)
> +{
> + int order;
> + unsigned long vma_len = vma_pages(vma);
> + struct zone *zone;
> + gfp_t gfp;
> +
> + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> + return 0;
> +
> + /* Cap at min(PMD_ORDER, 2M) */
> + order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT));
> +
> + /* Don't request folios larger than the VMA */
> + order = min(order, ilog2(vma_len));
Hum, as far as I'm checking page_cache_ra_order() used in
do_sync_mmap_readahead(), ra->order is the preferred order but it will be
trimmed down to fit both within the file and within ra->size. And ra->size
is set for the readahead to fit within the vma so I don't think any order
trimming based on vma length is needed in this place?
> + /* Step down under memory pressure */
> + gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
> + zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
> + gfp_zone(gfp), NULL)->zone;
> + if (zone) {
> + while (order > 0 &&
> + !zone_watermark_ok(zone, order,
> + high_wmark_pages(zone), 0, 0))
> + order--;
> + }
It looks wrong for this logic to be here. Trimming order based on memory
pressure makes sense (and we've already got reports that on memory limited
devices large order folios in the page cache have too big memory overhead
so we'll likely need to handle that for page cache allocations in general)
but IMHO it belongs to page_cache_ra_order() or some other common place
like that.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
2026-03-20 14:42 ` Jan Kara
@ 2026-03-26 12:40 ` Usama Arif
2026-03-26 16:21 ` Jan Kara
0 siblings, 1 reply; 16+ messages in thread
From: Usama Arif @ 2026-03-26 12:40 UTC (permalink / raw)
To: Jan Kara, david, ryan.roberts
Cc: Andrew Morton, willy, linux-mm, r, ajd, apopple, baohua,
baolin.wang, brauner, catalin.marinas, dev.jain, kees,
kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team
On 20/03/2026 17:42, Jan Kara wrote:
> On Fri 20-03-26 06:58:52, Usama Arif wrote:
>> Replace the arch-specific exec_folio_order() hook with a generic
>> preferred_exec_order() that dynamically computes the readahead folio
>> order for executable memory. It targets min(PMD_ORDER, 2M) as the
>> maximum, which optimally gives the right answer for contpte (arm64),
>> PMD mapping (x86, arm64 4K), and architectures with smaller PMDs
>> (s390 1M). It adapts at runtime based on:
>>
>> - VMA size: caps the order so folios fit within the mapping
>> - Memory pressure: steps down the order when the local node's free
>> memory is below the high watermark for the requested order
>>
>> This avoids over-allocating on memory-constrained systems while still
>> requesting the optimal order when memory is plentiful.
>>
>> Since exec_folio_order() is no longer needed, remove the arm64
>> definition and the generic default from pgtable.h.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ...
>> +static unsigned int preferred_exec_order(struct vm_area_struct *vma)
>> +{
>> + int order;
>> + unsigned long vma_len = vma_pages(vma);
>> + struct zone *zone;
>> + gfp_t gfp;
>> +
>> + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
>> + return 0;
>> +
>> + /* Cap at min(PMD_ORDER, 2M) */
>> + order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT));
>> +
>> + /* Don't request folios larger than the VMA */
>> + order = min(order, ilog2(vma_len));
>
Hi Jan,
Thanks for the feedback and sorry for the late reply! I was travelling
during the week.
> Hum, as far as I'm checking page_cache_ra_order() used in
> do_sync_mmap_readahead(), ra->order is the preferred order but it will be
> trimmed down to fit both within the file and within ra->size. And ra->size
> is set for the readahead to fit within the vma so I don't think any order
> trimming based on vma length is needed in this place?
Ack, yes makes sense.
>
>> + /* Step down under memory pressure */
>> + gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
>> + zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
>> + gfp_zone(gfp), NULL)->zone;
>> + if (zone) {
>> + while (order > 0 &&
>> + !zone_watermark_ok(zone, order,
>> + high_wmark_pages(zone), 0, 0))
>> + order--;
>> + }
>
> It looks wrong for this logic to be here. Trimming order based on memory
> pressure makes sense (and we've already got reports that on memory limited
> devices large order folios in the page cache have too big memory overhead
> so we'll likely need to handle that for page cache allocations in general)
> but IMHO it belongs to page_cache_ra_order() or some other common place
> like that.
>
> Honza
So I have been thinking about this. readahead_gfp_mask() already sets
__GFP_NORETRY, so we wont try aggressive reclaim/compaction to satisfy
the allocation. page_cache_ra_order() falls through to the fallback path
faulting in order 0 page when allocation is not satsified.
So the allocator already naturally steps down under memory pressure,
the explicit zone_watermark_ok() loop might be redundant?
What are your thoughts on just setting
ra->order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT))?
We can do the higher orlder allocation with gfp &= ~__GFP_RECLAIM
for the VM_EXEC case.
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
2026-03-26 12:40 ` Usama Arif
@ 2026-03-26 16:21 ` Jan Kara
0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2026-03-26 16:21 UTC (permalink / raw)
To: Usama Arif
Cc: Jan Kara, david, ryan.roberts, Andrew Morton, willy, linux-mm, r,
ajd, apopple, baohua, baolin.wang, brauner, catalin.marinas,
dev.jain, kees, kevin.brodsky, lance.yang, Liam.Howlett,
linux-arm-kernel, linux-fsdevel, linux-kernel, lorenzo.stoakes,
mhocko, npache, pasha.tatashin, rmclure, rppt, surenb, vbabka,
Al Viro, wilts.infradead.org, ziy, hannes, kas, shakeel.butt,
kernel-team
On Thu 26-03-26 08:40:21, Usama Arif wrote:
> On 20/03/2026 17:42, Jan Kara wrote:
> >> + /* Step down under memory pressure */
> >> + gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
> >> + zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
> >> + gfp_zone(gfp), NULL)->zone;
> >> + if (zone) {
> >> + while (order > 0 &&
> >> + !zone_watermark_ok(zone, order,
> >> + high_wmark_pages(zone), 0, 0))
> >> + order--;
> >> + }
> >
> > It looks wrong for this logic to be here. Trimming order based on memory
> > pressure makes sense (and we've already got reports that on memory limited
> > devices large order folios in the page cache have too big memory overhead
> > so we'll likely need to handle that for page cache allocations in general)
> > but IMHO it belongs to page_cache_ra_order() or some other common place
> > like that.
> >
> > Honza
>
> So I have been thinking about this. readahead_gfp_mask() already sets
> __GFP_NORETRY, so we wont try aggressive reclaim/compaction to satisfy
> the allocation. page_cache_ra_order() falls through to the fallback path
> faulting in order 0 page when allocation is not satsified.
>
> So the allocator already naturally steps down under memory pressure,
> the explicit zone_watermark_ok() loop might be redundant?
Probably yes. I still think we'll have to somehow better tweak the used
order based on expected size of the page cache (2M folios seem unreasonably
large for a machine that has e.g. 1G of memory in total, even if it has
enough free memory at this point in time - we'll benefit from smaller
folios and thus finer grained folio activity tracking for such cases). But
that's not for this patch set.
> What are your thoughts on just setting
> ra->order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT))?
> We can do the higher orlder allocation with gfp &= ~__GFP_RECLAIM
> for the VM_EXEC case.
Yes, it's simple and it makes sense to me so if others are fine with it...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
2026-03-20 13:58 [PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-03-20 13:58 ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-03-20 13:58 ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Usama Arif
@ 2026-03-20 13:58 ` Usama Arif
2026-03-20 14:55 ` Kiryl Shutsemau
` (2 more replies)
2026-03-20 13:58 ` [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area Usama Arif
3 siblings, 3 replies; 16+ messages in thread
From: Usama Arif @ 2026-03-20 13:58 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
lorenzo.stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, linux-fsdevel
For PIE binaries (ET_DYN), the load address is randomized at PAGE_SIZE
granularity via arch_mmap_rnd(). On arm64 with 64K base pages, this
means the binary is 64K-aligned, but contpte mapping requires 2M
(CONT_PTE_SIZE) alignment.
Without proper virtual address alignment, readahead patches that
allocate 2M folios with 2M-aligned file offsets and physical addresses
cannot benefit from contpte mapping, as the contpte fold check in
contpte_set_ptes() requires the virtual address to be CONT_PTE_SIZE-
aligned.
Fix this by extending maximum_alignment() to consider the maximum folio
size supported by the page cache (via mapping_max_folio_size()). For
each PT_LOAD segment, the alignment is bumped to the largest
power-of-2 that fits within the segment size, capped by the max folio
size the filesystem will allocate, if:
- Both p_vaddr and p_offset are aligned to that size
- The segment is large enough (p_filesz >= size)
This ensures load_bias is folio-aligned so that file-offset-aligned
folios map to properly aligned virtual addresses, enabling hardware PTE
coalescing (e.g. arm64 contpte) and PMD mappings for large folios.
The segment size check avoids reducing ASLR entropy for small binaries
that cannot benefit from large folio alignment.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
fs/binfmt_elf.c | 38 ++++++++++++++++++++++++++++++++++++--
1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 8e89cc5b28200..042af81766fcd 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -49,6 +49,7 @@
#include <uapi/linux/rseq.h>
#include <asm/param.h>
#include <asm/page.h>
+#include <linux/pagemap.h>
#ifndef ELF_COMPAT
#define ELF_COMPAT 0
@@ -488,19 +489,51 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
return 0;
}
-static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
+static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr,
+ struct file *filp)
{
unsigned long alignment = 0;
+ unsigned long max_folio_size = PAGE_SIZE;
int i;
+ if (filp && filp->f_mapping)
+ max_folio_size = mapping_max_folio_size(filp->f_mapping);
+
for (i = 0; i < nr; i++) {
if (cmds[i].p_type == PT_LOAD) {
unsigned long p_align = cmds[i].p_align;
+ unsigned long size;
/* skip non-power of two alignments as invalid */
if (!is_power_of_2(p_align))
continue;
alignment = max(alignment, p_align);
+
+ /*
+ * Try to align the binary to the largest folio
+ * size that the page cache supports, so the
+ * hardware can coalesce PTEs (e.g. arm64
+ * contpte) or use PMD mappings for large folios.
+ *
+ * Use the largest power-of-2 that fits within
+ * the segment size, capped by what the page
+ * cache will allocate. Only align when the
+ * segment's virtual address and file offset are
+ * already aligned to the folio size, as
+ * misalignment would prevent coalescing anyway.
+ *
+ * The segment size check avoids reducing ASLR
+ * entropy for small binaries that cannot
+ * benefit.
+ */
+ if (!cmds[i].p_filesz)
+ continue;
+ size = rounddown_pow_of_two(cmds[i].p_filesz);
+ size = min(size, max_folio_size);
+ if (size > PAGE_SIZE &&
+ IS_ALIGNED(cmds[i].p_vaddr, size) &&
+ IS_ALIGNED(cmds[i].p_offset, size))
+ alignment = max(alignment, size);
}
}
@@ -1104,7 +1137,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
}
/* Calculate any requested alignment. */
- alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
+ alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum,
+ bprm->file);
/**
* DOC: PIE handling
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
@ 2026-03-20 14:55 ` Kiryl Shutsemau
2026-03-20 15:58 ` Matthew Wilcox
2026-03-20 16:05 ` WANG Rui
2 siblings, 0 replies; 16+ messages in thread
From: Kiryl Shutsemau @ 2026-03-20 14:55 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro, wilts,
ziy, hannes, shakeel.butt, kernel-team
On Fri, Mar 20, 2026 at 06:58:53AM -0700, Usama Arif wrote:
> For PIE binaries (ET_DYN), the load address is randomized at PAGE_SIZE
> granularity via arch_mmap_rnd(). On arm64 with 64K base pages, this
> means the binary is 64K-aligned, but contpte mapping requires 2M
> (CONT_PTE_SIZE) alignment.
>
> Without proper virtual address alignment, readahead patches that
> allocate 2M folios with 2M-aligned file offsets and physical addresses
> cannot benefit from contpte mapping, as the contpte fold check in
> contpte_set_ptes() requires the virtual address to be CONT_PTE_SIZE-
> aligned.
>
> Fix this by extending maximum_alignment() to consider the maximum folio
> size supported by the page cache (via mapping_max_folio_size()). For
> each PT_LOAD segment, the alignment is bumped to the largest
> power-of-2 that fits within the segment size, capped by the max folio
> size the filesystem will allocate, if:
>
> - Both p_vaddr and p_offset are aligned to that size
> - The segment is large enough (p_filesz >= size)
>
> This ensures load_bias is folio-aligned so that file-offset-aligned
> folios map to properly aligned virtual addresses, enabling hardware PTE
> coalescing (e.g. arm64 contpte) and PMD mappings for large folios.
>
> The segment size check avoids reducing ASLR entropy for small binaries
> that cannot benefit from large folio alignment.
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
2026-03-20 14:55 ` Kiryl Shutsemau
@ 2026-03-20 15:58 ` Matthew Wilcox
2026-03-20 16:05 ` WANG Rui
2 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2026-03-20 15:58 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, ryan.roberts, linux-mm, r, jack, ajd,
apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team
On Fri, Mar 20, 2026 at 06:58:53AM -0700, Usama Arif wrote:
> -static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
> +static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr,
> + struct file *filp)
> {
> unsigned long alignment = 0;
> + unsigned long max_folio_size = PAGE_SIZE;
> int i;
>
> + if (filp && filp->f_mapping)
> + max_folio_size = mapping_max_folio_size(filp->f_mapping);
Under what circumstances can bprm->file be NULL?
Also we tend to prefer the name "file" rather than "filp" for new code
(yes, there's a lot of old code out there).
> +
> + /*
> + * Try to align the binary to the largest folio
> + * size that the page cache supports, so the
> + * hardware can coalesce PTEs (e.g. arm64
> + * contpte) or use PMD mappings for large folios.
> + *
> + * Use the largest power-of-2 that fits within
> + * the segment size, capped by what the page
> + * cache will allocate. Only align when the
> + * segment's virtual address and file offset are
> + * already aligned to the folio size, as
> + * misalignment would prevent coalescing anyway.
> + *
> + * The segment size check avoids reducing ASLR
> + * entropy for small binaries that cannot
> + * benefit.
> + */
> + if (!cmds[i].p_filesz)
> + continue;
> + size = rounddown_pow_of_two(cmds[i].p_filesz);
> + size = min(size, max_folio_size);
> + if (size > PAGE_SIZE &&
> + IS_ALIGNED(cmds[i].p_vaddr, size) &&
> + IS_ALIGNED(cmds[i].p_offset, size))
> + alignment = max(alignment, size);
Can this not all be factored out into a different function? Also, I
think it was done a bit better here:
https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc/
+ if (!IS_ALIGNED(cmd->p_vaddr | cmd->p_offset, PMD_SIZE))
+ return false;
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
2026-03-20 14:55 ` Kiryl Shutsemau
2026-03-20 15:58 ` Matthew Wilcox
@ 2026-03-20 16:05 ` WANG Rui
2026-03-20 17:47 ` Matthew Wilcox
2 siblings, 1 reply; 16+ messages in thread
From: WANG Rui @ 2026-03-20 16:05 UTC (permalink / raw)
To: usama.arif
Cc: Liam.Howlett, ajd, akpm, apopple, baohua, baolin.wang, brauner,
catalin.marinas, david, dev.jain, jack, kees, kevin.brodsky,
lance.yang, linux-arm-kernel, linux-fsdevel, linux-fsdevel,
linux-kernel, linux-mm, lorenzo.stoakes, mhocko, npache,
pasha.tatashin, r, rmclure, rppt, ryan.roberts, surenb, vbabka,
viro, willy
Hi Usama,
On Fri, Mar 20, 2026 at 10:04 PM Usama Arif <usama.arif@linux.dev> wrote:
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 8e89cc5b28200..042af81766fcd 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -49,6 +49,7 @@
> #include <uapi/linux/rseq.h>
> #include <asm/param.h>
> #include <asm/page.h>
> +#include <linux/pagemap.h>
>
> #ifndef ELF_COMPAT
> #define ELF_COMPAT 0
> @@ -488,19 +489,51 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
> return 0;
> }
>
> -static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
> +static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr,
> + struct file *filp)
> {
> unsigned long alignment = 0;
> + unsigned long max_folio_size = PAGE_SIZE;
> int i;
>
> + if (filp && filp->f_mapping)
> + max_folio_size = mapping_max_folio_size(filp->f_mapping);
From experiments (with 16K base pages), mapping_max_folio_size() appears to
depend on the filesystem. It returns 8M on ext4, while on btrfs it always
falls back to PAGE_SIZE (it seems CONFIG_BTRFS_EXPERIMENTAL=y may change this).
This looks overly conservative and ends up missing practical optimization
opportunities.
> +
> for (i = 0; i < nr; i++) {
> if (cmds[i].p_type == PT_LOAD) {
> unsigned long p_align = cmds[i].p_align;
> + unsigned long size;
>
> /* skip non-power of two alignments as invalid */
> if (!is_power_of_2(p_align))
> continue;
> alignment = max(alignment, p_align);
> +
> + /*
> + * Try to align the binary to the largest folio
> + * size that the page cache supports, so the
> + * hardware can coalesce PTEs (e.g. arm64
> + * contpte) or use PMD mappings for large folios.
> + *
> + * Use the largest power-of-2 that fits within
> + * the segment size, capped by what the page
> + * cache will allocate. Only align when the
> + * segment's virtual address and file offset are
> + * already aligned to the folio size, as
> + * misalignment would prevent coalescing anyway.
> + *
> + * The segment size check avoids reducing ASLR
> + * entropy for small binaries that cannot
> + * benefit.
> + */
> + if (!cmds[i].p_filesz)
> + continue;
> + size = rounddown_pow_of_two(cmds[i].p_filesz);
> + size = min(size, max_folio_size);
> + if (size > PAGE_SIZE &&
> + IS_ALIGNED(cmds[i].p_vaddr, size) &&
> + IS_ALIGNED(cmds[i].p_offset, size))
> + alignment = max(alignment, size);
In my patch [1], by aligning eligible segments to PMD_SIZE, THP can quickly
collapse them into large mappings with minimal warmup. That doesn’t happen
with the current behavior. I think allowing a reasonably sized PMD (say <= 32M)
is worth considering. All we really need here is to ensure virtual address
alignment. The rest can be left to THP under always, which can decide whether
to collapse or not based on memory pressure and other factors.
[1] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc
> }
> }
>
> @@ -1104,7 +1137,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
> }
>
> /* Calculate any requested alignment. */
> - alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
> + alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum,
> + bprm->file);
>
> /**
> * DOC: PIE handling
> --
> 2.52.0
>
Thanks,
Rui
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
2026-03-20 16:05 ` WANG Rui
@ 2026-03-20 17:47 ` Matthew Wilcox
0 siblings, 0 replies; 16+ messages in thread
From: Matthew Wilcox @ 2026-03-20 17:47 UTC (permalink / raw)
To: WANG Rui
Cc: usama.arif, Liam.Howlett, ajd, akpm, apopple, baohua, baolin.wang,
brauner, catalin.marinas, david, dev.jain, jack, kees,
kevin.brodsky, lance.yang, linux-arm-kernel, linux-fsdevel,
linux-fsdevel, linux-kernel, linux-mm, lorenzo.stoakes, mhocko,
npache, pasha.tatashin, rmclure, rppt, ryan.roberts, surenb,
vbabka, viro
On Sat, Mar 21, 2026 at 12:05:18AM +0800, WANG Rui wrote:
> >From experiments (with 16K base pages), mapping_max_folio_size() appears to
> depend on the filesystem. It returns 8M on ext4, while on btrfs it always
> falls back to PAGE_SIZE (it seems CONFIG_BTRFS_EXPERIMENTAL=y may change this).
> This looks overly conservative and ends up missing practical optimization
> opportunities.
btrfs only supports large folios with CONFIG_BTRFS_EXPERIMENTAL.
I mean, it's only been five years since it was added to XFS, can't rush
these things.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area
2026-03-20 13:58 [PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
` (2 preceding siblings ...)
2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
@ 2026-03-20 13:58 ` Usama Arif
2026-03-20 15:06 ` Kiryl Shutsemau
3 siblings, 1 reply; 16+ messages in thread
From: Usama Arif @ 2026-03-20 13:58 UTC (permalink / raw)
To: Andrew Morton, david, willy, ryan.roberts, linux-mm
Cc: r, jack, ajd, apopple, baohua, baolin.wang, brauner,
catalin.marinas, dev.jain, kees, kevin.brodsky, lance.yang,
Liam.Howlett, linux-arm-kernel, linux-fsdevel, linux-kernel,
lorenzo.stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
surenb, vbabka, Al Viro, wilts.infradead.org, linux-fsdevel
thp_get_unmapped_area() is the get_unmapped_area callback for
filesystems like ext4, xfs, and btrfs. It attempts to align the virtual
address for PMD_SIZE THP mappings, but on arm64 with 64K base pages
PMD_SIZE is 512M, which is too large for typical shared library mappings,
so the alignment always fails and falls back to PAGE_SIZE.
This means shared libraries loaded by ld.so via mmap() get 64K-aligned
virtual addresses, preventing contpte mapping even when 2M folios are
allocated with properly aligned file offsets and physical addresses.
Add a fallback in thp_get_unmapped_area_vmflags() that uses the
filesystem's mapping_max_folio_size() to determine alignment, capped to
the mapping length via rounddown_pow_of_two(len). This aligns mappings
to the largest folio the page cache will actually allocate, without
over-aligning small mappings.
The fallback is naturally a no-op for filesystems that don't support
large folios and skips the retry when the alignment would equal PMD_SIZE
(already attempted above).
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
mm/huge_memory.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e2746ea74adf..4005084c9c65b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1242,6 +1242,20 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
if (ret)
return ret;
+ if (filp && filp->f_mapping) {
+ unsigned long max_folio_size =
+ mapping_max_folio_size(filp->f_mapping);
+ unsigned long size = rounddown_pow_of_two(len);
+
+ size = min(size, max_folio_size);
+ if (size > PAGE_SIZE && size != PMD_SIZE) {
+ ret = __thp_get_unmapped_area(filp, addr, len, off,
+ flags, size, vm_flags);
+ if (ret)
+ return ret;
+ }
+ }
+
return mm_get_unmapped_area_vmflags(filp, addr, len, pgoff, flags,
vm_flags);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* Re: [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area
2026-03-20 13:58 ` [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area Usama Arif
@ 2026-03-20 15:06 ` Kiryl Shutsemau
0 siblings, 0 replies; 16+ messages in thread
From: Kiryl Shutsemau @ 2026-03-20 15:06 UTC (permalink / raw)
To: Usama Arif
Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro, ziy,
hannes, shakeel.butt, kernel-team
On Fri, Mar 20, 2026 at 06:58:54AM -0700, Usama Arif wrote:
> thp_get_unmapped_area() is the get_unmapped_area callback for
> filesystems like ext4, xfs, and btrfs. It attempts to align the virtual
> address for PMD_SIZE THP mappings, but on arm64 with 64K base pages
> PMD_SIZE is 512M, which is too large for typical shared library mappings,
> so the alignment always fails and falls back to PAGE_SIZE.
>
> This means shared libraries loaded by ld.so via mmap() get 64K-aligned
> virtual addresses, preventing contpte mapping even when 2M folios are
> allocated with properly aligned file offsets and physical addresses.
>
> Add a fallback in thp_get_unmapped_area_vmflags() that uses the
> filesystem's mapping_max_folio_size() to determine alignment, capped to
> the mapping length via rounddown_pow_of_two(len). This aligns mappings
> to the largest folio the page cache will actually allocate, without
> over-aligning small mappings.
>
> The fallback is naturally a no-op for filesystems that don't support
> large folios and skips the retry when the alignment would equal PMD_SIZE
> (already attempted above).
>
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
> mm/huge_memory.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8e2746ea74adf..4005084c9c65b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1242,6 +1242,20 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
> if (ret)
> return ret;
>
> + if (filp && filp->f_mapping) {
> + unsigned long max_folio_size =
> + mapping_max_folio_size(filp->f_mapping);
> + unsigned long size = rounddown_pow_of_two(len);
> +
> + size = min(size, max_folio_size);
> + if (size > PAGE_SIZE && size != PMD_SIZE) {
> + ret = __thp_get_unmapped_area(filp, addr, len, off,
> + flags, size, vm_flags);
Have you considered to integrate inside __thp_get_unmapped_area?
Like, start with PMD_SIZE alignment, then lower to
mapping_max_folio_size if needed and then lower further based on mapping
size.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 16+ messages in thread