Re: [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* Re: [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
       [not found] ` <20260320140315.979307-2-usama.arif@linux.dev>
@ 2026-03-20 14:18   ` Jan Kara
  2026-03-20 14:26   ` Kiryl Shutsemau
  1 sibling, 0 replies; 13+ messages in thread
From: Jan Kara @ 2026-03-20 14:18 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
	wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team

On Fri 20-03-26 06:58:51, Usama Arif wrote:
> The mmap_miss counter in do_sync_mmap_readahead() tracks whether
> readahead is useful for mmap'd file access. It is incremented by 1 on
> every page cache miss in do_sync_mmap_readahead(), and decremented in
> two places:
> 
>   - filemap_map_pages(): decremented by N for each of N pages
>     successfully mapped via fault-around (pages found already in cache,
>     evidence readahead was useful). Only pages not in the workingset
>     count as hits.
> 
>   - do_async_mmap_readahead(): decremented by 1 when a page with
>     PG_readahead is found in cache.
> 
> When the counter exceeds MMAP_LOTSAMISS (100), all readahead is
> disabled, including the targeted VM_EXEC readahead [1] that requests
> large folio orders for contpte mapping.
> 
> On arm64 with 64K base pages, both decrement paths are inactive:
> 
>   1. filemap_map_pages() is never called because fault_around_pages
>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>      requires fault_around_pages > 1. With only 1 page in the
>      fault-around window, there is nothing "around" to map.
> 
>   2. do_async_mmap_readahead() never fires for exec mappings because
>      exec readahead sets async_size = 0, so no PG_readahead markers
>      are placed.
> 
> With no decrements, mmap_miss monotonically increases past
> MMAP_LOTSAMISS after 100 page faults, disabling all subsequent
> exec readahead.
> 
> Fix this by excluding VM_EXEC VMAs from the mmap_miss logic, similar
> to how VM_SEQ_READ is already excluded. The exec readahead path is
> targeted (one folio at the fault location, async_size=0), not
> speculative prefetch, so the mmap_miss heuristic designed to throttle
> wasteful speculative readahead should not apply to it.
> 
> [1] https://lore.kernel.org/all/20250430145920.3748738-6-ryan.roberts@arm.com/
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/filemap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6cd7974d4adab..7d89c6b384cc4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3331,7 +3331,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  		}
>  	}
>  
> -	if (!(vm_flags & VM_SEQ_READ)) {
> +	if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {
>  		/* Avoid banging the cache line if not needed */
>  		mmap_miss = READ_ONCE(ra->mmap_miss);
>  		if (mmap_miss < MMAP_LOTSAMISS * 10)
> -- 
> 2.52.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead
       [not found] ` <20260320140315.979307-2-usama.arif@linux.dev>
  2026-03-20 14:18   ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Jan Kara
@ 2026-03-20 14:26   ` Kiryl Shutsemau
  1 sibling, 0 replies; 13+ messages in thread
From: Kiryl Shutsemau @ 2026-03-20 14:26 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro, wilts,
	ziy, hannes, shakeel.butt, kernel-team

On Fri, Mar 20, 2026 at 06:58:51AM -0700, Usama Arif wrote:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6cd7974d4adab..7d89c6b384cc4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3331,7 +3331,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  		}
>  	}
>  
> -	if (!(vm_flags & VM_SEQ_READ)) {
> +	if (!(vm_flags & (VM_SEQ_READ | VM_EXEC))) {

Strictly speaking the fact that the file mapped as executable doesn't
mean we are serving instruction fetch page fault.

FAULT_FLAG_INSTRUCTION would be the signal, but it is only provided by
handful of architectures.

VM_EXEC is good enough proxy.

Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
       [not found] ` <20260320140315.979307-3-usama.arif@linux.dev>
@ 2026-03-20 14:41   ` Kiryl Shutsemau
  2026-03-20 14:42   ` Jan Kara
  1 sibling, 0 replies; 13+ messages in thread
From: Kiryl Shutsemau @ 2026-03-20 14:41 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro, wilts,
	ziy, hannes, shakeel.butt, kernel-team

On Fri, Mar 20, 2026 at 06:58:52AM -0700, Usama Arif wrote:
> + *   allocations when memory is tight.
> + */
> +static unsigned int preferred_exec_order(struct vm_area_struct *vma)
> +{
> +	int order;
> +	unsigned long vma_len = vma_pages(vma);
> +	struct zone *zone;
> +	gfp_t gfp;
> +
> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return 0;
> +
> +	/* Cap at min(PMD_ORDER, 2M) */
> +	order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT));
> +
> +	/* Don't request folios larger than the VMA */
> +	order = min(order, ilog2(vma_len));
> +
> +	/* Step down under memory pressure */
> +	gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
> +	zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
> +				    gfp_zone(gfp), NULL)->zone;
> +	if (zone) {
> +		while (order > 0 &&
> +		       !zone_watermark_ok(zone, order,
> +					  high_wmark_pages(zone), 0, 0))
> +			order--;
> +	}

Eww. That's overkill and layering violation.

If we need something like this, it has to be do within page allocator:
an allocation interface that takes a range (or mask) of acceptable
order.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
       [not found] ` <20260320140315.979307-3-usama.arif@linux.dev>
  2026-03-20 14:41   ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Kiryl Shutsemau
@ 2026-03-20 14:42   ` Jan Kara
  2026-03-26 12:40     ` Usama Arif
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Kara @ 2026-03-20 14:42 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
	wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team

On Fri 20-03-26 06:58:52, Usama Arif wrote:
> Replace the arch-specific exec_folio_order() hook with a generic
> preferred_exec_order() that dynamically computes the readahead folio
> order for executable memory. It targets min(PMD_ORDER, 2M) as the
> maximum, which optimally gives the right answer for contpte (arm64),
> PMD mapping (x86, arm64 4K), and architectures with smaller PMDs
> (s390 1M). It adapts at runtime based on:
> 
> - VMA size: caps the order so folios fit within the mapping
> - Memory pressure: steps down the order when the local node's free
>   memory is below the high watermark for the requested order
> 
> This avoids over-allocating on memory-constrained systems while still
> requesting the optimal order when memory is plentiful.
> 
> Since exec_folio_order() is no longer needed, remove the arm64
> definition and the generic default from pgtable.h.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
...
> +static unsigned int preferred_exec_order(struct vm_area_struct *vma)
> +{
> +	int order;
> +	unsigned long vma_len = vma_pages(vma);
> +	struct zone *zone;
> +	gfp_t gfp;
> +
> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return 0;
> +
> +	/* Cap at min(PMD_ORDER, 2M) */
> +	order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT));
> +
> +	/* Don't request folios larger than the VMA */
> +	order = min(order, ilog2(vma_len));

Hum, as far as I'm checking page_cache_ra_order() used in
do_sync_mmap_readahead(), ra->order is the preferred order but it will be
trimmed down to fit both within the file and within ra->size. And ra->size
is set for the readahead to fit within the vma so I don't think any order
trimming based on vma length is needed in this place?

> +	/* Step down under memory pressure */
> +	gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
> +	zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
> +				    gfp_zone(gfp), NULL)->zone;
> +	if (zone) {
> +		while (order > 0 &&
> +		       !zone_watermark_ok(zone, order,
> +					  high_wmark_pages(zone), 0, 0))
> +			order--;
> +	}

It looks wrong for this logic to be here. Trimming order based on memory
pressure makes sense (and we've already got reports that on memory limited
devices large order folios in the page cache have too big memory overhead
so we'll likely need to handle that for page cache allocations in general)
but IMHO it belongs to page_cache_ra_order() or some other common place
like that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
       [not found] ` <20260320140315.979307-4-usama.arif@linux.dev>
@ 2026-03-20 14:55   ` Kiryl Shutsemau
  2026-03-20 15:58   ` Matthew Wilcox
  2026-03-20 16:05   ` WANG Rui
  2 siblings, 0 replies; 13+ messages in thread
From: Kiryl Shutsemau @ 2026-03-20 14:55 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro, wilts,
	ziy, hannes, shakeel.butt, kernel-team

On Fri, Mar 20, 2026 at 06:58:53AM -0700, Usama Arif wrote:
> For PIE binaries (ET_DYN), the load address is randomized at PAGE_SIZE
> granularity via arch_mmap_rnd(). On arm64 with 64K base pages, this
> means the binary is 64K-aligned, but contpte mapping requires 2M
> (CONT_PTE_SIZE) alignment.
> 
> Without proper virtual address alignment, readahead patches that
> allocate 2M folios with 2M-aligned file offsets and physical addresses
> cannot benefit from contpte mapping, as the contpte fold check in
> contpte_set_ptes() requires the virtual address to be CONT_PTE_SIZE-
> aligned.
> 
> Fix this by extending maximum_alignment() to consider the maximum folio
> size supported by the page cache (via mapping_max_folio_size()). For
> each PT_LOAD segment, the alignment is bumped to the largest
> power-of-2 that fits within the segment size, capped by the max folio
> size the filesystem will allocate, if:
> 
>   - Both p_vaddr and p_offset are aligned to that size
>   - The segment is large enough (p_filesz >= size)
> 
> This ensures load_bias is folio-aligned so that file-offset-aligned
> folios map to properly aligned virtual addresses, enabling hardware PTE
> coalescing (e.g. arm64 contpte) and PMD mappings for large folios.
> 
> The segment size check avoids reducing ASLR entropy for small binaries
> that cannot benefit from large folio alignment.
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>

Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area
       [not found] ` <20260320140315.979307-5-usama.arif@linux.dev>
@ 2026-03-20 15:06   ` Kiryl Shutsemau
  0 siblings, 0 replies; 13+ messages in thread
From: Kiryl Shutsemau @ 2026-03-20 15:06 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, willy, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro, ziy,
	hannes, shakeel.butt, kernel-team

On Fri, Mar 20, 2026 at 06:58:54AM -0700, Usama Arif wrote:
> thp_get_unmapped_area() is the get_unmapped_area callback for
> filesystems like ext4, xfs, and btrfs. It attempts to align the virtual
> address for PMD_SIZE THP mappings, but on arm64 with 64K base pages
> PMD_SIZE is 512M, which is too large for typical shared library mappings,
> so the alignment always fails and falls back to PAGE_SIZE.
> 
> This means shared libraries loaded by ld.so via mmap() get 64K-aligned
> virtual addresses, preventing contpte mapping even when 2M folios are
> allocated with properly aligned file offsets and physical addresses.
> 
> Add a fallback in thp_get_unmapped_area_vmflags() that uses the
> filesystem's mapping_max_folio_size() to determine alignment, capped to
> the mapping length via rounddown_pow_of_two(len). This aligns mappings
> to the largest folio the page cache will actually allocate, without
> over-aligning small mappings.
> 
> The fallback is naturally a no-op for filesystems that don't support
> large folios and skips the retry when the alignment would equal PMD_SIZE
> (already attempted above).
> 
> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ---
>  mm/huge_memory.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8e2746ea74adf..4005084c9c65b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1242,6 +1242,20 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  	if (ret)
>  		return ret;
>  
> +	if (filp && filp->f_mapping) {
> +		unsigned long max_folio_size =
> +			mapping_max_folio_size(filp->f_mapping);
> +		unsigned long size = rounddown_pow_of_two(len);
> +
> +		size = min(size, max_folio_size);
> +		if (size > PAGE_SIZE && size != PMD_SIZE) {
> +			ret = __thp_get_unmapped_area(filp, addr, len, off,
> +						      flags, size, vm_flags);

Have you considered to integrate inside __thp_get_unmapped_area?

Like, start with PMD_SIZE alignment, then lower to
mapping_max_folio_size if needed and then lower further based on mapping
size.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
       [not found] ` <20260320140315.979307-4-usama.arif@linux.dev>
  2026-03-20 14:55   ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Kiryl Shutsemau
@ 2026-03-20 15:58   ` Matthew Wilcox
  2026-03-27 16:51     ` Usama Arif
  2026-03-20 16:05   ` WANG Rui
  2 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2026-03-20 15:58 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, david, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
	wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team

On Fri, Mar 20, 2026 at 06:58:53AM -0700, Usama Arif wrote:
> -static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
> +static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr,
> +				       struct file *filp)
>  {
>  	unsigned long alignment = 0;
> +	unsigned long max_folio_size = PAGE_SIZE;
>  	int i;
>  
> +	if (filp && filp->f_mapping)
> +		max_folio_size = mapping_max_folio_size(filp->f_mapping);

Under what circumstances can bprm->file be NULL?

Also we tend to prefer the name "file" rather than "filp" for new code
(yes, there's a lot of old code out there).

> +
> +			/*
> +			 * Try to align the binary to the largest folio
> +			 * size that the page cache supports, so the
> +			 * hardware can coalesce PTEs (e.g. arm64
> +			 * contpte) or use PMD mappings for large folios.
> +			 *
> +			 * Use the largest power-of-2 that fits within
> +			 * the segment size, capped by what the page
> +			 * cache will allocate. Only align when the
> +			 * segment's virtual address and file offset are
> +			 * already aligned to the folio size, as
> +			 * misalignment would prevent coalescing anyway.
> +			 *
> +			 * The segment size check avoids reducing ASLR
> +			 * entropy for small binaries that cannot
> +			 * benefit.
> +			 */
> +			if (!cmds[i].p_filesz)
> +				continue;
> +			size = rounddown_pow_of_two(cmds[i].p_filesz);
> +			size = min(size, max_folio_size);
> +			if (size > PAGE_SIZE &&
> +			    IS_ALIGNED(cmds[i].p_vaddr, size) &&
> +			    IS_ALIGNED(cmds[i].p_offset, size))
> +				alignment = max(alignment, size);

Can this not all be factored out into a different function?  Also, I
think it was done a bit better here:
https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc/

+	if (!IS_ALIGNED(cmd->p_vaddr | cmd->p_offset, PMD_SIZE))
+		return false;



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
       [not found] ` <20260320140315.979307-4-usama.arif@linux.dev>
  2026-03-20 14:55   ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Kiryl Shutsemau
  2026-03-20 15:58   ` Matthew Wilcox
@ 2026-03-20 16:05   ` WANG Rui
  2026-03-20 17:47     ` Matthew Wilcox
  2026-03-27 16:53     ` Usama Arif
  2 siblings, 2 replies; 13+ messages in thread
From: WANG Rui @ 2026-03-20 16:05 UTC (permalink / raw)
  To: usama.arif
  Cc: Liam.Howlett, ajd, akpm, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, david, dev.jain, jack, kees, kevin.brodsky,
	lance.yang, linux-arm-kernel, linux-fsdevel, linux-fsdevel,
	linux-kernel, linux-mm, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, r, rmclure, rppt, ryan.roberts, surenb, vbabka,
	viro, willy

Hi Usama,

On Fri, Mar 20, 2026 at 10:04 PM Usama Arif <usama.arif@linux.dev> wrote:
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 8e89cc5b28200..042af81766fcd 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -49,6 +49,7 @@
>  #include <uapi/linux/rseq.h>
>  #include <asm/param.h>
>  #include <asm/page.h>
> +#include <linux/pagemap.h>
>
>  #ifndef ELF_COMPAT
>  #define ELF_COMPAT 0
> @@ -488,19 +489,51 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
>         return 0;
>  }
>
> -static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
> +static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr,
> +                                      struct file *filp)
>  {
>         unsigned long alignment = 0;
> +       unsigned long max_folio_size = PAGE_SIZE;
>         int i;
>
> +       if (filp && filp->f_mapping)
> +               max_folio_size = mapping_max_folio_size(filp->f_mapping);

From experiments (with 16K base pages), mapping_max_folio_size() appears to
depend on the filesystem. It returns 8M on ext4, while on btrfs it always
falls back to PAGE_SIZE (it seems CONFIG_BTRFS_EXPERIMENTAL=y may change this).
This looks overly conservative and ends up missing practical optimization
opportunities.

> +
>         for (i = 0; i < nr; i++) {
>                 if (cmds[i].p_type == PT_LOAD) {
>                         unsigned long p_align = cmds[i].p_align;
> +                       unsigned long size;
>
>                         /* skip non-power of two alignments as invalid */
>                         if (!is_power_of_2(p_align))
>                                 continue;
>                         alignment = max(alignment, p_align);
> +
> +                       /*
> +                        * Try to align the binary to the largest folio
> +                        * size that the page cache supports, so the
> +                        * hardware can coalesce PTEs (e.g. arm64
> +                        * contpte) or use PMD mappings for large folios.
> +                        *
> +                        * Use the largest power-of-2 that fits within
> +                        * the segment size, capped by what the page
> +                        * cache will allocate. Only align when the
> +                        * segment's virtual address and file offset are
> +                        * already aligned to the folio size, as
> +                        * misalignment would prevent coalescing anyway.
> +                        *
> +                        * The segment size check avoids reducing ASLR
> +                        * entropy for small binaries that cannot
> +                        * benefit.
> +                        */
> +                       if (!cmds[i].p_filesz)
> +                               continue;
> +                       size = rounddown_pow_of_two(cmds[i].p_filesz);
> +                       size = min(size, max_folio_size);
> +                       if (size > PAGE_SIZE &&
> +                           IS_ALIGNED(cmds[i].p_vaddr, size) &&
> +                           IS_ALIGNED(cmds[i].p_offset, size))
> +                               alignment = max(alignment, size);

In my patch [1], by aligning eligible segments to PMD_SIZE, THP can quickly
collapse them into large mappings with minimal warmup. That doesn’t happen
with the current behavior. I think allowing a reasonably sized PMD (say <= 32M)
is worth considering. All we really need here is to ensure virtual address
alignment. The rest can be left to THP under always, which can decide whether
to collapse or not based on memory pressure and other factors.

[1] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc

>                 }
>         }
>
> @@ -1104,7 +1137,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
>                         }
>
>                         /* Calculate any requested alignment. */
> -                       alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
> +                       alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum,
> +                                                     bprm->file);
>
>                         /**
>                          * DOC: PIE handling
> --
> 2.52.0
>

Thanks,
Rui


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
  2026-03-20 16:05   ` WANG Rui
@ 2026-03-20 17:47     ` Matthew Wilcox
  2026-03-27 16:53     ` Usama Arif
  1 sibling, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2026-03-20 17:47 UTC (permalink / raw)
  To: WANG Rui
  Cc: usama.arif, Liam.Howlett, ajd, akpm, apopple, baohua, baolin.wang,
	brauner, catalin.marinas, david, dev.jain, jack, kees,
	kevin.brodsky, lance.yang, linux-arm-kernel, linux-fsdevel,
	linux-fsdevel, linux-kernel, linux-mm, lorenzo.stoakes, mhocko,
	npache, pasha.tatashin, rmclure, rppt, ryan.roberts, surenb,
	vbabka, viro

On Sat, Mar 21, 2026 at 12:05:18AM +0800, WANG Rui wrote:
> >From experiments (with 16K base pages), mapping_max_folio_size() appears to
> depend on the filesystem. It returns 8M on ext4, while on btrfs it always
> falls back to PAGE_SIZE (it seems CONFIG_BTRFS_EXPERIMENTAL=y may change this).
> This looks overly conservative and ends up missing practical optimization
> opportunities.

btrfs only supports large folios with CONFIG_BTRFS_EXPERIMENTAL.
I mean, it's only been five years since it was added to XFS, can't rush
these things.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
  2026-03-20 14:42   ` Jan Kara
@ 2026-03-26 12:40     ` Usama Arif
  2026-03-26 16:21       ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Usama Arif @ 2026-03-26 12:40 UTC (permalink / raw)
  To: Jan Kara, david, ryan.roberts
  Cc: Andrew Morton, willy, linux-mm, r, ajd, apopple, baohua,
	baolin.wang, brauner, catalin.marinas, dev.jain, kees,
	kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
	wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team



On 20/03/2026 17:42, Jan Kara wrote:
> On Fri 20-03-26 06:58:52, Usama Arif wrote:
>> Replace the arch-specific exec_folio_order() hook with a generic
>> preferred_exec_order() that dynamically computes the readahead folio
>> order for executable memory. It targets min(PMD_ORDER, 2M) as the
>> maximum, which optimally gives the right answer for contpte (arm64),
>> PMD mapping (x86, arm64 4K), and architectures with smaller PMDs
>> (s390 1M). It adapts at runtime based on:
>>
>> - VMA size: caps the order so folios fit within the mapping
>> - Memory pressure: steps down the order when the local node's free
>>   memory is below the high watermark for the requested order
>>
>> This avoids over-allocating on memory-constrained systems while still
>> requesting the optimal order when memory is plentiful.
>>
>> Since exec_folio_order() is no longer needed, remove the arm64
>> definition and the generic default from pgtable.h.
>>
>> Signed-off-by: Usama Arif <usama.arif@linux.dev>
> ...
>> +static unsigned int preferred_exec_order(struct vm_area_struct *vma)
>> +{
>> +	int order;
>> +	unsigned long vma_len = vma_pages(vma);
>> +	struct zone *zone;
>> +	gfp_t gfp;
>> +
>> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
>> +		return 0;
>> +
>> +	/* Cap at min(PMD_ORDER, 2M) */
>> +	order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT));
>> +
>> +	/* Don't request folios larger than the VMA */
>> +	order = min(order, ilog2(vma_len));
> 

Hi Jan,

Thanks for the feedback and sorry for the late reply! I was travelling
during the week.

> Hum, as far as I'm checking page_cache_ra_order() used in
> do_sync_mmap_readahead(), ra->order is the preferred order but it will be
> trimmed down to fit both within the file and within ra->size. And ra->size
> is set for the readahead to fit within the vma so I don't think any order
> trimming based on vma length is needed in this place?

Ack, yes makes sense.

> 
>> +	/* Step down under memory pressure */
>> +	gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
>> +	zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
>> +				    gfp_zone(gfp), NULL)->zone;
>> +	if (zone) {
>> +		while (order > 0 &&
>> +		       !zone_watermark_ok(zone, order,
>> +					  high_wmark_pages(zone), 0, 0))
>> +			order--;
>> +	}
> 
> It looks wrong for this logic to be here. Trimming order based on memory
> pressure makes sense (and we've already got reports that on memory limited
> devices large order folios in the page cache have too big memory overhead
> so we'll likely need to handle that for page cache allocations in general)
> but IMHO it belongs to page_cache_ra_order() or some other common place
> like that.
> 
> 								Honza

So I have been thinking about this. readahead_gfp_mask() already sets
__GFP_NORETRY, so we wont try aggressive reclaim/compaction to satisfy
the allocation. page_cache_ra_order() falls through to the fallback path
faulting in order 0 page when allocation is not satsified.

So the allocator already naturally steps down under memory pressure,
the explicit zone_watermark_ok() loop might be redundant?

What are your thoughts on just setting
ra->order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT))?
We can do the higher orlder allocation with gfp &= ~__GFP_RECLAIM
for the VM_EXEC case.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order()
  2026-03-26 12:40     ` Usama Arif
@ 2026-03-26 16:21       ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2026-03-26 16:21 UTC (permalink / raw)
  To: Usama Arif
  Cc: Jan Kara, david, ryan.roberts, Andrew Morton, willy, linux-mm, r,
	ajd, apopple, baohua, baolin.wang, brauner, catalin.marinas,
	dev.jain, kees, kevin.brodsky, lance.yang, Liam.Howlett,
	linux-arm-kernel, linux-fsdevel, linux-kernel, lorenzo.stoakes,
	mhocko, npache, pasha.tatashin, rmclure, rppt, surenb, vbabka,
	Al Viro, wilts.infradead.org, ziy, hannes, kas, shakeel.butt,
	kernel-team

On Thu 26-03-26 08:40:21, Usama Arif wrote:
> On 20/03/2026 17:42, Jan Kara wrote:
> >> +	/* Step down under memory pressure */
> >> +	gfp = mapping_gfp_mask(vma->vm_file->f_mapping);
> >> +	zone = first_zones_zonelist(node_zonelist(numa_node_id(), gfp),
> >> +				    gfp_zone(gfp), NULL)->zone;
> >> +	if (zone) {
> >> +		while (order > 0 &&
> >> +		       !zone_watermark_ok(zone, order,
> >> +					  high_wmark_pages(zone), 0, 0))
> >> +			order--;
> >> +	}
> > 
> > It looks wrong for this logic to be here. Trimming order based on memory
> > pressure makes sense (and we've already got reports that on memory limited
> > devices large order folios in the page cache have too big memory overhead
> > so we'll likely need to handle that for page cache allocations in general)
> > but IMHO it belongs to page_cache_ra_order() or some other common place
> > like that.
> > 
> > 								Honza
> 
> So I have been thinking about this. readahead_gfp_mask() already sets
> __GFP_NORETRY, so we wont try aggressive reclaim/compaction to satisfy
> the allocation. page_cache_ra_order() falls through to the fallback path
> faulting in order 0 page when allocation is not satsified.
> 
> So the allocator already naturally steps down under memory pressure,
> the explicit zone_watermark_ok() loop might be redundant?

Probably yes. I still think we'll have to somehow better tweak the used
order based on expected size of the page cache (2M folios seem unreasonably
large for a machine that has e.g. 1G of memory in total, even if it has
enough free memory at this point in time - we'll benefit from smaller
folios and thus finer grained folio activity tracking for such cases). But
that's not for this patch set.

> What are your thoughts on just setting
> ra->order = min(HPAGE_PMD_ORDER, ilog2(SZ_2M >> PAGE_SHIFT))?
> We can do the higher orlder allocation with gfp &= ~__GFP_RECLAIM
> for the VM_EXEC case.

Yes, it's simple and it makes sense to me so if others are fine with it...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
  2026-03-20 15:58   ` Matthew Wilcox
@ 2026-03-27 16:51     ` Usama Arif
  0 siblings, 0 replies; 13+ messages in thread
From: Usama Arif @ 2026-03-27 16:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, david, ryan.roberts, linux-mm, r, jack, ajd,
	apopple, baohua, baolin.wang, brauner, catalin.marinas, dev.jain,
	kees, kevin.brodsky, lance.yang, Liam.Howlett, linux-arm-kernel,
	linux-fsdevel, linux-kernel, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, surenb, vbabka, Al Viro,
	wilts.infradead.org, ziy, hannes, kas, shakeel.butt, kernel-team



On 20/03/2026 18:58, Matthew Wilcox wrote:
> On Fri, Mar 20, 2026 at 06:58:53AM -0700, Usama Arif wrote:
>> -static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
>> +static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr,
>> +				       struct file *filp)
>>  {
>>  	unsigned long alignment = 0;
>> +	unsigned long max_folio_size = PAGE_SIZE;
>>  	int i;
>>  
>> +	if (filp && filp->f_mapping)
>> +		max_folio_size = mapping_max_folio_size(filp->f_mapping);
> 
> Under what circumstances can bprm->file be NULL?

Yeah its unnecessary here. Its used in other places and this is never
checked, so we can remove it.

> 
> Also we tend to prefer the name "file" rather than "filp" for new code
> (yes, there's a lot of old code out there).
> 

ack. will change in next revision.

>> +
>> +			/*
>> +			 * Try to align the binary to the largest folio
>> +			 * size that the page cache supports, so the
>> +			 * hardware can coalesce PTEs (e.g. arm64
>> +			 * contpte) or use PMD mappings for large folios.
>> +			 *
>> +			 * Use the largest power-of-2 that fits within
>> +			 * the segment size, capped by what the page
>> +			 * cache will allocate. Only align when the
>> +			 * segment's virtual address and file offset are
>> +			 * already aligned to the folio size, as
>> +			 * misalignment would prevent coalescing anyway.
>> +			 *
>> +			 * The segment size check avoids reducing ASLR
>> +			 * entropy for small binaries that cannot
>> +			 * benefit.
>> +			 */
>> +			if (!cmds[i].p_filesz)
>> +				continue;
>> +			size = rounddown_pow_of_two(cmds[i].p_filesz);
>> +			size = min(size, max_folio_size);
>> +			if (size > PAGE_SIZE &&
>> +			    IS_ALIGNED(cmds[i].p_vaddr, size) &&
>> +			    IS_ALIGNED(cmds[i].p_offset, size))
>> +				alignment = max(alignment, size);
> 
> Can this not all be factored out into a different function?  Also, I
> think it was done a bit better here:
> https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc/
> 
> +	if (!IS_ALIGNED(cmd->p_vaddr | cmd->p_offset, PMD_SIZE))
> +		return false;
> 


ack, will try and address this accordingly.


Thanks for the reviews!!


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
  2026-03-20 16:05   ` WANG Rui
  2026-03-20 17:47     ` Matthew Wilcox
@ 2026-03-27 16:53     ` Usama Arif
  1 sibling, 0 replies; 13+ messages in thread
From: Usama Arif @ 2026-03-27 16:53 UTC (permalink / raw)
  To: WANG Rui
  Cc: Liam.Howlett, ajd, akpm, apopple, baohua, baolin.wang, brauner,
	catalin.marinas, david, dev.jain, jack, kees, kevin.brodsky,
	lance.yang, linux-arm-kernel, linux-fsdevel, linux-fsdevel,
	linux-kernel, linux-mm, lorenzo.stoakes, mhocko, npache,
	pasha.tatashin, rmclure, rppt, ryan.roberts, surenb, vbabka, viro,
	willy



On 20/03/2026 19:05, WANG Rui wrote:
> Hi Usama,
> 
> On Fri, Mar 20, 2026 at 10:04 PM Usama Arif <usama.arif@linux.dev> wrote:
>> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
>> index 8e89cc5b28200..042af81766fcd 100644
>> --- a/fs/binfmt_elf.c
>> +++ b/fs/binfmt_elf.c
>> @@ -49,6 +49,7 @@
>>  #include <uapi/linux/rseq.h>
>>  #include <asm/param.h>
>>  #include <asm/page.h>
>> +#include <linux/pagemap.h>
>>
>>  #ifndef ELF_COMPAT
>>  #define ELF_COMPAT 0
>> @@ -488,19 +489,51 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
>>         return 0;
>>  }
>>
>> -static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
>> +static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr,
>> +                                      struct file *filp)
>>  {
>>         unsigned long alignment = 0;
>> +       unsigned long max_folio_size = PAGE_SIZE;
>>         int i;
>>
>> +       if (filp && filp->f_mapping)
>> +               max_folio_size = mapping_max_folio_size(filp->f_mapping);
> 
> From experiments (with 16K base pages), mapping_max_folio_size() appears to
> depend on the filesystem. It returns 8M on ext4, while on btrfs it always
> falls back to PAGE_SIZE (it seems CONFIG_BTRFS_EXPERIMENTAL=y may change this).
> This looks overly conservative and ends up missing practical optimization
> opportunities.

mapping_max_folio_size() reflects what the page cache will actually
allocate for a given filesystem, since readahead caps folio allocation
at mapping_max_folio_order() (in page_cache_ra_order()). If btrfs
reports PAGE_SIZE, readahead won't allocate large folios for it, so
there are no large folios to coalesce PTEs for, aligning the binary
beyond that would only reduce ASLR entropy for no benefit.

I don't think we should over-align binaries on filesystems that can't
take advantage of it.

> 
>> +
>>         for (i = 0; i < nr; i++) {
>>                 if (cmds[i].p_type == PT_LOAD) {
>>                         unsigned long p_align = cmds[i].p_align;
>> +                       unsigned long size;
>>
>>                         /* skip non-power of two alignments as invalid */
>>                         if (!is_power_of_2(p_align))
>>                                 continue;
>>                         alignment = max(alignment, p_align);
>> +
>> +                       /*
>> +                        * Try to align the binary to the largest folio
>> +                        * size that the page cache supports, so the
>> +                        * hardware can coalesce PTEs (e.g. arm64
>> +                        * contpte) or use PMD mappings for large folios.
>> +                        *
>> +                        * Use the largest power-of-2 that fits within
>> +                        * the segment size, capped by what the page
>> +                        * cache will allocate. Only align when the
>> +                        * segment's virtual address and file offset are
>> +                        * already aligned to the folio size, as
>> +                        * misalignment would prevent coalescing anyway.
>> +                        *
>> +                        * The segment size check avoids reducing ASLR
>> +                        * entropy for small binaries that cannot
>> +                        * benefit.
>> +                        */
>> +                       if (!cmds[i].p_filesz)
>> +                               continue;
>> +                       size = rounddown_pow_of_two(cmds[i].p_filesz);
>> +                       size = min(size, max_folio_size);
>> +                       if (size > PAGE_SIZE &&
>> +                           IS_ALIGNED(cmds[i].p_vaddr, size) &&
>> +                           IS_ALIGNED(cmds[i].p_offset, size))
>> +                               alignment = max(alignment, size);
> 
> In my patch [1], by aligning eligible segments to PMD_SIZE, THP can quickly
> collapse them into large mappings with minimal warmup. That doesn’t happen
> with the current behavior. I think allowing a reasonably sized PMD (say <= 32M)
> is worth considering. All we really need here is to ensure virtual address
> alignment. The rest can be left to THP under always, which can decide whether
> to collapse or not based on memory pressure and other factors.
> 
> [1] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc
> 
>>                 }
>>         }
>>
>> @@ -1104,7 +1137,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
>>                         }
>>
>>                         /* Calculate any requested alignment. */
>> -                       alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
>> +                       alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum,
>> +                                                     bprm->file);
>>
>>                         /**
>>                          * DOC: PIE handling
>> --
>> 2.52.0
>>
> 
> Thanks,
> Rui



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-27 16:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260320140315.979307-1-usama.arif@linux.dev>
     [not found] ` <20260320140315.979307-2-usama.arif@linux.dev>
2026-03-20 14:18   ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Jan Kara
2026-03-20 14:26   ` Kiryl Shutsemau
     [not found] ` <20260320140315.979307-3-usama.arif@linux.dev>
2026-03-20 14:41   ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Kiryl Shutsemau
2026-03-20 14:42   ` Jan Kara
2026-03-26 12:40     ` Usama Arif
2026-03-26 16:21       ` Jan Kara
     [not found] ` <20260320140315.979307-5-usama.arif@linux.dev>
2026-03-20 15:06   ` [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area Kiryl Shutsemau
     [not found] ` <20260320140315.979307-4-usama.arif@linux.dev>
2026-03-20 14:55   ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Kiryl Shutsemau
2026-03-20 15:58   ` Matthew Wilcox
2026-03-27 16:51     ` Usama Arif
2026-03-20 16:05   ` WANG Rui
2026-03-20 17:47     ` Matthew Wilcox
2026-03-27 16:53     ` Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox