From: Usama Arif <usama.arif@linux.dev>
To: WANG Rui <r@hev.cc>
Cc: Liam.Howlett@oracle.com, ajd@linux.ibm.com,
akpm@linux-foundation.org, apopple@nvidia.com, baohua@kernel.org,
baolin.wang@linux.alibaba.com, brauner@kernel.org,
catalin.marinas@arm.com, david@kernel.org, dev.jain@arm.com,
jack@suse.cz, kees@kernel.org, kevin.brodsky@arm.com,
lance.yang@linux.dev, linux-arm-kernel@lists.infradead.org,
linux-fsdevel@vger.kernel.l, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
lorenzo.stoakes@oracle.com, mhocko@suse.com, npache@redhat.com,
pasha.tatashin@soleen.com, rmclure@linux.ibm.com,
rppt@kernel.org, ryan.roberts@arm.com, surenb@google.com,
vbabka@kernel.org, viro@zeniv.linux.org.uk, willy@infradead.org
Subject: Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
Date: Fri, 27 Mar 2026 12:53:34 -0400 [thread overview]
Message-ID: <0725ce97-b8a3-47c9-952f-7b512873cc35@linux.dev> (raw)
In-Reply-To: <20260320160519.80962-1-r@hev.cc>
On 20/03/2026 19:05, WANG Rui wrote:
> Hi Usama,
>
> On Fri, Mar 20, 2026 at 10:04 PM Usama Arif <usama.arif@linux.dev> wrote:
>> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
>> index 8e89cc5b28200..042af81766fcd 100644
>> --- a/fs/binfmt_elf.c
>> +++ b/fs/binfmt_elf.c
>> @@ -49,6 +49,7 @@
>> #include <uapi/linux/rseq.h>
>> #include <asm/param.h>
>> #include <asm/page.h>
>> +#include <linux/pagemap.h>
>>
>> #ifndef ELF_COMPAT
>> #define ELF_COMPAT 0
>> @@ -488,19 +489,51 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
>> return 0;
>> }
>>
>> -static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
>> +static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr,
>> + struct file *filp)
>> {
>> unsigned long alignment = 0;
>> + unsigned long max_folio_size = PAGE_SIZE;
>> int i;
>>
>> + if (filp && filp->f_mapping)
>> + max_folio_size = mapping_max_folio_size(filp->f_mapping);
>
> From experiments (with 16K base pages), mapping_max_folio_size() appears to
> depend on the filesystem. It returns 8M on ext4, while on btrfs it always
> falls back to PAGE_SIZE (it seems CONFIG_BTRFS_EXPERIMENTAL=y may change this).
> This looks overly conservative and ends up missing practical optimization
> opportunities.
mapping_max_folio_size() reflects what the page cache will actually
allocate for a given filesystem, since readahead caps folio allocation
at mapping_max_folio_order() (in page_cache_ra_order()). If btrfs
reports PAGE_SIZE, readahead won't allocate large folios for it, so
there are no large folios to coalesce PTEs for, aligning the binary
beyond that would only reduce ASLR entropy for no benefit.
I don't think we should over-align binaries on filesystems that can't
take advantage of it.
>
>> +
>> for (i = 0; i < nr; i++) {
>> if (cmds[i].p_type == PT_LOAD) {
>> unsigned long p_align = cmds[i].p_align;
>> + unsigned long size;
>>
>> /* skip non-power of two alignments as invalid */
>> if (!is_power_of_2(p_align))
>> continue;
>> alignment = max(alignment, p_align);
>> +
>> + /*
>> + * Try to align the binary to the largest folio
>> + * size that the page cache supports, so the
>> + * hardware can coalesce PTEs (e.g. arm64
>> + * contpte) or use PMD mappings for large folios.
>> + *
>> + * Use the largest power-of-2 that fits within
>> + * the segment size, capped by what the page
>> + * cache will allocate. Only align when the
>> + * segment's virtual address and file offset are
>> + * already aligned to the folio size, as
>> + * misalignment would prevent coalescing anyway.
>> + *
>> + * The segment size check avoids reducing ASLR
>> + * entropy for small binaries that cannot
>> + * benefit.
>> + */
>> + if (!cmds[i].p_filesz)
>> + continue;
>> + size = rounddown_pow_of_two(cmds[i].p_filesz);
>> + size = min(size, max_folio_size);
>> + if (size > PAGE_SIZE &&
>> + IS_ALIGNED(cmds[i].p_vaddr, size) &&
>> + IS_ALIGNED(cmds[i].p_offset, size))
>> + alignment = max(alignment, size);
>
> In my patch [1], by aligning eligible segments to PMD_SIZE, THP can quickly
> collapse them into large mappings with minimal warmup. That doesn’t happen
> with the current behavior. I think allowing a reasonably sized PMD (say <= 32M)
> is worth considering. All we really need here is to ensure virtual address
> alignment. The rest can be left to THP under always, which can decide whether
> to collapse or not based on memory pressure and other factors.
>
> [1] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc
>
>> }
>> }
>>
>> @@ -1104,7 +1137,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
>> }
>>
>> /* Calculate any requested alignment. */
>> - alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
>> + alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum,
>> + bprm->file);
>>
>> /**
>> * DOC: PIE handling
>> --
>> 2.52.0
>>
>
> Thanks,
> Rui
next prev parent reply other threads:[~2026-03-27 16:53 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-20 13:58 [PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-03-20 13:58 ` Usama Arif
2026-03-20 13:58 ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-03-20 13:58 ` Usama Arif
2026-03-20 14:18 ` Jan Kara
2026-03-20 14:18 ` Jan Kara
2026-03-20 14:26 ` Kiryl Shutsemau
2026-03-20 13:58 ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Usama Arif
2026-03-20 13:58 ` Usama Arif
2026-03-20 14:41 ` Kiryl Shutsemau
2026-03-20 14:42 ` Jan Kara
2026-03-20 14:42 ` Jan Kara
2026-03-26 12:40 ` Usama Arif
2026-03-26 12:40 ` Usama Arif
2026-03-26 16:21 ` Jan Kara
2026-03-26 16:21 ` Jan Kara
2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
2026-03-20 13:58 ` Usama Arif
2026-03-20 14:55 ` Kiryl Shutsemau
2026-03-20 15:58 ` Matthew Wilcox
2026-03-27 16:51 ` Usama Arif
2026-03-20 16:05 ` WANG Rui
2026-03-20 17:47 ` Matthew Wilcox
2026-03-27 16:53 ` Usama Arif [this message]
2026-03-29 4:37 ` WANG Rui
2026-03-30 12:56 ` Matthew Wilcox
2026-03-30 14:00 ` Usama Arif
2026-03-20 13:58 ` [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area Usama Arif
2026-03-20 13:58 ` Usama Arif
2026-03-20 15:06 ` Kiryl Shutsemau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0725ce97-b8a3-47c9-952f-7b512873cc35@linux.dev \
--to=usama.arif@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=ajd@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=brauner@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=jack@suse.cz \
--cc=kees@kernel.org \
--cc=kevin.brodsky@arm.com \
--cc=lance.yang@linux.dev \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-fsdevel@vger.kernel.l \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=npache@redhat.com \
--cc=pasha.tatashin@soleen.com \
--cc=r@hev.cc \
--cc=rmclure@linux.ibm.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.