From: Usama Arif <usama.arif@linux.dev>
To: WANG Rui <r@hev.cc>
Cc: Liam.Howlett@oracle.com, ajd@linux.ibm.com,
akpm@linux-foundation.org, apopple@nvidia.com, baohua@kernel.org,
baolin.wang@linux.alibaba.com, brauner@kernel.org,
catalin.marinas@arm.com, david@kernel.org, dev.jain@arm.com,
jack@suse.cz, kees@kernel.org, kevin.brodsky@arm.com,
lance.yang@linux.dev, linux-arm-kernel@lists.infradead.org,
linux-fsdevel@vger.kernel.l, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
lorenzo.stoakes@oracle.com, mhocko@suse.com, npache@redhat.com,
pasha.tatashin@soleen.com, rmclure@linux.ibm.com,
rppt@kernel.org, ryan.roberts@arm.com, surenb@google.com,
vbabka@kernel.org, viro@zeniv.linux.org.uk, willy@infradead.org
Subject: Re: [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing
Date: Fri, 27 Mar 2026 12:53:34 -0400 [thread overview]
Message-ID: <0725ce97-b8a3-47c9-952f-7b512873cc35@linux.dev> (raw)
In-Reply-To: <20260320160519.80962-1-r@hev.cc>
On 20/03/2026 19:05, WANG Rui wrote:
> Hi Usama,
>
> On Fri, Mar 20, 2026 at 10:04 PM Usama Arif <usama.arif@linux.dev> wrote:
>> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
>> index 8e89cc5b28200..042af81766fcd 100644
>> --- a/fs/binfmt_elf.c
>> +++ b/fs/binfmt_elf.c
>> @@ -49,6 +49,7 @@
>> #include <uapi/linux/rseq.h>
>> #include <asm/param.h>
>> #include <asm/page.h>
>> +#include <linux/pagemap.h>
>>
>> #ifndef ELF_COMPAT
>> #define ELF_COMPAT 0
>> @@ -488,19 +489,51 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
>> return 0;
>> }
>>
>> -static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
>> +static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr,
>> + struct file *filp)
>> {
>> unsigned long alignment = 0;
>> + unsigned long max_folio_size = PAGE_SIZE;
>> int i;
>>
>> + if (filp && filp->f_mapping)
>> + max_folio_size = mapping_max_folio_size(filp->f_mapping);
>
> From experiments (with 16K base pages), mapping_max_folio_size() appears to
> depend on the filesystem. It returns 8M on ext4, while on btrfs it always
> falls back to PAGE_SIZE (it seems CONFIG_BTRFS_EXPERIMENTAL=y may change this).
> This looks overly conservative and ends up missing practical optimization
> opportunities.
mapping_max_folio_size() reflects what the page cache will actually
allocate for a given filesystem, since readahead caps folio allocation
at mapping_max_folio_order() (in page_cache_ra_order()). If btrfs
reports PAGE_SIZE, readahead won't allocate large folios for it, so
there are no large folios to coalesce PTEs for, aligning the binary
beyond that would only reduce ASLR entropy for no benefit.
I don't think we should over-align binaries on filesystems that can't
take advantage of it.
>
>> +
>> for (i = 0; i < nr; i++) {
>> if (cmds[i].p_type == PT_LOAD) {
>> unsigned long p_align = cmds[i].p_align;
>> + unsigned long size;
>>
>> /* skip non-power of two alignments as invalid */
>> if (!is_power_of_2(p_align))
>> continue;
>> alignment = max(alignment, p_align);
>> +
>> + /*
>> + * Try to align the binary to the largest folio
>> + * size that the page cache supports, so the
>> + * hardware can coalesce PTEs (e.g. arm64
>> + * contpte) or use PMD mappings for large folios.
>> + *
>> + * Use the largest power-of-2 that fits within
>> + * the segment size, capped by what the page
>> + * cache will allocate. Only align when the
>> + * segment's virtual address and file offset are
>> + * already aligned to the folio size, as
>> + * misalignment would prevent coalescing anyway.
>> + *
>> + * The segment size check avoids reducing ASLR
>> + * entropy for small binaries that cannot
>> + * benefit.
>> + */
>> + if (!cmds[i].p_filesz)
>> + continue;
>> + size = rounddown_pow_of_two(cmds[i].p_filesz);
>> + size = min(size, max_folio_size);
>> + if (size > PAGE_SIZE &&
>> + IS_ALIGNED(cmds[i].p_vaddr, size) &&
>> + IS_ALIGNED(cmds[i].p_offset, size))
>> + alignment = max(alignment, size);
>
> In my patch [1], by aligning eligible segments to PMD_SIZE, THP can quickly
> collapse them into large mappings with minimal warmup. That doesn’t happen
> with the current behavior. I think allowing a reasonably sized PMD (say <= 32M)
> is worth considering. All we really need here is to ensure virtual address
> alignment. The rest can be left to THP under always, which can decide whether
> to collapse or not based on memory pressure and other factors.
>
> [1] https://lore.kernel.org/linux-fsdevel/20260313005211.882831-1-r@hev.cc
>
>> }
>> }
>>
>> @@ -1104,7 +1137,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
>> }
>>
>> /* Calculate any requested alignment. */
>> - alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
>> + alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum,
>> + bprm->file);
>>
>> /**
>> * DOC: PIE handling
>> --
>> 2.52.0
>>
>
> Thanks,
> Rui
next prev parent reply other threads:[~2026-03-27 16:53 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-20 13:58 [PATCH v2 0/4] mm: improve large folio readahead and alignment for exec memory Usama Arif
2026-03-20 13:58 ` [PATCH v2 1/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-03-20 14:18 ` Jan Kara
2026-03-20 14:26 ` Kiryl Shutsemau
2026-03-20 13:58 ` [PATCH v2 2/4] mm: replace exec_folio_order() with generic preferred_exec_order() Usama Arif
2026-03-20 14:41 ` Kiryl Shutsemau
2026-03-20 14:42 ` Jan Kara
2026-03-26 12:40 ` Usama Arif
2026-03-26 16:21 ` Jan Kara
2026-03-20 13:58 ` [PATCH v2 3/4] elf: align ET_DYN base to max folio size for PTE coalescing Usama Arif
2026-03-20 14:55 ` Kiryl Shutsemau
2026-03-20 15:58 ` Matthew Wilcox
2026-03-27 16:51 ` Usama Arif
2026-03-20 16:05 ` WANG Rui
2026-03-20 17:47 ` Matthew Wilcox
2026-03-27 16:53 ` Usama Arif [this message]
2026-03-20 13:58 ` [PATCH v2 4/4] mm: align file-backed mmap to max folio order in thp_get_unmapped_area Usama Arif
2026-03-20 15:06 ` Kiryl Shutsemau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0725ce97-b8a3-47c9-952f-7b512873cc35@linux.dev \
--to=usama.arif@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=ajd@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=brauner@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=jack@suse.cz \
--cc=kees@kernel.org \
--cc=kevin.brodsky@arm.com \
--cc=lance.yang@linux.dev \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-fsdevel@vger.kernel.l \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=npache@redhat.com \
--cc=pasha.tatashin@soleen.com \
--cc=r@hev.cc \
--cc=rmclure@linux.ibm.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox