public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP
@ 2026-03-13  0:52 WANG Rui
  2026-03-15  3:46 ` Lance Yang
  0 siblings, 1 reply; 7+ messages in thread
From: WANG Rui @ 2026-03-13  0:52 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, David Hildenbrand, Jan Kara,
	Kees Cook, Matthew Wilcox
  Cc: linux-fsdevel, linux-mm, linux-kernel, WANG Rui

File-backed mappings can only be collapsed into PMD-sized THP when the
virtual address and file offset are both hugepage-aligned and the
mapping is large enough to cover a huge page.

For ELF executables loaded by the kernel ELF binary loader, PT_LOAD
segments are aligned according to p_align, which is often just the
normal page size.  As a result, large read-only segments that would
otherwise be eligible may fail to get PMD-sized mappings.

Even when a PT_LOAD segment itself is not PMD-aligned, it may still
contain a PMD-aligned subrange.  In that case only that subrange can
be mapped with huge pages, while the unaligned head of the segment
remains mapped with normal pages.

In practice, many executables already have PMD-aligned file offsets
for their text segments, but the virtual address is not aligned due
to the small p_align value.  Aligning the segment to PMD_SIZE in such
cases increases the chance of getting PMD-sized THP mappings.

This matters especially for 2MB huge pages, where many programs have
text segments only slightly larger than a single huge page. If the
start address is not aligned, the leading unaligned region can prevent
the mapping from forming a huge page. For larger huge pages (e.g. 32MB),
the unaligned head region may be close to the huge page size itself,
making the potential performance impact even more significant.

A segment is considered eligible if:

* it is not writable,
* both p_vaddr and p_offset are PMD-aligned,
* its size is at least PMD_SIZE, and
* its existing p_align is smaller than PMD_SIZE.

To avoid excessive virtual address space padding on systems with very
large PMD_SIZE values, this is only applied when PMD_SIZE <= 32MB.

This mainly benefits large text segments of executables by reducing
iTLB pressure.

This only affects ELF executables loaded directly by the kernel ELF
binary loader.  Shared libraries loaded from user space (e.g. by the
dynamic linker) are not affected.

Benchmark

Machine: AMD Ryzen 9 7950X (x86_64)
Binutils: 2.46
GCC: 15.2.1 (built with -z,noseparate-code + --enable-host-pie)

Workload: building Linux v7.0-rc1 vmlinux with x86_64_defconfig.

                Without patch        With patch
instructions    8,246,133,611,932    8,246,025,137,750
cpu-cycles      8,001,028,142,928    7,565,925,107,502
itlb-misses     3,672,158,331        26,821,242
time elapsed    64.66 s              61.97 s

Instructions are basically unchanged. iTLB misses drop from ~3.67B to
~26M (~99.27% reduction), which results in about a ~5.44% reduction in
cycles and ~4.18% shorter wall time for this workload.

Signed-off-by: WANG Rui <r@hev.cc>
---
Changes since [v4]:
* Drop runtime THP mode check, only gate on CONFIG_TRANSPARENT_HUGEPAGE.

Changes since [v3]:
* Fix compilation failure under !CONFIG_TRANSPARENT_HUGEPAGE.
* No functional changes otherwise.

Changes since [v2]:
* Rename align_to_pmd() to should_align_to_pmd().
* Add benchmark results to the commit message.

Changes since [v1]:
* Drop the Kconfig option CONFIG_ELF_RO_LOAD_THP_ALIGNMENT.
* Move the alignment logic into a helper align_to_pmd() for clarity.
* Improve the comment explaining why we skip the optimization
  when PMD_SIZE > 32MB.

[v4]: https://lore.kernel.org/linux-fsdevel/20260310031138.509730-1-r@hev.cc
[v3]: https://lore.kernel.org/linux-fsdevel/20260310013958.103636-1-r@hev.cc
[v2]: https://lore.kernel.org/linux-fsdevel/20260304114727.384416-1-r@hev.cc
[v1]: https://lore.kernel.org/linux-fsdevel/20260302155046.286650-1-r@hev.cc
---
 fs/binfmt_elf.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index fb857faaf0d6..d5f5154079de 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -489,6 +489,32 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
 	return 0;
 }
 
+static inline bool should_align_to_pmd(const struct elf_phdr *cmd)
+{
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return false;
+
+	/*
+	 * Avoid excessive virtual address space padding when PMD_SIZE is very
+	 * large, since this function increases PT_LOAD alignment.
+	 * This threshold roughly matches the largest commonly used hugepage
+	 * sizes on current architectures (e.g. x86 2M, arm64 32M with 16K pages).
+	 */
+	if (PMD_SIZE > SZ_32M)
+		return false;
+
+	if (!IS_ALIGNED(cmd->p_vaddr | cmd->p_offset, PMD_SIZE))
+		return false;
+
+	if (cmd->p_filesz < PMD_SIZE)
+		return false;
+
+	if (cmd->p_flags & PF_W)
+		return false;
+
+	return true;
+}
+
 static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
 {
 	unsigned long alignment = 0;
@@ -501,6 +527,10 @@ static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
 			/* skip non-power of two alignments as invalid */
 			if (!is_power_of_2(p_align))
 				continue;
+
+			if (p_align < PMD_SIZE && should_align_to_pmd(&cmds[i]))
+				p_align = PMD_SIZE;
+
 			alignment = max(alignment, p_align);
 		}
 	}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v5] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP
  2026-03-13  0:52 [PATCH v5] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP WANG Rui
@ 2026-03-15  3:46 ` Lance Yang
  2026-03-15  4:10   ` WANG Rui
  2026-03-15 13:12   ` Usama Arif
  0 siblings, 2 replies; 7+ messages in thread
From: Lance Yang @ 2026-03-15  3:46 UTC (permalink / raw)
  To: r
  Cc: viro, brauner, david, jack, kees, willy, linux-fsdevel, linux-mm,
	linux-kernel, baolin.wang, usama.arif, Lance Yang

Hi Rui,

CC Baolin and Usama

It would be better to keep earlier CC list across revisions, especially
for people who have already reviewed or were explicitly include before.
Otherwise, they may simply never see subsequent revisions, and from
their side the discussion just stops there ...

Thanks,
Lance

On Fri, Mar 13, 2026 at 08:52:11AM +0800, WANG Rui wrote:
>File-backed mappings can only be collapsed into PMD-sized THP when the
>virtual address and file offset are both hugepage-aligned and the
>mapping is large enough to cover a huge page.
>
>For ELF executables loaded by the kernel ELF binary loader, PT_LOAD
>segments are aligned according to p_align, which is often just the
>normal page size.  As a result, large read-only segments that would
>otherwise be eligible may fail to get PMD-sized mappings.
>
>Even when a PT_LOAD segment itself is not PMD-aligned, it may still
>contain a PMD-aligned subrange.  In that case only that subrange can
>be mapped with huge pages, while the unaligned head of the segment
>remains mapped with normal pages.
>
>In practice, many executables already have PMD-aligned file offsets
>for their text segments, but the virtual address is not aligned due
>to the small p_align value.  Aligning the segment to PMD_SIZE in such
>cases increases the chance of getting PMD-sized THP mappings.
>
>This matters especially for 2MB huge pages, where many programs have
>text segments only slightly larger than a single huge page. If the
>start address is not aligned, the leading unaligned region can prevent
>the mapping from forming a huge page. For larger huge pages (e.g. 32MB),
>the unaligned head region may be close to the huge page size itself,
>making the potential performance impact even more significant.
>
>A segment is considered eligible if:
>
>* it is not writable,
>* both p_vaddr and p_offset are PMD-aligned,
>* its size is at least PMD_SIZE, and
>* its existing p_align is smaller than PMD_SIZE.
>
>To avoid excessive virtual address space padding on systems with very
>large PMD_SIZE values, this is only applied when PMD_SIZE <= 32MB.
>
>This mainly benefits large text segments of executables by reducing
>iTLB pressure.
>
>This only affects ELF executables loaded directly by the kernel ELF
>binary loader.  Shared libraries loaded from user space (e.g. by the
>dynamic linker) are not affected.
>
>Benchmark
>
>Machine: AMD Ryzen 9 7950X (x86_64)
>Binutils: 2.46
>GCC: 15.2.1 (built with -z,noseparate-code + --enable-host-pie)
>
>Workload: building Linux v7.0-rc1 vmlinux with x86_64_defconfig.
>
>                Without patch        With patch
>instructions    8,246,133,611,932    8,246,025,137,750
>cpu-cycles      8,001,028,142,928    7,565,925,107,502
>itlb-misses     3,672,158,331        26,821,242
>time elapsed    64.66 s              61.97 s
>
>Instructions are basically unchanged. iTLB misses drop from ~3.67B to
>~26M (~99.27% reduction), which results in about a ~5.44% reduction in
>cycles and ~4.18% shorter wall time for this workload.
>
>Signed-off-by: WANG Rui <r@hev.cc>
>---
>Changes since [v4]:
>* Drop runtime THP mode check, only gate on CONFIG_TRANSPARENT_HUGEPAGE.
>
>Changes since [v3]:
>* Fix compilation failure under !CONFIG_TRANSPARENT_HUGEPAGE.
>* No functional changes otherwise.
>
>Changes since [v2]:
>* Rename align_to_pmd() to should_align_to_pmd().
>* Add benchmark results to the commit message.
>
>Changes since [v1]:
>* Drop the Kconfig option CONFIG_ELF_RO_LOAD_THP_ALIGNMENT.
>* Move the alignment logic into a helper align_to_pmd() for clarity.
>* Improve the comment explaining why we skip the optimization
>  when PMD_SIZE > 32MB.
>
>[v4]: https://lore.kernel.org/linux-fsdevel/20260310031138.509730-1-r@hev.cc
>[v3]: https://lore.kernel.org/linux-fsdevel/20260310013958.103636-1-r@hev.cc
>[v2]: https://lore.kernel.org/linux-fsdevel/20260304114727.384416-1-r@hev.cc
>[v1]: https://lore.kernel.org/linux-fsdevel/20260302155046.286650-1-r@hev.cc
>---
> fs/binfmt_elf.c | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
>diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
>index fb857faaf0d6..d5f5154079de 100644
>--- a/fs/binfmt_elf.c
>+++ b/fs/binfmt_elf.c
>@@ -489,6 +489,32 @@ static int elf_read(struct file *file, void *buf, size_t len, loff_t pos)
> 	return 0;
> }
> 
>+static inline bool should_align_to_pmd(const struct elf_phdr *cmd)
>+{
>+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
>+		return false;
>+
>+	/*
>+	 * Avoid excessive virtual address space padding when PMD_SIZE is very
>+	 * large, since this function increases PT_LOAD alignment.
>+	 * This threshold roughly matches the largest commonly used hugepage
>+	 * sizes on current architectures (e.g. x86 2M, arm64 32M with 16K pages).
>+	 */
>+	if (PMD_SIZE > SZ_32M)
>+		return false;
>+
>+	if (!IS_ALIGNED(cmd->p_vaddr | cmd->p_offset, PMD_SIZE))
>+		return false;
>+
>+	if (cmd->p_filesz < PMD_SIZE)
>+		return false;
>+
>+	if (cmd->p_flags & PF_W)
>+		return false;
>+
>+	return true;
>+}
>+
> static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
> {
> 	unsigned long alignment = 0;
>@@ -501,6 +527,10 @@ static unsigned long maximum_alignment(struct elf_phdr *cmds, int nr)
> 			/* skip non-power of two alignments as invalid */
> 			if (!is_power_of_2(p_align))
> 				continue;
>+
>+			if (p_align < PMD_SIZE && should_align_to_pmd(&cmds[i]))
>+				p_align = PMD_SIZE;
>+
> 			alignment = max(alignment, p_align);
> 		}
> 	}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP
  2026-03-15  3:46 ` Lance Yang
@ 2026-03-15  4:10   ` WANG Rui
  2026-03-15 13:12   ` Usama Arif
  1 sibling, 0 replies; 7+ messages in thread
From: WANG Rui @ 2026-03-15  4:10 UTC (permalink / raw)
  To: lance.yang
  Cc: baolin.wang, brauner, david, jack, kees, linux-fsdevel,
	linux-kernel, linux-mm, r, usama.arif, viro, willy

Hi Lance,

> It would be better to keep earlier CC list across revisions, especially
> for people who have already reviewed or were explicitly include before.
> Otherwise, they may simply never see subsequent revisions, and from
> their side the discussion just stops there ...

Thanks for pointing this out.

After addressing David's feedback on v4 [1], v5 no longer looked
mm-related to me, so I only sent it to linux-fsdevel.

But you're right, I should have kept the earlier CC list, especially
for people who had already reviewed it or were CC'ed explicitly in the
previous rounds.

[1] https://lore.kernel.org/linux-fsdevel/60ba4311-01f8-4ff3-a2df-e1b3fb6db699@kernel.org

Thanks,
Rui

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP
  2026-03-15  3:46 ` Lance Yang
  2026-03-15  4:10   ` WANG Rui
@ 2026-03-15 13:12   ` Usama Arif
  2026-03-20 11:44     ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 7+ messages in thread
From: Usama Arif @ 2026-03-15 13:12 UTC (permalink / raw)
  To: Lance Yang, r, Ryan Roberts
  Cc: viro, brauner, david, jack, kees, willy, linux-fsdevel, linux-mm,
	linux-kernel, baolin.wang



On 15/03/2026 06:46, Lance Yang wrote:
> Hi Rui,
> 
> CC Baolin and Usama
> 
> It would be better to keep earlier CC list across revisions, especially
> for people who have already reviewed or were explicitly include before.
> Otherwise, they may simply never see subsequent revisions, and from
> their side the discussion just stops there ...
> 
> Thanks,
> Lance
> 
Thanks! Also adding Ryan who did the exec_folio_order() work for ARM,
and also raised good concerns in [1]

The problem is not just alignment for elf, we need to fix more things like
mmap heuristics [2] and how unmapped areas are gotten [3].

[1] https://lore.kernel.org/all/cfdfca9c-4752-4037-a289-03e6e7a00d47@arm.com/
[2] https://lore.kernel.org/all/20260310145406.3073394-3-usama.arif@linux.dev/
[3] https://lore.kernel.org/all/20260310145406.3073394-5-usama.arif@linux.dev/


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP
  2026-03-15 13:12   ` Usama Arif
@ 2026-03-20 11:44     ` David Hildenbrand (Arm)
  2026-03-20 17:11       ` WANG Rui
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-20 11:44 UTC (permalink / raw)
  To: Usama Arif, Lance Yang, r, Ryan Roberts
  Cc: viro, brauner, jack, kees, willy, linux-fsdevel, linux-mm,
	linux-kernel, baolin.wang

On 3/15/26 14:12, Usama Arif wrote:
> 
> 
> On 15/03/2026 06:46, Lance Yang wrote:
>> Hi Rui,
>>
>> CC Baolin and Usama
>>
>> It would be better to keep earlier CC list across revisions, especially
>> for people who have already reviewed or were explicitly include before.
>> Otherwise, they may simply never see subsequent revisions, and from
>> their side the discussion just stops there ...
>>
>> Thanks,
>> Lance
>>
> Thanks! Also adding Ryan who did the exec_folio_order() work for ARM,
> and also raised good concerns in [1]
> 
> The problem is not just alignment for elf, we need to fix more things like
> mmap heuristics [2] and how unmapped areas are gotten [3].

I agree, ideally, that would all be tackled in one go.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP
  2026-03-20 11:44     ` David Hildenbrand (Arm)
@ 2026-03-20 17:11       ` WANG Rui
  2026-03-21 14:21         ` WANG Rui
  0 siblings, 1 reply; 7+ messages in thread
From: WANG Rui @ 2026-03-20 17:11 UTC (permalink / raw)
  To: david, usama.arif
  Cc: baolin.wang, brauner, jack, kees, lance.yang, linux-fsdevel,
	linux-kernel, linux-mm, r, ryan.roberts, viro, willy,
	Liam.Howlett, ajd, akpm, apopple, baohua, catalin.marinas,
	dev.jain, kevin.brodsky, linux-arm-kernel, lorenzo.stoakes,
	mhocko, npache, pasha.tatashin, rmclure, rppt, surenb, vbabka

>> Thanks! Also adding Ryan who did the exec_folio_order() work for ARM,
>> and also raised good concerns in [1]
>>
>> The problem is not just alignment for elf, we need to fix more things like
>> mmap heuristics [2] and how unmapped areas are gotten [3].
>
> I agree, ideally, that would all be tackled in one go.

From Usama’s v2 [1], it looks like we may be operating under slightly
different assumptions. His approach seems to key off page cache
characteristics when deciding segment alignment, while my patch is more
about proactively making things THP-friendly so that more code can end
up backed by large mappings. That helps in cases where a segment size is
just over a large mapping boundary.

Maybe what we really need here is to make sure the virtual address is
properly aligned, while avoiding overly aggressive alignment (e.g. capping
it at something like 32M, which is fairly common across architectures).
Beyond that, we can just leave it to THP in “always” mode. THP already has
its own heuristics to decide whether collapsing into large pages makes sense.

It also looks like this approach would work fine with Usama’s cont-pte
mappings. If so, would it make sense to implement [1] along these lines
instead?

[1] https://lore.kernel.org/linux-fsdevel/20260320140315.979307-4-usama.arif@linux.dev

Thanks,
Rui

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP
  2026-03-20 17:11       ` WANG Rui
@ 2026-03-21 14:21         ` WANG Rui
  0 siblings, 0 replies; 7+ messages in thread
From: WANG Rui @ 2026-03-21 14:21 UTC (permalink / raw)
  To: david, usama.arif, willy, baolin.wang
  Cc: r, Liam.Howlett, ajd, akpm, apopple, baohua, brauner,
	catalin.marinas, dev.jain, jack, kees, kevin.brodsky, lance.yang,
	linux-arm-kernel, linux-fsdevel, linux-kernel, linux-mm,
	lorenzo.stoakes, mhocko, npache, pasha.tatashin, rmclure, rppt,
	ryan.roberts, surenb, vbabka, viro

One clarification regarding my earlier comment about compatibility with
cont-pte.

What I meant there is that the alignment logic in my patch does work for
systems with 4K and 16K base pages, where the PMD size remains within a
practical range. In those configurations, providing PMD-level alignment
already creates the conditions needed for both THP collapse and cont-pte
coalescing.

This does not fully extend to 64K base page systems. There the PMD size
can be quite large (e.g. 512M), which exceeds the 32M cap used in my
patch, so PMD-sized alignment may not be achievable in practice.

One way to structure this could be to treat alignment in a layered manner.
My patch focuses on establishing a reliable PMD-level alignment baseline
so that THP has the opportunity to form large mappings where it is practical.
On top of that, Usama's work can further improve behavior at smaller
granularities, for example by enabling cont-pte mappings when PMD alignment
is not feasible.

 			/* skip non-power of two alignments as invalid */
 			if (!is_power_of_2(p_align))
 				continue;

			if (p_align < PMD_SIZE && should_align_to_pmd(&cmds[i]))
				p_align = PMD_SIZE;
+			else if (p_align < CONT_PTE_SIZE && should_align_to_cont_pte(&cmds[i]))
+				p_align = CONT_PTE_SIZE;
 
 			alignment = max(alignment, p_align);
 		}
 	}

With that separation of roles, the two approaches complement each other,
and we can get the benefit of both without changing the core alignment
policy in binfmt_elf.

Thanks,
Rui

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-03-21 14:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-13  0:52 [PATCH v5] binfmt_elf: Align eligible read-only PT_LOAD segments to PMD_SIZE for THP WANG Rui
2026-03-15  3:46 ` Lance Yang
2026-03-15  4:10   ` WANG Rui
2026-03-15 13:12   ` Usama Arif
2026-03-20 11:44     ` David Hildenbrand (Arm)
2026-03-20 17:11       ` WANG Rui
2026-03-21 14:21         ` WANG Rui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox