Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Usama Arif <usama.arif@linux.dev>
To: "David Hildenbrand (Arm)" <david@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	ryan.roberts@arm.com
Cc: ajd@linux.ibm.com, anshuman.khandual@arm.com, apopple@nvidia.com,
	baohua@kernel.org, baolin.wang@linux.alibaba.com,
	brauner@kernel.org, catalin.marinas@arm.com, dev.jain@arm.com,
	jack@suse.cz, kees@kernel.org, kevin.brodsky@arm.com,
	lance.yang@linux.dev, Liam.Howlett@oracle.com,
	linux-arm-kernel@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, lorenzo.stoakes@oracle.com,
	npache@redhat.com, rmclure@linux.ibm.com,
	Al Viro <viro@zeniv.linux.org.uk>,
	will@kernel.org, willy@infradead.org, ziy@nvidia.com,
	hannes@cmpxchg.org, kas@kernel.org, shakeel.butt@linux.dev,
	kernel-team@meta.com, WANG Rui <r@hev.cc>
Subject: Re: [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages
Date: Wed, 18 Mar 2026 13:41:33 +0300	[thread overview]
Message-ID: <562fb349-c2d2-4fbe-83f9-75c26cc4b7ae@linux.dev> (raw)
In-Reply-To: <be65e710-997c-413c-8455-2d687fc51fc6@kernel.org>



On 16/03/2026 19:06, David Hildenbrand (Arm) wrote:
> On 3/13/26 20:59, Usama Arif wrote:
>>
>>
>> On 13/03/2026 16:20, David Hildenbrand (Arm) wrote:
>>> On 3/10/26 15:51, Usama Arif wrote:
>>>> On arm64, the contpte hardware feature coalesces multiple contiguous PTEs
>>>> into a single iTLB entry, reducing iTLB pressure for large executable
>>>> mappings.
>>>>
>>>> exec_folio_order() was introduced [1] to request readahead at an
>>>> arch-preferred folio order for executable memory, enabling contpte
>>>> mapping on the fault path.
>>>>
>>>> However, several things prevent this from working optimally on 16K and
>>>> 64K page configurations:
>>>>
>>>> 1. exec_folio_order() returns ilog2(SZ_64K >> PAGE_SHIFT), which only
>>>>    produces the optimal contpte order for 4K pages. For 16K pages it
>>>>    returns order 2 (64K) instead of order 7 (2M), and for 64K pages it
>>>>    returns order 0 (64K) instead of order 5 (2M). Patch 1 fixes this by
>>>>    using ilog2(CONT_PTES) which evaluates to the optimal order for all
>>>>    page sizes.
>>>>
>>>> 2. Even with the optimal order, the mmap_miss heuristic in
>>>>    do_sync_mmap_readahead() silently disables exec readahead after 100
>>>>    page faults. The mmap_miss counter tracks whether readahead is useful
>>>>    for mmap'd file access:
>>>>
>>>>    - Incremented by 1 in do_sync_mmap_readahead() on every page cache
>>>>      miss (page needed IO).
>>>>
>>>>    - Decremented by N in filemap_map_pages() for N pages successfully
>>>>      mapped via fault-around (pages found in cache without faulting,
>>>>      evidence that readahead was useful). Only non-workingset pages
>>>>      count and recently evicted and re-read pages don't count as hits.
>>>>
>>>>    - Decremented by 1 in do_async_mmap_readahead() when a PG_readahead
>>>>      marker page is found (indicates sequential consumption of readahead
>>>>      pages).
>>>>
>>>>    When mmap_miss exceeds MMAP_LOTSAMISS (100), all readahead is
>>>>    disabled. On 64K pages, both decrement paths are inactive:
>>>>
>>>>    - filemap_map_pages() is never called because fault_around_pages
>>>>      (65536 >> PAGE_SHIFT = 1) disables should_fault_around(), which
>>>>      requires fault_around_pages > 1. With only 1 page in the
>>>>      fault-around window, there is nothing "around" to map.
>>>>
>>>>    - do_async_mmap_readahead() never fires for exec mappings because
>>>>      exec readahead sets async_size = 0, so no PG_readahead markers
>>>>      are placed.
>>>>
>>>>    With no decrements, mmap_miss monotonically increases past
>>>>    MMAP_LOTSAMISS after 100 faults, disabling exec readahead
>>>>    for the remainder of the mapping.
>>>>    Patch 2 fixes this by moving the VM_EXEC readahead block
>>>>    above the mmap_miss check, since exec readahead is targeted (one
>>>>    folio at the fault location, async_size=0) not speculative prefetch.
>>>>
>>>> 3. Even with correct folio order and readahead, contpte mapping requires
>>>>    the virtual address to be aligned to CONT_PTE_SIZE (2M on 64K pages).
>>>>    The readahead path aligns file offsets and the buddy allocator aligns
>>>>    physical memory, but the virtual address depends on the VMA start.
>>>>    For PIE binaries, ASLR randomizes the load address at PAGE_SIZE (64K)
>>>>    granularity, giving only a 1/32 chance of 2M alignment. When
>>>>    misaligned, contpte_set_ptes() never sets the contiguous PTE bit for
>>>>    any folio in the VMA, resulting in zero iTLB coalescing benefit.
>>>>
>>>>    Patch 3 fixes this for the main binary by bumping the ELF loader's
>>>>    alignment to PAGE_SIZE << exec_folio_order() for ET_DYN binaries.
>>>>
>>>>    Patch 4 fixes this for shared libraries by adding a contpte-size
>>>>    alignment fallback in thp_get_unmapped_area_vmflags(). The existing
>>>>    PMD_SIZE alignment (512M on 64K pages) is too large for typical shared
>>>>    libraries, so this smaller fallback (2M) succeeds where PMD fails.
>>>>
>>>> I created a benchmark that mmaps a large executable file and calls
>>>> RET-stub functions at PAGE_SIZE offsets across it. "Cold" measures
>>>> fault + readahead cost. "Random" first faults in all pages with a
>>>> sequential sweep (not measured), then measures time for calling random
>>>> offsets, isolating iTLB miss cost for scattered execution.
>>>>
>>>> The benchmark results on Neoverse V2 (Grace), arm64 with 64K base pages,
>>>> 512MB executable file on ext4, averaged over 3 runs:
>>>>
>>>>   Phase      | Baseline     | Patched      | Improvement
>>>>   -----------|--------------|--------------|------------------
>>>>   Cold fault | 83.4 ms      | 41.3 ms      | 50% faster
>>>>   Random     | 76.0 ms      | 58.3 ms      | 23% faster
>>>
>>> I'm curious: is a single order really what we want?
>>>
>>> I'd instead assume that we might want to make decisions based on the
>>> mapping size.
>>>
>>> Assume you have a 128M mapping, wouldn't we want to use a different
>>> alignment than, say, for a 1M mapping, a 128K mapping or a 8k mapping?
>>>
>>
>> So I see 2 benefits from this. Page fault and iTLB coverage. IMHO page
>> faults are not that big of a deal? If the text section is hot, it wont
>> get flushed after faulting in. So the real benefit comes from improved
>> iTLB coverage.
>>
>> For a 128M mapping, 2M alignment gives 64 contpte entries. Aligning
>> to something larger (say 128M) wouldn't give any additional TLB
>> coalescing, each 2M-aligned region independently qualifies for contpte.
>>
>> Mappings smaller than 2M can't benefit from contpte regardless of
>> alignment, so falling back to PAGE_SIZE would be the optimal behaviour.
>> Adding intermediate sizes (e.g. 512K, 128K) wouldn't map to any
>> hardware boundary and adds complexity without TLB benefit?
> 
> I might be wrong, but I think you are mixing two things here:
> 
> (1) "Minimum" folio size (exec_folio_order())
> 
> (2) VMA alignment.
> 
> 
> (2) should certainly be as large as (1), but assume we can get a 2M
> folio on arm64 4k, why shouldn't we align it to 2M if the region is
> reasonably sized, and use a PMD?
> 
> 

So this series is tackling both (1) and (2). When I started making changes
to the code, what I wanted was 2M folios at fault with 64K base page size
to reduce iTLB misses. This is what patch 1 (and 2) will achieve.

Yes, completely agree, (2) should be as large as (1). I didn't think about
PMD size on 4K which you pointed out. do_sync_mmap_readahead can give
that with force_thp_readahead, so this should be supported.

But we shouldn't align to PMD size for all base page sizes. As Rui pointed
out, increasing alignment size reduces ASLR entropy [1]. Should we max alignement
to 2M?

[1] https://lore.kernel.org/all/20260313144213.95686-1-r@hev.cc/

next prev parent reply	other threads:[~2026-03-18 10:42 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-10 14:51 [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages Usama Arif
2026-03-10 14:51 ` [PATCH 1/4] arm64: request contpte-sized folios for exec memory Usama Arif
2026-03-19  7:35   ` David Hildenbrand (Arm)
2026-03-10 14:51 ` [PATCH 2/4] mm: bypass mmap_miss heuristic for VM_EXEC readahead Usama Arif
2026-03-18 16:43   ` Jan Kara
2026-03-19  7:37     ` David Hildenbrand (Arm)
2026-03-10 14:51 ` [PATCH 3/4] elf: align ET_DYN base to exec folio order for contpte mapping Usama Arif
2026-03-13 14:42   ` WANG Rui
2026-03-13 19:47     ` Usama Arif
2026-03-14  2:10       ` hev
2026-03-10 14:51 ` [PATCH 4/4] mm: align file-backed mmap to exec folio order in thp_get_unmapped_area Usama Arif
2026-03-14  3:47   ` WANG Rui
2026-03-13 13:20 ` [PATCH 0/4] arm64/mm: contpte-sized exec folios for 16K and 64K pages David Hildenbrand (Arm)
2026-03-13 19:59   ` Usama Arif
2026-03-16 16:06     ` David Hildenbrand (Arm)
2026-03-18 10:41       ` Usama Arif [this message]
2026-03-18 12:41         ` David Hildenbrand (Arm)
2026-03-13 16:33 ` Ryan Roberts
2026-03-13 20:55   ` Usama Arif
2026-03-18 10:52     ` Usama Arif
2026-03-19  7:40       ` David Hildenbrand (Arm)
2026-03-14 13:20   ` WANG Rui
2026-03-13 16:35 ` hev
2026-03-14  9:50 ` WANG Rui
2026-03-18 10:57   ` Usama Arif
2026-03-18 11:46     ` WANG Rui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=562fb349-c2d2-4fbe-83f9-75c26cc4b7ae@linux.dev \
    --to=usama.arif@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=ajd@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brauner@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=kas@kernel.org \
    --cc=kees@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=kevin.brodsky@arm.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=r@hev.cc \
    --cc=rmclure@linux.ibm.com \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=viro@zeniv.linux.org.uk \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.