From: "Adrian Barnaś" <abarnas@google.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
David Hildenbrand <david@kernel.org>,
"Mike Rapoport (Microsoft)" <rppt@kernel.org>,
Ard Biesheuvel <ardb@kernel.org>,
Christoph Lameter <cl@gentwo.org>,
Yang Shi <yang@os.amperecomputing.com>,
Brendan Jackman <jackmanb@google.com>
Subject: Re: [RFC PATCH 6/6] arm64: mm: support PMD page coalescing in the linear map
Date: Wed, 24 Jun 2026 14:32:39 +0000 [thread overview]
Message-ID: <ajvqh8Ac0Sj5UzP_@google.com> (raw)
In-Reply-To: <799181c3-a1a1-4de7-bc6a-576d3282efb0@arm.com>
On Fri, Jun 19, 2026 at 02:40:40PM +0100, Ryan Roberts wrote:
>On 11/06/2026 14:01, Adrian Barnaś wrote:
>> Implement PMD block coalescing to merge fragmented linear mapping regions
>> back into huge pages when restoring the read-only attribute.
>>
>> When memory allocated with VM_ALLOW_HUGE_VMAP (such as for the execmem
>> ROX cache) has its permissions modified, the PMD block mapping is split
>> into individual PTEs. Without this change, when that memory have its RO
>> attribute subsequently cleared and set the mapping remains permanently
>> fragmented into 4K pages.
>
>I'd like to make sure I understand this correctly: So the execmem ROX cache is
>allocated using vmalloc_huge such that all it's backing memory are PMD-sized
>pages. It is initially poisoned with a faulting instruction and marked ROX. When
>a module is loaded, it's text is copied into a portion of the ROX cache and
>executed from there? To do that, the required number of (4K) pages are allocated
>from ROX cache, and the permissions are changed on that sub-region to RW,
>meaning we have to split the mapping from a PMD to (cont)PTEs in both vmalloc
>space and in the linear map. After the text is copied, the sub-region is
>switched back to ROX. But without that change, the 2M region is now mapped by
>(cont)PTEs for all of time?
That is correct.
>
>>
>> Signed-off-by: Adrian Barnaś <abarnas@google.com>
>> ---
>> arch/arm64/include/asm/mmu.h | 1 +
>> arch/arm64/mm/mmu.c | 95 ++++++++++++++++++++++++++++++++++++
>> arch/arm64/mm/pageattr.c | 7 +++
>> 3 files changed, 103 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
>> index 137a173df1ff..19158bacb2df 100644
>> --- a/arch/arm64/include/asm/mmu.h
>> +++ b/arch/arm64/include/asm/mmu.h
>> @@ -80,6 +80,7 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
>> extern void mark_linear_text_alias_ro(void);
>> extern int split_kernel_leaf_mapping(unsigned long start, unsigned long end);
>> extern void linear_map_maybe_split_to_ptes(void);
>> +void try_collapse_kernel_pmd(unsigned long addr);
>>
>> /*
>> * This check is triggered during the early boot before the cpufeature
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index a6a00accf4f9..d74226fa1c9b 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -769,6 +769,101 @@ static inline bool force_pte_mapping(void)
>>
>> static DEFINE_MUTEX(pgtable_split_lock);
>>
>> +static inline bool __pte_can_be_collapsed(pte_t pte, unsigned long pfn, pgprot_t prot)
>> +{
>> + if (!pte_valid(pte))
>> + return false;
>> + if (pte_pfn(pte) != pfn)
>> + return false;
>> + if ((pgprot_val(pte_pgprot(pte)) & ~PTE_CONT) != pgprot_val(prot))
>
>Do we want to permit differing values for access and dirty? We do for user
>space, but I think for kernel, those bits are always set? In which case, what
>you're doing is correct.
>
>> + return false;
>> +
>> + return true;
>> +}
>> +
>> +static void __try_collapse_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr)
>> +{
>> + pte_t *ptep;
>> + pte_t first_pte;
>> + unsigned long pfn;
>> + pgprot_t prot;
>> + int i;
>> +
>> + ptep = (pte_t *)pmd_page_vaddr(pmd);
>> + first_pte = __ptep_get(ptep);
>> +
>> + if (!pte_valid(first_pte))
>> + return;
>> +
>> + prot = pte_pgprot(first_pte);
>> + prot = __pgprot(pgprot_val(prot) & ~PTE_CONT);
>> + pfn = pte_pfn(first_pte);
>> +
>> + if (!IS_ALIGNED(pfn, PMD_SIZE >> PAGE_SHIFT))
>> + return;
>> +
>> + for (i = 1; i < PTRS_PER_PTE; i++) {
>> + if (!__pte_can_be_collapsed(__ptep_get(ptep + i), pfn + i, prot))
>> + return;
>> + }
>> +
>> + set_pmd(pmdp, pmd_mkhuge(pfn_pmd(pfn, prot)));
>> +
>> + __flush_tlb_kernel_pgtable(addr);
>> +
>
>nit: suggest adding the same comment that other sites has, since this is not
>obvious:
>
>/* See comment in pud_free_pmd_page for static key logic */
>
>> + if (static_branch_unlikely(&arm64_ptdump_lock_key)) {
>> + mmap_read_lock(&init_mm);
>> + mmap_read_unlock(&init_mm);
>> + }
>> +
>> + pte_free_kernel(NULL, ptep);
>
>Ooh, fun. Per my comments below we definitely have opportunity for racy pte
>table accesses (beyond ptdump). I _think_ if you fix that then this will be safe.
>
>> +}
>> +
>> +void try_collapse_kernel_pmd(unsigned long addr)
>> +{
>> + pgd_t *pgdp;
>> + p4d_t *p4dp;
>> + pud_t *pudp;
>> + pmd_t *pmdp;
>> + pmd_t pmd;
>> +
>> + /*
>> + * collapse_pmd expects exact address of block to be collapsed
>> + */
>> + if (WARN_ON(ALIGN_DOWN(addr, PMD_SIZE) != addr))
>> + return;
>> +
>> + mutex_lock(&pgtable_split_lock);
>
>I don't think this is safe in general. Let's say we have a 2M region split into
>512 x 4K PTEs. It's possible that the first 1M is one object and the second 1M
>is another object. Different CPUs could set_memory_*() on those 2 objects
>concurrently. If one of them then calls this function, we could end up
>collapsing the whole 2M while the other is trying to modify the PTEs and they
>will race.
>
>Note that splitting _is_ safe (and protected by this lock) because you'd have 2
>objects backed by the same PMD, so they would both have to split before
>modifying the PTEs.
>
>I think you'd need to ensure mutual exclusion at a higher level if doing this;
>probably execmem is the place that can ensure that no objects within a 2M region
>are racily trying to modify their permissions?
>
>> +
>> + pgdp = pgd_offset_k(addr);
>> + if (pgd_none(READ_ONCE(*pgdp)))
>
>nit: use the getters (e.g. pXdp_get()) instead of READ_ONCE().
>
>> + goto out;
>> +
>> + p4dp = p4d_offset(pgdp, addr);
>> + if (p4d_none(READ_ONCE(*p4dp)))
>> + goto out;
>> +
>> + pudp = pud_offset(p4dp, addr);
>> + if (pud_none(READ_ONCE(*pudp)))
>> + goto out;
>> +
>> + if (pud_leaf(READ_ONCE(*pudp)))
>> + goto out;
>> +
>> + pmdp = pmd_offset(pudp, addr);
>> + pmd = pmdp_get(pmdp);
>> +
>> + if (!pmd_table(pmd))
>> + goto out;
>> +
>> + lazy_mmu_mode_enable();
>> + __try_collapse_pmd(pmdp, pmd, addr);
>> + lazy_mmu_mode_disable();
>> +
>> +out:
>> + mutex_unlock(&pgtable_split_lock);
>> +}
>> +
>> int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>> {
>> int ret;
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index eaefdf90b0d5..11e0b60264c3 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -200,6 +200,13 @@ static int change_memory_common(unsigned long addr, int numpages,
>> if (ret)
>> return ret;
>> }
>
>The loop here in the unchanged code, calls __change_memory_common() for each
>PAGE_SIZE page in the linear alias. Perhaps it should be area->page_order aware?
>That way, if we are changing the permissions of the whole 2M chunk of vmalloc
>space and it's backed by a 2M page, then we won't split the mapping to PTEs in
>the linear map. Similarly, if we are changing permissions on a sub-region of the
>2M area, we might be able to split only as far as contpte and still have some
>performance benefit?
>
This is out of scope of execmem (as you noted we are mostly changing permision
for only a part of the block) but your sugestion seems to be valid for other
cases.
>> + /*
>> + * When setting a read-only flag on the linear region, the memory
>> + * may have been backed by a PMD before being split. Try to
>> + * collapse it back into a PMD to restore huge page performance.
>> + */
>> + if (pgprot_val(set_mask) == PTE_RDONLY && area->flags & VM_ALLOW_HUGE_VMAP)
>> + try_collapse_kernel_pmd((u64)page_address(area->pages[0]));
>
>try_collapse_kernel_pmd() requires a PMD-aligned address. VM_ALLOW_HUGE_VMAP
>doesn't guarrantee that - it only says that it's _allowed_ to be huge (and 64K
>would still be huge). vmalloc may have failed to allocate a PMD-sized page and
>reverted to something smaller (64K contpte or 4K). You need to look at
>area->page_order, I think.
>
Agree.
>You're only calling this for the linear alias. Don't you want to call it for the
>vmalloc range too? I assume the module text executes using the vmalloc address
>so you'd want it mapped by a single PMD for best performance?
>
I think it is a good idea to have them both (linear alias and vmalloc area)
mapped as PMD.
>But in general, I don't really like the idea of special-casing a collapse to pmd
>here for just execmem. I think we should either investigate a general purpose
>collapse mechanism or execmem should have extra hooks added to request specific
>collapse.>
>Thanks,
>Ryan
>
>
>> }
>>
>> /*
>
I will rethink this based on your suggestion and try to properly fix the
concurrency issues.
I am also wondering if we should avoid splitting linear aliases altogether.
I think I understand why it was done originally (to restrict writing through
linear mapping aliases for read-only vmalloc areas). However, it is not
necessary to make them writable when the vmalloc area is writable. Maybe
we should create a special case for execmem to prevent this. WDYT?
Thank you for the review,
Adrian
prev parent reply other threads:[~2026-06-24 14:32 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-11 13:01 [RFC PATCH 0/6] arm64: mm: Introducing ROX CACHE to ARM64 systems with bbml2 no abort Adrian Barnaś
2026-06-11 13:01 ` [RFC PATCH 1/6] arm64: mm: explicitly declare module and ftrace execmem regions Adrian Barnaś
2026-06-11 13:36 ` Brendan Jackman
2026-06-11 13:01 ` [RFC PATCH 2/6] arm64: mm: allow huge vmap permission adjustments with bbml2_no_abort Adrian Barnaś
2026-06-18 14:21 ` Ryan Roberts
2026-06-24 13:54 ` Adrian Barnaś
2026-06-11 13:01 ` [RFC PATCH 3/6] arm64: mm: fix restoring linear map permissions on execmem cache clean Adrian Barnaś
2026-06-11 13:54 ` Brendan Jackman
2026-06-12 7:17 ` Mike Rapoport
2026-06-17 15:18 ` Adrian Barnaś
2026-06-17 18:40 ` Mike Rapoport
2026-06-24 13:57 ` Adrian Barnaś
2026-06-18 15:05 ` Ryan Roberts
2026-06-19 8:33 ` Ryan Roberts
2026-06-24 13:52 ` Adrian Barnaś
2026-06-11 13:01 ` [RFC PATCH 4/6] arm64: mm: add helper to fill execmem with trapping instructions Adrian Barnaś
2026-06-19 10:54 ` Ryan Roberts
2026-06-19 10:58 ` Mike Rapoport
2026-06-11 13:01 ` [RFC PATCH 5/6] arm64: execmem: enable EXECMEM_ROX_CACHE on supported CPUs Adrian Barnaś
2026-06-19 12:09 ` Ryan Roberts
2026-06-11 13:01 ` [RFC PATCH 6/6] arm64: mm: support PMD page coalescing in the linear map Adrian Barnaś
2026-06-19 13:40 ` Ryan Roberts
2026-06-24 14:32 ` Adrian Barnaś [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ajvqh8Ac0Sj5UzP_@google.com \
--to=abarnas@google.com \
--cc=ardb@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=cl@gentwo.org \
--cc=david@kernel.org \
--cc=jackmanb@google.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-mm@kvack.org \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=will@kernel.org \
--cc=yang@os.amperecomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox