Re: [RFC PATCH 6/6] arm64: mm: support PMD page coalescing in the linear map

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Adrian Barnaś" <abarnas@google.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	David Hildenbrand <david@kernel.org>,
	"Mike Rapoport (Microsoft)" <rppt@kernel.org>,
	Ard Biesheuvel <ardb@kernel.org>,
	Christoph Lameter <cl@gentwo.org>,
	Yang Shi <yang@os.amperecomputing.com>,
	Brendan Jackman <jackmanb@google.com>
Subject: Re: [RFC PATCH 6/6] arm64: mm: support PMD page coalescing in the linear map
Date: Wed, 24 Jun 2026 14:32:39 +0000	[thread overview]
Message-ID: <ajvqh8Ac0Sj5UzP_@google.com> (raw)
In-Reply-To: <799181c3-a1a1-4de7-bc6a-576d3282efb0@arm.com>

On Fri, Jun 19, 2026 at 02:40:40PM +0100, Ryan Roberts wrote:
>On 11/06/2026 14:01, Adrian Barnaś wrote:
>> Implement PMD block coalescing to merge fragmented linear mapping regions
>> back into huge pages when restoring the read-only attribute.
>>
>> When memory allocated with VM_ALLOW_HUGE_VMAP (such as for the execmem
>> ROX cache) has its permissions modified, the PMD block mapping is split
>> into individual PTEs. Without this change, when that memory have its RO
>> attribute subsequently cleared and set the mapping remains permanently
>> fragmented into 4K pages.
>
>I'd like to make sure I understand this correctly: So the execmem ROX cache is
>allocated using vmalloc_huge such that all it's backing memory are PMD-sized
>pages. It is initially poisoned with a faulting instruction and marked ROX. When
>a module is loaded, it's text is copied into a portion of the ROX cache and
>executed from there? To do that, the required number of (4K) pages are allocated
>from ROX cache, and the permissions are changed on that sub-region to RW,
>meaning we have to split the mapping from a PMD to (cont)PTEs in both vmalloc
>space and in the linear map. After the text is copied, the sub-region is
>switched back to ROX. But without that change, the 2M region is now mapped by
>(cont)PTEs for all of time?

That is correct.

>
>>
>> Signed-off-by: Adrian Barnaś <abarnas@google.com>
>> ---
>>  arch/arm64/include/asm/mmu.h |  1 +
>>  arch/arm64/mm/mmu.c          | 95 ++++++++++++++++++++++++++++++++++++
>>  arch/arm64/mm/pageattr.c     |  7 +++
>>  3 files changed, 103 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
>> index 137a173df1ff..19158bacb2df 100644
>> --- a/arch/arm64/include/asm/mmu.h
>> +++ b/arch/arm64/include/asm/mmu.h
>> @@ -80,6 +80,7 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
>>  extern void mark_linear_text_alias_ro(void);
>>  extern int split_kernel_leaf_mapping(unsigned long start, unsigned long end);
>>  extern void linear_map_maybe_split_to_ptes(void);
>> +void try_collapse_kernel_pmd(unsigned long addr);
>>
>>  /*
>>   * This check is triggered during the early boot before the cpufeature
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index a6a00accf4f9..d74226fa1c9b 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -769,6 +769,101 @@ static inline bool force_pte_mapping(void)
>>
>>  static DEFINE_MUTEX(pgtable_split_lock);
>>
>> +static inline bool __pte_can_be_collapsed(pte_t pte, unsigned long pfn, pgprot_t prot)
>> +{
>> +	if (!pte_valid(pte))
>> +		return false;
>> +	if (pte_pfn(pte) != pfn)
>> +		return false;
>> +	if ((pgprot_val(pte_pgprot(pte)) & ~PTE_CONT) != pgprot_val(prot))
>
>Do we want to permit differing values for access and dirty? We do for user
>space, but I think for kernel, those bits are always set? In which case, what
>you're doing is correct.
>
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static void __try_collapse_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr)
>> +{
>> +	pte_t *ptep;
>> +	pte_t first_pte;
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +	int i;
>> +
>> +	ptep = (pte_t *)pmd_page_vaddr(pmd);
>> +	first_pte = __ptep_get(ptep);
>> +
>> +	if (!pte_valid(first_pte))
>> +		return;
>> +
>> +	prot = pte_pgprot(first_pte);
>> +	prot = __pgprot(pgprot_val(prot) & ~PTE_CONT);
>> +	pfn = pte_pfn(first_pte);
>> +
>> +	if (!IS_ALIGNED(pfn, PMD_SIZE >> PAGE_SHIFT))
>> +		return;
>> +
>> +	for (i = 1; i < PTRS_PER_PTE; i++) {
>> +		if (!__pte_can_be_collapsed(__ptep_get(ptep + i), pfn + i, prot))
>> +			return;
>> +	}
>> +
>> +	set_pmd(pmdp, pmd_mkhuge(pfn_pmd(pfn, prot)));
>> +
>> +	__flush_tlb_kernel_pgtable(addr);
>> +
>
>nit: suggest adding the same comment that other sites has, since this is not
>obvious:
>
>/* See comment in pud_free_pmd_page for static key logic */
>
>> +	if (static_branch_unlikely(&arm64_ptdump_lock_key)) {
>> +		mmap_read_lock(&init_mm);
>> +		mmap_read_unlock(&init_mm);
>> +	}
>> +
>> +	pte_free_kernel(NULL, ptep);
>
>Ooh, fun. Per my comments below we definitely have opportunity for racy pte
>table accesses (beyond ptdump). I _think_ if you fix that then this will be safe.
>
>> +}
>> +
>> +void try_collapse_kernel_pmd(unsigned long addr)
>> +{
>> +	pgd_t *pgdp;
>> +	p4d_t *p4dp;
>> +	pud_t *pudp;
>> +	pmd_t *pmdp;
>> +	pmd_t pmd;
>> +
>> +	/*
>> +	 * collapse_pmd expects exact address of block to be collapsed
>> +	 */
>> +	if (WARN_ON(ALIGN_DOWN(addr, PMD_SIZE) != addr))
>> +		return;
>> +
>> +	mutex_lock(&pgtable_split_lock);
>
>I don't think this is safe in general. Let's say we have a 2M region split into
>512 x 4K PTEs. It's possible that the first 1M is one object and the second 1M
>is another object. Different CPUs could set_memory_*() on those 2 objects
>concurrently. If one of them then calls this function, we could end up
>collapsing the whole 2M while the other is trying to modify the PTEs and they
>will race.
>
>Note that splitting _is_ safe (and protected by this lock) because you'd have 2
>objects backed by the same PMD, so they would both have to split before
>modifying the PTEs.
>
>I think you'd need to ensure mutual exclusion at a higher level if doing this;
>probably execmem is the place that can ensure that no objects within a 2M region
>are racily trying to modify their permissions?
>
>> +
>> +	pgdp = pgd_offset_k(addr);
>> +	if (pgd_none(READ_ONCE(*pgdp)))
>
>nit: use the getters (e.g. pXdp_get()) instead of READ_ONCE().
>
>> +		goto out;
>> +
>> +	p4dp = p4d_offset(pgdp, addr);
>> +	if (p4d_none(READ_ONCE(*p4dp)))
>> +		goto out;
>> +
>> +	pudp = pud_offset(p4dp, addr);
>> +	if (pud_none(READ_ONCE(*pudp)))
>> +		goto out;
>> +
>> +	if (pud_leaf(READ_ONCE(*pudp)))
>> +		goto out;
>> +
>> +	pmdp = pmd_offset(pudp, addr);
>> +	pmd = pmdp_get(pmdp);
>> +
>> +	if (!pmd_table(pmd))
>> +		goto out;
>> +
>> +	lazy_mmu_mode_enable();
>> +	__try_collapse_pmd(pmdp, pmd, addr);
>> +	lazy_mmu_mode_disable();
>> +
>> +out:
>> +	mutex_unlock(&pgtable_split_lock);
>> +}
>> +
>>  int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>>  {
>>  	int ret;
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index eaefdf90b0d5..11e0b60264c3 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -200,6 +200,13 @@ static int change_memory_common(unsigned long addr, int numpages,
>>  			if (ret)
>>  				return ret;
>>  		}
>
>The loop here in the unchanged code, calls __change_memory_common() for each
>PAGE_SIZE page in the linear alias. Perhaps it should be area->page_order aware?
>That way, if we are changing the permissions of the whole 2M chunk of vmalloc
>space and it's backed by a 2M page, then we won't split the mapping to PTEs in
>the linear map. Similarly, if we are changing permissions on a sub-region of the
>2M area, we might be able to split only as far as contpte and still have some
>performance benefit?
>

This is out of scope of execmem (as you noted we are mostly changing permision
for only a part of the block) but your sugestion seems to be valid for other
cases.

>> +		/*
>> +		 * When setting a read-only flag on the linear region, the memory
>> +		 * may have been backed by a PMD before being split. Try to
>> +		 * collapse it back into a PMD to restore huge page performance.
>> +		 */
>> +		if (pgprot_val(set_mask) == PTE_RDONLY && area->flags & VM_ALLOW_HUGE_VMAP)
>> +			try_collapse_kernel_pmd((u64)page_address(area->pages[0]));
>
>try_collapse_kernel_pmd() requires a PMD-aligned address. VM_ALLOW_HUGE_VMAP
>doesn't guarrantee that - it only says that it's _allowed_ to be huge (and 64K
>would still be huge). vmalloc may have failed to allocate a PMD-sized page and
>reverted to something smaller (64K contpte or 4K). You need to look at
>area->page_order, I think.
>
Agree.

>You're only calling this for the linear alias. Don't you want to call it for the
>vmalloc range too? I assume the module text executes using the vmalloc address
>so you'd want it mapped by a single PMD for best performance?
>

I think it is a good idea to have them both (linear alias and vmalloc area)
mapped as PMD. 

>But in general, I don't really like the idea of special-casing a collapse to pmd
>here for just execmem. I think we should either investigate a general purpose
>collapse mechanism or execmem should have extra hooks added to request specific
>collapse.>

>Thanks,
>Ryan
>
>
>>  	}
>>
>>  	/*
>


I will rethink this based on your suggestion and try to properly fix the
concurrency issues.

I am also wondering if we should avoid splitting linear aliases altogether. 
I think I understand why it was done originally (to restrict writing through
linear mapping aliases for read-only vmalloc areas). However, it is not 
necessary to make them writable when the vmalloc area is writable. Maybe 
we should create a special case for execmem to prevent this. WDYT?

Thank you for the review,
Adrian

     prev parent reply	other threads:[~2026-06-24 14:32 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-11 13:01 [RFC PATCH 0/6] arm64: mm: Introducing ROX CACHE to ARM64 systems with bbml2 no abort Adrian Barnaś
2026-06-11 13:01 ` [RFC PATCH 1/6] arm64: mm: explicitly declare module and ftrace execmem regions Adrian Barnaś
2026-06-11 13:36   ` Brendan Jackman
2026-06-11 13:01 ` [RFC PATCH 2/6] arm64: mm: allow huge vmap permission adjustments with bbml2_no_abort Adrian Barnaś
2026-06-18 14:21   ` Ryan Roberts
2026-06-24 13:54     ` Adrian Barnaś
2026-06-11 13:01 ` [RFC PATCH 3/6] arm64: mm: fix restoring linear map permissions on execmem cache clean Adrian Barnaś
2026-06-11 13:54   ` Brendan Jackman
2026-06-12  7:17     ` Mike Rapoport
2026-06-17 15:18       ` Adrian Barnaś
2026-06-17 18:40         ` Mike Rapoport
2026-06-24 13:57           ` Adrian Barnaś
2026-06-18 15:05   ` Ryan Roberts
2026-06-19  8:33     ` Ryan Roberts
2026-06-24 13:52       ` Adrian Barnaś
2026-06-11 13:01 ` [RFC PATCH 4/6] arm64: mm: add helper to fill execmem with trapping instructions Adrian Barnaś
2026-06-19 10:54   ` Ryan Roberts
2026-06-19 10:58     ` Mike Rapoport
2026-06-11 13:01 ` [RFC PATCH 5/6] arm64: execmem: enable EXECMEM_ROX_CACHE on supported CPUs Adrian Barnaś
2026-06-19 12:09   ` Ryan Roberts
2026-06-11 13:01 ` [RFC PATCH 6/6] arm64: mm: support PMD page coalescing in the linear map Adrian Barnaś
2026-06-19 13:40   ` Ryan Roberts
2026-06-24 14:32     ` Adrian Barnaś [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajvqh8Ac0Sj5UzP_@google.com \
    --to=abarnas@google.com \
    --cc=ardb@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=david@kernel.org \
    --cc=jackmanb@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=will@kernel.org \
    --cc=yang@os.amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox