All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Adrian Barnaś" <abarnas@google.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	David Hildenbrand <david@kernel.org>,
	"Mike Rapoport (Microsoft)" <rppt@kernel.org>,
	Ard Biesheuvel <ardb@kernel.org>,
	Christoph Lameter <cl@gentwo.org>,
	Yang Shi <yang@os.amperecomputing.com>,
	Brendan Jackman <jackmanb@google.com>
Subject: Re: [RFC PATCH 6/6] arm64: mm: support PMD page coalescing in the linear map
Date: Wed, 24 Jun 2026 14:32:39 +0000	[thread overview]
Message-ID: <ajvqh8Ac0Sj5UzP_@google.com> (raw)
In-Reply-To: <799181c3-a1a1-4de7-bc6a-576d3282efb0@arm.com>

On Fri, Jun 19, 2026 at 02:40:40PM +0100, Ryan Roberts wrote:
>On 11/06/2026 14:01, Adrian Barnaś wrote:
>> Implement PMD block coalescing to merge fragmented linear mapping regions
>> back into huge pages when restoring the read-only attribute.
>>
>> When memory allocated with VM_ALLOW_HUGE_VMAP (such as for the execmem
>> ROX cache) has its permissions modified, the PMD block mapping is split
>> into individual PTEs. Without this change, when that memory have its RO
>> attribute subsequently cleared and set the mapping remains permanently
>> fragmented into 4K pages.
>
>I'd like to make sure I understand this correctly: So the execmem ROX cache is
>allocated using vmalloc_huge such that all it's backing memory are PMD-sized
>pages. It is initially poisoned with a faulting instruction and marked ROX. When
>a module is loaded, it's text is copied into a portion of the ROX cache and
>executed from there? To do that, the required number of (4K) pages are allocated
>from ROX cache, and the permissions are changed on that sub-region to RW,
>meaning we have to split the mapping from a PMD to (cont)PTEs in both vmalloc
>space and in the linear map. After the text is copied, the sub-region is
>switched back to ROX. But without that change, the 2M region is now mapped by
>(cont)PTEs for all of time?

That is correct.

>
>>
>> Signed-off-by: Adrian Barnaś <abarnas@google.com>
>> ---
>>  arch/arm64/include/asm/mmu.h |  1 +
>>  arch/arm64/mm/mmu.c          | 95 ++++++++++++++++++++++++++++++++++++
>>  arch/arm64/mm/pageattr.c     |  7 +++
>>  3 files changed, 103 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
>> index 137a173df1ff..19158bacb2df 100644
>> --- a/arch/arm64/include/asm/mmu.h
>> +++ b/arch/arm64/include/asm/mmu.h
>> @@ -80,6 +80,7 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
>>  extern void mark_linear_text_alias_ro(void);
>>  extern int split_kernel_leaf_mapping(unsigned long start, unsigned long end);
>>  extern void linear_map_maybe_split_to_ptes(void);
>> +void try_collapse_kernel_pmd(unsigned long addr);
>>
>>  /*
>>   * This check is triggered during the early boot before the cpufeature
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index a6a00accf4f9..d74226fa1c9b 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -769,6 +769,101 @@ static inline bool force_pte_mapping(void)
>>
>>  static DEFINE_MUTEX(pgtable_split_lock);
>>
>> +static inline bool __pte_can_be_collapsed(pte_t pte, unsigned long pfn, pgprot_t prot)
>> +{
>> +	if (!pte_valid(pte))
>> +		return false;
>> +	if (pte_pfn(pte) != pfn)
>> +		return false;
>> +	if ((pgprot_val(pte_pgprot(pte)) & ~PTE_CONT) != pgprot_val(prot))
>
>Do we want to permit differing values for access and dirty? We do for user
>space, but I think for kernel, those bits are always set? In which case, what
>you're doing is correct.
>
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static void __try_collapse_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr)
>> +{
>> +	pte_t *ptep;
>> +	pte_t first_pte;
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +	int i;
>> +
>> +	ptep = (pte_t *)pmd_page_vaddr(pmd);
>> +	first_pte = __ptep_get(ptep);
>> +
>> +	if (!pte_valid(first_pte))
>> +		return;
>> +
>> +	prot = pte_pgprot(first_pte);
>> +	prot = __pgprot(pgprot_val(prot) & ~PTE_CONT);
>> +	pfn = pte_pfn(first_pte);
>> +
>> +	if (!IS_ALIGNED(pfn, PMD_SIZE >> PAGE_SHIFT))
>> +		return;
>> +
>> +	for (i = 1; i < PTRS_PER_PTE; i++) {
>> +		if (!__pte_can_be_collapsed(__ptep_get(ptep + i), pfn + i, prot))
>> +			return;
>> +	}
>> +
>> +	set_pmd(pmdp, pmd_mkhuge(pfn_pmd(pfn, prot)));
>> +
>> +	__flush_tlb_kernel_pgtable(addr);
>> +
>
>nit: suggest adding the same comment that other sites has, since this is not
>obvious:
>
>/* See comment in pud_free_pmd_page for static key logic */
>
>> +	if (static_branch_unlikely(&arm64_ptdump_lock_key)) {
>> +		mmap_read_lock(&init_mm);
>> +		mmap_read_unlock(&init_mm);
>> +	}
>> +
>> +	pte_free_kernel(NULL, ptep);
>
>Ooh, fun. Per my comments below we definitely have opportunity for racy pte
>table accesses (beyond ptdump). I _think_ if you fix that then this will be safe.
>
>> +}
>> +
>> +void try_collapse_kernel_pmd(unsigned long addr)
>> +{
>> +	pgd_t *pgdp;
>> +	p4d_t *p4dp;
>> +	pud_t *pudp;
>> +	pmd_t *pmdp;
>> +	pmd_t pmd;
>> +
>> +	/*
>> +	 * collapse_pmd expects exact address of block to be collapsed
>> +	 */
>> +	if (WARN_ON(ALIGN_DOWN(addr, PMD_SIZE) != addr))
>> +		return;
>> +
>> +	mutex_lock(&pgtable_split_lock);
>
>I don't think this is safe in general. Let's say we have a 2M region split into
>512 x 4K PTEs. It's possible that the first 1M is one object and the second 1M
>is another object. Different CPUs could set_memory_*() on those 2 objects
>concurrently. If one of them then calls this function, we could end up
>collapsing the whole 2M while the other is trying to modify the PTEs and they
>will race.
>
>Note that splitting _is_ safe (and protected by this lock) because you'd have 2
>objects backed by the same PMD, so they would both have to split before
>modifying the PTEs.
>
>I think you'd need to ensure mutual exclusion at a higher level if doing this;
>probably execmem is the place that can ensure that no objects within a 2M region
>are racily trying to modify their permissions?
>
>> +
>> +	pgdp = pgd_offset_k(addr);
>> +	if (pgd_none(READ_ONCE(*pgdp)))
>
>nit: use the getters (e.g. pXdp_get()) instead of READ_ONCE().
>
>> +		goto out;
>> +
>> +	p4dp = p4d_offset(pgdp, addr);
>> +	if (p4d_none(READ_ONCE(*p4dp)))
>> +		goto out;
>> +
>> +	pudp = pud_offset(p4dp, addr);
>> +	if (pud_none(READ_ONCE(*pudp)))
>> +		goto out;
>> +
>> +	if (pud_leaf(READ_ONCE(*pudp)))
>> +		goto out;
>> +
>> +	pmdp = pmd_offset(pudp, addr);
>> +	pmd = pmdp_get(pmdp);
>> +
>> +	if (!pmd_table(pmd))
>> +		goto out;
>> +
>> +	lazy_mmu_mode_enable();
>> +	__try_collapse_pmd(pmdp, pmd, addr);
>> +	lazy_mmu_mode_disable();
>> +
>> +out:
>> +	mutex_unlock(&pgtable_split_lock);
>> +}
>> +
>>  int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>>  {
>>  	int ret;
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index eaefdf90b0d5..11e0b60264c3 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -200,6 +200,13 @@ static int change_memory_common(unsigned long addr, int numpages,
>>  			if (ret)
>>  				return ret;
>>  		}
>
>The loop here in the unchanged code, calls __change_memory_common() for each
>PAGE_SIZE page in the linear alias. Perhaps it should be area->page_order aware?
>That way, if we are changing the permissions of the whole 2M chunk of vmalloc
>space and it's backed by a 2M page, then we won't split the mapping to PTEs in
>the linear map. Similarly, if we are changing permissions on a sub-region of the
>2M area, we might be able to split only as far as contpte and still have some
>performance benefit?
>

This is out of scope of execmem (as you noted we are mostly changing permision
for only a part of the block) but your sugestion seems to be valid for other
cases.

>> +		/*
>> +		 * When setting a read-only flag on the linear region, the memory
>> +		 * may have been backed by a PMD before being split. Try to
>> +		 * collapse it back into a PMD to restore huge page performance.
>> +		 */
>> +		if (pgprot_val(set_mask) == PTE_RDONLY && area->flags & VM_ALLOW_HUGE_VMAP)
>> +			try_collapse_kernel_pmd((u64)page_address(area->pages[0]));
>
>try_collapse_kernel_pmd() requires a PMD-aligned address. VM_ALLOW_HUGE_VMAP
>doesn't guarrantee that - it only says that it's _allowed_ to be huge (and 64K
>would still be huge). vmalloc may have failed to allocate a PMD-sized page and
>reverted to something smaller (64K contpte or 4K). You need to look at
>area->page_order, I think.
>
Agree.

>You're only calling this for the linear alias. Don't you want to call it for the
>vmalloc range too? I assume the module text executes using the vmalloc address
>so you'd want it mapped by a single PMD for best performance?
>

I think it is a good idea to have them both (linear alias and vmalloc area)
mapped as PMD. 

>But in general, I don't really like the idea of special-casing a collapse to pmd
>here for just execmem. I think we should either investigate a general purpose
>collapse mechanism or execmem should have extra hooks added to request specific
>collapse.>

>Thanks,
>Ryan
>
>
>>  	}
>>
>>  	/*
>


I will rethink this based on your suggestion and try to properly fix the
concurrency issues.

I am also wondering if we should avoid splitting linear aliases altogether. 
I think I understand why it was done originally (to restrict writing through
linear mapping aliases for read-only vmalloc areas). However, it is not 
necessary to make them writable when the vmalloc area is writable. Maybe 
we should create a special case for execmem to prevent this. WDYT?

Thank you for the review,
Adrian



      reply	other threads:[~2026-06-24 14:32 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-11 13:01 [RFC PATCH 0/6] arm64: mm: Introducing ROX CACHE to ARM64 systems with bbml2 no abort Adrian Barnaś
2026-06-11 13:01 ` [RFC PATCH 1/6] arm64: mm: explicitly declare module and ftrace execmem regions Adrian Barnaś
2026-06-11 13:36   ` Brendan Jackman
2026-06-11 13:01 ` [RFC PATCH 2/6] arm64: mm: allow huge vmap permission adjustments with bbml2_no_abort Adrian Barnaś
2026-06-18 14:21   ` Ryan Roberts
2026-06-24 13:54     ` Adrian Barnaś
2026-06-11 13:01 ` [RFC PATCH 3/6] arm64: mm: fix restoring linear map permissions on execmem cache clean Adrian Barnaś
2026-06-11 13:54   ` Brendan Jackman
2026-06-12  7:17     ` Mike Rapoport
2026-06-17 15:18       ` Adrian Barnaś
2026-06-17 18:40         ` Mike Rapoport
2026-06-24 13:57           ` Adrian Barnaś
2026-06-18 15:05   ` Ryan Roberts
2026-06-19  8:33     ` Ryan Roberts
2026-06-24 13:52       ` Adrian Barnaś
2026-06-11 13:01 ` [RFC PATCH 4/6] arm64: mm: add helper to fill execmem with trapping instructions Adrian Barnaś
2026-06-19 10:54   ` Ryan Roberts
2026-06-19 10:58     ` Mike Rapoport
2026-06-11 13:01 ` [RFC PATCH 5/6] arm64: execmem: enable EXECMEM_ROX_CACHE on supported CPUs Adrian Barnaś
2026-06-19 12:09   ` Ryan Roberts
2026-06-11 13:01 ` [RFC PATCH 6/6] arm64: mm: support PMD page coalescing in the linear map Adrian Barnaś
2026-06-19 13:40   ` Ryan Roberts
2026-06-24 14:32     ` Adrian Barnaś [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajvqh8Ac0Sj5UzP_@google.com \
    --to=abarnas@google.com \
    --cc=ardb@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=david@kernel.org \
    --cc=jackmanb@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=will@kernel.org \
    --cc=yang@os.amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.