From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 16CF5CDB481 for ; Wed, 24 Jun 2026 14:32:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To: Content-Transfer-Encoding:Content-Type:MIME-Version:References:Message-ID: Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=0ZY3mpmKpMn6hPTo5zempi8aavh2AOA9wO8Ai718llA=; b=hp0hJ/siidLyOEwT6NteMO/N7Z Jf5GPy5KcWwPf+RkoxJyPU0mbMB6mVUxZNR+TWJKQa9kXycwrulvGLVvCTdNDlWeDkzi0haXu/4DL 2vE011piM31dkQuKjDhw253iSDUbiSFOKIOLVRtxoyA/yl+0Hn1h1FVzFAoopAkEDILkRY6sJHlAg fvdIAhMLB6v/wgNlPNDqpD91NO9TuOak2hifIi1KoeocB3mOZ8z4dkxtL14ebexB5IdIb/GPaQFaP OeFIdnYtrQ+DR+JqsY1Iht6N5bKv18FEkhoGObAZ38A8/BexYPer9QFHtjBwmJS9qGc5fI/DyEGhZ i7Z2zJmg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.99.1 #2 (Red Hat Linux)) id 1wcOen-00000007v48-20Xd; Wed, 24 Jun 2026 14:32:49 +0000 Received: from mail-wr1-x434.google.com ([2a00:1450:4864:20::434]) by bombadil.infradead.org with esmtps (Exim 4.99.1 #2 (Red Hat Linux)) id 1wcOel-00000007v3J-0vvD for linux-arm-kernel@lists.infradead.org; Wed, 24 Jun 2026 14:32:48 +0000 Received: by mail-wr1-x434.google.com with SMTP id ffacd0b85a97d-46cbf263113so542737f8f.1 for ; Wed, 24 Jun 2026 07:32:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1782311565; x=1782916365; darn=lists.infradead.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=0ZY3mpmKpMn6hPTo5zempi8aavh2AOA9wO8Ai718llA=; b=e0pLtLRPtsDe08UfvLP1021HHBpR8Hv9oESpyB7QX2ZgcQ7sJUTdhfwZiXAKpHbc4w 0pJQW7GB3VkDmH8HtHnCKn5LlJSYkX0/d0Ib05eitZ7F3N4g681/f+3wUT5kcWek8xsb cPHk/u8pP/y0TZkiGgfr9JZjEJcPSGXBkKRPIU2h/5EC2Q4y3kE+nZ7UckPQb9TaEOwI 9vwAkbnu7iP2jvgIU5BN77JSQgBY/EMZhcC9h8nO527ackIaBt78wdZPKf2TWDnYOK1k Wn2b2tQXHbd8rpZnzneAsZL4Dm0eo7euHZraeGo7m16hfYiJp1bzvgX14HrnR0sfqAxg l7Xg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1782311565; x=1782916365; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=0ZY3mpmKpMn6hPTo5zempi8aavh2AOA9wO8Ai718llA=; b=bzKgnS4aRD6hEfnoCa9t5eiVwXfTWl5uCVVN261Y9OAtB/M7bxxT8jtzcHNTkFWThW mA5wN3mNnhC2VQ5Befxa/pu0b6VHHbjmYP20tzA9XwyrTPxONTSolN4NYvD7lbSzaHBj o1ued6K+3aIXUzSh11b7zanjmuBdxQ7sjr6Dua4yna/duCwdVN4E4peuGYdHC9Jv4Mh1 0eaeH0mZHFlf3gmiay4VTqg6FPBJQCP5+0z28Mnu0UlN6DfLjxJPYY+ybnnsr6cwt67g lD2TEPBAMnGjJIwapAzx1iKM4Pv2a8KusPbC7fsSdt58aYgferY7NU36MCtL3+ZGylBl 9zkA== X-Gm-Message-State: AOJu0YwwuyGnCUhm9szEygqsfAno9diIh6v8TpVZvl4bHpex+EcTW5V6 5Z8JU7Rza/+PazQX2HYYiWE+MpTPOJq1r2pUtJNsmsv2QHP5FrXLNLKWWOQHEiA6tQ== X-Gm-Gg: AfdE7cnvXxlSBIIJl12N+5LExQlhhvFbuoFOBxwHB7EXf5FIxa8nS1vPiJrxrJ98fPQ pqbfiXSzksuR25XC+1fi588fJPqRKT9+LQ49oEwaoAQprMqaOKN3r+4no4oqY/UzztxEiycz49Z yWxElQu+1KUh0DlrjbgZwXdx9B9ZCSiHga2O5AE9vrR4sc1StPJPye5C4OxP762eXyVzjGPydyr LPyMZLU47IMafSSJvoubnOoJyDwhxw+VAlvU0Z+xb1uqDf1P9FvwFAuoMavWZI4Zq7pYSzmRvh/ WJQfcbe0APlx/qg3EzNok3st5l5yw7ZtvM1AZEXQlMTRkmz+LAyEBzPBxKA8xbJeE0V9JTxxZET 8veJ5KCW6XuNBBFuHgdSnzWE9YLmO0n/nlLUmWpJZ1TIgKa5MmrD6KsPpxMHNp3QAPdqyTB46Ut OphOpewnrzB/m+BzMCyr44L40B2V3Jpq4vadApPANj5GZflSI= X-Received: by 2002:a05:600c:1553:b0:490:cb90:3e13 with SMTP id 5b1f17b1804b1-49260872798mr53816085e9.21.1782311562517; Wed, 24 Jun 2026 07:32:42 -0700 (PDT) Received: from google.com (97.113.76.34.bc.googleusercontent.com. [34.76.113.97]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4923fd154fdsm428332795e9.1.2026.06.24.07.32.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Jun 2026 07:32:41 -0700 (PDT) Date: Wed, 24 Jun 2026 14:32:39 +0000 From: Adrian =?utf-8?Q?Barna=C5=9B?= To: Ryan Roberts Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, Catalin Marinas , Will Deacon , David Hildenbrand , "Mike Rapoport (Microsoft)" , Ard Biesheuvel , Christoph Lameter , Yang Shi , Brendan Jackman Subject: Re: [RFC PATCH 6/6] arm64: mm: support PMD page coalescing in the linear map Message-ID: References: <20260611130144.1385343-1-abarnas@google.com> <20260611130144.1385343-7-abarnas@google.com> <799181c3-a1a1-4de7-bc6a-576d3282efb0@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <799181c3-a1a1-4de7-bc6a-576d3282efb0@arm.com> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.9.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20260624_073247_319832_A203E210 X-CRM114-Status: GOOD ( 45.57 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Fri, Jun 19, 2026 at 02:40:40PM +0100, Ryan Roberts wrote: >On 11/06/2026 14:01, Adrian Barnaś wrote: >> Implement PMD block coalescing to merge fragmented linear mapping regions >> back into huge pages when restoring the read-only attribute. >> >> When memory allocated with VM_ALLOW_HUGE_VMAP (such as for the execmem >> ROX cache) has its permissions modified, the PMD block mapping is split >> into individual PTEs. Without this change, when that memory have its RO >> attribute subsequently cleared and set the mapping remains permanently >> fragmented into 4K pages. > >I'd like to make sure I understand this correctly: So the execmem ROX cache is >allocated using vmalloc_huge such that all it's backing memory are PMD-sized >pages. It is initially poisoned with a faulting instruction and marked ROX. When >a module is loaded, it's text is copied into a portion of the ROX cache and >executed from there? To do that, the required number of (4K) pages are allocated >from ROX cache, and the permissions are changed on that sub-region to RW, >meaning we have to split the mapping from a PMD to (cont)PTEs in both vmalloc >space and in the linear map. After the text is copied, the sub-region is >switched back to ROX. But without that change, the 2M region is now mapped by >(cont)PTEs for all of time? That is correct. > >> >> Signed-off-by: Adrian Barnaś >> --- >> arch/arm64/include/asm/mmu.h | 1 + >> arch/arm64/mm/mmu.c | 95 ++++++++++++++++++++++++++++++++++++ >> arch/arm64/mm/pageattr.c | 7 +++ >> 3 files changed, 103 insertions(+) >> >> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h >> index 137a173df1ff..19158bacb2df 100644 >> --- a/arch/arm64/include/asm/mmu.h >> +++ b/arch/arm64/include/asm/mmu.h >> @@ -80,6 +80,7 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot); >> extern void mark_linear_text_alias_ro(void); >> extern int split_kernel_leaf_mapping(unsigned long start, unsigned long end); >> extern void linear_map_maybe_split_to_ptes(void); >> +void try_collapse_kernel_pmd(unsigned long addr); >> >> /* >> * This check is triggered during the early boot before the cpufeature >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c >> index a6a00accf4f9..d74226fa1c9b 100644 >> --- a/arch/arm64/mm/mmu.c >> +++ b/arch/arm64/mm/mmu.c >> @@ -769,6 +769,101 @@ static inline bool force_pte_mapping(void) >> >> static DEFINE_MUTEX(pgtable_split_lock); >> >> +static inline bool __pte_can_be_collapsed(pte_t pte, unsigned long pfn, pgprot_t prot) >> +{ >> + if (!pte_valid(pte)) >> + return false; >> + if (pte_pfn(pte) != pfn) >> + return false; >> + if ((pgprot_val(pte_pgprot(pte)) & ~PTE_CONT) != pgprot_val(prot)) > >Do we want to permit differing values for access and dirty? We do for user >space, but I think for kernel, those bits are always set? In which case, what >you're doing is correct. > >> + return false; >> + >> + return true; >> +} >> + >> +static void __try_collapse_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr) >> +{ >> + pte_t *ptep; >> + pte_t first_pte; >> + unsigned long pfn; >> + pgprot_t prot; >> + int i; >> + >> + ptep = (pte_t *)pmd_page_vaddr(pmd); >> + first_pte = __ptep_get(ptep); >> + >> + if (!pte_valid(first_pte)) >> + return; >> + >> + prot = pte_pgprot(first_pte); >> + prot = __pgprot(pgprot_val(prot) & ~PTE_CONT); >> + pfn = pte_pfn(first_pte); >> + >> + if (!IS_ALIGNED(pfn, PMD_SIZE >> PAGE_SHIFT)) >> + return; >> + >> + for (i = 1; i < PTRS_PER_PTE; i++) { >> + if (!__pte_can_be_collapsed(__ptep_get(ptep + i), pfn + i, prot)) >> + return; >> + } >> + >> + set_pmd(pmdp, pmd_mkhuge(pfn_pmd(pfn, prot))); >> + >> + __flush_tlb_kernel_pgtable(addr); >> + > >nit: suggest adding the same comment that other sites has, since this is not >obvious: > >/* See comment in pud_free_pmd_page for static key logic */ > >> + if (static_branch_unlikely(&arm64_ptdump_lock_key)) { >> + mmap_read_lock(&init_mm); >> + mmap_read_unlock(&init_mm); >> + } >> + >> + pte_free_kernel(NULL, ptep); > >Ooh, fun. Per my comments below we definitely have opportunity for racy pte >table accesses (beyond ptdump). I _think_ if you fix that then this will be safe. > >> +} >> + >> +void try_collapse_kernel_pmd(unsigned long addr) >> +{ >> + pgd_t *pgdp; >> + p4d_t *p4dp; >> + pud_t *pudp; >> + pmd_t *pmdp; >> + pmd_t pmd; >> + >> + /* >> + * collapse_pmd expects exact address of block to be collapsed >> + */ >> + if (WARN_ON(ALIGN_DOWN(addr, PMD_SIZE) != addr)) >> + return; >> + >> + mutex_lock(&pgtable_split_lock); > >I don't think this is safe in general. Let's say we have a 2M region split into >512 x 4K PTEs. It's possible that the first 1M is one object and the second 1M >is another object. Different CPUs could set_memory_*() on those 2 objects >concurrently. If one of them then calls this function, we could end up >collapsing the whole 2M while the other is trying to modify the PTEs and they >will race. > >Note that splitting _is_ safe (and protected by this lock) because you'd have 2 >objects backed by the same PMD, so they would both have to split before >modifying the PTEs. > >I think you'd need to ensure mutual exclusion at a higher level if doing this; >probably execmem is the place that can ensure that no objects within a 2M region >are racily trying to modify their permissions? > >> + >> + pgdp = pgd_offset_k(addr); >> + if (pgd_none(READ_ONCE(*pgdp))) > >nit: use the getters (e.g. pXdp_get()) instead of READ_ONCE(). > >> + goto out; >> + >> + p4dp = p4d_offset(pgdp, addr); >> + if (p4d_none(READ_ONCE(*p4dp))) >> + goto out; >> + >> + pudp = pud_offset(p4dp, addr); >> + if (pud_none(READ_ONCE(*pudp))) >> + goto out; >> + >> + if (pud_leaf(READ_ONCE(*pudp))) >> + goto out; >> + >> + pmdp = pmd_offset(pudp, addr); >> + pmd = pmdp_get(pmdp); >> + >> + if (!pmd_table(pmd)) >> + goto out; >> + >> + lazy_mmu_mode_enable(); >> + __try_collapse_pmd(pmdp, pmd, addr); >> + lazy_mmu_mode_disable(); >> + >> +out: >> + mutex_unlock(&pgtable_split_lock); >> +} >> + >> int split_kernel_leaf_mapping(unsigned long start, unsigned long end) >> { >> int ret; >> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c >> index eaefdf90b0d5..11e0b60264c3 100644 >> --- a/arch/arm64/mm/pageattr.c >> +++ b/arch/arm64/mm/pageattr.c >> @@ -200,6 +200,13 @@ static int change_memory_common(unsigned long addr, int numpages, >> if (ret) >> return ret; >> } > >The loop here in the unchanged code, calls __change_memory_common() for each >PAGE_SIZE page in the linear alias. Perhaps it should be area->page_order aware? >That way, if we are changing the permissions of the whole 2M chunk of vmalloc >space and it's backed by a 2M page, then we won't split the mapping to PTEs in >the linear map. Similarly, if we are changing permissions on a sub-region of the >2M area, we might be able to split only as far as contpte and still have some >performance benefit? > This is out of scope of execmem (as you noted we are mostly changing permision for only a part of the block) but your sugestion seems to be valid for other cases. >> + /* >> + * When setting a read-only flag on the linear region, the memory >> + * may have been backed by a PMD before being split. Try to >> + * collapse it back into a PMD to restore huge page performance. >> + */ >> + if (pgprot_val(set_mask) == PTE_RDONLY && area->flags & VM_ALLOW_HUGE_VMAP) >> + try_collapse_kernel_pmd((u64)page_address(area->pages[0])); > >try_collapse_kernel_pmd() requires a PMD-aligned address. VM_ALLOW_HUGE_VMAP >doesn't guarrantee that - it only says that it's _allowed_ to be huge (and 64K >would still be huge). vmalloc may have failed to allocate a PMD-sized page and >reverted to something smaller (64K contpte or 4K). You need to look at >area->page_order, I think. > Agree. >You're only calling this for the linear alias. Don't you want to call it for the >vmalloc range too? I assume the module text executes using the vmalloc address >so you'd want it mapped by a single PMD for best performance? > I think it is a good idea to have them both (linear alias and vmalloc area) mapped as PMD. >But in general, I don't really like the idea of special-casing a collapse to pmd >here for just execmem. I think we should either investigate a general purpose >collapse mechanism or execmem should have extra hooks added to request specific >collapse.> >Thanks, >Ryan > > >> } >> >> /* > I will rethink this based on your suggestion and try to properly fix the concurrency issues. I am also wondering if we should avoid splitting linear aliases altogether. I think I understand why it was done originally (to restrict writing through linear mapping aliases for read-only vmalloc areas). However, it is not necessary to make them writable when the vmalloc area is writable. Maybe we should create a special case for execmem to prevent this. WDYT? Thank you for the review, Adrian