From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 55FC8C83F26 for ; Thu, 24 Jul 2025 05:53:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=y17JMRiMz31y0+1Ev2VG5IgsGw4ecp6CR8svaFtwtms=; b=enEIv24HoSKQCFLS0s3zGzdK5/ meRpLhwtmKgXISzLG/AM1fNXUOUrHWQ8V6HLSt2wxL8vebiJI6/HiCwVlJH2oXZM0G7Hp8hiEH7UN I+SxqHctmX6PkKNuX4t8rSuXSFPMHA+W1bt/mhcfcTjCAHGw0SzoF3P9Nct8vgdk1s4vlNKj2MMxG Uqk51Bp2ujmIPUInv6YRl/5Ghd85AqXvHjfjCBXacQKHtpm7BRs8vUYne0hgYQf56ssaCCdBLMyN/ tqZLJM4iSXgkwckQ2PGqidYPtXcTEf2iIkfrheXmFEdypZhen1rvIEJJNtI7A+xBm/hZPo88cmKfw P0Mk4uTQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1ueotD-00000006Ydj-268N; Thu, 24 Jul 2025 05:53:11 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1ueoqk-00000006Y3g-1QtA for linux-arm-kernel@lists.infradead.org; Thu, 24 Jul 2025 05:50:40 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A8A331A00; Wed, 23 Jul 2025 22:50:28 -0700 (PDT) Received: from [10.163.92.78] (unknown [10.163.92.78]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B06973F66E; Wed, 23 Jul 2025 22:50:29 -0700 (PDT) Message-ID: <90f56b8b-cc09-45c5-88d5-36d39a00ea14@arm.com> Date: Thu, 24 Jul 2025 11:20:26 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump To: Dev Jain , catalin.marinas@arm.com, will@kernel.org Cc: quic_zhenhuah@quicinc.com, ryan.roberts@arm.com, kevin.brodsky@arm.com, yangyicong@hisilicon.com, joey.gouly@arm.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, david@redhat.com, mark.rutland@arm.com, urezki@gmail.com References: <20250723161827.15802-1-dev.jain@arm.com> Content-Language: en-US From: Anshuman Khandual In-Reply-To: <20250723161827.15802-1-dev.jain@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20250723_225038_470047_9A004BE3 X-CRM114-Status: GOOD ( 49.38 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org A very small nit - arm64/mm: Enable vmalloc-huge with ptdump On 23/07/25 9:48 PM, Dev Jain wrote: > Our goal is to move towards enabling vmalloc-huge by default on arm64 so > as to reduce TLB pressure. Therefore, we need a way to analyze the portion > of block mappings in vmalloc space we can get on a production system; this > can be done through ptdump, but currently we disable vmalloc-huge if > CONFIG_PTDUMP_DEBUGFS is on. The reason is that lazy freeing of kernel > pagetables via vmap_try_huge_pxd() may race with ptdump, so ptdump > may dereference a bogus address. > > To solve this, we need to synchronize ptdump_walk_pgd() with > pud_free_pmd_page() and pmd_free_pte_page(). > > Since this race is very unlikely to happen in practice, we do not want to > penalize other code paths taking the init_mm mmap_lock. Therefore, we use > static keys. ptdump will enable the static key - upon observing that, the > vmalloc table freeing path will get patched in with an > mmap_read_lock/unlock sequence. A code comment explains in detail, how > a combination of acquire sematics of static_branch_enable() and the typo ^^^^^^^^ s/sematics/semantics > barriers in __flush_tlb_kernel_pgtable() ensures that ptdump will never > get a hold on the address of a freed PMD or PTE table. > > For an (almost) rigorous proof of correctness, one may look at: > https://lore.kernel.org/all/e1e87f16-1c48-481b-8f7c-9333ac5d13e7@arm.com/ > > Reviewed-by: Ryan Roberts > Signed-off-by: Dev Jain > --- > Rebased on 6.16-rc7. > > mm-selfests pass. No issues were observed while parallelly running > test_vmalloc.sh (which stresses the vmalloc subsystem) and dumping the > kernel pagetable through sysfs in a loop. > > v4->v5: > - The arch_vmap_pxd_supported() changes were dropped by mistake in between > the revisions, add them back (Anshuman) > - Rewrite cover letter, drop stray change, add arm64_ptdump_walk_pgd() > helper, rename ptdump_lock_key -> arm64_ptdump_lock_key, add comment > above __pmd_free_pte_page() to explain when the lock won't > be taken (Anshuman) > - Rewrite the comment explaining the synchronization logic (Catalin) > > v3->v4: > - Lock-unlock immediately > - Simplify includes > > v2->v3: > - Use static key mechanism > > v1->v2: > - Take lock only when CONFIG_PTDUMP_DEBUGFS is on > - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking > the lock 512 times again via pmd_free_pte_page() > > --- > arch/arm64/include/asm/ptdump.h | 2 + > arch/arm64/include/asm/vmalloc.h | 6 +-- > arch/arm64/mm/mmu.c | 86 ++++++++++++++++++++++++++++++-- > arch/arm64/mm/ptdump.c | 11 +++- > 4 files changed, 95 insertions(+), 10 deletions(-) > > diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h > index fded5358641f..baff24004459 100644 > --- a/arch/arm64/include/asm/ptdump.h > +++ b/arch/arm64/include/asm/ptdump.h > @@ -7,6 +7,8 @@ > > #include > > +DECLARE_STATIC_KEY_FALSE(arm64_ptdump_lock_key); > + > #ifdef CONFIG_PTDUMP > > #include > diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h > index 12f534e8f3ed..e835fd437ae0 100644 > --- a/arch/arm64/include/asm/vmalloc.h > +++ b/arch/arm64/include/asm/vmalloc.h > @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot) > /* > * SW table walks can't handle removal of intermediate entries. > */ This above comment can now be dropped. > - return pud_sect_supported() && > - !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS); > + return pud_sect_supported(); > } > > #define arch_vmap_pmd_supported arch_vmap_pmd_supported > static inline bool arch_vmap_pmd_supported(pgprot_t prot) > { > - /* See arch_vmap_pud_supported() */ > - return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS); > + return true; > } > > #define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index 00ab1d648db6..49932c1dd34e 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -46,6 +46,8 @@ > #define NO_CONT_MAPPINGS BIT(1) > #define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */ > > +DEFINE_STATIC_KEY_FALSE(arm64_ptdump_lock_key); A short introduction about the static key's purpose might be helpful. > + > enum pgtable_type { > TABLE_PTE, > TABLE_PMD, > @@ -1267,7 +1269,12 @@ int pmd_clear_huge(pmd_t *pmdp) > return 1; > } > > -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr) > +/* > + * If PMD has been isolated via pud_free_pmd_page(), ptdump cannot get > + * a hold to it, so no need to serialize with mmap_lock, hence lock > + * will be passed as false here. Otherwise, lock will be true. > + */ This comment should be split into two and added near their respective call sites with and without the kernel mmap_lock requirement. > +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)IIUC 'lock' is the need for taking mmap_lock in init_mm. Hence changing that as 'mmap_lock' or 'kernel_mmap_lock' might be better which will also add some required context. > { > pte_t *table; > pmd_t pmd; > @@ -1279,13 +1286,24 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr) > return 1; > } > > + /* See comment in pud_free_pmd_page for static key logic */ > table = pte_offset_kernel(pmdp, addr); > pmd_clear(pmdp); > __flush_tlb_kernel_pgtable(addr); > + if (static_branch_unlikely(&arm64_ptdump_lock_key) && lock) { > + mmap_read_lock(&init_mm); > + mmap_read_unlock(&init_mm); > + } > + > pte_free_kernel(NULL, table); > return 1; > } > > +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr) > +{ > + return __pmd_free_pte_page(pmdp, addr, true); > +} > + > int pud_free_pmd_page(pud_t *pudp, unsigned long addr) > { > pmd_t *table; > @@ -1301,16 +1319,76 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr) > } > > table = pmd_offset(pudp, addr); > + > + /* > + * Our objective is to prevent ptdump from reading a PMD table which has > + * been freed. Assume that ptdump_walk_pgd() (call this thread T1) > + * executes completely on CPU1 and pud_free_pmd_page() (call this thread > + * T2) executes completely on CPU2. Let the region sandwiched by the > + * mmap_write_lock/unlock in T1 be called CS (the critical section). > + * > + * Claim: The CS of T1 will never operate on a freed PMD table. > + * > + * Proof: > + * > + * Case 1: The static branch is visible to T2. > + * > + * Case 1 (a): T1 acquires the lock before T2 can. > + * T2 will block until T1 drops the lock, so pmd_free() will only be > + * executed after T1 exits CS. T2 blocks here because ptdump_walk_pgd() takes mmap_write_lock(). This needs to be mentioned. > + *> + * Case 1 (b): T2 acquires the lock before T1 can. > + * The sequence of barriers issued in __flush_tlb_kernel_pgtable() > + * ensures that an empty PUD (via pud_clear()) is visible to T1 before > + * T1 can enter CS, therefore it is impossible for the CS to get hold > + * of the address of the isolated PMD table. Agreed. mmap_write_lock() from ptdump_walk_pgd() will proceed after T2 calls mmap_read_unlock(). So protection from race comes from the fact that T1 can never get hold of isolated PMD table not because of the mmap_lock. > + * > + * Case 2: The static branch is not visible to T2. > + * > + * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock() > + * have acquire semantics, it is guaranteed that the static branch > + * will be visible to all CPUs before T1 can enter CS. The static > + * branch not being visible to T2 therefore guarantees that T1 has > + * not yet entered CS .... (i) > + * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2 > + * implies that if the invisibility of the static branch has been > + * observed by T2 (i.e static_branch_unlikely() is observed as false), > + * then all CPUs will have observed an empty PUD ... (ii) > + * Combining (i) and (ii), we conclude that T1 observes an empty PUD > + * before entering CS => it is impossible for the CS to get hold of > + * the address of the isolated PMD table. Q.E.D > + *> + * We have proven that the claim is true on the assumption that > + * there is no context switch for T1 and T2. Note that the reasoning > + * of the proof uses barriers operating on the inner shareable domain, > + * which means that they will affect all CPUs, and also a context switch > + * will insert extra barriers into the code paths => the claim will > + * stand true even if we drop the assumption. Do these all rest on the fact that static_branch_enable() and mmap_write_lock() have acquire semantics available via a particular class of barrier instructions ? In which case should the generation of those instructions via these functions be asserted for the above T1/T2 guarantee to hold ? Just curious. > + * > + * It is also worth reasoning whether something can go wrong via > + * pud_free_pmd_page() -> __pmd_free_pte_page(), since the latter > + * will be called locklessly on this code path. > + * > + * For Case 1 (a), T2 will block until CS is finished, so we are safe. > + * For Case 1 (b) and Case 2, the PMD table will be isolated before > + * T1 can enter CS, therefore it is safe for T2 to operate on the > + * PMD table locklessly. > + */ Although looks reasonable - the above comment block is going to be difficult to process after time has passed :) Just wondering could it some how be simplified ? > + pud_clear(pudp); > + __flush_tlb_kernel_pgtable(addr); > + if (static_branch_unlikely(&arm64_ptdump_lock_key)) { > + mmap_read_lock(&init_mm); > + mmap_read_unlock(&init_mm); > + } > + > pmdp = table; > next = addr; > end = addr + PUD_SIZE; > do { > if (pmd_present(pmdp_get(pmdp))) > - pmd_free_pte_page(pmdp, next); > + __pmd_free_pte_page(pmdp, next, false); > } while (pmdp++, next += PMD_SIZE, next != end); > > - pud_clear(pudp); > - __flush_tlb_kernel_pgtable(addr); > pmd_free(NULL, table); > return 1; > } > diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c > index 421a5de806c6..65335c7ba482 100644 > --- a/arch/arm64/mm/ptdump.c > +++ b/arch/arm64/mm/ptdump.c > @@ -283,6 +283,13 @@ void note_page_flush(struct ptdump_state *pt_st) > note_page(pt_st, 0, -1, pte_val(pte_zero)); > } > > +static void arm64_ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm) > +{ > + static_branch_enable(&arm64_ptdump_lock_key); > + ptdump_walk_pgd(st, mm, NULL); > + static_branch_disable(&arm64_ptdump_lock_key); > +} Encapsulating generic ptdump_walk_pgd() between arm64 platform specific static key enable/disable requirement does make sense. > + > void ptdump_walk(struct seq_file *s, struct ptdump_info *info) > { > unsigned long end = ~0UL; > @@ -311,7 +318,7 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info) > } > }; > > - ptdump_walk_pgd(&st.ptdump, info->mm, NULL); > + arm64_ptdump_walk_pgd(&st.ptdump, info->mm); > } > > static void __init ptdump_initialize(void) > @@ -353,7 +360,7 @@ bool ptdump_check_wx(void) > } > }; > > - ptdump_walk_pgd(&st.ptdump, &init_mm, NULL); > + arm64_ptdump_walk_pgd(&st.ptdump, &init_mm); > > if (st.wx_pages || st.uxn_pages) { > pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",