From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id AF143C5B543 for ; Fri, 30 May 2025 13:14:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:Cc:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=bYGgQxhNq7A81glSs5pyPmsLNIk2C60jJUsKFfVg7a8=; b=ZjR3xKg7irsyGINYTFYT/RvWB0 5t4FtShrB5Wm6dYOA383dd9y7006aABq9o37nybA9Mp/AK9rOLl3EQ2i9I42ERIBgbUbxiO3sEORm cCnwIDZcburb5QpvtT7uCaOA4DGnExBRkuCiQMXIaZk21CFRpjqwcHdAh/xVH8lJDID4mPAybfp/o Sujq65b1vTEn2/6FCyuhq5EUKgYwewBbjcmRR7Y15z89xxPZl9izmI+n4GOYeLpXFh5w39x0HEams hATce1HKSZJv+tqNj7vyLF0VnGh0hX7zUyUg0+IM/aA2wnGGl+mLaBiPBN8X+AWxCNZnXgs1gZ1Zn q6NTql6Q==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1uKzYW-00000000j4a-28Zu; Fri, 30 May 2025 13:13:52 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.98.2 #2 (Red Hat Linux)) id 1uKzWN-00000000id8-3vx1 for linux-arm-kernel@lists.infradead.org; Fri, 30 May 2025 13:11:41 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C9B1C1692; Fri, 30 May 2025 06:11:22 -0700 (PDT) Received: from [10.57.95.14] (unknown [10.57.95.14]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 9744B3F5A1; Fri, 30 May 2025 06:11:37 -0700 (PDT) Message-ID: Date: Fri, 30 May 2025 14:11:36 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] arm64: Enable vmalloc-huge with ptdump Content-Language: en-GB To: Will Deacon Cc: Dev Jain , catalin.marinas@arm.com, anshuman.khandual@arm.com, quic_zhenhuah@quicinc.com, kevin.brodsky@arm.com, yangyicong@hisilicon.com, joey.gouly@arm.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, david@redhat.com References: <20250530082021.18182-1-dev.jain@arm.com> <832e84a9-4303-4e21-a88b-94395898fa3e@arm.com> <4202a03d-dacd-429c-91e6-81a5d05726a4@arm.com> <20250530123527.GA30463@willie-the-truck> From: Ryan Roberts In-Reply-To: <20250530123527.GA30463@willie-the-truck> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20250530_061140_068786_6AB3DA29 X-CRM114-Status: GOOD ( 20.84 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 30/05/2025 13:35, Will Deacon wrote: > On Fri, May 30, 2025 at 12:50:40PM +0100, Ryan Roberts wrote: >> On 30/05/2025 10:14, Dev Jain wrote: >>> >>> On 30/05/25 2:10 pm, Ryan Roberts wrote: >>>> On 30/05/2025 09:20, Dev Jain wrote: >>>>> arm64 disables vmalloc-huge when kernel page table dumping is enabled, >>>>> because an intermediate table may be removed, potentially causing the >>>>> ptdump code to dereference an invalid address. We want to be able to >>>>> analyze block vs page mappings for kernel mappings with ptdump, so to >>>>> enable vmalloc-huge with ptdump, synchronize between page table removal in >>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable walking. We >>>>> use mmap_read_lock and not write lock because we don't need to synchronize >>>>> between two different vm_structs; two vmalloc objects running this same >>>>> code path will point to different page tables, hence there is no race. >>> >>> My "correction" from race->no problem was incorrect after all :) There will >>> be no race too since the vm_struct object has exclusive access to whatever >>> table it is clearing. >>> >>>>> >>>>> Signed-off-by: Dev Jain >>>>> --- >>>>>   arch/arm64/include/asm/vmalloc.h | 6 ++---- >>>>>   arch/arm64/mm/mmu.c              | 7 +++++++ >>>>>   2 files changed, 9 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h >>>>> index 38fafffe699f..28b7173d8693 100644 >>>>> --- a/arch/arm64/include/asm/vmalloc.h >>>>> +++ b/arch/arm64/include/asm/vmalloc.h >>>>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot) >>>>>       /* >>>>>        * SW table walks can't handle removal of intermediate entries. >>>>>        */ >>>>> -    return pud_sect_supported() && >>>>> -           !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS); >>>>> +    return pud_sect_supported(); >>>>>   } >>>>>     #define arch_vmap_pmd_supported arch_vmap_pmd_supported >>>>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot) >>>>>   { >>>>> -    /* See arch_vmap_pud_supported() */ >>>>> -    return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS); >>>>> +    return true; >>>>>   } >>>>>     #endif >>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c >>>>> index ea6695d53fb9..798cebd9e147 100644 >>>>> --- a/arch/arm64/mm/mmu.c >>>>> +++ b/arch/arm64/mm/mmu.c >>>>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr) >>>>>       } >>>>>         table = pte_offset_kernel(pmdp, addr); >>>>> + >>>>> +    /* Synchronize against ptdump_walk_pgd() */ >>>>> +    mmap_read_lock(&init_mm); >>>>>       pmd_clear(pmdp); >>>>> +    mmap_read_unlock(&init_mm); >>>> So this works because ptdump_walk_pgd() takes the write_lock (which is mutually >>>> exclusive with any read_lock holders) for the duration of the table walk, so it >>>> will either consistently see the pgtables before or after this removal. It will >>>> never disappear during the walk, correct? >>>> >>>> I guess there is a risk of this showing up as contention with other init_mm >>>> write_lock holders. But I expect that pmd_free_pte_page()/pud_free_pmd_page() >>>> are called sufficiently rarely that the risk is very small. Let's fix any perf >>>> problem if/when we see it. >>> >>> We can avoid all of that by my initial approach - to wrap the lock around >>> CONFIG_PTDUMP_DEBUGFS. >>> I don't have a strong opinion, just putting it out there. >> >> (I wrote then failed to send earlier): >> >> It's ugly though. Personally I'd prefer to keep it simple unless we have clear >> evidence that its needed. I was just laying out my justification for not needing >> to doing the conditional wrapping in this comment. > > I really don't think we should be adding unconditional locking overhead > to core mm routines purely to facilitate a rarely used debug option. > > Instead, can we either adopt something like the RCU-like walk used by > fast GUP or stick the locking behind a static key that's only enabled > when a ptdump walk is in progress (a bit like how > hugetlb_vmemmap_optimize_folio() manipulates hugetlb_optimize_vmemmap_key)? My sense is that the static key will be less effort and can be contained fully in arm64. I think we would need to enable the key around the call to ptdump_walk_pgd() in both ptdump_walk() and ptdump_check_wx(). Then where Dev is currently taking the read lock, that would be contingent on the key being enabled and the unlock would be contingent on having taken the lock. Does that sound like an acceptable approach? > > Will