From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id E5B8AC52D7D for ; Fri, 16 Aug 2024 03:07:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:In-Reply-To:From:References:CC:To:Subject:MIME-Version:Date: Message-ID:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=5ujIiBuviXRSnKo/cpEejfiWZav/F47PBhAAI7unhOE=; b=cqiqQXHhuIkY970xsA4Ok5l1WT pZSYl+n5Qu8aPR5jez7mBXMux9yuPklfm7g2ey9LK9A6pLX7ZCbt1it6J9Iw7/N13hHD2TNk4tgt1 a/QHxQxrV6eQVEESQzh563I+z6TIBjazGSZPGhbDeNejuVZyGNOaKW8lUT9qiKEKURs+zsmlJMlof KdTuesFjK0R4Hdx9slCWID2cu5a0ll5vwstz/p9UrFw+FLCub1WR7c65u4voIGmDLAhx2E/7Dl+3x WxV8LwtQ/FLUZR91nwJNcq+rTH8JHt0xGM9Wutqd/vq7Ku9pyhxVg8pto1PxKu6AscnXVvOCAum3b XxS1kXoA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1senIl-0000000BdnL-1deY; Fri, 16 Aug 2024 03:06:55 +0000 Received: from szxga07-in.huawei.com ([45.249.212.35]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1senI6-0000000BdgB-3Le0 for linux-arm-kernel@lists.infradead.org; Fri, 16 Aug 2024 03:06:17 +0000 Received: from mail.maildlp.com (unknown [172.19.88.234]) by szxga07-in.huawei.com (SkyGuard) with ESMTP id 4WlRZK0cDdz1S82h; Fri, 16 Aug 2024 11:00:41 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id 90FAE14010C; Fri, 16 Aug 2024 11:05:35 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 16 Aug 2024 11:05:34 +0800 Message-ID: <1147332f-790e-487f-8816-1860b8744ab2@huawei.com> Date: Fri, 16 Aug 2024 11:05:33 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 00/19] mm: Support huge pfnmaps Content-Language: en-US To: Peter Xu , Jason Gunthorpe CC: , , Sean Christopherson , Oscar Salvador , Axel Rasmussen , , , Will Deacon , Gavin Shan , Paolo Bonzini , Zi Yan , Andrew Morton , Catalin Marinas , Ingo Molnar , Alistair Popple , Borislav Petkov , David Hildenbrand , Thomas Gleixner , , Dave Hansen , Alex Williamson , Yan Zhao References: <20240809160909.1023470-1-peterx@redhat.com> <20240814123715.GB2032816@nvidia.com> From: Kefeng Wang In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf100008.china.huawei.com (7.185.36.138) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240815_200615_349846_84CE273A X-CRM114-Status: GOOD ( 27.20 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 2024/8/16 3:20, Peter Xu wrote: > On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote: >>> Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. >> >> There is definitely interest here in extending ARM to support the 1G >> size too, what is missing? > > Currently PUD pfnmap relies on THP_PUD config option: > > config ARCH_SUPPORTS_PUD_PFNMAP > def_bool y > depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet. > > Ideally, pfnmap is too simple comparing to real THPs and it shouldn't > require to depend on THP at all, but we'll need things like below to land > first: > > https://lore.kernel.org/r/20240717220219.3743374-1-peterx@redhat.com > > I sent that first a while ago, but I didn't collect enough inputs, and I > decided to unblock this series from that, so x86_64 shouldn't be affected, > and arm64 will at least start to have 2M. > >> >>> The other trick is how to allow gup-fast working for such huge mappings >>> even if there's no direct sign of knowing whether it's a normal page or >>> MMIO mapping. This series chose to keep the pte_special solution, so that >>> it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that >>> gup-fast will be able to identify them and fail properly. >> >> Make sense >> >>> More architectures / More page sizes >>> ------------------------------------ >>> >>> Currently only x86_64 (2M+1G) and arm64 (2M) are supported. >>> >>> For example, if arm64 can start to support THP_PUD one day, the huge pfnmap >>> on 1G will be automatically enabled. A draft patch to enable THP_PUD on arm64, only passed with DEBUG_VM_PGTABLE, we may test pud pfnmaps on arm64. diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index a2f8ff354ca6..ff0d27c72020 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -184,6 +184,7 @@ config ARM64 select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE + select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if PGTABLE_LEVELS > 2 select HAVE_ARCH_VMAP_STACK select HAVE_ARM_SMCCC select HAVE_ASM_MODVERSIONS diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7a4f5604be3f..e013fe458476 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -763,6 +763,25 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) #define pud_valid(pud) pte_valid(pud_pte(pud)) #define pud_user(pud) pte_user(pud_pte(pud)) #define pud_user_exec(pud) pte_user_exec(pud_pte(pud)) +#define pud_dirty(pud) pte_dirty(pud_pte(pud)) +#define pud_devmap(pud) pte_devmap(pud_pte(pud)) +#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud))) +#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud))) +#define pud_mkwrite(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud))) +#define pud_mkclean(pud) pte_pud(pte_mkclean(pud_pte(pud))) +#define pud_mkdirty(pud) pte_pud(pte_mkdirty(pud_pte(pud))) + +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +static inline int pud_trans_huge(pud_t pud) +{ + return pud_val(pud) && pud_present(pud) && !(pud_val(pud) & PUD_TABLE_BIT); +} + +static inline pud_t pud_mkdevmap(pud_t pud) +{ + return pte_pud(set_pte_bit(pud_pte(pud), __pgprot(PTE_DEVMAP))); +} +#endif static inline bool pgtable_l4_enabled(void); @@ -1137,10 +1156,20 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma, pmd_pte(entry), dirty); } +static inline int pudp_set_access_flags(struct vm_area_struct *vma, + unsigned long address, pud_t *pudp, + pud_t entry, int dirty) +{ + return __ptep_set_access_flags(vma, address, (pte_t *)pudp, + pud_pte(entry), dirty); +} + +#ifndef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD static inline int pud_devmap(pud_t pud) { return 0; } +#endif static inline int pgd_devmap(pgd_t pgd) { @@ -1213,6 +1242,13 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, { return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp); } + +static inline int pudp_test_and_clear_young(struct vm_area_struct *vma, + unsigned long address, + pud_t *pudp) +{ + return __ptep_test_and_clear_young(vma, address, (pte_t *)pudp); +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static inline pte_t __ptep_get_and_clear(struct mm_struct *mm, @@ -1433,6 +1469,7 @@ static inline void update_mmu_cache_range(struct vm_fault *vmf, #define update_mmu_cache(vma, addr, ptep) \ update_mmu_cache_range(NULL, vma, addr, ptep, 1) #define update_mmu_cache_pmd(vma, address, pmd) do { } while (0) +#define update_mmu_cache_pud(vma, address, pud) do { } while (0) #ifdef CONFIG_ARM64_PA_BITS_52 #define phys_to_ttbr(addr) (((addr) | ((addr) >> 46)) & TTBR_BADDR_MASK_52) -- 2.27.0 >> >> Oh that sounds like a bigger step.. > > Just to mention, no real THP 1G needed here for pfnmaps. The real gap here > is only about the pud helpers that only exists so far with CONFIG_THP_PUD > in huge_memory.c. > >> >>> VFIO is so far the only consumer for the huge pfnmaps after this series >>> applied. Besides above remap_pfn_range() generic optimization, device >>> driver can also try to optimize its mmap() on a better VA alignment for >>> either PMD/PUD sizes. This may, iiuc, normally require userspace changes, >>> as the driver doesn't normally decide the VA to map a bar. But I don't >>> think I know all the drivers to know the full picture. >> >> How does alignment work? In most caes I'm aware of the userspace does >> not use MAP_FIXED so the expectation would be for the kernel to >> automatically select a high alignment. I suppose your cases are >> working because qemu uses MAP_FIXED and naturally aligns the BAR >> addresses? >> >>> - x86_64 + AMD GPU >>> - Needs Alex's modified QEMU to guarantee proper VA alignment to make >>> sure all pages to be mapped with PUDs >> >> Oh :( > > So I suppose this answers above. :) Yes, alignment needed. >