From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.ozlabs.org (lists.ozlabs.org [112.213.38.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1963EEB64DC for ; Wed, 28 Jun 2023 01:34:06 +0000 (UTC) Authentication-Results: lists.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20221208 header.b=EIe07R24; dkim-atps=neutral Received: from boromir.ozlabs.org (localhost [IPv6:::1]) by lists.ozlabs.org (Postfix) with ESMTP id 4QrPHx2GVyz2yjh for ; Wed, 28 Jun 2023 11:34:05 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20221208 header.b=EIe07R24; dkim-atps=neutral Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gmail.com (client-ip=2607:f8b0:4864:20::52b; helo=mail-pg1-x52b.google.com; envelope-from=ritesh.list@gmail.com; receiver=lists.ozlabs.org) Received: from mail-pg1-x52b.google.com (mail-pg1-x52b.google.com [IPv6:2607:f8b0:4864:20::52b]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4QrPGx5RNdz2xq5 for ; Wed, 28 Jun 2023 11:33:12 +1000 (AEST) Received: by mail-pg1-x52b.google.com with SMTP id 41be03b00d2f7-54f87d5f1abso2689445a12.0 for ; Tue, 27 Jun 2023 18:33:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1687915990; x=1690507990; h=in-reply-to:subject:cc:to:from:message-id:date:from:to:cc:subject :date:message-id:reply-to; bh=YKq6tOX35loLP2WnY1DBxuNBHc7fzn7GBrN6lMGkdr0=; b=EIe07R24bGBnSRl0crNE9YiM4T0iTsgSA0V4onpCkiElP+hAc1wdUFsidZ4Nd0FLBb PiQ7yX2dXOyHZW8Io8bwzjzMi4F/2ETCOldImlu5WKFz607Xo3H14yWH9CKLpF5cgFRJ JvQ5ygCQoERDnNFQVeU6bRTWC6kpgpYjRXwS1PrpnL0zK2V7CKFyKhx9KTt8ANREjEDH GeX4U2t9PDowDapkoqDSCs5TCJvXsXXmcnLN+kZ0NF3lWKU8s9c+ETBeLGg9TJyeV6K6 Z5R9pfV4GVQ+gm75N+t8TNpmBjnSngAqzCIocUV1g7BOMQ1D3haIRfX+NwTDdeqn++td wclQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687915990; x=1690507990; h=in-reply-to:subject:cc:to:from:message-id:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=YKq6tOX35loLP2WnY1DBxuNBHc7fzn7GBrN6lMGkdr0=; b=a9yl8I9o96lZ7Xf5g6CzTqijew0nWayic0amlc/mRTqlT74qXZDw+ND9BpQS+EOSTD xfdtWsKwByJYtYWWfPflDN6QBxBM3GI08LOc8knP6XvVUdcD9NnXsPJ51cJsJUzUn3Qr D1o+6/d/illjCPCC21Nxvhz6KNVQ8Kg+TlL/Ch/pZnbgK/34hyyXu29seZlkXptkrxX8 6OeKS8cx/dxartlVlf4YmfiRO5NGpEXM0GectJsWVx23AoKYXEy/bRwuquXdHRUul7za XMuuRKX94qzNQ/UNQbRU2JT5VgKKrosQMsDapMgtLkcCzbQVbqrTdJtlzUwy22VdKZBx oUog== X-Gm-Message-State: AC+VfDxCwV6F1RcmEfnHosuVBM2+hXgb6SMzFsn0Jk4KRkxQ+7AhpUeW btBGshr3iavYN2BxT0SAfmc= X-Google-Smtp-Source: ACHHUZ5v+0e9Mlbj9pLCeuqDLJckKfYWbSRyer3Vb058p8GZAgUxlVRBHgJAc/FAlhAV/RQ/62RnpQ== X-Received: by 2002:a05:6a21:6d9e:b0:121:f3a2:3a8 with SMTP id wl30-20020a056a216d9e00b00121f3a203a8mr20491541pzb.0.1687915989681; Tue, 27 Jun 2023 18:33:09 -0700 (PDT) Received: from dw-tp ([49.207.220.159]) by smtp.gmail.com with ESMTPSA id a17-20020a62e211000000b0066ccb8e8024sm5614071pfi.30.2023.06.27.18.33.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Jun 2023 18:33:09 -0700 (PDT) Date: Wed, 28 Jun 2023 07:03:03 +0530 Message-Id: <87r0pwnzg0.fsf@doe.com> From: Ritesh Harjani (IBM) To: "Aneesh Kumar K.V" , linux-mm@kvack.org, akpm@linux-foundation.org, mpe@ellerman.id.au, linuxppc-dev@lists.ozlabs.org, npiggin@gmail.com, christophe.leroy@csgroup.eu Subject: Re: [PATCH v2 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function In-Reply-To: <20230616110826.344417-15-aneesh.kumar@linux.ibm.com> X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Catalin Marinas , Muchun Song , "Aneesh Kumar K.V" , Dan Williams , Oscar Salvador , Will Deacon , Joao Martins , Mike Kravetz Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" "Aneesh Kumar K.V" writes: > This is in preparation to update radix to implement vmemmap optimization > for devdax. Below are the rules w.r.t radix vmemmap mapping > > 1. First try to map things using PMD (2M) > 2. With altmap if altmap cross-boundary check returns true, fall back to > PAGE_SIZE > 3. If we can't allocate PMD_SIZE backing memory for vmemmap, fallback to > PAGE_SIZE > > On removing vmemmap mapping, check if every subsection that is using the > vmemmap area is invalid. If found to be invalid, that implies we can safely > free the vmemmap area. We don't use the PAGE_UNUSED pattern used by x86 > because with 64K page size, we need to do the above check even at the > PAGE_SIZE granularity. > > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/include/asm/book3s/64/radix.h | 2 + > arch/powerpc/include/asm/pgtable.h | 3 + > arch/powerpc/mm/book3s64/radix_pgtable.c | 319 +++++++++++++++++++-- > arch/powerpc/mm/init_64.c | 26 +- > 4 files changed, 319 insertions(+), 31 deletions(-) > > diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h > index 8cdff5a05011..87d4c1e62491 100644 > --- a/arch/powerpc/include/asm/book3s/64/radix.h > +++ b/arch/powerpc/include/asm/book3s/64/radix.h > @@ -332,6 +332,8 @@ extern int __meminit radix__vmemmap_create_mapping(unsigned long start, > unsigned long phys); > int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, > int node, struct vmem_altmap *altmap); > +void __ref radix__vmemmap_free(unsigned long start, unsigned long end, > + struct vmem_altmap *altmap); > extern void radix__vmemmap_remove_mapping(unsigned long start, > unsigned long page_size); > > diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h > index 9972626ddaf6..6d4cd2ebae6e 100644 > --- a/arch/powerpc/include/asm/pgtable.h > +++ b/arch/powerpc/include/asm/pgtable.h > @@ -168,6 +168,9 @@ static inline bool is_ioremap_addr(const void *x) > > struct seq_file; > void arch_report_meminfo(struct seq_file *m); > +int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size); > +bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, > + unsigned long page_size); > #endif /* CONFIG_PPC64 */ > > #endif /* __ASSEMBLY__ */ > diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c > index d7e2dd3d4add..ef886fab643d 100644 > --- a/arch/powerpc/mm/book3s64/radix_pgtable.c > +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c > @@ -742,8 +742,57 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d) > p4d_clear(p4d); > } > > +static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned long end) > +{ > + unsigned long start = ALIGN_DOWN(addr, PMD_SIZE); > + > + return !vmemmap_populated(start, PMD_SIZE); > +} > + > +static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned long end) > +{ > + unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE); > + > + return !vmemmap_populated(start, PAGE_SIZE); > + > +} > + > +static void __meminit free_vmemmap_pages(struct page *page, > + struct vmem_altmap *altmap, > + int order) > +{ > + unsigned int nr_pages = 1 << order; > + > + if (altmap) { > + unsigned long alt_start, alt_end; > + unsigned long base_pfn = page_to_pfn(page); > + > + /* > + * with 1G vmemmap mmaping we can have things setup > + * such that even though atlmap is specified we never > + * used altmap. > + */ > + alt_start = altmap->base_pfn; > + alt_end = altmap->base_pfn + altmap->reserve + > + altmap->free + altmap->alloc + altmap->align; > + > + if (base_pfn >= alt_start && base_pfn < alt_end) { > + vmem_altmap_free(altmap, nr_pages); > + return; > + } > + } > + > + if (PageReserved(page)) { > + /* allocated from memblock */ > + while (nr_pages--) > + free_reserved_page(page++); > + } else > + free_pages((unsigned long)page_address(page), order); > +} > + > static void remove_pte_table(pte_t *pte_start, unsigned long addr, > - unsigned long end, bool direct) > + unsigned long end, bool direct, > + struct vmem_altmap *altmap) > { > unsigned long next, pages = 0; > pte_t *pte; > @@ -757,24 +806,23 @@ static void remove_pte_table(pte_t *pte_start, unsigned long addr, > if (!pte_present(*pte)) > continue; > > - if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(next)) { > - /* > - * The vmemmap_free() and remove_section_mapping() > - * codepaths call us with aligned addresses. > - */ > - WARN_ONCE(1, "%s: unaligned range\n", __func__); > - continue; > + if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) { > + if (!direct) > + free_vmemmap_pages(pte_page(*pte), altmap, 0); > + pte_clear(&init_mm, addr, pte); > + pages++; > + } else if (!direct && vmemmap_page_is_unused(addr, next)) { > + free_vmemmap_pages(pte_page(*pte), altmap, 0); > + pte_clear(&init_mm, addr, pte); > } > - > - pte_clear(&init_mm, addr, pte); > - pages++; > } > if (direct) > update_page_count(mmu_virtual_psize, -pages); > } > > static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr, > - unsigned long end, bool direct) > + unsigned long end, bool direct, > + struct vmem_altmap *altmap) > { > unsigned long next, pages = 0; > pte_t *pte_base; > @@ -788,18 +836,21 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr, > continue; > > if (pmd_is_leaf(*pmd)) { > - if (!IS_ALIGNED(addr, PMD_SIZE) || > - !IS_ALIGNED(next, PMD_SIZE)) { > - WARN_ONCE(1, "%s: unaligned range\n", __func__); > - continue; > + if (IS_ALIGNED(addr, PMD_SIZE) && > + IS_ALIGNED(next, PMD_SIZE)) { > + if (!direct) > + free_vmemmap_pages(pmd_page(*pmd), altmap, get_order(PMD_SIZE)); > + pte_clear(&init_mm, addr, (pte_t *)pmd); > + pages++; > + } else if (vmemmap_pmd_is_unused(addr, next)) { > + free_vmemmap_pages(pmd_page(*pmd), altmap, get_order(PMD_SIZE)); > + pte_clear(&init_mm, addr, (pte_t *)pmd); > } > - pte_clear(&init_mm, addr, (pte_t *)pmd); > - pages++; > continue; > } > > pte_base = (pte_t *)pmd_page_vaddr(*pmd); > - remove_pte_table(pte_base, addr, next, direct); > + remove_pte_table(pte_base, addr, next, direct, altmap); > free_pte_table(pte_base, pmd); > } > if (direct) > @@ -807,7 +858,8 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr, > } > > static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr, > - unsigned long end, bool direct) > + unsigned long end, bool direct, > + struct vmem_altmap *altmap) > { > unsigned long next, pages = 0; > pmd_t *pmd_base; > @@ -832,15 +884,16 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr, > } > > pmd_base = pud_pgtable(*pud); > - remove_pmd_table(pmd_base, addr, next, direct); > + remove_pmd_table(pmd_base, addr, next, direct, altmap); > free_pmd_table(pmd_base, pud); > } > if (direct) > update_page_count(MMU_PAGE_1G, -pages); > } > > -static void __meminit remove_pagetable(unsigned long start, unsigned long end, > - bool direct) > +static void __meminit > +remove_pagetable(unsigned long start, unsigned long end, bool direct, > + struct vmem_altmap *altmap) > { > unsigned long addr, next; > pud_t *pud_base; > @@ -869,7 +922,7 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end, > } > > pud_base = p4d_pgtable(*p4d); > - remove_pud_table(pud_base, addr, next, direct); > + remove_pud_table(pud_base, addr, next, direct, altmap); > free_pud_table(pud_base, p4d); > } > > @@ -892,7 +945,7 @@ int __meminit radix__create_section_mapping(unsigned long start, > > int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end) > { > - remove_pagetable(start, end, true); > + remove_pagetable(start, end, true, NULL); > return 0; > } > #endif /* CONFIG_MEMORY_HOTPLUG */ > @@ -924,10 +977,224 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start, > return 0; > } > > +int __meminit vmemmap_check_pmd(pmd_t *pmd, int node, > + unsigned long addr, unsigned long next) > +{ > + int large = pmd_large(*pmd); > + > + if (pmd_large(*pmd)) we already got the value of pmd_large into "large" variable. we can use just if (large) right? > + vmemmap_verify((pte_t *)pmd, node, addr, next); maybe we can use pmdp_ptep() function here which we used in the 1st patch? also shouldn't this be pmdp in the function argument instead of pmd? > + > + return large; > +} > + > +void __meminit vmemmap_set_pmd(pmd_t *pmdp, void *p, int node, > + unsigned long addr, unsigned long next) > +{ > + pte_t entry; > + pte_t *ptep = pmdp_ptep(pmdp); > + > + VM_BUG_ON(!IS_ALIGNED(addr, PMD_SIZE)); > + entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL); > + set_pte_at(&init_mm, addr, ptep, entry); > + asm volatile("ptesync": : :"memory"); > + > + vmemmap_verify(ptep, node, addr, next); > +} > + > +static pte_t * __meminit radix__vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, > + struct vmem_altmap *altmap, > + struct page *reuse) > +{ > + pte_t *pte = pte_offset_kernel(pmd, addr); > + > + if (pte_none(*pte)) { > + pte_t entry; > + void *p; > + > + if (!reuse) { > + /* > + * make sure we don't create altmap mappings > + * covering things outside the device. > + */ > + if (altmap && altmap_cross_boundary(altmap, addr, PAGE_SIZE)) > + altmap = NULL; > + > + p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap); > + if (!p) { > + if (altmap) > + p = vmemmap_alloc_block_buf(PAGE_SIZE, node, NULL); > + if (!p) > + return NULL; > + } Above if conditions are quite confusing when looking for the 1st time? Can we do this? Did I get it right? if (!p && altmap) p = vmemmap_alloc_block_buf(PAGE_SIZE, node, NULL); if (!p) return NULL; -ritesh From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CA690EB64D9 for ; Wed, 28 Jun 2023 01:33:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4E05C8D0002; Tue, 27 Jun 2023 21:33:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 48FC18D0001; Tue, 27 Jun 2023 21:33:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 37F448D0002; Tue, 27 Jun 2023 21:33:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 254DB8D0001 for ; Tue, 27 Jun 2023 21:33:13 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id D5A6740AB1 for ; Wed, 28 Jun 2023 01:33:12 +0000 (UTC) X-FDA: 80950433424.17.A79B7F9 Received: from mail-pf1-f172.google.com (mail-pf1-f172.google.com [209.85.210.172]) by imf05.hostedemail.com (Postfix) with ESMTP id 2625C100009 for ; Wed, 28 Jun 2023 01:33:10 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=EIe07R24; spf=pass (imf05.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.210.172 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687915991; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:in-reply-to:references:dkim-signature; bh=YKq6tOX35loLP2WnY1DBxuNBHc7fzn7GBrN6lMGkdr0=; b=k+Xrnqxkg5JyJBD7kpJzAMXFSle0R97yXjcjfarI4KLajQ3tb9ZbL8yT5Ki1b/Eo/DQHq8 q6jHSU9+/9DP1WRIWfT0LxoA7J4J0L9atbmcR/W8n1p7M8OX2P5+xxt9oi6NLX97xtjMww qOCULa1FVwBD6J4h/wF6pwP0zAFnKkU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687915991; a=rsa-sha256; cv=none; b=lXrohtwbYrJlcK8XDQtXfdPzd9Lntqagx+w8dzMv5qVGhW50wz4/Oa6JKY7BBO5ZXkd0Na KysWACzYQAtLAqDja6oNsgj+kOmiYUa6TAiMZMvfBRwQ5h6aOMQ2cC23yRM0Iqwzco9nVZ tF3zTpLBkIZp+GPTTUqZxB1ASFUJiuw= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=EIe07R24; spf=pass (imf05.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.210.172 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pf1-f172.google.com with SMTP id d2e1a72fcca58-666e97fcc60so2942215b3a.3 for ; Tue, 27 Jun 2023 18:33:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1687915990; x=1690507990; h=in-reply-to:subject:cc:to:from:message-id:date:from:to:cc:subject :date:message-id:reply-to; bh=YKq6tOX35loLP2WnY1DBxuNBHc7fzn7GBrN6lMGkdr0=; b=EIe07R24bGBnSRl0crNE9YiM4T0iTsgSA0V4onpCkiElP+hAc1wdUFsidZ4Nd0FLBb PiQ7yX2dXOyHZW8Io8bwzjzMi4F/2ETCOldImlu5WKFz607Xo3H14yWH9CKLpF5cgFRJ JvQ5ygCQoERDnNFQVeU6bRTWC6kpgpYjRXwS1PrpnL0zK2V7CKFyKhx9KTt8ANREjEDH GeX4U2t9PDowDapkoqDSCs5TCJvXsXXmcnLN+kZ0NF3lWKU8s9c+ETBeLGg9TJyeV6K6 Z5R9pfV4GVQ+gm75N+t8TNpmBjnSngAqzCIocUV1g7BOMQ1D3haIRfX+NwTDdeqn++td wclQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687915990; x=1690507990; h=in-reply-to:subject:cc:to:from:message-id:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=YKq6tOX35loLP2WnY1DBxuNBHc7fzn7GBrN6lMGkdr0=; b=FpuIrp5Jih7HSKCHMdlM/U11QNHcitrFMoe/Ramc1at+28wJnWyCCi9dbka9ly1ne9 9PvGHNETc0u/3A6A17iF2+Bl/e8r07OEF26Y8RjWAdYJ3/n5KATmPweyjEXaVFDv67V9 taiSQGhCah6EcyESCtBFAsLi0dlFfBv42pIdohpSJy6TmM7OcJ171U57y77EAQ6YFdMW o5dsdcRLWJpV+YOnDlsVAnsCTRjhqhGg7PxMvDJ9xjmOmpeJva2IvlqrW5+dqh461b4Q GfePSFDe+Lo60AHsXIsspeXua+QQn4RaDDIMe9tWgIm+pyI/ugQhZlhq+lQxCOO29z08 zkDQ== X-Gm-Message-State: AC+VfDxjlSi8Jrl/WTrhAaEHOT7oLiWybXDqW4LrVgSoXy/2AZkAqHMr E05c4lczUZaZ39fu6J+Q1nODGZ0tefs= X-Google-Smtp-Source: ACHHUZ5v+0e9Mlbj9pLCeuqDLJckKfYWbSRyer3Vb058p8GZAgUxlVRBHgJAc/FAlhAV/RQ/62RnpQ== X-Received: by 2002:a05:6a21:6d9e:b0:121:f3a2:3a8 with SMTP id wl30-20020a056a216d9e00b00121f3a203a8mr20491541pzb.0.1687915989681; Tue, 27 Jun 2023 18:33:09 -0700 (PDT) Received: from dw-tp ([49.207.220.159]) by smtp.gmail.com with ESMTPSA id a17-20020a62e211000000b0066ccb8e8024sm5614071pfi.30.2023.06.27.18.33.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Jun 2023 18:33:09 -0700 (PDT) Date: Wed, 28 Jun 2023 07:03:03 +0530 Message-Id: <87r0pwnzg0.fsf@doe.com> From: Ritesh Harjani (IBM) To: "Aneesh Kumar K.V" , linux-mm@kvack.org, akpm@linux-foundation.org, mpe@ellerman.id.au, linuxppc-dev@lists.ozlabs.org, npiggin@gmail.com, christophe.leroy@csgroup.eu Cc: Oscar Salvador , Mike Kravetz , Dan Williams , Joao Martins , Catalin Marinas , Muchun Song , Will Deacon , "Aneesh Kumar K.V" Subject: Re: [PATCH v2 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function In-Reply-To: <20230616110826.344417-15-aneesh.kumar@linux.ibm.com> X-Stat-Signature: wb38atjgxumzpc9gfqn1b7ny6sqos5jr X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 2625C100009 X-Rspam-User: X-HE-Tag: 1687915990-252297 X-HE-Meta: U2FsdGVkX19Dg5eP1dD9ZsAW4H66qwZgS8mzT0x0L7xFYtNnZIkXTj0wOmkGFpPcwcjSRkZK93mz849x2niToj92Fa3sTm/Yaqp22JZ7KZxGoaEl9IOAo+UpcIiVRRO6LF4+tyI6BSUPXnHQfxbI8OyCAzEndvbDF5UgKsy4y9u/s1RmWZhQgxq8+b1LhsUlExoSbYIzQ0E2q1Kbr8Dl4t+azJVU7f4Hk6CsGFQ82dqEPxVsf5QboeD+5zM5cJeuFG7wXrUx0aia0h3jmeO3Hg95Ak2fYxrxUkRzu7f/piROIht8xC6m6GJ0BArT6tbxRJEmyM4JTUkP/+S3So9MYJ24pW7GDVMoEAF6SSkoL0hv4riYd64DavkHhkdOzMcrZmDxONORV4NoSCb3xQ9XeutnEqBgdZCxiOyMLU35q8MuTI8SHuuLZiZoMtFzPQTCkWOpsjaWEDXMygDLpQrkj18GQpFmKS1dFZj9VtH8xoO3oIRCP4LUTyHS9khx+3woPEPuQEHCSaATps4kULLFAjx/Pau+d9ufn/wQ3vxQRDiS5jQ+CSiYvAMiGjgUBrbS2MDsWvnYZOICHnAZYZdbAdaGSh+zNWvOU612kkmnhdJN2Eo0SEOIFnC46rJFi0T83FmTYt2oi6N8fkJlZZyIafWLc4cWMn7fHZM/q9dFZGUs7UGri+Spmoc1vZQ1cyli2Z/LinllvvGm/aJf/g013amKzKhxpuOMZgTvbpM4I3Hr18htPww3wd5Zje1V8zvqrLlJFLD80GCd3oRZ3AC5ofFw2mI8p3t7tIxYy2qqBEAv7Bz4Zc9hGy1I1H2+gw/ICFdZzZopOYG0s01M7udtVVsgezyFFagvjx1odwU2SeqVbudefYIav5RW2KW/KQkjP/++tXzEuVSk9hNrXG84aecUi1yE8mBft3TeSpURU+bPFfoWp1caFLFllgSQp/JFSDS3q9L68AQwmpF9PDy aJxXOBFv JRbRFJ4AXH4ZUjLEZUfcS1yQbuVAnZhWkBRfXyKBHX34IdCzVts0cNilfg/IWTrNiN5EMgqng1sZUEj7QYwsXe+f1Z29b5UMGMUbtdNrpSxSGNBJ4mEngEQrge8PMOB5ktbKiyFI+IQ4f2Zd9gDsecqyH0jzXJxhQAYE1YoTJyXrz2K5EvNt6qjeiVZ2371+9lYhC0tmVNX9i3V++c1WZkzreg7LLUfy8BC3QKx/1gvkNzfyAlPYuLFY0y0KNC/DzOMZlEGm/NQkUDGCE3PZlVv9sVG2c8H6hl3JqBDB0E4wR2bEBeWNnOtaqL5PcLWeErmnCQP36edoXw0M4dYhiyYglYZCvNhzsz1v46QWIpCexgAvVa3ebO3bJrBqYFKB4dhWgLhy6BTT20WBIltECnd4NZw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: "Aneesh Kumar K.V" writes: > This is in preparation to update radix to implement vmemmap optimization > for devdax. Below are the rules w.r.t radix vmemmap mapping > > 1. First try to map things using PMD (2M) > 2. With altmap if altmap cross-boundary check returns true, fall back to > PAGE_SIZE > 3. If we can't allocate PMD_SIZE backing memory for vmemmap, fallback to > PAGE_SIZE > > On removing vmemmap mapping, check if every subsection that is using the > vmemmap area is invalid. If found to be invalid, that implies we can safely > free the vmemmap area. We don't use the PAGE_UNUSED pattern used by x86 > because with 64K page size, we need to do the above check even at the > PAGE_SIZE granularity. > > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/include/asm/book3s/64/radix.h | 2 + > arch/powerpc/include/asm/pgtable.h | 3 + > arch/powerpc/mm/book3s64/radix_pgtable.c | 319 +++++++++++++++++++-- > arch/powerpc/mm/init_64.c | 26 +- > 4 files changed, 319 insertions(+), 31 deletions(-) > > diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h > index 8cdff5a05011..87d4c1e62491 100644 > --- a/arch/powerpc/include/asm/book3s/64/radix.h > +++ b/arch/powerpc/include/asm/book3s/64/radix.h > @@ -332,6 +332,8 @@ extern int __meminit radix__vmemmap_create_mapping(unsigned long start, > unsigned long phys); > int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end, > int node, struct vmem_altmap *altmap); > +void __ref radix__vmemmap_free(unsigned long start, unsigned long end, > + struct vmem_altmap *altmap); > extern void radix__vmemmap_remove_mapping(unsigned long start, > unsigned long page_size); > > diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h > index 9972626ddaf6..6d4cd2ebae6e 100644 > --- a/arch/powerpc/include/asm/pgtable.h > +++ b/arch/powerpc/include/asm/pgtable.h > @@ -168,6 +168,9 @@ static inline bool is_ioremap_addr(const void *x) > > struct seq_file; > void arch_report_meminfo(struct seq_file *m); > +int __meminit vmemmap_populated(unsigned long vmemmap_addr, int vmemmap_map_size); > +bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start, > + unsigned long page_size); > #endif /* CONFIG_PPC64 */ > > #endif /* __ASSEMBLY__ */ > diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c > index d7e2dd3d4add..ef886fab643d 100644 > --- a/arch/powerpc/mm/book3s64/radix_pgtable.c > +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c > @@ -742,8 +742,57 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d) > p4d_clear(p4d); > } > > +static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned long end) > +{ > + unsigned long start = ALIGN_DOWN(addr, PMD_SIZE); > + > + return !vmemmap_populated(start, PMD_SIZE); > +} > + > +static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned long end) > +{ > + unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE); > + > + return !vmemmap_populated(start, PAGE_SIZE); > + > +} > + > +static void __meminit free_vmemmap_pages(struct page *page, > + struct vmem_altmap *altmap, > + int order) > +{ > + unsigned int nr_pages = 1 << order; > + > + if (altmap) { > + unsigned long alt_start, alt_end; > + unsigned long base_pfn = page_to_pfn(page); > + > + /* > + * with 1G vmemmap mmaping we can have things setup > + * such that even though atlmap is specified we never > + * used altmap. > + */ > + alt_start = altmap->base_pfn; > + alt_end = altmap->base_pfn + altmap->reserve + > + altmap->free + altmap->alloc + altmap->align; > + > + if (base_pfn >= alt_start && base_pfn < alt_end) { > + vmem_altmap_free(altmap, nr_pages); > + return; > + } > + } > + > + if (PageReserved(page)) { > + /* allocated from memblock */ > + while (nr_pages--) > + free_reserved_page(page++); > + } else > + free_pages((unsigned long)page_address(page), order); > +} > + > static void remove_pte_table(pte_t *pte_start, unsigned long addr, > - unsigned long end, bool direct) > + unsigned long end, bool direct, > + struct vmem_altmap *altmap) > { > unsigned long next, pages = 0; > pte_t *pte; > @@ -757,24 +806,23 @@ static void remove_pte_table(pte_t *pte_start, unsigned long addr, > if (!pte_present(*pte)) > continue; > > - if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(next)) { > - /* > - * The vmemmap_free() and remove_section_mapping() > - * codepaths call us with aligned addresses. > - */ > - WARN_ONCE(1, "%s: unaligned range\n", __func__); > - continue; > + if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) { > + if (!direct) > + free_vmemmap_pages(pte_page(*pte), altmap, 0); > + pte_clear(&init_mm, addr, pte); > + pages++; > + } else if (!direct && vmemmap_page_is_unused(addr, next)) { > + free_vmemmap_pages(pte_page(*pte), altmap, 0); > + pte_clear(&init_mm, addr, pte); > } > - > - pte_clear(&init_mm, addr, pte); > - pages++; > } > if (direct) > update_page_count(mmu_virtual_psize, -pages); > } > > static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr, > - unsigned long end, bool direct) > + unsigned long end, bool direct, > + struct vmem_altmap *altmap) > { > unsigned long next, pages = 0; > pte_t *pte_base; > @@ -788,18 +836,21 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr, > continue; > > if (pmd_is_leaf(*pmd)) { > - if (!IS_ALIGNED(addr, PMD_SIZE) || > - !IS_ALIGNED(next, PMD_SIZE)) { > - WARN_ONCE(1, "%s: unaligned range\n", __func__); > - continue; > + if (IS_ALIGNED(addr, PMD_SIZE) && > + IS_ALIGNED(next, PMD_SIZE)) { > + if (!direct) > + free_vmemmap_pages(pmd_page(*pmd), altmap, get_order(PMD_SIZE)); > + pte_clear(&init_mm, addr, (pte_t *)pmd); > + pages++; > + } else if (vmemmap_pmd_is_unused(addr, next)) { > + free_vmemmap_pages(pmd_page(*pmd), altmap, get_order(PMD_SIZE)); > + pte_clear(&init_mm, addr, (pte_t *)pmd); > } > - pte_clear(&init_mm, addr, (pte_t *)pmd); > - pages++; > continue; > } > > pte_base = (pte_t *)pmd_page_vaddr(*pmd); > - remove_pte_table(pte_base, addr, next, direct); > + remove_pte_table(pte_base, addr, next, direct, altmap); > free_pte_table(pte_base, pmd); > } > if (direct) > @@ -807,7 +858,8 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr, > } > > static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr, > - unsigned long end, bool direct) > + unsigned long end, bool direct, > + struct vmem_altmap *altmap) > { > unsigned long next, pages = 0; > pmd_t *pmd_base; > @@ -832,15 +884,16 @@ static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr, > } > > pmd_base = pud_pgtable(*pud); > - remove_pmd_table(pmd_base, addr, next, direct); > + remove_pmd_table(pmd_base, addr, next, direct, altmap); > free_pmd_table(pmd_base, pud); > } > if (direct) > update_page_count(MMU_PAGE_1G, -pages); > } > > -static void __meminit remove_pagetable(unsigned long start, unsigned long end, > - bool direct) > +static void __meminit > +remove_pagetable(unsigned long start, unsigned long end, bool direct, > + struct vmem_altmap *altmap) > { > unsigned long addr, next; > pud_t *pud_base; > @@ -869,7 +922,7 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end, > } > > pud_base = p4d_pgtable(*p4d); > - remove_pud_table(pud_base, addr, next, direct); > + remove_pud_table(pud_base, addr, next, direct, altmap); > free_pud_table(pud_base, p4d); > } > > @@ -892,7 +945,7 @@ int __meminit radix__create_section_mapping(unsigned long start, > > int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end) > { > - remove_pagetable(start, end, true); > + remove_pagetable(start, end, true, NULL); > return 0; > } > #endif /* CONFIG_MEMORY_HOTPLUG */ > @@ -924,10 +977,224 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start, > return 0; > } > > +int __meminit vmemmap_check_pmd(pmd_t *pmd, int node, > + unsigned long addr, unsigned long next) > +{ > + int large = pmd_large(*pmd); > + > + if (pmd_large(*pmd)) we already got the value of pmd_large into "large" variable. we can use just if (large) right? > + vmemmap_verify((pte_t *)pmd, node, addr, next); maybe we can use pmdp_ptep() function here which we used in the 1st patch? also shouldn't this be pmdp in the function argument instead of pmd? > + > + return large; > +} > + > +void __meminit vmemmap_set_pmd(pmd_t *pmdp, void *p, int node, > + unsigned long addr, unsigned long next) > +{ > + pte_t entry; > + pte_t *ptep = pmdp_ptep(pmdp); > + > + VM_BUG_ON(!IS_ALIGNED(addr, PMD_SIZE)); > + entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL); > + set_pte_at(&init_mm, addr, ptep, entry); > + asm volatile("ptesync": : :"memory"); > + > + vmemmap_verify(ptep, node, addr, next); > +} > + > +static pte_t * __meminit radix__vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, > + struct vmem_altmap *altmap, > + struct page *reuse) > +{ > + pte_t *pte = pte_offset_kernel(pmd, addr); > + > + if (pte_none(*pte)) { > + pte_t entry; > + void *p; > + > + if (!reuse) { > + /* > + * make sure we don't create altmap mappings > + * covering things outside the device. > + */ > + if (altmap && altmap_cross_boundary(altmap, addr, PAGE_SIZE)) > + altmap = NULL; > + > + p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap); > + if (!p) { > + if (altmap) > + p = vmemmap_alloc_block_buf(PAGE_SIZE, node, NULL); > + if (!p) > + return NULL; > + } Above if conditions are quite confusing when looking for the 1st time? Can we do this? Did I get it right? if (!p && altmap) p = vmemmap_alloc_block_buf(PAGE_SIZE, node, NULL); if (!p) return NULL; -ritesh