From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A08DEC6FA8F for ; Wed, 30 Aug 2023 08:10:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EEDAD680001; Wed, 30 Aug 2023 04:10:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E9DF08E0009; Wed, 30 Aug 2023 04:10:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D6737680001; Wed, 30 Aug 2023 04:10:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id C43808E0009 for ; Wed, 30 Aug 2023 04:10:10 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 8CB531A02D5 for ; Wed, 30 Aug 2023 08:10:10 +0000 (UTC) X-FDA: 81180048180.21.E075992 Received: from out-248.mta1.migadu.com (out-248.mta1.migadu.com [95.215.58.248]) by imf05.hostedemail.com (Postfix) with ESMTP id A450B100005 for ; Wed, 30 Aug 2023 08:10:08 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="hL/MZ6hL"; spf=pass (imf05.hostedemail.com: domain of muchun.song@linux.dev designates 95.215.58.248 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693383009; a=rsa-sha256; cv=none; b=eE5FsOSMTSqziv1WLO1EAPicSqRZP/LY6S84QEJ1peC6zZ6TfA0Ut6xK4xmMrUyZle2prL lHAtI10iExVBTIwkCM0XxTsouzNFp60nh51CjwX6fZNg1Cl13ZXJZywzAe8Dz4u7DvCw+D mBcXO92B8HGHJpF0d3SZG+LSFVAd9Xg= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="hL/MZ6hL"; spf=pass (imf05.hostedemail.com: domain of muchun.song@linux.dev designates 95.215.58.248 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693383008; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=y+5QkFGg0huI9b9EYOFBCflwtwkc9K7TEokIrLKiyY4=; b=ePrAK6U8hvT8cyccsYgVtjVAh+J45tMgrEppHgjyrTT1JkNHlxCj5D7mlEsmhGJzwhrMla dk8rq8AQrgyWSqKQfwQBUkhAscgZ/9xFnlGJUYBWLkharrBAdeeoBxUoMYTsWtE66aL0xO NKV3sUsBtr7Tzp6SW0iPukTXCy4TxSA= Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1693383006; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=y+5QkFGg0huI9b9EYOFBCflwtwkc9K7TEokIrLKiyY4=; b=hL/MZ6hLxpYhpY5Bj38PzLe1an3NAq11M/MkN8rUUCWKI879ku8+Uk31Pi0ITAacBgfTOp WyHXhhL+pp6xvZF/qBKsYLcejTliKXiEA1CQRyxJUrjB5jWAzrjzwpQd8ZXac3hDskn/qM mdRCc4m5Ri5mninoMMR8kiq6gNFyO4U= Date: Wed, 30 Aug 2023 16:09:58 +0800 MIME-Version: 1.0 Subject: Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup To: Mike Kravetz , Joao Martins Cc: Muchun Song , Oscar Salvador , David Hildenbrand , Miaohe Lin , David Rientjes , Anshuman Khandual , Naoya Horiguchi , Barry Song , Michal Hocko , Matthew Wilcox , Xiongchun Duan , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20230825190436.55045-1-mike.kravetz@oracle.com> <20230825190436.55045-11-mike.kravetz@oracle.com> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: <20230825190436.55045-11-mike.kravetz@oracle.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: A450B100005 X-Stat-Signature: u67chnqp7ecskay1kwj36pu8jtbnhy3p X-Rspam-User: X-HE-Tag: 1693383008-766401 X-HE-Meta: U2FsdGVkX18DTXlslDH5l+0xSl3gJfqj8cNB0UPC0OJg8lOPfhaxubhy+jHKP/zPVP9jDL+EBaeqBbVP/VJIjYaM/YB/an9s+8BBvidaAZ5dOu87ZY+jSHT3abIVwYNiA9ibbU5/eEZ4OtIkiNC3EKYvMqlf1bUuDVwjd2sdAs+VxbxmKySE2358gKzS+0UzAixYC8S1k0VVQnx9KdQ5CIdW0nmCJDPEbnnyX3W39wyz/isWhocwhZK4h9qnTEci1Zi5j9KkDW2nIBIfP/UH1yjIB5r25RvkeKiZr43/PIUsCnxAwkE3PtLB0iCkve4/vVUsIsCCvaE4HOzAj/aAoAxuUciZ+qbrAx2d5g06HC0JA9thDAYh9U92yh8nGe0Nc5LYnV+p9da+s0TZfR7qPMI3a4J/ZqkHSUNAD+jYgkVOPbMZ3krH0Z7bbNMPMIqsLTxvwS9CJojHOKTXwA98M6W/BNyrY6OcYSz6xKi1PUqp7NHO+LJ3N3Q4yqGV2g77gS2YbLkbXcAEUZJdLHxCQzXmMZSibp1Q2zaWy6A6DWt07PVBOb1h/yyBCsD9jUit+bykFZXvsRqc+M7Gxmll+RQi6t4Ve88d8K4lOy2t6osGC1CdC58jqIBLukttzOj1h9jzDwpsWNGC/1wIHPy17djN0Nepg3DZb9G8Bx6eqthk8F2PMG3cg2N9Gpmx/6fzXnLYiVCYW1/qLGoZk+elM9DN5Q1jRIt0FJyN3F7ACYXG3eqTWw5/5NVPVSkh3FBrhSMZ+J/4zxIaskqrm3RupI1dRGsUx/4rqu4KWKzlxrSm5FoilZq6T1v69vTlpUz9pThOfmESv7iiH/5o9+Z19jGFlmMf8OFHyRhXRvM89t1Vf4OlYd8fbu41enSfHXb3Vj0mYqs3llxFKMZah/zGDReRbRhwWsTMNRyv8HdkBJ+I2KQ7TmIyZnEQP3nkokgwzxi8CYqj+J6pmFfZFCk yUNEC7Rx /xkxzgIt5/vdr25UgX7dCAe4k/9emMZO5UzqYzAdakRjGo+It/VvmRC/XJJYBAcXFHeSEt+DLq7IF3tnxB0pmhznPnR/23kcgEGRLUlEiJQ4A8Auoicd133n+Mg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/8/26 03:04, Mike Kravetz wrote: > From: Joao Martins > > In an effort to minimize amount of TLB flushes, batch all PMD splits > belonging to a range of pages in order to perform only 1 (global) TLB > flush. This brings down from 14.2secs into 7.9secs a 1T hugetlb > allocation. > > Rebased by Mike Kravetz > > Signed-off-by: Joao Martins > Signed-off-by: Mike Kravetz > --- > mm/hugetlb_vmemmap.c | 94 ++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 90 insertions(+), 4 deletions(-) > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index 500a118915ff..904a64fe5669 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -26,6 +26,7 @@ > * @reuse_addr: the virtual address of the @reuse_page page. > * @vmemmap_pages: the list head of the vmemmap pages that can be freed > * or is mapped from. > + * @flags used to modify behavior in bulk operations > */ > struct vmemmap_remap_walk { > void (*remap_pte)(pte_t *pte, unsigned long addr, > @@ -34,9 +35,11 @@ struct vmemmap_remap_walk { > struct page *reuse_page; > unsigned long reuse_addr; > struct list_head *vmemmap_pages; > +#define VMEMMAP_REMAP_ONLY_SPLIT BIT(0) > + unsigned long flags; > }; > > -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start) > +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool bulk) > { > pmd_t __pmd; > int i; > @@ -79,7 +82,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start) > /* Make pte visible before pmd. See comment in pmd_install(). */ > smp_wmb(); > pmd_populate_kernel(&init_mm, pmd, pgtable); > - flush_tlb_kernel_range(start, start + PMD_SIZE); > + if (!bulk) > + flush_tlb_kernel_range(start, start + PMD_SIZE); A little weird to me. @bulk is used to indicate whether the TLB should be flushed, however, the flag name is VMEMMAP_REMAP_ONLY_SPLIT, it seems to tell me @bulk (calculated from walk->flags & VMEMMAP_REMAP_ONLY_SPLIT) is a indicator to only split the huge pmd entry. For me, I think it is better to introduce another flag like VMEMMAP_SPLIT_WITHOUT_FLUSH to indicate whether TLB should be flushed. > } else { > pte_free_kernel(&init_mm, pgtable); > } > @@ -119,18 +123,28 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr, > unsigned long end, > struct vmemmap_remap_walk *walk) > { > + bool bulk; > pmd_t *pmd; > unsigned long next; > > + bulk = walk->flags & VMEMMAP_REMAP_ONLY_SPLIT; > pmd = pmd_offset(pud, addr); > do { > int ret; > > - ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK); > + ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK, bulk); > if (ret) > return ret; > > next = pmd_addr_end(addr, end); > + > + /* > + * We are only splitting, not remapping the hugetlb vmemmap > + * pages. > + */ > + if (bulk) > + continue; Actually, we don not need a flag to detect this situation, you could use "!@walk->remap_pte" to determine whether we should go into the next level traversal of the page table. ->remap_pte is used to traverse the pte entry, so it make senses to continue to the next pmd entry if it is NULL. > + > vmemmap_pte_range(pmd, addr, next, walk); > } while (pmd++, addr = next, addr != end); > > @@ -197,7 +211,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end, > return ret; > } while (pgd++, addr = next, addr != end); > > - flush_tlb_kernel_range(start, end); > + if (!(walk->flags & VMEMMAP_REMAP_ONLY_SPLIT)) > + flush_tlb_kernel_range(start, end); This could be:     if (walk->remap_pte)         flush_tlb_kernel_range(start, end); > > return 0; > } > @@ -296,6 +311,48 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); > } > > +/** > + * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end) > + * backing PMDs of the directmap into PTEs > + * @start: start address of the vmemmap virtual address range that we want > + * to remap. > + * @end: end address of the vmemmap virtual address range that we want to > + * remap. > + * @reuse: reuse address. > + * > + * Return: %0 on success, negative error code otherwise. > + */ > +static int vmemmap_remap_split(unsigned long start, unsigned long end, > + unsigned long reuse) > +{ > + int ret; > + LIST_HEAD(vmemmap_pages); Unused variable? > + struct vmemmap_remap_walk walk = { > + .flags = VMEMMAP_REMAP_ONLY_SPLIT, > + }; > + > + /* > + * In order to make remapping routine most efficient for the huge pages, > + * the routine of vmemmap page table walking has the following rules > + * (see more details from the vmemmap_pte_range()): > + * > + * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) > + * should be continuous. > + * - The @reuse address is part of the range [@reuse, @end) that we are > + * walking which is passed to vmemmap_remap_range(). > + * - The @reuse address is the first in the complete range. > + * > + * So we need to make sure that @start and @reuse meet the above rules. > + */ The comments are duplicated, something like:     /* See the comment in the vmemmap_remap_free(). */ is enough. > + BUG_ON(start - reuse != PAGE_SIZE); > + > + mmap_read_lock(&init_mm); > + ret = vmemmap_remap_range(reuse, end, &walk); > + mmap_read_unlock(&init_mm); > + > + return ret; > +} > + > /** > * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end) > * to the page which @reuse is mapped to, then free vmemmap > @@ -320,6 +377,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > .remap_pte = vmemmap_remap_pte, > .reuse_addr = reuse, > .vmemmap_pages = &vmemmap_pages, > + .flags = 0, > }; > int nid = page_to_nid((struct page *)start); > gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | > @@ -606,11 +664,39 @@ void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head, > __hugetlb_vmemmap_optimize(h, head, bulk_pages); > } > > +void hugetlb_vmemmap_split(const struct hstate *h, struct page *head) > +{ > + unsigned long vmemmap_start = (unsigned long)head, vmemmap_end; > + unsigned long vmemmap_reuse; > + > + if (!vmemmap_should_optimize(h, head)) > + return; > + > + static_branch_inc(&hugetlb_optimize_vmemmap_key); Why? hugetlb_optimize_vmemmap_key is used as a switch to let page_fixed_fake_head works properly for the vmemmap-optimized HugeTLB pages, however, this function only splits the huge pmd entry without optimizing the vmemmap pages. So it is wrong to increase the static_key. Thanks. > + > + vmemmap_end = vmemmap_start + hugetlb_vmemmap_size(h); > + vmemmap_reuse = vmemmap_start; > + vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; > + > + /* > + * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end) > + * to the page which @vmemmap_reuse is mapped to, then free the pages > + * which the range [@vmemmap_start, @vmemmap_end] is mapped to. > + */ > + if (vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse)) > + static_branch_dec(&hugetlb_optimize_vmemmap_key); > +} > + > void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list) > { > struct folio *folio; > LIST_HEAD(vmemmap_pages); > > + list_for_each_entry(folio, folio_list, lru) > + hugetlb_vmemmap_split(h, &folio->page); > + > + flush_tlb_kernel_range(0, TLB_FLUSH_ALL); > + > list_for_each_entry(folio, folio_list, lru) > hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages); >