Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Muchun Song <muchun.song@linux.dev>
To: Mike Kravetz <mike.kravetz@oracle.com>,
	Joao Martins <joao.m.martins@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>,
	Oscar Salvador <osalvador@suse.de>,
	David Hildenbrand <david@redhat.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	David Rientjes <rientjes@google.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	Barry Song <song.bao.hua@hisilicon.com>,
	Michal Hocko <mhocko@suse.com>,
	Matthew Wilcox <willy@infradead.org>,
	Xiongchun Duan <duanxiongchun@bytedance.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup
Date: Wed, 30 Aug 2023 16:09:58 +0800	[thread overview]
Message-ID: <e769f96e-cd03-0530-da7a-35d9de03edfc@linux.dev> (raw)
In-Reply-To: <20230825190436.55045-11-mike.kravetz@oracle.com>



On 2023/8/26 03:04, Mike Kravetz wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
>
> In an effort to minimize amount of TLB flushes, batch all PMD splits
> belonging to a range of pages in order to perform only 1 (global) TLB
> flush. This brings down from 14.2secs into 7.9secs a 1T hugetlb
> allocation.
>
> Rebased by Mike Kravetz
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb_vmemmap.c | 94 ++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 90 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 500a118915ff..904a64fe5669 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -26,6 +26,7 @@
>    * @reuse_addr:		the virtual address of the @reuse_page page.
>    * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
>    *			or is mapped from.
> + * @flags		used to modify behavior in bulk operations
>    */
>   struct vmemmap_remap_walk {
>   	void			(*remap_pte)(pte_t *pte, unsigned long addr,
> @@ -34,9 +35,11 @@ struct vmemmap_remap_walk {
>   	struct page		*reuse_page;
>   	unsigned long		reuse_addr;
>   	struct list_head	*vmemmap_pages;
> +#define VMEMMAP_REMAP_ONLY_SPLIT	BIT(0)
> +	unsigned long		flags;
>   };
>   
> -static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
> +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, bool bulk)
>   {
>   	pmd_t __pmd;
>   	int i;
> @@ -79,7 +82,8 @@ static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start)
>   		/* Make pte visible before pmd. See comment in pmd_install(). */
>   		smp_wmb();
>   		pmd_populate_kernel(&init_mm, pmd, pgtable);
> -		flush_tlb_kernel_range(start, start + PMD_SIZE);
> +		if (!bulk)
> +			flush_tlb_kernel_range(start, start + PMD_SIZE);

A little weird to me. @bulk is used to indicate whether the TLB
should be flushed, however, the flag name is VMEMMAP_REMAP_ONLY_SPLIT,
it seems to tell me @bulk (calculated from walk->flags & 
VMEMMAP_REMAP_ONLY_SPLIT)
is a indicator to only split the huge pmd entry. For me, I think
it is better to introduce another flag like VMEMMAP_SPLIT_WITHOUT_FLUSH
to indicate whether TLB should be flushed.

>   	} else {
>   		pte_free_kernel(&init_mm, pgtable);
>   	}
> @@ -119,18 +123,28 @@ static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
>   			     unsigned long end,
>   			     struct vmemmap_remap_walk *walk)
>   {
> +	bool bulk;
>   	pmd_t *pmd;
>   	unsigned long next;
>   
> +	bulk = walk->flags & VMEMMAP_REMAP_ONLY_SPLIT;
>   	pmd = pmd_offset(pud, addr);
>   	do {
>   		int ret;
>   
> -		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK);
> +		ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK, bulk);
>   		if (ret)
>   			return ret;
>   
>   		next = pmd_addr_end(addr, end);
> +
> +		/*
> +		 * We are only splitting, not remapping the hugetlb vmemmap
> +		 * pages.
> +		 */
> +		if (bulk)
> +			continue;

Actually, we don not need a flag to detect this situation, you could
use "!@walk->remap_pte" to determine whether we should go into the
next level traversal of the page table. ->remap_pte is used to traverse
the pte entry, so it make senses to continue to the next pmd entry if
it is NULL.

> +
>   		vmemmap_pte_range(pmd, addr, next, walk);
>   	} while (pmd++, addr = next, addr != end);
>   
> @@ -197,7 +211,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>   			return ret;
>   	} while (pgd++, addr = next, addr != end);
>   
> -	flush_tlb_kernel_range(start, end);
> +	if (!(walk->flags & VMEMMAP_REMAP_ONLY_SPLIT))
> +		flush_tlb_kernel_range(start, end);

This could be:

     if (walk->remap_pte)
         flush_tlb_kernel_range(start, end);

>   
>   	return 0;
>   }
> @@ -296,6 +311,48 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
>   	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
>   }
>   
> +/**
> + * vmemmap_remap_split - split the vmemmap virtual address range [@start, @end)
> + *                      backing PMDs of the directmap into PTEs
> + * @start:     start address of the vmemmap virtual address range that we want
> + *             to remap.
> + * @end:       end address of the vmemmap virtual address range that we want to
> + *             remap.
> + * @reuse:     reuse address.
> + *
> + * Return: %0 on success, negative error code otherwise.
> + */
> +static int vmemmap_remap_split(unsigned long start, unsigned long end,
> +				unsigned long reuse)
> +{
> +	int ret;
> +	LIST_HEAD(vmemmap_pages);

Unused variable?

> +	struct vmemmap_remap_walk walk = {
> +		.flags = VMEMMAP_REMAP_ONLY_SPLIT,
> +	};
> +
> +	/*
> +	 * In order to make remapping routine most efficient for the huge pages,
> +	 * the routine of vmemmap page table walking has the following rules
> +	 * (see more details from the vmemmap_pte_range()):
> +	 *
> +	 * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
> +	 *   should be continuous.
> +	 * - The @reuse address is part of the range [@reuse, @end) that we are
> +	 *   walking which is passed to vmemmap_remap_range().
> +	 * - The @reuse address is the first in the complete range.
> +	 *
> +	 * So we need to make sure that @start and @reuse meet the above rules.
> +	 */

The comments are duplicated, something like:

     /* See the comment in the vmemmap_remap_free(). */

is enough.

> +	BUG_ON(start - reuse != PAGE_SIZE);
> +
> +	mmap_read_lock(&init_mm);
> +	ret = vmemmap_remap_range(reuse, end, &walk);
> +	mmap_read_unlock(&init_mm);
> +
> +	return ret;
> +}
> +
>   /**
>    * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
>    *			to the page which @reuse is mapped to, then free vmemmap
> @@ -320,6 +377,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>   		.remap_pte	= vmemmap_remap_pte,
>   		.reuse_addr	= reuse,
>   		.vmemmap_pages	= &vmemmap_pages,
> +		.flags		= 0,
>   	};
>   	int nid = page_to_nid((struct page *)start);
>   	gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
> @@ -606,11 +664,39 @@ void hugetlb_vmemmap_optimize_bulk(const struct hstate *h, struct page *head,
>   	__hugetlb_vmemmap_optimize(h, head, bulk_pages);
>   }
>   
> +void hugetlb_vmemmap_split(const struct hstate *h, struct page *head)
> +{
> +	unsigned long vmemmap_start = (unsigned long)head, vmemmap_end;
> +	unsigned long vmemmap_reuse;
> +
> +	if (!vmemmap_should_optimize(h, head))
> +		return;
> +
> +	static_branch_inc(&hugetlb_optimize_vmemmap_key);

Why? hugetlb_optimize_vmemmap_key is used as a switch to let
page_fixed_fake_head works properly for the vmemmap-optimized
HugeTLB pages, however, this function only splits the huge pmd
entry without optimizing the vmemmap pages. So it is wrong to
increase the static_key.

Thanks.

> +
> +	vmemmap_end     = vmemmap_start + hugetlb_vmemmap_size(h);
> +	vmemmap_reuse   = vmemmap_start;
> +	vmemmap_start   += HUGETLB_VMEMMAP_RESERVE_SIZE;
> +
> +	/*
> +	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
> +	 * to the page which @vmemmap_reuse is mapped to, then free the pages
> +	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
> +	 */
> +	if (vmemmap_remap_split(vmemmap_start, vmemmap_end, vmemmap_reuse))
> +		static_branch_dec(&hugetlb_optimize_vmemmap_key);
> +}
> +
>   void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
>   {
>   	struct folio *folio;
>   	LIST_HEAD(vmemmap_pages);
>   
> +	list_for_each_entry(folio, folio_list, lru)
> +		hugetlb_vmemmap_split(h, &folio->page);
> +
> +	flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
> +
>   	list_for_each_entry(folio, folio_list, lru)
>   		hugetlb_vmemmap_optimize_bulk(h, &folio->page, &vmemmap_pages);
>

next prev parent reply	other threads:[~2023-08-30  8:10 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-25 19:04 [PATCH 00/12] Batch hugetlb vmemmap modification operations Mike Kravetz
2023-08-25 19:04 ` [PATCH 01/12] hugetlb: clear flags in tail pages that will be freed individually Mike Kravetz
2023-08-25 19:04 ` [PATCH 02/12] hugetlb: Use a folio in free_hpage_workfn() Mike Kravetz
2023-08-25 19:04 ` [PATCH 03/12] hugetlb: Remove a few calls to page_folio() Mike Kravetz
2023-08-25 19:04 ` [PATCH 04/12] hugetlb: Convert remove_pool_huge_page() to remove_pool_hugetlb_folio() Mike Kravetz
2023-08-25 19:04 ` [PATCH 05/12] hugetlb: restructure pool allocations Mike Kravetz
2023-08-25 19:04 ` [PATCH 06/12] hugetlb: perform vmemmap optimization on a list of pages Mike Kravetz
2023-08-25 19:04 ` [PATCH 07/12] hugetlb: perform vmemmap restoration " Mike Kravetz
2023-08-26  6:58   ` kernel test robot
2023-08-30  8:33   ` Muchun Song
2023-08-30 17:53     ` Mike Kravetz
2023-08-25 19:04 ` [PATCH 08/12] hugetlb: batch freeing of vmemmap pages Mike Kravetz
2023-08-26  4:00   ` kernel test robot
2023-08-30  7:20   ` Muchun Song
2023-08-30 18:36     ` Mike Kravetz
2023-08-25 19:04 ` [PATCH 09/12] hugetlb_vmemmap: Optimistically set Optimized flag Mike Kravetz
2023-08-30  7:26   ` Muchun Song
2023-08-30 22:47     ` Mike Kravetz
2023-08-31  3:27       ` Muchun Song
2023-08-25 19:04 ` [PATCH 10/12] hugetlb: batch PMD split for bulk vmemmap dedup Mike Kravetz
2023-08-26  5:56   ` kernel test robot
2023-08-28  9:42     ` Joao Martins
2023-08-28 16:44       ` Mike Kravetz
2023-08-29  3:47         ` Muchun Song
2023-08-26 18:14   ` kernel test robot
2023-08-30  8:09   ` Muchun Song [this message]
2023-08-30 11:13     ` Joao Martins
2023-08-30 16:03       ` Joao Martins
2023-08-31  3:54         ` Muchun Song
2023-08-31  9:26           ` Joao Martins
2023-08-25 19:04 ` [PATCH 11/12] hugetlb: batch TLB flushes when freeing vmemmap Mike Kravetz
2023-08-30  8:23   ` Muchun Song
2023-08-30 11:17     ` Joao Martins
2023-08-25 19:04 ` [PATCH 12/12] hugetlb: batch TLB flushes when restoring vmemmap Mike Kravetz
2023-08-26  8:01   ` kernel test robot
2023-08-30  8:47   ` Muchun Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e769f96e-cd03-0530-da7a-35d9de03edfc@linux.dev \
    --to=muchun.song@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=david@redhat.com \
    --cc=duanxiongchun@bytedance.com \
    --cc=joao.m.martins@oracle.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@linux.dev \
    --cc=osalvador@suse.de \
    --cc=rientjes@google.com \
    --cc=song.bao.hua@hisilicon.com \
    --cc=songmuchun@bytedance.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.