linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] Optimizations for khugepaged
@ 2025-07-24  5:22 Dev Jain
  2025-07-24  5:22 ` [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes() Dev Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Dev Jain @ 2025-07-24  5:22 UTC (permalink / raw)
  To: akpm, david
  Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, baohua, linux-mm, linux-kernel, Dev Jain

If the underlying folio mapped by the ptes is large, we can process those
ptes in a batch using folio_pte_batch().

For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls, since on a contig block, ptep_get() on arm64 will iterate
through all 16 entries to collect a/d bits. Next, ptep_clear() will cause
a TLBI for every contig block in the range via contpte_try_unfold().
Instead, use clear_ptes() to only do the TLBI at the first and last
contig block of the range.

For split folios, there will be no pte batching; the batch size returned
by folio_pte_batch() will be 1. For pagetable split folios, the ptes will
still point to the same large folio; for arm64, this results in the
optimization described above, and for other arches, a minor improvement
is expected due to a reduction in the number of function calls and
batching atomic operations.

---
Rebased on today's mm-new - the v3 of this patchset was already in, so
I reverted those commits and then rebased on top of that.
mm-selftests pass.

v3->v4:
 - Use unsigned int for nr_ptes and max_nr_ptes (David)
 - Define the functions in patch 1 as inline functions with kernel docs
   instead of macros (akpm)


v2->v3:
 - Drop patch 3 (was merged separately)
 - Add patch 1 (David)
 - Coding style change, drop mapped_folio (Lorenzo)

v1->v2:
 - Use for loop instead of do-while loop (Lorenzo)
 - Remove folio_test_large check since the subpage-check condition
   will imply that (Baolin)
 - Combine patch 1 and 2 into this series, add new patch 3

David Hildenbrand (1):
  mm: add get_and_clear_ptes() and clear_ptes()

Dev Jain (2):
  khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE
    batching
  khugepaged: Optimize collapse_pte_mapped_thp() by PTE batching

 arch/arm64/mm/mmu.c     |  2 +-
 include/linux/pgtable.h | 45 ++++++++++++++++++++++++++++++++
 mm/khugepaged.c         | 58 +++++++++++++++++++++++++++--------------
 mm/mremap.c             |  2 +-
 mm/rmap.c               |  2 +-
 5 files changed, 87 insertions(+), 22 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching
@ 2025-07-24 17:32 Lorenzo Stoakes
  2025-07-24 17:40 ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24 17:32 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	baohua, linux-mm, linux-kernel

Message-ID: <32843cfb-a70b-4dfb-965c-4e1b0623a1b4@lucifer.local>
Reply-To:
In-Reply-To: <20250724052301.23844-3-dev.jain@arm.com>

NIT: Please don't capitalise 'Optimize' here.

I think Andrew fixed this for you actually in the repo though :P

On Thu, Jul 24, 2025 at 10:53:00AM +0530, Dev Jain wrote:
> Use PTE batching to batch process PTEs mapping the same large folio. An
> improvement is expected due to batching refcount-mapcount manipulation on
> the folios, and for arm64 which supports contig mappings, the number of
> TLB flushes is also reduced.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/khugepaged.c | 25 ++++++++++++++++++-------
>  1 file changed, 18 insertions(+), 7 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a55fb1dcd224..f23e943506bc 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -700,12 +700,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  						spinlock_t *ptl,
>  						struct list_head *compound_pagelist)
>  {
> +	unsigned long end = address + HPAGE_PMD_SIZE;
>  	struct folio *src, *tmp;
> -	pte_t *_pte;
>  	pte_t pteval;
> +	pte_t *_pte;
> +	unsigned int nr_ptes;
>
> -	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, address += PAGE_SIZE) {
> +	for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> +	     address += nr_ptes * PAGE_SIZE) {
> +		nr_ptes = 1;
>  		pteval = ptep_get(_pte);
>  		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>  			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> @@ -722,18 +725,26 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
>  			struct page *src_page = pte_page(pteval);
>
>  			src = page_folio(src_page);
> -			if (!folio_test_large(src))
> +
> +			if (folio_test_large(src)) {
> +				unsigned int max_nr_ptes = (end - address) >> PAGE_SHIFT;
> +
> +				nr_ptes = folio_pte_batch(src, _pte, pteval, max_nr_ptes);
> +			} else {
>  				release_pte_folio(src);
> +			}
> +
>  			/*
>  			 * ptl mostly unnecessary, but preempt has to
>  			 * be disabled to update the per-cpu stats
>  			 * inside folio_remove_rmap_pte().
>  			 */
>  			spin_lock(ptl);
> -			ptep_clear(vma->vm_mm, address, _pte);
> -			folio_remove_rmap_pte(src, src_page, vma);
> +			clear_ptes(vma->vm_mm, address, _pte, nr_ptes);
> +			folio_remove_rmap_ptes(src, src_page, nr_ptes, vma);
>  			spin_unlock(ptl);
> -			free_folio_and_swap_cache(src);
> +			free_swap_cache(src);
> +			folio_put_refs(src, nr_ptes);

Hm one thing here though is the free_folio_and_swap_cache() does:

	free_swap_cache(folio);
	if (!is_huge_zero_folio(folio))
		folio_put(folio);

Whereas here you unconditionally reduce the reference count. Might this
cause issues with the shrinker version of the huge zero folio?

Should this be:

			if (!is_huge_zero_folio(src))
				folio_put_refs(src, nr_ptes);

Or do we otherwise avoid issues with this?

>  		}
>  	}
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-07-29 14:41 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-24  5:22 [PATCH v4 0/3] Optimizations for khugepaged Dev Jain
2025-07-24  5:22 ` [PATCH v4 1/3] mm: add get_and_clear_ptes() and clear_ptes() Dev Jain
2025-07-24  9:31   ` Barry Song
2025-07-24 17:17   ` Lorenzo Stoakes
2025-07-29 14:30   ` Zi Yan
2025-07-24  5:23 ` [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() by PTE batching Dev Jain
2025-07-24 17:55   ` Lorenzo Stoakes
2025-07-24 17:57     ` David Hildenbrand
2025-07-24 18:01       ` Lorenzo Stoakes
2025-07-24 18:02     ` Lorenzo Stoakes
2025-07-29 14:38   ` Zi Yan
2025-07-24  5:23 ` [PATCH v4 3/3] khugepaged: Optimize collapse_pte_mapped_thp() " Dev Jain
2025-07-24 18:07   ` Lorenzo Stoakes
2025-07-29 14:41   ` Zi Yan
  -- strict thread matches above, loose matches on Subject: below --
2025-07-24 17:32 [PATCH v4 2/3] khugepaged: Optimize __collapse_huge_page_copy_succeeded() " Lorenzo Stoakes
2025-07-24 17:40 ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).